Chapter 2

```{r setup, include=FALSE} knitr::opts_chunk$set(echo = FALSE) ``` ```{r, echo=F, message=F, warning=F} library(datasets) library(tidyverse) library(shiny) library(scales) library(jpeg) library(openintro) library(dplyr) library(ggplot2) library(learnr) library(readr) library(knitr) library(png) library(gradethis) #remotes::install_github("rstudio/gradethis") library(learnrhash) #devtools::install_github("rundel/learnrhash") library(grid) library(tinytex) data("COL") ``` ## Acknowledgement

These notes use content from OpenIntro Statistics Slides by Mine Cetinkaya-Rundel.

## Content Outline of Chapter 2

+ This chapter focuses on summarizing data - Numerical data - Categorical data + There are three sections in this chapter - 2.1 Examining numerical data - 2.2 Considering categorical data - 2.3 Case study

## 2.1 Examining numerical data

In this section, we explore techniques for summarizing numerical variables. + **Graphical Summary** - Scatterplot (for two numerical variables) - Dot plot - Histogram and shape (modality and skewness) - Boxplots & Outliers + **Numerical Summary** - Mean - Variance and Standard Deviation - Median - Quartiles and IQR

## Scatterplot

A scatterplot (also called scatter graph, scatter chart, scattergram, or scatter diagram) is a type of plot using Cartesian coordinates to display values for typically two numerical variables for a set of data. **Scatterplots** are useful for visualizing the relationship between two numerical variables. Example 1. The figure below is a scatterplot for a data set (50 cases) of the total income and borrowed loan amount. ```{r, echo=FALSE, out.width="50%"} image <- readJPEG("img.jpg") grid.raster(image) ``` Practice: Read two pair of data and interpret.

## Scatterplot

- When two variables show some connection with one another, they are called **associated** variables. - Linear(positive, negative) association. - Nonlinear association. - If two variables are not associated, i.e. there is no evident connection between the two, then they are said to **independent**. Back to Example 1. Is the association linear or nonlinear? **Nonlinear** ```{r, echo=FALSE, out.width="50%"} image <- readJPEG("img.jpeg") grid.raster(image) ```

## Practice

Visualize the types of associations. ```{r, out.width="80%"} hh <- readJPEG("ag.jpeg") grid.raster(hh) ```

## Dot plots

- **Dot Plot** : shows a dot for each observation placed above its value on a number line. - Dot Plot is useful for visualizing one numerical variable. - Darker colors represent areas where there are more observations.

```{r, echo= FALSE,out.width= "45%", Include= F, fig.align='center'} d = read.csv("gpa.csv") gpa = d$gpa[d$gpa <= 4] gpa = gpa[!is.na(gpa)] openintro::dotPlot(gpa, pch = 19, col = COL[1,4], xlab = "GPA", xlim = c(2.5,4), ylab = "") ``` ## Stacked dot plot

Higher bars represent areas where there are more observations, makes it a little easier to judge the center and the shape of the distribution. ```{r, echo=F, message=F, warning=F, out.width="70%",fig.align='center'} X <- c() Y <- c() for(i in 1:length(gpa)){ x <- gpa[i] rec <- sum(gpa == x) X <- append(X, rep(x, rec)) Y <- append(Y, 1:rec) } radius <- 0.0249 cex <- 2.5 seed <- 1 stacks <- dotPlotStack(gpa, radius=radius, addDots=FALSE, pch=19, col=COL[1], cex=1.25, seed=seed) plot(0, type="n", xlab="GPA", axes=FALSE, ylab="", cex.lab = 2, xlim=c(2.6, 4.0), ylim=c(0, quantile(stacks[[3]], 0.994))) dotPlotStack(gpa, radius=radius, pch=19, col=COL[1], cex=cex, seed=seed) abline(h=0) axis(1, cex.axis = 2) ```

## Dot plots & mean

```{r,echo= FALSE,out.width= "45%", Include= F, fig.align='center'} openintro::dotPlot(gpa, pch = 19, col = COL[1,4], xlab = "GPA", xlim = c(2.5,4), ylab = "") M <- mean(d$gpa[d$gpa <= 4], na.rm = TRUE) polygon(M + c(-2,2,0)*0.01, c(0.25, 0.25, 0.5), border=COL[4], col=COL[4]) # The plot is a stretched vertically. So we use the included plot pdf ``` - The **mean**, also called the **average** (marked with a triangle in the above plot), is one way to measure the center of a **distribution** of data. - The mean GPA is 3.59.

## Mean

+ The **sample mean**, denoted as $\bar{x}$, can be calculated as

$$\bar{x} = \frac{x_1+x_2+ \dots + x_n}{n},$$

where $x_1+x_2+ \dots + x_n$ represent the **n** observed values. + The **population mean** is also computed the same way but is denotes as $\mu$. It is often not possible to calculate $\mu$ since population data are rarely available.

+ The sample mean is a **sample statistic**, and served as a **point estimate** of the population mean (parameter).

+ This estimate may not be perfect, but if the sample is good (representative of the population), it is usually a pretty good estimate.

## Examples Example 1. Calculate the mean of a sample with five observations: 5, 3, 8, 5, 6. $$\bar{x} = \frac{\sum_{i=1}^{n}x_i}{n} = \frac{5+3+8+5+6}{5}=\frac{27}{5}=5.4$$ Using R, we can calculate the mean using the `mean()` command. Notice that we need to put the values in a vector using the `c()` function which stands for *concatenate*. ```{r Ex1, exercise=TRUE} mean(c(5,3,8,5,6)) ``` ## Examples Example 2. Given a data set with $n=234$ and $\sum_{i=1}^{n}x_i=2019$, find the mean $\bar{x}$ rounded to 4 decimals. $$\bar{x} = \frac{\sum_{i=1}^{n}x_i}{n} = \frac{2019}{234} = 8.6282$$ ```{r Ex2, exercise=TRUE} 2019 / 234 ``` ## Discussions 1. If the data set has 5 observations, with $\bar{x} = 5.4$, find $\sum_{i=1}^{5}x_i$. 2. Continue discussion in 1, if add one more observation 10, will the mean $\bar{x}$ increase or decrease? What is the new $\bar{x}$? 3. Compare data sets 5, 3, 8, 5, 6 and 5, 3, 80, 5, 6, which one has the higher mean? ```{r Ex3, exercise=TRUE} ``` ## Histograms - Extracurricular hours

(Example: Extracurricular hours)

+ Histogram provide a view of the **data density**. Higher bars represent where the data are relatively more common. + Histograms are especially convenient for describing the **shape** of the data distribution. + The chosen **bin width** can alter the story the histogram is telling. ```{r, echo=F, message=F, warning=F, out.width="50%",fig.align='center'} d = read.csv("extracurr_hrs.csv") extracurr_hrs = d$extracurr_hrs[!is.na(d$extracurr_hrs)] histPlot(extracurr_hrs, col = COL[1], xlab = "Hours / week spent on extracurricular activities", ylab = "",cex.lab=2,cex.axis=2) ```

## Bin width

- Which one(s) of these histograms are useful? - Which reveals too much about the data? - Which hides too much?

```{r, echo=F, message=F, warning=F, out.width="75%", fig.align='center'} histPlot(extracurr_hrs, col = COL[1],xlab = "Hours/week spent on extracurricular activities", ylab = "", breaks = 2,cex.lab=1.5,cex.axis=2) histPlot(extracurr_hrs, col = COL[1],xlab = "Hours/week spent on extracurricular activities", ylab = "", breaks = 20,cex.lab=1.5,cex.axis=2) ```

```{r, echo=F, message=F, warning=F, out.width="75%", fig.align='center'} histPlot(extracurr_hrs, col = COL[1], xlab = "Hours / week spent on extracurricular activities", ylab = "",cex.lab=1.5,cex.axis=2) histPlot(extracurr_hrs, col = COL[1], xlab = "Hours / week spent on extracurricular activities", ylab = "", breaks = 30,cex.lab=1.5,cex.axis=2) ```

## Shape of the distribution: 1) skewness

Is the histogram **right skewed**, **left skewed**, or **symmetric**?

```{r, echo=F, message=F, warning=F,fig.width=6, fig.height=3,fig.align='center'} set.seed(234) x1 <- rchisq(65, 3) x2 <- c(runif(20, 0,10), rnorm(100, 16.5, 2)) x3 <- rnorm(100, 35, 12) par(mfrow=c(1,3), mar=c(1.9, 2, 1, 2), mgp=c(2.4, 0.7, 0)) histPlot(x1, axes=FALSE, xlab='', ylab='', col=COL[1]) axis(1) axis(2) histPlot(x2, axes=FALSE, xlab='', ylab='', col=COL[1]) axis(1) axis(2) histPlot(x3, axes=FALSE, xlab='', ylab='', col=COL[1]) axis(1) axis(2) ``` __________________________

Note:

Histograms are said to be skewed to the side of the long tail.

## Shape of the distribution: 2) unusual observations

Are there any unusual observations or potential **outliers**?

```{r, echo=F, message=F, warning=F,fig.width=6, fig.height=3,fig.align='center'} set.seed(195) x1 <- c(rchisq(65, 3), 20) x2 <- c(rnorm(100, 35, 10), rnorm(3, 100,3)) par(mfrow=c(1,2), mar=c(1.9, 2, 1, 2), mgp=c(2.4, 0.7, 0)) histPlot(x1, axes=FALSE, xlab='', ylab='', col=COL[1]) axis(1) axis(2) histPlot(x2, axes=FALSE, xlab='', ylab='', col=COL[1]) axis(1) axis(2) ``` ## Shape of the distribution: 3) modality

Does the histogram have a single prominent peak (**unimodal**), several prominent peaks (**bimodal/multimodal**), or no apparent peaks (**uniform**)?

```{r, echo=F, message=F, warning=F,fig.width=8, fig.height=2,fig.align='center'} set.seed(51) x1 <- rchisq(65, 6) x2 <- c(rchisq(22, 5.8), rnorm(40, 16.5, 2)) x3 <- c(rchisq(20, 3), rnorm(35, 12), rnorm(42, 18, 1.5)) x4 <- runif(100,0,20) par(mfrow=c(1,4), mar=c(1.9, 2, 1, 2), mgp=c(2.4, 0.7, 0)) histPlot(x1, axes=FALSE, xlab='', ylab='', col=COL[1]) axis(1) axis(2) histPlot(x2, axes=FALSE, xlab='', ylab='', col=COL[1]) axis(1) axis(2) histPlot(x3, axes=FALSE, xlab='', ylab='', col=COL[1]) axis(1) axis(2) histPlot(x4, axes=FALSE, xlab='', ylab='', col=COL[1]) axis(1) axis(2) ``` ______________________

**Note**:

In order to determine modality, step back and imagine a smooth curve over the histogram. (Continue to see next slide).

## Commonly observed shapes of distributions: Modality + Unimodal ```{r,warning=FALSE, message=FALSE, out.width= "35%", echo=FALSE,fig.align='center'} Unimodal <- readPNG("unimodal.png") grid.raster(Unimodal) ``` ## Commonly observed shapes of distributions: Modality + Unimodal ```{r,warning=FALSE, message=FALSE, out.width= "35%", echo=FALSE,fig.align='center'} Unimodal <- readPNG("unimodal.png") grid.raster(Unimodal) ```

+ Bimodal ```{r,warning=FALSE, message=FALSE, out.width= "35%", echo=FALSE,fig.align='center'} Bimodal <- readPNG("bimodal.png") grid.raster(Bimodal) ```

## Commonly observed shapes of distributions: Modality + Unimodal ```{r,warning=FALSE, message=FALSE, out.width= "20%", echo=FALSE,fig.align='center'} Unimodal <- readPNG("unimodal.png") grid.raster(Unimodal) ``` + Bimodal ```{r,warning=FALSE, message=FALSE, out.width= "20%", echo=FALSE,fig.align='center'} Bimodal <- readPNG("bimodal.png") grid.raster(Bimodal) ``` + Multimodal ```{r,warning=FALSE, message=FALSE, out.width= "20%", echo=FALSE,fig.align='center'} mult <- readPNG("multimodal.png") grid.raster(mult) ``` ## Commonly observed shapes of distributions: Modality

+ Unimodal ```{r,warning=FALSE, message=FALSE, out.width= "25%", echo=FALSE,fig.align='center'} Unimodal <- readPNG("unimodal.png") grid.raster(Unimodal) ``` + Bimodal ```{r,warning=FALSE, message=FALSE, out.width= "25%", echo=FALSE,fig.align='center'} Bimodal <- readPNG("bimodal.png") grid.raster(Bimodal) ```

+ Multimodal ```{r,warning=FALSE, message=FALSE, out.width= "25%", echo=FALSE,fig.align='center'} mult <- readPNG("multimodal.png") grid.raster(mult) ``` + uniform ```{r,warning=FALSE, message=FALSE, out.width= "25%", echo=FALSE,fig.align='center'} unif <- readPNG("uniform.png") grid.raster(unif) ```

## Commonly observed shapes of distributions: Skewness + Right Skew ```{r,warning=FALSE, message=FALSE, out.width= "20%", echo=FALSE,fig.align='center'} skew <- readPNG("right_skew.png") grid.raster(skew) ``` + Left Skew ```{r,warning=FALSE, message=FALSE, out.width= "20%", echo=FALSE,fig.align='center'} skewd <- readPNG("left_skew.png") grid.raster(skewd) ``` ## Commonly observed shapes of distributions: Skewness + Right Skew ```{r,warning=FALSE, message=FALSE, out.width= "20%", echo=FALSE,fig.align='center'} skew <- readPNG("right_skew.png") grid.raster(skew) ``` + Left Skew ```{r,warning=FALSE, message=FALSE, out.width= "20%", echo=FALSE,fig.align='center'} skewd <- readPNG("left_skew.png") grid.raster(skewd) ``` + Symmetric ```{r,warning=FALSE, message=FALSE, out.width= "20%", echo=FALSE,fig.align='center'} sk <- readPNG("symmetric.png") grid.raster(sk) ``` ## Extracurricular activities

How would you describe the shape of the distribution of hours of week students spend on extracurricular activities?

```{r, echo=F, message=F, warning=F, out.width="50%",fig.align='center'} histPlot(extracurr_hrs, col = COL[1], xlab = "Hours / week spent on extracurricular activities", ylab = "",cex.lab=1.5,cex.axis=2) ```

Unimodal and right skewed, with a potentially unusual observation at 60 hours/week.

## Variance

**Variance** is roughly the average squared deviation from the mean. $$ s^2 = \frac{\sum_{i=1}^{n}(x_i-\bar{x})^2}{n-1}\\ x : 1,2,5,20 \\ \bar{x} =7\\ (x-\bar{x}): -6,-5,-2,13 \\ (x-\bar{x})^2 : 36,25,4,169 \\ s^2 = \frac{234}{3}= 78 $$ We use the squared deviation in the calculation of variance: - To get rid of negatives so that observations equally distant from the mean are weighted equally. - To weigh larger deviations more heavily.

## Variance

**Variance** is roughly the average squared deviation from the mean.

Example: For the sample data set "Hours of sleep per night", the sample size is $n = 217$, sample mean is $\bar{x} = 6.71,$ and the variance is calculated as follows:

```{r,echo=FALSE,message=F, warning=F,out.width="80%", fig.height=4, fig.align='center'} d = read.csv("sleep.csv") sleep = d$sleep[!is.na(d$sleep)] # hist histPlot(sleep, col = COL[1], xlab = "Hours of sleep / night", ylab = "", cex.lab=2) ```

$s^2 = \frac{(5-6.71)^2+(9-6.71)^2+\dots+(7-6.71)^2}{217-1} = 4.11 \text{ } hours^2$

## Standard Deviation

The **standard deviation** is the square root of the variance, and has the same units as the data. $$s = \sqrt{s^2}$$ $$ s = \sqrt{\frac{\sum_{i=1}^n(x_i - \bar{x})^2}{n-1}}$$ Standard deviation measures the variability of data: 1) if $\color{blue}{s}$ is small, the data is concentrated around the mean $\color{blue}{\bar{x}}$; 2) if $\color{blue}{s}$ is large, the data is spread further from the mean $\color{blue}{\bar{x}}$.

## Standard Deviation

+ The standard deviation of amount of sleep students get per night can be calculated as: $$s = \sqrt{4.11} = 2.03 \text{ } hours$$ + We can see that all of the data are within 3 standard deviations of the mean (center), i.e. between 0.62 and 12.8. $\bar{x}$ = 6.71, 3s =6.09 $\bar{x}$ - 3s = 6.71- 6.09 = 0.62 $\bar{x}$ + 3s = 6.71 + 6.09 = 12.8

```{r,echo=FALSE,message=F, warning=F,out.width="110%", fig.height=4, fig.align='center'} d = read.csv("sleep.csv") sleep = d$sleep[!is.na(d$sleep)] # hist histPlot(sleep, col = COL[1], xlab = "Hours of sleep / night", ylab = "", cex.lab=2) ```

## Standard deviation Example: The following are samples of women’s and men’s ideal number of children. Find the standard deviation for each group. Men: 0, 0, 0, 2, 4, 4, 4 \ \ \ \ \ \ Women: 0, 2, 2, 2, 2, 2, 4 - Both men and women have a mean of $\bar{x} = \frac{14}{7}=2$ - The deviations $(x_i - \bar{x})$ for men are: −2, −2, −2, 0, 2, 2, 2 - The standard deviation (SD) for men is $$s = \sqrt{\frac{\sum_{i=1}^n(x_i - \bar{x})^2}{n-1}} = \sqrt{\frac{3(-2)^2+0+3(2)^2}{7-1}}=\sqrt{\frac{24}{6}}=2$$ - The deviations $(x_i - \bar{x})$ for women are: −2, 0, 0 , 0 , 0, 2 - The standard deviation (SD) for women is $$s = \sqrt{\frac{\sum_{i=1}^n(x_i - \bar{x})^2}{n-1}} = \sqrt{\frac{(-2)^2+4(0)^2+(2)^2}{7-1}}=\sqrt{\frac{8}{6}}\approx1.15\approx1.2$$ ## Standard deviation Example: The following are samples of women’s and men’s ideal number of children. Find the standard deviation for each group. Men: 0, 0, 0, 2, 4, 4, 4 \ \ \ \ \ \ Women: 0, 2, 2, 2, 2, 2, 4 - Using R, we can calculate the standard deviation using the `sd()` command as follows: ```{r Ex4, exercise = TRUE} sd(c(0, 0, 0, 2, 4, 4, 4)) sd(c(0, 2, 2, 2, 2, 2, 4)) ``` ## Understand Standard Deviation

$$ s = \sqrt{\frac{\sum_{i=1}^n(x_i - \bar{x})^2}{n-1}}$$ + The standard derivation 𝑠 represents a type of average distance of observations from the mean. + The standard deviation is zero (𝑠 = 0) only when all observations have the same value, otherwise 𝑠 > 0. + The larger the value of standard deviation 𝑠, the greater the variability of the data. As the spread of the data increases,𝑠 gets larger. + s has the same units of measurement as the original observations, while variance 𝑠^2 does not. + The standard deviation𝑠is not resistant. That is, strong skewness or a few outliers can greatly increase𝑠.

## Median

- The **median** is the value that splits the data in half when ordered in ascending order. $$0, 1, \textbf{2}, 3, 4$$ - If there are an even number of observations, then the median is the average of the two values in the middle. $$0, 1, \underline{2, 3}, 4, 5 \rightarrow \frac{2+3}{2} = \textbf{2.5}$$ - Since the $\color{red}{median}$ is the midpoint of the data, 50% of the values are below it. Hence, it is also the $\color{red}{50^{\textbf{th}}}$ **percentile**.

## Median

**Example.** the data below gives the per capita CO2 emissions in 9 largest nations measured in metric tons per person. Find the value of the median. China 5.9; India 1.4; U.S. 16.9; Indonesia 1.8; Brazil 2.1; Pakistan 0.8; Nigeria 0.3; Bangladesh 0.4; Russia 11.6. **Solution.** First, put the 9 observations in ascending order 0.3, 0.4, 0.8, 1.4, 1.8, 2.1, 5.9, 11.6, 16.9 Since 𝑛 = 9 is odd, the median is the middle observation: median= 1.8 metric tons. - In R, we calculate the median by applying the `median()` command to the raw (unordered) values: ```{r Ex5, exercise = TRUE} median(c(5.9, 1.4, 16.9, 1.8, 2.1, 0.8, 0.3, 0.4, 11.6)) ```

## Median

- Unlike the mean, the median is a resistant measure as its value is not sensitive to outliers. For data 0.3, 0.4, 0.8, 1.4, 1.8, 2.1, 5.9, 11.6, 16.9

```{r, out.width="80%", fig.align='center'} px <- readJPEG("px.jpeg") grid.raster(px) ```

- If we drop out the U.S. value, what is the new median? New mean? 0.3, 0.4, 0.8, 1.4, 1.8, 2.1, 5.9, 11.6 Now the new median is $\frac{(1.4+1.8)}2=1.6$ The new mean is 3.04. - If we change the last value, 0.3, 0.4, 0.8, 1.4, 1.8, 2.1, 5.9, 11.6, $\color{red}{106.9}$ - what is the new median? New mean? - Now the new median is still 1.8, the new mean is $\color{red}{13.04}$

## Percentile - The pth percentile is a value such that $p$ percent of the observations fall below or at that value, and $1-p$ percent of the observations fall above or at that value. ```{r, out.width= "60%"} pt <- readJPEG("pt.jpg") grid.raster(pt) ``` - The median is a special percentile: 50th percentile ## Quartiles and (IQR)

- The $25^{th}$ percentile is also called the first quartile, **Q1**. - The $50^{th}$ percentile is also called the median. - The $75^{th}$ percentile is also called the third quartile, **Q3**. - Between Q1 and Q3 is the middle 50% of the data. The range these data span is called the **interquartile range**, or the **IQR**.

$$IQR = Q3-Q1$$

**The Quartiles Split the Distribution into Four Parts** ```{r, out.width="100%"} iqr <- readJPEG("iqr.jpeg") grid.raster(iqr) ```

## Finding Quartiles

Steps: - Arrange the data in order. - Find the median. This is the second quartile, Q2. - Consider the lower half of the observations (excluding the median itself if n is odd). The median of these observations is the first quartile, Q1. - Consider the upper half of the observations (excluding the median itself if n is odd). Their median is the third quartile, Q3.

## Five- Number Summary

**Example: Cereal Sodium Data**

For the sodium values for the 20 breakfast cereals in table,

```{r, out.width="80%"} fs <- readJPEG("fs.jpeg") grid.raster(fs) ```

- The median of the 20 values is the average of the 10th and 11th observations, 180 and 180, which is Q2 = 180 mg. - The first quartile Q1 is the median of the 10 smallest observations (in the top row), which is the average of 130 and 140, Q1 = 135 mg. - The third quartile Q3 is the median of the 10 largest observations (in the bottom row), which is the average of 200 and 210, Q3 = 205 mg. - Five-Number Summary: minimum; first quartile; median; third quartile; maximum - In this example, the five number summary is: 0 135 180 205 340

## Five- Number Summary in R

**Example: Cereal Sodium Data**

```{r, out.width="30%"} grid.raster(fs) ``` In R, we can compute the five-number summary using the `summary()` command as shown below. Note that the `quantile.type = 2` works for both odd and even sample sizes with the exception of `n=5` observations in which case we need to adjust `quantile.type = 6`. ```{r Ex7, exercise = TRUE} summary(c(0,50,70,100,130,140,140,150,160,180, 180,180,190,200,200,210,210,220,290,340), quantile.type = 2) #change type to 6 if n=5 ``` ## The Interquartile Range (IQR) and Outliers.

The interquartile range is the distance between the third quartile and first quartile: $$IQR = Q3-Q1$$ IQR gives the spread of middle 50% of the data distribution.

The IQR measures the variability.

```{r,out.width= "100%" } ik <-readJPEG("ik.jpeg") grid.raster(ik) ```

An observation is a potential outlier if it falls a distance of more than 1.5 x IQR below the first quartile or a distance of more than 1.5 x IQR above the third quartile. That is, if either $x < Q_1 - 1.5*IQR$ Or $x> Q_3+1.5*IQR$, then the value $x$ is a potential outlier

' ## Box Plot

Boxplot uses quartiles to draw a box and two whiskers to summarize data. (Boxplot can be horizonal or vertical.) 1) The box represents the middle 50% of the data, from Q1 to Q3. The thick line in the box is the median. 2) A line goes from the lower end of the box to the smallest observation that is not a potential outlier (lower whisker). 3) A line goes from the upper end of the box to the largest observation that is not a potential outlier (upper whisker). 4) Indicate potential (suspected) outliers if there are any.

## Box plot( Example)

Below is a boxplot of interest rate from from loan dataset. From the plot, 1) Find (estimate) quartiles, IQR and five number summary. 2) Is the data skewed?.

```{r, echo=F, message=F, warning=F, out.width="100%",fig.align='center'} # layout d = read.csv("study_hours.csv") study_hours = d$study_hours[!is.na(d$study_hours)] par(mar=c(0.8,4,0,1), mgp=c(2.8, 0.7, 0), las=1) boxPlot(study_hours, col = COL[1,3], ylab = "# of study hours / week", axes=FALSE, xlim = c(0,3.5), pch = 20) axis(2) arrows(2,0, 1.40,min(study_hours)-0.5, length=0.08) text(2,0.5,'lower whisker', pos=4, cex=1.5) arrows(2, 8, 1.40, quantile(study_hours, 0.25), length=0.08) text(2,8,expression(Q[1]~~'(first quartile)'), pos=4, cex=1.5) m <- median(study_hours) arrows(2, m, 1.40, m, length=0.08) text(2,m,'median', pos=4, cex=2) q <- quantile(study_hours, 0.75) arrows(2, q, 1.40, q, length=0.08) text(2,q,expression(Q[3]~~'(third quartile)'), pos=4, cex=1.5) arrows(2, 35, 1.40, 35, length=0.08) text(2,35,'max whisker reach\n& upper whisker', pos=4, cex=1.5) arrows(2, 47, 1.40, 45, length=0.08) arrows(2, 47, 1.40, 49, length=0.08) text(2,47,'suspected outliers', pos=4, cex=1.5) points(rep(0.4, 99), rev(sort(study_hours))[1:99], cex=rep(2, 27), col=rep(COL[1,3], 99), pch=rep(20, 99)) points(rep(0.4, 99), sort(study_hours)[1:99], cex=rep(2, 27), col=rep(COL[2], 99), pch=rep(1, 99)) ```

Guessed answer: Q1=8, Q2=10, Q3=14, IQR=6 Max Whisker reach 14+1.5*6=23, Min=5 , Max= 26. Potential outliers 25,26 Skewed to the right.

## More on Boxplot (Basic Boxplot)

When there is no outliers (or in a simple way), the boxplot is determined by five number summary.

**Example**. Given a data set with the following boxplot. 1) Calculate the IQR 2) Explain why there is no outlier. 3) What percentage of the data is below 4. 4) What percentage of the data is above 10. 5) Can you tell the five summary from the boxplot.

```{r, out.width= "90%"} bx <- readJPEG("bx.jpeg") grid.raster(bx) ```

## Boxplot

Multiple boxplots may put by side-by-side to compare the data sets. **Example.** The following are two box plots for heights (in inches) of boys and girls. Use the boxplots to answer the following questions:

1) The five number summary for each group (boys, girls): 2) If a girl is 68 inches tall, what is the percentage of girls who are below or at her height? what is the percentage of boys who are below or at her height? 3) How much percentage of boys are not shorter than all girls?

```{r, out.width= "70%"} hg <- readJPEG("hg.jpeg") grid.raster(hg) ```

## Practice

Match the histograms to Boxplot. ```{r, out.width= "85%"} mn <- readJPEG("mn.jpeg") grid.raster(mn) ```

Introduction to Probability & Statistics