Chapter 7

```{r setup, include=FALSE} knitr::opts_chunk$set(echo = FALSE) ``` ```{r, echo=F, message=F, warning=F} library(readr) library(openintro) library(here) library(tidyverse) library(xtable) data(COL) ``` # One-sample means with the _t_ distribution ## Heights - According to the CDC, the mean height of U.S. adults ages 20 and older is about 66.5 inches (69.3 inches for males, and 63.8 inches for females). - In our sample data, we have a sample of 430 college students from a single college. ```{r, echo=F, message=F, warning=F, out.width="75%",fig.align='center'} set.seed(123) col_heights = rnorm(n = 430, mean = 67, sd = 5) hist(col_heights, xlab = "Height (inches)", main = "Histogram of college student heights") ``` ## Summary statistics \small \begin{center} \begin{tabular}{ccccc} \hline n & $\bar x$ & s & minimum & maximum \\ \hline 430 & 67.09 & 4.86 & 53.78 & 83.21 \\ \hline \end{tabular} \end{center} **Objective:** We would like to investigate if the mean height of students at this college is significantly different than 66.5 inches. ## From the Z-Test to the T-Test Similar to the case of proportions, under certain conditions, we can perform a hypothesis test about the mean $\mu$ using the test statistic $$Z = \frac{\bar{x}-\mu_0}{\frac{\sigma}{\sqrt{n}}}$$ where $\sigma$ is the population standard deviation and $\mu_0=66.5$ is the hypothesized value for $\mu$. \pause - But we do not know $\sigma$ to calculate the SE=$\sigma/\sqrt{n}$ \pause - We can estimate $\sigma$ using the sample standard deviation $s$. - The estimated SE will be SE=$s/\sqrt{n}$ \pause - Then the test statistic becomes $$T = \frac{\bar{x}-\mu_0}{\frac{s}{\sqrt{n}}}$$ ## Conditons As long as observations are independent, and the population distribution is not extremely skewed, a large sample would ensure that... - The sampling distribution of the mean is nearly normal by the central limit theorem. - The estimate of the standard error, as $\frac{s}{\sqrt{n}}$, is reliable. ## The _t_ distribution - When the population standard deviation is unknown (almost always), the uncertainty of the standard error estimate is addressed by using a new distribution: the **_t_ distribution**. \pause - This distribution also has a bell shape, but its tails are **thicker** than the normal model's. \pause - Therefore observations are more likely to fall beyond two SDs from the mean than under the normal distribution. \pause - Extra thick tails are helpful for resolving our problem with a less reliable estimate the standard error (since _n_ is small). ```{r, echo=F, message=F, warning=F, fig.width=4, fig.height=1.4,fig.align='center'} par(mar=c(2, 1, 1, 1), mgp=c(5, 0.6, 0)) plot(c(-4, 4), c(0, dnorm(0)), type='n', axes=FALSE, cex.axis=0.8) axis(1) abline(h=0) X <- seq(-5, 5, 0.01) Y <- dnorm(X) lines(X, Y, lty=3, lwd=2) Y <- dt(X, 2) lines(X, Y, lwd=1.5, col = "#225588") legend("topright", inset = 0.05, lty = c(3,1), lwd = c(2,1.5), c("normal","t"), col = c("black","#225588"), bty = "n") ``` ## The _t_ distribution - Always centered at zero, like the standard normal ($z$) distribution. - Has a single parameter: **degrees of freedom** ($\mathbf{df}$). ```{r, echo=F, message=F, warning=F, fig.width=4, fig.height=1.7,fig.align='center'} par(mar=c(2, 1, 1, 1), mgp=c(5, 0.6, 0)) plot(c(-3.5, 7), c(0, dnorm(0)), type='n', axes=FALSE) axis(1, cex.axis = 0.75) abline(h=0) COL <- c('#000000FF', '#00000091', '#0000006F', '#0000004B', '#00000022') DF <- c('normal', 10, 5, 2, 1) X <- seq(-5, 8, 0.01) Y <- dnorm(X) lines(X, Y, col=COL[1]) for(i in 2:5){ Y <- dt(X, as.numeric(DF[i])) lines(X, Y, col=COL[i]) } legend('topright', legend=c(DF[1], paste('t, df=', DF[2:5], sep='')), col=COL, lty=rep(1, 5), bty = "n", cex = 0.85) ``` \pause \alert{What happens to shape of the $t$ distribution as $df$ increases?} \pause Approaches normal. ## Back to the student heights survey \small \begin{center} \begin{tabular}{ccccc} \hline n & $\bar x$ & s & minimum & maximum \\ \hline 430 & 67.09 & 4.86 & 53.78 & 83.21 \\ \hline \end{tabular} \end{center} **Objective:** We would like to investigate if the mean height of students at this college is significantly different than 66.5 inches. ## Hypotheses \alert{What are the hypotheses for testing for the mean of college student heights being different from 66.5 inches?} \begin{enumerate}[A)] \item $H_0: \mu = 66.5$ \\ $H_A: \mu \ne 66.5$ \item $H_0: \mu = 66.5$ \\ $H_A: \mu > 66.5$ \item $H_0: \mu = 66.5$ \\ $H_A: \mu < 66.5$ \item $H_0: \mu \ne 66.5$ \\ $H_A: \mu > 66.5$ \end{enumerate} ## Hypotheses \alert{What are the hypotheses for testing for the mean of college student heights being different from 66.5 inches?} \begin{enumerate}[A)] \item $\textcolor{red}{H_0: \mu = 66.5}$ \\ $\textcolor{red}{H_A: \mu \ne 66.5}$ \item $H_0: \mu = 66.5$ \\ $H_A: \mu > 66.5$ \item $H_0: \mu = 66.5$ \\ $H_A: \mu < 66.5$ \item $H_0: \mu \ne 66.5$ \\ $H_A: \mu > 66.5$ \end{enumerate} ## Finding the test statistic The test statistic for inference on sample mean is the $T$ statistic with $df = n - 1$. \centering{$T_{df} = \frac{\text{point estimate} - \text{null value}}{SE}$} \pause \raggedright \alert{in context...} \[\text{point estimate} = \bar{x} = 67.09\] \pause \[SE = \frac{s}{\sqrt{n}} = \frac{4.86}{\sqrt{430}} = 0.234 \] \pause \[T = \frac{67.09 - 66.5}{0.234} = 2.52\] \pause \[df = 430-1 = 429\] \noindent\rule{4cm}{0.4pt} \small \alert{Note:} Null value is 66.5 because in the null hypothesis we set $\mu = 66.5.$ ## Finding the p-value - The p-value is, once again, calculated as the area tail area under the $t$ distribution. \pause - Using R: ```{r, echo=T} 2 * pt(2.52, df = 429, lower.tail = FALSE) ``` \pause - Using a web app: [https://gallery.shinyapps.io/dist_calc/](https://gallery.shinyapps.io/dist_calc/) \pause - Or when these aren't available, we can use a $t$-table. ## Conclusion of the test \alert{What is the conclusion of this hypothese test?} \pause We saw that the p-value was extremely low. Thus, we reject the null hypothesis. Based on the p-value, we conclude that the survey provide strong evidence that the mean of the college students height is different from the mean height of U.S. adults over 20. ## What is the difference? - We concluded that there is a difference in the mean heights of the college students compared to the mean height of U.S. adults \pause - But it would be more interesting to find out what exactly this difference is. \pause - We can use a confidence interval to estimate this difference. ## Confidence interval for a sample mean - Confidence intervals are always of the form \centering{$\text{point estimate} \pm ME$} \pause - ME is always calculated as the product of a critical value and SE. \pause - $ME = t^* \times SE$ \centering{$\text{point estimate} \pm t^* \times SE$} ## Finding the critical $t (t^*)$ - We want to find the 95% confidence interval. - Using R: ```{r, echo=T} qt(p = (1+0.95)/2, df = 429) ``` - Or use the $t$-table. ## Constructing a CI for a small sample mean \alert{Which of the following is the correct calculation of a 95\% confidence interval for the heights of the college students?} \centering{$\bar{x} = 67.09 \hspace{0.5cm} s = 4.86 \hspace{0.5cm} n = 430 \hspace{0.5cm} SE = 0.234$} \begin{enumerate}[A)] \item $66.5 \pm 1.96 \times 0.234$ \item $67.09 \pm 1.97 \times 0.234$ \item $67.09 \pm -2.26 \times 0.234$ \item $66.5 \pm 2.26 \times 4.86$ \end{enumerate} ## Constructing a CI for a small sample mean \alert{Which of the following is the correct calculation of a 95\% confidence interval for the heights of the college students?} \centering{$\bar{x} = 67.09 \hspace{0.5cm} s = 4.86 \hspace{0.5cm} n = 430 \hspace{0.5cm} SE = 0.234$} \begin{enumerate}[A)] \item $66.5 \pm 1.96 \times 0.234$ \item $\textcolor{red}{67.09 \pm 1.97 \times 0.234 \rightarrow (66.63, 67.55)}$ \item $67.09 \pm -2.26 \times 0.234$ \item $66.5 \pm 2.26 \times 4.86$ \end{enumerate} ## Synthesis \alert{Does the conclusion from the hypothesis test agree with the findings of the confidence intereval?} \pause Yes, the hypothesis test found a significant difference, and the CI does not contain the null value of 66.5. ## Recap: Inference using the $t$-distribution - If $\sigma$ is unknown, use $t$-distribution with $SE = \frac{s}{\sqrt{n}}$. \pause - Conditions: - Independence of observations (often verified by random sample, and if sampling w/o replacement, n < 10% of population). - No extreme skew. \pause - Hypothesis testing: \centering{$T_{df} = \frac{\text{point estimate}-\text{null value}}{SE}, \text{where }df=n-1$} \pause - Confidence interval: $\text{point estimate} \pm t^*_{df} \times SE$