Chapter 8

```{r setup, include=FALSE} knitr::opts_chunk$set(echo = FALSE) ``` ```{r, echo=F, message=F, warning=F} library(datasets) library(tidyverse) library(shiny) library(here) library(faraway) library(xtable) library(broom) library(scales) library(jpeg) library(openintro) library(learnr) library(readr) library(knitr) library(png) library(gradethis) #remotes::install_github("rstudio/gradethis") library(learnrhash) #devtools::install_github("rundel/learnrhash") library(grid) library(GGally) library(tinytex) data(COL) ``` ## Acknowledgement

These notes use content from OpenIntro Statistics Slides by Mine Cetinkaya-Rundel.

# Inference for linear regression ## Nature or nurture?

In 1966 Cyril Burt published a paper called "The genetic determination of differences in intelligence: A study of monozygotic twins reared together and apart". The data consist of IQ scores for [an assumed random sample of] 27 identical twins, one raised by foster parents, the other by the biological parents. ```{r, echo=F, message=F, warning=F, fig.width=6, fig.height=3,fig.align='center'} data(twins) r = cor(twins$Foster, twins$Biological) m1 = lm(twins$Foster ~ twins$Biological) par(mar=c(4.5,4.5,0.5,0.5)) plot(twins$Foster ~ twins$Biological, xlab = "biological IQ", ylab = "foster IQ", pch = 20, col = COL[1,2], cex = 2, las = 1, cex.axis = 1.5, cex.lab = 1.5) abline(m1, col = COL[4], lwd = 2) text(paste("R = ", round(r,3)), x = 75, y = 125, cex = 1.5) ```

Which of the following is $\underline{false}$?

```{r,out.height="30%", out.height="30%"} summary(m1) ``` \ A) Additional 10 points in the biological twin's IQ is associated with additional 9 points in the foster twin's IQ, on average. \ B) Roughly 78% of the foster twins' IQs can be accurately predicted by the model. \ C) The linear model is $\widehat{fosterIQ} = 9.2 + 0.9 \times bioIQ$. \ D) Foster twins with IQs higher than average IQs tend to have biological twins with higher than average IQs as well.

Which of the following is $\underline{false}$?

```{r, out.height="30%", out.width="30%"} summary(m1) ``` \ A) Additional 10 points in the biological twin's IQ is associated with additional 9 points in the foster twin's IQ, on average. \ B)

Roughly 78\% of the foster twins'IQs can be accurately predicted by the model.

\ C) The linear model is $\widehat{fosterIQ} = 9.2 + 0.9 \times bioIQ$. \ D) Foster twins with IQs higher than average IQs tend to have biological twins with higher than average IQs as well.

## Testing for the slope

Assuming that these 27 twins comprise a representative sample of all twins separated at birth, we would like to test if these data provide convincing evidence that the IQ of the biological twin is a significant predictor of IQ of the foster twin. What are the appropriate hypotheses?

- $H_0: b_0 = 0; H_A: b_0 \ne 0$ - $H_0: \beta_0 = 0; H_A: \beta_0 \ne 0$ - $H_0: b_1 = 0; H_A: b_1 \ne 0$ - $H_0: \beta_1 = 0; H_A: \beta_1 \ne 0$

## Testing for the slope

- $H_0: b_0 = 0; H_A: b_0 \ne 0$ - $H_0: \beta_0 = 0; H_A: \beta_0 \ne 0$ - $H_0: b_1 = 0; H_A: b_1 \ne 0$ - \textcolor{red}{$H_0: \beta_1 = 0; H_A: \beta_1 \ne 0$}

## Testing for the slope

$H_0: \beta_1 = 0; H_A: \beta_1 \ne 0$ $$ \begin{eqnarray*} \hline Estimate && Std. Error && t value && Pr(>|t|) \\ \hline \text{(Intercept)} && 9.2076 && 9.2999 && 0.99 && 0.3316 \\ \text{bioIQ} && 0.9014 && 0.0963 && 9.36 && 0.0000\\ \hline \end{eqnarray*} $$ - We always use a $t$-test in inference for regression. $\:$ - $\color{red}{Test \hspace{0.2cm} statistic, T = \frac{point~estimate - null~value}{SE}}$ - Point estimate = $b_1$ is the observed slope. - $SE_{b_1}$ is the standard error associated with the slope. - Degrees of freedom associated with the slope is $df = n - 2$, where $n$ is the sample size. - **We lose 1 degree of freedom for each parameter we estimate, and in simple linear regression we estimate 2 parameters}, $\beta_0$ and $\beta_1.$**

## Testing for the slope

$H_0: \beta_1 = 0; H_A: \beta_1 \ne 0$ $$ \begin{eqnarray*} \hline & Estimate & Std. Error & t value & Pr(>|t|) \\ \hline (Intercept) & 9.2076 & 9.2999 & 0.99 & 0.3316 \\ bioIQ & \color{orange}{0.9014} & \color{green}{0.0963} & \color{orange}{9.36} & \color{blue}{0.0000} \\ \hline \end{eqnarray*} $$

$$ \begin{eqnarray*} T &=& \frac{\color{orange}{0.9014} - 0}{\color{green}{0.0963}} = \color{orange}{9.36} \\ df &=& 27 - 2 = 25 \\ p-value &=& P(|T| > \color{orange}{9.36}) < \color{blue}{0.01} \end{eqnarray*} $$

## % College graduate vs. % Hispanic in LA

What can you say about the relationship between \% college graduate and \% Hispanic in a sample of 100 zip code areas in LA?

```{r, out.width="100%"} tyy <- readPNG("tyy.png") grid.raster(tyy) ```

```{r, out.width="100%"} tpy <- readPNG("tpy.png") grid.raster(tpy) ```

## % College graduate vs. % Hispanic in LA - another look

What can you say about the relationship between of \% college graduate and \% Hispanic in a sample of 100 zip code areas in LA?

```{r, out.width="75%"} cv <- readPNG("cv).png") grid.raster(cv) ```

## % College graduate vs. % Hispanic in LA - linear model

Which of the below is the best interpretation of the slope?

$$ \begin{eqnarray*} \hline & Estimate & Std. Error & t value & Pr(>|t|) \\ \hline (Intercept) & 0.7290 & 0.0308 & 23.68 & 0.0000 \\ \%Hispanic & -0.7527 & 0.0501 &-15.01 & 0.0000 \\ \hline \end{eqnarray*} $$ - 1% increase in Hispanic residents in a zip code area in LA is associated with a 75% decrease in % of college grads, on average. - 1% increase in Hispanic residents in a zip code area in LA is associated with a 0.75% decrease in % of college grads, on average. - An additional 1% of Hispanic residents decreases the % of college graduates in a zip code area in LA by 0.75%. - In zip code areas with no Hispanic residents, % of college graduates is expected to be 75%.

## % College graduate vs. % Hispanic in LA - linear model

Which of the below is the best interpretation of the slope?

1% increase in Hispanic residents in a zip code area in LA is associated with a 0.75% decrease in % of college grads, on average.

- An additional 1% of Hispanic residents decreases the % of college graduates in a zip code area in LA by 0.75%. - In zip code areas with no Hispanic residents, % of college graduates is expected to be 75%.

## % College graduate vs. % Hispanic in LA - linear model

Do these data provide convincing evidence that there is a statistically significant relationship between \% Hispanic and \% college graduates in zip code areas in LA?

$$ \begin{eqnarray*} \hline & Estimate & Std. Error & t value & Pr(>|t|) \\ \hline (Intercept) & 0.7290 & 0.0308 & 23.68 & 0.0000 \\ \%Hispanic & -0.7527 & 0.0501 & -15.01 & 0.0000 \\ \hline \end{eqnarray*} $$ Yes, the p-value for % Hispanic is low, indicating that the data provide convincing evidence that the slope parameter is different than 0.

How reliable is this p-value if these zip code areas are not randomly selected?

Not very...

## Confidence interval for the slope

Remember that a confidence interval is calculated as $point~estimate \pm ME$ and the degrees of freedom associated with the slope in a simple linear regression is $n - 2$. Which of the below is the correct 95\% confidence interval for the slope parameter? Note that the model is based on observations from 27 twins.

$$ \begin{eqnarray*} \hline & Estimate & Std. Error & t value & Pr(>|t|) \\ \hline (Intercept) & 9.2076 & 9.2999 & 0.99 & 0.3316 \\ bioIQ & 0.9014 & 0.0963 & 9.36 & 0.0000 \\ \hline \end{eqnarray*} $$ - $9.2076 \pm 1.65 \times 9.2999$ - $0.9014 \pm 2.06 \times 0.0963$ - $0.9014 \pm 1.96 \times 0.0963$ - $9.2076 \pm 1.96 \times 0.0963$

## Confidence interval for the slope

- $9.2076 \pm 1.65 \times 9.2999$ - $\color{red}{0.9014 \pm 2.06 \times 0.0963}$ - $0.9014 \pm 1.96 \times 0.0963$ - $9.2076 \pm 1.96 \times 0.0963$ $$n=27 \qquad df = 27 - 2 = 25 \\95\%:~t^\star_{25} = 2.06 \\0.9014 \pm 2.06 \times 0.0963 \\(0.7 , 1.1)$$

## Recap

- Inference for the slope for a single-predictor linear regression model: - Hypothesis test: \[ T = \frac{b_1 - null~value}{SE_{b_1}} \qquad df = n - 2 \] - Confidence interval: \[ b_1 \pm t^\star_{df = n - 2} SE_{b_1} \] - The null value is often 0 since we are usually checking for **any** relationship between the explanatory and the response variable. - The regression output gives $b_1$, $SE_{b_1}$, and **two-tailed** p-value for the $t$-test for the slope where the null value is 0. - We rarely do inference on the intercept, so we'll be focusing on the estimates and inference for the slope.

## Caution

- Always be aware of the type of data you're working with: random sample, non-random sample, or population. - Statistical inference, and the resulting p-values, are meaningless when you already have population data. - If you have a sample that is non-random (biased), inference on the results will be unreliable. - The ultimate goal is to have independent observations.

Introduction to Probability & Statistics