```{r setup, include=FALSE} knitr::opts_chunk$set(echo = FALSE) ``` ```{r, echo=F, message=F, warning=F} library(readr) library(openintro) library(here) library(tidyverse) library(xtable) library(broom) library(faraway) data(COL) ``` # Inference for linear regression ## Nature or nurture? In 1966 Cyril Burt published a paper called "The genetic determination of differences in intelligence: A study of monozygotic twins reared together and apart". The data consist of IQ scores for [an assumed random sample of] 27 identical twins, one raised by foster parents, the other by the biological parents. ```{r, echo=F, message=F, warning=F, fig.width=6, fig.height=3,fig.align='center'} data(twins) r = cor(twins$Foster, twins$Biological) m1 = lm(twins$Foster ~ twins$Biological) par(mar=c(4.5,4.5,0.5,0.5)) plot(twins$Foster ~ twins$Biological, xlab = "biological IQ", ylab = "foster IQ", pch = 20, col = COL[1,2], cex = 2, las = 1, cex.axis = 1.5, cex.lab = 1.5) abline(m1, col = COL[4], lwd = 2) text(paste("R = ", round(r,3)), x = 75, y = 125, cex = 1.5) ``` ## \alert{Which of the following is \underline{false}?} \tiny ```{r} summary(m1) ``` \normalsize A) Additional 10 points in the biological twin's IQ is associated with additional 9 points in the foster twin's IQ, on average. B) Roughly 78% of the foster twins' IQs can be accurately predicted by the model. C) The linear model is $\widehat{fosterIQ} = 9.2 + 0.9 \times bioIQ$. D) Foster twins with IQs higher than average IQs tend to have biological twins with higher than average IQs as well. ## \alert{Which of the following is \underline{false}?} \tiny ```{r} summary(m1) ``` \normalsize A) Additional 10 points in the biological twin's IQ is associated with additional 9 points in the foster twin's IQ, on average. B) \alert{Roughly 78\% of the foster twins' IQs can be accurately predicted by the model.} C) The linear model is $\widehat{fosterIQ} = 9.2 + 0.9 \times bioIQ$. D) Foster twins with IQs higher than average IQs tend to have biological twins with higher than average IQs as well. ## Testing for the slope \alert{Assuming that these 27 twins comprise a representative sample of all twins separated at birth, we would like to test if these data provide convincing evidence that the IQ of the biological twin is a significant predictor of IQ of the foster twin. What are the appropriate hypotheses?} \begin{enumerate}[A)] \item $H_0: b_0 = 0; H_A: b_0 \ne 0$ \item $H_0: \beta_0 = 0; H_A: \beta_0 \ne 0$ \item $H_0: b_1 = 0; H_A: b_1 \ne 0$ \item $H_0: \beta_1 = 0; H_A: \beta_1 \ne 0$ \end{enumerate} ## Testing for the slope \alert{Assuming that these 27 twins comprise a representative sample of all twins separated at birth, we would like to test if these data provide convincing evidence that the IQ of the biological twin is a significant predictor of IQ of the foster twin. What are the appropriate hypotheses?} \begin{enumerate}[A)] \item $H_0: b_0 = 0; H_A: b_0 \ne 0$ \item $H_0: \beta_0 = 0; H_A: \beta_0 \ne 0$ \item $H_0: b_1 = 0; H_A: b_1 \ne 0$ \item \textcolor{red}{$H_0: \beta_1 = 0; H_A: \beta_1 \ne 0$} \end{enumerate} ## Testing for the slope \footnotesize \begin{center} \begin{tabular}{rrrrr} \hline & Estimate & Std. Error & t value & Pr($>$$|$t$|$) \\ \hline (Intercept) & 9.2076 & 9.2999 & 0.99 & 0.3316 \\ bioIQ & 0.9014 & 0.0963 & 9.36 & 0.0000 \\ \hline \end{tabular} \end{center} \normalsize \pause \begin{itemize} \item We always use a $t$-test in inference for regression. $\:$ \\ \pause \alert{Remember:} Test statistic, $T = \frac{point~estimate - null~value}{SE}$ \pause \item Point estimate = $b_1$ is the observed slope. \pause \item $SE_{b_1}$ is the standard error associated with the slope. \pause \item Degrees of freedom associated with the slope is $df = n - 2$, where $n$ is the sample size. \pause \alert{Remember:} We lose 1 degree of freedom for each parameter we estimate, and in simple linear regression we estimate 2 parameters, $\beta_0$ and $\beta_1$. \end{itemize} ## Testing for the slope \small \begin{center} \begin{tabular}{rrrrr} \hline & Estimate & Std. Error & t value & Pr($>$$|$t$|$) \\ \hline (Intercept) & 9.2076 & 9.2999 & 0.99 & 0.3316 \\ bioIQ & \textcolor{orange}{0.9014} & \textcolor{green}{0.0963} & \textcolor{orange}{9.36} & \textcolor{blue}{0.0000} \\ \hline \end{tabular} \end{center} \normalsize \pause \begin{eqnarray*} T &=& \frac{\textcolor{orange}{0.9014} - 0}{\textcolor{green}{0.0963}} = \textcolor{orange}{9.36} \\ \pause df &=& 27 - 2 = 25 \\ \pause p-value &=& P(|T| > \textcolor{orange}{9.36}) < \textcolor{blue}{0.01} \end{eqnarray*} ## % College graduate vs. % Hispanic in LA \alert{What can you say about the relationship between \% college graduate and \% Hispanic in a sample of 100 zip code areas in LA?} \begin{columns} \begin{column}{0.5\textwidth} \includegraphics[width=1\columnwidth]{Prop_EduHigherThan16th.pdf} \end{column} \begin{column}{0.5\textwidth} \includegraphics[width=1\columnwidth]{Prop_RaceEthHispanic.pdf} \end{column} \end{columns} ## % College graduate vs. % Hispanic in LA - another look \alert{What can you say about the relationship between of \% college graduate and \% Hispanic in a sample of 100 zip code areas in LA?} {width="90%"} ## % College graduate vs. % Hispanic in LA - linear model \alert{Which of the below is the best interpretation of the slope?} \centering \begin{tabular}{rrrrr} \hline & Estimate & Std. Error & t value & Pr($>$$|$t$|$) \\ \hline (Intercept) & 0.7290 & 0.0308 & 23.68 & 0.0000 \\ \%Hispanic & -0.7527 & 0.0501 & -15.01 & 0.0000 \\ \hline \end{tabular} \begin{enumerate}[A)] \item A 1\% increase in Hispanic residents in a zip code area in LA is associated with a 75\% decrease in \% of college grads. \item A 1\% increase in Hispanic residents in a zip code area in LA is associated with a 0.75\% decrease in \% of college grads. \item An additional 1\% of Hispanic residents decreases the \% of college graduates in a zip code area in LA by 0.75\%. \item In zip code areas with no Hispanic residents, \% of college graduates is expected to be 75\%. \end{enumerate} ## % College graduate vs. % Hispanic in LA - linear model \alert{Which of the below is the best interpretation of the slope?} \centering \begin{tabular}{rrrrr} \hline & Estimate & Std. Error & t value & Pr($>$$|$t$|$) \\ \hline (Intercept) & 0.7290 & 0.0308 & 23.68 & 0.0000 \\ \%Hispanic & -0.7527 & 0.0501 & -15.01 & 0.0000 \\ \hline \end{tabular} \begin{enumerate}[A)] \item A 1\% increase in Hispanic residents in a zip code area in LA is associated with a 75\% decrease in \% of college grads. \item \alert{A 1\% increase in Hispanic residents in a zip code area in LA is associated with a 0.75\% decrease in \% of college grads.} \item An additional 1\% of Hispanic residents decreases the \% of college graduates in a zip code area in LA by 0.75\%. \item In zip code areas with no Hispanic residents, \% of college graduates is expected to be 75\%. \end{enumerate} ## % College graduate vs. % Hispanic in LA - linear model \alert{Do these data provide convincing evidence that there is a statistically significant relationship between \% Hispanic and \% college graduates in zip code areas in LA?} \small \begin{center} \begin{tabular}{rrrrr} \hline & Estimate & Std. Error & t value & Pr($>$$|$t$|$) \\ \hline (Intercept) & 0.7290 & 0.0308 & 23.68 & 0.0000 \\ hispanic & -0.7527 & 0.0501 & -15.01 & 0.0000 \\ \hline \end{tabular} \end{center} \pause Yes, the p-value for % Hispanic is low, indicating that the data provide convincing evidence that the slope parameter is different than 0. \pause \alert{How reliable is this p-value if these zip code areas are not randomly selected?} \pause Not very... ## Confidence interval for the slope \alert{ Remember that a confidence interval is calculated as $point~estimate \pm ME$ and the degrees of freedom associated with the slope in a simple linear regression is $n - 2$. Which of the below is the correct 95\% confidence interval for the slope parameter? Note that the model is based on observations from 27 twins.} \footnotesize \begin{center} \begin{tabular}{rrrrr} \hline & Estimate & Std. Error & t value & Pr($>$$|$t$|$) \\ \hline (Intercept) & 9.2076 & 9.2999 & 0.99 & 0.3316 \\ bioIQ & 0.9014 & 0.0963 & 9.36 & 0.0000 \\ \hline \end{tabular} \end{center} \begin{columns} \begin{column}{0.5\textwidth} \begin{enumerate}[A)] \item $9.2076 \pm 1.65 \times 9.2999$ \item $0.9014 \pm 2.06 \times 0.0963$ \item $0.9014 \pm 1.96 \times 0.0963$ \item $9.2076 \pm 1.96 \times 0.0963$ \end{enumerate} \end{column} \begin{column}{0.5\textwidth} \end{column} \end{columns} ## Confidence interval for the slope \alert{ Remember that a confidence interval is calculated as $point~estimate \pm ME$ and the degrees of freedom associated with the slope in a simple linear regression is $n - 2$. Which of the below is the correct 95\% confidence interval for the slope parameter? Note that the model is based on observations from 27 twins.} \footnotesize \begin{center} \begin{tabular}{rrrrr} \hline & Estimate & Std. Error & t value & Pr($>$$|$t$|$) \\ \hline (Intercept) & 9.2076 & 9.2999 & 0.99 & 0.3316 \\ bioIQ & 0.9014 & 0.0963 & 9.36 & 0.0000 \\ \hline \end{tabular} \end{center} \begin{columns} \begin{column}{0.5\textwidth} \begin{enumerate}[A)] \item $9.2076 \pm 1.65 \times 9.2999$ \item \alert{$0.9014 \pm 2.06 \times 0.0963$} \item $0.9014 \pm 1.96 \times 0.0963$ \item $9.2076 \pm 1.96 \times 0.0963$ \end{enumerate} \end{column} \begin{column}{0.5\textwidth} \begin{eqnarray*} \pause n &=& 27 \qquad df = 27 - 2 = 25 \\ \pause 95\%:~t^\star_{25} &=& 2.06 \\ \pause 0.9014 &\pm& 2.06 \times 0.0963 \\ \pause (0.7 &,& 1.1) \end{eqnarray*} \end{column} \end{columns} ## Recap \begin{itemize} \item Inference for the slope for a single-predictor linear regression model: \pause \begin{itemize} \item Hypothesis test: \[ T = \frac{b_1 - null~value}{SE_{b_1}} \qquad df = n - 2 \] \pause \item Confidence interval: \[ b_1 \pm t^\star_{df = n - 2} SE_{b_1} \] \end{itemize} \pause \item The null value is often 0 since we are usually checking for \textbf{any} relationship between the explanatory and the response variable. \pause \item The regression output gives $b_1$, $SE_{b_1}$, and \textbf{two-tailed} p-value for the $t$-test for the slope where the null value is 0. \pause \item We rarely do inference on the intercept, so we'll be focusing on the estimates and inference for the slope. \end{itemize} ## Caution \begin{itemize} \item Always be aware of the type of data you're working with: random sample, non-random sample, or population. \pause \item Statistical inference, and the resulting p-values, are meaningless when you already have population data. \pause \item If you have a sample that is non-random (biased), inference on the results will be unreliable. \pause \item The ultimate goal is to have independent observations. \end{itemize}