Chapter 9
These notes use content from OpenIntro Statistics Slides by Mine Cetinkaya-Rundel.
- Simple linear regression: Bivariate - two variables: $y$ and $x$. - Multiple linear regression: Multiple variables: $y$ and $x_1, x_2, \cdots$
## Poverty vs. Region (east, west) $$\widehat{poverty} = 11.17 + 0.38 \times west$$ - Explanatory variable: region, **reference level:** east - **Intercept:** The estimated average poverty percentage in eastern states is 11.17\% - This is the value we get if we plug in $\color{red}{0}$ for the explanatory variable. - **Slope:** The estimated average poverty percentage in western states is 0.38\% higher than eastern states. - Then, the estimated average poverty percentage in western states is 11.17 + 0.38 = 11.55\%. - This is the value we get if we plug in $\color{red}{1}$ for the explanatory variable
## Poverty vs. Region (northeast, midwest, west, south) Which region (northeast, mid-west, west, or south) is the reference level?
$$ \begin{eqnarray*} \hline & Estimate & Std. Error & t value & Pr(>|t|) \\ \hline (Intercept) & 9.50 & 0.87 & 10.94 & 0.00 \\ region4midwest & 0.03 & 1.15 & 0.02 & 0.98 \\ region4west & 1.79 & 1.13 & 1.59 & 0.12 \\ region4south & 4.16 & 1.07 & 3.87 & 0.00 \\ \hline \end{eqnarray*} $$ - northeast - mid-west - west - south - cannot tell Which region (northeast, mid-west, west, or south) is the reference level?
$$ \begin{eqnarray*} \hline & Estimate & Std. Error & t value & Pr(>|t|) \\ \hline (Intercept) & 9.50 & 0.87 & 10.94 & 0.00 \\ region4midwest & 0.03 & 1.15 & 0.02 & 0.98 \\ region4west & 1.79 & 1.13 & 1.59 & 0.12 \\ region4south & 4.16 & 1.07 & 3.87 & 0.00 \\ \hline \end{eqnarray*} $$ - northeast - mid-west
- west - south - cannot tell Which region (northeast, midwest, west, or south) has the lowest poverty percentage?
$$ \begin{eqnarray*} \hline & Estimate & Std. Error & t value & Pr(>|t|) \\ \hline (Intercept) & 9.50 & 0.87 & 10.94 & 0.00 \\ region4midwest & 0.03 & 1.15 & 0.02 & 0.98 \\ region4west & 1.79 & 1.13 & 1.59 & 0.12 \\ region4south & 4.16 & 1.07 & 3.87 & 0.00 \\ \hline \end{eqnarray*} $$ - northeast - mid-west - west - south - cannot tell Which region (northeast, midwest, west, or south) has the lowest poverty percentage?
$$ \begin{eqnarray*} \hline & Estimate & Std. Error & t value & Pr(>|t|) \\ \hline (Intercept) & 9.50 & 0.87 & 10.94 & 0.00 \\ region4midwest & 0.03 & 1.15 & 0.02 & 0.98 \\ region4west & 1.79 & 1.13 & 1.59 & 0.12 \\ region4south & 4.16 & 1.07 & 3.87 & 0.00 \\ \hline \end{eqnarray*} $$ - northeast
- mid-west - west - south - cannot tell $$ \begin{eqnarray*} \hline & weight (g) & volume (cm^\text{3}) & cover \\ \hline 1 & 800 & 885 & hc \\ 2 & 950 & 1016 & hc \\ 3 & 1050 & 1125 & hc \\ 4 & 350 & 239 & hc \\ 5 & 750 & 701 & hc \\ 6 & 600 & 641 & hc \\ 7 & 1075 & 1228 & hc \\ 8 & 250 & 412 & pb \\ 9 & 700 & 953 & pb \\ 10 & 650 & 929 & pb \\ 11 & 975 & 1492 & pb \\ 12 & 350 & 419 & pb \\ 13 & 950 & 1010 & pb \\ 14 & 425 & 595 & pb \\ 15 & 725 & 1034 & pb \\ \hline \end{eqnarray*} $$
```{r, out.width="100%"} book <- readPNG("book.png") grid.raster(book) ```
The scatter plot shows the relationship between weights and volumes of books as well as the regression output. Which of the below is correct?
```{r, echo=F, message=F, warning=F, out.width="100%",fig.align='center'} # scatterplot: weight vs. volume m1 = lm(weight ~ volume, data = allbacks) par(mar=c(4,4,1,1), las=1, mgp=c(3,0.7,0), cex.lab = 1.7, cex.axis = 1.5) plot(weight ~ volume, data = allbacks, pch = 19, col = COL[1,2], xlab = expression(volume~(cm^{3})), ylab = "weight (g)", cex = 2) abline(m1, lwd = 3, col = COL[4]) text(x = 600, y = 1000, expression(paste(widehat(weight)," = 108 + 0.7 volume")), cex = 1.8, col = COL[1], pos = 1) text(x = 600, y = 900, expression(paste(R^{2},"= 80%")), cex = 1.7, col = COL[1], pos = 1) ```
- Weights of 80\% of the books can be predicted accurately using this model. \ - Books that are 10 cm$^\text{3}$ over average are expected to weigh 7 g over average. \ - The correlation between weight and volume is $R = 0.80^2 = 0.64$. \ - The model underestimates the weight of the book with the highest volume. The scatter plot shows the relationship between weights and volumes of books as well as the regression output. Which of the below is correct?
```{r, echo=F, message=F, warning=F, out.width="100%",fig.align='center'} # scatterplot: weight vs. volume m1 = lm(weight ~ volume, data = allbacks) par(mar=c(4,4,1,1), las=1, mgp=c(3,0.7,0), cex.lab = 1.7, cex.axis = 1.5) plot(weight ~ volume, data = allbacks, pch = 19, col = COL[1,2], xlab = expression(volume~(cm^{3})), ylab = "weight (g)", cex = 2) abline(m1, lwd = 3, col = COL[4]) text(x = 600, y = 1000, expression(paste(widehat(weight)," = 108 + 0.7 volume")), cex = 1.8, col = COL[1], pos = 1) text(x = 600, y = 900, expression(paste(R^{2},"= 80%")), cex = 1.7, col = COL[1], pos = 1) ```
- Weights of 80\% of the books can be predicted accurately using this model. \ - Books that are 10 cm$^\text{3}$ over average are expected to weigh 7 g over average.
\ - The correlation between weight and volume is $R = 0.80^2 = 0.64$. \ - The model underestimates the weight of the book with the highest volume. ```{r} summary(m1) ```
## Weights of hardcover and paperback books Can you identify a trend in the relationship between volume and weight of hardcover and paper books?
```{r, echo=F, message=F, warning=F, out.width="80%",fig.align='center'} # scatterplot: weight vs. volume and cover ch = rep(NA, dim(allbacks)[1]) ch[allbacks$cover == "hb"] = 15 ch[allbacks$cover == "pb"] = 17 color = rep(NA, dim(allbacks)[1]) color[allbacks$cover == "hb"] = COL[1,2] color[allbacks$cover == "pb"] = COL[2,2] par(mar=c(4,4,0.25,1), las=1, mgp=c(3,0.7,0), cex.lab = 1.8, cex.axis = 1.25) plot(weight ~ volume, data = allbacks, col = color, xlab = expression(volume~(cm^{3})), ylab = "weight (g)", pch = ch, cex = 2) legend("topleft", inset = 0.05, c("hardcover","paperback"), col = c(COL[1,2],COL[2,2]), pch = c(15,17), cex = 1.8) ``` Can you identify a trend in the relationship between volume and weight of hardcover and paper books?
Paperbacks generally weight less than hardcover books after controlling for the book's volume. ```{r, echo=F, message=F, warning=F, out.width="80%",fig.align='center'} # scatterplot: weight vs. volume and cover ch = rep(NA, dim(allbacks)[1]) ch[allbacks$cover == "hb"] = 15 ch[allbacks$cover == "pb"] = 17 color = rep(NA, dim(allbacks)[1]) color[allbacks$cover == "hb"] = COL[1,2] color[allbacks$cover == "pb"] = COL[2,2] par(mar=c(4,4,0.25,1), las=1, mgp=c(3,0.7,0), cex.lab = 1.8, cex.axis = 1.25) plot(weight ~ volume, data = allbacks, col = color, xlab = expression(volume~(cm^{3})), ylab = "weight (g)", pch = ch, cex = 2) legend("topleft", inset = 0.05, c("hardcover","paperback"), col = c(COL[1,2],COL[2,2]), pch = c(15,17), cex = 1.8) ``` ```{r} m2 <- lm(weight ~ volume + cover, data = allbacks) summary(m2) ```
## Determining the reference level Based on the regression output below, which level of $\tt{cover}$ is the reference level? Note that $\tt{pb}$: paperback.
$$ \begin{eqnarray*} \hline & Estimate & Std. Error & t value & Pr(>|t|) \\ \hline (Intercept) & 197.9628 & 59.1927 & 3.34 & 0.0058 \\ volume & 0.7180 & 0.0615 & 11.67 & 0.0000 \\ cover:pb & -184.0473 & 40.4942 & -4.55 & 0.0007 \\ \hline \end{eqnarray*} $$ A) paperback B) hardcover Based on the regression output below, which level of $\tt{cover}$ is the reference level? Note that $\tt{pb}$: paperback.
$$ \begin{eqnarray*} \hline & Estimate & Std. Error & t value & Pr(>|t|) \\ \hline (Intercept) & 197.9628 & 59.1927 & 3.34 & 0.0058 \\ volume & 0.7180 & 0.0615 & 11.67 & 0.0000 \\ cover:pb & -184.0473 & 40.4942 & -4.55 & 0.0007 \\ \hline \end{eqnarray*} $$ A) paperback B) $\color{red}{\text{hardcover}}$ Which of the below correctly describes the roles of variables in this regression model?
$$ \begin{eqnarray*} \hline & Estimate & Std. Error & t value & Pr(>|t|) \\ \hline (Intercept) & 197.9628 & 59.1927 & 3.34 & 0.0058 \\ volume & 0.7180 & 0.0615 & 11.67 & 0.0000 \\ cover:pb & -184.0473 & 40.4942 & -4.55 & 0.0007 \\ \hline \end{eqnarray*} $$ A) response: weight, explanatory: volume, paperback cover. B) response: weight, explanatory: volume, hardcover cover. C) response: volume, explanatory: weight, cover type. D) response: weight, explanatory: volume, cover type. Which of the below correctly describes the roles of variables in this regression model?
$$ \begin{eqnarray*} \hline & Estimate & Std. Error & t value & Pr(>|t|) \\ \hline (Intercept) & 197.9628 & 59.1927 & 3.34 & 0.0058 \\ volume & 0.7180 & 0.0615 & 11.67 & 0.0000 \\ cover:pb & -184.0473 & 40.4942 & -4.55 & 0.0007 \\ \hline \end{eqnarray*} $$ A) response: weight, explanatory: volume, paperback cover. B) response: weight, explanatory: volume, hardcover cover. C) response: volume, explanatory: weight, cover type. D) response: weight, explanatory: volume, cover type.
$$ \begin{eqnarray*} \hline & Estimate & Std. Error & t value & Pr(>|t|) \\ \hline (Intercept) & 197.96 & 59.19 & 3.34 & 0.01 \\ volume & 0.72 & 0.06 & 11.67 & 0.00 \\ cover:pb & -184.05 & 40.49 & -4.55 & 0.00 \\ \hline \end{eqnarray*} $$ \[ \widehat{weight} = 197.96 + 0.72~volume - 184.05~cover:pb \] - **hardcover** books: plug in $\color{red}{0}$ for $\tt{cover}$ $$\widehat{weight} = 197.96 + 0.72~volume - 184.05 \times \color{red}{0} \\= 197.96 + 0.72~volume$$ - For **paperback** books: plug in $\color{red}{1}$ for $\tt{cover}$ $$\widehat{weight} = 197.96 + 0.72~volume - 184.05 \times \color{red}{1} \\=13.91 + 0.72~volume$$
## Visualizing the linear model ```{r, echo=F, message=F, warning=F, out.width="80%",fig.align='center'} # scatterplot: weight vs. volume and cover with lines par(mar=c(4,4,0.25,1), las=1, mgp=c(3,0.7,0), cex.lab = 1.8, cex.axis = 1.25) plot(weight ~ volume, data = allbacks, col = color, xlab = expression(volume~(cm^{3})), ylab = "weight (g)", pch = ch, cex = 3) legend("topleft", inset = 0.05, c("hardcover","paperback"), col = c(COL[1,2],COL[2,2]), pch = c(15,17), cex = 1.8) abline(a = m2$coefficients[1], b = m2$coefficients[2], col = COL[1], lwd = 2) abline(a = m2$coefficients[1] + m2$coefficients[3], b = m2$coefficients[2], col = COL[2], lwd = 2, lty = 2.5) ```
## Interpretation of the regression coefficients $$ \begin{eqnarray*} \hline & Estimate & Std. Error & t value & Pr(>|t|) \\ \hline (Intercept) & 197.96 & 59.19 & 3.34 & 0.01 \\ volume & 0.72 & 0.06 & 11.67 & 0.00 \\ cover:pb & -184.05 & 40.49 & -4.55 & 0.00 \\ \hline \end{eqnarray*} $$ - **Slope of volume:** $\underline{\text{All else held constant}}$, books that are 1 more cubic centimeter in volume tend to weigh about 0.72 grams more. - **Slope of cover:** $\underline{\text{All else held constant}}$, the model predicts that paperback books weigh 184 grams lower than hardcover books. - **Intercept:** Hardcover books with no volume are expected on average to weigh 198 grams. - Obviously, the intercept does not make sense in context. It only serves to adjust the height of the line.
## Prediction Which of the following is the correct calculation for the predicted weight of a paperback book that is $600~cm^3?$
$$ \begin{eqnarray*} \hline & Estimate & Std. Error & t value & Pr(>|t|) \\ \hline (Intercept) & 197.96 & 59.19 & 3.34 & 0.01 \\ volume & 0.72 & 0.06 & 11.67 & 0.00 \\ cover:pb & -184.05 & 40.49 & -4.55 & 0.00 \\ \hline \end{eqnarray*} $$ A) $197.96 + 0.72 \times 600 - 184.05 \times 1$ B) $184.05 + 0.72 \times 600 - 197.96 \times 1$ C) $197.96 + 0.72 \times 600 - 184.05 \times 0$ D) $197.96 + 0.72 \times 1 - 184.05 \times 600$ ## Prediction Which of the following is the correct calculation for the predicted weight of a paperback book that is $600~cm^3?$
$$ \begin{eqnarray*} \hline & Estimate & Std. Error & t value & Pr(>|t|) \\ \hline (Intercept) & 197.96 & 59.19 & 3.34 & 0.01 \\ volume & 0.72 & 0.06 & 11.67 & 0.00 \\ cover:pb & -184.05 & 40.49 & -4.55 & 0.00 \\ \hline \end{eqnarray*} $$ A) $197.96 + 0.72 \times 600 - 184.05 \times 1$
B) $184.05 + 0.72 \times 600 - 197.96 \times 1$ C) $197.96 + 0.72 \times 600 - 184.05 \times 0$ D) $197.96 + 0.72 \times 1 - 184.05 \times 600$ Predicting cognitive test scores of three- and four-year-old children using characteristics of their mothers. Data are from a survey of adult American women and their children - a sub-sample from the National Longitudinal Survey of Youth. $$ \begin{eqnarray*} \hline & kid\_score & mom\_hs & mom\_iq & mom\_work & mom\_age \\ \hline 1 & 65 & yes & 121.12 & yes & 27 \\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\ 5 & 115 & yes & 92.75 & yes & 27 \\ 6 & 98 & no & 107.90 & no & 18 \\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\ 434 & 70 & yes & 91.25 & yes & 25 \\ \hline \end{eqnarray*} $$
## Interpreting the slope What is the correct interpretation of the $\underline{\text{slope for mom's IQ}}$?
$$ \begin{eqnarray*} \hline & Estimate & Std. Error & t value & Pr(>|t|) \\ \hline (Intercept) & 19.59 & 9.22 & 2.13 & 0.03 \\ mom\_hs:yes & 5.09 & 2.31 & 2.20 & 0.03 \\ mom\_iq & 0.56 & 0.06 & 9.26 & 0.00 \\ mom\_work:yes & 2.54 & 2.35 & 1.08 & 0.28 \\ mom\_age & 0.22 & 0.33 & 0.66 & 0.51 \\ \hline \end{eqnarray*} $$ All else held constant, kids with mothers whose IQs are one point higher tend to score on average 0.56 points higher. What is the correct interpretation of the $\underline{intercept}$?
$$ \begin{eqnarray*} \hline & Estimate & Std. Error & t value & Pr(>|t|) \\ \hline (Intercept) & 19.59 & 9.22 & 2.13 & 0.03 \\ mom\_hs:yes & 5.09 & 2.31 & 2.20 & 0.03 \\ mom\_iq & 0.56 & 0.06 & 9.26 & 0.00 \\ mom\_work:yes & 2.54 & 2.35 & 1.08 & 0.28 \\ mom\_age & 0.22 & 0.33 & 0.66 & 0.51 \\ \hline \end{eqnarray*} $$ Kids whose moms haven't gone to HS, did not work during the first three years of the kid's life, have an IQ of 0 and are 0 yrs old are expected on average to score 19.59. Obviously, the intercept does not make any sense in context. What is the correct interpretation of the slope for $\underline{mom_{work}}?$
$$ \begin{eqnarray*} \hline & Estimate & Std. Error & t value & Pr(>|t|) \\ \hline (Intercept) & 19.59 & 9.22 & 2.13 & 0.03 \\ mom\_hs:yes & 5.09 & 2.31 & 2.20 & 0.03 \\ mom\_iq & 0.56 & 0.06 & 9.26 & 0.00 \\ mom\_work:yes & 2.54 & 2.35 & 1.08 & 0.28 \\ mom\_age & 0.22 & 0.33 & 0.66 & 0.51 \\ \hline \end{eqnarray*} $$ All else being equal, kids whose moms worked during the first three years of the kid's life A) are estimated to score 2.54 points lower. B) are estimated to score 2.54 points higher. than those whose moms did not work. What is the correct interpretation of the slope for $\underline{mom_{work}}?$
$$ \begin{eqnarray*} \hline & Estimate & Std. Error & t value & Pr(>|t|) \\ \hline (Intercept) & 19.59 & 9.22 & 2.13 & 0.03 \\ mom\_hs:yes & 5.09 & 2.31 & 2.20 & 0.03 \\ mom\_iq & 0.56 & 0.06 & 9.26 & 0.00 \\ mom\_work:yes & 2.54 & 2.35 & 1.08 & 0.28 \\ mom\_age & 0.22 & 0.33 & 0.66 & 0.51 \\ \hline \end{eqnarray*} $$ All else being equal, kids whose moms worked during the first three years of the kid's life A) are estimated to score 2.54 points lower. B) are estimated to score 2.54 points higher.
than those whose moms did not work. ```{r, echo=F, message=F, warning=F, out.width="80%",fig.align='center'} poverty = read.table("poverty.txt", h = T, sep = "\t") # rename columns names(poverty) = c("state", "metro_res", "white", "hs_grad", "poverty", "female_house") # reorder columns poverty = poverty[,c(1,5,2,3,4,6)] # pairs plot panel.cor <- function(x, y, digits=2, prefix="", cex.cor, ...){ usr <- par("usr"); on.exit(par(usr)) par(usr = c(0, 1, 0, 1)) r <- abs(cor(x, y)) rreal = cor(x, y) txtreal <- format(c(rreal, 0.123456789), digits=digits)[1] txt <- format(c(r, 0.123456789), digits=digits)[1] if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt) text(0.5, 0.5, txtreal, cex = 3 * cex.cor * r) } pairs(poverty[,c(2:6)], lower.panel = panel.cor, pch = 19, col = COL[1,2], cex = 1.5, cex.labels = 1.6, cex.axis = 1.5) ```
## Predicting poverty using % female householder $$ \begin{eqnarray*} \hline & Estimate & Std. Error & t value & Pr(>|t|) \\ \hline (Intercept) & 3.31 & 1.90 & 1.74 & 0.09 \\ female\_house & 0.69 & 0.16 & 4.32 & 0.00 \\ \hline \end{eqnarray*} $$
```{r, echo=F, message=F, warning=F, out.width="80%",fig.align='center'} # poverty vs. female_house pov_slr = lm(poverty ~ female_house, data = poverty) par(mar=c(4,4,1,1), las=1, mgp=c(2.5,0.7,0), cex.lab = 2, cex.axis = 1.5) plot(poverty$poverty ~ poverty$female_house, ylab = "% in poverty", xlab = "% female householder", pch=19, col=COL[1,2], cex = 2) abline(pov_slr, col = COL[4], lwd = 4) ```
$$\begin{align*}R &= 0.53 \\R^2 &= 0.53^2 = 0.28\end{align*}$$ $R^2$ can be calculated in three ways: - Square the correlation coefficient of $x$ and $y$ {\small (how we have been calculating it)} - Square the correlation coefficient of $y$ and $\hat{y}$ - Based on definition: - $R^2 = \frac{\text{explained variability in }y}{\text{total variability in }y}$ Using **ANOVA** we can calculate the explained variability and total variability in $y$.
## Sum of squares $$ \begin{eqnarray*} \hline & Df & Sum Sq & Mean Sq & F value & Pr(>F) \\ \hline \text{female_house} & 1 & 132.57 & 132.57 & 18.68 & 0.00 \\ \text{Residuals }& 49 & 347.68 & 7.10 & & \\ \hline \hline \end{eqnarray*} $$ $$ \begin{eqnarray*} \text{Sum of squares of y: } SS_{Total} &=& \sum(y - \bar{y})^2 = 480.25 \\ & & \color{red} \\ \text{Sum of squares of residuals: } SS_{Error} &=& \sum e_i^2 = 347.68 \\ & &\color{red} \\ \text{Sum of squares of x: } SS_{Model} &=& SS_{Total} - SS_{Error} \\ & &\color{red} \\ &=& 480.25 - 347.68 = 132.57 \end{eqnarray*} $$
## Why bother? Why bother with another approach for calculating $R^2$ when we had a perfectly good way to calculate it as the correlation coefficient squared?
- For single-predictor linear regression, having three ways to calculate the same value may seem like overkill. - However, in multiple linear regression, we can't calculate $R^2$ as the square of the correlation between $x$ and $y$ because we have multiple $x$s. - And next we'll learn another measure of explained variability, $\mathbf{adjusted~R^2}$, that requires the use of the third approach, ratio of explained and unexplained variability. $$ \begin{eqnarray*} \hline \textbf{Linear model:}& Estimate & Std. Error & t value & Pr(>|t|) \\ \hline (Intercept) & -2.58 & 5.78 & -0.45 & 0.66 \\ female\_house & 0.89 & 0.24 & 3.67 & 0.00 \\ white & 0.04 & 0.04 & 1.08 & 0.29 \\ \hline \end{eqnarray*} $$ $$ \begin{eqnarray*} \hline \textbf{ANOVA:} & Df & Sum Sq & Mean Sq & F value & Pr(>F) \\ \hline female\_house & 1 & 132.57 & 132.57 & 18.74 & 0.00 \\ white & 1 & 8.21 & 8.21 & 1.16 & 0.29 \\ Residuals & 48 & 339.47 & 7.07 & & \\ \hline Total & 50 & 480.25\\ \hline \end{eqnarray*} $$ \[ R^2 = \frac{\text{explained variability}}{\text{total variability}} = \frac{132.57 + 8.21}{480.25} = 0.29 \]
## Does adding the variable $\tt{white}$ to the model add valuable information that wasn't provided by $\tt{female\_house}$?
```{r, echo=F, message=F, warning=F, out.width="80%",fig.align='center'} pairs(poverty[,c(2:6)], lower.panel = panel.cor, pch = 19, col = COL[1,2], cex = 1.5, cex.labels = 1.6, cex.axis = 1.5) ``` $\mathbf{\text{poverty vs. % female head of household}}$ $$ \begin{eqnarray*} \hline & Estimate & Std. Error & t value & Pr(>|t|) \\ \hline (Intercept) & 3.31 & 1.90 & 1.74 & 0.09 \\ female\_house & 0.69 & 0.16 & 4.32 & 0.00 \\ \hline \end{eqnarray*} $$ $\mathbf{\text{poverty vs. % female head of household and % female hh}}$ $$ \begin{eqnarray*} \hline & Estimate & Std. Error & t value & Pr(>|t|) \\ \hline (Intercept) & -2.58 & 5.78 & -0.45 & 0.66 \\ female\_house & 0.89 & 0.24 & 3.67 & 0.00 \\ white & 0.04 & 0.04 & 1.08 & 0.29 \\ \hline \end{eqnarray*} $$
## Collinearity between explanatory variables $\mathbf{\text{poverty vs. % female head of household}}$ $$ \begin{eqnarray*} \hline & Estimate & Std. Error & t value & Pr(>|t|) \\ \hline (Intercept) & 3.31 & 1.90 & 1.74 & 0.09 \\ female\_house & \color{red}{0.69} & 0.16 & 4.32 & 0.00 \\ \hline \end{eqnarray*} $$ $\mathbf{\text{poverty vs. % female head of household and % female hh}}$ $$ \begin{eqnarray*} \hline & Estimate & Std. Error & t value & Pr(>|t|) \\ \hline (Intercept) & -2.58 & 5.78 & -0.45 & 0.66 \\ female\_house & \color{red}{0.89} & 0.24 & 3.67 & 0.00 \\ white & 0.04 & 0.04 & 1.08 & 0.29 \\ \hline \end{eqnarray*} $$
## Collinearity between explanatory variables - Two predictor variables are said to be collinear when they are correlated, and this **collinearity** complicates model estimation. - $Remember$:Predictors are also called explanatory or $\underline{\text{independent}}$ variables. Ideally, they would be independent of each other. - We don't like adding predictors that are associated with each other to the model, because often times the addition of such variable brings nothing to the table. Instead, we prefer the simplest best model, i.e. **parsimonious** model. - While it's impossible to avoid collinearity from arising in observational data, experiments are usually designed to prevent correlation among predictors.
## $R^2$ vs. adjusted $R^2$ $$ \begin{eqnarray*} && R^2 && Adjusted R^2 \\ \hline Model 1 (Single-predictor) && 0.28 && 0.26 \\ Model 2 (Multiple) && 0.29 && 0.26 \end{eqnarray*} $$ - When $\underline{any}$ variable is added to the model $R^2$ increases. - But if the added variable doesn't really provide any new information, or is completely unrelated, adjusted $R^2$ does not increase.
## Adjusted $R^2$ $$R^2_{adj} = 1 - \left( \frac{ SS_{Error} }{ SS_{Total} } \times \frac{n - 1}{n - p - 1} \right)$$ where $n$ is the number of cases and $p$ is the number of predictors (explanatory variables) in the model. - Because $p$ is never negative, $R^2_{adj}$ will always be smaller than $R^2$. - $R^2_{adj}$ applies a penalty for the number of predictors included in the model. - Therefore, we choose models with higher $R^2_{adj}$ over others.
## Calculate adjusted $R^2$ $$ \begin{eqnarray*} \hline \mathbf{ANOVA:} & Df & Sum Sq & Mean Sq & F value & Pr(>F) \\ \hline female\_house & 1 & 132.57 & 132.57 & 18.74 & 0.0001 \\ white & 1 & 8.21 & 8.21 & 1.16 & 0.2868 \\ Residuals & 48 & 339.47 & 7.07 & & \\ \hline Total & 50 & 480.25\\ \hline \end{eqnarray*} $$ $$ \begin{eqnarray*} R^2_{adj} &=& 1 - \left( \frac{ SS_{Error} }{ SS_{Total} } \times \frac{n - 1}{n - p - 1} \right) \\ &=& 1 - \left( \frac{ 339.47 }{ 480.25 } \times \frac{51 - 1}{51 - 2 - 1} \right) \\ &=& 1- \left( \frac{ 339.47 }{ 480.25 } \times \frac{50}{48} \right) \\ &=& 1 - 0.74 \\ &=& 0.26 \end{eqnarray*} $$