Chapter 9

```{r setup, include=FALSE} knitr::opts_chunk$set(echo = FALSE) ``` ```{r, echo=F, message=F, warning=F} library(datasets) library(tidyverse) library(shiny) library(here) library(faraway) library(xtable) library(broom) library(scales) library(jpeg) library(openintro) library(learnr) library(readr) library(knitr) library(png) library(gradethis) #remotes::install_github("rstudio/gradethis") library(learnrhash) #devtools::install_github("rundel/learnrhash") library(grid) library(GGally) library(tinytex) library(DAAG) library(Sleuth3) library(ROCR) data(allbacks) data(COL) ``` ## Acknowledgement
These notes use content from OpenIntro Statistics Slides by Mine Cetinkaya-Rundel.
# Introduction to multiple regression ## Multuple regression
- Simple linear regression: Bivariate - two variables: $y$ and $x$. - Multiple linear regression: Multiple variables: $y$ and $x_1, x_2, \cdots$
## Poverty vs. Region (east, west)
$$\widehat{poverty} = 11.17 + 0.38 \times west$$ - Explanatory variable: region, **reference level:** east - **Intercept:** The estimated average poverty percentage in eastern states is 11.17\% - This is the value we get if we plug in $\color{red}{0}$ for the explanatory variable. - **Slope:** The estimated average poverty percentage in western states is 0.38\% higher than eastern states. - Then, the estimated average poverty percentage in western states is 11.17 + 0.38 = 11.55\%. - This is the value we get if we plug in $\color{red}{1}$ for the explanatory variable
## Poverty vs. Region (northeast, midwest, west, south)
Which region (northeast, mid-west, west, or south) is the reference level?
$$ \begin{eqnarray*} \hline & Estimate & Std. Error & t value & Pr(>|t|) \\ \hline (Intercept) & 9.50 & 0.87 & 10.94 & 0.00 \\ region4midwest & 0.03 & 1.15 & 0.02 & 0.98 \\ region4west & 1.79 & 1.13 & 1.59 & 0.12 \\ region4south & 4.16 & 1.07 & 3.87 & 0.00 \\ \hline \end{eqnarray*} $$ - northeast - mid-west - west - south - cannot tell
## Poverty vs. Region (northeast, midwest, west, south)
Which region (northeast, mid-west, west, or south) is the reference level?
$$ \begin{eqnarray*} \hline & Estimate & Std. Error & t value & Pr(>|t|) \\ \hline (Intercept) & 9.50 & 0.87 & 10.94 & 0.00 \\ region4midwest & 0.03 & 1.15 & 0.02 & 0.98 \\ region4west & 1.79 & 1.13 & 1.59 & 0.12 \\ region4south & 4.16 & 1.07 & 3.87 & 0.00 \\ \hline \end{eqnarray*} $$ - northeast -
mid-west
- west - south - cannot tell
## Poverty vs. Region (northeast, midwest, west, south)
Which region (northeast, midwest, west, or south) has the lowest poverty percentage?
$$ \begin{eqnarray*} \hline & Estimate & Std. Error & t value & Pr(>|t|) \\ \hline (Intercept) & 9.50 & 0.87 & 10.94 & 0.00 \\ region4midwest & 0.03 & 1.15 & 0.02 & 0.98 \\ region4west & 1.79 & 1.13 & 1.59 & 0.12 \\ region4south & 4.16 & 1.07 & 3.87 & 0.00 \\ \hline \end{eqnarray*} $$ - northeast - mid-west - west - south - cannot tell
## Poverty vs. Region (northeast, midwest, west, south)
Which region (northeast, midwest, west, or south) has the lowest poverty percentage?
$$ \begin{eqnarray*} \hline & Estimate & Std. Error & t value & Pr(>|t|) \\ \hline (Intercept) & 9.50 & 0.87 & 10.94 & 0.00 \\ region4midwest & 0.03 & 1.15 & 0.02 & 0.98 \\ region4west & 1.79 & 1.13 & 1.59 & 0.12 \\ region4south & 4.16 & 1.07 & 3.87 & 0.00 \\ \hline \end{eqnarray*} $$ -
northeast
- mid-west - west - south - cannot tell
## Weights of books
$$ \begin{eqnarray*} \hline & weight (g) & volume (cm^\text{3}) & cover \\ \hline 1 & 800 & 885 & hc \\ 2 & 950 & 1016 & hc \\ 3 & 1050 & 1125 & hc \\ 4 & 350 & 239 & hc \\ 5 & 750 & 701 & hc \\ 6 & 600 & 641 & hc \\ 7 & 1075 & 1228 & hc \\ 8 & 250 & 412 & pb \\ 9 & 700 & 953 & pb \\ 10 & 650 & 929 & pb \\ 11 & 975 & 1492 & pb \\ 12 & 350 & 419 & pb \\ 13 & 950 & 1010 & pb \\ 14 & 425 & 595 & pb \\ 15 & 725 & 1034 & pb \\ \hline \end{eqnarray*} $$
```{r, out.width="100%"} book <- readPNG("book.png") grid.raster(book) ```
## Weights of books
The scatter plot shows the relationship between weights and volumes of books as well as the regression output. Which of the below is correct?
```{r, echo=F, message=F, warning=F, out.width="100%",fig.align='center'} # scatterplot: weight vs. volume m1 = lm(weight ~ volume, data = allbacks) par(mar=c(4,4,1,1), las=1, mgp=c(3,0.7,0), cex.lab = 1.7, cex.axis = 1.5) plot(weight ~ volume, data = allbacks, pch = 19, col = COL[1,2], xlab = expression(volume~(cm^{3})), ylab = "weight (g)", cex = 2) abline(m1, lwd = 3, col = COL[4]) text(x = 600, y = 1000, expression(paste(widehat(weight)," = 108 + 0.7 volume")), cex = 1.8, col = COL[1], pos = 1) text(x = 600, y = 900, expression(paste(R^{2},"= 80%")), cex = 1.7, col = COL[1], pos = 1) ```
- Weights of 80\% of the books can be predicted accurately using this model. \ - Books that are 10 cm$^\text{3}$ over average are expected to weigh 7 g over average. \ - The correlation between weight and volume is $R = 0.80^2 = 0.64$. \ - The model underestimates the weight of the book with the highest volume.
## Weights of books
The scatter plot shows the relationship between weights and volumes of books as well as the regression output. Which of the below is correct?
```{r, echo=F, message=F, warning=F, out.width="100%",fig.align='center'} # scatterplot: weight vs. volume m1 = lm(weight ~ volume, data = allbacks) par(mar=c(4,4,1,1), las=1, mgp=c(3,0.7,0), cex.lab = 1.7, cex.axis = 1.5) plot(weight ~ volume, data = allbacks, pch = 19, col = COL[1,2], xlab = expression(volume~(cm^{3})), ylab = "weight (g)", cex = 2) abline(m1, lwd = 3, col = COL[4]) text(x = 600, y = 1000, expression(paste(widehat(weight)," = 108 + 0.7 volume")), cex = 1.8, col = COL[1], pos = 1) text(x = 600, y = 900, expression(paste(R^{2},"= 80%")), cex = 1.7, col = COL[1], pos = 1) ```
- Weights of 80\% of the books can be predicted accurately using this model. \ -
Books that are 10 cm$^\text{3}$ over average are expected to weigh 7 g over average.
\ - The correlation between weight and volume is $R = 0.80^2 = 0.64$. \ - The model underestimates the weight of the book with the highest volume.
## Modeling weights of books using volume
```{r} summary(m1) ```
## Weights of hardcover and paperback books
Can you identify a trend in the relationship between volume and weight of hardcover and paper books?
```{r, echo=F, message=F, warning=F, out.width="80%",fig.align='center'} # scatterplot: weight vs. volume and cover ch = rep(NA, dim(allbacks)[1]) ch[allbacks$cover == "hb"] = 15 ch[allbacks$cover == "pb"] = 17 color = rep(NA, dim(allbacks)[1]) color[allbacks$cover == "hb"] = COL[1,2] color[allbacks$cover == "pb"] = COL[2,2] par(mar=c(4,4,0.25,1), las=1, mgp=c(3,0.7,0), cex.lab = 1.8, cex.axis = 1.25) plot(weight ~ volume, data = allbacks, col = color, xlab = expression(volume~(cm^{3})), ylab = "weight (g)", pch = ch, cex = 2) legend("topleft", inset = 0.05, c("hardcover","paperback"), col = c(COL[1,2],COL[2,2]), pch = c(15,17), cex = 1.8) ```
## Weights of hardcover and paperback books
Can you identify a trend in the relationship between volume and weight of hardcover and paper books?
Paperbacks generally weight less than hardcover books after controlling for the book's volume. ```{r, echo=F, message=F, warning=F, out.width="80%",fig.align='center'} # scatterplot: weight vs. volume and cover ch = rep(NA, dim(allbacks)[1]) ch[allbacks$cover == "hb"] = 15 ch[allbacks$cover == "pb"] = 17 color = rep(NA, dim(allbacks)[1]) color[allbacks$cover == "hb"] = COL[1,2] color[allbacks$cover == "pb"] = COL[2,2] par(mar=c(4,4,0.25,1), las=1, mgp=c(3,0.7,0), cex.lab = 1.8, cex.axis = 1.25) plot(weight ~ volume, data = allbacks, col = color, xlab = expression(volume~(cm^{3})), ylab = "weight (g)", pch = ch, cex = 2) legend("topleft", inset = 0.05, c("hardcover","paperback"), col = c(COL[1,2],COL[2,2]), pch = c(15,17), cex = 1.8) ```
## Modeling weights of books using volume \underline{and} cover type
```{r} m2 <- lm(weight ~ volume + cover, data = allbacks) summary(m2) ```
## Determining the reference level
Based on the regression output below, which level of $\tt{cover}$ is the reference level? Note that $\tt{pb}$: paperback.
$$ \begin{eqnarray*} \hline & Estimate & Std. Error & t value & Pr(>|t|) \\ \hline (Intercept) & 197.9628 & 59.1927 & 3.34 & 0.0058 \\ volume & 0.7180 & 0.0615 & 11.67 & 0.0000 \\ cover:pb & -184.0473 & 40.4942 & -4.55 & 0.0007 \\ \hline \end{eqnarray*} $$ A) paperback B) hardcover
## Determining the reference level
Based on the regression output below, which level of $\tt{cover}$ is the reference level? Note that $\tt{pb}$: paperback.
$$ \begin{eqnarray*} \hline & Estimate & Std. Error & t value & Pr(>|t|) \\ \hline (Intercept) & 197.9628 & 59.1927 & 3.34 & 0.0058 \\ volume & 0.7180 & 0.0615 & 11.67 & 0.0000 \\ cover:pb & -184.0473 & 40.4942 & -4.55 & 0.0007 \\ \hline \end{eqnarray*} $$ A) paperback B) $\color{red}{\text{hardcover}}$
## Determining the reference level
Which of the below correctly describes the roles of variables in this regression model?
$$ \begin{eqnarray*} \hline & Estimate & Std. Error & t value & Pr(>|t|) \\ \hline (Intercept) & 197.9628 & 59.1927 & 3.34 & 0.0058 \\ volume & 0.7180 & 0.0615 & 11.67 & 0.0000 \\ cover:pb & -184.0473 & 40.4942 & -4.55 & 0.0007 \\ \hline \end{eqnarray*} $$ A) response: weight, explanatory: volume, paperback cover. B) response: weight, explanatory: volume, hardcover cover. C) response: volume, explanatory: weight, cover type. D) response: weight, explanatory: volume, cover type.
## Determining the reference level
Which of the below correctly describes the roles of variables in this regression model?
$$ \begin{eqnarray*} \hline & Estimate & Std. Error & t value & Pr(>|t|) \\ \hline (Intercept) & 197.9628 & 59.1927 & 3.34 & 0.0058 \\ volume & 0.7180 & 0.0615 & 11.67 & 0.0000 \\ cover:pb & -184.0473 & 40.4942 & -4.55 & 0.0007 \\ \hline \end{eqnarray*} $$ A) response: weight, explanatory: volume, paperback cover. B) response: weight, explanatory: volume, hardcover cover. C) response: volume, explanatory: weight, cover type. D)
response: weight, explanatory: volume, cover type.
## Linear model
$$ \begin{eqnarray*} \hline & Estimate & Std. Error & t value & Pr(>|t|) \\ \hline (Intercept) & 197.96 & 59.19 & 3.34 & 0.01 \\ volume & 0.72 & 0.06 & 11.67 & 0.00 \\ cover:pb & -184.05 & 40.49 & -4.55 & 0.00 \\ \hline \end{eqnarray*} $$ \[ \widehat{weight} = 197.96 + 0.72~volume - 184.05~cover:pb \] - **hardcover** books: plug in $\color{red}{0}$ for $\tt{cover}$ $$\widehat{weight} = 197.96 + 0.72~volume - 184.05 \times \color{red}{0} \\= 197.96 + 0.72~volume$$ - For **paperback** books: plug in $\color{red}{1}$ for $\tt{cover}$ $$\widehat{weight} = 197.96 + 0.72~volume - 184.05 \times \color{red}{1} \\=13.91 + 0.72~volume$$
## Visualizing the linear model
```{r, echo=F, message=F, warning=F, out.width="80%",fig.align='center'} # scatterplot: weight vs. volume and cover with lines par(mar=c(4,4,0.25,1), las=1, mgp=c(3,0.7,0), cex.lab = 1.8, cex.axis = 1.25) plot(weight ~ volume, data = allbacks, col = color, xlab = expression(volume~(cm^{3})), ylab = "weight (g)", pch = ch, cex = 3) legend("topleft", inset = 0.05, c("hardcover","paperback"), col = c(COL[1,2],COL[2,2]), pch = c(15,17), cex = 1.8) abline(a = m2$coefficients[1], b = m2$coefficients[2], col = COL[1], lwd = 2) abline(a = m2$coefficients[1] + m2$coefficients[3], b = m2$coefficients[2], col = COL[2], lwd = 2, lty = 2.5) ```
## Interpretation of the regression coefficients
$$ \begin{eqnarray*} \hline & Estimate & Std. Error & t value & Pr(>|t|) \\ \hline (Intercept) & 197.96 & 59.19 & 3.34 & 0.01 \\ volume & 0.72 & 0.06 & 11.67 & 0.00 \\ cover:pb & -184.05 & 40.49 & -4.55 & 0.00 \\ \hline \end{eqnarray*} $$ - **Slope of volume:** $\underline{\text{All else held constant}}$, books that are 1 more cubic centimeter in volume tend to weigh about 0.72 grams more. - **Slope of cover:** $\underline{\text{All else held constant}}$, the model predicts that paperback books weigh 184 grams lower than hardcover books. - **Intercept:** Hardcover books with no volume are expected on average to weigh 198 grams. - Obviously, the intercept does not make sense in context. It only serves to adjust the height of the line.
## Prediction
Which of the following is the correct calculation for the predicted weight of a paperback book that is $600~cm^3?$
$$ \begin{eqnarray*} \hline & Estimate & Std. Error & t value & Pr(>|t|) \\ \hline (Intercept) & 197.96 & 59.19 & 3.34 & 0.01 \\ volume & 0.72 & 0.06 & 11.67 & 0.00 \\ cover:pb & -184.05 & 40.49 & -4.55 & 0.00 \\ \hline \end{eqnarray*} $$ A) $197.96 + 0.72 \times 600 - 184.05 \times 1$ B) $184.05 + 0.72 \times 600 - 197.96 \times 1$ C) $197.96 + 0.72 \times 600 - 184.05 \times 0$ D) $197.96 + 0.72 \times 1 - 184.05 \times 600$ ## Prediction
Which of the following is the correct calculation for the predicted weight of a paperback book that is $600~cm^3?$
$$ \begin{eqnarray*} \hline & Estimate & Std. Error & t value & Pr(>|t|) \\ \hline (Intercept) & 197.96 & 59.19 & 3.34 & 0.01 \\ volume & 0.72 & 0.06 & 11.67 & 0.00 \\ cover:pb & -184.05 & 40.49 & -4.55 & 0.00 \\ \hline \end{eqnarray*} $$ A)
$197.96 + 0.72 \times 600 - 184.05 \times 1$
B) $184.05 + 0.72 \times 600 - 197.96 \times 1$ C) $197.96 + 0.72 \times 600 - 184.05 \times 0$ D) $197.96 + 0.72 \times 1 - 184.05 \times 600$
## Another example: Modeling kid's test scores
Predicting cognitive test scores of three- and four-year-old children using characteristics of their mothers. Data are from a survey of adult American women and their children - a sub-sample from the National Longitudinal Survey of Youth. $$ \begin{eqnarray*} \hline & kid\_score & mom\_hs & mom\_iq & mom\_work & mom\_age \\ \hline 1 & 65 & yes & 121.12 & yes & 27 \\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\ 5 & 115 & yes & 92.75 & yes & 27 \\ 6 & 98 & no & 107.90 & no & 18 \\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\ 434 & 70 & yes & 91.25 & yes & 25 \\ \hline \end{eqnarray*} $$
## Interpreting the slope
What is the correct interpretation of the $\underline{\text{slope for mom's IQ}}$?
$$ \begin{eqnarray*} \hline & Estimate & Std. Error & t value & Pr(>|t|) \\ \hline (Intercept) & 19.59 & 9.22 & 2.13 & 0.03 \\ mom\_hs:yes & 5.09 & 2.31 & 2.20 & 0.03 \\ mom\_iq & 0.56 & 0.06 & 9.26 & 0.00 \\ mom\_work:yes & 2.54 & 2.35 & 1.08 & 0.28 \\ mom\_age & 0.22 & 0.33 & 0.66 & 0.51 \\ \hline \end{eqnarray*} $$ All else held constant, kids with mothers whose IQs are one point higher tend to score on average 0.56 points higher.
## Interpreting the slope
What is the correct interpretation of the $\underline{intercept}$?
$$ \begin{eqnarray*} \hline & Estimate & Std. Error & t value & Pr(>|t|) \\ \hline (Intercept) & 19.59 & 9.22 & 2.13 & 0.03 \\ mom\_hs:yes & 5.09 & 2.31 & 2.20 & 0.03 \\ mom\_iq & 0.56 & 0.06 & 9.26 & 0.00 \\ mom\_work:yes & 2.54 & 2.35 & 1.08 & 0.28 \\ mom\_age & 0.22 & 0.33 & 0.66 & 0.51 \\ \hline \end{eqnarray*} $$ Kids whose moms haven't gone to HS, did not work during the first three years of the kid's life, have an IQ of 0 and are 0 yrs old are expected on average to score 19.59. Obviously, the intercept does not make any sense in context.
## Interpreting the slope
What is the correct interpretation of the slope for $\underline{mom_{work}}?$
$$ \begin{eqnarray*} \hline & Estimate & Std. Error & t value & Pr(>|t|) \\ \hline (Intercept) & 19.59 & 9.22 & 2.13 & 0.03 \\ mom\_hs:yes & 5.09 & 2.31 & 2.20 & 0.03 \\ mom\_iq & 0.56 & 0.06 & 9.26 & 0.00 \\ mom\_work:yes & 2.54 & 2.35 & 1.08 & 0.28 \\ mom\_age & 0.22 & 0.33 & 0.66 & 0.51 \\ \hline \end{eqnarray*} $$ All else being equal, kids whose moms worked during the first three years of the kid's life A) are estimated to score 2.54 points lower. B) are estimated to score 2.54 points higher. than those whose moms did not work.
## Interpreting the slope
What is the correct interpretation of the slope for $\underline{mom_{work}}?$
$$ \begin{eqnarray*} \hline & Estimate & Std. Error & t value & Pr(>|t|) \\ \hline (Intercept) & 19.59 & 9.22 & 2.13 & 0.03 \\ mom\_hs:yes & 5.09 & 2.31 & 2.20 & 0.03 \\ mom\_iq & 0.56 & 0.06 & 9.26 & 0.00 \\ mom\_work:yes & 2.54 & 2.35 & 1.08 & 0.28 \\ mom\_age & 0.22 & 0.33 & 0.66 & 0.51 \\ \hline \end{eqnarray*} $$ All else being equal, kids whose moms worked during the first three years of the kid's life A) are estimated to score 2.54 points lower. B)
are estimated to score 2.54 points higher.
than those whose moms did not work.
## Revisit: Modeling poverty
```{r, echo=F, message=F, warning=F, out.width="80%",fig.align='center'} poverty = read.table("poverty.txt", h = T, sep = "\t") # rename columns names(poverty) = c("state", "metro_res", "white", "hs_grad", "poverty", "female_house") # reorder columns poverty = poverty[,c(1,5,2,3,4,6)] # pairs plot panel.cor <- function(x, y, digits=2, prefix="", cex.cor, ...){ usr <- par("usr"); on.exit(par(usr)) par(usr = c(0, 1, 0, 1)) r <- abs(cor(x, y)) rreal = cor(x, y) txtreal <- format(c(rreal, 0.123456789), digits=digits)[1] txt <- format(c(r, 0.123456789), digits=digits)[1] if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt) text(0.5, 0.5, txtreal, cex = 3 * cex.cor * r) } pairs(poverty[,c(2:6)], lower.panel = panel.cor, pch = 19, col = COL[1,2], cex = 1.5, cex.labels = 1.6, cex.axis = 1.5) ```
## Predicting poverty using % female householder
$$ \begin{eqnarray*} \hline & Estimate & Std. Error & t value & Pr(>|t|) \\ \hline (Intercept) & 3.31 & 1.90 & 1.74 & 0.09 \\ female\_house & 0.69 & 0.16 & 4.32 & 0.00 \\ \hline \end{eqnarray*} $$
```{r, echo=F, message=F, warning=F, out.width="80%",fig.align='center'} # poverty vs. female_house pov_slr = lm(poverty ~ female_house, data = poverty) par(mar=c(4,4,1,1), las=1, mgp=c(2.5,0.7,0), cex.lab = 2, cex.axis = 1.5) plot(poverty$poverty ~ poverty$female_house, ylab = "% in poverty", xlab = "% female householder", pch=19, col=COL[1,2], cex = 2) abline(pov_slr, col = COL[4], lwd = 4) ```
$$\begin{align*}R &= 0.53 \\R^2 &= 0.53^2 = 0.28\end{align*}$$
## Another look at $R^2$
$R^2$ can be calculated in three ways: - Square the correlation coefficient of $x$ and $y$ {\small (how we have been calculating it)} - Square the correlation coefficient of $y$ and $\hat{y}$ - Based on definition: - $R^2 = \frac{\text{explained variability in }y}{\text{total variability in }y}$ Using **ANOVA** we can calculate the explained variability and total variability in $y$.
## Sum of squares
$$ \begin{eqnarray*} \hline & Df & Sum Sq & Mean Sq & F value & Pr(>F) \\ \hline \text{female_house} & 1 & 132.57 & 132.57 & 18.68 & 0.00 \\ \text{Residuals }& 49 & 347.68 & 7.10 & & \\ \hline \hline \end{eqnarray*} $$ $$ \begin{eqnarray*} \text{Sum of squares of y: } SS_{Total} &=& \sum(y - \bar{y})^2 = 480.25 \\ & & \color{red} \\ \text{Sum of squares of residuals: } SS_{Error} &=& \sum e_i^2 = 347.68 \\ & &\color{red} \\ \text{Sum of squares of x: } SS_{Model} &=& SS_{Total} - SS_{Error} \\ & &\color{red} \\ &=& 480.25 - 347.68 = 132.57 \end{eqnarray*} $$
## Why bother?
Why bother with another approach for calculating $R^2$ when we had a perfectly good way to calculate it as the correlation coefficient squared?
- For single-predictor linear regression, having three ways to calculate the same value may seem like overkill. - However, in multiple linear regression, we can't calculate $R^2$ as the square of the correlation between $x$ and $y$ because we have multiple $x$s. - And next we'll learn another measure of explained variability, $\mathbf{adjusted~R^2}$, that requires the use of the third approach, ratio of explained and unexplained variability.
## Predicting poverty using % female hh + % white
$$ \begin{eqnarray*} \hline \textbf{Linear model:}& Estimate & Std. Error & t value & Pr(>|t|) \\ \hline (Intercept) & -2.58 & 5.78 & -0.45 & 0.66 \\ female\_house & 0.89 & 0.24 & 3.67 & 0.00 \\ white & 0.04 & 0.04 & 1.08 & 0.29 \\ \hline \end{eqnarray*} $$ $$ \begin{eqnarray*} \hline \textbf{ANOVA:} & Df & Sum Sq & Mean Sq & F value & Pr(>F) \\ \hline female\_house & 1 & 132.57 & 132.57 & 18.74 & 0.00 \\ white & 1 & 8.21 & 8.21 & 1.16 & 0.29 \\ Residuals & 48 & 339.47 & 7.07 & & \\ \hline Total & 50 & 480.25\\ \hline \end{eqnarray*} $$ \[ R^2 = \frac{\text{explained variability}}{\text{total variability}} = \frac{132.57 + 8.21}{480.25} = 0.29 \]
##
Does adding the variable $\tt{white}$ to the model add valuable information that wasn't provided by $\tt{female\_house}$?
```{r, echo=F, message=F, warning=F, out.width="80%",fig.align='center'} pairs(poverty[,c(2:6)], lower.panel = panel.cor, pch = 19, col = COL[1,2], cex = 1.5, cex.labels = 1.6, cex.axis = 1.5) ```
## Collinearity between explanatory variables
$\mathbf{\text{poverty vs. % female head of household}}$ $$ \begin{eqnarray*} \hline & Estimate & Std. Error & t value & Pr(>|t|) \\ \hline (Intercept) & 3.31 & 1.90 & 1.74 & 0.09 \\ female\_house & 0.69 & 0.16 & 4.32 & 0.00 \\ \hline \end{eqnarray*} $$ $\mathbf{\text{poverty vs. % female head of household and % female hh}}$ $$ \begin{eqnarray*} \hline & Estimate & Std. Error & t value & Pr(>|t|) \\ \hline (Intercept) & -2.58 & 5.78 & -0.45 & 0.66 \\ female\_house & 0.89 & 0.24 & 3.67 & 0.00 \\ white & 0.04 & 0.04 & 1.08 & 0.29 \\ \hline \end{eqnarray*} $$
## Collinearity between explanatory variables
$\mathbf{\text{poverty vs. % female head of household}}$ $$ \begin{eqnarray*} \hline & Estimate & Std. Error & t value & Pr(>|t|) \\ \hline (Intercept) & 3.31 & 1.90 & 1.74 & 0.09 \\ female\_house & \color{red}{0.69} & 0.16 & 4.32 & 0.00 \\ \hline \end{eqnarray*} $$ $\mathbf{\text{poverty vs. % female head of household and % female hh}}$ $$ \begin{eqnarray*} \hline & Estimate & Std. Error & t value & Pr(>|t|) \\ \hline (Intercept) & -2.58 & 5.78 & -0.45 & 0.66 \\ female\_house & \color{red}{0.89} & 0.24 & 3.67 & 0.00 \\ white & 0.04 & 0.04 & 1.08 & 0.29 \\ \hline \end{eqnarray*} $$
## Collinearity between explanatory variables
- Two predictor variables are said to be collinear when they are correlated, and this **collinearity** complicates model estimation. - $Remember$:Predictors are also called explanatory or $\underline{\text{independent}}$ variables. Ideally, they would be independent of each other. - We don't like adding predictors that are associated with each other to the model, because often times the addition of such variable brings nothing to the table. Instead, we prefer the simplest best model, i.e. **parsimonious** model. - While it's impossible to avoid collinearity from arising in observational data, experiments are usually designed to prevent correlation among predictors.
## $R^2$ vs. adjusted $R^2$
$$ \begin{eqnarray*} && R^2 && Adjusted R^2 \\ \hline Model 1 (Single-predictor) && 0.28 && 0.26 \\ Model 2 (Multiple) && 0.29 && 0.26 \end{eqnarray*} $$ - When $\underline{any}$ variable is added to the model $R^2$ increases. - But if the added variable doesn't really provide any new information, or is completely unrelated, adjusted $R^2$ does not increase.
## Adjusted $R^2$
$$R^2_{adj} = 1 - \left( \frac{ SS_{Error} }{ SS_{Total} } \times \frac{n - 1}{n - p - 1} \right)$$ where $n$ is the number of cases and $p$ is the number of predictors (explanatory variables) in the model. - Because $p$ is never negative, $R^2_{adj}$ will always be smaller than $R^2$. - $R^2_{adj}$ applies a penalty for the number of predictors included in the model. - Therefore, we choose models with higher $R^2_{adj}$ over others.
## Calculate adjusted $R^2$
$$ \begin{eqnarray*} \hline \mathbf{ANOVA:} & Df & Sum Sq & Mean Sq & F value & Pr(>F) \\ \hline female\_house & 1 & 132.57 & 132.57 & 18.74 & 0.0001 \\ white & 1 & 8.21 & 8.21 & 1.16 & 0.2868 \\ Residuals & 48 & 339.47 & 7.07 & & \\ \hline Total & 50 & 480.25\\ \hline \end{eqnarray*} $$ $$ \begin{eqnarray*} R^2_{adj} &=& 1 - \left( \frac{ SS_{Error} }{ SS_{Total} } \times \frac{n - 1}{n - p - 1} \right) \\ &=& 1 - \left( \frac{ 339.47 }{ 480.25 } \times \frac{51 - 1}{51 - 2 - 1} \right) \\ &=& 1- \left( \frac{ 339.47 }{ 480.25 } \times \frac{50}{48} \right) \\ &=& 1 - 0.74 \\ &=& 0.26 \end{eqnarray*} $$