Multiple Linear Regression

If you’ve never coded before (or even if you have), type print("Your Name") in the interactive R chunk below and run it by hitting crtl+Enter or cmd+Enter for MAC users.

We now expand the predictors under consideration so that we may better understand instructor- and course-related factors which, on average, lead to the highest evaluation scores. We’ll start by building a model including almost all of the predictors, and then reducing that model to include only significant predictors of overall evaluation score.

First, we’ll check in with Dr. Çetinkaya-Rundel again for some quick discussions on multiple linear regression.

Multiple regression is a natural generalization of simple regression. In multiple regression we are allowed to include more than just a single predictor. That is, a multiple regression model takes the form: \[\displaystyle{\mathbb{E}\left[y\right] = \beta_0 + \beta_1x_1 + \beta_2x_2 + \cdots + \beta_kx_k}\] Predictors in a multiple regression model may be numerical or categorical. In the case that a variable is numerical, its coefficient is a slope coefficient, describing the expected change in the response variable given a one-unit change in the corresponding predictor if all other features remain constant. In the case that a variable is categorical, the coefficient is a vertical shift parameter (impacting the intercept). There are alot of intricacies here and advanced techniques to consider – look for a full course on regression analysis to see more!

There is no guarantee that the most complicated regression model will yield the best predictions. One way to measure the quality of a multiple regression model is with the adjusted-R-squared metric. This metric seeks to explain the variability in the response which is explained by the model, but penalizes more complex models (models with fewer terms are preferrable due to their simplicity). Values of adjusted-R-squared closer to 1 indicate higher explanatory value.

Since the most complex models are not necessarily the best, we often start with a large model (sometimes called a full model), and reduce the model by removing one insignificant predictor at a time – removing the predictor corresponding to the highest \(p\)-value at each step – until we have a model in which all predictors are significant. This procedure is often referred to as backward elimination. Again, there are some intricacies here and you should look for a full course in regression analysis to learn more!

Now we get back to our application of regression to analyzing course evaluations. The code block below is pre-set to build a regression model including nearly all of the predictors available in the evals dataset. Use the code block to build the regression model, view its summary, and answer the questions that follow.

Aside: You’ll notice that, rather that including all of the predictors related to beauty ratings, we include only the bty_avg predictor. We make this choice because the beauty variables are highly correlated with one another – in some sense, they encode the same information. Including highly correlated predictors can cause problems for regression – look for a full course in regression analysis to learn more about regression and potential pitfalls.

#build regression model
m_full <- lm(score ~ ethnicity + gender + language + age + cls_perc_eval + cls_students + cls_level + cls_profs + cls_credits + bty_avg + pic_color + pic_outfit, data = evals)
#view model summary
summary(m_full)

Quiz

Use the code block below to remove the predictor corresponding to the highest \(p\)-value (just delete it from the code which constructs the model). Re-run the model and print the summary.
m_reduce1 <- lm(score ~ ethnicity + gender + language + age + cls_perc_eval + cls_students + cls_level + cls_profs + cls_credits + bty_avg + pic_color + pic_outfit, data = evals)

summary(m_reduce1)
m_reduce1 <- lm(score ~ ethnicity + gender + language + age + cls_perc_eval + cls_students + cls_level + cls_credits + bty_avg + pic_color + pic_outfit, data = evals)

summary(m_reduce1)

Use the code block below to continue reducing the model, removing one predictor at a time.
Use the code block below to build the final model as m_final and print the model summary.
m_final <- lm(score ~ ethnicity + gender + language + age + cls_perc_eval + cls_credits + bty_avg + pic_color, data = evals)
summary(m_final)

As in the case of the simple regression model, we found several statistically significant predictors of overall course evaluation score. While these predictors are indeed statistically significant, they are not necessarily practically significant. This means that, while we should not expect that the characteristics identified above are primary drivers of overall course evaluation score, we’ve replicated the finding of Hamermesh and Parker that there are several implicit biases in the instrument used to measure course evaluation score.

If you liked this tutorial, it is just a taste of a truly powerful statistical technique called linear regression. We can use regression analysis to build predictive models that help us understand phenomena and to predict future observations. Look for full courses in regression, predictive modeling, statistical learning, or machine learning if you want to learn more.

Submit

If you have completed this tutorial and are happy with all of your solutions, please click the button below to generate your hash and submit it using the corresponding tutorial assignment tab on Blackboard


NCAT Blackboard