Getting Started

Load packages

In this lab, you will explore and visualize the data using the tidyverse suite of packages. You will also use the GGally package for visualisation of many variables at once.

Let’s load the packages.

library(tidyverse)
library(openintro)
library(GGally)

This is not the first time we’re using the GGally package. We already made use of this package in the previous lab regarding the Data Analysis Project Part III. You will be using the ggpairs() function from this package later in the lab.

The data

For today’s lab we will make use of a data set from a textbook known as “An Introduction to Statistical Learning with Applications in R”. Suppose that we are statistical consultants hired by a client to provide advice on how to improve sales on a particular product. The Advertising data set consists of the sales of that product in 200 different markets, along with advertising budgets for the product in each of those markets for three different media: TV, radio, and newspaper. It is not possible for our client to directly increase sales of the product. On the other hand, they can control the advertising expenditure in each of the three media. Therefore, if we determine that there is an association between advertising and sales, then we can instruct the client to adjust advertising budgets, thereby indirectly increasing sales. In other words, our goal is to develop an accurate model that can be used to predict sales on the basis of the three media budgets.

Importing the data

advertising = read.csv("https://raw.githubusercontent.com/nguyen-toan/ISLR/master/dataset/Advertising.csv")%>%
  .[,-1]

Exploring the data

  1. Use the ggpairs() function from the GGally package. Report the strongest correlation a variable has with the sales variable. Also, make interpretations of the scatter plot for each pair with sales variable.

Simple linear regression

From the scatter plot above, we can see that sales and TV has a really high correlation. So let’s create a simple linear regression model where sales is the response variable and TV is the explanatory variable to better understand the relationship between the variables.

  1. Report the intercept and slope of the simple linear regression model.

  2. Is the variable TV significant in the model? If so, why? Report the \(R^2\) value and its interpretation.

We can also see the linear relationship between two variables using ggplot(), particularly with the use of geom_smooth().

advertising%>%
  ggplot(aes(x = TV, y = Sales))+
  geom_point()+
  geom_smooth(method = "lm", se = F)

  1. Does the “Line of Best Fit” match the results from the model above in terms of intercept and slope?

Multiple linear regression

The data set contains two more variables, namely: radio and newspaper. So we would like to include these variables into our previously made model to see to what extent do they change the results in terms of sales.

  1. Create a multiple linear regression model with sales as the response and the other variables as the explanatory variables. Report if all the variables are significant. If not, then mention which variables weren’t significant. Report the Adjusted \(R^2\) value.

  2. Interpret the estimated coefficient of each of the variables.

The search for the best model

In a previous exercise, we created the full regression model where we added all the explanatory variables into the model to predict the response variable. We noticed that one of the explanatory variables was not significantly associated with the response variable as indicated by its high p-value. To improve the prediction performance of the multiple linear regression model, we will remove the variable with the highest p-value and re-run the model. We can repeat this step multiple times till we acquire a model where all the variables are statistically significant, i.e., have low p-values.

  1. Create a new multiple linear regression model after removing the insignificant predictor from the full model. What is the Adjusted \(R^2\) for this new model? How does it compare to the Adjusted \(R^2\) of the full model? Which of the model provides a better fit?

Now that we found a satisfactory model, we need to make sure that the model does not violate any of the linear regression assumptions. We do so by plotting and inspecting the residuals plot of the model. The following line of code creates the residuals plot of the final model.

plot(final_model, which = 1:2)
  1. Run the above code to create two residuals plots which you will use to assess the assumptions of the regression model. Do the plots indicate that the model violates any of the assumptions? If so, then which one?