In this lab, you will explore and visualize the data using the tidyverse suite of packages. You will also use the GGally package for visualisation of many variables at once.
Let’s load the packages.
library(tidyverse)
library(openintro)
library(GGally)
This is not the first time we’re using the GGally package. We already made use of this package in the previous lab regarding the Data Analysis Project Part III. You will be using the ggpairs()
function from this package later in the lab.
For today’s lab we will make use of a data set from a textbook known as “An Introduction to Statistical Learning with Applications in R”. Suppose that we are statistical consultants hired by a client to provide advice on how to improve sales on a particular product. The Advertising
data set consists of the sales
of that product in 200 different markets, along with advertising budgets for the product in each of those markets for three different media: TV
, radio
, and newspaper
. It is not possible for our client to directly increase sales of the product. On the other hand, they can control the advertising expenditure in each of the three media. Therefore, if we determine that there is an association between advertising and sales, then we can instruct the client to adjust advertising budgets, thereby indirectly increasing sales. In other words, our goal is to develop an accurate model that can be used to predict sales on the basis of the three media budgets.
= read.csv("https://raw.githubusercontent.com/nguyen-toan/ISLR/master/dataset/Advertising.csv")%>%
advertising -1] .[,
ggpairs()
function from the GGally
package. Report the strongest correlation a variable has with the sales variable. Also, make interpretations of the scatter plot for each pair with sales variable.From the scatter plot above, we can see that sales
and TV
has a really high correlation. So let’s create a simple linear regression model where sales
is the response variable and TV
is the explanatory variable to better understand the relationship between the variables.
Report the intercept and slope of the simple linear regression model.
Is the variable TV
significant in the model? If so, why? Report the \(R^2\) value and its interpretation.
We can also see the linear relationship between two variables using ggplot()
, particularly with the use of geom_smooth()
.
%>%
advertisingggplot(aes(x = TV, y = Sales))+
geom_point()+
geom_smooth(method = "lm", se = F)
The data set contains two more variables, namely: radio
and newspaper
. So we would like to include these variables into our previously made model to see to what extent do they change the results in terms of sales.
Create a multiple linear regression model with sales
as the response and the other variables as the explanatory variables. Report if all the variables are significant. If not, then mention which variables weren’t significant. Report the Adjusted \(R^2\) value.
Interpret the estimated coefficient of each of the variables.
In a previous exercise, we created the full regression model where we added all the explanatory variables into the model to predict the response variable. We noticed that one of the explanatory variables was not significantly associated with the response variable as indicated by its high p-value. To improve the prediction performance of the multiple linear regression model, we will remove the variable with the highest p-value and re-run the model. We can repeat this step multiple times till we acquire a model where all the variables are statistically significant, i.e., have low p-values.
Now that we found a satisfactory model, we need to make sure that the model does not violate any of the linear regression assumptions. We do so by plotting and inspecting the residuals plot of the model. The following line of code creates the residuals plot of the final model.
plot(final_model, which = 1:2)