In this lab, we will explore and visualize the data using the tidyverse suite of packages, and perform statistical inference using infer. The data can be found in the companion package for OpenIntro resources, openintro.
Let’s load the packages.
library(tidyverse)
library(openintro)
library(infer)
To create your new lab report, in RStudio, go to New File -> R Markdown… Then, choose From Template and then choose Lab Report for OpenIntro Statistics Labs
from the list of templates.
Every two years, the Centers for Disease Control and Prevention conduct the Youth Risk Behavior Surveillance System (YRBSS) survey, where it takes data from high schoolers (9th through 12th grade), to analyze health patterns. You will work with a selected group of variables from a random sample of observations during one of the years the YRBSS was conducted.
Load the yrbss
data set into your workspace.
data(yrbss)
There are observations on 13 different variables, some categorical and some numerical. The meaning of each variable can be found by bringing up the help file:
?yrbss
Consider the variable physically_active_7d
which has 8 categories: \(0, 1, 2, \dots, 7\). We are interested in estimating the proportion of students who are physically active at least 3 days of the week. So we will create a new variable called physical_3plus
which will be coded as either “yes” if the student is physically active for at least 3 days a week, and “no” if not.
<- yrbss %>%
yrbss mutate(physical_3plus = if_else(physically_active_7d > 2, "yes", "no"))
Now suppose we are interested in testing whether the proportion of male students who are physically active at least 3 days a week is different from the proportion of female students who are physically active at least 3 days a week. That is, we want to see if there is a relationship between being physically active at least 3 days a week and the students gender.
We can conduct such tests using the prop_test()
function as for the case of one proportion but this time we add the explanatory
argument and assign it to gender
as shown in the code below.
%>%
yrbssfilter(!is.na(physical_3plus), !is.na(gender))%>%
prop_test(response = physical_3plus,
success = "yes",
explanatory = gender,
order = c("male","female"),
z = TRUE,
conf_int = TRUE,
conf_level = 0.95)
Notice that the argument order = c("male", "female")
in the above code chunk ensures that the female students proportion is subtracted from the male students proportion for those who are physically active at least 3 days a week. The default is order = c("female", "male")
as R goes by alphabetical order.
Now consider the variable weight
which reports the students weight in Kilograms (kg). Let’s first plot the distribution of this variable.
%>%
yrbssfilter(!is.na(weight))%>%
ggplot(aes(x = weight))+
geom_density()
Let’s also calculate some summary statistics for weight
.
%>%
yrbssfilter(!is.na(weight))%>%
summarise(n = n(),
x_bar = mean(weight),
s = sd(weight),
min = min(weight),
max = max(weight))
## # A tibble: 1 x 5
## n x_bar s min max
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 12579 67.9 16.9 29.9 181.
A recent study reported that college students have mean weight of 66.82 kg. Suppose we are interested in testing if the mean weight of all high school students in the U.S. \((\mu)\) differs from 66.82 kg. We can conduct a hypothesis test and construct a 95% confidence interval using t_test()
function as shown below:
%>%
yrbssfilter(!is.na(weight))%>%
t_test(response = weight,
mu = 66.82,
conf_int = TRUE,
conf_level = 0.95)
Note: The
t_test()
is for means while theprop_test()
is for porportions.
Consider the possible relationship between a high schooler’s weight and their physical activity. Plotting the data is a useful first step because it helps us quickly visualize trends, identify strong associations, and develop research questions.
physical_3plus
and weight
. Is there a relationship between these two variables? What did you expect and why?%>%
yrbss filter(!is.na(physical_3plus), !is.na(weight))%>%
ggplot(aes(x = physical_3plus, y = weight, fill = physical_3plus))+
geom_violin()
The violin plots show how the medians of the two distributions compare, but we can also compare the means of the distributions using the following code. We first group the data by the physical_3plus
variable, and then calculate the mean weight
in these groups using the mean
function.
%>%
yrbss filter(!is.na(physical_3plus), !is.na(weight))%>%
group_by(physical_3plus) %>%
summarise(mean_weight = mean(weight))
In order to answer the previous question better using statistical tools we will conduct a hypothesis test.
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.