Inference for numerical data

Getting Started

Load packages

In this lab, we will explore and visualize the data using the tidyverse suite of packages, and perform statistical inference using infer. The data can be found in the companion package for OpenIntro resources, openintro.

Let’s load the packages.

library(tidyverse)
library(openintro)
library(infer)

Creating a reproducible lab report

To create your new lab report, in RStudio, go to New File -> R Markdown… Then, choose From Template and then choose Lab Report for OpenIntro Statistics Labs from the list of templates.

The data

Every two years, the Centers for Disease Control and Prevention conduct the Youth Risk Behavior Surveillance System (YRBSS) survey, where it takes data from high schoolers (9th through 12th grade), to analyze health patterns. You will work with a selected group of variables from a random sample of observations during one of the years the YRBSS was conducted.

Load the yrbss data set into your workspace.

data(yrbss)

There are observations on 13 different variables, some categorical and some numerical. The meaning of each variable can be found by bringing up the help file:

?yrbss

Inference for proportion(s)

Consider the variable physically_active_7d which has 8 categories: \(0, 1, 2, \dots, 7\). We are interested in estimating the proportion of students who are physically active at least 3 days of the week. So we will create a new variable called physical_3plus which will be coded as either “yes” if the student is physically active for at least 3 days a week, and “no” if not.

yrbss <- yrbss %>% 
  mutate(physical_3plus = if_else(physically_active_7d > 2, "yes", "no"))

Set up a Hypothesis Test for testing if the proportion of students who are physically active at least 3 days a week is different from 66%. Use a 2% significance level (\(\alpha = 0.02\)). Report the 98% confidence interval as well.

Now suppose we are interested in testing whether the proportion of male students who are physically active at least 3 days a week is different from the proportion of female students who are physically active at least 3 days a week. That is, we want to see if there is a relationship between being physically active at least 3 days a week and the students gender.

We can conduct such tests using the prop_test() function as for the case of one proportion but this time we add the explanatory argument and assign it to gender as shown in the code below.

yrbss%>%
  filter(!is.na(physical_3plus), !is.na(gender))%>%
  prop_test(response = physical_3plus,
            success = "yes",
            explanatory = gender,
            order = c("male","female"), 
            z = TRUE,
            conf_int = TRUE,
            conf_level = 0.95)

Notice that the argument order = c("male", "female") in the above code chunk ensures that the female students proportion is subtracted from the male students proportion for those who are physically active at least 3 days a week. The default is order = c("female", "male") as R goes by alphabetical order.

Set up a Hypothesis Test for testing if there is no difference in proportion of male and female students who are physically active at least 3 days a week. Use a 5% significance level (\(\alpha = 0.05\)). Report the 95% confidence interval as well.

Inference for mean(s)

Now consider the variable weight which reports the students weight in Kilograms (kg). Let’s first plot the distribution of this variable.

yrbss%>%
  filter(!is.na(weight))%>%
  ggplot(aes(x = weight))+
  geom_density()

Let’s also calculate some summary statistics for weight.

yrbss%>%
  filter(!is.na(weight))%>%
  summarise(n = n(),
            x_bar = mean(weight),
            s = sd(weight),
            min = min(weight),
            max = max(weight))

## # A tibble: 1 x 5
##       n x_bar     s   min   max
##   <int> <dbl> <dbl> <dbl> <dbl>
## 1 12579  67.9  16.9  29.9  181.

A recent study reported that college students have mean weight of 66.82 kg. Suppose we are interested in testing if the mean weight of all high school students in the U.S. \((\mu)\) differs from 66.82 kg. We can conduct a hypothesis test and construct a 95% confidence interval using t_test() function as shown below:

yrbss%>%
  filter(!is.na(weight))%>%
  t_test(response = weight,
         mu = 66.82,
         conf_int = TRUE,
         conf_level = 0.95)

Set up a Hypothesis Test for testing if the mean weight of all high school students is different from 66.82 kg. Use a 5% significance level (\(\alpha = 0.05\)). Report the 95% confidence interval as well.

Note: The t_test() is for means while the prop_test() is for porportions.

Exploratory Data Analysis

Consider the possible relationship between a high schooler’s weight and their physical activity. Plotting the data is a useful first step because it helps us quickly visualize trends, identify strong associations, and develop research questions.

Make a side-by-side violin plots of physical_3plus and weight. Is there a relationship between these two variables? What did you expect and why?

yrbss %>%
  filter(!is.na(physical_3plus), !is.na(weight))%>%
  ggplot(aes(x = physical_3plus, y = weight, fill = physical_3plus))+
  geom_violin()

The violin plots show how the medians of the two distributions compare, but we can also compare the means of the distributions using the following code. We first group the data by the physical_3plus variable, and then calculate the mean weight in these groups using the mean function.

yrbss %>%
  filter(!is.na(physical_3plus), !is.na(weight))%>%
  group_by(physical_3plus) %>%
  summarise(mean_weight = mean(weight))

Is there an observed difference in students weight by their rate of physically activity? Is this difference large enough to deem it “statistically significant”?

In order to answer the previous question better using statistical tools we will conduct a hypothesis test.

Set up a Hypothesis Test for testing if the mean weights are not different for those who physically active at least 3 days a week and those who are not. Use a 5% significance level (\(\alpha = 0.05\)). Report the 95% confidence interval as well.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.