Skip to Tutorial Content

Inference for Categorical Data

Please type "Your Name" in the interactive R chunk below and run it by clicking Run Code or by hitting crtl+Enter or cmd+Enter for MAC users.

Introduction

In this lab, we will explore and visualize the data using the tidyverse suite of packages, and perform statistical inference using the infer package. The data can be found in the companion package for OpenIntro resources, openintro.

library(infer)
library(tidyverse)
library(openintro)

The Data

Every two years, the Centers for Disease Control and Prevention conduct the Youth Risk Behavior Surveillance System (YRBSS) survey, where it takes data from high schoolers (9th through 12th grade), to analyze health patterns. You will work with a selected group of variables from a random sample of observations during one of the years the YRBSS was conducted.

Let’s start by loading the yrbss dataset and taking a glimpse of it by running the following code chunk.

data(yrbss)
glimpse(yrbss)

Notice that there are 13583 observations on 13 different variables, some categorical and some numerical. The meaning of each variable can be found by bringing up the help file:

?yrbss

Inference for Mean(s)

Now consider the variable weight which reports the students weight in Kilograms (kg). Let’s first plot the distribution of this variable in a density plot.

yrbss %>%
  filter(!is.na(weight)) %>%
  ggplot(aes(x = weight))+
  geom_density()

Let’s also calculate some summary statistics for weight using the summarize() as follows. Notice that we first filter out any cases with missing weight using the filter(!is.na(weight)) layer in the code chunk below.

yrbss %>%
  filter(!is.na(weight)) %>%
  summarise(n = n(),
            x_bar = mean(weight),
            s = sd(weight),
            min = min(weight),
            max = max(weight))

A recent study reported that college students have mean weight of 66.82 kg. Suppose we are interested in testing if the mean weight of all high school students in the U.S. \((\mu)\) differs from 66.82 kg (the hypothesized mean weight of college students). We can conduct a hypothesis test and construct a 95% confidence interval using the t_test() function from the infer package as shown below.

\[H_0: \mu = 66.82 \ \ \ \text{Versus} \ \ \ H_a: \mu \not= 66.82\]

yrbss %>%
  filter(!is.na(weight)) %>%
  t_test(response = weight,
         mu = 66.82,
         conf_int = TRUE,
         conf_level = 0.95)

Exercise: You are encouraged to use the code chunk below to run a hypothesis test for testing if the mean height of all high school students is different from 1.68 meters. Use a 5% significance level (\(\alpha = 0.05\)). Report the 90% confidence interval as well.

Note: The t_test() is for means while the prop_test() is for porportions.

Comparing Two Means

Consider the possible relationship between a high schooler’s weight and their physical activity. The variable physically_active_7d stores the levels of physcial activity during the week and it has 8 categories: \(0, 1, 2, \dots, 7\). We will first create a new variable called physical_3plus which will be coded as “yes” if the student is physically active for at least 3 days a week, and “no” if not.

Then, we will make side-by-side boxplots of weight by physical_3plus to see the relationship between the two variables. Plotting the data is a useful first step because it helps us quickly visualize trends, identify associations, and develop research questions. The following code does that for us.

#first create the binary variable physical_3plus
yrbss <- yrbss %>% 
  mutate(physical_3plus = if_else(physically_active_7d > 2, "yes", "no"))

#make side-by-side violin plots of weight
yrbss %>%
  filter(!is.na(physical_3plus), !is.na(weight))%>%
  ggplot(aes(x = physical_3plus, y = weight, fill = physical_3plus))+
  geom_boxplot(show.legend = FALSE)

Base on the above plot we can see that there is an observable difference. But the difference isn’t big enough for us to deem it statistically significant without conducting a hypothesis test. We can confirm this difference by calculating the mean() of weight for each group as shown below;

yrbss <- yrbss %>% 
  mutate(physical_3plus = if_else(physically_active_7d > 2, "yes", "no"))

yrbss %>%
  filter(!is.na(physical_3plus), !is.na(weight)) %>%
  group_by(physical_3plus) %>%
  summarise(mean_weight = mean(weight))

We can also set up a hypothesis test to verify if the mean weight is different for those who are physically active at least 3 days a week from the mean weight of those who are not physically active at least 3 days a week. We will use a \(5\%\) significant level \((\alpha = 0.05)\) and also report the \(95\%\) confidence interval.

\[H_0: {\mu}_{yes} - {\mu}_{no} = 0 \ \ \ \text{Versus} \ \ \ H_a: {\mu}_{yes} - {\mu}_{no} \neq 0\]

yrbss <- yrbss %>% 
  mutate(physical_3plus = if_else(physically_active_7d > 2, "yes", "no"))

yrbss %>%
  filter(!is.na(physical_3plus), !is.na(weight)) %>%
  t_test(response = weight,
         explanatory = physical_3plus,
         order = c("yes","no"), #to make sure the order is mu_yes - mu_no
         mu = 0, #the difference under H0
         conf_int = TRUE,
         conf_level = 0.95)

Exercise: You are encouraged to use the three code chunks below to run similar analysis on the variable height.

Resources for learning R and working in RStudio

The book R For Data Science by Grolemund and Wickham is a great resource for data analysis in R with the tidyverse. If you are Goggling for R code, make sure to also include these package names in your search query. For example, instead of Goggling “scatterplot in R”, Goggle “scatterplot in R with the tidyverse”.

These may come in handy throughout the semester:

Note that some of the code on these cheatsheets may be too advanced for this course. However the majority of it will become useful throughout the semester.


Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Foundations for Inference II: Inference for numerical data