Skip to Tutorial Content

Inference for Categorical Data

Please type "Your Name" in the interactive R chunk below and run it by clicking Run Code or by hitting crtl+Enter or cmd+Enter for MAC users.

Introduction

In this lab, we will explore and visualize the data using the tidyverse suite of packages, and perform statistical inference using the infer package. The data can be found in the companion package for OpenIntro resources, openintro.

Let’s install and load the packages.

library(tidyverse)
library(openintro)
library(infer)

The Data

For this lab, we will be analyzing a new dataset called yrbss which is short for Youth Risk Behavior Surveillance System. It is a survey which collects data from high schoolers to help discover health patterns among students. It contains variables such as age, gender, grade, hispanic, race, and so on. Run the following code chunk to load the data and peak into it using the head() command.

data(yrbss)
head(yrbss)

Inference on proportions

Today we will be focusing on the variable text_while_driving_30d which records the answers for the question “How many days did you text while driving in the last 30 days?” The following code chunk produces the frequency distribution of this categorical variable.

yrbss %>%
  count(text_while_driving_30d)

We notice that there is a large number of students in the sample that stated “I didn’t drive”. There is also a good number of students who didn’t answer the question as indicated by the NA category. More interesting is the number of students who stated that they have texted every day (30 days) while driving. Focusing on students who drove within the last 30 days and answered the question “How many days did you text while driving in the last 30 days?”, we want to answer the question “What proportion of high schoolers have texted while driving each day for the past 30 days?”

First, to focus on students who drove within the last 30 days and answered the question of interest, we will filter the dataset using the filter() command and reconstruct the frequency table as follows.

yrbss = yrbss %>%
  filter(text_while_driving_30d!="did not drive", !is.na(text_while_driving_30d)) 

yrbss %>% 
  count(text_while_driving_30d)

Next, to make it easier to calculate the proportion of those who texted every day when driving, we will create a new variable that specifies whether the individual has texted every day while driving over the past 30 days or not. We will call this variable text_ind. We will take the filtered data and use the mutate() command to define the new variable text_ind as follows.

yrbss = yrbss %>%
  filter(text_while_driving_30d!="did not drive", !is.na(text_while_driving_30d)) %>% 
  mutate(text_ind = ifelse(text_while_driving_30d == "30", "yes", "no"))

yrbss %>% 
  count(text_ind) %>% 
  mutate(phat = round(n/sum(n),2))

Now, we can use the cleaned dataset with the new binary variable text_ind to answer the question: “What proportion of high schoolers have texted while driving each day for the past 30 days?”

Here we want to use the sample data from yrbss to make inference about the proportion of all high school students who have texted while driving each day for the past 30 days? We can conduct inference about a population proportion in two ways:

  1. Confidence Intervals
  2. Hypothesis Testing

The infer package provides us with tools for computing confidence intervals and conducting hypothesis tests about population proportions. Specifically, we will use the command prop_test(). The following code chunk shows you how to construct a 95% confidence interval for the proportion of proportion of high schoolers who have texted while driving each day for the past 30 days.

yrbss = yrbss %>%
  filter(text_while_driving_30d!="did not drive", !is.na(text_while_driving_30d)) %>% 
  mutate(text_ind = ifelse(text_while_driving_30d == "30", "yes", "no"))

prop_test(yrbss,
          text_ind ~ NULL,
          success = "yes",
          z = TRUE,
          conf_int = TRUE,
          conf_level = 0.95, 
          correct = FALSE)

Note that since the goal is to construct an interval estimate for a proportion, it’s necessary to include the success argument, which accounts for the proportion of students who have consistently texted while driving in the past 30 days, in this example.

  1. What is the margin of error in the 95% confidence interval for those that have texted while driving each day for the past 30 days based?

Quiz

  1. Using the infer package, calculate a 98% confidence interval for the proportion of high school students who never wear helemt for the past 12 months. The variable helmet_12m in the dataset records the data for this. Make sure to filter out any missing data and any students who “did not ride”.

How does the proportion affect the margin of error?

Imagine you’ve set out to survey 1000 people on two questions: are you at least 6-feet tall? and are you left-handed? Since both of these sample proportions were calculated from the same sample size, they should have the same margin of error, right? Wrong!. While the margin of error does change with sample size, it is also affected by the proportion.

Think back to the formula for the standard error: \(SE = \sqrt{p(1-p)/n}\). This is then used in the formula for the margin of error for a 95% confidence interval:

\[ ME = 1.96\times SE = 1.96\times\sqrt{p(1-p)/n} \] Since the population proportion \(p\) is in this \(ME\) formula, it should make sense that the margin of error is in some way dependent on the population proportion. We can visualize this relationship by creating a plot of \(ME\) vs. \(p\). Since sample size is irrelevant to this discussion, let’s just set it to some value (\(n = 1000\)) and use this value in the following calculations:

The first step is to make a variable p that is a sequence from 0 to 1 with each number incremented by 0.01. You can then create a variable of the margin of error (me) associated with each of these values of p using the above formula (\(ME = 1.96 \times SE\)). Lastly, you can plot the two variables against each other to reveal their relationship. To do so, we need to first put these variables in a data frame that you can call in the ggplot function.

n <- 1000

p <- seq(from = 0, to = 1, by = 0.01)
me <- 1.96 * sqrt(p * (1 - p)/n)

dd <- data.frame(p = p, me = me)
ggplot(data = dd, aes(x = p, y = me)) + 
  geom_line() +
  labs(x = "Population Proportion", y = "Margin of Error")
  1. Describe the relationship between p and me. Include the margin of error vs. population proportion plot you constructed in your answer. For a given sample size, for which value of p is margin of error maximized?

Quiz

Hypothesis testing

A researcher wants to test if the proportion of high school students who have texted while driving everyday in the past 30 days is different than 10%.

To test this research question, we set up and run the following hypothesis test.

\[ H_0: p = 0.10 ~~~~~~~~~~~~~~~~~ H_A: p \ne 0.10\]

yrbss = yrbss %>%
  filter(text_while_driving_30d!="did not drive", !is.na(text_while_driving_30d)) %>% 
  mutate(text_ind = ifelse(text_while_driving_30d == "30", "yes", "no"))

prop_test(yrbss,
          text_ind~NULL,
          success = "yes",
          p = 0.10,
          z = TRUE)
  1. Based on the result of the above test, would you reject or fail to reject \(H_0\)?

Quiz

  1. Set up a Hypothesis test for the variable from exercise 2 (helmet_12m) with p = 0.81 and explain why the null hypothesis should be rejected or not?

Quiz

Submit

If you have completed this tutorial and are happy with all of your solutions, please click the button below to generate your hash and submit it using the corresponding tutorial assignment tab on Blackboard


NCAT Blackboard

Resources for learning R and working in RStudio

The book R For Data Science by Grolemund and Wickham is a great resource for data analysis in R with the tidyverse. If you are Goggling for R code, make sure to also include these package names in your search query. For example, instead of Goggling “scatterplot in R”, Goggle “scatterplot in R with the tidyverse”.

These may come in handy throughout the semester:

Note that some of the code on these cheatsheets may be too advanced for this course. However the majority of it will become useful throughout the semester.


Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Foundations for Inference II: Inference for Categorical Data