Inference for Categorical Data
Please type "Your Name"
in the interactive R chunk below
and run it by clicking Run Code
or by hitting
crtl+Enter
or cmd+Enter
for MAC users.
Introduction
In this lab, we will explore and visualize the data using the tidyverse suite of packages, and perform statistical inference using the infer package. The data can be found in the companion package for OpenIntro resources, openintro.
Let’s install and load the packages.
library(tidyverse)
library(openintro)
library(infer)
The Data
For this lab, we will be analyzing a new dataset called
yrbss
which is short for Youth Risk Behavior Surveillance
System. It is a survey which collects data from high schoolers to help
discover health patterns among students. It contains variables such as
age
, gender
, grade
,
hispanic
, race
, and so on. Run the following
code chunk to load the data and peak into it using the
head()
command.
data(yrbss)
head(yrbss)
Inference on proportions
Today we will be focusing on the variable
text_while_driving_30d
which records the answers for the
question “How many days did you text while driving in the last 30 days?”
The following code chunk produces the frequency distribution of this
categorical variable.
yrbss %>%
count(text_while_driving_30d)
We notice that there is a large number of students in the sample that
stated “I didn’t drive”. There is also a good number of students who
didn’t answer the question as indicated by the NA
category.
More interesting is the number of students who stated that they have
texted every day (30 days) while driving. Focusing on students who drove
within the last 30 days and answered the question “How many days did you
text while driving in the last 30 days?”, we want to answer the question
“What proportion of high schoolers have texted while driving
each day for the past 30 days?”
First, to focus on students who drove within the last 30 days and
answered the question of interest, we will filter the dataset using the
filter()
command and reconstruct the frequency table as
follows.
yrbss = yrbss %>%
filter(text_while_driving_30d!="did not drive", !is.na(text_while_driving_30d))
yrbss %>%
count(text_while_driving_30d)
Next, to make it easier to calculate the proportion of those who
texted every day when driving, we will create a new variable that
specifies whether the individual has texted every day while driving over
the past 30 days or not. We will call this variable
text_ind
. We will take the filtered data and use the
mutate()
command to define the new variable
text_ind
as follows.
yrbss = yrbss %>%
filter(text_while_driving_30d!="did not drive", !is.na(text_while_driving_30d)) %>%
mutate(text_ind = ifelse(text_while_driving_30d == "30", "yes", "no"))
yrbss %>%
count(text_ind) %>%
mutate(phat = round(n/sum(n),2))
Now, we can use the cleaned dataset with the new binary variable
text_ind
to answer the question: “What proportion
of high schoolers have texted while driving each day for the past 30
days?”
Here we want to use the sample data from yrbss
to make
inference about the proportion of all high school students who have
texted while driving each day for the past 30 days? We can conduct
inference about a population proportion in two ways:
- Confidence Intervals
- Hypothesis Testing
The infer
package provides us with tools for computing
confidence intervals and conducting hypothesis tests about population
proportions. Specifically, we will use the command
prop_test()
. The following code chunk shows you how to
construct a 95% confidence interval for the proportion of proportion of
high schoolers who have texted while driving each day for the past 30
days.
yrbss = yrbss %>%
filter(text_while_driving_30d!="did not drive", !is.na(text_while_driving_30d)) %>%
mutate(text_ind = ifelse(text_while_driving_30d == "30", "yes", "no"))
prop_test(yrbss,
text_ind ~ NULL,
success = "yes",
z = TRUE,
conf_int = TRUE,
conf_level = 0.95,
correct = FALSE)
Note that since the goal is to construct an interval estimate for a
proportion, it’s necessary to include the success
argument,
which accounts for the proportion of students who have consistently
texted while driving in the past 30 days, in this example.
- What is the margin of error in the 95% confidence interval for those that have texted while driving each day for the past 30 days based?
- Using the
infer
package, calculate a 98% confidence interval for the proportion of high school students who never wear helemt for the past 12 months. The variablehelmet_12m
in the dataset records the data for this. Make sure to filter out any missing data and any students who “did not ride”.
How does the proportion affect the margin of error?
Imagine you’ve set out to survey 1000 people on two questions: are you at least 6-feet tall? and are you left-handed? Since both of these sample proportions were calculated from the same sample size, they should have the same margin of error, right? Wrong!. While the margin of error does change with sample size, it is also affected by the proportion.
Think back to the formula for the standard error: \(SE = \sqrt{p(1-p)/n}\). This is then used in the formula for the margin of error for a 95% confidence interval:
\[ ME = 1.96\times SE = 1.96\times\sqrt{p(1-p)/n} \] Since the population proportion \(p\) is in this \(ME\) formula, it should make sense that the margin of error is in some way dependent on the population proportion. We can visualize this relationship by creating a plot of \(ME\) vs. \(p\). Since sample size is irrelevant to this discussion, let’s just set it to some value (\(n = 1000\)) and use this value in the following calculations:
The first step is to make a variable p
that is a
sequence from 0 to 1 with each number incremented by 0.01. You can then
create a variable of the margin of error (me
) associated
with each of these values of p
using the above formula
(\(ME = 1.96 \times SE\)). Lastly, you
can plot the two variables against each other to reveal their
relationship. To do so, we need to first put these variables in a data
frame that you can call in the ggplot
function.
n <- 1000
p <- seq(from = 0, to = 1, by = 0.01)
me <- 1.96 * sqrt(p * (1 - p)/n)
dd <- data.frame(p = p, me = me)
ggplot(data = dd, aes(x = p, y = me)) +
geom_line() +
labs(x = "Population Proportion", y = "Margin of Error")
- Describe the relationship between
p
andme
. Include the margin of error vs. population proportion plot you constructed in your answer. For a given sample size, for which value ofp
is margin of error maximized?
Hypothesis testing
A researcher wants to test if the proportion of high school students who have texted while driving everyday in the past 30 days is different than 10%.
To test this research question, we set up and run the following hypothesis test.
\[ H_0: p = 0.10 ~~~~~~~~~~~~~~~~~ H_A: p \ne 0.10\]
yrbss = yrbss %>%
filter(text_while_driving_30d!="did not drive", !is.na(text_while_driving_30d)) %>%
mutate(text_ind = ifelse(text_while_driving_30d == "30", "yes", "no"))
prop_test(yrbss,
text_ind~NULL,
success = "yes",
p = 0.10,
z = TRUE)
- Based on the result of the above test, would you reject or fail to reject \(H_0\)?
- Set up a Hypothesis test for the variable from exercise 2
(
helmet_12m
) withp = 0.81
and explain why the null hypothesis should be rejected or not?
Submit
NCAT Blackboard
Resources for learning R and working in RStudio
The book R For Data Science by Grolemund and Wickham is a great resource for data analysis in R with the tidyverse. If you are Goggling for R code, make sure to also include these package names in your search query. For example, instead of Goggling “scatterplot in R”, Goggle “scatterplot in R with the tidyverse”.
These may come in handy throughout the semester:
Note that some of the code on these cheatsheets may be too advanced for this course. However the majority of it will become useful throughout the semester.
This
work is licensed under a
Creative
Commons Attribution-ShareAlike 4.0 International License.