If you’ve never coded before (or even if you have), type print("Your Name") in the interactive R chunk below and run it by hitting crtl+Enter or cmd+Enter for MAC users.

Hypothesis Testing and Confidence Intervals for Categorical Data

In this workbook, we continue our exploration of statistical inference. We’ll cover the basic hypothesis testing framework in addition to discussing the confidence intervals you were exposed to in Workbook 10 more formally.

We’ll motivate this workbook by watching three videos from volunteers at OpenIntro.org. The first two will be from Dr. David Diez, a data scientist at YouTube and the last will be from Dr. Shannon McLintock who is a member of the statistics faculty at Cal Poly. After each of the videos, you’ll walk through a hands-on application of the video content to a new scenario.

Our first video discusses variability in point estimates. It is likely that some of the content will sound pretty familiar to you since we’ve were working with the idea of point estimates and variability in the last workbook. Watch the video below, and then we’ll engage with the ideas Dr. Diez discusses by walking through an example together.

An Example: A June 2020 Pew Research survey revealed that 74% of Americans support offering a path to citizenship for undocumented immigrants who were brought to the US illegally as children – often referred to as DREAMers.

We’ve discussed the impossibility of a true census, so the Pew study did not poll every single American to get their estimate. Instead, they surveyed 9,654 US adults between the dates of June 4 and June 10, 2020. You can find out more about the study logistics here. This means that the 74% from the article is the proportion of individuals from the study who were in favor of a path to citizenship for the DREAMers.

Answer the questions below to check your understanding of some of our terminology.

Quiz

The code below is set to simulate a random sample of 9,654 individuals for which there is a 74% chance the individual is in support of a path to citizenship for DREAMers and a 26% chance that the individual is not. This code should look somewhat familiar to you, as you did something similar in workbook 7, where you simulated shots from a basketball player. Run the code in the block a few times to see the results of the simulation.
samp <- sample(c("Support Citizenship", "Do Not Support Citizenship"), size = 9654, prob = c(0.74, 0.26), replace = TRUE)
table(samp)

paste0("The proportion supporting the Citizenship option is: ", table(samp)[2]/9654)

paste0("This is a sampling error of: ", table(samp)[2]/9654 - 0.74)

By running the code block above multiple times, you’ve probably seen that most of the samples resulted in a sample proportion which was well-within one percentage point (0.01) of the assumed proportion \(p = 0.74\).

In the video Dr. Diez discusses how we can use the Central Limit Theorem to quantify how much variability we should see in the point estimate from one sample to the next. In the case of a single proportion, the Central Limit Theorem states the following:

Central Limit Theorem: When observations are independent and the sample size is sufficiently large, the sample proportion \(\hat{p}\) will tend to follow a normal distribution with \(\mu = p\) (the true population proportion) and standard error \(\displaystyle{S_E = \sqrt{\frac{p\left(1-p\right)}{n}}}\). That is \(\displaystyle{\hat{p} \sim N\left(\mu = p, ~S_E = \sqrt{\frac{p\left(1-p\right)}{n}}\right)}\).

Use the code block below to answer the questions that follow.

Quiz

Good work – notice that the standard error is about half of a percentage point (close to \(0.005\)). Doubling this estimate closely matches what we observed about the sampling error using our simulations. This brings us to our next topic – confidence intervals.

Intro to Confidence Intervals

You’ll start again with a video from Dr. Diez. Once you’ve watched it, we’ll continue with our example about the 2020 Pew Research study on the proportion of American adults who are in favor of a citizenship option for the DREAMers.

As Dr. Diez mentions, a confidence interval can be used to capture a population parameter with some degree of certaintly. In general, we construct a confidence interval using the following formula: \[\displaystyle{\left(\tt{point~estimate}\right)\pm \left(\tt{critical~value}\right)\cdot S_E}\] where the

  • \(\tt{point~estimate}\) comes from the sample data
  • \(\tt{critical~value}\) is related to the level of confidence
  • \(S_E\) is the standard error, which measures the spread of the sampling distribution

Recall that we’ve been working with a 2020 Pew Research study which included 9,654 participants. The study resulted in 74% of participants being in favor of a path to citizenship for the DREAMers, and we computed the standard error to be approximately \(0.0045\).

If we are sure that a sampling distribution is well-modeled by a normal distribution, we have the following critical values associated with several common levels of confidence.

  • The critical value for a 90% confidence interval is approximately 1.65
  • The critical value for a 95% confidence interval is approximately 1.96
  • The critical value for a 98% confidence interval is approximately 2.33
  • The critical value for a 99% confidence interval is approximately 2.58

Use what you learned in the video and your knowledge of the Pew Research study to answer the following questions. You can use the code block to make any necessary computations.

Quiz

So far, so good! There’s one more topic to go. Sometimes we’ll want to test a claim about a population parameter rather than build a confidence interval for it. Inferential statistics provides a formal framework called the hypothesis test for evaluating statistical claims such as

  • Is a population mean or proportion larger/smaller/different than some proposed value?
  • Do the population means or proportions different across multiple groups?

Intro to Hypothesis Testing

Here’s one more video from Dr. Shannon McLintock (also of Openintro.org) introducing the notion of the hypothesis test.

A 2018 poll and story from NPR reported that 65% of Americans supported a path to citizenship for DREAMers. Does the new poll from Pew Research provide evidence that support to a pathway to citizenship for dreamers has grown over the past two years? Use an \(\alpha = 0.05\) level of significance.

Use what you learned from Dr. McClintock in the video about hypothesis testing to answer the following questions and complete the hypothesis test. You can use the code block below for any calculations you need to make.

Quiz

Submit

If you have completed this tutorial and are happy with all of your solutions, please click the button below to generate your hash and submit it using the corresponding tutorial assignment tab on Blackboard


NCAT Blackboard

Summary

As a recap, this workbook covered the following main points and ideas:

  • Sample statistics provide a point estimate for their corresponding population parameters.
    • A sample mean provides an estimate of a population mean.
    • A sample proportion provides an estimate of a population proportion.
    • Any sample metric can propvide an estimate for the corresponding population metric.
  • Sample statistics provide reasonable point estimates only when the sample used is representative of the population we wish to generalize to.
  • Each sample taken will result in a different sample statistic, and therefore a different point estimate for the population parameter.
    • Much of statistics is focused on quantifying the variability in these point estimates.

  • A confidence interval is used to capture a population parameter with some desired degree of confidence. We compute the bounds for a confidence interval with the expression below. \[\left(\tt{point~estimate}\right)\pm\left(\tt{critical~value}\right)\cdot S_E\]
    • The \(\tt{point~estimate}\) is an estimate from a sample – that is, the point estimate is a sample statistic.
    • The \(\tt{critical~value}\) depends on the desired level of confidence and the distribution being used to model the sampling distribution.
    • The \(\tt{standard~error}\) (\(S_E\)) quantifies the variability in the point estimate (expected variation due to different samples being taken).
    • The interpretation of a confidence interval is that “We are XX% confident that the true \(\tt{population~parameter}\) lies between \(\tt{lower~bound}\) and \(\tt{upper~bound}\)”.

  • A hypothesis test provides a formal framework for testing claims about a population parameter across one or more populations.
    • We begin with two hypotheses – the null hypothesis (\(H_0\)), and the alternative hypothesis (\(H_a\)). The null hypothesis is a statement assuming the “status quo” while the alternative hypothesis is the claim to be tested.
    • We set a level of significance, denoted by \(\alpha\), which determines how unlikely our sample must be in order for us to favor the alternative hypothesis over the null hypothesis.
    • We use our sample data and null hypothesis to compute a test statistic: \[\displaystyle{\tt{test~statistic} = \frac{\left(\tt{point~estimate}\right) - \left(\tt{null~value}\right)}{S_E}}\]
      • The \(\tt{point~estimate}\) is an estimate from a sample – that is, the point estimate is a sample statistic.
      • The \(\tt{null~value}\) is the assumed value of the population parameter from the null hypothesis.
      • The \(\tt{standard~error}\) (\(S_E\)) quantifies the variability in the point estimate (expected variation due to different samples being taken).
    • Once we have a test statistic, we use it to compute a \(p\)-value which will be compared to the level of significance, \(\alpha\).
      • The \(p\) value measures the probability that we would observe a sample at least as favorable to the alternative hypothesis as our observed sample, under the assumption that the null hypothesis is true.
      • A \(p\)-value smaller than the \(\alpha\) threshold results in the conclusion that our sample was so unlikely that we claim it as evidence that the null hypothesis is false.

As a final item, here’s a link to a document that we’ll make heavy use of for the remainder of our class. The document probably looks intimidating right now, but look at the bottom-right corner. There’s our confidence interval formula! Similarly, on the lower left is the general formula for a test statistic. For now, know that you aren’t expected to know what almost anything on this document means yet. It will all be explained soon enough.