If you’ve never coded before (or even if you have), type "Your Name"
in the interactive R chunk below and run it by hitting crtl+Enter
or cmd+Enter
for MAC users.
Statistical inference is the process of making claims about a population based on information from a sample of data.
Typically, the data represent only a small portion of the larger group which you’d like to summarize. For example, you might be interested in how a drug treats diabetes. Your interest is in how the drug treats all people with diabetes, not just the few dozen people in your study.
At first glance, the logic of statistical inference seems to be backwards, but as you become more familiar with the steps in the process, the logic will make much more sense.
In this tutorial we’ll begin investigating the true power of statistics – using sample data to make accurate claims about a population (even when we don’t have access to the entire population). We start by exploring the connection between a Population Distribution and the distribution of sample means, often called the Sampling Distribution. We’ll do this through a series of simple, interactive code blocks which you will run and use to answer questions.
Start by viewing the following following video from the New York Times.
So the video claimed that the sampling distribution can help us answer questions about the population. This is really important because, as we mentioned in our first tutorial, Census is almost always impossible. Use the code blocks below to explore the connection between the population and the sampling distribution for various different populations. Note that you do not need to understand all of the code contained in the code blocks – you should focus, instead, on the pictures resulting each time you run the code. In general, you are invited to change the first few lines of code in each block, and you are not expected to look at the remaining code.
Suppose a poll suggested the US President’s approval rating is 45%. We would consider 45% to be a point estimate of the approval rating we might see if we collected responses from the entire population. This entire-population response proportion is generally referred to as the parameter of interest. When the parameter is a proportion, it is often denoted by p, and we often refer to the sample proportion as \(\hat{p}\) (pronounced p - hat). Unless we collect responses from every individual in the population, p remains unknown, and we use \(\hat{p}\) as our estimate of p. The difference we observe from the poll versus the parameter is called the error in the estimate. Generally, the error consists of two aspects: sampling error and bias.
Question:
Suppose the proportion of American adults who support the expansion of solar energy is p = 0.88, which is our parameter of interest. Is a randomly selected American adult more or less likely to support the expansion of solar energy?
Answer: More likely.
Suppose that you don’t have access to the population of all American adults, which is a quite likely scenario. In order to estimate the proportion of American adults who support solar power expansion, you might sample from the population and use your sample proportion as the best guess for the unknown population proportion.
We will simulate a data to play the role of the population. As discusses above, we will assume that 88% of the population support the expansion and the remaining 12% don’t.
pop_size <- 250000000
possible_entries_solar <- c(rep("support", 0.88 * pop_size),
rep("not", 0.12 * pop_size))
First we will sample, without replacement, 1000 American adults from the population, and record whether they support or not solar power expansion.
sampled_entries <- sample(possible_entries_solar,
size = 1000, replace = F)
Second we will find the sample proportion.
sum(sampled_entries == "support")/1000
## [1] 0.88
Interesting thing about sampling from a population is that its always random. So the first sample might give a completely different sample proportion compared to the second sample, third sample and so on.
For example, we will perform the same sampling using the same code and we will obtain a different sample proportion.
sampled_entries <- sample(possible_entries_solar,
size = 1000, replace = F)
sum(sampled_entries == "support")/1000
sampled_entries <- sample(possible_entries_solar,
size = 1000, replace = F)
sum(sampled_entries == "support")/1000
Run the code to see that you obtain a different result everytime you perform sampling.
Third we will use the fact that the sample proportion changes with every sample and collect multiple samples and find the sample proportion for all of those samples. This will help us to create a distribution for the sample proportion to understand the spread, center and shape of the sample proportions.
set.seed(123)
# Creating 10000 different sample proportions
phat <- rep(NA, 10000)
for(i in 1:10000){
sampled_entries <- sample(possible_entries_solar, size = 1000, replace = F)
phat[i] <- sum(sampled_entries == "support") / 1000
}
sampling <- tibble(phat = phat)
# Plot
ggplot(sampling, aes(x = phat)) +
geom_histogram(aes(y=..density..),bins = 40,col = "black", fill = "lightblue") +
geom_vline(xintercept = 0.88, col = "red")+
theme_minimal(base_size = 14) +
labs(x = "Sample proportions", y = "Frequency")+
stat_function(fun = dnorm, args = list(mean = mean(phat), sd = sd(phat)), size = 1.2)
This distribution of sample proportions is called a sampling distribution. We can characterize this sampling distribution as follows:
Center. The center of the distribution is \(\bar{x_{\hat{p}}}\) = 0.880, which is the same as the parameter. Notice that the simulation mimicked a simple random sample of the population, which is a straightforward sampling strategy that helps avoid sampling bias.
Spread. The standard deviation of the distribution is \(s_{\hat{p}}\) = 0.010. When we’re talking about a sampling distribution or the variability of a point estimate, we typically use the term standard error rather than standard deviation, and the notation \(SE_{\hat{p}}\) is used for the standard error associated with the sample proportion.
Shape. The distribution is symmetric and bell-shaped, and it resembles a normal distribution.
These findings are encouraging! When the population proportion is p = 0.88 and the sample size is n = 1000, the sample proportion ˆp tends to give a pretty good estimate of the population proportion. We also have the interesting observation that the histogram resembles a normal distribution.
The distribution in the histogram plot above looks an awful lot like a normal distribution. That is no anomaly; it is the result of a general principle called the Central Limit Theorem.
Central Limit Theorem and the Success-Failure Condition
When observations are independent and the sample size is sufficiently large, the sample proportion \(\hat{p}\) will tend to follow a normal distribution with the following mean and standard error:
\[ \mu_{\hat p} = p ~~~~~~ SE_{\hat p} = \sqrt{\frac{p(1-p)}{n}} \]
In order for the Central Limit Theorem to hold, the sample size is typically considered sufficiently large when \(np \ge 10\) and \(n(1-p) \ge 10\), which is called the success-failure condition.
The Central Limit Theorem is incredibly important, and it provides a foundation for much of statistics. As we begin applying the Central Limit Theorem, be mindful of the two technical conditions: 1) the observations must be independent, and 2) the sample size must be sufficiently large such that \(np \ge 10\) and \(n(1-p) \ge 10\).
Use the code block below to compute the standard error of \(\hat p\) when \(p = 0.88\) and \(n = 1000\)
Let’s estimate how frequently the sample proportion \(\hat p\) should be withing 0.02 (2%) of the population value, \(p = 0.88\). Based on the questions above, we know that the distribution is approximately \(N(\mu_{\hat p} = 0.88, SE_{\hat p} = 0.010)\)
After so much practice in Section 4.1, this normal distribution example will hopefully feel familiar! We would like to understand the fraction of \(\hat p\)’s between 0.86 and 0.90:
With \(\mu_{\hat p} = 0.88\) and \(SE_{\hat p} = 0.010\), we can compute the Z-score for both the left and right cutoffs:
\[ Z_{0.86} = \frac{0.86-0.88}{0.010}=-2 ~~~~~~~~~~~ Z_{0.90} = \frac{0.90-0.88}{0.010}=2\]
We can use either R, a graphing calculator, or a table to find the areas to the tails. So now we will see how to use R to find the area under the curve.
pnorm(-2)
This gives us the area under the curve to the left of -2. We know that the normal distribution is symmetric so this area is the same to the right of 2
1 - pnorm(2)
So now to find the shaded area. We will do the following:
1 - 2*pnorm(-2)
This is about 95.44% of the sampling distribution in the histogram plot from the sampling proportion example is within \(\pm 0.02\) of the population proportion, \(p = 0.88\)
An interesting question to answer is, what happens when \(np < 10~ or~ n(1-p)<10?\). As we did before, we can simulate drawing samples of different sizes where, say, the true proportion is \(p = 0.25\). Here’s a sample of size 10:
\[no,~ no,~ yes,~ yes,~ no,~ no,~ no,~ no,~ no,~ no\]
set.seed(123)
possible_entries = c("no","no","yes","yes","no","no","no","no","no","no")
# Creating 10000 different sample proportions
phat <- rep(NA, 10000)
for(i in 1:10000){
sampled_entries <- sample(possible_entries, size = 10, replace = T)
phat[i] <- sum(sampled_entries == "yes") / 10
}
sampling <- tibble(phat = phat)
# Plot
ggplot(sampling, aes(x = phat)) +
geom_histogram(aes(y=..density..),bins = 40,col = "black", fill = "lightblue") +
geom_vline(xintercept = 0.25, col = "red")+
theme_minimal(base_size = 14) +
labs(x = "Sample proportions", y = "Frequency")
Things to notice about the plot above:
Notice that the success-failure condition was not satisfied when \(n = 10\) and \(p = 0.25\):
\[n= 10 \times 0.25 = 2.5 < 10 ~~~ n(1-p) = 10 \times 0.75 = 7.5 < 10 \]
This single sampling distribution does not show that the success-failure conditions is the perfect guideline, but we have found that the guideline did correctly identify that a normal distribution might not be appropriate.
We can complete several additional simulations and we can see some trends:
When either \(np~ or~ n(1-p)\) is small, the distribution is more discrete, i.e. not continuous.
When \(np~or~n(1-p)\) is smaller than 10, the skew in the distribution is more noteworthy.
The larger both \(np~or~n(1-p)\), the more normal the distribution. This maybe a little harder to see for the larger sample size in these plots as the variability also becomes much smaller to see for the larger sample size in these plots as the variability also becomes much smaller.
When \(np~or~n(1-p)\) are both very large, the distribution’s discreteness is hardly evident, and the distribution looks much more like a normal distribution.
Good work – notice that the standard error is about half of a percentage point (close to \(0.005\)). Doubling this estimate closely matches what we observed about the sampling error using our simulations. This brings us to our next topic – confidence intervals.
You’ll start again with a video from Dr. Diez. Once you’ve watched it, we’ll continue with our example about the 2020 Pew Research study on the proportion of American adults who are in favor of a citizenship option for the DREAMers.
As Dr. Diez mentions, a confidence interval can be used to capture a population parameter with some degree of certaintly. In general, we construct a confidence interval using the following formula: \[\displaystyle{\left(\tt{point~estimate}\right)\pm \left(\tt{critical~value}\right)\cdot S_E}\] where the
The critical value is also known as the Z-score. The Z-score of an observation is the number of standard deviations it falls above or below the mean. We compute the Z-score for an observation \(p~(point~estimate)\) that follows a distribution with mean \(\mu_{\hat p}\) and standard deviation \(\sigma_{\hat p}\) using
\[ Z = \frac{p-\mu_{\hat p}}{\sigma_{\hat p}}\]
Recall that we’ve been working with a 2020 Pew Research study which included 9,654 participants. The study resulted in 74% of participants being in favor of a path to citizenship for the DREAMers, and we computed the standard error to be approximately \(0.0045\).
If we are sure that a sampling distribution is well-modeled by a normal distribution, we have the following critical values associated with several common levels of confidence.
Use what you learned in the video and your knowledge of the Pew Research study to answer the following questions. You can use the code block to make any necessary computations.
So far, so good! There’s one more topic to go. Sometimes we’ll want to test a claim about a population parameter rather than build a confidence interval for it. Inferential statistics provides a formal framework called the hypothesis test for evaluating statistical claims such as
Conclusion: The true proportion of Americans that are happy is between 0.71 and 0.84.
What do we mean by confident?
Let’s look deeper into this by starting with the confidence interval that we’ve already formed.
The data from which this interval was constructed is from 2016, and we can plot both p-hat and the resulting interval on a number line here. To understand what is meant by confident, we need to consider how this interval fits into the big picture.
In classical statistical inference, there is thought to be a fixed but unknown parameter of interest, in this case the population proportion of Americans that are happy.
The 2016, the survey drew a small sample of this population…
calculated \(\hat{p}\) to estimate the parameter, \(p\),
and quantified the uncertainty in that estimate with a confidence interval.
Now imagine what would happen if we were to draw a new sample…
of the same size from that population and come up with a new p-hat and a new interval. It wouldn’t be the same as our first, but it’d likely be similar.
We can imagine doing this a third time: a new data sample,
a new p-hat and a new interval.
We can keep this thought experiment going
but what we want to focus on is the properties of this collection of confidence intervals that are accumulating.
Now let’s do the same for another year from the GSS: 2014.
While we can’t go out right now and knock on doors to collect a new sample of data, we do have data from previous years that we can treat as separate samples. Let’s look at the data from 2014. In that sample, the proportion that are happy is about 0.89.
When we compute a 95% confidence interval, we see it stretches from about 0.83 to 0.94.
gss2014 <- gss %>%
filter(year == 2014)
p_hat_happy <- gss2014 %>%
summarize(prop_happy = mean(happy == "HAPPY")) %>%
pull()
SE_happy <- gss2014 %>%
specify(response = happy, success = "HAPPY") %>%
generate(reps = 500, type = "bootstrap") %>%
calculate(stat = "prop") %>%
summarize(se = sd(stat)) %>%
pull()
c(p_hat_happy - 2 * SE_happy, p_hat_happy + 2 * SE_happy)
## [1] 0.8334290 0.9399043
Now let’s do the same for another year from the GSS: 2012.
When we compute a 95% confidence interval, we see it stretches from about 0.76 to 0.89.
gss2012 <- gss %>%
filter(year == 2012)
p_hat_happy <- gss2012 %>%
summarize(prop_happy = mean(happy == "HAPPY")) %>%
pull()
SE_happy <- gss2012 %>%
specify(response = happy, success = "HAPPY") %>%
generate(reps = 500, type = "bootstrap") %>%
calculate(stat = "prop") %>%
summarize(se = sd(stat)) %>%
pull()
c(p_hat_happy - 2 * SE_happy, p_hat_happy + 2 * SE_happy)
## [1] 0.7612300 0.8921033
If we were to continue this process many times, we’d get many different \(\hat{p}\)s and many different intervals. These intervals aren’t arbitrary: they’re designed to capture that unknown population parameter \(p\).
You can see in this plot that almost all of our intervals succeeded in capturing p, but not all of them.
This interval missed the mark. If these are 95% confident intervals, they will have the property that if we form a very large collection of intervals, we’d expect that 95% of them would capture the parameter and 5% of them would not.
Interpretation: “We’re 95% confident that the true proportion of Americans that are happy is between 0.71 and 0.84.”
Width of the interval affected by
n
p
This is what is meant by 95% confident. It’s a statement about the way that these intervals behave across many samples of data. Another property of intervals that is important to consider is their width, which is effected by three factors: the sample size, n, the confidence level, and the value of the parameter, p.
In the following exercises you’ll get the chance to explore these factors and how they effect confidence intervals.
We learned that for a 95% confidence interval (a confidence level of .95), if we were to take many samples of the same size and compute many intervals, we would expect 95% of the resulting intervals to contain the parameter. Based on the set of confidence intervals plotted here, what is your best guess at the confidence level used in these intervals?
The population proportion is represented by the p in the cloud and the dotted line and each confidence interval is represented by a segment that extends out from it’s p-hat. Intervals that capture the true value are in green; those that miss it are in red.
You can guess the confidence level using the proportion of intervals that contain the true parameter value.