Chapter 5

Acknowledgement

These notes use content from OpenIntro Statistics Slides by

Mine Cetinkaya-Rundel.

Chapter 5 outline

Statistical inference is primarily concerned with understanding and quantifying the uncertainty of parameter estimates.

In this chapter, we learn about the foundations for inference regarding population proportion 𝑝.

5.1 Point Estimates and Sampling Variability

5.2 Confidence Interval for a Proportion

5.3 Hypothesis Testing for a Proportion

5.1 Point estimates and sampling variability

We begin with investigating the probability distributions of one important statistic - sample proportion.

Point estimate
Sampling error
Bias
Sampling Distribution for Sample Proportions
Central Limit Theorem for Proportion
The effect of sample size 𝑛 on Sampling Distribution for Sample Proportions

Sample Statistics and Population Parameters

Review the concepts parameter and statistic

A parameter is a numerical summary of the population. Example: Percentage of the population of all Florida residents favoring gun control

A statistic is a numerical summary of a sample taken from the population.

Example: Percentage of a group of Florida residents favoring gun control

Notations: 𝑝 for population proportion, \(\hat{p}\) for sample proportion.

In this section, we investigate the probability distributions of one important statistic– sample proportion.

This is called the sampling distribution of sample proportion, represented in CENTRAL LIMIT THEOREM.

Point estimates and error

We are often interested in population parameters.
Complete populations are difficult to collect data on, so we use sample statistics as point estimates for the unknown population parameters of interest.
Error in the estimate \(\rightarrow\) difference between population parameter and sample statistic.
Bias is systematic tendency to over- or under-estimate the true population parameter.
Sampling error describes how much an estimate will tend to vary from one sample to the next.
Much of statistics is focused on understanding and quantifying sampling error, and sample size is helpful for quantifying this error.

Example- Population Proportion

Example. Suppose the proportion of American adults who support the expansion of solar energy is 𝑝=0.88, which is our parameter of interest. Is a randomly selected American adult more or less likely to support the expansion of solar energy?

Answer: More likely. (b/c 𝑝=0.88 is large)

Example - point estimates and variability

Example. Suppose you don’t know the exact proportion of American adults who support expansion of solar energy. In order to estimate the proportion, you might sample from the population and use your sample proportion as the best guess for the unknown population proportion.

Take simple random samples, say, each with \(𝑛=1000\) American adults from the population, and record whether they support solar power expansion or not.
For each sample, find the sample proportion \[\hat{𝑝}= \frac{𝑥}{𝑛}= \frac{(\text{count of those in favor})}{( \text{𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑖𝑧𝑒})}\]
Different samples may yield different sample proportions. We may simulate the sampling process.

Use WebApp

Sampling Distribution of the Sample Proportion https://istats.shinyapps.io/SampDist_Prop/

Put the value and the sample size,

say 𝑝=0.88, 𝑛=1000

We may simulate 1 by 1
Or we can draw 10,000 samples of the same size, say \(𝑛=1000\)

Use R to simulate the process

# 1. Create a set of 250 million entries, where 88%
# of them are "support" and 12% are "not".
set.seed(101) #for everyone running the code to see the same result

pop_size = 250000000
possible_entries = c(rep("support", 0.88 * pop_size),
                      rep("not", 0.12 * pop_size))

# 2. Sample 1000 entries without replacement.
sampled_entries = sample(possible_entries,
                          size = 1000, replace = F)
# 3. Compute p-hat: count the number that are
# "support", then divide by # of the sample size
sum(sampled_entries == "support")/1000

Use R to compute error (p - phat)

pop_size = 250000000
possible_entries = c(rep("support", 0.88*pop_size), rep("not", 0.12*pop_size))

sample_entries = sample( possible_entries, size=1000) 
phat1 = sum(sample_entries == "support")/1000
error1 = 0.88 - phat1
error1
## We have one sample proportion and the error! Do it again!

sample_entries = sample( possible_entries, size=1000)
phat2 = sum(sample_entries == "support")/1000
error2 = 0.88 - phat2
error2

Sampling distribution of the sample proportion

If we repeat this process many times and obtain many \(\hat{p}\) values. The distribution of these \(\hat{p}\) values is called the sampling distribution of \(\hat{p}\).

Practice

What is the shape and center of this distribution? Based on this distribution, what do you think is the true population proportion?

The distribution is unimodal and roughly symmetric.
A reasonable guess for the true population proportion is the center of this distribution, approximately 0.88.

Central Limit Theorem for Proportion

When observations are independent (take random samples of fixed size 𝑛, without replacement) and the sample size 𝑛 is large enough, i.e., \(n𝑝≥10\) and \(𝑛(1−𝑝)≥ 10 )\), then the sample proportion \(\hat{p}\) is approximately normal with

mean: \(\mu_{\hat{p}}=𝑝\), and
standard deviation (also called standard error S.E.): \(\sigma_{\hat{p}}= \sqrt{\frac{𝑝(1−𝑝)}{𝑛}}\)

Central Limit Thereom for Proportion

The CLT states that when observations are independent and the sample size is sufficiently large, then \[\hat{p} \sim N\left(p, \sqrt{\frac{𝑝(1−𝑝)}{𝑛}}\right)\]

The CLT indicates that \(\hat{𝑝}\) is an unbiased point estimator for the parameter 𝑝.
As the sample size 𝑛 increases, the variation of \(\hat{𝑝}\) decreases
By CLT, we may use the normal distribution to compute probabilities of \(\hat{𝑝}\) and \[𝑍=\frac{\hat{𝑝}- 𝜇_{\hat{p}}}{𝜎_{\hat{p}}} \sim N(0,1)\]

Example

For \(n=1000, 𝑝=0.88\), compute \(𝜇_{\hat{p}}\) and \(S.E_{\hat{p}}\) or \(𝜎_{\hat{p}}\), and state the sampling distribution of \(\hat{p}\).
Compute the z-scores for \(\hat{p}=0.86\) and \(\hat{p}=0.90\).
Find probability that \(\hat{p}\) is between 0.86 and 0.90.

Solution

\(𝜇_{\hat{p}}=p=0.88\), \(S.E_{\hat{p}}\) or (\(𝜎_{\hat{p}}\)) \(= \sqrt{\frac{𝑝(1−𝑝)}{𝑛}}=\sqrt{\frac{0.88(1−.88)}{1000}} = 0.010\), and as \(np= 880 > 10\) and \(n(1-p)= 120>10\), i.e., the sample size is sufficiently large, by CLT \(\hat{p} \sim N(.88,0.01)\)
For \(\hat{p} = 0.86\): \(z = \frac{0.86-0.88}{0.010}= -2\) and for \(\hat{p}= 0.90\): \(z=2\)
\(P(0.86 < \hat{p} < 0.90) = P(-2 < Z < 2)= 0.9545\).

pnorm(0.90, 0.88, 0.010) - pnorm(0.86, 0.88, 0.010)

Example (Practice, change sample size)

For \(n=100, 𝑝=0.88\), compute \(𝜇_{\hat{p}}\) and \(S.E_{\hat{p}}\) or \(𝜎_{\hat{p}}\), and state the sampling distribution of \(\hat{p}\).
Compute the z-scores for \(\hat{p}=0.86\) and \(\hat{p}=0.90\).
Find probability that \(\hat{p}\) is between 0.86 and 0.90.

Solution

\(𝜇_{\hat{p}}=p=0.88\), \(S.E_{\hat{p}}\) (or\(𝜎_{\hat{p}}\)) \(= \sqrt{\frac{𝑝(1−𝑝)}{𝑛}}=\sqrt{\frac{0.88(1−.88)}{100}} = 0.032496 \approx 0.0325\), and as \(np= 880 > 10\) and \(n(1-p)= 120>10\), i.e., the sample size is sufficiently large, by CLT, \(\hat{p} \sim N(.88,0.0325)\)
For \(\hat{p} = 0.86\): \(z = \frac{0.86-0.88}{0.0325}= -0.62\) and for \(\hat{p}= 0.90\): \(z=0.62\)
\(P(0.86 < \hat{p} < 0.90) = P(-0.62 < Z < 0.62)= 0.4617\).

pnorm(0.90, 0.88, 0.0325) - pnorm(0.86, 0.88, 0.0325)

Compare the results for \(𝑛=1000\) and \(n=100\): how does the sample size affect the S.E., and the values of z scores?

CLT - conditions

Certain conditions must be met for the CLT to apply:

Independence: Sample observations must be independent. How to verify;
- If random assignment is used for treatment groups
- If observations are from a simple random sample
- if sampling without replacement, CLT is for \(n < 10\%\) of the population \(N\). When \(𝑛 > 10\%\) of \(N\), then the SD is overly estimated, will typically adjust by a factor \(\sqrt{\frac{(𝑁−𝑛)}{(𝑁−1)}}\) (see next slide) .
Success-Failure condition (Large sample size): There should be at least 10 expected successes and 10 expected failures in the observed sample. Check if
- \(𝑛𝑝≥10\) and \(𝑛(1−𝑝)≥10\) or
- \(𝑛\hat{p}≥10\) and \(𝑛(1−\hat{p})≥10\) if \(𝑝\) is unknown

Applying CLT in real -world setting.

When \(p\) is unknown.

The CLT states \(SE = \sqrt{\frac{p(1-p)}{n}}\), with the condition that \(np\) and \(n(1-p)\) are at least 10, however we often don’t know the value of \(p\), the population proportion.

In these cases we substitute \(\hat{p}\) for \(p\). \[SE = \sqrt{\frac{p(1-p)}{n}} \approx = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\] Note: when the population size is N, and 𝑛≥0.10∗𝑁,

Then use a finite population correction factor \(\sqrt{\frac{(𝑁−𝑛)}{(𝑁−1)}}\) \[SE = \sqrt{\frac{p(1-p)}{n}}\sqrt{\frac{(𝑁−𝑛)}{(𝑁−1)}}\] which gives smaller, more precise estimate of S.E.

More on CLT– When \(p\) is low or high, n is not large enough.

Suppose we have a population where the true population proportion is \(p=0.05,\) and we take random sample of size \(n = 50\) from this population. We calculate the sample proportion in each sample and plot these proportions. Would you expect this distribution to be nearly normal? Why, or why not?

No, the success-failure condition is not met \((50 *0.05 = 2.5)\), hence we would not expect the sampling distribution to be nearly normal.

Conditions

What happens when \(np\) and/or \(n(1-p)< 10\)?

When the conditions are not met…

When either \(np\) or \(n(1-p)\) is small, the distribution is more discrete.
When \(np\) or \(n(1-p)<10\), the distribution is more skewed.
The larger both \(np\) and \(n(1-p)\), the more normal the distribution.
When \(np\) and \(n(1-p)\) are both very large, the discreteness of the distribution is hardly evident, and the distribution looks much more like a normal distribution.

Extending the framework for other statistics

The strategy of using sample statistic to estimate a parameter is quite common, and it’s a strategy that we can apply to other statistics besides a proportion.
- Take a random sample of students at a college and ask them how many extracurricular activities they are involved in to estimate the average number of extra curricular activities all students in this college are interested in.
The principles and general ideas are from this chapter apply to other parameters as well, even if the details change a little.