Acknowledgement

These notes use content from OpenIntro Statistics Slides by

Mine Cetinkaya-Rundel.

5.2 Confidence intervals for a proportion

In this section, we discuss using confidence intervals to estimate the population proportion.

  • Construction of confidence intervals (width depends on given confidence level)

  • Margin of error

  • Interpreting confidence intervals

Confidence intervals

  • If we report a point estimate, we probably won’t hit the exact population parameter. If we report a range of plausible values, we have a good shot at capturing the parameter. A plausible range of values for the population parameter is called a confidence interval.

  • Using only a sample statistic to estimate a parameter is like fishing in a murky lake with a spear, and using a confidence interval is like fishing with a net.

We can throw a spear where we saw a fish but we will probably miss. If we toss a net in that area, we have a good chance of catching the fish.

Confidence Intervals

  • A confidence interval is a plausible range of values so defined that there is a specified probability that the value of the interested population parameter lies within it.

  • The probability of the confidence interval that contains the parameter is called the confidence level – which is a number chosen to be close to 1 (usually use percentages like 90%, 95%, 99%).

  • Confidence intervals are interval estimates and they are built around a point estimate (value of sample proportion)

Constructing confidence intervals

  • By CLT, if the conditions are met, \(\hat{p} \sim N(p, \sqrt{\frac{p(1-p)}{n}})\)

  • The Standard Error is \(\sigma_{\hat{p}}= S.E._{\hat{p}} = \sqrt{\frac{p(1-p)}{n}}\)

  • When \(p\) is unknown, we replace \(p\) by \(\hat{p}\)

\[S.E._{\hat{p}}= \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\]

  • A \((1-\alpha)100\%\) confidence interval for the proportion \(p\) is \[\hat{p} ± z_{\frac{\alpha}{2}} \times S.E._{\hat{p}}\] where \(z_{\frac{\alpha}{2}}\) is a z-score from the standard normal distribution and is related to the confidence level \((1-\alpha)100\%\).

Example: 95% confidence intervals

  • The 95% confidence interval for p is given by:

\[\hat{p} ± 1.96 \times S.E._{\hat{p}}\]

  • Note that \((1-\alpha)100\%=95\% \implies \alpha = 0.05 \implies \alpha/2 = 0.025\)

  • \(z_{\frac{\alpha}{2}} = z_{0.025} = 1.96\) is the z-score that makes the upper-tail area under the standard normal distribution curve equal to \(\alpha/2\) (0.025).

More about \(𝑧_{\frac{𝛼}{2}}\)

  • The value of \(𝑧_{\frac{𝛼}{2}}\) is determined by the value of the subscript.

  • The textbook uses the notation \(z*\) to refer to \(𝑧_{\frac{𝛼}{2}}\).

Margin of error (M.E.)

  • Recall that a \((1-\alpha)100\%\) confidence interval for the proportion \(p\) is given by: \[\hat{p} ± z_{\frac{\alpha}{2}} \times S.E._{\hat{p}}\]

  • The margin of error (M.E.) of the confidence interval is:

\[M.E. = z_{\frac{\alpha}{2}} \times S.E._{\hat{p}}\]

  • The margin of error (M.E.) depends on the confidence level (through \(z_{\frac{\alpha}{2}}\)) and the standard error (\(S.E._{\hat{p}}=\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\)).

  • The confidence interval can also be written as follows:

\[\hat{p} ± M.E.\]

Illusteration of the 95% confidence interval

  • For large random samples, there is about 95% chance that \(\hat{p}\) falls within \(1.96\times S.E.\) of the population proportion \(𝑝\).

Interpretation of the 95% confidence interval

What does 95% confident mean?

  • If we draw many samples of the same size and build a 95% C.I. \(\hat{𝒑} ±𝟏.𝟗𝟔\times S.E.\) from each sample. Then, about 95% of such C.I.s would contain the true population proportion \(𝑝\).

  • The figure below shows twenty-five (25) point-estimates and 95% confidence intervals from the simulations relative to the population proportion p = 0.88. Only 1 of these 25 intervals did not capture the population proportion, and this interval has been marked red bolded. (Note that: 1/25=0.04, 24/25=0.96)

Interpretation of the 95% confidence interval

What does 95% confident mean? The code below performs the simulation described in previous slide.

library(tidyverse)
nsim = 25 #number of simulations to run (larger nsim will produce more accurate estimate of true confidence level)
CL = 0.95 #confidence level
p = 0.88 #population proportion (parameter)
N = 100000 #population size
pop = c(rep("success", N*p), rep("failure", N*(1-p))) #population of successes and failures
n = 100 #sample size

phat = numeric(nsim)
L = numeric(nsim)
U = numeric(nsim)

for (i in 1:nsim){
  s = sample(pop, n, replace = TRUE)
  phat[i] = sum(s=="success")/n
  SE = sqrt(phat[i]*(1-phat[i])/n)
  L[i] = phat[i] - qnorm((1-CL)/2, lower.tail = FALSE)*SE
  U[i] = phat[i] + qnorm((1-CL)/2, lower.tail = FALSE)*SE
}

ci95 = data.frame(Sample = seq(1,nsim), phat, L, U) 
ci95 = mutate(ci95, capture = as.factor(ifelse(L <= p & p <= U, 1, 0)))

ggplot(data = ci95, aes(x = Sample, y = phat)) + 
  geom_point(aes(color = capture)) + 
  geom_errorbar(aes(ymin = L, ymax = U, color = capture)) + 
  scale_color_manual(values = c('0'='red','1'='black')) + 
  coord_flip() + 
  geom_hline(yintercept = p, linetype = "dashed", color = "blue") + 
  labs(title = paste0(CL*100, "% Confidence Intervals")) + 
  theme(plot.title = element_text(hjust = 0.5))

Changing the confidence level

If the confidence level is changed to 90%, would you expect the confidence interval to be wider or narrower?

  • Here, \(1 - \alpha = 0.90 \implies \alpha = 0.10 \implies \alpha/2 = 0.05\)

  • So, using R, we find \(z_{\frac{\alpha}{2}} = z_{0.05} = 1.645\)

qnorm(0.05, lower.tail = FALSE)
qnorm(1-0.05, lower.tail = TRUE) 
  • The 90% confident interval would be \(\hat{p} ± 1.645×𝑆.𝐸.\) which will be narrower than the 95% interval because the margin of error is higher for the 95% interval (\(z_{\frac{\alpha}{2}} = z_{0.025} = 1.96\)).

  • Practice: Find the 99% confidence interval.

qnorm(0.005, lower.tail = FALSE)
qnorm(1-0.005, lower.tail = TRUE)

\(\hat{p} ± 2.58×𝑆.𝐸.\)

Changing the confidence level

  • The following table shows the values of \(z_{\frac{\alpha}{2}}\) for commonly used confidence levels. You are encouraged to use R to confirm the values in the table.

  • Practice: For an 80% confidence level, find \(z_{\frac{\alpha}{2}}\).

Answer: \(𝛼= 0.20 \implies \frac{\alpha}{2} = 0.10 \implies z_{\frac{\alpha}{2}} = 𝑧_{0.10} = 1.281552\)

Practice

Which of the below Z scores is the appropriate \(z^*\) when calculating a 98% confidence interval?

  1. Z = 2.05
  2. Z = 1.96
  3. Z = 2.33
  4. Z = -2.33
  5. Z = -1.65

Practice

Which of the below Z scores is the appropriate \(z^*\) when calculating a 98% confidence interval?

  1. Z = 2.05
  2. Z = 1.96
  3. Z = 2.33
  4. Z = -2.33
  5. Z = -1.65

Facebook’s categorization of user interests

Most commercial websites (e.g. social media platforms, news outlets, online retailers) collect a data about their users’ behaviors and use these data to deliver targeted content, recommendations, and ads. To understand whether Americans think their lives line up with how the algorithm-driven classification systems categorizes them, Pew Research asked a representative sample of 850 American Facebook users how accurately they feel the list of categories Facebook has listed for them on the page of their supposed interests actually represents them and their interests. 67% of the respondents said that the listed categories were accurate.

Estimate the true proportion of American Facebook users who think the Facebook categorizes their interests accurately.

  1. Find the point estimate in this case

  2. Estimate the standard error

  3. Construct a 95% confidence interval for population proportion

  4. Construct a 90% confidence interval for population proportion

Facebook’s categorization of user interests

Note that: \(\hat{p} = 0.67\) and \(n=850\)

Ans 1. The point estimate is \(\hat{p}\) = 0.67

Ans 2. The standard error is: \[{S.E. = \sqrt{\frac{p(1-p)}{n}} = \sqrt{\frac{0.67 \times 0.33}{850}} \approx 0.0161281}\]

Ans 3. The approximate 95% confidence interval is

\[ \begin{align*} \hat{p} \pm 1.96 \times S.E. &= 0.67 \pm 1.96 \times 0.016\\ &= (0.67-0.03136, 0.67+0.03136)\\ &= (0.639, 0.701) \end{align*} \] We are 95% confident that 63.9% to 70.1% of all American Facebook users think Facebook categorizes their interests accurately.

Ans 4. The 90% confidence interval is

\[ \begin{align*} \hat{p} \pm 1.645 \times S.E. &= 0.67 \pm 1.645 \times 0.016\\ &= (0.67-0.02632, 0.67+0.02632)\\ &= (0.644, 0.696) \end{align*} \] * Note that the 90% confidence interval is narrower since it has a smaller margin of error.

Practice

Which of the following is the correct interpretation of this confidence interval?

We are 95% confident that

  1. 64% to 70% of American Facebook users in this sample think Facebook categorizes their interests accurately.

  2. 64% to 70% of all American Facebook users think Facebook categorizes their interests accurately.

  3. There is a 64% to 70% chance that a randomly chosen American Facebook user’s interests are categorized accurately.

  4. There is a 64% to 70% chance that 95% of American Facebook users’ interests are categorized accurately.

Practice

Which of the following is the correct interpretation of this confidence interval?

We are 95% confident that

  1. 64% to 70% of American Facebook users in this sample think Facebook categorizes their interests accurately.

  2. 64% to 70% of all American Facebook users think Facebook categorizes their interests accurately.
  3. There is a 64% to 70% chance that a randomly chosen American Facebook user’s interests are categorized accurately.

  4. There is a 64% to 70% chance that 95% of American Facebook users’ interests are categorized accurately.

Factors affecting width of confidence interval

  • The \((1-\alpha)100\%\) confidence interval for the population proportion is

\[\text{point estimate} \pm z_{\frac{\alpha}{2}} \times S.E. \implies \text{point estimate} \pm M.E. \]

  • If we increase our confidence level, then we would have a wider interval.

  • If we know the lower limit and upper limit of the confidence interval \((𝐿,𝑈)\), then we can find the point estimate and the M.E. as follows:

    \[\text{the point estimate} = \text{the midpoint of the interval}=\frac{𝐿+𝑈}{2}\]

    \[\text{M. E.} = \text{half of the length of the C.I.} = \frac{𝑈−𝐿}{2}\]

Determining the sample size given M.E. and C.L.

Since \(M.E. = z_{\frac{\alpha}{2}} \times \sqrt{\frac{p(1-p)}{n}}\), we can solve for \(n\) if we are given the M.E. and confidence level:

\[n=p(1-p) \left({\frac{z_{\frac{\alpha}{2}}}{M.E.}}\right)^2\]

  • When \(p\) is unknown, use \(\hat{p}\) to replace \(p\):

\[n=\hat{p}(1-\hat{p}) \left({\frac{z_{\frac{\alpha}{2}}}{M.E.}}\right)^2\] - As the maximum of \(p(1-p)\) is \(\frac{1}{4}\) (i.e., when \(p=0.5\)), we can use

\[n=\frac{1}{4}\left({\frac{z_{\frac{\alpha}{2}}}{M.E.}}\right)^2\] which gives the minimum size to guarantee the required M.E.

Example

Example. Determine the sample size to be 98% certain that the M.E. is below 0.05. Consider two cases:

  • \(p=0.3\)
  • \(p\) is unknown

Solution

For confidence level 98%, \(z_{\frac{\alpha}{2}} = z_{0.01} = 2.326\)

  • When \(p=0.3\), take the sample size as

\[ n=p(1-p) \left({\frac{z_{\frac{\alpha}{2}}}{M.E}}\right)^2 =0.3 *(0.7) \left({\frac{2.326}{0.05}}\right)^2= 454.4632\approx 455 \ (\text{Always round up!}) \]

  • When \(p\) is unknown, use \(p=0.5\) to get

\[ n=\frac{1}{4}\left({\frac{z_{\frac{\alpha}{2}}}{M.E}}\right)^2 =\frac{1}{4} \left({\frac{2.326}{0.05}}\right)^2 = 541.02 \approx 542 \ (\text{Always round up!}) \]

  • Note: whenever the sample size is not an integer, round it up to the next integer.

Interpreting confidence intervals

Confidence intervals are …

  • Always about the population.
  • Not probability statements.
  • Only about population parameters, not individual observation.
  • Only reliable if the sample statistic they’re based on is an unbiased estimator of the population parameter.

Extending to the population mean (end of 5.2)

For the population mean \(\mu\):

  • Use sample mean \(\bar{x}\) as the point estimator.

  • The Central Limit Theorem for the mean is similar: “If the sample observations are independent and the sample size is at least 30 (\(𝑛\ge30\)), then the sample mean \(\bar{X}\) is approximately normally distributed.”

  • The mean of the sampling distribution of \(\bar{X}\) is \(𝐸(\bar{X})=\mu\)

  • The standard error of the sampling distribution of \(\bar{X}\) is \[𝑆.𝐸._{\bar{X}} = \frac{\sigma}{\sqrt{𝑛}} \approx \frac{s}{\sqrt{𝑛}}\]

where \(\sigma\) is the population standard deviation and \(s\) is the sample standard deviation.

  • The CI for the mean \(\mu\) is: \[\color{red}{\text{point estimate}}\pm z_{\frac{\alpha}{2}} \times S.E \hspace{0.7cm} or \hspace{0.7cm} \color{red}{\text{point estimate}} \pm M.E\]

\[\bar{x} \pm z_{\frac{\alpha}{2}} \times \frac{𝜎}{\sqrt{𝑛}}(\hspace{0.2cm} or \hspace{0.2cm} \bar{x} \pm z_{\frac{\alpha}{2}} \times \frac{s}{\sqrt{n}})\]