Chapter 5

```{r setup, include=FALSE} knitr::opts_chunk$set(echo = FALSE) ``` ```{r, echo=F, message=F, warning=F} library(readr) library(openintro) library(here) library(tidyverse) data(COL) ``` # Confidence intervals for a proportion ## Confidence intervals - A plausible range of values for the population parameter is called a **confidence interval**. - Using only a sample statistic to estimate a parameter is like fishing in a murky lake with a spear, and using a confidence interval is like fishing with a net. \begin{multicols}{3} \includegraphics[width=1\columnwidth]{spear.jpg} \columnbreak We can throw a spear where we saw a fish but we will probably miss. If we toss a net in that area, we have a good chance of catching the fish. \columnbreak \includegraphics[width=1\columnwidth]{net.jpg} \end{multicols} - If we report a point estimate, we probably won't hit the exact population parameter. If we report a range of plausible values we have a good shot at capturing the parameter. ## Facebook's categorization of user interests Most commercial websites (e.g. social media platforms, news outlets, online retailers) collect a data about their users' behaviors and use these data to deliver targeted content, recommendations, and ads. To understand whether Americans think their lives line up with how the algorithm-driven classification systems categorizes them, Pew Research asked a representative sample of 850 American Facebook users how accurately they feel the list of categories Facebook has listed for them on the page of their supposed interests actually represents them and their interests. 67% of the respondents said that the listed categories were accurate. Estimate the true proportion of American Facebook users who think the Facebook categorizes their interests accurately. ## Facebook's categorization of user interests \centering{$\hat{p} = 0.67$ \text{ } $n=850$} \pause \raggedright The approximate 95% confidence interval is defined as \centering{$point \text{ }estimate \pm 1.96 \times SE$} \pause \centering{$SE = \sqrt{\frac{p(1-p)}{n}} = \sqrt{\frac{0.67 \times 0.33}{850}} \approx 0.016$} \pause \begin{align*} \hat{p} \pm 1.96 \times SE &= 0.67 \pm 1.96 + 0.03\\ &= (0.67-0.03, 0.67+0.03)\\ &= (0.64, 0.70) \end{align*} ## Practice \alert{Which of the following is the correct interpretation of this confidence interval?} We are 95% confident that A) 64% to 70% of American Facebook users in this sample think Facebook categorizes their interests accurately. B) 64% to 70% of all American Facebook users think Facebook categorizes their interests accurately. C) There is a 64% to 70% chance that a randomly chosen American Facebook user's interests are categorized accurately. D) There is a 64% to 70% chance that 95% of American Facebook users' interests are categorized accurately. ## Practice \alert{Which of the following is the correct interpretation of this confidence interval?} We are 95% confident that A) 64% to 70% of American Facebook users in this sample think Facebook categorizes their interests accurately. B) \alert{64\% to 70\% of all American Facebook users think Facebook categorizes their interests accurately.} C) There is a 64% to 70% chance that a randomly chosen American Facebook user's interests are categorized accurately. D) There is a 64% to 70% chance that 95% of American Facebook users' interests are categorized accurately. ## What does 95% confident mean? - Suppose we took many sample and built a confidence interval from each sampling using the equation \centering{$point \text{ }estimate \pm 1.96 \times SE$} - Then about 95% of those intervals would contain the true population proportion (p). ## Width of an interval \alert{If we want to be more certain that we capture the population parameter, i.e. increase our confidence level, should we use a wider interval or a smaller interval?} \pause A wider interval. \pause \alert{Can you see any drawbacks to using a wider interval?} ![](images/garfield.png) \pause If the interval is too wide it may not be very informative. ## Changing the confidence level \centering{$point \text{ }estimate \pm z^* \times SE$} - In a confidence interval, $z^* \times SE$ is called the **margin of error**, and for a given sample, the margin of error changes as the confidence level changes. - In order to change the confidence level we need to adjust $z^*$ in the above formula. - Commonly used confidence levels in practice are 90%, 95%, 98% and 99%. - For a 95% confidence interval, $z^* = 1.96$. - However, using the standard normal $(z)$ distribution, it is possible to find the appropriate $z^*$ for any confidence level. ## Practice \alert{Which of the below Z scores is the appropriate $z^*$ when calculating a 98\% confidence interval?} A) Z = 2.05 B) Z = 1.96 C) Z = 2.33 D) Z = -2.33 E) Z = -1.65 ## Practice \alert{Which of the below Z scores is the appropriate $z^*$ when calculating a 98\% confidence interval?} A) Z = 2.05 B) Z = 1.96 C) \alert{Z = 2.33} D) Z = -2.33 E) Z = -1.65 ```{r, echo=F, message=F, warning=F, fig.width=8, fig.height=4,fig.align='center'} normTail(M = c(-2.33,2.33), col = COL[6,4], cex.axis = 1.25) text(x = 0, y = 0.25, "0.98", col = COL[4], cex = 1.5) lines(x = c(-2.33,-2.33), y = c(0, 0.15), lty = 2) lines(x = c(2.33,2.33), y = c(0, 0.15), lty = 2) text(x = -2.33, y = 0.20, "z = -2.33", col = COL[1], cex = 1.5) text(x = 2.33, y = 0.20, "z = 2.33", col = COL[1], cex = 1.5) text(x = -2.7, y = 0.03, "0.01", col = COL[4], cex = 1.5) text(x = 2.7, y = 0.03, "0.01", col = COL[4], cex = 1.5) ``` where z is obtained from the standard normal table or RStudio ```{r, echo=TRUE} qnorm(0.01) ``` ## Interpreting confidence intervals Confidence intervals are ... - Always about the population. - Not probability statements. - Only about population parameters, not individual observation. - Only reliable if the sample statistic they're based on is an unbiased estimator of the population parameter.