These notes use content from OpenIntro Statistics Slides by
Mine Cetinkaya-Rundel.
These notes use content from OpenIntro Statistics Slides by
Mine Cetinkaya-Rundel.
We are often interested in two population proportions. For example, comparing the graduation rates of two universities.
In this section, we extend our methods and results on one proportion \(p\) on point estimate, confidence interval and hypothesis testing to the difference of two proportions \(π_1βπ_2.\)
The difference of sample proportions \(\hat{p}_1- \hat{p}_2\)
Similar results of sampling distribution of \(\hat{p}_1- \hat{p}_2\) (but use \(\color{blue}{S.E= \sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1} + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2}}})\)
Similar way to construct confidence interval to estimate \(π_1βπ_2.\)
Similar way to conduct hypothesis testing( but use null value 0 and \(\color{blue}{S.E_{\hat{p}_{pooled}}}\))
Let \(p_1 \hspace{0.2cm}\text{and} \hspace{0.2cm} p_2\) be the proportions of the same characteristic of two populations \(X_1 \hspace{0.2cm} \text{and} \hspace{0.2cm} X_2\).
Use the difference sample proportions \(\hat{p}_1- \hat{p}_2\) as the point estimator of difference of two proportions \(π_1βπ_2.\)
Example:
“Melting Ice Cap” Scientists predict that global warming may have big effects on the polar regions within the next 100 years. One of the possible effects is that the northern ice cap may completely melt. Below is the list of distributions of responses from the 2020 GSS (General Social Survey) and from a group of introductory statistics students at Duke University.
\[ \begin{align*} \hline && GSS && Duke \\ \hline \text{A great deal} && 454 && 69 \\ \text{Some} && 124 && 30\\ \text{A little} && 52 && 4\\ \text{Not at all} && 50 && 2 \\ \hline \text{Total} && 680 && 105\\ \hline \end{align*} \]
Estimate the difference of proportions of all Duke students and all Americans who would be bothered a great deal by the northern ice cap completely melting. That is, Estimate \(p_{Duke}βp_{GSS}.\)
Estimate \(π_\text{π·π’ππ}βπ_\text{πΊππ}\) using \(\hat{π}_{π·π’ππ}β \hat{π}_{πΊππ}.\)
\[ \begin{align*} \hline && GSS && Duke \\ \hline \text{A great deal} && 454 && 69 \\ \text{Some} && 124 && 30\\ \text{A little} && 52 && 4\\ \text{Not at all} && 50 && 2 \\ \hline \text{Total} && 680 && 105\\ \hline \end{align*} \]
\(\hat{π}_{π·π’ππ}β \hat{π}_{πΊππ}\)
\(= \frac{x_1}{n_1}- \frac{x_2}{n_2}\)
\(=\frac{69}{105} - \frac{454}{680}\)
\(= 0.657-0.668\)
\(= -0.011\)
So the estimate of proportions of all Duke students is 1.1% less than all Americans who would be bothered a great deal by the northern ice cap completely melting.
The difference between the proportions of all Duke students and all Americans who would be bothered a great deal by the northern ice cap completely melting:
\[p_\text{Duke}-p_\text{US}\]
The difference between the proportions of sampled Duke students and sampled Americans who would be bothered a great deal by the northern ice cap completely melting:
\[\hat{p}_\text{Duke}-\hat{p}_\text{US}\]
Conditions:
1. The data are independent (random samples) within and between two groups.
2. The success-failure conditions holds for each group (each group is a normal model, and two groups are independent).
Conclusion:
The difference \(\hat{p}_{1}-\hat{p}_{2}\) is nearly normal with the \(\color{blue}{\text{mean}\hspace{0.2cm}{p_1 - p_2}}\) and standard deviation \[\color{blue}{SE_{(\hat{p}_1-\hat{p}_2)} = \sqrt{\frac{p_1(1-p_1)}{n_1}+\frac{p_2(1-p_2)}{n_2}}}\]
where \(n_1 \hspace{0.2cm} \text{and}\hspace{0.2cm} n_2\) are the sizes of samples for group 1 and group 2 respectively.
\[\color{blue}{SE_{(\hat{p}_1-\hat{p}_2)} \approx \sqrt{\frac{\hat{p_1}(1-\hat{p_1})}{n_1}+\frac{\hat{p_2}(1-\hat{p_2})}{n_2}}}\]
The Confidence Interval (C.I) is constructed in the same way as before: \[\color{blue}{\text{point estimate} \pm \text{margin of error}}\]
The details of C.I are as the following.
The \(\color{blue}{\text{point estimate} :\hat{p}_{1}-\hat{p}_{2}}\)
The Standard error \(\color{blue}{SE_{(\hat{p}_1-\hat{p}_2)} = \sqrt{\frac{\hat{p_1}(1-\hat{p_1})}{n_1}+\frac{\hat{p_2}(1-\hat{p_2})}{n_2}}}\)
The \(\color{blue}{\text{margin of error} = z_{\frac{\alpha}{2}} \times S.E}\)
C.I: \(\color{blue}{\text{point estimate} \pm \text{margin of error}}\)
i.e.Β \(\color{purple}{(\hat{p}_{1}-\hat{p}_{2}) \pm z_{\frac{\alpha}{2}} \times S.E}\)
i.e \(\color{purple}{(\hat{p}_{1}-\hat{p}_{2}) \pm z_{\frac{\alpha}{2}} \times \sqrt{\frac{\hat{p_1}(1-\hat{p_1})}{n_1}+\frac{\hat{p_2}(1-\hat{p_2})}{n_2}}}\)
\[ \begin{align*} \text{Data} \hspace{0.5cm}& Duke & US \\ \hline \text{A great deal} \hspace{0.5cm} & 69 & 454 \\ \text{Not a great deal} \hspace{0.5cm}& 36 & 226 \\ \hline \text{Total} \hspace{0.5cm}& 105 & 680 \\ \hline \hat{p} \hspace{0.5cm} & 0.657 & 0.668 \end{align*} \]
\[ \begin{align*} \hline Data && GSS && Duke \\ \hline \text{A great deal} && 454 && 69 \\ \text{Some} && 124 && 30\\ \text{A little} && 52 && 4\\ \text{Not at all} && 50 && 2 \\ \hline \text{Total} && 680 && 105\\ \hline \end{align*} \]
Back to our example
Independence within groups:
Independence between groups: The sampled Duke students and the US residents are independent of each other.
Success-Failure: At least 10 observed successes and 10 observed failures in the two groups.
\[ \begin{align*} \text{Data} \hspace{0.5cm}& Duke & US \\ \hline \text{A great deal} \hspace{0.5cm} & 69 & 454 \\ \text{Not a great deal} \hspace{0.5cm}& 36 & 226 \\ \hline \text{Total} \hspace{0.5cm}& 105 & 680 \\ \hline \hat{p} \hspace{0.5cm} & 0.657 & 0.668 \end{align*} \]
Point estimate: \(\hat{p}_{1}-\hat{p}_{2}\)
\(\hat{p}_{Duke} - \hat{p}_{GSS} =.657 -0.668 = -0.011\)
\((\hat{p}_{Duke} - \hat{p}_{US}) \pm z^\star \times \sqrt{ \frac{ \hat{p}_{Duke} (1 - \hat{p}_{Duke})}{n_{Duke} } + \frac{ \hat{p}_{US} (1 - \hat{p}_{US})}{n_{US}}}\)
\(=(0.657 - 0.668) \pm 1.96 \times \sqrt{ \frac{0.657 \times 0.343}{105} + \frac{0.668 \times 0.332}{680}}\)
\(= -0.011 \pm 1.96 \times 0.0497\)
\(= -0.011 \pm 0.097\)
\(= (-0.108, 0.086)\)
For Hypothesis testing of difference of two proportions:
Hypotheses: \(\color{blue}{H_0: p_1 -p_2= 0, H_a: p_1-p_2 \ne 0}\) \(\color{red}{\text{(two sided)}}\)
or equivalently \({H_0: p_1= p_2, H_a: p_1\ne p_2}\)
For HT, we use the pooled proportion
\[ \begin{eqnarray*} \hat{p}_{pooled}&=&\color{blue}{\frac{\text{total success}}{\text{total size}}= \frac{x_1 + x_2}{n_1 +n_2}=\frac{n_1\hat{p_1} +n_2\hat{p_2}}{n_1+n_2}}\\ S.E. &=& \sqrt{\frac{\hat{p}_{pooled}(1-\hat{p}_{pooled})}{n_1}+\frac{\hat{p}_{pooled}(1-\hat{p}_{pooled})}{n_2}}\\ &=& \sqrt{\hat{p}_{pooled}(1-\hat{p}_{pooled})\big(\frac{1}{n_1}+ \frac{1}{n_2}\big)} \end{eqnarray*} \] - The z-test statistic: \(z = \frac{(\hat{p}_1-\hat{p}_2)-0}{S.E_{\hat{p}_{pooled}}}= \frac{\hat{p_1}-{\hat{p_2}}}{S.E_{\hat{p}_{pooled}}}\)
The P-value: P-value = \(P(|Z| >|z|)\) (probability of two tails)
The smaller P-value, the stronger evidence against \(H_0\) and support \(H_a\)
Similar calculation of the P-value for left sided and right sided
Right sided;
- Hypotheses: \(\color{purple}{H_0: p_1 -p_2= 0, H_a: p_1-p_2 > 0}\) \(\color{red}{\text{(right sided)}}\)
or equivalently \({H_0: p_1= p_2, H_a: p_1 > p_2}\)
- The z_test statistic :\(z = \frac{\hat{p_1}-{\hat{p_2}}-{0}}{S.E.}= \frac{\hat{p_1}-{\hat{p_2}}}{S.E.}\)
Left sided
- Hypotheses: \(\color{purple}{H_0: p_1 -p_2= 0, H_a: p_1-p_2 < 0}\) \(\color{red}{\text{(left sided)}}\)
or equivalently \({H_0: p_1= p_2, H_a: p_1 < p_2}\)
- The z_test statistic \(z = \frac{\hat{p_1}-{\hat{p_2}}-{0}}{S.E.}= \frac{\hat{p_1}-{\hat{p_2}}}{S.E.}\)
The P-value: P-value= \(P(Z<z)\) (probability of left tail)
The smaller P-value, the stronger evidence against \(H_0\) and support \(H_a\).
Which of the following is the correct set of hypotheses for testing if the proportion of all Duke students who would be bothered a great deal by the melting of the northern ice cap differs from the proportion of all Americans who do?
\(\color{red}{H_0: p_{Duke} = p_{US}}\)
\(\color{red}{H_a: p_{Duke} \neq p_{US}}\)
\(H_0: \hat{p}_{Duke} = \hat{p}_{US}\)
\(H_a: \hat{p}_{Duke} \neq \hat{p}_{US}\)
\(H_0: p_{Duke}-p_{US}=0\)
\(H_a:p_{Duke}-p_{US} \neq 0\)
\(H_0: p_{Duke} = p_{US}\)
\(H_a: p_{Duke} < p_{US}\)
Both A) and C) are correct.
\[ \begin{eqnarray*} \text{Data} \hspace{0.5cm}& Duke & US \\ \hline \text{A great deal} \hspace{0.5cm} & 69 & 454 \\ \text{Not a great deal} \hspace{0.5cm}& 36 & 226 \\ \hline \text{Total} \hspace{0.5cm}& 105 & 680 \\ \hline \hat{p} \hspace{0.5cm} & 0.657 & 0.668 \end{eqnarray*} \]
\[\hat{p}_{pooled}=\color{purple}{\frac{\text{total success}}{\text{total size}}= \frac{69 + 454}{105 +680}=\frac{523}{785}}= 0.666\]
\[ \begin{eqnarray*} S.E. &=& \sqrt{\hat{p}_{pooled}(1-\hat{p}_{pooled})(\frac{1}{n_1} + \frac{1}{n_2}})\\ &=&\sqrt{\frac{523}{785}\times\frac{262}{785}\times\big(\frac{1}{105}+ \frac{1}{680}\big)}\\ &=&0.0494 \end{eqnarray*} \]
As in the case of HT for two proportions where \(H_0 : p_1 - p_2=0\)
We cannot use 0 as the null value, we use the common sample proportion \(\hat{p_{pooled}}\).
The common (pooled) proportion for the two groups
\[\hat{p_{pooled}}=\color{purple}{\frac{\text{total success}}{\text{total size}} = \frac{69 + 454}{105 +680}=\frac{523}{785}=0.666}\]
\[ \begin{eqnarray*} S.E. &=& \sqrt{\hat{p}_{pooled}(1-\hat{p}_{pooled})(\frac{1}{n_1} + \frac{1}{n_2}})\\ &=& \sqrt{\frac{523}{785}\times\frac{262}{785}\times\big(\frac{1}{105}+ \frac{1}{680}\big)}\\ &=& 0.0494 \end{eqnarray*} \]
(Use the original fraction to avoid round off errors when doing HW)
The z test statistic \(z = \frac{\hat{p_{Duke}}-\hat{p_{US}-0}}{SE}= \frac{-0.011}{0.0494}= -0.2227\)
The P-Value =\(P(|Z|-0.2227)= 2\times P(Z>0.2227)\)
\(= 2\times P(Z < -0.2227)\)
\(= 2\times 0.4119\)
\(=0.824\)
Conclusion:
We cannot reject \(π»_0\) and substantiate \(π»_a\). In context, the data does not suggest that proportion of \(\text{all Duke students}\) who would be bothered a great deal by the melting of the northern ice cap differ from the proportion of all Americans who do.
When working with one proportion,
If doing a HT, p comes from the null hypothesis \(p_0, S.E= \sqrt{\frac{p_0(1-p_0)}{n}}\)
If constructing CI, use \(\hat{p}\) instead \(S.E= \sqrt{\frac{\hat{p}(1-\hat{p}}{n}}\)
When working with difference of two proportions,
if doing a hypothesis test with \(H_0: p_1 - p_2 = 0\),
\(\hat{p}( \hspace{0.2cm} \text{or} \hspace{0.2cm}\hat{p}_{pooled}) =\color{purple}{\frac{total success}{total size}= \frac{x_1 + x_2}{n_1 +n_2}=\frac{n_1\hat{p_1} +n_2\hat{p_2}}{n_1+n_2}}\)
\(\text{The z-test statistic :z}= \frac{\hat{p_1}-{\hat{p_2}}-{0}}{S.E_{\hat{p}_{pooled}}}= \frac{\hat{p_1}-{\hat{p_2}}}{S.E_{\hat{p}_{pooled}}}\)
For confidence interval, use
\[\color{purple}{S.E= \sqrt{ \frac{ \hat{p}_{1} (1 - \hat{p}_{2})}{n_{1} } + \frac{ \hat{p}_{2} (1 -\hat{p}_{2})}{n_{2}}}}\]
\[\color{purple}{(\hat{p}_{1} - \hat{p}_{2}) \pm z_{\frac{\alpha}{2}} \times \sqrt{ \frac{ \hat{p}_{1} (1 - \hat{p}_{2})}{n_{1} } + \frac{ \hat{p}_{2} (1 -\hat{p}_{2})}{n_{2}}}}\]
When working with the one mean,
\[ \bar{X} \sim N(\mu, \frac{\sigma}{\sqrt{n}})\]
\[S.E = \frac{s}{\sqrt{n}}\]