Chapter 6

```{r setup, include=FALSE} knitr::opts_chunk$set(echo = FALSE) ``` ```{r, echo=F, message=F, warning=F} library(readr) library(openintro) library(here) library(tidyverse) data(COL) ``` # Chi-square test of independence ## Popular kids \alert{In the dataset \texttt{popular}, students in grades 4-6 were asked whether good grades, athletic ability, or popularity was most important to them. A two-way table separating the students by grade and by choice of most important factor is shown below. Do these data provide evidence to suggest that goals vary by grade?} \begin{multicols}{2} \begin{table}[] \begin{tabular}{rrrr} \hline & Grades & Popular & Sports \\ \hline $4^{th}$ & 63 & 31 & 25 \\ $5^{th}$ & 88 & 55 & 33 \\ $6^{th}$ & 96 & 55 & 32 \\ \hline \end{tabular} \end{table} \columnbreak \includegraphics[width=1\columnwidth]{popular_mosaic.pdf} \end{multicols} ## Chi-square test of independence \begin{itemize} \item The hypotheses are: \begin{itemize} \item[$H_0$:] Grade and goals are independent. Goals do not vary by grade. \item[$H_A$:] Grade and goals are dependent. Goals vary by grade. \end{itemize} \pause \item The test statistic is calculated as \[ \chi^2_{df} = \sum_{i = 1}^{k} \frac{(O - E)^2}{E} \quad \text{ where } \quad df = (R - 1) \times (C - 1), \] where $k$ is the number of cells, $R$ is the number of rows, and $C$ is the number of columns. \noindent\rule{4cm}{0.4pt} \alert{Note:} We calculate $df$ differently for one-way and two-way tables. \pause \item The p-value is the area under the $\chi^2_{df}$ curve, above the calculated test statistic. \end{itemize} ## Expected counts in two-way tables **Expected counts in two-way tables** $Expected \text{ }Count = \frac{(row \text{ }total) \times (column \text{ }total)}{table \text{ }total}$ \pause \begin{table}[] \begin{tabular}{rrrr|r} \hline & Grades & Popular & Sports & Total\\ \hline $4^{th}$ & \textcolor{red}{63} & \textcolor{blue}{31} & 25 & 119 \\ $5^{th}$ & 88 & 55 & 33 & 176 \\ $6^{th}$ & 96 & 55 & 32 & 183 \\ \hline Total & 247 & 141 & 90 & 478 \\ \end{tabular} \end{table} \pause \textcolor{red}{$E_{row~1, col~1} = \frac{119 \times 247}{478} = 61$} \pause \textcolor{blue}{$E_{row~1, col~2} = \frac{119 \times 141}{478} = 35$} ## Expected counts in two-way tables \alert{What is the expected count for the highlighted cell?} \begin{table}[] \begin{tabular}{rrrr|r} \hline & Grades & Popular & Sports & Total\\ \hline $4^{th}$ & 63 & 31 & 25 & 119 \\ $5^{th}$ & 88 & \textcolor{red}{55} & 33 & 176 \\ $6^{th}$ & 96 & 55 & 32 & 183 \\ \hline Total & 247 & 141 & 90 & 478 \\ \end{tabular} \end{table} A) $\frac{176 \times 141}{478}$ B) $\frac{119 \times 141}{478}$ C) $\frac{176 \times 247}{478}$ D) $\frac{176 \times 478}{478}$ ## Expected counts in two-way tables \alert{What is the expected count for the highlighted cell?} \begin{table}[] \begin{tabular}{rrrr|r} \hline & Grades & Popular & Sports & Total\\ \hline $4^{th}$ & 63 & 31 & 25 & 119 \\ $5^{th}$ & 88 & \textcolor{red}{55} & 33 & 176 \\ $6^{th}$ & 96 & 55 & 32 & 183 \\ \hline Total & 247 & 141 & 90 & 478 \\ \end{tabular} \end{table} \begin{multicols}{2} A) \textcolor{red}{$\frac{176 \times 141}{478}$} B) $\frac{119 \times 141}{478}$ C) $\frac{176 \times 247}{478}$ D) $\frac{176 \times 478}{478}$ \columnbreak \large\alert{\rightarrow 52} \normalsize\alert{more than expected \# of 5th graders have a goal of being popular} \end{multicols} ## Calculating the test statistic in two-way tables Expected counts are shown in \textcolor{blue}{blue} next to the observed count. \begin{center} \begin{tabular}{rrrr|r} \hline & Grades & Popular & Sports & Total \\ \hline $4^{th}$ & 63 \textcolor{blue}{61} & 31 \textcolor{blue}{35} & 25 \textcolor{blue}{23} &119 \\ $5^{th}$ & 88 \textcolor{blue}{91} & 55 \textcolor{blue}{52} & 33 \textcolor{blue}{33} & 176 \\ $6^{th}$ & 96 \textcolor{blue}{95} & 55 \textcolor{blue}{54} & 32 \textcolor{blue}{34} & 183 \\ \hline Total & 247 & 141 & 90 & 478 \\ \end{tabular} \end{center} \vspace{0.5cm} \pause \begin{eqnarray*} \chi^2 &=& \sum \frac{(63 - 61)^2}{61} + \frac{(31 - 35)^2}{35} + \cdots + \frac{(32 - 34)^2}{34} = 1.3121 \\ \pause df &=& (R - 1) \times (C - 1) = (3 - 1) \times (3 - 1) = 2 \times 2 = 4 \end{eqnarray*} ## Calculating the p-value \alert{Which of the following is the correct p-value for this hypothesis test?} \centering{\textcolor{red}{$\chi^2=1.3121 \qquad df = 4$}} \begin{multicols}{2} ```{r, echo=F, message=F, warning=F, out.width="100%",fig.align='center'} chiTail <- function(df, U, showdf = TRUE, axes = TRUE){ par(mar=c(2, 1, 1, 1), mgp=c(2.1, 0.8, 0), las=1) sd <- sqrt(2*df) xmax <- max(df + 4.000102*sd,U+2) x <- seq(0, xmax, 0.05) y <- dchisq(x, df) plot(x, y, type='l', axes=FALSE) if(axes == TRUE){ abline(h=0) axis(1, at=c(0,U,xmax+3), label = c(0,U,NA), cex.axis = 3) } if(axes == FALSE){ lines(c(0,xmax), c(0,0)) } if(showdf == TRUE){ text(x = 0.8*xmax, y = 0.5*max(y), paste("df =",df), cex = 3) } these <- which(x > U) X <- x[c(these[1], these, rev(these)[1])] Y <- c(0, y[these], 0) polygon(X, Y, col='#569BBD') } chiTail(df = 4, U = 1.3121) axis(1, at = 2.5, label = "1.3121", tick = FALSE, cex.axis = 3) ``` \columnbreak \raggedright A) More than 0.3 B) Between 0.3 and 0.2 C) Between 0.2 and 0.1 D) Between 0.1 and 0.05 E) Less than 0.001 \end{multicols} ## Calculating the p-value \alert{Which of the following is the correct p-value for this hypothesis test?} \centering{\textcolor{red}{$\chi^2=1.3121 \qquad df = 4$}} \begin{multicols}{2} ```{r, echo=F, message=F, warning=F, out.width="100%",fig.align='center'} chiTail(df = 4, U = 1.3121) axis(1, at = 2.5, label = "1.3121", tick = FALSE, cex.axis = 3) ``` \columnbreak \raggedright A) **More than 0.3** B) Between 0.3 and 0.2 C) Between 0.2 and 0.1 D) Between 0.1 and 0.05 E) Less than 0.001 \end{multicols} ## Conclusion \alert{Do these data provide evidence to suggest that goals vary by grade?} \begin{itemize} \item[$H_0$:] Grade and goals are independent. Goals do not vary by grade. \item[$H_A$:] Grade and goals are dependent. Goals vary by grade. \\ \end{itemize} \pause Since p-value is high, we fail to reject $H_0$. The data do not provide convincing evidence that grade and goals are dependent. It doesn't appear that goals vary by grade.

Introduction to Probability & Statistics