Chapter 7

```{r setup, include=FALSE} knitr::opts_chunk$set(echo = FALSE) ``` ```{r, echo=F, message=F, warning=F} library(datasets) library(tidyverse) library(shiny) library(scales) library(jpeg) library(openintro) library(dplyr) library(ggplot2) library(learnr) library(readr) library(knitr) library(png) library(gradethis) #remotes::install_github("rstudio/gradethis") library(learnrhash) #devtools::install_github("rundel/learnrhash") library(grid) library(tinytex) data("COL") ``` ## Acknowledgement
These notes use content from OpenIntro Statistics Slides by Mine Cetinkaya-Rundel.
## 7.2 Paired data
Paired data --Two sets of observations are paired if each observation in one set has a special correspondence or connection with exactly one observation in the other data set. To analyze a paired data, we simply **analyze the differences of each pair of observations.** We **use the contents of 7.1 to discuss** two examples about the difference of paired observations.
## Example 1.
- 200 observations were randomly sampled from the High School and Beyond survey. The same students took a reading and writing test and their scores are shown below.
```{r, echo=F, message=F, warning=F, out.width="90%",fig.align='center'} data(hsb2) data(COL) par(mar=c(3, 4, 0.5, 0.5), las=1, mgp=c(2.8, 0.7, 0), cex.axis = 1.1, cex.lab = 1.1) scores <- c(hsb2$read, hsb2$write) gp <- c(rep('read', nrow(hsb2)), rep('write', nrow(hsb2))) openintro::dotPlot(scores, gp, vertical=TRUE, xlab="",ylab = "scores", at=1:2+0.13, col=fadeColor(COL[1],33), xlim=c(0.5,2.5), ylim=c(20, 80), axes = FALSE, cex.lab = 1.8, cex.axis = 1.8) axis(1, at = c(1,2), labels = c("read","write"), cex.lab = 1.8, cex.axis = 1.8) axis(2, at = seq(20,80,20), cex.axis = 1.25) boxplot(scores ~ gp, add=TRUE, axes=FALSE) abline(h=0) ```
```{r, out.width="90%"} b <- readPNG("b.png") grid.raster(b) ```
## Example (Cont.) Analyze paired data.
To analyze paired data, we look at the difference in outcomes of each pair of observations. For our example, diff = read βˆ’ write
```{r, echo=F, message=F, warning=F, out.width="100%",fig.align='center'} data(hsb2) data(COL) histPlot(hsb2$read - hsb2$write, col = COL[1], xlab = "Differences in scores (read - write)", ylab = "", cex.lab = 2, cex.axis = 3) ```
```{r, out.width="90%"} xx <- readPNG("xx.png") grid.raster(xx) ```
## Inference for paired data- Hypotheses testing.
The analysis is no different than what we have done before. We have data from one sample: differences. - **Parameter of interest:** Average difference between the reading and writing scores of **all** high school students. $πœ‡_{diff}$ - point estimate: Average difference between the reading and writing scores of sampled high school students. $\bar{x}_{diff}$
## Inference for paired data- Hypotheses testing
For the paired test scores, We would like to test to see if average difference is different than 0.
**Hypotheses** For testing if there is a difference between the average reading and writing scores, we have hypotheses $𝐻_0$ ∢ There is no difference between the average reading and writing score.$πœ‡_{diff}$ = 0 $𝐻_a$ ∢ There is a difference between the average reading and writing score.$πœ‡_{diff}\ne$ 0 **Checking assumptions & conditions** Since students are sampled randomly and are less than 10% of all high school students, we can assume that the difference between the reading and writing scores of one student in the sample is independent of another.
## Calculating the test-statistic and the p-value
The observed average difference between the two scores is βˆ’0.545 points and the standard deviation of the difference is 8.887 points. Do these data provide convincing evidence of a difference between the average scores on the two exams? Use 𝛼 = 0.05.
Recall n= 200, as $\sigma$ is unknown; Compute t test statistic $𝑑=\frac{βˆ’0.545βˆ’0}{8.887/\sqrt{200}}=0.8673$ **P – value** \ With $df$=200βˆ’1=199 \ \ $𝑃(|𝑇|>0.86732)=2βˆ—π‘ƒ(𝑇<βˆ’0.8673)=2βˆ—π‘ƒ(𝑇>0.8673) =2βˆ—0.1934=0.38$
```{r, out.width="80%"} qq <- readPNG("qq.png") grid.raster(qq) ```
\ \ \ \ **Conclusion** As the P-value > 0.05, we failed to reject $𝐻_0$. The data do not provide convincing evidence of a difference between the average reading and writing scores.
## Interpretation of p-value
**The interpretation of the p-value** P- value is 0.3868. $=P(|𝑇| > 0.86732)$ $=P(𝑇<βˆ’0.86732 \hspace{0.2cm} \text{or} \hspace{0.2cm} 𝑇>0.86732)$ $=𝑃(\bar{X} <βˆ’0.545 \hspace{0.2cm} \text{or} \hspace{0.2cm} \bar{X}>0.545)$ (Note: the mean is 0) \ So the p-value is the probability of obtaining a random sample of 200 students where the average difference between the reading and writing scores is at least **0.545 more or less** is 0.3868, if it assumes that the true average difference between the scores is 0. **Decision** \ Because the p-value is large, so the data does not provide strong evidence to reject the null hypothesis to support the alternative hypothesis: we do not see the average difference is different than 0.
## Using CI in Hypothesis Testing
Suppose we were to construct 95% confidence interval for the average difference between the reading and writing scores. Would you expect this interval to include 0? 1) Yes 2) No 3) Cannot tell from the information given. The choice is A)--b/c HT failed to reject $𝐻_0$ Check: The 95% CI is $\bar{x} \pm 𝑑_{0.025} \times \frac{s}{\sqrt{n}}$ with $df$=199, $t_{0.025}=1.972$ (use TI 84 Calculator or R to check) CI : ${βˆ’0.545} \pm 1.972 \times \frac{8.887}{\sqrt{200}}$, (βˆ’1.784, 0.694) which contains 0. So, we cannot reject $H_0$ to support $H_a$
## Example 2-- Friday the $13^{th}$
Between 1990 - 1992 researchers in the UK collected data on traffic flow, accidents, hospital admissions on Fridays $13^{th}$ and the previous Fridays, Friday $6^{th}$. Below is this data set on traffic flow. We can assume that traffic flow on given day at locations 1 and 2 are independent ```{r, out.width= "70%"} kk <- readPNG("kk.png") grid.raster(kk) ```
## Hypotheses Testing
We want to investigate if people’s behavior is different on Friday $13^{th}$ compared to Friday $6^{th}$. **Hypotheses** \ $H_0$ ∢ Average traffic flow on Friday $6^{th}$ and$13^{th}$ are equal. \ $H_a$ ∢ Average traffic flow on Friday $6^{th}$ and$13^{th}$ are different. \ $𝐻_0 ∢ πœ‡_{diff}= 0$ \ $𝐻_𝐴 ∢ πœ‡_{diff}\ne 0$ The difference data: 698 1104 1037 1889 1911 2416 2761 4382 1839 321 ```{r quiz5_1, exercise=TRUE} data <- c(698,1104,1037,1889,1911,2416,2761,4382,1839,321) xbar <- mean(data);xbar; s= sd(data);s ``` Data Summary: $n$=10, $\bar{x}$=1836, $s$=1176 **The t-test statistic** $𝑑=\frac{1836βˆ’0}{1176/\sqrt{10}}=4.937$ \ **The P-value is, with $df$ =πŸ—,**$𝑃(|𝑇|>4.937)=2βˆ—π‘ƒ(𝑇>4.937)=0.0008055524$ \ **The conclusion:** Since the p-value is quite low, we conclude that the data provide strong evidence to support that there is a difference between traffic flow on Friday $6^{th}$ and $13^{th}$.
## Estimate the difference using confidence interval (end of 7.2)
We concluded that there is a difference in the traffic flow between Friday $6^{th}$ and $13^{th}$ at the significance level $\alpha=0.05$. We can use a 95% confidence interval to estimate this difference. For $df$=9, $𝑑_{0.025}=2.2622$ the 95 CI is: $1836\pm 2.2622 \times \frac{1176}{\sqrt{10}}$, that is, (995, 2677) As the CI does not contain the null value 0, we reject $H_0$ and substantiated $H_a$. This conclusion using confidence interval agrees with the findings using P-value.