Tutorial 3: Descriptive Statistics for Numerical and Categorical Data

```{r setup, include=FALSE} library(dplyr) library(ggplot2) library(learnr) library(gradethis) #remotes::install_github("rstudio/gradethis") library(learnrhash) #devtools::install_github("rundel/learnrhash") set.seed(241) tutorial_options(exercise.timelimit = 60, exercise.checker = gradethis::grade_learnr) diamonds <- diamonds diamonds <- data.frame(diamonds) diamonds$region <- sample(c(1,2,3,4,5), prob = c( 0.11, .02 , 0.65, 0.20, .02), nrow(diamonds), replace = TRUE) diamonds$region <- as.factor(diamonds$region) samples <- data.frame(Sample_One = diamonds$carat[sample(1:nrow(diamonds), 8, replace = FALSE)], Sample_Two = diamonds$carat[sample(1:nrow(diamonds), 8, replace = FALSE)], Sample_Three = c(diamonds$carat[sample(1:nrow(diamonds), 7, replace = FALSE)], 5)) cut.samples <- data.frame(Sample_One = diamonds$cut[sample(1:nrow(diamonds), 8, replace = FALSE)], Sample_Two = diamonds$cut[sample(1:nrow(diamonds), 8, replace = FALSE)], Sample_Three = diamonds$cut[sample(1:nrow(diamonds), 8, replace = FALSE)]) ``` ## Objective If you've never coded before (or even if you have), type `"Your Name"` in the interactive R chunk below and run it by hitting `crtl+Enter` or `cmd+Enter` for MAC users. ```{r Student-Name, exercise = TRUE} ``` **Tutorial Objectives:** This tutorial covers the following objectives. * Compute and discuss appropriate summaries for both numerical and categorical data * Regarding numerical variables, discuss the difference between mean and median as well as standard deviation and inter-quartile range. Identify when each measure is appropriate. * Compute summary statistics both by hand and with the use of $\tt{R}$ **Important Reminders:** The following previously mastered material is necessary for success through this tutorial + A variable is *numerical* if a summary statistic such as the mean has a meaningful interpretation. + A variable is *categorical* if it serves to sort observations into different groups (categories). + The unique values which a variable takes are called the *levels* of that variable. ## Summarizing Numerical (Quantitative) Data **Recall**: Variables for which computation of measures like the mean (average) or standard deviation are meaningful are numerical variables. ### Measuring Central Tendancy **Measures of Central Tendency (Averages)**: The mean and median both attempt to measure the *center* of a dataset. ### * The **mean** of a set of observations is the traditional *average*. We typically denote the mean by $\bar{x}$ (or $\mu$ in the case of population-level data) and it is computed as follows: $\bar{x} = \frac{\displaystyle{\sum_{i=1}^{n}{x_i}}}{n} = \frac{x_1+x_2+x_3+...+x_n}{n}$ ### * The **median** is the *middle value* for a set of observations. To compute the median, list the numbers in ascending order and find the number or number(s) in the middle of the list. In the case that there is a single middle number, that is the median. In the case where there are two middle numbers, we take the *mean* of those two. ### **Example**: The following are three random samples of carat values from the *diamonds* dataset. ```{r printSamplesDf, eval = TRUE} samples ``` ### 1. Use the code block below to compute the mean of `Sample_One` ```{r meanSampleOne, exercise = TRUE, exercise.eval = TRUE} ``` ```{r meanSampleOne-check} grade_result( pass_if(~ (abs(.result - mean(samples$Sample_One)) < 0.005)) ) ``` ### In `R` we can easily compute the means and medians for our samples or for the entire dataset! Remember from our most recent tutorial that the `$` operator can be used to access an entire column of a data frame. I've stored the samples in a data frame called *samples*. `R` includes a function `mean()` for computing the mean of a list of numbers and a function `median()` for computing the median. This means that we could compute the mean of `Sample_Two` using `mean(samples$Sample_Two)`. 2. Use the code block below so that it computes the mean of `Sample_Three` using the `$` operator to access the `Sample_Three` column of the `samples` data frame. ```{r meanSampThree, exercise = TRUE, exercise.eval = TRUE} ``` ```{r meanSampThree-solution} mean(samples$Sample_Three) ``` ```{r meanSampThree-check} grade_code() ``` ### 3. Use the `median()` function and the code block below to compute the median of each of the samples and then answer the question that follows. ```{r findMedians, exercise = TRUE} ``` ```{r meanVsMedian, echo = FALSE} question( "What does your work above tell you about the mean and median as measures of central tendancy?", answer("The mean is generally larger than the median"), answer("The mean is generally smaller than the median"), answer("The mean is usually close to the median"), answer("The mean is more strongly distorted by outliers (unusually large or small observed values) than the median is", correct = TRUE), random_answer_order = TRUE, allow_retry = TRUE ) ``` ### 4. In our first tutorial we saw that we can use sample data to make generalizations about populations for which the sample is representative. Answer the following questions with this in mind. ```{r samplesToEstimatePop, echo = FALSE} quiz( question( "Assuming that the samples of diamonds were taken randomly, which of the following claims is justified", answer("The mean from `Sample_Three` is too High", message = "Can we be certain that this is the case?"), answer("The mean from `Sample_One` is too low"), answer("Each of the sample means provides an estimate for the population mean", correct = TRUE), answer("The population mean is equal to the sample means"), allow_retry = TRUE ), question_checkbox( "Which of the following would result in improved estimates for the true mean carat weight of diamonds in our population?", answer("Take the mean of the sample means", correct = TRUE), answer("Combine the three samples into a single sample of 24 diamonds (rather than three separate samples of 8), and take its mean", correct = TRUE), answer("Take samples of larger sizes", correct = TRUE), answer("Take samples of smaller sizes"), answer("Stratify the diamonds according to price (low, medium, and high) and take a random sample of diamonds of each price category"), allow_retry = TRUE, random_answer_order = TRUE ) ) ``` ### Aside: Defining my own data For data which is not already known to `R` (ie. data which is not part of a data frame), we can still use `R` to quickly perform compuations. Consider the distributions of doors knocked on by two political campaign workers last week (Monday - Friday): $\begin{array}{lcl} \text{Worker A} & : & 23,~24,~25,~26,~27\\ \text{Worker B:} & : & 0,~15,~25,~35,~50\end{array}$. We do this below with the help of the `c()` function in `R`, which can be used to create lists of values. The following code block finds the mean and median for `Worker A` -- execute the code block to find the mean and median. Once you've done this for `Worker A`, add two lines to the bottom of the code block so that it also finds the mean and median for `Worker B`. ```{r myData, exercise = TRUE, exercise.eval = FALSE} mean(c(23, 24, 25, 26, 27)) median(c(23, 24, 25, 26, 27)) ``` ```{r myData-solution} mean(c(23, 24, 25, 26, 27)) median(c(23, 24, 25, 26, 27)) mean(c(0, 15, 25, 35, 50)) median(c(0, 15, 25, 35, 50)) ``` ```{r myData-check} grade_code() ``` ### 5. Use your explorations of the means and medians for the poll workers to answer the following question. ```{r needForSpread, echo = FALSE} question_checkbox( "Which of the following are correct?", answer("Worker A and Worker B had the same mean and median number of doors knocked", correct = TRUE), answer("Worker A and Worker B had identical door-knocking strategies"), answer("Measures of center alone are not enough to describe numerical data", correct = TRUE), allow_retry = TRUE ) ``` ## Measuring Spread **Measures of Variability**: Clearly, the center of a dataset doesn't tell the entire story. Our two political pollsters obviously have very different door-knocking strategies but both have a mean (and median) of $25$ doors per day. We should also measure the *spread* of data. ### ### The *standard deviation* of a set of observations is denoted by $s$ (or $\sigma$ in the case of population-level data) and is computed as follows: $$s = \sqrt{\frac{\displaystyle{\sum_{i=1}^{n}{\left(x_i-\bar{x}\right)^2}}}{n-1}}$$ We should also note that if you are certain that you are working with population-level data, then the denominator used to compute the standard deviation should be changed to $N$ (the population size). We can do this because there is no uncertainty in estimating the population standard deviation if we have records from every element of the population. ### **Explaining the Standard Deviation Formula**: The standard deviation seeks to measure an "average deviation" from the mean. * If we don't look too closely at the formula, we can see the summation symbol $\left(\sum\right)$ as well as division (by just about the number of values we've added up). That's almost like an average! * What are we averaging? The quantity $\left(x - \bar{x}\right)$ denotes an observed value's deviation from the mean. We shouldn't average these values though, since the mean sits in *center* of the data and we would have deviations above the mean (positive) "cancelling out" deviations below the mean (negative). * We square the deviations which has two effects: (1) all of the squared deviations are now non-negative, so that no cancellation can occur, and (2) large deviations from the mean carry a larger weight in measuring the standard deviation. * Since we squared the deviations before computing the "average", the units of measure are no longer comparable to the original units that the variable was measured in -- the units are square units now. This is why we see the large square root as the last piece of the formula -- taking the square root brings us back to the original units. ### The *inter-quartile range* (IQR) of a set of observations measures the spread of the "middle-50-percent" of the observations. The IQR is the distance between $Q1$ (the 25th percentile) and $Q3$ (the 75th percentile). * The median of a set of observations splits the set into two halves: an upper half and a lower half. The median of the lower half is called the *first quartile* ($Q_1$) while the median of the upper half is called the *third quartile* ($Q_3$). The interquartile range is the distance between $Q_1$ and $Q_3$. That is, $$IQR = Q_3-Q_1$$ 6. Check your intuition about the standard deviation and interquartile range by answering the questions below. ```{r spreadIntuition, echo = FALSE} quiz( question( "Without carrying out the computations, which of the samples of diamonds has the largest standard deviation?", answer("Sample One"), answer("Sample Two"), answer("Sample Three", correct = TRUE), allow_retry = TRUE ), question( "Which measure of spread is more greatly impacted by the presence of outliers?", answer("Standard Deviation", correct = TRUE), answer("IQR"), allow_retry = TRUE ) ) ``` ### The two plots below are a histogram (left) and a boxplot (right), each showing the distribution of `carat`-weights for the diamonds in our population. ```{r printPlots, echo = FALSE, eval = TRUE, fig.width = 4.5, fig.height = 2.5, message = FALSE, warning = FALSE, results = FALSE} ggplot(data = diamonds) + geom_histogram(mapping = aes(x = carat)) + geom_vline(xintercept = mean(diamonds$carat), color = "purple", lwd = 1.75) ggplot(data = diamonds) + geom_boxplot(mapping = aes(y = carat)) + geom_hline(yintercept = mean(diamonds$carat), color = "purple", lwd = 1.75, alpha = 0.5) + coord_flip() ``` * The histogram does a nice job showing the true *shape* of the data, but does not always do a good job showing the presence of outliers. The purple line has been added to the histogram to show the true mean carat weight. * The boxplot doesn't show the detailed shape that the histogram does, but it does a great job showing the IQR, median, and any outliers present. * The lone dots in the boxplot show any outliers (extending more than 1.5 times the IQR, the distance from $Q1$ to $Q3$). * The box in the boxplot shows the IQR -- the left edge of the box is at $Q1$ and the right edge of the box is at $Q3$. * The line through the middle of the box denotes the location of the median. * I've added a faded purple line to our boxplot, showing the location of the mean carat weight. We can really see the impact of those outliers on the mean here. ### **A Note on Skew:** It is common to refer to data as *skewed* if the presence of outliers cause the mean and median to disagree with one another on the location of the "center" of our data. In this case, we say that the data is *skewed in the direction that those outliers have pulled the mean*. For example, we would say that the `carat` weight data (from above) is *skewed right*. ### In `R` we can easily compute the standard deviation with the function `sd()`, and IQR with the function `quantile()` or `IQR()`, for our samples or for the entire dataset! Recall that our diamond samples are stored in a data frame called `samples`. The code block below is preset to compute the standard deviation, Q1, Q3, and IQR for `Sample_One`. Note that in the `quantile()` function the `0.25` identifies the 25th percentile ($Q1$) and the `0.75` identifies the 75th percentile ($Q3$). Run the code to find the standard deviation, Q1, Q3, and the interquartile range for `Sample_One`. Once you've done that, edit the existing code to compute these metrics for the other two samples. ```{r computeSdIqr, exercise = TRUE, exercise.eval = FALSE} sd(samples$Sample_One) quantile(samples$Sample_One, c(0.25, 0.75)) IQR(samples$Sample_One) ``` ### **Remark**: Our third sample of diamond carat sizes contained an outlier. The presence of this outlier drastically impacted the computed mean and standard deviation, but didn't have much (if any) effect on the median or $IQR$. Because of this, we say that the median and $IQR$ are *robust* statistics in the presence of outliers. ### In `R` we can also easily explore these measures of spread for our campaign workers from earlier. Recall their door-knocking data: $\begin{array}{lcl} \text{Worker A} & : & 23,~24,~25,~26,~27\\ \text{Worker B:} & : & 0,~15,~25,~35,~50\end{array}$ Use the code blocks below to find the standard deviation and IQR for the doors visited by the campaign workers. ```{r sdIqrCampaign, exercise = TRUE} ``` ## Summarizing Categorical (Qualitative, Factor) Data Categorical data is best summarized using a frequency table. That is, we use a table that lists how *frequent* each observation occured in the dataset. For example, consider the samples of diamond *cuts* below: ```{r sampleCuts, echo = FALSE, eval = TRUE} cut.samples ``` ### ### We can use `R`'s `table()` function to construct frequency and relative frequency tables for a sample or our entire set of observations. The following code chunk is preset to compute a frequency and relative frequency table for `Sample_One`. Adapt the code to provide summaries for `Sample_Two` and `Sample_Three`. ```{r summarizeCut, exercise = TRUE, exercise.eval = FALSE} table(cut.samples$Sample_One) table(cut.samples$Sample_One)/nrow(cut.samples) ``` ### Below, we can see the distributions of diamond `cut` from `Sample_Three` (left) and from our entire population (right) below. Even with a sample of 8 diamonds, we gain "some" insight as to the most and least common diamond cuts. You may also notice that the frequency and relative frequency plots look identical aside from the scale on the vertical axis -- this will be the case in general. ```{r printCutPlots, echo = FALSE, eval = TRUE, fig.width = 4, fig.height = 2.5, warning = FALSE, message = FALSE, results = FALSE} ggplot(data = cut.samples) + geom_bar(mapping = aes(x = Sample_Three)) + labs(title = "Distribution of cut type in Diamonds (samp)", x = "Cut Type") ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut)) + labs(title = "Distribution of cut type in Diamonds (pop)", x = "Cut Type") ggplot(data = cut.samples) + geom_bar(mapping = aes(x = Sample_Three, y = (..count..)/sum(..count..))) + ylab("proportion") + labs(title = "Distribution of cut type in Diamonds (samp)", x = "Cut Type") ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, y = (..count..)/sum(..count..))) + ylab("proportion") + labs(title = "Distribution of cut type in Diamonds (pop)", x = "Cut Type") ``` ## Submit ```{r context="server"} learnrhash::encoder_logic(strip_output = TRUE) ``` ```{r encode, echo=FALSE} learnrhash::encoder_ui( ui_before = shiny::div( "If you have completed this tutorial and are happy with all of your", "solutions, please click the button below to generate your hash and", "submit it using the corresponding tutorial assignment tab on Blackboard", shiny::tags$br() ), ui_after = shiny::tags$a(href = "https://blackboard.ncat.edu/", "NCAT Blackboard") ) ``` ## Summary **Summary**: Here's a quick summary of the most important ideas from this tutorial. * We can summarize numerical data using measures of central tendency and measures of spread (or variability) * The *mean* ($\bar{x}$ for samples, $\mu$ for populations) and *median* measure the center of a set of numerical data. * The *standard deviation* ($s$ for samples, $\sigma$ for populations) and interquartile range ($IQR$) measure the spread of a set of numerical data. * The median and $IQR$ are robust measures in the presence of *outliers* (unusually large or small values). * Categorical data is best summarized in a *frequency table* or *relative frequency table*. **R Commands Introduced**: The following commands in `R` were introduced here. * Compute the **mean**: `mean()` * Compute the **median**: `median()` * Compute the **standard deviation**: `sd()` * Compute the **boundaries for the interquartile range**: `quantile(, c(0.25, 0.75))` * Compute the **interquartile range**: `IQR()` * Compute general **percentiles**: `quantile(, c(p1, p2,...))` * Build a **frequency table**: `table()` * Build a **relative frequency table**: `table()/nrow()` or `table()/len()`

Introduction to Probability & Statistics