If you’ve never coded before (or even if you have), type print("Your Name")
in the interactive R chunk below and run it by hitting crtl+Enter
or cmd+Enter
for MAC users.
Tutorial Objectives: This tutorial covers the following objectives.
Important Reminders: The following previously mastered material is necessary for success through this tutorial
Recall: Variables for which computation of measures like the mean (average) or standard deviation are meaningful are numerical variables.
Measures of Central Tendency (Averages): The mean and median both attempt to measure the center of a dataset.
Sample_One
In R
we can easily compute the means and medians for our samples or for the entire dataset! Remember from our most recent tutorial that the $
operator can be used to access an entire column of a data frame. I’ve stored the samples in a data frame called samples. R
includes a function mean()
for computing the mean of a list of numbers and a function median()
for computing the median. This means that we could compute the mean of Sample_Two
using mean(samples$Sample_Two)
.
Sample_Three
using the $
operator to access the Sample_Three
column of the samples
data frame.
mean(samples$Sample_Three)
median()
function and the code block below to compute the median of each of the samples and then answer the question that follows.
For data which is not already known to R
(ie. data which is not part of a data frame), we can still use R
to quickly perform compuations. Consider the distributions of doors knocked on by two political campaign workers last week (Monday - Friday): \(\begin{array}{lcl} \text{Worker A} & : & 23,~24,~25,~26,~27\\ \text{Worker B:} & : & 0,~15,~25,~35,~50\end{array}\). We do this below with the help of the c()
function in R
, which can be used to create lists of values.
The following code block finds the mean and median for Worker A
– execute the code block to find the mean and median. Once you’ve done this for Worker A
, add two lines to the bottom of the code block so that it also finds the mean and median for Worker B
.
mean(c(23, 24, 25, 26, 27))
median(c(23, 24, 25, 26, 27))
mean(c(23, 24, 25, 26, 27))
median(c(23, 24, 25, 26, 27))
mean(c(0, 15, 25, 35, 50))
median(c(0, 15, 25, 35, 50))
Measures of Variability: Clearly, the center of a dataset doesn’t tell the entire story. Our two political pollsters obviously have very different door-knocking strategies but both have a mean (and median) of \(25\) doors per day. We should also measure the spread of data.
The standard deviation of a set of observations is denoted by \(s\) (or \(\sigma\) in the case of population-level data) and is computed as follows: \[s = \sqrt{\frac{\displaystyle{\sum_{i=1}^{n}{\left(x_i-\bar{x}\right)^2}}}{n-1}}\]
We should also note that if you are certain that you are working with population-level data, then the denominator used to compute the standard deviation should be changed to \(N\) (the population size). We can do this because there is no uncertainty in estimating the population standard deviation if we have records from every element of the population.
Explaining the Standard Deviation Formula: The standard deviation seeks to measure an “average deviation” from the mean.
The inter-quartile range (IQR) of a set of observations measures the spread of the “middle-50-percent” of the observations. The IQR is the distance between \(Q1\) (the 25th percentile) and \(Q3\) (the 75th percentile).
* The median of a set of observations splits the set into two halves: an upper half and a lower half. The median of the lower half is called the first quartile (\(Q_1\)) while the median of the upper half is called the third quartile (\(Q_3\)). The interquartile range is the distance between \(Q_1\) and \(Q_3\). That is, \[IQR = Q_3-Q_1\]
The two plots below are a histogram (left) and a boxplot (right), each showing the distribution of carat
-weights for the diamonds in our population.
A Note on Skew: It is common to refer to data as skewed if the presence of outliers cause the mean and median to disagree with one another on the location of the “center” of our data. In this case, we say that the data is skewed in the direction that those outliers have pulled the mean. For example, we would say that the carat
weight data (from above) is skewed right.
In R
we can easily compute the standard deviation with the function sd()
, and IQR with the function quantiles()
or IQR()
, for our samples or for the entire dataset! Recall that our diamond samples are stored in a data frame called samples
. The code block below is preset to compute the standard deviation, Q1, Q3, and IQR for Sample_One
. Note that in the quantiles()
function the 0.25
identifies the 25th percentile (\(Q1\)) and the 0.75
identifies the 75th percentile (\(Q3\)). Run the code to find the standard deviation, Q1, Q3, and the interquartile range for Sample_One
. Once you’ve done that, edit the existing code to compute these metrics for the other two samples.
sd(samples$Sample_One)
quantile(samples$Sample_One, c(0.25, 0.75))
IQR(samples$Sample_One)
Remark: Our third sample of diamond carat sizes contained an outlier. The presence of this outlier drastically impacted the computed mean and standard deviation, but didn’t have much (if any) effect on the median or \(IQR\). Because of this, we say that the median and \(IQR\) are robust statistics in the presence of outliers.
In R
we can also easily explore these measures of spread for our campaign workers from earlier. Recall their door-knocking data: \(\begin{array}{lcl} \text{Worker A} & : & 23,~24,~25,~26,~27\\ \text{Worker B:} & : & 0,~15,~25,~35,~50\end{array}\)
Use the code blocks below to find the standard deviation and IQR for the doors visited by the campaign workers.
R
’s table()
function to construct frequency and relative frequency tables for a sample or our entire set of observations. The following code chunk is preset to compute a frequency and relative frequency table for Sample_One
. Adapt the code to provide summaries for Sample_Two
and Sample_Three
.
table(cut.samples$Sample_One)
table(cut.samples$Sample_One)/nrow(cut.samples)
Below, we can see the distributions of diamond cut
from Sample_Two
(left) and from our entire population (right) below. Even with a sample of 8 diamonds, we gain “some” insight as to the most and least common diamond cuts. You may also notice that the frequency and relative frequency plots look identical aside from the scale on the vertical axis – this will be the case in general.
Summary: Here’s a quick summary of the most important ideas from this tutorial.
R Commands Introduced: The following commands in R
were introduced here.
mean(<data>)
median(<data>)
sd(<data>)
quantile(<data>, c(0.25, 0.75))
IQR(<data>)
quantile(<data>, c(p1, p2,...))
table(<data>)
table(<data>)/nrow(<dataframe>)
or table(<data>)/len(<data>)