If you’ve never coded before (or even if you have), type "Your Name"
in the interactive R chunk below and run it by hitting crtl+Enter
or cmd+Enter
for MAC users.
Throughout this tutorial we’ll investigate the probability distribution that is most central to our study of statistics: the normal distribution. If we are confident that our data are nearly normal, that opens the door to many powerful statistical methods. This tutorial gives you practice in working with normally distributed data.
Tutorial Objectives: After completing this tutorial you should be able to:
Definition: If a random variable \(X\) is normally distributed with mean \(\mu\) and standard deviation \(\sigma\), we often write \(X\sim N\left(\mu, \sigma\right)\). Three different normal distributions appear below.
Notice that all three distributions are bell-shaped and are centered at their mean (\(\mu = 0\)). The larger the standard deviation, the shorter and wider the curve, while the smaller the standard deviation, the taller and more narrow the curve.
Given that \(X\sim N\left(\mu, \sigma\right)\), we can compute probabilities associated with observed values of \(X\) by finding the corresponding area beneath the normal curve with mean \(\mu\) and standard deviation \(\sigma\).
Properties of the Normal Distribution: We have the following properties associated with the normal distribution. Consider \(X\sim N\left(\mu, \sigma\right)\).
Sometimes it is useful to be able to estimate probabilities or to estimate the proportion of a population that falls into a range as long as the population is nearly normal. A convenient rule of thumb is the Empirical Rule.
The Empirical Rule: If \(X\sim N\left(\mu, \sigma\right)\), then
For each of the following, assume that \(X\sim N\left(\mu = 85, \sigma = 5\right)\)
Scenario: Two students, Bob and Sally, are trying to compare how well they did on a college entrance exam. The difficulty comes in that Bob took the SAT which is known to follow an approximate normal distribution with a mean score of 1068 and a standard deviation of 210 while Sally took the ACT which also follows an approximately normal distribution but with a mean score of 20.8 and a standard deviation of 5.8. If Bob scored a 1400 on the SAT and Sally scored a 31 on the ACT, who scored relatively higher?
How do we answer this question? We’ll see two methods.
Method 1: We can standardize the test scores so that they have comparable units.
\[\displaystyle{z = \frac{x - \mu}{\sigma}}\]
An observation’s \(z\)-score is simply the number of standard deviations it falls above or below the mean.
A recap on \(z\)-scores: We can use \(z\)-scores as a common unit for comparing observations from completely different populations (such as SAT scores and ACT scores). Here’s a recap of the most important information so far:
If an observation \(x\) comes from a nearly normal population with mean \(\mu\) and standard deviation \(\sigma\), we can compute it’s \(z\)-score using the formula: \(\displaystyle{z = \frac{x - \mu}{\sigma}}\).
A \(z\)-score measures the number of standard deviations which an observation falls above or below the mean.
Method 2: We can compute the percentile corresponding to Bob’s SAT score and the percentile corresponding to Sally’s ACT score.
Bob’s percentile corresponds to the shaded area in the distribution below.
Sally’s percentile corresponds to the shaded area in the distribution below.
There are many ways to compute percentiles. Before the widespread availability of statistical software, people converted observed values to \(z\)-scores and then looked up the percentile in a table. Luckily R provides nice functionality for computing percentiles.
Computing Percentiles in R: If \(X\sim N\left(\mu, \sigma\right)\), then \[\mathbb{P}\left[X\leq q\right] \approx \tt{pnorm(q, mean = \mu, sd = \sigma)}\]
The block below is preset to compute the Bob’s percentile. Execute the code cell and then adapt the code to find Sally’s percentile. Use your results to answer the questions below.
pnorm(1400, 1068, 210)
We’ll make good use of this second method for a while, but don’t forget about standardization and \(z\)-scores. We’ll need that strategy quite often later in our course! For now, let’s move on to practicing with finding probabilities from a normal distribution using R’s pnorm()
function.
Through this section you’ll be getting practice finding probabilities by using R’s pnorm()
function to compute areas. Remember that the pnorm()
function takes three arguments – the first is a \(\tt{boundary}\) value, the second is the \(\tt{mean}\) of the distribution, and the third is the \(\tt{standard~deviation}\). The value returned by pnorm()
is the area to the left of the provided boundary value in the distribution with the mean and standard deviation you provided.
For these first few questions I’ll draw pictures for you, but you should be prepared to draw your own shortly.
Question 1: Use the code block below to find \(\mathbb{P}\left[Z < \right.\) 1.49 \(\left.\right]\) – Remember that \(Z\sim N\left(\mu = 0, \sigma = 1\right)\).
Question 2: Find \(\mathbb{P}\left[Z > \right.\) -2.17 \(\left.\right]\).
Question 3: Find \(\mathbb{P}\left[\right.\) -1.04 \(< Z <\) 2.08 \(\left.\right]\).
Through the last three problems you only worked with the standard normal distribution – that’s the \(Z\)-distribution, which is \(N\left(\mu = 0, \sigma = 1\right)\). We can find probabilities from arbitrary normal distributions (normal distributions with any mean and any standard deviation) using R’s pnorm()
functionality – just supply the appropriate mean
and sd
arguments to pnorm()
instead of the 0 and 1 that we passed earlier.
Recall from earlier that the \(p^{th}\) percentile of a random variable \(X\) is the value \(x^*\) such that \(\mathbb{P}\left[X < x^*\right] = p\).
If \(X\sim N\left(\mu, \sigma\right)\), then to find the cutoff \(x^*\) for which \(\mathbb{P}\left[X < x^*\right] = p\), we can use R’s qnorm()
function. Similar to pnorm()
, this function takes three arguments. The first is the \(\tt{area~to~the~\underline{LEFT}}\) of the desired cutoff, the second is the \(\tt{mean}\) of the distribution, and the third is the \(\tt{standard~deviation}\) of the distribution.
Recall from earlier that SAT scores followed \(N\left(\mu = 1068, \sigma = 210\right)\) and ACT scores followed \(N\left(\mu = 20.8, \sigma = 5.8\right)\). The code block below is set up to find the minimum required SAT score to fall in the 95th percentile (to do better than 95% of other test-takers). Execute the code and note the required score. Adapt the code to find the minimum ACT score required to fall into the top 10% of all ACT test takers. Does your answer seem right? How can you judge?
qnorm(0.95, 1068, 210)
Nice job getting through this tutorial. Your hard work here will pay off as we move through much of the remainder of our course. Here are the major points we touched on.
A normal distribution is approximately bell-shaped and can be described by its mean \(\mu\) and standard deviation \(\sigma\).
As a shorthand, we often write \(N\left(\mu, \sigma\right)\) to mean the normal distribution with mean \(\mu\) and standard deviation \(\sigma\).
The Empirical Rule is a “rule of thumb” that states that, if data is normally distributed we expect:
We can interpret areas underneath the normal distribution to be probabilities.
If \(X\sim N\left(\mu, \sigma\right)\), then \(\mathbb{P}\left[X\leq k\right] = \tt{pnorm(k, mean = \mu, sd = \sigma)}\)
If \(X\sim N\left(\mu, \sigma\right)\), then the \(p^{th}\) percentile of \(X\) (the cutoff for which the proportion of the population falling below is \(p\)), is given by \(\tt{qnorm(p, mean = \mu, sd = \sigma)}\)