Acknowledgement

These notes use content from OpenIntro Statistics Slides by

Mine Cetinkaya-Rundel.

4.1 Normal distribution

In this section, we discuss Normal Distribution, a continuous random variable distribution.

  • Probability Distribution of a Continuous Random Variable
  • The Characteristics of Normal Distribution (shape, mean, standard deviation)
  • The Standard normal distribution
  • Find probability for standard/general normal distribution using R
  • Compute z-scores for any value of normally distributed random variable
  • Find the value of normal distribution variable given probability
  • Find percentile
  • The 68-95-99.7 Rule (Empirical Rule)

Continous Random Variable (3.5)

- A random variable is called continuous when its possible values form an interval.
- The probability distribution of a continuous random variable \(X\) is denoted by a probability density function (pdf) \(𝑓(π‘₯)\), with \(𝑓(π‘₯)β‰₯0\).
- The pdf of \(X\) is usually represented by a probability density curve over the interval of all possible values of the random variable as shown below.
- The probability that \(X\) falls between two values π‘Ž and 𝑏, is the area under the curve between π‘Ž and 𝑏, expressed as \(𝑃(π‘Ž<𝑋<𝑏)\).

- Similar definition applies to \(𝑃(𝑋<𝑏)\) and \(𝑃(π‘Ž<𝑋)\).
- The total probability (the total area under the density curve) equals to 1.

Probability Distribution of a Continuous Random Variable

Example. The graph on the right represents the probability distribution of commuting time from a survey. The area under the curve for values higher than 45 is 0.15. That is, \(𝑃(𝑋β‰₯45)=0.15\).


Practice:

  1. Find \(𝑃(𝑋<45)\).
  2. What is the percentage of commuting time that is less than 45 minutes?

Note: \(𝑃(𝑋<45)=𝑃(𝑋≀45)\) – the equal sign does not make difference here in the case of continuous random variable because \(𝑃(𝑋=45)=0\).

From Histograms to Continous Distribution

Example. Below are a some histograms with different bin widths (10, 5, 2.5, 1.25) for the heights of U.S. adults.

In the last two plots, the bins are so slim that they start to resemble a smooth curve, which represents a density function.

Probabilities from a Continous Distribution

Example. From the histogram, if the sample size is 3,000, 000, and the counts in bins [180, 182.5) and [182.5, 185) are 195,307 and 156,239, Find the proportion of US adults whose heights in [180, 185).

  • The estimated probability is \(\frac{195,307+156,239}{3,000,000}=0.1172\)

- For continuous distribution, P(height in [180, 185)) = area under the curve between 180 and 185 =0.1157

Normal distribution (4.1)

  • Normal distribution is a probability distribution of a continuous random variable with values on \((-\infty, \infty)\).
  • The normal distribution is characterized by its mean \(\mu\) and standard deviation \(\sigma\). Its probability density function is

\[𝑓(π‘₯)=\frac{1}{\sigma \sqrt{2\pi}}e^{-\frac{1}{2}\left(\frac{π‘₯βˆ’\mu}{\sigma}\right)^2}\ \ ; \ \ (βˆ’βˆž<π‘₯<∞)\]

  • The graph of this function is bell-shaped, symmetric about the center \(\mu\).

  • Normal distribution is denoted as \(\color{red}{𝑁(\mu, \sigma)}\). It is determined by the mean \(\mu\) (the center) and standard deviation \(\sigma\).

  • \(𝑁(0, 1)\) is the standard normal distribution with \(\mu=0\) and \(\sigma=1\)

  • Many variables are nearly normal, but none are exactly normal.

Normal Distributions with Different \(\mu\) and \(\sigma\)

Discussion:

  1. What does the value of \(\mu\) represent?
  2. How does the value of \(\sigma\) affect the shape of the curve?

Finding Normal Probabilities Using R

Using the function pnorm():

  • Find the area to the left of a given value \(x\): pnorm(x, mean, sd)
  • Find the area to the right of \(π‘₯\): pnorm(x, mean, sd, lower.tail = FALSE)

Example: Suppose \(X\sim 𝑁(72, 15.2)\) (that is \(\mu=72, \sigma=15.2\)). To find
- P(X < 84)

pnorm(84, mean =72, sd = 15.2)
  • P(X > 84)
pnorm(84, mean =72, sd = 15.2, lower.tail = FALSE)

Finding Normal Probabilities Using R

Using the function pnorm():

  • Find the area to between two values \(a\) and \(b\): pnorm(b, mean, sd) - pnorm(a, mean, sd)

Example: Suppose \(X\sim 𝑁(72, 15.2)\) (that is \(\mu=72, \sigma=15.2\)). To find

  • P(70 < X < 90)
pnorm(90, mean =72, sd = 15.2)- pnorm(70, mean =72, sd = 15.2)

OR

1 - (pnorm(70, mean =72, sd = 15.2) + pnorm(90, mean =72, sd = 15.2, lower.tail = FALSE))

Standard Normal Distribution: \(N(0,1)\)

- The standard normal distribution is \(𝑁(0,1)\). That is, \(\mu=0\) and \(\sigma=1\).
- The letter \(Z\) is used for the random variable that has standard normal distribution.
- The cumulative probability under the standard normal distribution is given by a table (see Textbook, Appendix C, page 410-411) – See an Example on next slide.
- The function pnorm() can also be used to compute probabilities under the standard normal distribution by setting the mean = 0 and sd = 1 or simply not specifying them.
-Example: Suppose \(Z\sim N(0,1)\). Find \(P(Z<1)\).

pnorm(1, mean =0, sd = 1)

OR

pnorm(1)

Calculating probabilities - using tables

-Example: Suppose \(Z\sim N(0,1)\). Find \(P(Z<1)\) using the Z-Table.

  • Note that the Table gives the same result as R (previous slide).

Practice

For \(𝑁(0,1)\), what percent of the variable is in each region? Sketch the graph for each region and use the code chunk below to find the probabilities using pnorm().
1) \(𝑍<1.25\)
2) \(𝑍>βˆ’0.25\)
3) \(βˆ’0.4 <𝑍<1.5\)
4) \(|𝑍|<1.25\)
5) \(|𝑍|> 2.23\)

Practice Answers

For 𝑁(0,1) , what percent of the variable is in each region? Sketch the graph for each region.
1) 𝑍<1.25
2) 𝑍>βˆ’0.25
3) βˆ’0.4 <𝑍<1.5
4) |𝑍|<1.25
5) |𝑍|> 2.23

Answers:

0.8944 (89.44%)
0.5987
0.5886
0.7888
0.0258

Finding Probability Using Online App

The Z-score

- For an observation \(π‘₯\) of \(𝑋\), the z-score (standardized scores) is \(𝑧=\frac{π‘₯βˆ’\mu}{\sigma}\) where \(\mu\) is the mean, \(\sigma\) is the standard deviation

  • If \(𝑋\) has a normal distribution \(𝑁(\mu, \sigma)\), then \(𝑍=\frac{π‘‹βˆ’\mu}{\sigma}\) has the standard normal distribution \(N(0,1)\).

  • The z-score of an observation is the number of standard deviations it falls above (\(𝑧>0\)) or below (\(𝑧<0\)) the mean: \[π‘₯=\mu+𝑧\sigma \]

  • The z score of \(\mu\) is 0 (for any distribution)

  • Observations that are more than 2 SD away from the mean (\(|𝑍| > 2\)) are usually considered unusual.

Standard Normal to General Normal Distribution

Using R, we can find (see earlier example)

  • \(𝑃(𝑍<1.43)=0.9236\),

  • \(𝑃(𝑍>1.43)=0.0764\).

For a general normal random variable \(X\sim N(\mu, \sigma)\), we have

\[𝑃(𝑋<\mu+1.43\sigma)=0.9236\]

\[𝑃(𝑋>\mu+1.43\sigma)=0.0764\]

Example (Standby Time of Smartphones)

Based on data from smartphones available from major carriers in the U.S, in 2014, the distribution of the standby time approximately follows a normal distribution with a mean of \(\mu=330\) minutes and a standard deviation of \(\sigma=80\) minutes.

  1. What percentage of smartphones have a standby time 1.25 standard deviations below the mean?
  2. What percentage of smartphones have a standby time 1.25 standard deviation above the mean?
  3. What percentage of smartphones have a standby time that is within 1.25 standard deviations of the mean?

Solution. Use the pnorm() command in the below R code chunk to confirm the following probabilities.

  1. \(𝑃(𝑍<βˆ’1.25)=0.1056\),
  2. \(𝑃(𝑍>1.25)=0.1056\),
  3. \(𝑃(βˆ’1.25<𝑍<1.25)=0.7888\).

Example (Cont.)

As \(\mu=330\), \(\sigma=80\) , we can compute \(\muβˆ’1.25\sigma\), \(\mu+1.25\sigma\):
\(\muβˆ’1.25\sigma=330βˆ’1.25Γ—80=230\),
\(\mu +1.25 \sigma=330+1.25Γ—80=430\).

  1. 10.56% smartphones have standby time shorter than 230 minutes;
  2. 10.56% smartphones have standby time longer than 430 minutes;
  3. 78.88% smartphones have standby time between 230 and 430 minutes.

Example – Quality Control

At Heinz ketchup factory the amounts which go into bottles of ketchup are supposed to be normally distributed with mean 36 oz. and standard deviation 0.11 oz. Once every 30 minutes a bottle is selected from the production line, and is below 35.8 oz. or above 36.2 oz., then the bottle fails the quality control inspection.

pnorm(35.8, mean= 36, sd= 0.11) #part (1)

pnorm(36.2, mean= 36, sd= 0.11) - pnorm(35.8, mean= 36, sd= 0.11) #part (2)
  1. What percent of bottles have less than 35.8 ounces of ketchup?
  2. What percent of bottles pass the inspection?

Solution. Let \(𝑋\) represent the amount of ketchup in a bottle, then \(𝑋 \sim 𝑁(36, 0.11)\)

  • Using pnorm() in R (see code chunk above), we get

\[𝑃(𝑋< 35.8) = 0.0344\]

  • Using pnorm() in R (see code chunk above), we get

\[𝑃(35.8 < 𝑋< 36.2) = 0.9310\]

Example: Comparing z -Scores.

SAT scores are distributed nearly normally with mean 1100 and standard deviation 200. ACT scores are distributed nearly normally with mean 21 and statdard deviation 5. A college admissions officer wants to determine which of the two applicants scored better on their standardized test with respect to the other test takers: Ann, who earned 1300 on her SAT, or Tom, who scored 24 on his ACT?

Example (cont.)

We cannot just compare the two raw scores. We instead compare how many standard deviations beyond the mean each score is by using the Z-scores.

  • Ann’s score is \(\frac{(1300βˆ’1100)}{200}= 1 \ \text{standard deviation above the mean}\)
  • Tom’s score is \(\frac{24βˆ’21}{5}=0.6 \ \text{standard deviations above the mean}\)
  • Therefore, Ann’s score is better than Tom’s score.

Discussion: If Tom has ACT score of 30, whose score is better?

Percentiles and inverse normal distribution

  • A \(𝒑^{th}\) percentile is a score below which p% of observations falls below or at the value of the score.

  • Graphically, percentile is the score(observation) of random variable that that area (probability)under the probability distribution curve to the left of that observation is 𝑝/100. It is also called the value of the inverse cumulative density function.

For any normal distribution, the mean πœ‡ is the 50th percentile.

Find percentiles for normal distribution

For \(N(0,1)\), βˆ’1.96 is the 2.5th percentile: the probability that a normal random variable falls at least 1.96 standard deviations below (because of the negative sign) the mean is 0.025, or \(𝑃(𝑍<βˆ’1.96)=0.025\)

Find inverse Normal Distribution - Use R

The function qnorm() returns the value of the \(p(\times 100)th\) percentile under the the normal distribution curve (i.e., the value that has \(p\) probability to its left under the normal curve).

Use syntax: qnorm(p, mean, sd) in general, or simply qnorm(p) for standard Normal Distribution

Example.

  • Finding quartiles for \(𝑁(0,1)\)
qnorm(0.25)

qnorm(0.5)

qnorm(0.75)
  • The 95th percentile for \(𝑁(10, 2)\) (make change for general Normal Distribution)
qnorm(0.95, 10, 2)

Example Finding cutoff points

Body temperatures of healthy humans are distributed nearly normally with mean \({98.2^\circ}\)F and standard deviation \({0.73^\circ}\)F. What is the cutoff for the lowest 3% of human body temperatures?

  • Short Method:
x = qnorm(0.03, 98.2, 0.73)
x

Example Finding cutoff points

Body temperatures of healthy humans are distributed nearly normally with mean \({98.2^\circ}\)F and standard deviation \({0.73^\circ}\)F. What is the cutoff for the lowest 3% of human body temperatures?

  • Long Method (through z-score)
z = qnorm(0.03)
z

\(P(X<x) = 0.03 \rightarrow P(Z<\color{red}{-1.88})=0.03\) \(Z = \frac{obs-mean}{SD} \rightarrow \frac{x-98.2}{0.73}=-1.88\) \(x = (-1.88 \times 0.73)+98.2 = {96.8} F\)

Practice

Body temperature of healthy humans are distributed nearly normally with mean \({98.2^\circ}\)F and standard deviation \({0.73^\circ}\)F. What is the cutoff for the highest 10% of human body temperatures?

  1. \({97.3^\circ}\)F
  2. \({99.1^\circ}\)F
  3. \({99.4^\circ}\)F
  4. \({99.6^\circ}\)F

Practice

Body temperature of healthy humans are distributed nearly normally with mean \({98.2^\circ}\)F and standard deviation \({0.73^\circ}\)F. What is the cutoff for the highest 10% of human body temperatures?

qnorm(0.90, 98.2, 0.73)
qnorm(0.10, 98.2, 0.73, lower.tail = FALSE)

68 - 95 - 99.7 Rule (Empirical Rule)

- For nearly normally distributed data,

  • about 68% falls within 1 SD of the mean,

  • about 95% falls within 2 SD of the mean,

  • about 99.7% falls within 3 SD of the mean.

  • It is possible for observations to fall 4, 5, or more standard deviations away from the mean, but these occurrences are very rare if the data are nearly normal.

  • For normal distributions, beyond 2 SD, the observations are said to be unusual.

Note: Empirical rule is not exactly normal.

Example

The following graph shows the normal distributions for women’s height \(𝑁 (65,3.5)\) and men’s height \(𝑁(70, 4)\) in North America.

Question: For men’s height, within what interval, do the 68% of men’s height fell? 95%? 99.7%?

  • 68% of men’s height fell the interval within 1 S.D. of the mean: \[(πœ‡βˆ’πœŽ,πœ‡+𝜎)=(70βˆ’4, 70+4)=(66, 74) \]

That is, 68% of men has height between 66 inches and 74 inches.

  • 95% of men’s height fell the interval within 2 S.D. of the mean:

\[(πœ‡βˆ’2𝜎,πœ‡+2𝜎)=(70βˆ’2Γ—4, 70+2Γ—4)=(62, 78)\]

That is,95% of men has height between 62 inches and 78 inches.

  • 99.7% of men’s height fell the interval within 3 S.D. of the mean:

\[(πœ‡βˆ’3𝜎,πœ‡+3𝜎)=(70βˆ’3Γ—4, 70+3Γ—4)=(58, 82)\]

That is, 99.7% of men has height between 58 inches and 82 inches.

Exercise. Do the same for women’s height data.

Describing variability using the 68 - 95 - 99.7 Rule (end of 4.1)

SAT scores are distributed nearly normally with mean 1500 and standard deviation 300.

  • \(\sim\) 68% of students score between 1200 and 1800 on the SAT.
  • \(\sim\) 95% of students score between 900 and 2100 on the SAT.
  • \(\sim\) 99.7% of students score between 600 and 2400 on the SAT.