Probability Distributions

Please type "Your Name" in the interactive R chunk below and run it by clicking Run Code hitting crtl+Enter or cmd+Enter for MAC users.

Getting Started

Load packages

In this lab, we will continuously explore and visualize data using the tidyverse suite of packages. The data can be found in the companion package for OpenIntro labs, openintro.

Let’s load the packages.

library(tidyverse)
library(openintro)

Binomial Distribution

Binomial Experiments: A binomial experiment satisfies the following four criteria:

There are \(n\) trials.
Each trial has two possible outcomes (usually called success and failure for convenience)
The trials are independent of one another.
For each trial, the probability of success is \(p\) (which remains constant).

Binomial Distribution: Let \(X\) be the number of successes resulting from a binomial experiment with \(n\) trials. Then, \(X\) has a binomial distribution. We can use the binomial distribution to compute the following probabilities about \(X\):

The probability of exactly \(k\) successes is given by \(\displaystyle{\mathbb{P}\left[X = k\right] = \binom{n}{k}\cdot p^k\left(1 - p\right)^{n-k} \overset{\text{in R}}{\equiv} \tt{dbinom(k, n, p)}}\)
The probability of at most \(k\) successes is given by \(\displaystyle{\mathbb{P}\left[X \leq k\right] = \sum_{i=0}^{k}{\binom{n}{i}\cdot p^i\left(1 - p\right)^{n-i}} \overset{\text{in R}}{\equiv} \tt{pbinom(k, n, p)}}\)

In the equations above, \(\binom{n}{k} = \frac{n!}{k!\left(n-k\right)!}\) counts the number of ways to arrange the \(k\) successes amongst the \(n\) trials. That being said, the R functions, dbinom() and pbinom() allow us to bypass the formulas – but you will still need to know what these functions do in order to use them correctly!

Tip: We need to use the binomial distribution to find probabilities associated with numbers of successful (or failing) outcomes in which we do not know for certain the trials on which the successes (or failures) occur.

For example, we would like to find the probability that we got exactly 30 heads from tossing a fair coin 50 times. We can calculate this probability using the dbinom() command as follows. Note that we have \(n=50\) trials and since the coin is fair, \(p=0.5\).

dbinom(30,50,0.5)

We obtain \(P(X=30) = 0.04185915\). So according to our answer, there is a 4.19% chance of getting exactly 30 heads from 50 coin tosses.

Now if we want to find the probability that we got at most 8 tails out of 20 tosses of a fair coin, we can calculate it using the pbinom() command. Now we have \(n=20\) trials and since the coin is fair, \(p=0.5\).

pbinom(8,20,0.5)

We obtain \(P(X \le 8) = 0.2517223\). Based on the probability, we have a 25.17% chance of getting at most 8 tails out of 20 coin tosses.

On the other hand, if we want to find the probability that we got at least 9 tails out of 20 tosses of a fair coin, we can calculate it using the pbinom() command and the rule of complements as follows.

1 - pbinom(8,20,0.5)

Exercises: A students takes an exam with 35 multiple choice questions. Each question has 4 choices. Assume the student didn’t study and therefore the 4 choices on every question are equally likely because the student will be guessing. Based on this information answer the following questions:

Find the probability that the student got at most 15 questions correct.

Find the probability that the student got at least 10 questions correct.

Find the probability that the student got a maximum of 31 and a minimum of 19 questions correct.

Normal Distribution

The normal distribution is a probability distribution that is used to model continuous random variables. The normal distribution is unimodal with a bell-shaped curve that is symmertical around the mean of the data.

Let’s start by exploring the fastfood dataset from the openintro library. In the following code chunk, we will load and peek through the dataset using the `glimpse()’ command.

data("fastfood")
glimpse(fastfood)

We observ that the fastfood dataset contains 515 observations and 17 variables, many of which are nutritional facts about different products of fastfood restraunts.

We will focus on the products of the Dairy Queen restaurant and look into the calories variable (cal_fat) to understand the fat contents of the restaurant’s products. To visualize the distribution of the cal_fat variable for Dairy Queen products, we will create both a histogram and a density plot.

In the following code chunk, we first filter() the fastfood dataset to keep only data from the “Dairy Queen” restaurant. Then, we ggplot() to construct a plot for the variable cal_fat. The plot type is set using the geom_histogram() and geom_density() layers.

dairy_queen = fastfood %>% 
  filter(restaurant == "Dairy Queen") 

dairy_queen %>%
  ggplot(aes(x = cal_fat))+
  geom_histogram()

dairy_queen %>%
  ggplot(aes(x = cal_fat))+
  geom_density()

The difference between a frequency histogram and a density histogram is that while in a frequency histogram the heights of the bars add up to the total number of observations, in a density histogram the areas of the bars add up to 1. The area of each bar can be calculated as simply the height times the width of the bar.

Using the density plot, we notice that the shape of the cal_fat variable is bimodal, skewed to the right, and not normally distributed with outliers.

Evaluating the Normal Distribution

Eyeballing the shape of the histogram or the density curve is one way to determine if the data appear to be nearly normally distributed, but it can be difficult to decide how close the histogram or density curve is to the normal distribution curve. An alternative approach involves constructing a normal probability plot, also called a normal Q-Q plot for “quantile-quantile”.

dairy_queen = fastfood %>%
  filter(restaurant == "Dairy Queen")

dairy_queen %>% 
  ggplot(aes(sample = cal_fat)) + 
  geom_qq() +
  geom_qq_line(col = "red")

In the above code chunk, we used the geom_qq() to make the scatterplot and the geom_qq_line() layer to add the red line.

It’s important to note that here, instead of using x inside aes(), we used sample which calculates the sample quantiles.

The x-axis values correspond to the quantiles of a theoretically normal curve with mean 0 and standard deviation 1 (i.e., the standard normal distribution).
The y-axis values correspond to the quantiles of the original unstandardized sample data.

However, even if we were to standardize the sample data values, the Q-Q plot would look identical.

A data set that is nearly normal will result in a probability plot where the points closely follow a diagonal line. Any deviations from normality lead to deviations of these points from that line.

The plot for Dairy Queen’s calories from fat shows points that tend to follow the line but with some errant points towards the upper tail. This indicates that the sample data distribution deviates from the normal distribution.

Finding Probabilities Under the Normal Distribution

Properties of the Normal Distribution: We have the following properties associated with the normal distribution. Consider \(X\sim N\left(\mu, \sigma\right)\).

The area beneath the entire distribution is 1 (since this is equivalent to the probability that \(X\) takes on any of its possible values).

\(\displaystyle{\mathbb{P}\left[X\leq \mu\right] = \mathbb{P}\left[X\geq \mu\right] = 0.5}\) (the area underneath a full half of the distribution is 0.5)

The distribution is symmetric. In symbols, \(\mathbb{P}\left[X\leq \mu - k\right] = \mathbb{P}\left[X \geq \mu + k\right]\) for any \(k\).

\(\displaystyle{\mathbb{P}\left[X = k\right] = 0}\) (the probability that \(X\) takes on any prescribed value exactly is \(0\))

For the binomial Distribution (discrete distribution), we used pbinom() to calculate the probability of observing certain number of successes or less for a given \(n\) (i.e., lower tail probabilities). Similarly, for the normal distribution (continuous distribution), we will make use of the function pnorm() command to compute lower tail probabilities under the normal curve.

Let’s find the area under the curve for the given plot.

So to find this area, we will use the following R code:

pnorm(1.5, mean = 0, sd = 1)

Now, let’s use this function to find the following areas:

Find the area under the curve of the following plot:

Find \(\mathbb{P}\left[-1.89 < Z < 1.89\right]\). In other words, find the area under the curve in the following plot.

Let’s consider this situation. The average daily high temperature in June in LA is 77\(^{\circ}\)F with a standard deviation of 5\(^{\circ}\)F. Suppose that the temperatures in June closely follow a normal distribution.

What is the probability of observing an 83\(^{\circ}\)F temperature or higher in LA during a randomly chosen day in June?

So, we can use a value, mean and sd to find the area under the curve. But we can use area under the curve, mean and sd to find the exact value as well. To do so we will make use of qnorm(p, mean, sd)

Let’s say for instance, “x*” is 1.96 on the standard normal curve. Then the area under the curve would be

pnorm(1.96, mean = 0, sd = 1)

Now, if we use this number in qnorm(), we will get “x*” in return. That is,

qnorm(0.9750, mean = 0, sd = 1)

We tend to round the Z-score to two decimal places. So the outcome would be 1.96.

Now, let’s answer the following exercises using the example from earlier.

How cool are the coldest 10% of the days (days with lowest average high temperature) during June in LA?

How hot are the hottest 5% of the days (days with highest average high temperature) during June in LA?

Simulations in R

In a simulation, you set the ground rules of a random process and then the computer uses random numbers to generate an outcome that adheres to those rules. As a simple example, you can simulate flipping a fair coin with the following.

coin_outcomes <- c("heads", "tails")
sample(coin_outcomes, size = 1, replace = TRUE)

The vector coin_outcomes can be thought of as a hat with two slips of paper in it: one slip says heads and the other says tails. The function sample draws one slip from the hat and tells us if it was a head or a tail.

Run the second command listed above several times. Just like when flipping a coin, sometimes you’ll get a heads, sometimes you’ll get a tails, but in the long run, you’d expect to get roughly equal numbers of each.

If you wanted to simulate flipping a fair coin 100 times, you could either run the function 100 times or, more simply, adjust the size argument, which governs how many samples to draw (the replace = TRUE argument indicates we put the slip of paper back in the hat before drawing again). Save the resulting vector of heads and tails in a new object called sim_fair_coin.

To view the results of this simulation, type the name of the object and then use table to count up the number of heads and tails.

coin_outcomes <- c("heads", "tails")
sim_fair_coin <- sample(coin_outcomes, size = 100, replace = TRUE)
table(sim_fair_coin)

Since there are only two elements in coin_outcomes, the probability that we “flip” a coin and it lands heads is 0.5. Say we’re trying to simulate an unfair coin that we know only lands heads 20% of the time. We can adjust for this by adding an argument called prob, which provides a vector of two probability weights.

coin_outcomes <- c("heads", "tails")
sim_unfair_coin <- sample(coin_outcomes, size = 100, replace = TRUE, 
                          prob = c(0.2, 0.8))
table(sim_unfair_coin)

prob=c(0.2, 0.8) indicates that for the two elements in the outcomes vector, we want to select the first one, heads, with probability 0.2 and the second one, tails with probability 0.8. Another way of thinking about this is to think of the outcome space as a bag of 10 chips, where 2 chips are labeled “head” and 8 chips “tail”. Therefore at each draw, the probability of drawing a chip that says “head”” is 20%, and “tail” is 80%.

If you don’t provide a prob argument in the sample() function, all elements in the outcomes vector will be assigned an equal probability of being drawn. If you want to learn more about sample or any other function, recall that you can always check out its help file.

?sample

In your simulation of flipping a fair coin 500 times, how many flips came up heads?

Simulate flipping an unfair coin 500 times, how many flips came up heads using a probability of 70% head, and 30% Tail?

A note on setting a seed: Setting a seed will cause R to select the same sample each time you run the code chunk. This will ensure reproducibility of your work (by setting the same seed it will be possible to reproduce your results). You can set a seed like this:

set.seed(35797)                 # make sure to change the seed
coin_outcomes <- c("heads", "tails")
sim_fair_coin <- sample(coin_outcomes, size = 500, replace = TRUE)
table(sim_fair_coin)

The seed above is completely arbitrary. If you need inspiration, you can use your ID, birthday, or just a random string of numbers. The important thing is that you use each seed only once in a document.

Submit

If you have completed this tutorial and are happy with all of your solutions, please click the button below to generate your hash and submit it using the corresponding tutorial assignment tab on Blackboard

NCAT Blackboard

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.