If you have access to data on an entire population, say the size of every house in Ames, Iowa, it’s straightforward to answer questions like, “How big is the typical house in Ames?” and “How much variation is there in sizes of houses?”. If you have access to only a sample of the population, as is often the case, the task becomes more complicated. What is your best guess for the typical size if you only know the sizes of several dozen houses? This type of situation requires that you use your sample to make inference on what your population looks like.
Setting a seed: You will take random samples and build sampling distributions in this lab, which means you should set a seed on top of your lab. If this concept is new to you, review the lab concerning probability.
In this lab, we will explore and visualize the data using the tidyverse suite of packages, and perform statistical inference using infer. The data can be found in the companion package for OpenIntro resources, openintro.
Let’s load the packages.
To create your new lab report, start by opening a new R Markdown document… From Template… then select Lab Report from the openintro package.
You will consider real estate data from the city of Ames, Iowa. This is the same dataset used in the previous lab. The details of every real estate transaction in Ames is recorded by the City Assessor’s office. Your particular focus for this lab will be all residential home sales in Ames between 2006 and 2010. This collection represents your population of interest. In this lab, you will learn about these home sales by taking smaller samples from the full population. Let’s load the data.
In this lab, you’ll start with a simple random sample of size 60 from the population using the rep_sample_n
function from the infer
package. Although rep_sample_n
has more arguments that you will use in the future, you just need to include the size of the sample since we are just taking one sample without replacement.
Remember that if you want to learn more about the arguments of a command, you can do so by typing ?
in front of the command in your console to pull up the help page.
Note that the dataset has information on many housing variables, but for the first portion of the lab, you’ll focus on the size of the house, represented by the variable area
.
Describe the distribution of house area in your sample. What would you say is the “typical” size within your sample? Also state precisely what you interpreted “typical” to mean.
Would you expect another student’s distribution to be identical to yours? Would you expect it to be similar? Why or why not?
Return for a moment to the question that first motivated this lab: based on this sample, what can you infer about the population? With just one sample, the best estimate of the average living area of houses sold in Ames would be the sample mean, usually denoted as \(\bar{x}\) (here we’re calling it x_bar
). That serves as a good point estimate, but it would be useful to also communicate how uncertain you are of that estimate. This uncertainty can be quantified using a confidence interval.
A confidence interval for a population mean is of the following form \[ \bar{x} + z^\star \frac{s}{\sqrt{n}} \]
Although you can do this calculate by hand, the infer
package provides a set of commands consistent with the tidyverse
framework, thus making it easy to follow a similar syntax. You can use these commands to find a single confidence interval:
Function | Purpose |
---|---|
specify |
Identify your variable of interest using the formula argument |
generate |
The number of samples you want to generate |
calculate |
The sample statistic you want to do inference with, or you can also think of this as the population parameter you want to do inference for |
get_ci |
Find the confidence interval |
This code will find the 95 percent confidence interval for the average living area of houses sold in Ames.
samp %>%
specify(formula = area ~ NULL) %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "mean") %>%
get_ci(level = 0.95)
## # A tibble: 1 x 2
## `2.5%` `97.5%`
## <dbl> <dbl>
## 1 1353. 1614.
Note that within generate
, there are two argument included here. The first, reps
, should be set to a reasonably large number. Here, we used 1000 since it is sufficiently large for this example without sacrificing any speed in the calculation. The second argument is type
. When generating confidence intervals, always set the type to “bootstrap”.
Feel free to research the rest of the arguments for these functions, since these commands will be used together to calculate confidence intervals and solve inference problems for the rest of the semester.
To recap: even though we don’t know what the full population looks like, we’re 95% confident that the true average size of houses in Ames lies between the two printed values. There are a few conditions that must be met for this interval to be valid.
In this case, you have the rare luxury of knowing the true population mean since you have data on the entire population. Let’s calculate this value so you can determine if your confidence intervals actually capture it. You can store it in a data frame called params
(short for population parameters), and name it mu
.
Does your confidence interval capture the true average size of houses in Ames? If you are working on this lab in a classroom, does your neighbor’s interval capture this value?
Each student should have gotten a slightly different confidence interval. What proportion of those intervals would you expect to capture the true population mean? Why?
Using R, you will now collect many samples to learn more about how sample means and confidence intervals vary from one sample to another.
The following lines of code takes 50 random samples of size n
from population (and remember we defined \(n = 60\) earlier), and computes the upper and lower bounds of the confidence intervals based on these samples manually using qnorm
to find the corresponding critical value for a 95 percent confidence interval, as opposed to the get_ci
method within infer
, which can only compute a single interval at a time.
Note: qnorm
is set to 0.975 for a two-sided 95 percent confidence interval since the upper bound is at the 97.5 percentile.
ci <- ames %>%
rep_sample_n(size = n, reps = 50, replace = TRUE) %>%
summarise(x_bar = mean(area),
se = sd(area) / sqrt(n),
me = qnorm(0.975) * se,
lower = x_bar - me,
upper = x_bar + me)
Let’s view the first five intervals:
## # A tibble: 5 x 6
## replicate x_bar se me lower upper
## <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1449. 60.3 118. 1331. 1567.
## 2 2 1455. 62.0 121. 1334. 1577.
## 3 3 1430. 57.3 112. 1318. 1543.
## 4 4 1667. 74.6 146. 1521. 1813.
## 5 5 1405. 50.5 99.0 1306. 1504.
Next we’ll create a plot similar to Figure 4.8 on page 175 of OpenIntro Statistics, 3rd Edition. The first step will be to create a new variable in the ci
data frame that indicates whether the interval does or does not capture the true population mean. Note that capturing this value would mean the lower bound of the confidence interval is below the value and upper bound of the confidence interval is above the value. Remember that you can create new variables using the mutate
function.
The ifelse
function is new. It takes three arguments: first is a logical statement, second is the value we want if the logical statement yields a true result, and the third is the value we want if the logical statement yields a false result.
You now have all the information you need to create the plot, but you need to re-organize your data a bit for easy plotting. Specifically, you need to organize the data in a new data frame where each row represents one bound, as opposed to one interval. So this
lower upper capture_mu
1 1350.540 1544.360 yes
2 1333.441 1584.425 yes
3 1412.133 1663.801 yes
...
should instead look something like
replicate type bound capture_mu
1 1 lower 1350.540 yes
2 2 lower 1333.441 yes
3 3 lower 1412.133 yes
4 1 upper 1544.360 yes
5 2 upper 1584.425 yes
6 3 upper 1663.801 yes
...
You can accomplish this using the following:
And finally you can create the plot using the following:
ggplot(data = ci_data, aes(x = bound, y = replicate,
group = replicate, color = capture_mu)) +
geom_point(size = 2) + # add points at the ends, size = 2
geom_line() + # connect with lines
geom_vline(xintercept = params$mu, color = "darkgray") # draw vertical line
Find a confidence interval with a percentage of your choosing (other than 95%) and interpret it.
Calculate 50 confidence intervals at the confidence level you chose in the previous question, and plot all intervals on one plot, and calculate the proportion of intervals that include the true population mean. How does this percentage compare to the confidence level selected for the intervals? Make sure to include your plot in your answer.
This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was written for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel.