Foundations for statistical inference

If you have access to data on an entire population, say the size of every house in Ames, Iowa, it’s straightforward to answer questions like, “How big is the typical house in Ames?” and “How much variation is there in sizes of houses?”. If you have access to only a sample of the population, as is often the case, the task becomes more complicated. What is your best guess for the typical size if you only know the sizes of several dozen houses? This type of situation requires that you use your sample to make inference on what your population looks like.

Setting a seed: You will take random samples and build sampling distributions in this lab, which means you should set a seed on top of your lab. If this concept is new to you, review the lab concerning probability.

Getting Started

Load packages

In this lab, we will explore and visualize the data using the tidyverse suite of packages, and perform statistical inference using infer. The data can be found in the companion package for OpenIntro resources, openintro.

Let’s load the packages.

library(tidyverse)
library(openintro)
library(infer)

Creating a reproducible lab report

To create your new lab report, start by opening a new R Markdown document… From Template… then select Lab Report from the openintro package.

The data

You will consider real estate data from the city of Ames, Iowa. This is the same dataset used in the previous lab. The details of every real estate transaction in Ames is recorded by the City Assessor’s office. Your particular focus for this lab will be all residential home sales in Ames between 2006 and 2010. This collection represents your population of interest. In this lab, you will learn about these home sales by taking smaller samples from the full population. Let’s load the data.

data(ames)

In this lab, you’ll start with a simple random sample of size 60 from the population using the rep_sample_n function from the infer package. Although rep_sample_n has more arguments that you will use in the future, you just need to include the size of the sample since we are just taking one sample without replacement.

n <- 60
samp <- ames %>%
  rep_sample_n(size = n)

Remember that if you want to learn more about the arguments of a command, you can do so by typing ? in front of the command in your console to pull up the help page.

Note that the dataset has information on many housing variables, but for the first portion of the lab, you’ll focus on the size of the house, represented by the variable area.

Describe the distribution of house area in your sample. What would you say is the “typical” size within your sample? Also state precisely what you interpreted “typical” to mean.
Would you expect another student’s distribution to be identical to yours? Would you expect it to be similar? Why or why not?

Confidence intervals

Return for a moment to the question that first motivated this lab: based on this sample, what can you infer about the population? With just one sample, the best estimate of the average living area of houses sold in Ames would be the sample mean, usually denoted as \(\bar{x}\) (here we’re calling it x_bar). That serves as a good point estimate, but it would be useful to also communicate how uncertain you are of that estimate. This uncertainty can be quantified using a confidence interval.

A confidence interval for a population mean is of the following form \[ \bar{x} + z^\star \frac{s}{\sqrt{n}} \]

Although you can do this calculate by hand, the infer package provides a set of commands consistent with the tidyverse framework, thus making it easy to follow a similar syntax. You can use these commands to find a single confidence interval:

Function	Purpose
`specify`	Identify your variable of interest using the `formula` argument
`generate`	The number of samples you want to generate
`calculate`	The sample statistic you want to do inference with, or you can also think of this as the population parameter you want to do inference for
`get_ci`	Find the confidence interval

This code will find the 95 percent confidence interval for the average living area of houses sold in Ames.

samp %>%
  specify(formula = area ~ NULL) %>%
  generate(reps = 1000, type = "bootstrap") %>%
  calculate(stat = "mean") %>%
  get_ci(level = 0.95)

## # A tibble: 1 x 2
##   `2.5%` `97.5%`
##    <dbl>   <dbl>
## 1  1353.   1614.

Note that within generate, there are two argument included here. The first, reps, should be set to a reasonably large number. Here, we used 1000 since it is sufficiently large for this example without sacrificing any speed in the calculation. The second argument is type. When generating confidence intervals, always set the type to “bootstrap”.

Feel free to research the rest of the arguments for these functions, since these commands will be used together to calculate confidence intervals and solve inference problems for the rest of the semester.

To recap: even though we don’t know what the full population looks like, we’re 95% confident that the true average size of houses in Ames lies between the two printed values. There are a few conditions that must be met for this interval to be valid.

For the confidence interval to be valid, the sample mean must be normally distributed and have standard error \(s / \sqrt{n}\). What conditions must be met for this to be true?

Confidence levels

What does “95% confidence” mean?

In this case, you have the rare luxury of knowing the true population mean since you have data on the entire population. Let’s calculate this value so you can determine if your confidence intervals actually capture it. You can store it in a data frame called params (short for population parameters), and name it mu.

params <- ames %>%
  summarise(mu = mean(area))

Does your confidence interval capture the true average size of houses in Ames? If you are working on this lab in a classroom, does your neighbor’s interval capture this value?
Each student should have gotten a slightly different confidence interval. What proportion of those intervals would you expect to capture the true population mean? Why?

Using R, you will now collect many samples to learn more about how sample means and confidence intervals vary from one sample to another.

Obtain a random sample.
Calculate the sample’s mean and standard deviation, and use these to calculate and store the lower and upper bounds of the confidence intervals.
Repeat these steps 50 times.

The following lines of code takes 50 random samples of size n from population (and remember we defined \(n = 60\) earlier), and computes the upper and lower bounds of the confidence intervals based on these samples manually using qnorm to find the corresponding critical value for a 95 percent confidence interval, as opposed to the get_ci method within infer, which can only compute a single interval at a time.

Note: qnorm is set to 0.975 for a two-sided 95 percent confidence interval since the upper bound is at the 97.5 percentile.

ci <- ames %>%
        rep_sample_n(size = n, reps = 50, replace = TRUE) %>%
        summarise(x_bar = mean(area), 
                  se = sd(area) / sqrt(n),
                  me = qnorm(0.975) * se,
                  lower = x_bar - me,
                  upper = x_bar + me)

Let’s view the first five intervals:

ci %>%
  slice(1:5)

## # A tibble: 5 x 6
##   replicate x_bar    se    me lower upper
##       <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1         1 1449.  60.3 118.  1331. 1567.
## 2         2 1455.  62.0 121.  1334. 1577.
## 3         3 1430.  57.3 112.  1318. 1543.
## 4         4 1667.  74.6 146.  1521. 1813.
## 5         5 1405.  50.5  99.0 1306. 1504.

Next we’ll create a plot similar to Figure 4.8 on page 175 of OpenIntro Statistics, 3rd Edition. The first step will be to create a new variable in the ci data frame that indicates whether the interval does or does not capture the true population mean. Note that capturing this value would mean the lower bound of the confidence interval is below the value and upper bound of the confidence interval is above the value. Remember that you can create new variables using the mutate function.

ci <- ci %>%
  mutate(capture_mu = ifelse(lower < params$mu & upper > params$mu, "yes", "no"))

The ifelse function is new. It takes three arguments: first is a logical statement, second is the value we want if the logical statement yields a true result, and the third is the value we want if the logical statement yields a false result.

You now have all the information you need to create the plot, but you need to re-organize your data a bit for easy plotting. Specifically, you need to organize the data in a new data frame where each row represents one bound, as opposed to one interval. So this

     lower    upper capture_mu
1 1350.540 1544.360        yes
2 1333.441 1584.425        yes
3 1412.133 1663.801        yes
...

should instead look something like

  replicate   type     bound capture_mu
1         1  lower  1350.540        yes
2         2  lower  1333.441        yes
3         3  lower  1412.133        yes
4         1  upper  1544.360        yes
5         2  upper  1584.425        yes
6         3  upper  1663.801        yes
...

You can accomplish this using the following:

ci_data <- gather(ci, type, bound, lower:upper)

And finally you can create the plot using the following:

ggplot(data = ci_data, aes(x = bound, y = replicate, 
                           group = replicate, color = capture_mu)) +
  geom_point(size = 2) +  # add points at the ends, size = 2
  geom_line() +           # connect with lines
  geom_vline(xintercept = params$mu, color = "darkgray") # draw vertical line

What proportion of your confidence intervals include the true population mean? Is this proportion exactly equal to the confidence level? If not, explain why. Make sure to include your plot in your answer.

More Practice

Find a confidence interval with a percentage of your choosing (other than 95%) and interpret it.
Calculate 50 confidence intervals at the confidence level you chose in the previous question, and plot all intervals on one plot, and calculate the proportion of intervals that include the true population mean. How does this percentage compare to the confidence level selected for the intervals? Make sure to include your plot in your answer.

This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was written for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel.

Foundations for statistical inference - Confidence intervals