Topic 15: Hypothesis Tests and Confidence Intervals for Numerical Data

If you’ve never coded before (or even if you have), type print("Your Name") in the interactive R chunk below and run it by hitting crtl+Enter or cmd+Enter for MAC users.

Hypothesis Testing and Confidence Intervals for Numerical Data

In this workbook, we continue our exploration of statistical inference. Through the past few workbooks you became more comfortable with hypothesis testing and confidence intervals associated with categorical variables; here we extend to numerical variables. First you’ll be reminded of the normal distribution and be formally introduced to the family of $t$-distributions. After that, we’ll work on a pair of applications to sentencing data from the Southern District of the State of New York.

As a reminder, we have the standard errror decision tree which you’ll continue to make use of. In case you need a refresher on the tree, you can find my walkthrough video below:

While you were watching my walkthrough video you probably noticed that the side of the standard error decision tree corresponding to numerical data (inference for the mean, $\mu$) is much more involved than the side corresponding to inference for proportions. I mentioned in that video a bit about why that side of the tree is more involved and much of it stems from the idea that using a sample standard deviation as an approximation for the population standard deviation adds uncertainty to our approach. In order to counter this added uncertainty, we utilize a class of penalized normal distributions, called the $t$-distributions. Watch at least one of the videos below for a bit of history on the $t$-distributions.

A Detailed Introduction

A Shorter Introduction

What does the $t$-distribution look like?

So we’ve identified some scenarios for which we should utilize a $t$-distribution instead of the normal ($z$) distribution – the simple rule of thumb that I’ve given you is that any time we use a sample standard deviation as a proxy for the population standard deviation in the standard error estimate, we should utilize the $t$-distribution.

There are several other rules of thumb which people follow – such as, even if you utilize the sample standard deviation in place of the population standard deviation but your sample size is large enough, then we can safely use the normal distribution. Since access to powerful statistical software makes distribution lookup tables unnecessary, my feeling is that these rules of thumb are no longer required. That is, we should use the $t$-distibution any time we use the sample standard deviation in place of the population distribution.

Okay, so what does the $t$-distribution actually look like? As we’ve mentioned before and was cited in the video introductions, the $t$-distribution is a family of distributions identified by a parameter called degrees of freedom. Below you can see a standard normal distribution in black, a $t$-distribution with 3 degrees of freedom in red, and a $t$-distribution with 12 degrees of freedom in blue.

Notice that all three of the distributions are bell-shaped, but that the $t$-distributions have fatter tails than the normal distribution does. Also, notice that the $t$-distribution with 12 degrees of freedom is more similar to the normal distribution than the $t$-distribution with 3 degrees of freedom. As degrees of freedom increase, our $t$-distribution becomes closer and closer to the normal distribution.

Using the $t$-distribution(s)

When we introduced the normal distribution, we identified two helper functions:

The function pnorm(q, mean, sd) can be used to find the probability that a randomly selected observation is less than the boundary value $q$ from a population which is normally distributed with mean mean and standard deviation sd.
- We’ve simply said that the pnorm() function can be used to find area to the left of some boundary value under a normal distribution.
The function qnorm(p, mean, sd) can be used to find the boundary value such that the probability of a randomly selected observation falling below that observation is p. That is, qnorm(p, mean, sd) finds the $p^{th}$ percentile in the normal distribution having the mean and sd indicated.
- We’ve simply said that the qnorm() function can be used to find the cutoff value for which the area to the left, underneath a normal distribution, is p.

We have analogous functions for the $t$-distribution.

The function pt(q, df) can be used to find the probability of falling to the left of the boundary value q in a $t$-distribution with df degrees of freedom.
The function qt(p, df) can be used to find the cutoff value for which the area to the left of that cutoff, in a $t$-distribution with df degrees of freedom, is p.

Notice that our functions for the $t$-distribution do not have parameters for the mean or standard deviation. This means that we must always work with standardized variables (see the formula for the test statistic on the standard error decision tree) when working with the $t$-distributions.

Let’s start with some practice using the pt() and qt() functions. Don’t forget to draw your pictures – you are much more likely find incorrect answers if you omit this step.

Question 1: Use the code block below to find $\mathbb{P}\left[t < \right.$ 1.49 $\left.\right]$ in a $t$-distribution with 20 degrees of freedom.

Question 2: Find $\mathbb{P}\left[t > \right.$ -1.9 $\left.\right]$ in a $t$-distribution with 12 degrees of freedom.

Question 3: Find the cutoff value in a $t$ distribution with 23 degrees of freedom for which the area to the left of the cutoff value is 0.67.

Question 4: Find the critical value associated with a 99% confidence interval using a $t$-distribution with 14 degrees of freedom.

Okay, good – now that you’ve had some practice working with the $t$-distribution, let’s move on to some applications.

Applications to Criminal Sentencing

We’ll work with a dataset on Federal Sentencing from the Southern District of New York State. A subset of the data, consisting only of drug-related charges, has been loaded for you as SDNYdrug.

Question 1: Compute a 95% confidence interval for the average sentence length for a drug-related charge in the Southern District of New York State.

Use the code block below to compute the point estimate for average sentence length (SentenceMonths).

The variable SentenceMonths is a column within the SDNYdrug data frame. Remember that you can access columns in a data frame with the $ operator.

Use the code block below to compute the standard error

Remember that you can compute the standard deviation in R with the sd() function. Additionally, you can find the number of rows in a data frame by passing the name of the data frame to R’s nrow() function. The sqrt() function will also be useful to you.

Use the code block below to enter or compute the critical value associated with your confidence interval.

Did you try using 1.96? Remember that we can’t use the normal distribution here. You can find the critical value you need by making use of the qt() function.

Did you remember to use the qt() function to determine the critical value? With so many observations, the correct critical value differed very little from the 1.96 value used with the normal distribution. Remember that the critical values provided on the standard error decision tree are for use with the normal distribution only. Any time we use a sample standard deviation as a “stand-in” for the population standard deviation while computing standard error, we should be using critical values from a $t$-distribution. We can get the critical value using R’s qt() function – this will make a real difference if sample sizes are smaller.

Use the code block below to compute the lower bound for the 95% confidence interval.

And use the code block below to compute the upper bound for the 95% confidence interval.

Question 2: Conduct a hypothesis test at the $\alpha = 0.10$ level of significance to determine whether the sample data provides significant evidence to suggest that the average sentence length for white offenders and average sentence length for non-white offenders differs for drug-related cases in the Southern District of New York State.

Quiz

I’ve stored the sentence lengths (in months) handed down to white offenders in an object called whiteSentences and the sentence lengths for non-white offenders in an object called nonWhiteSentences. Use the code blocks below to answer the corresponding questions.

Find the number of white offenders sentenced in our dataset.

Since whiteSentences and nonWhiteSentences are vectors of sentence lengths rather than full data frames, the nrow() function won’t work on them. Try using the length() function instead.

Find the number of non-white offenders sentenced in our dataset.

Compute the average sentence length in months for the white offenders.

Compute the average sentence length in months for the non-white offenders.

Compute the standard deviation in sentence lengths for the white offenders.

Compute the standard deviation in sentence lengths for the non-white offenders.

Okay, we’ve answered lots of questions that give us pieces of our analysis. Now, let’s think about putting to pieces together to compute the test statistic, $p$-value, and complete the hypothesis test.

Quiz

Compute the point estimate

Compute the standard error

Compute the test statistic

Compute the degrees of freedom for this test.

Compute the p-value

Answer the following to complete the hypothesis test.

Quiz

Submit

If you have completed this tutorial and are happy with all of your solutions, please click the button below to generate your hash and submit it using the corresponding tutorial assignment tab on Blackboard

NCAT Blackboard

Summary

Wow – good work through all of that! I hope you found the application to sentence lengths in the SDNY to be interesting and eye-opening. The approaches to these problems were really drawn out, step-by-step, to help you identify the necessary steps for computing a confidence interval or conducting a hypothesis test with numerical data. Think about each individual step you were asked to complete and how that step relates to the larger process of completing the inference task. Doing this will help you become a better independent problem solver. Your next workbook will provide you an opportunity to practice with a mixture of inference tasks dealing with both numerical and categorical data. You’ll become more comfortable with the content and applying our statistical techniques as you work through more and more problems.

Hypothesis Testing and Confidence Intervals for Numerical Data

What does the \(t\)-distribution look like?

Using the \(t\)-distribution(s)

Applications to Criminal Sentencing

Submit

Summary

Topic 15: Hypothesis Tests and Confidence Intervals for Numerical Data