Acknowledgement

These notes use content from OpenIntro Statistics Slides by

Mine Cetinkaya-Rundel.

Content Outline of Chapter 2

  • This chapter focuses on summarizing data

    • Numerical data

    • Categorical data

  • There are three sections in this chapter

    • 2.1 Examining numerical data

    • 2.2 Considering categorical data

    • 2.3 Case study

2.1 Examining numerical data

In this section, we explore techniques for summarizing numerical variables.

  • Graphical Summary

    • Scatterplot (for two numerical variables)
    • Dot plot
    • Histogram and shape (modality and skewness)
    • Boxplots & Outliers
  • Numerical Summary

    • Mean
    • Variance and Standard Deviation
    • Median
    • Quartiles and IQR

Scatterplot

A scatterplot (also called scatter graph, scatter chart, scattergram, or scatter diagram) is a type of plot using Cartesian coordinates to display values for typically two numerical variables for a set of data.

Scatterplots are useful for visualizing the relationship between two numerical variables.

Example 1. The figure below is a scatterplot for a data set (50 cases) of the total income and borrowed loan amount.

Practice: Read two pair of data and interpret.

Scatterplot

  • When two variables show some connection with one another, they are called associated variables.
    • Linear(positive, negative) association.
    • Nonlinear association.
  • If two variables are not associated, i.e. there is no evident connection between the two, then they are said to independent.

Back to Example 1. Is the association linear or nonlinear? Nonlinear

Practice

Visualize the types of associations.

Dot plots

  • Dot Plot : shows a dot for each observation placed above its value on a number line.

  • Dot Plot is useful for visualizing one numerical variable.

  • Darker colors represent areas where there are more observations.

Stacked dot plot

Higher bars represent areas where there are more observations, makes it a little easier to judge the center and the shape of the distribution.

Dot plots & mean

  • The mean, also called the average (marked with a triangle in the above plot), is one way to measure the center of a distribution of data.

  • The mean GPA is 3.59.

Mean

  • The sample mean, denoted as \(\bar{x}\), can be calculated as

\[\bar{x} = \frac{x_1+x_2+ \dots + x_n}{n},\]

where \(x_1+x_2+ \dots + x_n\) represent the n observed values.

  • The population mean is also computed the same way but is denotes as \(\mu\). It is often not possible to calculate \(\mu\) since population data are rarely available.
  • The sample mean is a sample statistic, and served as a point estimate of the population mean (parameter).
  • This estimate may not be perfect, but if the sample is good (representative of the population), it is usually a pretty good estimate.

Examples

Example 1. Calculate the mean of a sample with five observations: 5, 3, 8, 5, 6.

\[\bar{x} = \frac{\sum_{i=1}^{n}x_i}{n} = \frac{5+3+8+5+6}{5}=\frac{27}{5}=5.4\] Using R, we can calculate the mean using the mean() command. Notice that we need to put the values in a vector using the c() function which stands for concatenate.

mean(c(5,3,8,5,6))

Examples

Example 2. Given a data set with \(n=234\) and \(\sum_{i=1}^{n}x_i=2019\), find the mean \(\bar{x}\) rounded to 4 decimals.

\[\bar{x} = \frac{\sum_{i=1}^{n}x_i}{n} = \frac{2019}{234} = 8.6282\]

2019 / 234

Discussions

  1. If the data set has 5 observations, with \(\bar{x} = 5.4\), find \(\sum_{i=1}^{5}x_i\).

  2. Continue discussion in 1, if add one more observation 10, will the mean \(\bar{x}\) increase or decrease? What is the new \(\bar{x}\)?

  3. Compare data sets 5, 3, 8, 5, 6 and 5, 3, 80, 5, 6, which one has the higher mean?

Histograms - Extracurricular hours

(Example: Extracurricular hours)

  • Histogram provide a view of the data density. Higher bars represent where the data are relatively more common.

  • Histograms are especially convenient for describing the shape of the data distribution.

  • The chosen bin width can alter the story the histogram is telling.

Bin width

  • Which one(s) of these histograms are useful?
  • Which reveals too much about the data?
  • Which hides too much?

Shape of the distribution: 1) skewness

Is the histogram right skewed, left skewed, or symmetric?

__________________________

Note:

Histograms are said to be skewed to the side of the long tail.

Shape of the distribution: 2) unusual observations

Are there any unusual observations or potential outliers?

Shape of the distribution: 3) modality

Does the histogram have a single prominent peak (unimodal), several prominent peaks (bimodal/multimodal), or no apparent peaks (uniform)?

______________________
Note:

In order to determine modality, step back and imagine a smooth curve over the histogram. (Continue to see next slide).

Commonly observed shapes of distributions: Modality

  • Unimodal

Commonly observed shapes of distributions: Modality

  • Unimodal
  • Bimodal

Commonly observed shapes of distributions: Modality

  • Unimodal
  • Bimodal
  • Multimodal

Commonly observed shapes of distributions: Modality

  • Unimodal
  • Bimodal
  • Multimodal
  • uniform

Commonly observed shapes of distributions: Skewness

  • Right Skew

Commonly observed shapes of distributions: Skewness

  • Right Skew

  • Left Skew

Commonly observed shapes of distributions: Skewness

  • Right Skew

  • Left Skew

  • Symmetric

Extracurricular activities

How would you describe the shape of the distribution of hours of week students spend on extracurricular activities?

Unimodal and right skewed, with a potentially unusual observation at 60 hours/week.

Variance

Variance is roughly the average squared deviation from the mean.

\[ s^2 = \frac{\sum_{i=1}^{n}(x_i-\bar{x})^2}{n-1}\\ x : 1,2,5,20 \\ \bar{x} =7\\ (x-\bar{x}): -6,-5,-2,13 \\ (x-\bar{x})^2 : 36,25,4,169 \\ s^2 = \frac{234}{3}= 78 \] We use the squared deviation in the calculation of variance:

  • To get rid of negatives so that observations equally distant from the mean are weighted equally.

  • To weigh larger deviations more heavily.

Variance

Variance is roughly the average squared deviation from the mean.

Example: For the sample data set “Hours of sleep per night”, the sample size is \(n = 217\), sample mean is \(\bar{x} = 6.71,\) and the variance is calculated as follows:

\(s^2 = \frac{(5-6.71)^2+(9-6.71)^2+\dots+(7-6.71)^2}{217-1} = 4.11 \text{ } hours^2\)

Standard Deviation

The standard deviation is the square root of the variance, and has the same units as the data.

\[s = \sqrt{s^2}\] \[ s = \sqrt{\frac{\sum_{i=1}^n(x_i - \bar{x})^2}{n-1}}\] Standard deviation measures the variability of data:

  1. if \(\color{blue}{s}\) is small, the data is concentrated around the mean \(\color{blue}{\bar{x}}\);

  2. if \(\color{blue}{s}\) is large, the data is spread further from the mean \(\color{blue}{\bar{x}}\).

Standard Deviation

  • The standard deviation of amount of sleep students get per night can be calculated as: \[s = \sqrt{4.11} = 2.03 \text{ } hours\]

  • We can see that all of the data are within 3 standard deviations of the mean (center), i.e. between 0.62 and 12.8.

\(\bar{x}\) = 6.71, 3s =6.09

\(\bar{x}\) - 3s = 6.71- 6.09 = 0.62

\(\bar{x}\) + 3s = 6.71 + 6.09 = 12.8

Standard deviation

Example: The following are samples of women’s and men’s ideal number of children. Find the standard deviation for each group.

Men: 0, 0, 0, 2, 4, 4, 4       Women: 0, 2, 2, 2, 2, 2, 4

  • Both men and women have a mean of \(\bar{x} = \frac{14}{7}=2\)

  • The deviations \((x_i - \bar{x})\) for men are: −2, −2, −2, 0, 2, 2, 2

  • The standard deviation (SD) for men is

\[s = \sqrt{\frac{\sum_{i=1}^n(x_i - \bar{x})^2}{n-1}} = \sqrt{\frac{3(-2)^2+0+3(2)^2}{7-1}}=\sqrt{\frac{24}{6}}=2\] - The deviations \((x_i - \bar{x})\) for women are: −2, 0, 0 , 0 , 0, 2

  • The standard deviation (SD) for women is

\[s = \sqrt{\frac{\sum_{i=1}^n(x_i - \bar{x})^2}{n-1}} = \sqrt{\frac{(-2)^2+4(0)^2+(2)^2}{7-1}}=\sqrt{\frac{8}{6}}\approx1.15\approx1.2\]

Standard deviation

Example: The following are samples of women’s and men’s ideal number of children. Find the standard deviation for each group.

Men: 0, 0, 0, 2, 4, 4, 4       Women: 0, 2, 2, 2, 2, 2, 4

  • Using R, we can calculate the standard deviation using the sd() command as follows:
sd(c(0, 0, 0, 2, 4, 4, 4))
sd(c(0, 2, 2, 2, 2, 2, 4))

Understand Standard Deviation

\[ s = \sqrt{\frac{\sum_{i=1}^n(x_i - \bar{x})^2}{n-1}}\]

  • The standard derivation 𝑠 represents a type of average distance of observations from the mean.

  • The standard deviation is zero (𝑠 = 0) only when all observations have the same value, otherwise 𝑠 > 0.

  • The larger the value of standard deviation 𝑠, the greater the variability of the data. As the spread of the data increases,𝑠 gets larger.

  • s has the same units of measurement as the original observations, while variance 𝑠^2 does not.

  • The standard deviation𝑠is not resistant. That is, strong skewness or a few outliers can greatly increase𝑠.

Median

  • The median is the value that splits the data in half when ordered in ascending order.

\[0, 1, \textbf{2}, 3, 4\]

  • If there are an even number of observations, then the median is the average of the two values in the middle.

\[0, 1, \underline{2, 3}, 4, 5 \rightarrow \frac{2+3}{2} = \textbf{2.5}\]

  • Since the \(\color{red}{median}\) is the midpoint of the data, 50% of the values are below it. Hence, it is also the \(\color{red}{50^{\textbf{th}}}\) percentile.

Median

Example. the data below gives the per capita CO2 emissions in 9 largest nations measured in metric tons per person. Find the value of the median.

China 5.9; India 1.4; U.S. 16.9; Indonesia 1.8; Brazil 2.1; Pakistan 0.8; Nigeria 0.3; Bangladesh 0.4; Russia 11.6.

Solution.

First, put the 9 observations in ascending order 0.3, 0.4, 0.8, 1.4, 1.8, 2.1, 5.9, 11.6, 16.9

Since 𝑛 = 9 is odd, the median is the middle observation:

median= 1.8 metric tons.

  • In R, we calculate the median by applying the median() command to the raw (unordered) values:
median(c(5.9, 1.4, 16.9, 1.8, 2.1, 0.8, 0.3, 0.4, 11.6))

Median

  • Unlike the mean, the median is a resistant measure as its value is not sensitive to outliers. For data 0.3, 0.4, 0.8, 1.4, 1.8, 2.1, 5.9, 11.6, 16.9

  • If we drop out the U.S. value, what is the new median? New mean? 0.3, 0.4, 0.8, 1.4, 1.8, 2.1, 5.9, 11.6 Now the new median is \(\frac{(1.4+1.8)}2=1.6\)

The new mean is 3.04.

  • If we change the last value, 0.3, 0.4, 0.8, 1.4, 1.8, 2.1, 5.9, 11.6, \(\color{red}{106.9}\)
    • what is the new median? New mean?
    • Now the new median is still 1.8, the new mean is \(\color{red}{13.04}\)

Percentile

  • The pth percentile is a value such that \(p\) percent of the observations fall below or at that value, and \(1-p\) percent of the observations fall above or at that value.

  • The median is a special percentile: 50th percentile

Quartiles and (IQR)

  • The \(25^{th}\) percentile is also called the first quartile, Q1.

  • The \(50^{th}\) percentile is also called the median.

  • The \(75^{th}\) percentile is also called the third quartile, Q3.

  • Between Q1 and Q3 is the middle 50% of the data. The range these data span is called the interquartile range, or the IQR.

\[IQR = Q3-Q1\]

The Quartiles Split the Distribution into Four Parts

Finding Quartiles

Steps:

  • Arrange the data in order.

  • Find the median. This is the second quartile, Q2.

  • Consider the lower half of the observations (excluding the median itself if n is odd). The median of these observations is the first quartile, Q1.

  • Consider the upper half of the observations (excluding the median itself if n is odd). Their median is the third quartile, Q3.

Five- Number Summary

Example: Cereal Sodium Data

For the sodium values for the 20 breakfast cereals in table,

  • The median of the 20 values is the average of the 10th and 11th observations, 180 and 180, which is Q2 = 180 mg.

  • The first quartile Q1 is the median of the 10 smallest observations (in the top row), which is the average of 130 and 140, Q1 = 135 mg.

  • The third quartile Q3 is the median of the 10 largest observations (in the bottom row), which is the average of 200 and 210, Q3 = 205 mg.

  • Five-Number Summary:
    minimum; first quartile; median; third quartile; maximum

  • In this example, the five number summary is: 0 135 180 205 340

Five- Number Summary in R

Example: Cereal Sodium Data

In R, we can compute the five-number summary using the summary() command as shown below.

Note that the quantile.type = 2 works for both odd and even sample sizes with the exception of n=5 observations in which case we need to adjust quantile.type = 6.

summary(c(0,50,70,100,130,140,140,150,160,180,
          180,180,190,200,200,210,210,220,290,340),
          quantile.type = 2) #change type to 6 if n=5

The Interquartile Range (IQR) and Outliers.

The interquartile range is the distance between the third quartile and first quartile: \[IQR = Q3-Q1\] IQR gives the spread of middle 50% of the data distribution.

The IQR measures the variability.

An observation is a potential outlier if it falls a distance of more than 1.5 x IQR below the first quartile or a distance of more than 1.5 x IQR above the third quartile.

That is, if either

\(x < Q_1 - 1.5*IQR\)

Or

\(x> Q_3+1.5*IQR\),

then the value \(x\) is a potential outlier

Box Plot

Boxplot uses quartiles to draw a box and two whiskers to summarize data. (Boxplot can be horizonal or vertical.)

  1. The box represents the middle 50% of the data, from Q1 to Q3. The thick line in the box is the median.

  2. A line goes from the lower end of the box to the smallest observation that is not a potential outlier (lower whisker).

  3. A line goes from the upper end of the box to the largest observation that is not a potential outlier (upper whisker).

  4. Indicate potential (suspected) outliers if there are any.

Box plot( Example)

Below is a boxplot of interest rate from from loan dataset. From the plot,

  1. Find (estimate) quartiles, IQR and five number summary.
  2. Is the data skewed?.

Guessed answer: Q1=8, Q2=10, Q3=14, IQR=6

Max Whisker reach 14+1.5*6=23, Min=5 , Max= 26.

Potential outliers 25,26

Skewed to the right.

More on Boxplot (Basic Boxplot)

When there is no outliers (or in a simple way), the boxplot is determined by five number summary.

Example. Given a data set with the following boxplot.

  1. Calculate the IQR

  2. Explain why there is no outlier.

  3. What percentage of the data is below 4.

  4. What percentage of the data is above 10.

  5. Can you tell the five summary from the boxplot.

Boxplot

Multiple boxplots may put by side-by-side to compare the data sets.

Example. The following are two box plots for heights (in inches) of boys and girls. Use the boxplots to answer the following questions:

  1. The five number summary for each group (boys, girls):

  2. If a girl is 68 inches tall, what is the percentage of girls who are below or at her height? what is the percentage of boys who are below or at her height?

  3. How much percentage of boys are not shorter than all girls?

Practice

Match the histograms to Boxplot.

Outliers

Why is it important to look for outliers?

  • Identify extreme skew in the distribution.

  • Identify data collection and entry errors.

  • Provide insight into interesting features of the data.

Extreme observations

How would sample statistics such as mean, median, SD, and IQR of household income be affected if the largest value was replaced with $10 million? What if the smallest value was replaced with $10 million?

Robust statistics

Robust statistics

Median and IQR are more robust to skewness and outliers than mean and SD. Therefore,

  • for skewed distributions it is often more helpful to use median and IQR to describe the center and spread.

  • for symmetric distributions it is often more helpful to use the mean and SD to describe the center and spread.

If you would like to estimate the typical household income for a student, would you be more interested in the mean or median income?

Median -b/c household income is skewed to the right.

Mean vs. Median

  • If the distribution is symmetric, center is often defined as the mean: mean \(\approx\) median.

  • If the distribution is skewed or has extreme outliers, center is often defined as the median.

    • Right-skewed: mean > median
    • left-skewed: mean < median

Practice

Which is most likely true for the distribution of percentage of time actually spent taking notes in class versus on Facebook, Twitter etc.?

  1. Mean > Median
  2. Mean < Median
  3. Mean \(\approx\) Median
  4. Impossible to tell

Practice

Which is most likely true for the distribution of percentage of time actually spent taking notes in class versus on Facebook, Twitter etc.?

  1. Mean > Median
  2. Mean < Median
  3. Mean \(\approx\) Median
  4. Impossible to tell

Practice

Which is most likely true for the distribution of percentage of time actually spent taking notes in class versus on Facebook, Twitter etc.?

  • Median: 80%
  • Mean: 76%

Mean and median are both measures of center of numerical (quantitative) data.

Transforming Data (For Extremely skewed data)

When data are extremely skewed, transforming them might make modeling easier. A common transformation is the log transformation.

The histograms on the left shows the distribution of number of basketball games attended by students. The histogram on the right shows the distribution of log of number of games attended.(Here is the natural log)

Pros and Cons of transformations (end of 2.1)

  • Skewed data are easier to model with when they are transformed because outliers tend to become far less prominent after an appropriate transformation. \[ \begin{split} \# of games & 70 & 50 & 25 & \dots \\ log(\# of games) & 4.25 & 3.91 & 3.22 & \dots \\ \end{split} \]
  • However, results of an analysis in log units of the measured variable might be difficult to interpret.

What other variables would you expect to be extremely skewed?

Salary, housing prices, etc.

2.2 Considering categorical data

In this section, we introduce tables and other tools for categorical data.

  • Frequency table, relative frequency table.
  • Bar Plots
  • Contingency Tables, Row and Column Proportions
  • Mosaic plots

Frequency and Relative Frequency Table

Example. A study of aphasia measured a sample of 22 adult aphasics with identified three types of aphasia (loss of ability to understand or express speech). The resulting data is given below:

Frequency and Relative Frequency Table

  • The variable “Type of Aphasia” has 3 categories (also called classes).
  • We can construct the following frequency table to summarize the data.

Frequency Tables in R

We can use the R function table() to construct the frequency table to summarize the data.

  • First, put the data in a vector using the c() function. We will name that vector aphasia_type
    • You may choose any other name but note that names of objects in R cannot contain any spaces;
    • Notice that we used underscore to create a two-word name
    • Also, notice that the values are typed within double quotes because we are working with non-numerical or character data
  • Then, use the table() function to get the frequencies of the different aphasia types
aphasia_type = c("Broca's", "Anomic", "Anomic", "Conduction", "Broca's", "Conduction", "Conduction", "Anomic", 
                 "Conduction", "Anomic", "Conduction", "Broca's", "Anomic", "Broca's", "Anomic", "Anomic", "Anomic",
                 "Conduction", "Broca's", "Anomic", "Conduction", "Anomic")
table(aphasia_type)

Bar plot (Bar graph)

A bar plot is a common way to display a single categorical variable.

  • The categories of a categorical variable are represented by bars.

  • The height of each bar is either the frequency, or the relative frequency (or the percentage), of the represented category.

Here is an example of a bar plot (frequency). (check the annotation and the scale of the vertical axis)

\[ \begin{align} \text{Type of Aphasia} && \text{Frequency} \\ \hline \text{Anomic} && 10 \\ \text{Broca's} && 5 \\ \text{Conduction} && 7 \\ \hline \text{Total} && 22 \\ \hline \end{align} \]

Contingency tables

A table that summarizes data for two categorical variable is called a contingency table.

The contingency table below shows the distribution of survival and ages of passengers on the Titanic.

\[ \begin{align} && && {Survival} & \\ & && Died & & Survived & & Total \\ {Age} & Adult && 1438 && 654 && 2092 \\ & Child && 52 && 57 && 109 \\ & Total && 1490 && 711 && 2201 \\ \end{align} \]

Bar Plots

A bar plot is a common way to display a single categorical variable. A bar plots below are frequency bar plots* (left) and relative frequency bar plot (right) for the group of Adult on Titanic.

\[ \begin{align} && && {Survival} & \\ & && Died & & Survived & & Total \\ {Age} & Adult && 1438 && 654 && 2092 \\ & Child && 52 && 57 && 109 \\ & Total && 1490 && 711 && 2201 \\ \end{align} \] check: \(\frac{1438}{2092}\approx 0.69\hspace{0.2cm} \text{or} \hspace{0.2cm} 69%\)

Bar Plots and Histograms

How are Bar plots different than Histograms?

Bar plots are used fro displaying distributions of categorical variables.

Histograms are used for numerical variables.

The x-axis in a histogram is a number line, hence the order of the bars cannot be changed. In a bar plot, the categories can be listed in any order(though some ordering make more sense than others, especially for ordinal variables)

Choosing the appropriate proportion

Does there appear to be a relationship between age and survival for passengers on the Titanic?

\[ \begin{align} && && {Survival} & \\ & && Died & & Survived & & Total \\ {Age} & Adult && 1438 && 654 && 2092 \\ & Child && 52 && 57 && 109 \\ & Total && 1490 && 711 && 2201 \\ \end{align} \]

To answer this question we examine the row proportions:

  • % Adults who survived: 654 / 2091 \(\approx\) 0.31

  • % Children who survived: 57 / 109 \(\approx\) 0.52

What do the column proportion tell us? \(\frac{654}{711}\approx 0.92 \hspace{0.2cm} \text {% adults survived in the total survivors on Titanic}\) \(\frac{57}{711}\approx 0.08 \hspace{0.2cm} \text{(Give a description)}\)

The percentage of passengers who were survived and were adults? \(\frac{654}{2201} \approx 0.30\)

Bar plots with two variables

What are the difference between the three visualizations shown below?

- Stacked bar plot: Graphical display of contingency table information, for counts.
- Side-by-side bar plot: Displays the same information by placing bars next to, instead of on top of, each other.
- Standardized stacked bar plot: Graphical display of contingency table information, for proportions.





\[ \begin{align} && && {Survival} & \\ & && Died & & Survived & & Total \\ {Age} & Adult && 1438 && 654 && 2092 \\ & Child && 52 && 57 && 109 \\ & Total && 1490 && 711 && 2201 \\ \end{align} \]

Mosaic plots

What is the difference between the two visualizations shown below? The Plot on the left is a stacked bar plot for proportions. The Plot on the right is a Mosaic Plot.

\[ \begin{align} && && {Survival} & \\ & && Died & & Survived & & Total \\ {Age} & Adult && 1438 && 654 && 2092 \\ & Child && 52 && 57 && 109 \\ & Total && 1490 && 711 && 2201 \\ \end{align} \]

A Mosaic plot is a special type of stacked bar plot. For two variables, the width of the columns is proportional to the number of observations in each level of the variable plotted on the horizontal axis. (here 2092:109 \(\approx\) 19:1)

Pie charts

Side-by-side box plots

Does there appear to be a relationship between class year and number of clubs students are in?

2.3 Case Study: Gender discrimination

In this section, we consider a case study on gender discrimination.

  • In 1972, as a part of a study on gender discrimination, 48 male bank supervisors were each given the same personnel file and asked to judge whether the person should be promoted to a branch manager job that was described as “routine”.

  • The files were identical except that half of the supervisors had files showing the person was male while the other half had files showing the person was female.

  • It was randomly determined which supervisors got “male” applications and which got “female” applications.

  • Of the 48 files reviewed, 35 were promoted.

  • The study is testing whether females are unfairly discriminated against.

Data

Is this an observational study or an experiment?

Experiment

  • as the files marked with male/female were randomly given to supervisors, the results can be used to evaluate a casual relationship between gender and promotion.

At a first glance, does there appear to be a relationship between promotion and gender?

\[ \begin{align} & && Promoted && Not Promoted && Total \\ {Gender} & Male && 21 && 3 && 24 \\ & Female && 14 && 10 && 24 \\ & Total && 35 && 13 && 48 \\ \end{align} \] % of males promoted: 21/24 = 0.875

% of females promoted: 14/24 = 0.583

Practice

We saw a difference of almost 30% (28.2% to be exact) between the proportion of male and female files that are promoted. Based on this information, which of the below is true?

  1. If we were to repeat the experiment we will definitely see that more female files get promoted. This was a fluke.
  2. Promotion is dependent on gender, males are more likely to be promoted, and hence there is gender discrimination against women in promotion decisions.
  3. The difference in the proportions of promoted male and female files is due to chance, this is not evidence of gender discrimination against women in promotion decisions.
  4. Women are less qualified than men, and this is why fewer females get promoted.

( Watch 2.3 Video in learning unit 3)

Practice

We saw a difference of almost 30% (28.2% to be exact) between the proportion of male and female files that are promoted. Based on this information, which of the below is true?

  1. If we were to repeat the experiment we will definitely see that more female files get promoted. This was a fluke.

  2. Promotion is dependent on gender, males are more likely to be promoted, and hence there is gender discrimination against women in promotion decisions.

    Maybe

  3. The difference in the proportions of promoted male and female files is due to chance, this is not evidence of gender discrimination against women in promotion decisions.

    Maybe

  4. Women are less qualified than men, and this is why fewer females get promoted.

Two competing claims

  1. “There is nothing going on.”
    • Promotion and gender are independent , no gender discrimination, observed difference in proportions is simply due to chance. \(\rightarrow\) Null hypothesis
  2. “There is something going on.” - Promotion and gender are dependent, there is gender discrimination, observed difference in proportions is not due to chance. \(\rightarrow\) Alternative hypothesis

A trial as a hypothesis test

  • Hypothesis testing is very much like a court trial.

  • \(H_0\): Defendant is innocent \(H_A\): Defendant is guilty

  • We then present the evidence collect data.

  • Then we judge the evidence - “Could these data plausibly have happened by chance if the null hypothesis were true?”

  • If they were very unlikely to have occurred, then the evidence raises more than a reasonable doubt in out minds about the null hypothesis.

  • Ultimately we must make a decision. How unlikely is unlikely?

A trial as a hypothesis test

  • If the evidence is not strong enough to reject then assumption of innocence, the jury returns with a verdict of “not guilty”.

    • The jury does not say that the defendant is innocent, just that there is not enough evidence to convict.

    • The defendant may, in fact, be innocent, but the jury has no way of being sure.

  • Said statistically, we fail to reject the null hypothesis.

    • We never declare the null hypothesis to be true, because we simply do not know whether it’s true or not.

    • Therefore we never “accept the null hypothesis”.

A trial as a hypothesis test

  • In a trial, the burden of proof is on the prosecution.

  • In a hypothesis test, the burden of proof is on the unusual claim.

  • The null hypothesis is the ordinary state of affairs (the status quo), so it’s the alternative hypothesis that we consider unusual and for which we must gather evidence.

Recap: Hypothesis testing framework

  • We start with a null hypothesis (\(H_0\)) that represents the status quo.

  • We also have an alternative hypothesis (\(H_A\)) that represents our research question, i.e. what we’re testing for.

  • We conduct a hypothesis test under the assumption that the null hypothesis is true, either via simulation (today) or theoretical methods (later in the courses).

  • if the test results suggest that the data do not provide convincing evidence for the alternative hypothesis, we stick with the null hypothesis, If they do, then we reject the null hypothesis in favor of the alternative.

Simulating the experiment…

… under the assumption of independence, i.e. leave things up to chance.

If results from the simulations based on the chance model look like the data, then we can determine that the difference between the proportions of promoted files between males and females was simply due to chance (promotion and gender are independent).

If the results from the simulations based on the chance model do not look like the data, then we can determine that the difference between the proportions of promoted filed between males and females was not due to chance, but due to an actual effect of gender (promotion and gender are dependent).

Application activity: simulating the experiment

Use a deck of playing cards to simulate the experiment.

  1. Let a face card represent \(\textit{not}\) promoted and a non-face card represent a \(\textit{promoted}\). Consider aces as face cards.

    • Set aside the jokers.
    • Take out 3 aces \(\rightarrow\) there are exactly 13 face cards left in the deck (face cards: A, K, Q, J).
    • Take out a number card \(\rightarrow\) there are exactly 35 number (non-face) cards left in the deck (number cards: 2-10).
  2. Shuffle the cards and deal them into two groups of size 24, representing males and females.

  3. Count and record how many files in each group are promoted (number cards).

  4. Calculate the proportion of promoted files in each group and take the difference (male - female), and record this value.

  5. Repeat steps 2 - 4 many times.

Step 1

Step 2 - 4

Practice

Do the results of the simulation you just ran provide convincing evidence of gender discrimination against women, i.e. dependence between gender and promotion decisions?

  1. No, the data do not provide convincing evidence for the alternative hypothesis, therefore we can’t reject the null hypothesis of independence between gender and promotion decisions. The observed difference between the two proportions was due to chance.

  2. Yes, the data provide convincing evidence for the alternative hypothesis of gender discrimination against women in promotion decisions. The observed difference between the two proportions was due to a real effect of gender.

Practice

Do the results of the simulation you just ran provide convincing evidence of gender discrimination against women, i.e. dependence between gender and promotion decisions?

  1. No, the data do not provide convincing evidence for the alternative hypothesis, therefore we can’t reject the null hypothesis of independence between gender and promotion decisions. The observed difference between the two proportions was due to chance.

  2. Yes, the data provide convincing evidence for the alternative hypothesis of gender discrimination against women in promotion decisions. The observed difference between the two proportions was due to a real effect of gender.

Simulation using software

These simulations are tedious and slow to run using the method described earlier. In reality, we use software to generate the simulations. The dot plot below shows the distribution of simulated differences in promotion rates based on 100 simulations.