1.3 Summarizing Categorical Data

1.3.1 Objectives

By the end of this unit, students will be able to:

  • Summarize and describe the distribution of categorical data using contingency tables and various visual displays including bar plots and mosaic plots.
  • Explore the association between a numerical variable and a categorical variable using side-by-side box plots.

1.3.2 Overview

Categorical (qualitative) variables classify observations into groups or labels, such as gender, state of residence, or blood type. Unlike numerical data, it is not meaningful to compute averages for categorical variables. Instead, we describe their distributions using counts and proportions, often displayed in tables or graphs.


Frequency and Relative Frequency Tables

  • Frequency: the number of observations in each category.
  • Relative Frequency: the proportion in each category, calculated as
    \[ \text{Relative Frequency} = \frac{\text{Category Count}}{\text{Total Sample Size}} \]

Example: Student Survey on Favorite Music
Suppose a survey of 50 students asked for their favorite type of music.

Music Type Frequency Relative Frequency
Pop 18 0.36
Rock 12 0.24
Hip-Hop 15 0.30
Other 5 0.10
Total 50 1.00

From this table:
- Most students prefer Pop (36%).
- Hip-Hop is the second most common choice (30%).
- Only 10% selected “Other.”

Contingency Tables

When summarizing two categorical variables together, we use a contingency table. This shows the joint distribution of the variables, along with row and column totals.

Example: Gender and Smoking Status (n = 200 students)

Smoker Non-Smoker Total
Male 25 75 100
Female 15 85 100
Total 40 160 200
  • Row proportions: Within males, 25% smoke; within females, 15% smoke.
  • Column proportions: Of all smokers, 62.5% are male and 37.5% are female.

Contingency tables help reveal associations (e.g., males in this survey are more likely to report smoking).

Graphical Summaries

  1. Bar Plots (one variable): Bars represent frequencies or proportions.

    • Example: number of students preferring each music genre.
  2. Side-by-Side Bar Plots (two variables): Compare categories across groups.

    • Example: comparing music preference by gender.
  3. Stacked Bar Plots: Combine categories in stacked form to show proportions.

  4. Mosaic Plots: An extension of stacked bars where both width and height of rectangles reflect proportions.

Categorical–Numerical Relationships

To examine how a numerical variable differs across categories of a categorical variable, we use side-by-side boxplots.

Example: GPA (numerical) vs. Study Program (categorical: STEM, Social Sciences, Arts).
- Boxplots show whether GPA distributions differ by major.
- We can compare medians, variability, and detect outliers across groups.

R Code Examples

Frequency table:

music <- c("Pop","Pop","Rock","Hip-Hop","Pop","Other","Hip-Hop")
table(music)
## music
## Hip-Hop   Other     Pop    Rock 
##       2       1       3       1
prop.table(table(music))
## music
##   Hip-Hop     Other       Pop      Rock 
## 0.2857143 0.1428571 0.4285714 0.1428571

1.3.3 Knowledge Check

1.3.4 Solved Exercises

Exercise 1.
A survey polled a sample of 400 employees at a tech company regarding a proposed change in remote work policy. The following table summarizes the survey responses.

Responses Frequency Relative Frequency (Round to 3 decimals)
Support 230
Neutral 70
Oppose 100
Total 400
  1. How many employees support the proposed policy?
  2. Fill the last column in the table.
  3. What percentage of the sampled employees opposed the proposed policy?

Solution:

  1. The number of employees who support the policy is 230.

  2. Compute each relative frequency as

\[ \text{Relative Frequency} = \frac{\text{Frequency}}{400} \]

Responses Frequency Relative Frequency
Support 230 \(\frac{230}{400} = 0.575\)
Neutral 70 \(\frac{70}{400} = 0.175\)
Oppose 100 \(\frac{100}{400} = 0.250\)
Total 400 1.000
  1. The percentage of employees who opposed the policy is

\[ 0.25 \times 100 = 25\% \]

Exercise 2.
The following data shows the recorded favorite coffee types of 30 volunteers at a local café.

Latte, Espresso, Cappuccino, Latte, Espresso, Latte, Americano, Latte, Cappuccino, Espresso
Americano, Latte, Latte, Cappuccino, Americano, Latte, Espresso, Americano, Latte, Latte
Latte, Cappuccino, Espresso, Latte, Americano, Cappuccino, Latte, Espresso, Latte, Latte

Coffee Type Frequency Relative Frequency
Latte
Espresso
Cappuccino
Americano
Total
  1. Summarize the data in a frequency table and calculate the relative frequencies.
  2. Draw a bar chart for the frequency of each coffee type.

Solution:

Count each coffee type from the data.

Coffee Type Frequency Relative Frequency
Latte 14 \(\frac{14}{30} = 0.467\)
Espresso 6 \(\frac{6}{30} = 0.200\)
Cappuccino 5 \(\frac{5}{30} = 0.167\)
Americano 5 \(\frac{5}{30} = 0.167\)
Total 30 1.001 \(\approx\) 1.000
  1. Bar Chart (conceptually):
Coffee Type Frequency (bars)
Latte (14)
Espresso (6)
Cappuccino (5)
Americano (5)

Latte is clearly the most popular coffee choice among participants.

Exercise 3.
Four hundred high school students were surveyed about their weekly volunteering hours. The following contingency table summarizes the survey results related to grade level and hours volunteered per week.

Not volunteering Volunteer 5 hours or less Volunteer more than 5 hours Total
Freshman or Sophomore 140 25 15 180
Junior or Senior 130 40 50 220
Total 270 65 65 400
  1. Complete the table for the 2nd row, 3rd row proportions (relative frequencies by class, and overall).
    (Divide the 2nd and 3rd rows by 220 and 400, respectively)

Solution (a):

Not volunteering Volunteer 5 hours or less Volunteer more than 5 hours Total
Freshman or Sophomore 0.778 0.139 0.083 1.000
Junior or Senior \(\frac{130}{220}=0.591\) \(\frac{40}{220}=0.182\) \(\frac{50}{220}=0.227\) 1.000
All Students \(\frac{270}{400}=0.675\) \(\frac{65}{400}=0.1625\) \(\frac{65}{400}=0.1625\) 1.000
  1. Find the column proportions. Interpret the meaning of the ratios in each column.
Not volunteering Volunteer 5 hours or less Volunteer more than 5 hours Total
Freshman or Sophomore \(\frac{140}{270}=0.519\) \(\frac{25}{65}=0.385\) \(\frac{15}{65}=0.231\) \(\frac{180}{400}=0.450\)
Junior or Senior \(\frac{130}{270}=0.481\) \(\frac{40}{65}=0.615\) \(\frac{50}{65}=0.769\) \(\frac{220}{400}=0.550\)
Total 1.000 1.000 1.000 1.000

Interpretation:
- Among non-volunteers, 51.9% are Freshmen/Sophomores and 48.1% are Juniors/Seniors.
- Among those who volunteer 5 hours or less, the majority (61.5%) are Juniors/Seniors.
- Among those volunteering more than 5 hours, 76.9% are Juniors/Seniors — suggesting older students are more involved in extensive volunteering.

  1. Find the overall relative frequencies by dividing all counts by 400 (grand total) and interpret.
Not volunteering Volunteer 5 hours or less Volunteer more than 5 hours Total
Freshman or Sophomore \(\frac{140}{400}=0.350\) \(\frac{25}{400}=0.063\) \(\frac{15}{400}=0.038\) \(\frac{180}{400}=0.450\)
Junior or Senior \(\frac{130}{400}=0.325\) \(\frac{40}{400}=0.100\) \(\frac{50}{400}=0.125\) \(\frac{220}{400}=0.550\)
Total \(\frac{270}{400}=0.675\) \(\frac{65}{400}=0.163\) \(\frac{65}{400}=0.163\) 1.000

Interpretation:
- 35% of all students are underclassmen who do not volunteer.
- 12.5% of all students are upperclassmen who volunteer more than 5 hours per week.
- Overall, 16.3% of all students volunteer either \(\ge\) 5 or >5 hours weekly.