1.3 Summarizing Categorical Data
1.3.1 Objectives
By the end of this unit, students will be able to:
- Summarize and describe the distribution of categorical data using contingency tables and various visual displays including bar plots and mosaic plots.
- Explore the association between a numerical variable and a categorical variable using side-by-side box plots.
1.3.2 Overview
Categorical (qualitative) variables classify observations into groups or labels, such as gender, state of residence, or blood type. Unlike numerical data, it is not meaningful to compute averages for categorical variables. Instead, we describe their distributions using counts and proportions, often displayed in tables or graphs.
Frequency and Relative Frequency Tables
- Frequency: the number of observations in each category.
- Relative Frequency: the proportion in each category, calculated as
\[ \text{Relative Frequency} = \frac{\text{Category Count}}{\text{Total Sample Size}} \]
Example: Student Survey on Favorite Music
Suppose a survey of 50 students asked for their favorite type of music.
| Music Type | Frequency | Relative Frequency |
|---|---|---|
| Pop | 18 | 0.36 |
| Rock | 12 | 0.24 |
| Hip-Hop | 15 | 0.30 |
| Other | 5 | 0.10 |
| Total | 50 | 1.00 |
From this table:
- Most students prefer Pop (36%).
- Hip-Hop is the second most common choice (30%).
- Only 10% selected “Other.”
Contingency Tables
When summarizing two categorical variables together, we use a contingency table. This shows the joint distribution of the variables, along with row and column totals.
Example: Gender and Smoking Status (n = 200 students)
| Smoker | Non-Smoker | Total | |
|---|---|---|---|
| Male | 25 | 75 | 100 |
| Female | 15 | 85 | 100 |
| Total | 40 | 160 | 200 |
- Row proportions: Within males, 25% smoke; within females, 15% smoke.
- Column proportions: Of all smokers, 62.5% are male and 37.5% are female.
Contingency tables help reveal associations (e.g., males in this survey are more likely to report smoking).
Graphical Summaries
Bar Plots (one variable): Bars represent frequencies or proportions.
- Example: number of students preferring each music genre.
Side-by-Side Bar Plots (two variables): Compare categories across groups.
- Example: comparing music preference by gender.
Stacked Bar Plots: Combine categories in stacked form to show proportions.
Mosaic Plots: An extension of stacked bars where both width and height of rectangles reflect proportions.
Categorical–Numerical Relationships
To examine how a numerical variable differs across categories of a categorical variable, we use side-by-side boxplots.
Example: GPA (numerical) vs. Study Program (categorical: STEM, Social Sciences, Arts).
- Boxplots show whether GPA distributions differ by major.
- We can compare medians, variability, and detect outliers across groups.
R Code Examples
Frequency table:
## music
## Hip-Hop Other Pop Rock
## 2 1 3 1
## music
## Hip-Hop Other Pop Rock
## 0.2857143 0.1428571 0.4285714 0.1428571
1.3.4 Solved Exercises
Exercise 1.
A survey polled a sample of 400 employees at a tech company regarding a proposed change in remote work policy. The following table summarizes the survey responses.
| Responses | Frequency | Relative Frequency (Round to 3 decimals) |
|---|---|---|
| Support | 230 | |
| Neutral | 70 | |
| Oppose | 100 | |
| Total | 400 |
- How many employees support the proposed policy?
- Fill the last column in the table.
- What percentage of the sampled employees opposed the proposed policy?
Solution:
The number of employees who support the policy is 230.
Compute each relative frequency as
\[ \text{Relative Frequency} = \frac{\text{Frequency}}{400} \]
| Responses | Frequency | Relative Frequency |
|---|---|---|
| Support | 230 | \(\frac{230}{400} = 0.575\) |
| Neutral | 70 | \(\frac{70}{400} = 0.175\) |
| Oppose | 100 | \(\frac{100}{400} = 0.250\) |
| Total | 400 | 1.000 |
- The percentage of employees who opposed the policy is
\[ 0.25 \times 100 = 25\% \]
Exercise 2.
The following data shows the recorded favorite coffee types of 30 volunteers at a local café.
Latte, Espresso, Cappuccino, Latte, Espresso, Latte, Americano, Latte, Cappuccino, Espresso
Americano, Latte, Latte, Cappuccino, Americano, Latte, Espresso, Americano, Latte, Latte
Latte, Cappuccino, Espresso, Latte, Americano, Cappuccino, Latte, Espresso, Latte, Latte
| Coffee Type | Frequency | Relative Frequency |
|---|---|---|
| Latte | ||
| Espresso | ||
| Cappuccino | ||
| Americano | ||
| Total |
- Summarize the data in a frequency table and calculate the relative frequencies.
- Draw a bar chart for the frequency of each coffee type.
Solution:
Count each coffee type from the data.
| Coffee Type | Frequency | Relative Frequency |
|---|---|---|
| Latte | 14 | \(\frac{14}{30} = 0.467\) |
| Espresso | 6 | \(\frac{6}{30} = 0.200\) |
| Cappuccino | 5 | \(\frac{5}{30} = 0.167\) |
| Americano | 5 | \(\frac{5}{30} = 0.167\) |
| Total | 30 | 1.001 \(\approx\) 1.000 |
- Bar Chart (conceptually):
| Coffee Type | Frequency (bars) |
|---|---|
| Latte | (14) |
| Espresso | (6) |
| Cappuccino | (5) |
| Americano | (5) |
Latte is clearly the most popular coffee choice among participants.
Exercise 3.
Four hundred high school students were surveyed about their weekly volunteering hours. The following contingency table summarizes the survey results related to grade level and hours volunteered per week.
| Not volunteering | Volunteer 5 hours or less | Volunteer more than 5 hours | Total | |
|---|---|---|---|---|
| Freshman or Sophomore | 140 | 25 | 15 | 180 |
| Junior or Senior | 130 | 40 | 50 | 220 |
| Total | 270 | 65 | 65 | 400 |
- Complete the table for the 2nd row, 3rd row proportions (relative frequencies by class, and overall).
(Divide the 2nd and 3rd rows by 220 and 400, respectively)
Solution (a):
| Not volunteering | Volunteer 5 hours or less | Volunteer more than 5 hours | Total | |
|---|---|---|---|---|
| Freshman or Sophomore | 0.778 | 0.139 | 0.083 | 1.000 |
| Junior or Senior | \(\frac{130}{220}=0.591\) | \(\frac{40}{220}=0.182\) | \(\frac{50}{220}=0.227\) | 1.000 |
| All Students | \(\frac{270}{400}=0.675\) | \(\frac{65}{400}=0.1625\) | \(\frac{65}{400}=0.1625\) | 1.000 |
- Find the column proportions. Interpret the meaning of the ratios in each column.
| Not volunteering | Volunteer 5 hours or less | Volunteer more than 5 hours | Total | |
|---|---|---|---|---|
| Freshman or Sophomore | \(\frac{140}{270}=0.519\) | \(\frac{25}{65}=0.385\) | \(\frac{15}{65}=0.231\) | \(\frac{180}{400}=0.450\) |
| Junior or Senior | \(\frac{130}{270}=0.481\) | \(\frac{40}{65}=0.615\) | \(\frac{50}{65}=0.769\) | \(\frac{220}{400}=0.550\) |
| Total | 1.000 | 1.000 | 1.000 | 1.000 |
Interpretation:
- Among non-volunteers, 51.9% are Freshmen/Sophomores and 48.1% are Juniors/Seniors.
- Among those who volunteer 5 hours or less, the majority (61.5%) are Juniors/Seniors.
- Among those volunteering more than 5 hours, 76.9% are Juniors/Seniors — suggesting older students are more involved in extensive volunteering.
- Find the overall relative frequencies by dividing all counts by 400 (grand total) and interpret.
| Not volunteering | Volunteer 5 hours or less | Volunteer more than 5 hours | Total | |
|---|---|---|---|---|
| Freshman or Sophomore | \(\frac{140}{400}=0.350\) | \(\frac{25}{400}=0.063\) | \(\frac{15}{400}=0.038\) | \(\frac{180}{400}=0.450\) |
| Junior or Senior | \(\frac{130}{400}=0.325\) | \(\frac{40}{400}=0.100\) | \(\frac{50}{400}=0.125\) | \(\frac{220}{400}=0.550\) |
| Total | \(\frac{270}{400}=0.675\) | \(\frac{65}{400}=0.163\) | \(\frac{65}{400}=0.163\) | 1.000 |
Interpretation:
- 35% of all students are underclassmen who do not volunteer.
- 12.5% of all students are upperclassmen who volunteer more than 5 hours per week.
- Overall, 16.3% of all students volunteer either \(\ge\) 5 or >5 hours weekly.