1.1 Data Types
1.1.1 Objectives
By the end of this section, students will be able to:
- Understand the importance of statistical methods for answering research questions using data.
- Identify different types of data that can be analyzed using statistical methods.
- Describe basic sampling principles and strategies for the purpose of collecting data for research studies.
- Describe basic principles of designing research experiments.
1.1.2 Overview
In this section, we will delve deeper into the categorization of variables as numerical and categorical. This is an important step, as the type of variable helps us determine what summary statistics to calculate, what type of visualizations to make, and what statistical method will be appropriate to answer the research questions we’re exploring.
There are two types of variables: numerical and categorical.
Numerical, in other words, quantitative, variables take on numerical values. It is sensible to add, subtract, take averages, and so on, with these values.
Categorical, or qualitative, variables, take on a limited number of distinct categories. These categories can be identified with numbers, for example, it is customary to see likert variables (strongly agree to strongly disagree) coded as 1 through 5, but it wouldn’t be sensible to do arithmetic operations with these values. They are merely placeholders for the levels of the categorical variable.
Numerical data
Numerical variables can be further categorized as continuous or discrete.
Continuous numerical variables are usually measured, such as height. These variables can take on an infinite number of values within a given range.
Discrete numerical variables are those that take on one of a specific set of numeric values where we are able to count or enumerate all of the possibilities. One example of a discrete variable is number of pets in a household. In general, count data are an example of discrete variables.
When determining whether a numerical variable is continuous or discrete, it is important to think about the nature of the variable and not just the observed value, as rounding of continuous variables can make them appear to be discrete. For example, height is a continuous variable, however we tend to report our height rounded to the nearest unit of measure, like inches or centimeters.
Categorical data
Categorical variables that have ordered levels are called ordinal.
Think about a survey question where you’re asked how satisfied you are with the customer service you received and the options are very unsatisfied, unsatisfied, neutral, satisfied, and very satisfied. These levels have an inherent ordering, hence the variable would be called ordinal.
If the levels of a categorical variable do not have an inherent ordering to them, then the variable is simply called categorical. For example, do you consume caffeine or not?
Data collection principles
Population versus Sample: In statistics, we almost always want to apply generalizations from a small sample to a large population – you might think of this as a sort of stereotyping. The trick here is that for our assertions (generalizations) to be valid, our sample must be representative of our population.
Why not take a census?
First, taking a census requires a lot more resources than collecting data from a sample of the population.
Second, certain individuals in your population might be hard to locate or collect data from. If these individuals that are missed in the census are different from those in the rest of the population, the census data will be biased. For example, in the US census, undocumented immigrants are often not recorded properly since they tend to be reluctant to fill out census forms with the concern that this information could be shared with immigration. However, these individuals might have characteristics different than the rest of the population and hence, not getting information from them might result in unreliable data from geographical regions with high concentrations of undocumented immigrants.
Lastly, populations are constantly changing. Even if you do have the required resources and manage to collect data from everyone in the population, tomorrow your population will be different and so the hard work required to collect such data may not pay off.
If you think about it, sampling is actually quite natural.
Sampling is natural
Think about something you are cooking we taste or in other words examine a small part of what we’re cooking to get an idea about the dish as a whole. After all, we would never eat a whole pot of soup just to check its taste.
When you taste a spoonful of soup and decide the spoonful you tasted isn’t salty enough, what you’re doing is simply exploratory analysis for the sample at hand.
If you then generalize and conclude that your entire soup needs salt, that’s making an inference.
For your inference to be valid, the spoonful you tasted, your sample, needs to be representative of the entire pot, your population.
If your spoonful comes only from the surface and the salt is collected at the bottom of the pot, what you tasted is probably not going to be representative of the whole pot.
On the other hand, if you first stir the soup thoroughly before you taste, your spoonful will more likely be representative of the whole pot.
Sampling data is a bit different than sampling soup though.
Steps to Sampling:
Identify the research question (then determine the population)
Collect data that are reliable and help achieve the research goal (take good samples)
- Population and sample
- A population is the entire group that you want to draw conclusions about.
- A sample is the specific group that you will collect data from
- Parameter and Statistic
- A descriptive measure (for example, average, median, standard deviation and percentages) for an entire population is a ’‘parameter.’’
- A descriptive measure for a sample is referred to as a ’‘sample statistic’’
- Observational studies and Experiments
- Observational studies: research processes where researchers collect data in a way that does not directly interfere with how the data arise (examine something without manipulating it)
- Experiment: Researchers randomly assign subjects to various treatments in order to establish causal connections between the explanatory and response variables
Four commonly used random sampling techniques:
- Simple random sampling
- Stratified sample
- Cluster sampling
- Multistage sampling
So next, we’ll introduce a few commonly used sampling methods: simple random sampling, stratified sampling, cluster sampling, and multistage sampling.
Sampling Methods
Here we discuss some of the different ways to draw a sample from a population.
Simple random sample
In simple random sampling, we randomly select cases from the population, such that each case is equally likely to be selected. This is similar to randomly drawing names from a hat.

Stratified sample
In stratified sampling, we first divide the population into homogeneous groups, called strata, and then we randomly sample from within each stratum. For example, if we wanted to make sure that people from low, medium, and high socioeconomic status are equally represented in a study, we would first divide our population into three groups as such and then sample from within each group.
Cluster sample
In cluster sampling, we divide the population into clusters, randomly sample a few clusters, and then sample all observations within these clusters. The clusters, unlike strata in stratified sampling, are heterogeneous within themselves and each cluster is similar to the others, such that we can get away with sampling from just a few of the clusters.
Multistage sample
Multistage sampling adds another step to cluster sampling. Just like in cluster sampling, we divide the population into clusters, randomly sample a few clusters, and then we randomly sample observations from within those clusters.

Note: Cluster and multistage sampling are often used for economical reasons. For example, one might divide a city into geographic regions that are on average similar to each other and then sample randomly from a few randomly picked regions in order to avoid traveling to all regions.
Convenience sample
The convenience sample is the most commonly used sampling method. Unfortunately, it is also the worst. When researchers sample from individuals they have “easy access” to, they are conducting a convenience sample. There are always hidden biases in these samples. Do a quick Google search for “FDR versus Alf Landon Sampling Error” to see a very famous example here. In addition, much of the error in predicting the results of the 2016 presidential election may be attributable to convenience sampling.
Sampling strategies, determine which
A consulting company is planning a pilot study on marketing in Boston. They identify the zip codes that make up the greater Boston area, then sample 50 randomly selected addresses from each zip code and mail a coupon to these addresses. They then track whether the coupon was used in the following month.
Sampling strategies, choose worst
A school district has requested a survey be conducted on the socioeconomic status of their students. Their budget only allows them to conduct the survey in some of the schools, hence they need to first sample a few schools.
Students living in this district generally attend a school in their neighborhood. The district is broken into many distinct and unique neighborhoods, some including large single-family homes and others with only low-income housing.
Experimental Design
Experiment versus Observational Study: Beyond just sampling, there are multiple methods for collecting data. We can just observe what happens naturally (without manipulating any conditions) or we can run an experiment. In experiments we manipulate one or more conditions, utilizing a control and treatment group(s). The advantage to an experiment is that we can infer cause and effect relationships (this is extremely important in medical studies), but in observational studies we can only discuss an association between variables.
There’s lots more to learn about experimental design, but it is beyond the scope of our course. You should read pages 32 through 35 of OpenIntro Statistics, 4Ed as a starting point.
Explanatory and response variables
Often when one mentions “a relationship between variables” we think of a relationship between just two variables, say a so called explanatory variable, x, and response variable, y. However, truly understanding the relationship between two variables might require considering other potentially related variables as well. If we don’t, we might find ourselves in a Simpson’s paradox. So, what is Simpson’s paradox?
First, let’s clarify what we mean when we say explanatory and response variables. Labeling variables as explanatory and response does not guarantee the relationship between the two is actually causal, even if there is an association identified. We use these labels only to keep track of which variable we suspect affects the other.
Explanatory and response
And these definitions can be expanded to more than just two variables. For example, we could study the relationship between three explanatory variables and a single response variable.
Multivariate relationships
This is often a more realistic scenario since most real world relationships are multivariable. For example, if we’re interested in the relationship between calories consumed daily and heart health, we would probably also want to consider information on variables like age and fitness level of the person as well.
Not considering an important variable when studying a relationship can result in what we call a Simpson’s paradox. This paradox illustrates the effect the omission of an explanatory variable can have on the measure of association between another explanatory variable and the response variable. In other words, the inclusion of a third variable in the analysis can change the apparent relationship between the other two variables.
Consider the eight dots in the scatter plot below (the points happen to fall on the orange and blue lines). The trend describing the points when only considering x1
and y
, illustrated by the black dashed line, is reversed when x2
, the grouping variable, is also considered. If we don’t consider x2
, the relationship between x1
and y
is positive. If we do consider x2
, we see that within each group the relationship between x1
and y
is actually negative.
We’ll explore Simpson’s paradox further with another dataset, which comes from a study carried out by the graduate Division of the University of California, Berkeley in the early 70’s to evaluate whether there was a sex bias in graduate admissions. The data come from six departments. For confidentiality we’ll call them A through F. The dataset contains information on whether the applicant identified as male or female, recorded as Gender
, and whether they were admitted or rejected, recorded as Admit
.
Berkeley admission data
| Admitted | Rejected
——-| ———|——— Male | 1198 | 1493 Female | 557 | 1278
Note: At the time of this study, gender and sexual identities were not given distinct names. Instead, it was common for a survey to ask for your “gender” and then provide you with the options of “male” and “female.” Today, we better understand how an individual’s gender and sexual identities are different pieces of who they are. To learn more about inclusive language surrounding gender and sexual identities see the gender unicorn.
- Principles of experimental design: 4 principles
- Controlling (assign treatment and control groups, enforce specific treatment in treatment group)
- Randomization (randomly assign treatment group and control group);
- Replication (large sample, or replicate an entire study to verify earlier findings)
- Blocking
1.1.3 Solved Problems
Exercises:
Exercise 1. (page 11 #1.2) Researchers studying the effect of antibiotic treatment for acute sinusitis compared to symptomatic treatments randomly assigned 166 adults diagnosed with acute sinusitis to one of two groups: treatment or control. Study participants received either a 10-day course of amoxicillin (an antibiotic) or a placebo similar in appearance and taste. The placebo consisted of symptomatic treatments such as acetaminophen nasal decongestants, etc. At the end of the 10-day period, patients were asked if they experienced improvement in symptoms. The distribution of responses is summarized below (with some cells missing numbers):
(for b), c), Round answers to within one hundredth of a percent)
Self-reported improved in symptoms | ||||
---|---|---|---|---|
Yes | No | Total | ||
Treatment | 66 | 85 | ||
Control | 65 | |||
Total | 166 |
(a). Fill the blank cells in the above table.
(b). What percent of patients in the treatment group experienced improvement in symptoms?
(c). What percent experienced improvement in symptoms in the control group?
(d). In which group did a higher percentage of patients experience improvement in symptoms?
(e). Your findings so far might suggest a real difference in effectiveness of antibiotic and placebo treatments for improving symptoms of sinusitis. However, this is not the only possible conclusion that can be drawn based on your findings so far. What is one other possible explanation for the observed difference between the percentages of patients in the antibiotic and placebo treatment groups that experience improvement in symptoms of sinusitis?
(Answers for reference:
(a).
Self-reported improved in symptoms | ||||
---|---|---|---|---|
Yes | No | Total | ||
Treatment | 66 | 19 | 85 | |
Control | 65 | 16 | 81 | |
Total | 131 | 35 | 166 |
(e). Be careful: Do not generalize the results of this study. It is impossible to tell merely by comparing the sample proportions because the difference could be the result of random error in our sample.
Exercise 2. The following figure displays data from a lending company.
loan.amount | interest.rate | term | grade | state | total.income | homeownership |
---|---|---|---|---|---|---|
7500 | 7.34 | 36 | A | MD | 70000 | rent |
25000 | 9.43 | 60 | B | OH | 254000 | mortgage |
14500 | 6.08 | 36 | A | MO | 80000 | mortgage |
… | … | … | … | … | … | … |
3000 | 7.96 | 36 | A | CA | 34000 | rent |
Variable descriptions
loan amount: Amount of the loan received, in US dollars.
interest rate: Interest rate on the loan, in an annual percentage.
term: The length of the loan, which is always set as a whole number of months.
grade: Loan grade, which takes a values A through G and represents the quality of the loan and its likelihood of being repaid.
state: US state where the borrower resides.
total income: Borrower’s total income, including any second income, in US dollars.
homeownership: Indicates whether the person owns, owns but has a mortgage, or rents.
(a). How many cases in the data?
(b). Identify the types of variables.
Exercise 3. (page 19 #1.4) The Buteyko method is a shallow breathing technique developed by Konstantin Buteyko, a Russian doctor, in 1952. Anecdotal evidence (evidence based only on personal observation) suggests that the Buteyko method can reduce asthma symptoms and improve quality of life. In a scientific study to determine the effectiveness of this method, researchers recruited 600 asthma patients aged 18-69 who relied on medication for asthma treatment. These patients were randomly split into two research groups: one practiced the Buteyko method and the other did not. Patients were scored on quality of life, activity, asthma symptoms, and medication reduction on a scale from 0 to 10. On average, the participants in the Buteyko group experienced a significant reduction in asthma symptoms and an improvement in quality of life.
(a). Identify the main research question of the study.
(b). Who are the subjects in this study and how many are included?
(c). What are the variables in the study? Identify each variable as numerical or categorical. If numerical, state whether the variable is discrete or continuous.
(Reference answer:
(a). The effect of Buteyko method on reducing asthma symptoms and improving quality of life.
(b). Asthma patients aged 18-69 who relied on medication for asthma treatment; 600.
(c). The variables and types are: quality of life (categorical), activity (categorical), asthma symptoms (categorical), and medication reduction on a scale from 0 to 10 (numerical discrete).)
Exercise 4. (page 29 #1.13) Exercise 1.3 introduces a study where researchers collected data to examine the relationship between air pollutants and preterm births in Southern California. During the study, air pollution levels were measured by air quality monitoring stations; lengths of gestation data were collected on 143,196 births between the years 1989 and 1993; and air pollution exposure during gestation was calculated for each birth.
(a). Identify the population of interest and the sample in this study.
(b). Comment on whether or not the results of the study can be generalized to the population and if the findings of the study can be used to establish causal relationships.
(Reference answer:
Population: all births in Southern California. Sample: collected length of gestation data of 143,196 births between the years 1989 and 1993.
If the collected lengths of gestation data of births in this time span and geography can be considered representative of all births, then the results are generalizable to the population of Southern California. However, since the study is observational, the findings cannot be used to establish causal relationships.)
Exercise 5. A fitness center is interested in the average amount of time a client exercises in the center each week. Match the vocabulary words (a-f) with its corresponding examples (1-6). (Note: 1-1 match)
Examples:
All 45 exercise times that were recorded from the participants in the study.
The 45 clients from the fitness center who participated in the study.
All clients at the fitness center.
The average amount of time that all clients from the fitness center exercise.
The amount of time that any given client from the fitness center exercises.
The average amount of exercise time for the 45 clients from the fitness center who participated in the study.
Vocabulary words:
Data
Population
Variable
Sample
Parameter
Statistic
Exercise 6. (Observational Study or Experiment)
You would like to investigate whether listening to music while taking exams affects performance. A group of students are told to listen to music while taking a test and their results are compared to a group not listening to music. Is this an experiment or an observational study?
The starting salaries of recent graduates from Ivy League private and public universities are recorded. Is this an experiment or an observational study?
Exercise 7. (page 37 #1.41) In a public health study on the effects of consumption of fruits and vegetables on psychological well-being in young adults, participants were randomly assigned to three groups: (1) diet as usual, (2) an ecological momentary intervention involving text message reminders to increase their fruits and vegetable consumption plus a voucher to purchase them, or (3) a fruit and vegetable intervention in which participants were given two additional daily servings of fresh fruits and vegetables to consume on top of their normal diet. Participants were asked to take a nightly survey on their smartphones. Participants were student volunteers at the University of Otago, New Zealand. At the end of the 14-day study, only participants in the third group showed improvements to their psychological well-being across the 14-days relative to the other groups.
(a). What type of study is this?
(b). Identify the explanatory and response variables.
(c). Comment on whether the results of the study can be generalized to the population.
(d). Comment on whether the results of the study can be used to establish causal relationships.
(e). A newspaper article reporting on the study states, “The results of this study provide proof that giving young adults fresh fruits and vegetables to eat can have psychological benefits even over a brief period of time.” How would you suggest revising this statement so that it can be supported by the study?
Reference answer:
(a). Experiment
(b). Explanatory: treatment group (categorical with 3 levels). Response variable: Psychological well-being.
(c). No, because the participants were volunteers.
(d). Yes, because it was an experiment.
(e). The statement should say “evidence” instead of “proof”.)