If you’ve never coded before (or even if you have), type "Your Name"
in the interactive R chunk below and run it by hitting crtl+Enter
or cmd+Enter
for MAC users.
This is the first of a series of interactive tutorials you’ll use to engage content from introductory statistics. This tutorial covers an introduction to data, including some background on experimental design and data collection.
In this tutorial you’ll encounter two datasets – one concerning real estate from King County, WA (houses
) and another containing prices and attributes of almost 54,000 diamonds (diamonds
). A few rows of the houses
data is shown below in a convenient form where the rows of data represent records (sometimes called observations), and the columns represent variables (sometimes called features). We’ll see later in our course that such data is called tidy.
You can see the first six rows of the dataset above. A simplified data dictionary (a map of column names and explanations) appears below.
id
provides the property identification number.month
gives the month that the property was listed for sale.price
is the listing price of the property in US dollars ($).sqft_living
gives the finished square footage of the home.sqft_lot
gives the square footage of the property (land).waterfront
indicates whether the property is waterfront.exterior
provides a description of the exterior covering of the home.In this section, we will delve deeper into the categorization of variables as numerical and categorical. This is an important step, as the type of variable helps us determine what summary statistics to calculate, what type of visualizations to make, and what statistical method will be appropriate to answer the research questions we’re exploring.
There are two types of variables: numerical and categorical.
Numerical, in other words, quantitative, variables take on numerical values. It is sensible to add, subtract, take averages, and so on, with these values.
Categorical, or qualitative, variables, take on a limited number of distinct categories. These categories can be identified with numbers, for example, it is customary to see likert variables (strongly agree to strongly disagree) coded as 1 through 5, but it wouldn’t be sensible to do arithmetic operations with these values. They are merely placeholders for the levels of the categorical variable.
Numerical variables can be further categorized as continuous or discrete.
Continuous numerical variables are usually measured, such as height. These variables can take on an infinite number of values within a given range.
Discrete numerical variables are those that take on one of a specific set of numeric values where we are able to count or enumerate all of the possibilities. One example of a discrete variable is number of pets in a household. In general, count data are an example of discrete variables.
When determining whether a numerical variable is continuous or discrete, it is important to think about the nature of the variable and not just the observed value, as rounding of continuous variables can make them appear to be discrete. For example, height is a continuous variable, however we tend to report our height rounded to the nearest unit of measure, like inches or centimeters.
Categorical variables that have ordered levels are called ordinal.
Think about a survey question where you’re asked how satisfied you are with the customer service you received and the options are very unsatisfied, unsatisfied, neutral, satisfied, and very satisfied. These levels have an inherent ordering, hence the variable would be called ordinal.
If the levels of a categorical variable do not have an inherent ordering to them, then the variable is simply called categorical. For example, do you consume caffeine or not?
Answer the following using your knowledge of the dataset and variable types.
The levels of a variable are the different (unique) values that the variable takes on. For example, a dataset on students might include a variable called ClassYear
with the levels Freshman, Sophomore, Junior, Senior. Numerical variables also have levels – usually there are lots of levels corresponding to a numerical variable, but if there are too few, we may be better off considering the corresponding variable to be categorical. For example, if we had a dataset that included a Year
variable, but the only observed levels in the dataset are 2008 and 2017, we may be better off thinking about Year
as a categorical variable than as a numerical one.
Association, Independence, Correlation: Two variables are associated with one another if a change in levels of one is generally accompanied by change in the other. That is, larger values of one variable are accompanied by larger (or smaller) values in the other. Think – does knowing something about one of the variables give me any information about the other? If two variables are not associated, then we might say that they are independent of one another. Lastly, correlation is a way to formally measure the strength of a LINEAR association between two variables. Look at the plots considering characteristics of various diamonds below.
Use the plots above to answer the following questions.
Since both of the variables in each of the plots above are numerical, we can describe the direction of the association. Notice that there is a positive association in both of the plots you identified above, since an increase in one of the variables is generally accompanied by an increase in the other. If two numerical variables are associated but an increase in one is generally accompanied by a decrease in the other, we say that the association is negative. For those familiar with lines and slopes, the direction of the association corresponds to the sign on the slope of a line of “best fit” (which we will discuss at the end of our course).
We can also identify whether an association exists between variables when one or more are categorical. Consider the plots below which refer back to our houses
dataset from earlier.
Did you get the previous question right on the first try? Think about why the answer is as this tutorial indicates and ask a question if you need more clarification.
Major Questions In Statistics: Given groups with different characteristics (differing levels) regarding variable X, do they differ with respect to variable Y?
We’ll find that we can’t just answer these questions by looking at plots involving some sample data. Why not?
Population versus Sample: In statistics, we almost always want to apply generalizations from a small sample to a large population – you might think of this as a sort of stereotyping. The trick here is that for our assertions (generalizations) to be valid, our sample must be representative of our population.
First, taking a census requires a lot more resources than collecting data from a sample of the population.
Second, certain individuals in your population might be hard to locate or collect data from. If these individuals that are missed in the census are different from those in the rest of the population, the census data will be biased. For example, in the US census, undocumented immigrants are often not recorded properly since they tend to be reluctant to fill out census forms with the concern that this information could be shared with immigration. However, these individuals might have characteristics different than the rest of the population and hence, not getting information from them might result in unreliable data from geographical regions with high concentrations of undocumented immigrants.
Lastly, populations are constantly changing. Even if you do have the required resources and manage to collect data from everyone in the population, tomorrow your population will be different and so the hard work required to collect such data may not pay off.
If you think about it, sampling is actually quite natural.
Think about something you are cooking we taste or in other words examine a small part of what we’re cooking to get an idea about the dish as a whole. After all, we would never eat a whole pot of soup just to check its taste.
When you taste a spoonful of soup and decide the spoonful you tasted isn’t salty enough, what you’re doing is simply exploratory analysis for the sample at hand.
If you then generalize and conclude that your entire soup needs salt, that’s making an inference.
For your inference to be valid, the spoonful you tasted, your sample, needs to be representative of the entire pot, your population.
If your spoonful comes only from the surface and the salt is collected at the bottom of the pot, what you tasted is probably not going to be representative of the whole pot.
On the other hand, if you first stir the soup thoroughly before you taste, your spoonful will more likely be representative of the whole pot.
Sampling data is a bit different than sampling soup though. So next, we’ll introduce a few commonly used sampling methods: simple random sampling, stratified sampling, cluster sampling, and multistage sampling.
Here we discuss some of the different ways to draw a sample from a population.
Note: Cluster and multistage sampling are often used for economical reasons. For example, one might divide a city into geographic regions that are on average similar to each other and then sample randomly from a few randomly picked regions in order to avoid traveling to all regions.
The convenience sample is the most commonly used sampling method. Unfortunately, it is also the worst. When researchers sample from individuals they have “easy access” to, they are conducting a convenience sample. There are always hidden biases in these samples. Do a quick Google search for “FDR versus Alf Landon Sampling Error” to see a very famous example here. In addition, much of the error in predicting the results of the 2016 presidential election may be attributable to convenience sampling.
A consulting company is planning a pilot study on marketing in Boston. They identify the zip codes that make up the greater Boston area, then sample 50 randomly selected addresses from each zip code and mail a coupon to these addresses. They then track whether the coupon was used in the following month.
A school district has requested a survey be conducted on the socioeconomic status of their students. Their budget only allows them to conduct the survey in some of the schools, hence they need to first sample a few schools.
Students living in this district generally attend a school in their neighborhood. The district is broken into many distinct and unique neighborhoods, some including large single-family homes and others with only low-income housing.
Experiment versus Observational Study: Beyond just sampling, there are multiple methods for collecting data. We can just observe what happens naturally (without manipulating any conditions) or we can run an experiment. In experiments we manipulate one or more conditions, utilizing a control and treatment group(s). The advantage to an experiment is that we can infer cause and effect relationships (this is extremely important in medical studies), but in observational studies we can only discuss an association between variables.
There’s lots more to learn about experimental design, but it is beyond the scope of our course. You should read pages 32 through 35 of OpenIntro Statistics, 4Ed as a starting point.
Often when one mentions “a relationship between variables” we think of a relationship between just two variables, say a so called explanatory variable, x, and response variable, y. However, truly understanding the relationship between two variables might require considering other potentially related variables as well. If we don’t, we might find ourselves in a Simpson’s paradox. So, what is Simpson’s paradox?
First, let’s clarify what we mean when we say explanatory and response variables. Labeling variables as explanatory and response does not guarantee the relationship between the two is actually causal, even if there is an association identified. We use these labels only to keep track of which variable we suspect affects the other.
And these definitions can be expanded to more than just two variables. For example, we could study the relationship between three explanatory variables and a single response variable.
This is often a more realistic scenario since most real world relationships are multivariable. For example, if we’re interested in the relationship between calories consumed daily and heart health, we would probably also want to consider information on variables like age and fitness level of the person as well.
Not considering an important variable when studying a relationship can result in what we call a Simpson’s paradox. This paradox illustrates the effect the omission of an explanatory variable can have on the measure of association between another explanatory variable and the response variable. In other words, the inclusion of a third variable in the analysis can change the apparent relationship between the other two variables.
Consider the eight dots in the scatter plot below (the points happen to fall on the orange and blue lines). The trend describing the points when only considering x1
and y
, illustrated by the black dashed line, is reversed when x2
, the grouping variable, is also considered. If we don’t consider x2
, the relationship between x1
and y
is positive. If we do consider x2
, we see that within each group the relationship between x1
and y
is actually negative.
We’ll explore Simpson’s paradox further with another dataset, which comes from a study carried out by the graduate Division of the University of California, Berkeley in the early 70’s to evaluate whether there was a sex bias in graduate admissions. The data come from six departments. For confidentiality we’ll call them A through F. The dataset contains information on whether the applicant identified as male or female, recorded as Gender
, and whether they were admitted or rejected, recorded as Admit
.
. | Admitted | Rejected |
---|---|---|
Male | 1198 | 1493 |
Female | 557 | 1278 |
Note: At the time of this study, gender and sexual identities were not given distinct names. Instead, it was common for a survey to ask for your “gender” and then provide you with the options of “male” and “female.” Today, we better understand how an individual’s gender and sexual identities are different pieces of who they are. To learn more about inclusive language surrounding gender and sexual identities see the gender unicorn.
First, we will evaluate whether the percentage of males admitted is indeed higher than females, overall.
Next, we will calculate the same percentage for each individual department.
We’ll wrap up the lesson with a recap of our findings.
Overall: males were more likely to be admitted
We saw that overall males were more likely to be admitted.
But when we consider the department information, within most departments actually females are more likely to be admitted.
So when we control for department, the relationship between gender and admission status was reversed, which is what we call Simpson’s paradox.
One potential reason for this paradox is that women tended to apply to competitive departments with lower rates of admission even among qualified applicants, such as in the English Department. Whereas, men tended to apply to less competitive departments with higher rates of admission among the qualified applicants, such as in engineering and chemistry.
Note that we were only able to discover the contradictory finding once we incorporated information about the department of the application. Examples like this highlight the importance of a good study design that considers and collects information on extraneous, but potentially confounding variables in a study.
Summary: Here’s a quick summary of the most important ideas from this first notebook.