The R Markdown file for this lab can be found here.

The potential outcomes framework is the standard approach to causal inference. In this lab, we will learn how to apply this framework to analyze research papers in our field, and hopefully gain an appreciation for the power of this framework. We will also preview some important mathematical basics to understanding how we can operationalize this framework.


Part 1

Goals

By the end of this part, you should be able to:

Please sort yourselves into \(3n\) groups, \(n \in \mathbb{N}\). Ideally, you should join a group with others who have a major most similar to yours. You will each be assigned a paper from the below:

  1. Economics: Acemoglu, Johnson & Robinson (2001), “The Colonial Origins of Comparative Development.” American Economic Review.

  2. Sociology: Pager (2003), “The Mark of a Criminal Record.” American Journal of Sociology.

  3. Political Science: Miguel, Satyanath & Sergenti (2004), “Economic Shocks and Civil Conflict: An Instrumental Variables Approach.” Journal of Political Economy.

Each group will discuss among themselves and present a 3-5 minute summary as follows:


Part 2

Goals

By the end of this part, you should be able to:

Let’s start by simulating some data. You do not need to know how this simulation works. Simply run the below code chunk and treat the resulting data as you would any other dataset. For the sake of avoiding too much abstraction at this stage, let’s consider the following fake scenario:


The Department of Health wants to answer the following question: Does watching short-form video media cause worse brain development among children? We have enough funding to hire 600 test subjects, and no moral qualms about humane treatment. We can measure whatever we need, but we also have data on previous exposure, actual IQ, number of incidences of academic probations, and sex (assume M/F).


set.seed(2025)   # for reproducibility
N <- 600         # number of observations

# (log-normal)
exposure <- rlnorm(N, meanlog = 0, sdlog = 1.0)

# (Normal)
IQ <- rnorm(N, mean = 95, sd = 15)

# (Bernoulli with p=0.49)
male <- rbinom(N, 1, 0.49)

# Put into a data.frame
dat <- data.frame(id = 1:N, exposure, IQ, male)
head(dat)
##   id exposure        IQ male
## 1  1 1.860335  82.22982    1
## 2  2 1.036284  72.25608    0
## 3  3 2.166590 110.43226    0
## 4  4 3.569727 129.51451    0
## 5  5 1.449147 105.63320    1
## 6  6 0.849715  96.15045    1

You should now have a dataframe dat.

Identification strategy

How should we answer the question? Please identify:

Data exploration

Let’s look at how each of the confounding variables is distributed using hist().

hist(dat$exposure)

# [Your code here]

If we randomly assign individuals into treatment groups, what shape should we expect for the within-group distributions of each variable?

Randomization

We now create a treatment indicator by flipping a fair coin for each observation. You don’t need to know how to do this yet. Again, just run the code chunk below.

# 50/50 assignment
group <- sample(c("A","B"), size = N, replace = TRUE)
dat$group <- group

# Quick counts
table(dat$group)
## 
##   A   B 
## 280 320
prop.table(table(dat$group))
## 
##         A         B 
## 0.4666667 0.5333333

Now create two subsets of dat.

# [Your code here]

Plot the within-subset distributions for each variable. Do they look the same? Are the two groups comparable?

# [Your code here]

Bonus: Let’s reduce the sample size N. Are the two groups comparable now?

Bonus: In general, how can we assess whether our two groups are comparable?

Covariate Balance

We will compute group means of each covariate using tapply(), which we learned in class. Please calculate the means of each variable for each subgroup. Name each as follows: mean_groupletter_variablename.

# [Your code here]
mean_A_exposure <- tapply(dat$exposure, dat$group, mean)[["A"]]

Let’s plot the above results. You don’t need to know how to do the below yet.

# balance_table <- data.frame(
#   covariate = c("Exposure", "IQ", "Gender"),
#   mean_A = c(mean_A_exposure, mean_A_IQ, mean_A_gender),
#   mean_B = c(mean_B_exposure, mean_B_IQ, mean_B_gender)
# )
# balance_table$diff_in_means <- with(balance_table, mean_A - mean_B)
# balance_table

For each covariate, are the A and B means close? Explain why randomization implies they should be close in expectation even if any one draw shows some noise.

Randomization as simulated counterfactuals

The potential outcomes for a unit \(i\) are \(Y_i(A)\) and \(Y_i(B)\). Randomization picks one at random to reveal. If groups are balanced on covariates and unobservables in expectation, then comparing mean outcomes between A and B estimates a causal effect.

We illustrate this by simulating an outcome Y that depends on our covariates, but not on group assignment.

# outcome depends on covariates (nonlinear), but not on group
Y <- 1.5 + 0.4*log(1+exposure) + 0.6*(IQ > 1.5) - 0.8*IQ + rnorm(N, 0, 0.7)
dat$Y <- Y

What is the difference in mean Y between A and B? What would happen if assignment were correlated with one or more covariates?

# [Your code here]