The R Markdown file for this lab can be found here.

In today’s lab you will learn:

  1. The dplyr pipe, as well as functions filter() and mutate().
  2. The various types of merge() - inner, outer, left, right.
  3. Diagnostic functions for checking merged data.

Data & Variables

We’ll use the gapminder package. If needed, install it with install.packages("gapminder").

Gapminder contains columns:

Review: What is the type of each variable?

We will also create two new variables.

Load Packages

# install.packages("gapminder") # uncomment if not installed
library(dplyr)
library(gapminder)
df <- gapminder

The dplyr pipe (%>%)

The dplyr pipe (%>%, and in modern R also the base pipe |>) lets you express a sequence of data transformations from left to right, reading like a sentence. Instead of nesting function calls, you “pipe” the output of one step into the next, which keeps code legible and encourages small, composable operations.

Given some function function1(input1) returns output 1, and function2(input2), function1(input1) %>% function2() is equivalent to function2(output1). If function2(input2a,input2b) has multiple arguments, you can indicate the argument to replace with .. Otherwise, the pipe will assume the first argument is taking the previous input.

Let’s try with some basic examples. For the vector lifeExp, please provide script to compute the square of the sum and the sum of the squares using the pipe.

# Square of sum
# [Your code here]
# df$lifeExp %>%
#   sum() %>%
#   .^2
# 
# Sum of squares
# [Your code here]
# df$lifeExp %>%
#   .^2 %>%
#   sum()

Filter and Mutate

filter(condition) keeps rows that satisfy logical condition condition==TRUE. These conditions use standard comparison operators (==, >, <=), logical connectors (&, |), and helpers like %in% for set membership and is.na() for missingness.

mutate() creates or modifies columns, computing values rowwise in a vectorized way. The syntax is mutate(new_or_modified = function(column1, column2, etc.)) You can build on columns created earlier in the same mutate() call, overwrite existing columns, and use helpers like if_else(), case_when(), or across() for many-column operations. Here we focus only on if_else().

We want to keep a single year (2007) and three continents (Africa, Europe, Asia), then create gdp_total, the total GDP of the country, and rich, a numeric binary variable that is 1 for countries with GDP per capita over 20,000 and 0 otherwise.

# Create gdp_total, rich
# [Your code here]
df <- df %>%
  filter(year == 2007 & continent %in% c("Africa", "Europe", "Asia")) %>%
  mutate(gdp_total = gdpPercap * pop, rich = ifelse(gdpPercap >= 20000, 1, 0))

Review: check the following diagnostics, using the pipe rather than nesting where appropriate:

# [Your code here]
nrow(df)
df$country %>%
  unique()
df$year %>%
  range()
df$continent %>%
  unique()

Build Two Overlapping Tables

Let’s now create two tables using subset().

# Table A: Africa + Europe with life expectancy
# [Your Code Here]
A <- df %>% subset(continent %in% c("Africa", "Europe"), select = c(country, lifeExp))
A <- df %>%
  filter(continent %in% c("Africa", "Europe")) %>%
  subset(select = c(country, lifeExp))

# Table B: Europe + Asia with GDP per capita (and gdp_total)
# [Your Code Here]
B <- df %>% subset(continent %in% c("Europe", "Asia"), select = c(country, gdp_total, gdpPercap))
B <- df %>%
  filter(continent %in% c("Europe", "Asia")) %>%
  subset(select = c(country, gdp_total, gdpPercap))

Joining tables with merge()

Sometimes we have variables that are split across two separate tables. In those cases we may need to join the tables together so each row can contain all the variables we need.

We can join tables by an indexed variable using the function merge(tableA, tableB, by = "indexvariable"). You can do so with multiple index variables but we refrain from doing so for now as it follows simply from single-variable joins. There are broadly four ways to join tables.

Let’s try each of these in turn. Imagine we do not have the full gapminder dataframe, but only tables A and B above. Which merge gives us the following?

# INNER
# [Your Code Here]
merge(A, B, by = "country")

# LEFT
# [Your Code Here]
merge(A, B, by = "country", all.x = TRUE)

# RIGHT
# [Your Code Here]
merge(A, B, by = "country", all.y = TRUE)


# OUTER
# [Your Code Here]
merge(A, B, by = "country", all = TRUE)