The R Markdown file for this lab can be found here.
In today’s lab you will learn:
- The
dplyrpipe, as well as functionsfilter()andmutate().- The various types of
merge()- inner, outer, left, right.- Diagnostic functions for checking merged data.
We’ll use the gapminder package. If needed, install it
with install.packages("gapminder").
Gapminder contains columns:
country: country namecontinent: continentyear: observation yearlifeExp: life expectancy at birth, in yearspop: populationgdpPercap: GDP per capita (inflation-adjusted,
USD)Review: What is the type of each variable?
We will also create two new variables.
# install.packages("gapminder") # uncomment if not installed
library(dplyr)
library(gapminder)
df <- gapminder
dplyr pipe (%>%)The dplyr pipe (%>%, and in modern R also the base
pipe |>) lets you express a sequence of data
transformations from left to right, reading like a sentence. Instead of
nesting function calls, you “pipe” the output of one step into the next,
which keeps code legible and encourages small, composable
operations.
Given some function function1(input1) returns
output 1, and function2(input2),
function1(input1) %>% function2() is equivalent to
function2(output1). If
function2(input2a,input2b) has multiple arguments, you can
indicate the argument to replace with .. Otherwise, the
pipe will assume the first argument is taking the previous input.
Let’s try with some basic examples. For the vector
lifeExp, please provide script to compute the square of the
sum and the sum of the squares using the pipe.
# Square of sum
# [Your code here]
# df$lifeExp %>%
# sum() %>%
# .^2
#
# Sum of squares
# [Your code here]
# df$lifeExp %>%
# .^2 %>%
# sum()
filter(condition) keeps rows that satisfy logical
condition condition==TRUE. These conditions use standard
comparison operators (==, >,
<=), logical connectors (&,
|), and helpers like %in% for set membership
and is.na() for missingness.
& (AND): TRUE if and only if both
conditions are TRUE. What does
filter(continent == "Asia" & lifeExp > 70)
do?
| (OR): at least one condition must be
TRUE. What does
filter(continent == "Asia" | continent == "Europe")
do?
%in% (set membership): tests whether a value is in a
vector of allowed values (cleaner than chaining many ORs).
What does filter(continent %in% c("Asia", "Europe"))
do?
is.na() (missingness): returns TRUE for missing
values; combine with ! (NOT) to exclude missingness. What does
filter(!is.na(lifeExp)) do?
mutate() creates or modifies columns, computing values
rowwise in a vectorized way. The syntax is
mutate(new_or_modified = function(column1, column2, etc.))
You can build on columns created earlier in the same
mutate() call, overwrite existing columns, and use helpers
like if_else(), case_when(), or
across() for many-column operations. Here we focus only on
if_else().
We want to keep a single year (2007) and three
continents (Africa, Europe, Asia), then create
gdp_total, the total GDP of the country, and
rich, a numeric binary variable that is 1 for countries
with GDP per capita over 20,000 and 0 otherwise.
# Create gdp_total, rich
# [Your code here]
df <- df %>%
filter(year == 2007 & continent %in% c("Africa", "Europe", "Asia")) %>%
mutate(gdp_total = gdpPercap * pop, rich = ifelse(gdpPercap >= 20000, 1, 0))
Review: check the following diagnostics, using the pipe rather than nesting where appropriate:
df now?unique())range())# [Your code here]
nrow(df)
df$country %>%
unique()
df$year %>%
range()
df$continent %>%
unique()
Let’s now create two tables using subset().
A contains observations from Africa and Europe,
and should retain the variables country and
lifeExp.B contains observations from Europe and Asia, and
should retain the variables country,
gdpPercap, and gdp_total.# Table A: Africa + Europe with life expectancy
# [Your Code Here]
A <- df %>% subset(continent %in% c("Africa", "Europe"), select = c(country, lifeExp))
A <- df %>%
filter(continent %in% c("Africa", "Europe")) %>%
subset(select = c(country, lifeExp))
# Table B: Europe + Asia with GDP per capita (and gdp_total)
# [Your Code Here]
B <- df %>% subset(continent %in% c("Europe", "Asia"), select = c(country, gdp_total, gdpPercap))
B <- df %>%
filter(continent %in% c("Europe", "Asia")) %>%
subset(select = c(country, gdp_total, gdpPercap))
merge()Sometimes we have variables that are split across two separate tables. In those cases we may need to join the tables together so each row can contain all the variables we need.
We can join tables by an indexed variable using the function
merge(tableA, tableB, by = "indexvariable"). You can do so
with multiple index variables but we refrain from doing so for now as it
follows simply from single-variable joins. There are broadly four ways
to join tables.
all.x = TRUE)all.y = TRUE)all = TRUE)Let’s try each of these in turn. Imagine we do not have the full
gapminder dataframe, but only tables A and B
above. Which merge gives us the following?
# INNER
# [Your Code Here]
merge(A, B, by = "country")
# LEFT
# [Your Code Here]
merge(A, B, by = "country", all.x = TRUE)
# RIGHT
# [Your Code Here]
merge(A, B, by = "country", all.y = TRUE)
# OUTER
# [Your Code Here]
merge(A, B, by = "country", all = TRUE)