17 Merging

Merging is the process of combining multiple datasets into a single dataset. Examples include adding inflation factors to panel data to adjust income to today’s rates, or adding county-level statistics to individual-level data. We will consider two types of merges: adding columns from one dataset to another, and adding rows (which is also called appending).

Load dplyr.


17.1 Merging: Adding Columns

The actual act of merging two datasets usually involves only one or two lines of code. Most of the work when merging happens before this, in preparing dataframes for merging and in deciding which rows we want in the output.

Let’s create two dataframes from airquality, each with some distinct columns and rows. air_5to8 will contain months 5 to 8 and only the Temp column. air_6to9 will contain months 6 to 9, and the columns Ozone, Solar.R, and Wind.

air_5to8 <- 
  airquality |>
  select(Month, Day, Temp) |>
  filter(Month != 9)

air_6to9 <- 
  airquality |>
  select(-Temp) |>
  filter(Month != 5)

To merge two dataframes, we need a key, an identifier that we can use to match one set of information to another. A common example of a key in the social sciences is a person ID, which can be used to link information across multiple datasets.

In our example, our key is spread across two columns, Month and Day:

  Month Day Temp
1     5   1   67
2     5   2   72
3     5   3   74
4     5   4   62
5     5   5   56
6     5   6   66
  Ozone Solar.R Wind Month Day
1    NA     286  8.6     6   1
2    NA     287  9.7     6   2
3    NA     242 16.1     6   3
4    NA     186  9.2     6   4
5    NA     220  8.6     6   5
6    NA     264 14.3     6   6

Our four measurements (Temp, Ozone, Solar.R, and Wind) are currently spread across two datasets. What we want to accomplish with our merge is to collect them all into a single dataset. The question we have to answer first is, which rows do we want to appear in our merged dataset?

Currently, both datasets have months 6-8. Beyond that, one has month 5, and the other month 9. We have four choices for which months we want in our merged dataset:

  • Months 5-9 (rows in either)
  • Months 6-8 (rows in both)
  • Months 5-8 (rows in air_5to8)
  • Months 6-9 (rows in air_6to9)

To keep rows in either dataset (logical OR), use dplyr’s full_join() or base R’s merge() with the argument all=T:

air_merged <- full_join(air_5to8, air_6to9)
air_merged <- merge(air_5to8, air_6to9, all = T)

To keep rows in both datasets (logical AND), use dplyr’s inner_join() or base R’s merge() without an additional all argument:

air_merged <- inner_join(air_5to8, air_6to9)
air_merged <- merge(air_5to8, air_6to9)

To keep rows from one dataset but not another, we need to pay attention to the order of the datasets we pass to our merging function. With dplyr, the two datasets are termed “left” and “right”, and with base R, they are called “x” and “y”.

We can keep rows from the dataset in the first argument (left, x) with left_join() or with merge(..., all.x = T):

air_merged <- left_join(air_5to8, air_6to9)
air_merged <- merge(air_5to8, air_6to9, all.x = T)

Finally, use right_join() or merge(..., all.y = T) to keep all rows in the second dataset (right, y):

air_merged <- right_join(air_5to8, air_6to9)
air_merged <- merge(air_5to8, air_6to9, all.y = T)

Note that in all of these examples, because the key columns have the same name, we never needed to tell R which columns to use when merging.

17.1.1 Different Key Column Names

Sometimes your key variable will have different names in your dataframes.

Let’s change the Month column in air_5to8 to be lowercase month:

air_5to8_month <- air_5to8 |> rename(month = Month)

To merge air_5to8_month and air_6to9, we can either rename our key column(s) so that they do match by adapating the code above, or we can utilize the by arguments of our merging functions.

With dplyr’s *_join() functions, give the by argument a vector, and name all key columns. Where they differ between the two datasets, provide a pair in the pattern left = right:

air_merged <- 
  full_join(air_5to8_month, air_6to9,
            by = c("month" = "Month", "Day"))

With merge(), use the by.x and by.y arguments to name all key columns. Be sure the order and length of these two argument values match.

air_merged <-
  merge(air_5to8_month, air_6to9, all = T,
        by.x = c("month", "Day"),
        by.y = c("Month", "Day"))

Usually, it is easier to rename columns before merging, especially where we are matching on several key columns.

17.2 Appending: Adding Rows

To append rows from one or more dataframe to another, use dplyr’s bind_rows() function. bind_rows() automatically matches columns by name, so the datasets can differ in their number and order of columns.

air_noSolar <- 
  airquality |> 
  filter(Month %in% 5:6) |> 

air_noTemp <- 
  airquality |> 
  filter(Month %in% 7:8) |> 

air_onlyTemp <-
  airquality |> 
  filter(Month %in% 9) |> 
  select(Month, Day, Temp)

air_appended <- bind_rows(air_noSolar, air_noTemp, air_onlyTemp)

17.2.1 Different Column Types

bind_rows() is picky about matching column types, and it will not automatically coerce columns to be of the same type. Instead, it will return an error. We can fix this by manually coercing column types to character before merging.

First, let’s create some sample data. Imagine we just downloaded two sets of survey data.

survey1 <-
  data.frame(Q1 = c(1, 1),
             Q2 = c(3, 1),
             Q2a = c("Sometimes", NA),
             Q3 = c(6, 7),
             Q3a = c("Never", "Always"))

survey2 <-
  data.frame(Q1 = c(2, 4, 4),
             Q2 = c(1, 5, 3),
             Q2a = c(NA, NA, 2),
             Q3 = c(4, 2, 5),
             Q3a = c(NA, NA, NA))

  Q1 Q2       Q2a Q3    Q3a
1  1  3 Sometimes  6  Never
2  1  1      <NA>  7 Always
  Q1 Q2 Q2a Q3 Q3a
1  2  1  NA  4  NA
2  4  5  NA  2  NA
3  4  3   2  5  NA

When we use bind_rows() for survey1 and survey2, the error message tells us that column Q2a is character in one dataframe and numeric in the other.

bind_rows(survey1, survey2)
Error in `bind_rows()`:
! Can't combine `..1$Q2a` <character> and `..2$Q2a` <double>.

This happened here because c() automatically assigned a type based on the values it observed. When we read in datasets, read.csv() and related functions will also use the values to determine the type. Differences in responses between two datasets, especially for open-response questions, can lead to different types assigned to the two datasets’ columns.

To fix this, simply coerce the problem column in one dataset to match the type of that column in the other dataset. Then, use bind_rows().

survey2b <- 
  survey2 |>
  mutate(Q2a = as.character(Q2a))

surveys <- bind_rows(survey1, survey2b)

  Q1 Q2       Q2a Q3    Q3a
1  1  3 Sometimes  6  Never
2  1  1      <NA>  7 Always
3  2  1      <NA>  4   <NA>
4  4  5      <NA>  2   <NA>
5  4  3         2  5   <NA>

This works well enough if we have only one or two type conflicts, but what if we have many?

We could simplify the coercion process by coercing every column to character, appending the datasets, and then asking R to reinterpret the column types. With this simplification comes the drawback that more complicated column types (namely, factors and lists) will lose some information in this coercion.

To coerce each column in a dataframe to character, use across() to apply a function to all columns. Here, we can apply the function as.character (without trailing parentheses):

survey1 <- survey1 |> mutate(across(.fns = as.character))
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `across(.fns = as.character)`.
Caused by warning:
! Using `across()` without supplying `.cols` was deprecated in dplyr 1.1.0.
ℹ Please supply `.cols` instead.
survey2 <- survey2 |> mutate(across(.fns = as.character))

Then, append the datasets:

surveys <- bind_rows(survey1, survey2)

  Q1 Q2       Q2a Q3    Q3a
1  1  3 Sometimes  6  Never
2  1  1      <NA>  7 Always
3  2  1      <NA>  4   <NA>
4  4  5      <NA>  2   <NA>
5  4  3         2  5   <NA>

Since every column is a character, we cannot use functions that take numeric arguments, like mean():

Warning in mean.default(surveys$Q1): argument is not numeric or logical: returning NA
[1] NA

Column Q1 contains only numbers, so it should be of type numeric. We could go through each column and convert the type (surveys$Q1 <- as.numeric(surveys$Q1), etc.), or we could make use of type_convert() from the readr package. This function guesses the type of each column and converts it for us, letting us know which type it chose for each column.


surveys <- type_convert(surveys)

── Column specification ────────────────────────────────────────────────────────────────────────────────────────────────
  Q1 = col_double(),
  Q2 = col_double(),
  Q2a = col_character(),
  Q3 = col_double(),
  Q3a = col_character()
[1] 2.4

17.3 Exercises

  1. Merge the beaver1 and beaver2 datasets on the time column, and include all rows.

    • What do you notice about the column names before and after merging? Which columns gained suffixes, and which did not?
  2. Append the rows of beaver2 to beaver1. Make sure there is a column that specifies the beaver number (1 or 2) for each observation.