13 Merging
We often find we want to combine the data in two separate data sets, in order to do some analysis. This is often referred to as a merge or a join.
There are two very straightforward cases to consider first:
- Adding the observations in one data set as new observations in a second data set. This is sometimes also called “appending” data sets.
- Adding the variables in one data set as new variables to the observations in a second data set. The term “merge” is sometimes reserved to mean this, and is sometimes also called a “match merge”.
13.1 Appending Data Set Observations
Consider an example. We can start with the mtcars
data, and construct
two data frames - one with manual transmission cars, the other with
automatic transmission cars.
manual <- subset(mtcars, am==1)
automatic <- subset(mtcars, am==0)
Here there are 13 observations in the data frame manual
and 19 observations in the data frame automatic
.
Both data frames have 11 variables/columns.
To put both types of cars in one data frame, the simplest approach
is to use the rbind
function.
allcars <- rbind(manual, automatic)
# Check the number of observations
nrow(allcars) == nrow(manual) + nrow(automatic)
[1] TRUE
13.1.1 Some rbind
Details
13.1.1.1 Matching Variables
A limitation of rbind
is that both data frames must include
all the same variables.
manual_disp <- subset(manual, select=c("mpg", "disp"))
automatic_hp <- subset(automatic, select=c("mpg", "hp"))
diffvars <- rbind(manual_disp, automatic_hp)
Error in match.names(clabs, names(xi)): names do not match previous names
(We will see a solution to this problem later, using merge
.)
13.1.1.2 Variable Order
Variables do not need to appear in the same order.
manual <- subset(manual, select=c("mpg", "disp"))
automatic <- subset(automatic, select=c("disp", "mpg"))
lessvars <- rbind(manual, automatic)
head(lessvars)
mpg disp
Mazda RX4 21.0 160.0
Mazda RX4 Wag 21.0 160.0
Datsun 710 22.8 108.0
Fiat 128 32.4 78.7
Honda Civic 30.4 75.7
Toyota Corolla 33.9 71.1
nrow(lessvars)
[1] 32
ncol(lessvars)
[1] 2
13.1.1.3 Variable Types
If variables with the same name are not of the same type, they
will be coerced to the most general type. Here, numeric
values in the manual
data frame are coerced to character
values by rbind
.
automatic$disp <- as.character(automatic$disp)
str(automatic)
'data.frame': 19 obs. of 2 variables:
$ disp: chr "258" "360" "225" "360" ...
$ mpg : num 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 ...
difftypes <- rbind(manual, automatic)
str(difftypes)
'data.frame': 32 obs. of 2 variables:
$ mpg : num 21 21 22.8 32.4 30.4 33.9 27.3 26 30.4 15.8 ...
$ disp: chr "160" "160" "108" "78.7" ...
Notice that you do not get any warning that the type has been changed!
13.2 Merging Data Set Variables
A simple merge adds the data values in one data set as new variables to the observations in another data set.
Consider the classic sleep study, where we have observations on ten subjects. In the first data set, we have data on how much extra sleep they got with drug A, while in the second data set we have data on how much extra sleep they got with drug B.
A <- subset(sleep, group==1, select=c("ID","extra"))
B <- subset(sleep, group==2, select=c("ID","extra"))
# for clarity, rename `extra`
names(A) <- c("ID", "extra_A")
names(B) <- c("ID", "extra_B")
This is especially simple to merge:
merge(A,B)
ID extra_A extra_B
1 1 0.7 1.9
2 10 2.0 3.4
3 2 -1.6 0.8
4 3 -0.2 1.1
5 4 -1.2 0.1
6 5 -0.1 -0.1
7 6 3.4 4.4
8 7 3.7 5.5
9 8 0.8 1.6
10 9 0.0 4.6
Two things make this example simple.
- Every observation in data set A has a matching observation in data set B, and vice versa.
- The only variable the two data sets have in common is a unique identifier (a "key") that can be used to match the observations in one data set to the observations in the other.
13.2.1 Some merge
Details
We can dig into the details of merges by first examining what happens when the conditions above are not met.
13.2.1.1 Common Variables That Are Not Keys
Suppose we had not renamed the extra
variables.
A <- subset(sleep, group==1, select=c("ID","extra"))
B <- subset(sleep, group==2, select=c("ID","extra"))
Now our simple merge
specification no longer produces
the desired data set.
merge(A,B)
ID extra
1 5 -0.1
By default, merge
seeks to match observations based
on all the variables common to both data sets. In the
first example, only ID
was common (because we renamed
extra
), but in this example
both ID
and extra
are common.
Because ID
appears first in the data set, the merge
procedure first looks for a matching ID
. If that is
found, then the procedure looks for an observation with
a matching extra
value. Only one observation matches
both ID
and extra
across both data sets!
Notice that unmatched observations from both data sets were dropped.
We can control which variables are used for matching (and
their order of use) with the by
parameter.
merge(A, B, by="ID")
ID extra.x extra.y
1 1 0.7 1.9
2 10 2.0 3.4
3 2 -1.6 0.8
4 3 -0.2 1.1
5 4 -1.2 0.1
6 5 -0.1 -0.1
7 6 3.4 4.4
8 7 3.7 5.5
9 8 0.8 1.6
10 9 0.0 4.6
Notice that even though extra
appears in both data sets,
it is no longer used to match observations. Instead, both
versions of extra
are kept, but now as extra.x
(from A)
and extra.y
(from B). Any variables not used for matching
are kept - an attempt is made to rename them if necessary.
13.2.1.2 Unmatched Observations
Now suppose that not every observation has a match.
A <- subset(sleep, group==1, select=c("ID","extra"))
B <- subset(sleep, group==2, select=c("ID","extra"))
# just some observations from B
B <- B[1:5, ]
We only want to match on ID
, so we use the by
parameter.
But recall that only observations with matches are kept by
default. We often additionally want to keep all observations
from both data sets.
merge(A, B, by="ID", all=TRUE)
ID extra.x extra.y
1 1 0.7 1.9
2 2 -1.6 0.8
3 3 -0.2 1.1
4 4 -1.2 0.1
5 5 -0.1 -0.1
6 6 3.4 NA
7 7 3.7 NA
8 8 0.8 NA
9 9 0.0 NA
10 10 2.0 NA
Notice that the observations from B have a value for extra.y
,
but the observations from A have no extra.y
where there was
no matching observation in B (ID’s 6 through 10). Instead
these have been filled in with NA
, the missing value.
13.3 Merge to Append
We can use merge
to get around some of the difficulty using rbind
to append data sets when they don’t include all the same variables.
The crux here is to find a unique identifier in both data sets so that none of the observations actually match. The (inobvious) identifier here is the row names in each data set.
manual_disp <- subset(mtcars, am==1, select=c("mpg", "disp"))
automatic_hp <- subset(mtcars, am==0, select=c("mpg", "hp"))
head(manual_disp) # to show row.names
mpg disp
Mazda RX4 21.0 160.0
Mazda RX4 Wag 21.0 160.0
Datsun 710 22.8 108.0
Fiat 128 32.4 78.7
Honda Civic 30.4 75.7
Toyota Corolla 33.9 71.1
morecars <- merge(manual_disp, automatic_hp, by="row.names", all=TRUE)
head(morecars, 10)
Row.names mpg.x disp mpg.y hp
1 AMC Javelin NA NA 15.2 150
2 Cadillac Fleetwood NA NA 10.4 205
3 Camaro Z28 NA NA 13.3 245
4 Chrysler Imperial NA NA 14.7 230
5 Datsun 710 22.8 108.0 NA NA
6 Dodge Challenger NA NA 15.5 150
7 Duster 360 NA NA 14.3 245
8 Ferrari Dino 19.7 145.0 NA NA
9 Fiat 128 32.4 78.7 NA NA
10 Fiat X1-9 27.3 79.0 NA NA
This include all our observations and all our variables. However,
it leaves mpg
divided into two columns. If we realize that
mpg
is unique within car type (row.names
) because car type
is itself a unique identifier, then we can include mpg
in the
by
specification.
morecars <- merge(manual_disp, automatic_hp, by=c("row.names","mpg"), all=TRUE)
head(morecars, 10)
Row.names mpg disp hp
1 AMC Javelin 15.2 NA 150
2 Cadillac Fleetwood 10.4 NA 205
3 Camaro Z28 13.3 NA 245
4 Chrysler Imperial 14.7 NA 230
5 Datsun 710 22.8 108.0 NA
6 Dodge Challenger 15.5 NA 150
7 Duster 360 14.3 NA 245
8 Ferrari Dino 19.7 145.0 NA
9 Fiat 128 32.4 78.7 NA
10 Fiat X1-9 27.3 79.0 NA
13.4 Merging Exercises
- Merge the
beaver1
andbeaver2
datasets on thetime
column, and include all rows.
- Bonus: Add a column called
warmer
that has a 1 whenever beaver1’s temperature was higher, and a 2 whenever beaver2’s temperature was higher. Ignore missing values. (If beaver1 is missing and beaver2 is not, the new column should say 2.) If the two beavers’ temperatures are equal, make the value ofwarmer
missing.
Append the rows of
beaver2
tobeaver1
. Make sure there is a column that specifies the beaver number (1 or 2) for each observation.Combine
state.abb
,state.division
, andstate.name
into a single dataframe.- Drop all rows where the state name is more than one word.
- Merge this dataframe with
state.x77
. - Bonus: Add two columns corresponding to the two objects in the list
state.center
.