Combining datasets is a very common task, and one that’s easy to do if you understand the structure of the datasets you are trying to combine. However, if you’ve misunderstood the structure of the datasets you can end up with a dataset that makes no sense at all. Combining datasets will sometimes reveal errors in the datasets you’re trying to combine.
Terminology Alert
This chapter depends heavily on the concepts and terminology introduced in Hierarchical Data Concepts, so be sure you’ve read that section before proceeding.
Stata always works with one dataset at a time. So when we talk about combining datasets we mean taking a dataset that’s in memory, what Stata calls the master dataset, and combining it with a dataset on disk, what Stata calls the using dataset for reasons that will become obvious when you see the command syntax. When you’re done, you’ll have a single dataset in memory again.
Always think through what the resulting dataset should look like before combining two datasets. If the resulting data set won’t have a consistent, logical structure you probably need to rethink what you’re doing.
How you combine the two data sets depends on what the using data set adds to the master data set. We’ll discuss the most common scenarios, which are:
Adding observations
Adding variables
Adding level one units to existing level two units
Adding variables for different levels
7.1 Setting Up
Start a do file called combine.do. Don’t worry about loading any data yet–in this chapter almost every example will involve different data sets.
-------------------------------------------------------------------------------
name: <unnamed>
log: /home/d/dimond/kb/dws/combine.do
log type: smcl
opened on: 29 Dec 2025, 13:22:54
If you haven’t downloaded the example files yet, you may want to now so you don’t have to specify a full URL for every file. You can do that with:
net get dws, from(https://ssc.wisc.edu/sscc/stata/)
That will put the example files in Stata’s working directory, so make sure it’s set to where you want them to be.
7.2 Adding Observations
If the using dataset adds more observations to the master dataset, then this is a job for append. Using append makes sense if the master dataset and the using dataset contain the same kinds of things, but not the same things. The append command simply adds the using dataset to the end of the master data set.
Suppose that instead of a single file containing ACS sample you were given acs_part1.dta and acs_part2.dta, with each file containing about half of the observations. Load acs_part1, then use append to add acs_part2, combining them into a single dataset:
use https://sscc.wisc.edu/sscc/pubs/dws/data/acs_part1describe, short
Contains data from https://sscc.wisc.edu/sscc/pubs/dws/data/acs_part1.dta
Observations: 13,705
Variables: 9 11 Jan 2023 12:08
Sorted by: household person
appendusing https://sscc.wisc.edu/sscc/pubs/dws/data/acs_part2describe, short
(label edu_label already defined)
(label race_label already defined)
(label marital_status_label already defined)
Contains data from https://sscc.wisc.edu/sscc/pubs/dws/data/acs_part1.dta
Observations: 27,410
Variables: 9 11 Jan 2023 12:08
Sorted by:
Note: Dataset has changed since last saved.
If a variable only exists in one of the two datasets, observations from the other dataset will have missing values for that variable. Make sure variables have the same name in both files before appending them, or append will treat them as different variables. Of course that assumes they actually measure the same thing in the same way. The warning you got about labels already being defined tells you those variables have value labels defined in both files, and you should make sure that they agree about what the values mean.
7.3 Adding Variables
If the using dataset adds more variables to the master dataset and observations represent the same things in both datasets, then this is a job for a one-to-one merge. The merge command combines datasets by combining observations that have the same value of an identifier variable or variables, so the result has all the variables from both files.
Suppose you were given the data files acs_demographics.dta and acs_ses.dta, containing demographic information and socio-economic status (SES) information respectively about the ACS respondents.
Contains data from https://sscc.wisc.edu/sscc/pubs/dws/data/acs_demographics.dta
Observations: 27,410
Variables: 6 11 Jan 2023 12:08
------------------------------------------------------------------------------------------
Variable Storage Display Value
name type format label Variable label
------------------------------------------------------------------------------------------
household double %8.0g Household serial number
person int %8.0g Person number in sample unit
age byte %10.0g Age (original)
race byte %25.0g race_label
Race Recode 1
marital_status byte %13.0g marital_status_label
Marital Status
hispanic float %9.0g Person is hispanic
------------------------------------------------------------------------------------------
Sorted by: household person
Contains data from https://sscc.wisc.edu/sscc/pubs/dws/data/acs_ses.dta
Observations: 27,410
Variables: 4 11 Jan 2023 12:08
------------------------------------------------------------------------------------------
Variable Storage Display Value
name type format label Variable label
------------------------------------------------------------------------------------------
household double %8.0g Household serial number
person int %8.0g Person number in sample unit
edu byte %24.0g edu_label
Educational Attainment
income long %10.0g Person's Total Income in 1999
------------------------------------------------------------------------------------------
Sorted by: household person
You can use merge to combine them into a single dataset:
clearuse https://sscc.wisc.edu/sscc/pubs/dws/data/acs_demographicsmerge 1:1 household person using https://sscc.wisc.edu/sscc/pubs/dws/data/acs_ses
Result Number of obs
-----------------------------------------
Not matched 0
Matched 27,410 (_merge==3)
-----------------------------------------
In the merge command, 1:1 means you are doing a one-to-one merge: one respondent’s demographic information will be matched with one respondent’s SES information. Next come the identifier variables, two in this case, that tell merge which observations should be matched: observations with the same value for both household and person so we’re sure they contain information about the same respondent. Because we’ve specified that this is a 1:1 merge, the identifier variable(s) must uniquely identify observations in both data sets. We’ll talk about handling duplicate identifiers later. The identifier variables must exist in both data sets, and have the same names, but in most cases all the other variables should have different names.
If an observation in one dataset does not match any observation in the other dataset, it will get missing values for all the variables from that dataset. How successful you are at matching observations can sometimes affect your entire research agenda, so Stata both gives you a report and creates a new variable, _merge, that tells you whether a given observation matched or not. In this case, all the observations matched and thus got a 3 for _merge. A 1 means the observation came from the master dataset but did not match anything in the using dataset; a 2 means the observation came from the using dataset but did not match anything in the master dataset. Note that you cannot carry out another merge until you drop or rename the _merge variable so Stata can create a new one.
The resulting dataset contains all the variables from both acs_demographics and acs_ses:
describe
Contains data from https://sscc.wisc.edu/sscc/pubs/dws/data/acs_demographics.dta
Observations: 27,410
Variables: 9 11 Jan 2023 12:08
------------------------------------------------------------------------------------------
Variable Storage Display Value
name type format label Variable label
------------------------------------------------------------------------------------------
household double %8.0g Household serial number
person int %8.0g Person number in sample unit
age byte %10.0g Age (original)
race byte %25.0g race_label
Race Recode 1
marital_status byte %13.0g marital_status_label
Marital Status
hispanic float %9.0g Person is hispanic
edu byte %24.0g edu_label
Educational Attainment
income long %10.0g Person's Total Income in 1999
_merge byte %23.0g _merge Matching result from merge
------------------------------------------------------------------------------------------
Sorted by: household person
Note: Dataset has changed since last saved.
Exercise
Examine the following pairs of datasets: acs_race and acs_education; and acs_adults and acs_children. Determine the appropriate method for combining each pair into a single dataset and then do so.
Solution
If you’ve read the entire book to this point you’re very familiar with these datasets, so there’s no need to do anything formal to figure out what they contain. Just open the data browser and look. (If you haven’t read the entire book and want to learn about tools for figuring out a dataset, read the chapter First Steps with Your Data.)
acs_race and acs_education contain different variables for the same people, so they need to be merged.
clearuse https://sscc.wisc.edu/sscc/pubs/dws/data/acs_racemerge 1:1 household person using https://sscc.wisc.edu/sscc/pubs/dws/data/acs_education
Result Number of obs
-----------------------------------------
Not matched 0
Matched 27,410 (_merge==3)
-----------------------------------------
acs_adults and acs_children contain the same variables for different people, so they need to be appended.
7.4 Adding Level One Units to Existing Level Two Units
Next consider the datasets nlsy1979 and nlsy1980. They each contain one year’s worth of data from our NLSY extract. Is combining them a job for append or for merge?
The answer depends on whether you want the resulting data set to be in long form or in wide form. In the NLSY, a level two unit is a person and a level one unit is a person-year combination, so adding nlsy1980 to nlsy1979 is adding new level one units (years) to the existing level two units (people). In long form each level one unit gets its own observation, so adding nlsy1980 in long form adds observations. This is a job for append. In wide form, each level two unit gets its own observation, but each level one unit gets its own set of variables. Thus adding nlsy1980 in wide form adds variables; a job for merge.
The only complication is the level one identifier, year. Right now it is found only in the filenames of the two data sets, as is common. In long form, the level one identifier needs to be a variable. In wide form, it needs to be a suffix at the end of the names of all the level one variables. Either way that needs to be done before combining the data sets, or you’ll have no way of knowing whether a value is from 1979 or 1980.
7.4.1 Long Form
First combine the two files using append so the result is in long form. Begin by loading nlsy1980, creating a year variable set to 1980, and saving the results:
+------------------------------------------------------+
| id year_of_birth edu income age year |
|------------------------------------------------------|
1. | 1 58 . . 22 1980 |
2. | 2 59 9TH GRADE 5000 21 1980 |
+------------------------------------------------------+
Next, load nlsy1979 and create a year variable set to 1979:
use https://sscc.wisc.edu/sscc/pubs/dws/data/nlsy1979genyear=1979
listin 1/2, ab(30)
+-------------------------------------------------------+
| id year_of_birth edu income age year |
|-------------------------------------------------------|
1. | 1 58 12TH GRADE 4620 21 1979 |
2. | 2 59 9TH GRADE 4000 20 1979 |
+-------------------------------------------------------+
Now combine them with append:
appendusing nlsy1980_append
(label edulabel already defined)
This will give you a dataset with first all the 1979 observations and then all the 1980 observations. It’s often useful to have all the observations for a given person together and in chronological order, which you can get by sorting:
sort id yearlistin 1/4, ab(30)
+-------------------------------------------------------+
| id year_of_birth edu income age year |
|-------------------------------------------------------|
1. | 1 58 12TH GRADE 4620 21 1979 |
2. | 1 58 . . 22 1980 |
3. | 2 59 9TH GRADE 4000 20 1979 |
4. | 2 59 9TH GRADE 5000 21 1980 |
+-------------------------------------------------------+
7.4.2 Wide Form
Now combine the two files using merge so the result is in wide form. Begin by loading nlsy1980, but this time instead of creating a variable to store 1980 you need to add 1980 to the names of all the level one variables: edu, income, and age. You could do that with three rename commands (e.g. rename edu edu1980) but you can also rename them as a group:
This rename command first uses variable list syntax to specify the variables to be acted on, edu-age, and then specifies that 1980 should be added to the end of the existing variable names with =1980.
Now you’re ready to combine them with merge. This will again be a one-to-one merge, since one person’s data from 1979 is being combined with one person’s data from 1980.
merge 1:1 id using nlsy1980_merge
(label edulabel already defined)
Result Number of obs
-----------------------------------------
Not matched 0
Matched 12,686 (_merge==3)
-----------------------------------------
The files nlsy7980 and nlsy8182 each contain two level one units (person-year combinations). Combine them into either long form or wide form, using reshape to make them consistent before combining.
Solution
Again, the NLSY is a familiar dataset, so a look at the data browser will probably tell you everything you need to know about these two files. But check on the identifiers to be absolutely sure you understand the structure properly.
clearuse https://sscc.wisc.edu/sscc/pubs/dws/data/nlsy7980duplicatesreport id year
Duplicates in terms of id year
--------------------------------------
Copies | Observations Surplus
----------+---------------------------
1 | 25372 0
--------------------------------------
nlsy7980 has one observation per person (id) per year, so it’s in long form. But nlsy8182 has just one observation per person:
clearuse https://sscc.wisc.edu/sscc/pubs/dws/data/nlsy8182duplicatesreport id
Duplicates in terms of id
--------------------------------------
Copies | Observations Surplus
----------+---------------------------
1 | 12686 0
--------------------------------------
A look at the variable names confirms it still has two years of data, just in wide form:
describe
Contains data from https://sscc.wisc.edu/sscc/pubs/dws/data/nlsy8182.dta
Observations: 12,686
Variables: 8 27 Dec 2022 13:13
------------------------------------------------------------------------------------------
Variable Storage Display Value
name type format label Variable label
------------------------------------------------------------------------------------------
id float %9.0g ID# (1-12686) 79
edu1981 float %24.0g edulabel 1981 edu
income1981 float %9.0g 1981 income
age1981 float %9.0g 1981 age
edu1982 float %24.0g edulabel 1982 edu
income1982 float %9.0g 1982 income
age1982 float %9.0g 1982 age
year_of_birth float %16.0g
------------------------------------------------------------------------------------------
Sorted by: id
If you want the result to be in long form, them you need to reshape nlsy8182 to long form so the two files have the same structure:
reshapelong edu income age, i(id) j(year)
(j = 1981 1982)
Data Wide -> Long
-----------------------------------------------------------------------------
Number of observations 12,686 -> 25,372
Number of variables 8 -> 6
j variable (2 values) -> year
xij variables:
edu1981 edu1982 -> edu
income1981 income1982 -> income
age1981 age1982 -> age
-----------------------------------------------------------------------------
Now there’s one observation per person per year just like in nlsy7980:
duplicatesreport id year
Duplicates in terms of id year
--------------------------------------
Copies | Observations Surplus
----------+---------------------------
1 | 25372 0
--------------------------------------
Combining them means adding observations, so this is a job for append:
+-------------------------------------------------------+
| id year edu income age year_of_birth |
|-------------------------------------------------------|
1. | 1 1979 12TH GRADE 4620 21 58 |
2. | 1 1980 . . 22 58 |
3. | 1 1981 12TH GRADE 5000 23 58 |
4. | 1 1982 . . 24 58 |
+-------------------------------------------------------+
If you want the result to be in wide form, you need to reshape nlsy7980 to wide form so the two data sets have the same structure.
clearuse https://sscc.wisc.edu/sscc/pubs/dws/data/nlsy7980reshapewide edu income age, i(id) j(year)
(j = 1979 1980)
Data Long -> Wide
-----------------------------------------------------------------------------
Number of observations 25,372 -> 12,686
Number of variables 6 -> 8
j variable (2 values) year -> (dropped)
xij variables:
edu -> edu1979 edu1980
income -> income1979 income1980
age -> age1979 age1980
-----------------------------------------------------------------------------
Now it has one observation per id (person), just like nlsy8182:
duplicatesreport id
Duplicates in terms of id
--------------------------------------
Copies | Observations Surplus
----------+---------------------------
1 | 12686 0
--------------------------------------
nlsy8182 will add new variables to the data set, with one person’s 1979 and 1980 data matching with the same person’s 1981 and 1982 data, so this is a job for a one-to-one merge with id as the matching variable:
merge 1:1 id using https://sscc.wisc.edu/sscc/pubs/dws/data/nlsy8182
(label edulabel already defined)
Result Number of obs
-----------------------------------------
Not matched 0
Matched 12,686 (_merge==3)
-----------------------------------------
Since there’s just one observation per person, no sorting is needed:
Note that year_of_birth only appears once. This makes sense since it is a person-level variable. But what would happen if the two datasets had different values for year_of_birth? By default, if a variable appears in both datasets the value in the master dataset is kept and the value in the using dataset is discarded. You can reverse that with the update option, i.e. update the values in the master dataset by using the values in the using dataset where they disagree. If the values should be the same (like in this case) but you think there’s a possibility of error, you should give the variables different names before running merge so they’re both preserved in the combined dataset. Then look at any observations where they disagree to try to identify what happened and which value you want to use.
7.5 Adding Variables for Different Levels
If you need to combine hierarchical data where the master dataset contains information on level two units and the using dataset contains information on level one units, this is a job for a one-to-many merge. A one-to-many merge combines observations just like a one-to-one merge, but the observation for one level two unit is combined with many observations containing the corresponding level one units. (“Many” here really just means “possibly more than one.”)
The data set acs_households.dta contains information about the households in our 2000 ACS extract (in particular, their household income). There is one observation per household.
To combine it with the rest of the ACS data, the observation for one household is combined with the observations for all the people in it. In practice, this means copying each value of household_income to the rows representing the people in the household.
To tell Stata you’re doing a one-to-many merge, use 1:m. The key variable is now just household, not household and person like in prior merges.
merge 1:m household using https://sscc.wisc.edu/sscc/pubs/dws/data/acs
Result Number of obs
-----------------------------------------
Not matched 0
Matched 27,410 (_merge==3)
-----------------------------------------
sort household personlist household household_income person age race income in 1/5, ab(30)
+--------------------------------------------------------------+
| household household_income person age race income |
|--------------------------------------------------------------|
1. | 37 20000 1 20 White 10000 |
2. | 37 20000 2 19 White 5300 |
3. | 37 20000 3 19 Black 4700 |
4. | 241 32500 1 50 White 32500 |
5. | 242 30000 1 29 White 30000 |
+--------------------------------------------------------------+
If the master dataset contains information about level one units and the using dataset contains information about level two units, then this is a job for a many-to-one merge. Conceptually a many-to-one merge is identical to a one-to-many merge, just swapping which dataset you have in memory and which you’re adding to it.
clearuse https://sscc.wisc.edu/sscc/pubs/dws/data/acsmergem:1 household using https://sscc.wisc.edu/sscc/pubs/dws/data/acs_households
Result Number of obs
-----------------------------------------
Not matched 0
Matched 27,410 (_merge==3)
-----------------------------------------
list household person age race income household_income in 1/5, ab(30)
+--------------------------------------------------------------+
| household person age race income household_income |
|--------------------------------------------------------------|
1. | 37 1 20 White 10000 20000 |
2. | 37 2 19 White 5300 20000 |
3. | 37 3 19 Black 4700 20000 |
4. | 241 1 50 White 32500 32500 |
5. | 242 1 29 White 30000 30000 |
+--------------------------------------------------------------+
(The fact that this dataset didn’t need to be sorted to keep the households together tells you something about how Stata builds merged datasets, but it’s not very interesting.)
Exercise
nlsy_person contains information about the people in our NLSY extract that does not change over time, while nlsy_person_year contains only variables that change from year to year. Combine them.
Solution
nlsy_person has one observation per person:
clearuse https://sscc.wisc.edu/sscc/pubs/dws/data/nlsy_personduplicatesreport id
Duplicates in terms of id
--------------------------------------
Copies | Observations Surplus
----------+---------------------------
1 | 12686 0
--------------------------------------
While nlsy_person_year has one observation per person per year:
clearuse https://sscc.wisc.edu/sscc/pubs/dws/data/nlsy_person_yearduplicatesreport id year
Duplicates in terms of id year
--------------------------------------
Copies | Observations Surplus
----------+---------------------------
1 | 241034 0
--------------------------------------
That makes combining them a job for either a one-to-many or a many-to-one merge, depending on which dataset you start with.
nlsy_person_year is currently in memory and has many observations per person, which must be matched with one observation in nlsy_person. That makes this a many-to-one merge:
mergem:1 id using https://sscc.wisc.edu/sscc/pubs/dws/data/nlsy_person
Result Number of obs
-----------------------------------------
Not matched 0
Matched 241,034 (_merge==3)
-----------------------------------------
But if you start with nlsy_person, it has one observation per person which must be matched with many observations in `nlsy_person_year, making this a one to many merge:
clearuse https://sscc.wisc.edu/sscc/pubs/dws/data/nlsy_personmerge 1:m id using https://sscc.wisc.edu/sscc/pubs/dws/data/nlsy_person_year
Result Number of obs
-----------------------------------------
Not matched 0
Matched 241,034 (_merge==3)
-----------------------------------------
Next we’ll do an example that illustrates some of the issues that frequently arise when combining data sets from different sources.
One weakness of the NLSY data extract we’ve been using is that incomes from different time periods are not really comparable, due to inflation. To adjust them for inflation, we need information about the level of inflation in each year. The fredcpi data set contains the average Consumer Price Index for All Urban Consumers for every year from 1970 to 2019. It was obtained from the Federal Reserve Economic Data (FRED). If you have a FRED API key, and if you are interested in the US economy you probably want one, you can obtain it directly from FRED with:
import fred CPIAUCSL, daterange(1970 2019) aggregate(annual,avg)
If you click File, Import, Federal Reserve Economic Data (FRED) you can search for and download a variety of economic data.
Taking this data and adding nlsy to it is a job for a one-to-many merge: one year’s CPI data will match with many people’s NLSY data for that year. Note how this treats years as the level two units and people as level one units! For most purposes it’s more useful to think of people as the level two units in the NLSY, but it’s just as valid to group person-year combinations by year instead.
The fredcpi data set will need some preparation before merging; load it and take a look using the data browser:
CPIAUCSL contains the Consumer Price Index we want, but since the subtleties of the different indexes don’t concern us, rename it to just cpi:
rename CPIAUCSL cpi
As expected we have one observation per year. Both datestr and daten are year identifiers, just in different forms. The daten variable is an official Stata date, which consists of a numeric variable recording the number of days since January 1, 1960 and a format which causes that number to be displayed as a human-readable date. If you’re interested in learning more about Stata dates see Working with Dates in Stata.
The trouble is, neither of these match the year variable in the NLSY data so you’ll need to create a variable that does. The year() function takes a Stata date and extracts the year from it:
Now that you have year, you no longer need datestr and daten, so drop them (using a wildcard for practice/efficiency):
dropdate*
You’re now ready to merge in nlsy:
merge 1:myearusing nlsy
Result Number of obs
-----------------------------------------
Not matched 30
from master 30 (_merge==1)
from using 0 (_merge==2)
Matched 241,034 (_merge==3)
-----------------------------------------
This time we see something new: not everything matched. Is this a problem? It certainly could be! Take a look at all the observations that didn’t match with:
browse if _merge!=3
For the book we’ll list some of them and also look at the values of year:
First note that all the observations that didn’t match are from the master data set, fredcpi. Next note the years: many of them come from before or after the period of our extract. Others come from the period when the NLSY only collected data every other year (if you hadn’t noticed this before, run use nlsy and tab year to see the years in which data were collected). Putting it all together, the unmatched observations are years from fredcpi that did not match anything in nlsy because there was no NLSY data for that year. This is not a problem at all, which highlights that a “successful” merge is not always one where all the observations match. On the other hand, it’s always worth investigating why observations don’t match. If you had tried to match by the original daten variable nothing would have matched, but that would have been a fixable problem.
The observations that did not match don’t represent people like the other observations do, which could cause a variety of problems down the road. One way to get rid of them would be to simply drop based on _merge:
dropif_merge!=3
(30 observations deleted)
A more efficient way is to tell the merge command you only want observations that match using the keep() option:
merge 1:m year using nlsy, keep(match)
Don’t use this approach until after you’ve looked at the observations that didn’t match and are confident they don’t indicate a problem.
The keep() option will also accept master and using, and you can list more than one. In this case, keep(match using) would mean you want to keep observations that match and observations from the using data set, which would be safer if some of the NLSY data might be from years not contained in fredcpi.
To adjust monetary quantities for inflation you need to pick a reference year and convert them to dollars in that year. We’ll use the year 2000. To convert dollars in some source year to dollars in a destination year, multiply them by the CPI in the destination year divided by the CPI in the source year. In this data set “CPI in the source year” is just the cpi variable. But you need a separate variable for “CPI in the year 2000” that contains the same number for all observations. This is creating a variable based on the value of another variable for a special observation, but it’s for the entire data set so you don’t need by:
gen cpi_if_2000 = cpi ifyear==2000egen cpi2000 = mean(cpi_if_2000)gen income2000 = income * cpi2000/cpi
To see what a difference adjusting for inflation makes, run:
graphhbar inc*, over(year)
This will show you the mean income in each year, both in the original dollars and in 2000 dollars. Note how a dollar in 1979 was worth more than twice what a dollar was worth in 2000, but the difference decreases until in 2000 income and income2000 are identical.
7.7 Dealing with Duplicate Identifiers
The bane of everyone who does merges is the error message “variable id does not uniquely identify observations in the master data” and its variants. Sometimes this means you made a simple mistake: mixing up 1:m and m:1, for example, or specifying just one identifier when you need two. Sometimes it means you’ve badly misunderstood the structure of the data and need to rethink everything. But often it means there is a problem with the data itself: duplicate IDs.
If you follow the steps we recommend in First Steps with your Data, you’ll find out about any duplicates in the process of finding the data set’s identifiers. (You’ll also avoid badly misunderstanding the structure of the data). However, it’s when you merge that duplicates become a problem that must be resolved. Fortunately, you now have the skills to resolve them.
The one thing you should not do is make the error message go away by changing the type of merge you perform. If your merge should be one-to-one but you have duplicate identifiers in the master data set, changing it to many-to-one may allow the merge to run but the resulting data set won’t make any sense. If you find yourself wanting to run a many-to-many merge (m:m), step away from the computer, go for a walk, and rethink what you’re doing. We’ve never seen a many-to-many merge that wasn’t a mistake; it’s hard to think of anything in the real world that could be modeled by what Stata does when you specify a many-to-many merge. Even the Stata documentation describes many to many merges as “a bad idea.”
The data sets we’ve been working with are carefully curated and will probably never have a duplicate identifier. However, they’re very common in administrative and other real-world data. So we’ve prepared a data set that does have some: merge_error.dta. This is a fictitious data set of students, their demographics, and their scores on a standardized test. Pretend you were trying to merge it with another similar data set and got the dreaded “variable id does not uniquely identify observations in the master data.” Now what?
The first thing to do is load the data set that’s causing the error:
clearuse merge_errordescribe
Contains data from merge_error.dta
Observations: 103
Variables: 5 1 Oct 2015 09:36
------------------------------------------------------------------------------------------
Variable Storage Display Value
name type format label Variable label
------------------------------------------------------------------------------------------
id float %9.0g
female float %9.0g
grade int %8.0g
race int %8.0g
score int %8.0g
------------------------------------------------------------------------------------------
Sorted by:
The id variable is clearly intended to be an identifier, but recall that you can use duplicates report to check if it actually uniquely identifies observations:
duplicatesreport id
Duplicates in terms of id
--------------------------------------
Copies | Observations Surplus
----------+---------------------------
1 | 91 0
2 | 12 6
--------------------------------------
It does not–or you would not have gotten that error message when you tried to merge–but just knowing this does not help you identify the problem. To do that, create a variable that tells you how many times each id is duplicated:
bysort id: gen copies = _N
Now you can examine the problem observations with browse or list:
The first six rows consist of three pairs of observations which are completely identical. This is almost certainly a case of them having been put in the data set twice, a common data entry error. You can get rid of these duplicates by running:
duplicatesdrop
Duplicates in terms of all variables
(3 observations deleted)
This is a good outcome: not only is the problem easy to fix, you didn’t actually lose any data. Unfortunately it does not fix all the duplicates in this data set.
At this point copies is no longer accurate: it is still 2 for the observations where we dropped a duplicate observation. Update it before proceeding:
The remaining six rows with copies>1 are three pairs of observations with the same value of id but different values for one or more of the other variables. You should consider the possibility that there is some hierarchy involved that you were not aware of, such as some people taking multiple tests. However, there’s no indication of anything like that in this data set, and it would be unusual for it to affect so few observations. Almost certainly the duplicates are different people who were somehow given the same id, most likely due to a data entry error.
This is a bad outcome: Assuming you only have one person with an id of 64 in the other data set, you don’t know which of the two people with an id of 64 in this data set is the same person. You can resolve this problem by dropping both of the duplicate observations:
dropif copies>1
(6 observations deleted)
However, note that all of the duplicate observations have different values for race. This means that if you merged by id and race, they would be uniquely identified. This very much depends on the particulars of the data set: merging by id and female instead wouldn’t help much at all. On the other hand, if you had enough variables you might be able to match people even if the data sets didn’t have an identifier in common. Linking records without relying on identifiers is a very hot topic, but one we won’t address any further.
Exercise
nlsy_error.dta is a version of our NLSY extract that has had duplicate identifiers introduced into it. Identify and address the errors so that the combination of id and year is a unique identifier again and could be used to carry out a one-to-one merge.
Solution
First load the data and confirm the problem:
clearuse nlsy_errorduplicatesreport id year
Duplicates in terms of id year
--------------------------------------
Copies | Observations Surplus
----------+---------------------------
1 | 240974 0
2 | 80 40
--------------------------------------
Now create a copies variable that identifies the problem observations and then examine them:
bysort id year: gen copies = _Nlistif copies>1, ab(30)
Some of these are completely duplicate observations, like person 501 having two copies of their row for 1985, which can be remedied with duplicates drop:
duplicatesdrop
Duplicates in terms of all variables
(20 observations deleted)
Now recreate copies so you can examine the remaining problems:
by id year: replace copies = _Nlistif copies>1, ab(30)
The remaining observations are not duplicates and suggest different people were assigned the same id by accident. For real work you might investigate further, including reaching out to the data provider for an explanation, but for now you can eliminate all the problem observations with:
dropif copies>1
(40 observations deleted)
Now you could run a one-to-one merge using id and year:
duplicatesreport id year
Duplicates in terms of id year
--------------------------------------
Copies | Observations Surplus
----------+---------------------------
1 | 240994 0
--------------------------------------