7 Combining Datasets

Combining datasets is a very common task, and one that’s easy to do if you understand the structure of the datasets you are trying to combine. However, if you’ve misunderstood the structure of the datasets you can end up with a dataset that makes no sense at all. Combining datasets will sometimes reveal errors in the datasets you’re trying to combine.

Terminology Alert

This chapter depends heavily on the concepts and terminology introduced in Hierarchical Data Concepts, so be sure you’ve read that section before proceeding.

Stata always works with one dataset at a time. So when we talk about combining datasets we mean taking a dataset that’s in memory, what Stata calls the master dataset, and combining it with a dataset on disk, what Stata calls the using dataset for reasons that will become obvious when you see the command syntax. When you’re done, you’ll have a single dataset in memory again.

Always think through what the resulting dataset should look like before combining two datasets. If the resulting data set won’t have a consistent, logical structure you probably need to rethink what you’re doing.

How you combine the two data sets depends on what the using data set adds to the master data set. We’ll discuss the most common scenarios, which are:

Adding observations
Adding variables
Adding level one units to existing level two units
Adding variables for different levels

7.1 Setting Up

Start a do file called combine.do. Don’t worry about loading any data yet–in this chapter almost every example will involve different data sets.

capture log close
log using combine.do, replace
clear all
set linesize 90

-------------------------------------------------------------------------------
      name:  <unnamed>
       log:  /home/d/dimond/kb/dws/combine.do
  log type:  smcl
 opened on:  29 Dec 2025, 13:22:54

If you haven’t downloaded the example files yet, you may want to now so you don’t have to specify a full URL for every file. You can do that with:

net get dws, from(https://ssc.wisc.edu/sscc/stata/)

That will put the example files in Stata’s working directory, so make sure it’s set to where you want them to be.

7.2 Adding Observations

If the using dataset adds more observations to the master dataset, then this is a job for append. Using append makes sense if the master dataset and the using dataset contain the same kinds of things, but not the same things. The append command simply adds the using dataset to the end of the master data set.

Suppose that instead of a single file containing ACS sample you were given acs_part1.dta and acs_part2.dta, with each file containing about half of the observations. Load acs_part1, then use append to add acs_part2, combining them into a single dataset:

use https://sscc.wisc.edu/sscc/pubs/dws/data/acs_part1
describe, short


Contains data from https://sscc.wisc.edu/sscc/pubs/dws/data/acs_part1.dta
 Observations:        13,705                  
    Variables:             9                  11 Jan 2023 12:08
Sorted by: household  person

append using https://sscc.wisc.edu/sscc/pubs/dws/data/acs_part2
describe, short

(label edu_label already defined)
(label race_label already defined)
(label marital_status_label already defined)

Contains data from https://sscc.wisc.edu/sscc/pubs/dws/data/acs_part1.dta
 Observations:        27,410                  
    Variables:             9                  11 Jan 2023 12:08
Sorted by: 
     Note: Dataset has changed since last saved.

If a variable only exists in one of the two datasets, observations from the other dataset will have missing values for that variable. Make sure variables have the same name in both files before appending them, or append will treat them as different variables. Of course that assumes they actually measure the same thing in the same way. The warning you got about labels already being defined tells you those variables have value labels defined in both files, and you should make sure that they agree about what the values mean.

7.3 Adding Variables

If the using dataset adds more variables to the master dataset and observations represent the same things in both datasets, then this is a job for a one-to-one merge. The merge command combines datasets by combining observations that have the same value of an identifier variable or variables, so the result has all the variables from both files.

Suppose you were given the data files acs_demographics.dta and acs_ses.dta, containing demographic information and socio-economic status (SES) information respectively about the ACS respondents.

clear
use https://sscc.wisc.edu/sscc/pubs/dws/data/acs_demographics
describe


Contains data from https://sscc.wisc.edu/sscc/pubs/dws/data/acs_demographics.dta
 Observations:        27,410                  
    Variables:             6                  11 Jan 2023 12:08
------------------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
------------------------------------------------------------------------------------------
household       double  %8.0g                 Household serial number
person          int     %8.0g                 Person number in sample unit
age             byte    %10.0g                Age (original)
race            byte    %25.0g     race_label
                                              Race Recode 1
marital_status  byte    %13.0g     marital_status_label
                                              Marital Status
hispanic        float   %9.0g                 Person is hispanic
------------------------------------------------------------------------------------------
Sorted by: household  person

clear
use https://sscc.wisc.edu/sscc/pubs/dws/data/acs_ses
describe


Contains data from https://sscc.wisc.edu/sscc/pubs/dws/data/acs_ses.dta
 Observations:        27,410                  
    Variables:             4                  11 Jan 2023 12:08
------------------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
------------------------------------------------------------------------------------------
household       double  %8.0g                 Household serial number
person          int     %8.0g                 Person number in sample unit
edu             byte    %24.0g     edu_label
                                              Educational Attainment
income          long    %10.0g                Person's Total Income in 1999
------------------------------------------------------------------------------------------
Sorted by: household  person

You can use merge to combine them into a single dataset:

clear
use https://sscc.wisc.edu/sscc/pubs/dws/data/acs_demographics
merge 1:1 household person using https://sscc.wisc.edu/sscc/pubs/dws/data/acs_ses


    Result                      Number of obs
    -----------------------------------------
    Not matched                             0
    Matched                            27,410  (_merge==3)
    -----------------------------------------

In the merge command, 1:1 means you are doing a one-to-one merge: one respondent’s demographic information will be matched with one respondent’s SES information. Next come the identifier variables, two in this case, that tell merge which observations should be matched: observations with the same value for both household and person so we’re sure they contain information about the same respondent. Because we’ve specified that this is a 1:1 merge, the identifier variable(s) must uniquely identify observations in both data sets. We’ll talk about handling duplicate identifiers later. The identifier variables must exist in both data sets, and have the same names, but in most cases all the other variables should have different names.

If an observation in one dataset does not match any observation in the other dataset, it will get missing values for all the variables from that dataset. How successful you are at matching observations can sometimes affect your entire research agenda, so Stata both gives you a report and creates a new variable, _merge, that tells you whether a given observation matched or not. In this case, all the observations matched and thus got a 3 for _merge. A 1 means the observation came from the master dataset but did not match anything in the using dataset; a 2 means the observation came from the using dataset but did not match anything in the master dataset. Note that you cannot carry out another merge until you drop or rename the _merge variable so Stata can create a new one.

The resulting dataset contains all the variables from both acs_demographics and acs_ses:

describe


Contains data from https://sscc.wisc.edu/sscc/pubs/dws/data/acs_demographics.dta
 Observations:        27,410                  
    Variables:             9                  11 Jan 2023 12:08
------------------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
------------------------------------------------------------------------------------------
household       double  %8.0g                 Household serial number
person          int     %8.0g                 Person number in sample unit
age             byte    %10.0g                Age (original)
race            byte    %25.0g     race_label
                                              Race Recode 1
marital_status  byte    %13.0g     marital_status_label
                                              Marital Status
hispanic        float   %9.0g                 Person is hispanic
edu             byte    %24.0g     edu_label
                                              Educational Attainment
income          long    %10.0g                Person's Total Income in 1999
_merge          byte    %23.0g     _merge     Matching result from merge
------------------------------------------------------------------------------------------
Sorted by: household  person
     Note: Dataset has changed since last saved.

Exercise

Examine the following pairs of datasets: acs_race and acs_education; and acs_adults and acs_children. Determine the appropriate method for combining each pair into a single dataset and then do so.

Solution

If you’ve read the entire book to this point you’re very familiar with these datasets, so there’s no need to do anything formal to figure out what they contain. Just open the data browser and look. (If you haven’t read the entire book and want to learn about tools for figuring out a dataset, read the chapter First Steps with Your Data.)

acs_race and acs_education contain different variables for the same people, so they need to be merged.

clear
use https://sscc.wisc.edu/sscc/pubs/dws/data/acs_race
merge 1:1  household person using https://sscc.wisc.edu/sscc/pubs/dws/data/acs_education


    Result                      Number of obs
    -----------------------------------------
    Not matched                             0
    Matched                            27,410  (_merge==3)
    -----------------------------------------

acs_adults and acs_children contain the same variables for different people, so they need to be appended.

clear
use https://sscc.wisc.edu/sscc/pubs/dws/data/acs_adults
append using https://sscc.wisc.edu/sscc/pubs/dws/data/acs_children

(label edu_label already defined)
(label race_label already defined)
(label marital_status_label already defined)

7.4 Adding Level One Units to Existing Level Two Units

Next consider the datasets nlsy1979 and nlsy1980. They each contain one year’s worth of data from our NLSY extract. Is combining them a job for append or for merge?

The answer depends on whether you want the resulting data set to be in long form or in wide form. In the NLSY, a level two unit is a person and a level one unit is a person-year combination, so adding nlsy1980 to nlsy1979 is adding new level one units (years) to the existing level two units (people). In long form each level one unit gets its own observation, so adding nlsy1980 in long form adds observations. This is a job for append. In wide form, each level two unit gets its own observation, but each level one unit gets its own set of variables. Thus adding nlsy1980 in wide form adds variables; a job for merge.

The only complication is the level one identifier, year. Right now it is found only in the filenames of the two data sets, as is common. In long form, the level one identifier needs to be a variable. In wide form, it needs to be a suffix at the end of the names of all the level one variables. Either way that needs to be done before combining the data sets, or you’ll have no way of knowing whether a value is from 1979 or 1980.

7.4.1 Long Form

First combine the two files using append so the result is in long form. Begin by loading nlsy1980, creating a year variable set to 1980, and saving the results:

clear
use https://sscc.wisc.edu/sscc/pubs/dws/data/nlsy1980
gen year=1980
save nlsy1980_append, replace

file nlsy1980_append.dta saved

list in 1/2, ab(30)


     +------------------------------------------------------+
     | id   year_of_birth         edu   income   age   year |
     |------------------------------------------------------|
  1. |  1              58           .        .    22   1980 |
  2. |  2              59   9TH GRADE     5000    21   1980 |
     +------------------------------------------------------+

Next, load nlsy1979 and create a year variable set to 1979:

use https://sscc.wisc.edu/sscc/pubs/dws/data/nlsy1979
gen year=1979

list in 1/2, ab(30)


     +-------------------------------------------------------+
     | id   year_of_birth          edu   income   age   year |
     |-------------------------------------------------------|
  1. |  1              58   12TH GRADE     4620    21   1979 |
  2. |  2              59    9TH GRADE     4000    20   1979 |
     +-------------------------------------------------------+

Now combine them with append:

append using nlsy1980_append

(label edulabel already defined)

This will give you a dataset with first all the 1979 observations and then all the 1980 observations. It’s often useful to have all the observations for a given person together and in chronological order, which you can get by sorting:

sort id year
list in 1/4, ab(30)


     +-------------------------------------------------------+
     | id   year_of_birth          edu   income   age   year |
     |-------------------------------------------------------|
  1. |  1              58   12TH GRADE     4620    21   1979 |
  2. |  1              58            .        .    22   1980 |
  3. |  2              59    9TH GRADE     4000    20   1979 |
  4. |  2              59    9TH GRADE     5000    21   1980 |
     +-------------------------------------------------------+

7.4.2 Wide Form

Now combine the two files using merge so the result is in wide form. Begin by loading nlsy1980, but this time instead of creating a variable to store 1980 you need to add 1980 to the names of all the level one variables: edu, income, and age. You could do that with three rename commands (e.g. rename edu edu1980) but you can also rename them as a group:

clear
use https://sscc.wisc.edu/sscc/pubs/dws/data/nlsy1980
rename edu-age =1980
save nlsy1980_merge, replace

file nlsy1980_merge.dta saved

This rename command first uses variable list syntax to specify the variables to be acted on, edu-age, and then specifies that 1980 should be added to the end of the existing variable names with =1980.

list in 1/2, ab(30)


     +-------------------------------------------------------+
     | id   year_of_birth     edu1980   income1980   age1980 |
     |-------------------------------------------------------|
  1. |  1              58           .            .        22 |
  2. |  2              59   9TH GRADE         5000        21 |
     +-------------------------------------------------------+

Repeat the process for nlsy1979:

use https://sscc.wisc.edu/sscc/pubs/dws/data/nlsy1979
rename edu-age =1979
list in 1/2, ab(30)


     +--------------------------------------------------------+
     | id   year_of_birth      edu1979   income1979   age1979 |
     |--------------------------------------------------------|
  1. |  1              58   12TH GRADE         4620        21 |
  2. |  2              59    9TH GRADE         4000        20 |
     +--------------------------------------------------------+

Now you’re ready to combine them with merge. This will again be a one-to-one merge, since one person’s data from 1979 is being combined with one person’s data from 1980.

merge 1:1 id using nlsy1980_merge

(label edulabel already defined)

    Result                      Number of obs
    -----------------------------------------
    Not matched                             0
    Matched                            12,686  (_merge==3)
    -----------------------------------------

list in 1/2, ab(30)


     +---------------------------------------------------------------------------------+
  1. | id | year_of_birth |    edu1979 | income1979 | age1979 |   edu1980 | income1980 |
     |  1 |            58 | 12TH GRADE |       4620 |      21 |         . |          . |
     |---------------------------------------------------------------------------------|
     |               age1980                |                    _merge                |
     |                    22                |               Matched (3)                |
     +---------------------------------------------------------------------------------+

     +---------------------------------------------------------------------------------+
  2. | id | year_of_birth |    edu1979 | income1979 | age1979 |   edu1980 | income1980 |
     |  2 |            59 |  9TH GRADE |       4000 |      20 | 9TH GRADE |       5000 |
     |---------------------------------------------------------------------------------|
     |               age1980                |                    _merge                |
     |                    21                |               Matched (3)                |
     +---------------------------------------------------------------------------------+

Exercise

The files nlsy7980 and nlsy8182 each contain two level one units (person-year combinations). Combine them into either long form or wide form, using reshape to make them consistent before combining.

Solution

Again, the NLSY is a familiar dataset, so a look at the data browser will probably tell you everything you need to know about these two files. But check on the identifiers to be absolutely sure you understand the structure properly.

clear
use https://sscc.wisc.edu/sscc/pubs/dws/data/nlsy7980
duplicates report id year


Duplicates in terms of id year

--------------------------------------
   Copies | Observations       Surplus
----------+---------------------------
        1 |        25372             0
--------------------------------------

nlsy7980 has one observation per person (id) per year, so it’s in long form. But nlsy8182 has just one observation per person:

clear
use https://sscc.wisc.edu/sscc/pubs/dws/data/nlsy8182
duplicates report id


Duplicates in terms of id

--------------------------------------
   Copies | Observations       Surplus
----------+---------------------------
        1 |        12686             0
--------------------------------------

A look at the variable names confirms it still has two years of data, just in wide form:

describe


Contains data from https://sscc.wisc.edu/sscc/pubs/dws/data/nlsy8182.dta
 Observations:        12,686                  
    Variables:             8                  27 Dec 2022 13:13
------------------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
------------------------------------------------------------------------------------------
id              float   %9.0g                 ID# (1-12686) 79
edu1981         float   %24.0g     edulabel   1981 edu
income1981      float   %9.0g                 1981 income
age1981         float   %9.0g                 1981 age
edu1982         float   %24.0g     edulabel   1982 edu
income1982      float   %9.0g                 1982 income
age1982         float   %9.0g                 1982 age
year_of_birth   float   %16.0g                
------------------------------------------------------------------------------------------
Sorted by: id

If you want the result to be in long form, them you need to reshape nlsy8182 to long form so the two files have the same structure:

reshape long edu income age, i(id) j(year)

(j = 1981 1982)

Data                               Wide   ->   Long
-----------------------------------------------------------------------------
Number of observations           12,686   ->   25,372      
Number of variables                   8   ->   6           
j variable (2 values)                     ->   year
xij variables:
                        edu1981 edu1982   ->   edu
                  income1981 income1982   ->   income
                        age1981 age1982   ->   age
-----------------------------------------------------------------------------

Now there’s one observation per person per year just like in nlsy7980:

duplicates report id year


Duplicates in terms of id year

--------------------------------------
   Copies | Observations       Surplus
----------+---------------------------
        1 |        25372             0
--------------------------------------

Combining them means adding observations, so this is a job for append:

append using https://sscc.wisc.edu/sscc/pubs/dws/data/nlsy7980

(label edulabel already defined)

To see the results, sort the data:

sort id year
list in 1/4, ab(30)


     +-------------------------------------------------------+
     | id   year          edu   income   age   year_of_birth |
     |-------------------------------------------------------|
  1. |  1   1979   12TH GRADE     4620    21              58 |
  2. |  1   1980            .        .    22              58 |
  3. |  1   1981   12TH GRADE     5000    23              58 |
  4. |  1   1982            .        .    24              58 |
     +-------------------------------------------------------+

If you want the result to be in wide form, you need to reshape nlsy7980 to wide form so the two data sets have the same structure.

clear
use https://sscc.wisc.edu/sscc/pubs/dws/data/nlsy7980

reshape wide edu income age, i(id) j(year)

(j = 1979 1980)

Data                               Long   ->   Wide
-----------------------------------------------------------------------------
Number of observations           25,372   ->   12,686      
Number of variables                   6   ->   8           
j variable (2 values)              year   ->   (dropped)
xij variables:
                                    edu   ->   edu1979 edu1980
                                 income   ->   income1979 income1980
                                    age   ->   age1979 age1980
-----------------------------------------------------------------------------

Now it has one observation per id (person), just like nlsy8182:

duplicates report id


Duplicates in terms of id

--------------------------------------
   Copies | Observations       Surplus
----------+---------------------------
        1 |        12686             0
--------------------------------------

nlsy8182 will add new variables to the data set, with one person’s 1979 and 1980 data matching with the same person’s 1981 and 1982 data, so this is a job for a one-to-one merge with id as the matching variable:

merge 1:1 id using https://sscc.wisc.edu/sscc/pubs/dws/data/nlsy8182

(label edulabel already defined)

    Result                      Number of obs
    -----------------------------------------
    Not matched                             0
    Matched                            12,686  (_merge==3)
    -----------------------------------------

Since there’s just one observation per person, no sorting is needed:

list in 1, ab(30)


     +--------------------------------------------------------------------------+
  1. | id |    edu1979 | income1979 | age1979 | edu1980 | income1980 | age1980  |
     |  1 | 12TH GRADE |       4620 |      21 |       . |          . |      22  |
     |--------------------------------------------------------------------------|
     | year_of_birth |    edu1981 | income1981 | age1981 | edu1982 | income1982 |
     |            58 | 12TH GRADE |       5000 |      23 |       . |          . |
     |--------------------------------------------------------------------------|
     |             age1982              |                   _merge              |
     |                  24              |              Matched (3)              |
     +--------------------------------------------------------------------------+

Note that year_of_birth only appears once. This makes sense since it is a person-level variable. But what would happen if the two datasets had different values for year_of_birth? By default, if a variable appears in both datasets the value in the master dataset is kept and the value in the using dataset is discarded. You can reverse that with the update option, i.e. update the values in the master dataset by using the values in the using dataset where they disagree. If the values should be the same (like in this case) but you think there’s a possibility of error, you should give the variables different names before running merge so they’re both preserved in the combined dataset. Then look at any observations where they disagree to try to identify what happened and which value you want to use.

7.5 Adding Variables for Different Levels

If you need to combine hierarchical data where the master dataset contains information on level two units and the using dataset contains information on level one units, this is a job for a one-to-many merge. A one-to-many merge combines observations just like a one-to-one merge, but the observation for one level two unit is combined with many observations containing the corresponding level one units. (“Many” here really just means “possibly more than one.”)

The data set acs_households.dta contains information about the households in our 2000 ACS extract (in particular, their household income). There is one observation per household.

clear
use https://sscc.wisc.edu/sscc/pubs/dws/data/acs_households
list in 1/3, ab(30)


     +------------------------------+
     | household   household_income |
     |------------------------------|
  1. |        37              20000 |
  2. |       241              32500 |
  3. |       242              30000 |
     +------------------------------+

To combine it with the rest of the ACS data, the observation for one household is combined with the observations for all the people in it. In practice, this means copying each value of household_income to the rows representing the people in the household.

To tell Stata you’re doing a one-to-many merge, use 1:m. The key variable is now just household, not household and person like in prior merges.

merge 1:m household using https://sscc.wisc.edu/sscc/pubs/dws/data/acs


    Result                      Number of obs
    -----------------------------------------
    Not matched                             0
    Matched                            27,410  (_merge==3)
    -----------------------------------------

sort household person
list household household_income person  age race income  in 1/5, ab(30)


     +--------------------------------------------------------------+
     | household   household_income   person   age    race   income |
     |--------------------------------------------------------------|
  1. |        37              20000        1    20   White    10000 |
  2. |        37              20000        2    19   White     5300 |
  3. |        37              20000        3    19   Black     4700 |
  4. |       241              32500        1    50   White    32500 |
  5. |       242              30000        1    29   White    30000 |
     +--------------------------------------------------------------+

If the master dataset contains information about level one units and the using dataset contains information about level two units, then this is a job for a many-to-one merge. Conceptually a many-to-one merge is identical to a one-to-many merge, just swapping which dataset you have in memory and which you’re adding to it.

clear
use https://sscc.wisc.edu/sscc/pubs/dws/data/acs
merge m:1 household using https://sscc.wisc.edu/sscc/pubs/dws/data/acs_households


    Result                      Number of obs
    -----------------------------------------
    Not matched                             0
    Matched                            27,410  (_merge==3)
    -----------------------------------------

list household person age race income household_income in 1/5, ab(30)


     +--------------------------------------------------------------+
     | household   person   age    race   income   household_income |
     |--------------------------------------------------------------|
  1. |        37        1    20   White    10000              20000 |
  2. |        37        2    19   White     5300              20000 |
  3. |        37        3    19   Black     4700              20000 |
  4. |       241        1    50   White    32500              32500 |
  5. |       242        1    29   White    30000              30000 |
     +--------------------------------------------------------------+

(The fact that this dataset didn’t need to be sorted to keep the households together tells you something about how Stata builds merged datasets, but it’s not very interesting.)

Exercise

nlsy_person contains information about the people in our NLSY extract that does not change over time, while nlsy_person_year contains only variables that change from year to year. Combine them.

Solution

nlsy_person has one observation per person:

clear
use https://sscc.wisc.edu/sscc/pubs/dws/data/nlsy_person
duplicates report id


Duplicates in terms of id

--------------------------------------
   Copies | Observations       Surplus
----------+---------------------------
        1 |        12686             0
--------------------------------------

While nlsy_person_year has one observation per person per year:

clear
use https://sscc.wisc.edu/sscc/pubs/dws/data/nlsy_person_year
duplicates report id year


Duplicates in terms of id year

--------------------------------------
   Copies | Observations       Surplus
----------+---------------------------
        1 |       241034             0
--------------------------------------

That makes combining them a job for either a one-to-many or a many-to-one merge, depending on which dataset you start with.

nlsy_person_year is currently in memory and has many observations per person, which must be matched with one observation in nlsy_person. That makes this a many-to-one merge:

merge m:1 id using https://sscc.wisc.edu/sscc/pubs/dws/data/nlsy_person


    Result                      Number of obs
    -----------------------------------------
    Not matched                             0
    Matched                           241,034  (_merge==3)
    -----------------------------------------

list in 1/5, ab(30)


     +---------------------------------------------------------------------+
     | id   year          edu   income   age   year_of_birth        _merge |
     |---------------------------------------------------------------------|
  1. |  1   1979   12TH GRADE     4620    21              58   Matched (3) |
  2. |  1   1980            .        .    22              58   Matched (3) |
  3. |  1   1981   12TH GRADE     5000    23              58   Matched (3) |
  4. |  1   1982            .        .    24              58   Matched (3) |
  5. |  1   1983            .        .    25              58   Matched (3) |
     +---------------------------------------------------------------------+

But if you start with nlsy_person, it has one observation per person which must be matched with many observations in `nlsy_person_year, making this a one to many merge:

clear
use https://sscc.wisc.edu/sscc/pubs/dws/data/nlsy_person
merge 1:m id using https://sscc.wisc.edu/sscc/pubs/dws/data/nlsy_person_year


    Result                      Number of obs
    -----------------------------------------
    Not matched                             0
    Matched                           241,034  (_merge==3)
    -----------------------------------------

list in 1/5, ab(30)


     +---------------------------------------------------------------------------+
     | id   year_of_birth   year                edu   income   age        _merge |
     |---------------------------------------------------------------------------|
  1. |  1              58   1979         12TH GRADE     4620    21   Matched (3) |
  2. |  2              59   1979          9TH GRADE     4000    20   Matched (3) |
  3. |  3              61   1979         10TH GRADE        .    18   Matched (3) |
  4. |  4              62   1979          9TH GRADE        .    17   Matched (3) |
  5. |  5              59   1979   1ST YEAR COLLEGE     2200    20   Matched (3) |
     +---------------------------------------------------------------------------+

7.6 Adjusting for Inflation

Next we’ll do an example that illustrates some of the issues that frequently arise when combining data sets from different sources.

One weakness of the NLSY data extract we’ve been using is that incomes from different time periods are not really comparable, due to inflation. To adjust them for inflation, we need information about the level of inflation in each year. The fredcpi data set contains the average Consumer Price Index for All Urban Consumers for every year from 1970 to 2019. It was obtained from the Federal Reserve Economic Data (FRED). If you have a FRED API key, and if you are interested in the US economy you probably want one, you can obtain it directly from FRED with:

import fred CPIAUCSL, daterange(1970 2019) aggregate(annual,avg)

If you click File, Import, Federal Reserve Economic Data (FRED) you can search for and download a variety of economic data.

Taking this data and adding nlsy to it is a job for a one-to-many merge: one year’s CPI data will match with many people’s NLSY data for that year. Note how this treats years as the level two units and people as level one units! For most purposes it’s more useful to think of people as the level two units in the NLSY, but it’s just as valid to group person-year combinations by year instead.

The fredcpi data set will need some preparation before merging; load it and take a look using the data browser:

clear
use https://sscc.wisc.edu/sscc/pubs/dws/data/fredcpi
list in 1/5


     +-----------------------------------+
     | datestr          daten   CPIAUCSL |
     |-----------------------------------|
  1. | 1970-01-01   01jan1970     38.842 |
  2. | 1971-01-01   01jan1971     40.483 |
  3. | 1972-01-01   01jan1972     41.808 |
  4. | 1973-01-01   01jan1973     44.425 |
  5. | 1974-01-01   01jan1974     49.317 |
     +-----------------------------------+

CPIAUCSL contains the Consumer Price Index we want, but since the subtleties of the different indexes don’t concern us, rename it to just cpi:

rename CPIAUCSL cpi

As expected we have one observation per year. Both datestr and daten are year identifiers, just in different forms. The daten variable is an official Stata date, which consists of a numeric variable recording the number of days since January 1, 1960 and a format which causes that number to be displayed as a human-readable date. If you’re interested in learning more about Stata dates see Working with Dates in Stata.

The trouble is, neither of these match the year variable in the NLSY data so you’ll need to create a variable that does. The year() function takes a Stata date and extracts the year from it:

gen year = year(daten)
list in 1/5


     +----------------------------------------+
     | datestr          daten      cpi   year |
     |----------------------------------------|
  1. | 1970-01-01   01jan1970   38.842   1970 |
  2. | 1971-01-01   01jan1971   40.483   1971 |
  3. | 1972-01-01   01jan1972   41.808   1972 |
  4. | 1973-01-01   01jan1973   44.425   1973 |
  5. | 1974-01-01   01jan1974   49.317   1974 |
     +----------------------------------------+

Now that you have year, you no longer need datestr and daten, so drop them (using a wildcard for practice/efficiency):

drop date*

You’re now ready to merge in nlsy:

merge 1:m year using nlsy


    Result                      Number of obs
    -----------------------------------------
    Not matched                            30
        from master                        30  (_merge==1)
        from using                          0  (_merge==2)

    Matched                           241,034  (_merge==3)
    -----------------------------------------

This time we see something new: not everything matched. Is this a problem? It certainly could be! Take a look at all the observations that didn’t match with:

browse if _merge!=3

For the book we’ll list some of them and also look at the values of year:

list if _merge!=3 & _n<=3
tab year if _merge!=3


        +----------------------------------------------------------------------+
        |    cpi   year   id   year_o~h   edu   income   age            _merge |
        |----------------------------------------------------------------------|
     1. | 38.842   1970    .          .     .        .     .   Master only (1) |
     2. | 40.483   1971    .          .     .        .     .   Master only (1) |
     3. | 41.808   1972    .          .     .        .     .   Master only (1) |
        +----------------------------------------------------------------------+

       year |      Freq.     Percent        Cum.
------------+-----------------------------------
       1970 |          1        3.33        3.33
       1971 |          1        3.33        6.67
       1972 |          1        3.33       10.00
       1973 |          1        3.33       13.33
       1974 |          1        3.33       16.67
       1975 |          1        3.33       20.00
       1976 |          1        3.33       23.33
       1977 |          1        3.33       26.67
       1978 |          1        3.33       30.00
       1995 |          1        3.33       33.33
       1997 |          1        3.33       36.67
       1999 |          1        3.33       40.00
       2001 |          1        3.33       43.33
       2002 |          1        3.33       46.67
       2003 |          1        3.33       50.00
       2004 |          1        3.33       53.33
       2005 |          1        3.33       56.67
       2006 |          1        3.33       60.00
       2007 |          1        3.33       63.33
       2008 |          1        3.33       66.67
       2009 |          1        3.33       70.00
       2010 |          1        3.33       73.33
       2011 |          1        3.33       76.67
       2012 |          1        3.33       80.00
       2013 |          1        3.33       83.33
       2014 |          1        3.33       86.67
       2015 |          1        3.33       90.00
       2016 |          1        3.33       93.33
       2017 |          1        3.33       96.67
       2018 |          1        3.33      100.00
------------+-----------------------------------
      Total |         30      100.00

First note that all the observations that didn’t match are from the master data set, fredcpi. Next note the years: many of them come from before or after the period of our extract. Others come from the period when the NLSY only collected data every other year (if you hadn’t noticed this before, run use nlsy and tab year to see the years in which data were collected). Putting it all together, the unmatched observations are years from fredcpi that did not match anything in nlsy because there was no NLSY data for that year. This is not a problem at all, which highlights that a “successful” merge is not always one where all the observations match. On the other hand, it’s always worth investigating why observations don’t match. If you had tried to match by the original daten variable nothing would have matched, but that would have been a fixable problem.

The observations that did not match don’t represent people like the other observations do, which could cause a variety of problems down the road. One way to get rid of them would be to simply drop based on _merge:

drop if _merge!=3

(30 observations deleted)

A more efficient way is to tell the merge command you only want observations that match using the keep() option:

merge 1:m year using nlsy, keep(match)

Don’t use this approach until after you’ve looked at the observations that didn’t match and are confident they don’t indicate a problem.

The keep() option will also accept master and using, and you can list more than one. In this case, keep(match using) would mean you want to keep observations that match and observations from the using data set, which would be safer if some of the NLSY data might be from years not contained in fredcpi.

To adjust monetary quantities for inflation you need to pick a reference year and convert them to dollars in that year. We’ll use the year 2000. To convert dollars in some source year to dollars in a destination year, multiply them by the CPI in the destination year divided by the CPI in the source year. In this data set “CPI in the source year” is just the cpi variable. But you need a separate variable for “CPI in the year 2000” that contains the same number for all observations. This is creating a variable based on the value of another variable for a special observation, but it’s for the entire data set so you don’t need by:

gen cpi_if_2000 = cpi if year==2000
egen cpi2000 = mean(cpi_if_2000)

gen income2000 = income * cpi2000/cpi

(228,348 missing values generated)
(60,217 missing values generated)

income2000 is now “income in year 2000 dollars.”

list id year income cpi cpi2000 income2000 in 1/5, ab(30)


     +-------------------------------------------------------+
     |    id   year   income      cpi   cpi2000   income2000 |
     |-------------------------------------------------------|
  1. | 10172   1979        .   72.583   172.192            . |
  2. |    39   1980        .   82.383   172.192            . |
  3. | 12454   1981      536   90.933   172.192     1014.977 |
  4. |  9019   1982     7000   96.533   172.192     12486.34 |
  5. |  8867   1983        0   99.583   172.192            0 |
     +-------------------------------------------------------+

To see what a difference adjusting for inflation makes, run:

graph hbar inc*, over(year)

This will show you the mean income in each year, both in the original dollars and in 2000 dollars. Note how a dollar in 1979 was worth more than twice what a dollar was worth in 2000, but the difference decreases until in 2000 income and income2000 are identical.

7.7 Dealing with Duplicate Identifiers

The bane of everyone who does merges is the error message “variable id does not uniquely identify observations in the master data” and its variants. Sometimes this means you made a simple mistake: mixing up 1:m and m:1, for example, or specifying just one identifier when you need two. Sometimes it means you’ve badly misunderstood the structure of the data and need to rethink everything. But often it means there is a problem with the data itself: duplicate IDs.

If you follow the steps we recommend in First Steps with your Data, you’ll find out about any duplicates in the process of finding the data set’s identifiers. (You’ll also avoid badly misunderstanding the structure of the data). However, it’s when you merge that duplicates become a problem that must be resolved. Fortunately, you now have the skills to resolve them.

The one thing you should not do is make the error message go away by changing the type of merge you perform. If your merge should be one-to-one but you have duplicate identifiers in the master data set, changing it to many-to-one may allow the merge to run but the resulting data set won’t make any sense. If you find yourself wanting to run a many-to-many merge (m:m), step away from the computer, go for a walk, and rethink what you’re doing. We’ve never seen a many-to-many merge that wasn’t a mistake; it’s hard to think of anything in the real world that could be modeled by what Stata does when you specify a many-to-many merge. Even the Stata documentation describes many to many merges as “a bad idea.”

The data sets we’ve been working with are carefully curated and will probably never have a duplicate identifier. However, they’re very common in administrative and other real-world data. So we’ve prepared a data set that does have some: merge_error.dta. This is a fictitious data set of students, their demographics, and their scores on a standardized test. Pretend you were trying to merge it with another similar data set and got the dreaded “variable id does not uniquely identify observations in the master data.” Now what?

The first thing to do is load the data set that’s causing the error:

clear
use merge_error
describe


Contains data from merge_error.dta
 Observations:           103                  
    Variables:             5                  1 Oct 2015 09:36
------------------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
------------------------------------------------------------------------------------------
id              float   %9.0g                 
female          float   %9.0g                 
grade           int     %8.0g                 
race            int     %8.0g                 
score           int     %8.0g                 
------------------------------------------------------------------------------------------
Sorted by:

The id variable is clearly intended to be an identifier, but recall that you can use duplicates report to check if it actually uniquely identifies observations:

duplicates report id


Duplicates in terms of id

--------------------------------------
   Copies | Observations       Surplus
----------+---------------------------
        1 |           91             0
        2 |           12             6
--------------------------------------

It does not–or you would not have gotten that error message when you tried to merge–but just knowing this does not help you identify the problem. To do that, create a variable that tells you how many times each id is duplicated:

bysort id: gen copies = _N

Now you can examine the problem observations with browse or list:

list if copies>1


     +---------------------------------------------+
     | id   female   grade   race   score   copies |
     |---------------------------------------------|
  9. |  9        1      11      0      76        2 |
 10. |  9        1      11      0      76        2 |
 27. | 26        1      11      0      85        2 |
 28. | 26        1      11      0      85        2 |
 35. | 33        0       8      0      78        2 |
     |---------------------------------------------|
 36. | 33        0       8      0      78        2 |
 67. | 64        0      10      1      86        2 |
 68. | 64        0      10      0      85        2 |
 77. | 74        0      11      0      63        2 |
 78. | 74        0       9      1      87        2 |
     |---------------------------------------------|
 97. | 94        0      11      3     100        2 |
 98. | 94        1      11      1     100        2 |
     +---------------------------------------------+

The first six rows consist of three pairs of observations which are completely identical. This is almost certainly a case of them having been put in the data set twice, a common data entry error. You can get rid of these duplicates by running:

duplicates drop


Duplicates in terms of all variables

(3 observations deleted)

This is a good outcome: not only is the problem easy to fix, you didn’t actually lose any data. Unfortunately it does not fix all the duplicates in this data set.

At this point copies is no longer accurate: it is still 2 for the observations where we dropped a duplicate observation. Update it before proceeding:

by id: replace copies = _N
list if copies>1

(3 real changes made)

     +---------------------------------------------+
     | id   female   grade   race   score   copies |
     |---------------------------------------------|
 64. | 64        0      10      1      86        2 |
 65. | 64        0      10      0      85        2 |
 74. | 74        0      11      0      63        2 |
 75. | 74        0       9      1      87        2 |
 94. | 94        0      11      3     100        2 |
     |---------------------------------------------|
 95. | 94        1      11      1     100        2 |
     +---------------------------------------------+

The remaining six rows with copies>1 are three pairs of observations with the same value of id but different values for one or more of the other variables. You should consider the possibility that there is some hierarchy involved that you were not aware of, such as some people taking multiple tests. However, there’s no indication of anything like that in this data set, and it would be unusual for it to affect so few observations. Almost certainly the duplicates are different people who were somehow given the same id, most likely due to a data entry error.

This is a bad outcome: Assuming you only have one person with an id of 64 in the other data set, you don’t know which of the two people with an id of 64 in this data set is the same person. You can resolve this problem by dropping both of the duplicate observations:

drop if copies>1

(6 observations deleted)

However, note that all of the duplicate observations have different values for race. This means that if you merged by id and race, they would be uniquely identified. This very much depends on the particulars of the data set: merging by id and female instead wouldn’t help much at all. On the other hand, if you had enough variables you might be able to match people even if the data sets didn’t have an identifier in common. Linking records without relying on identifiers is a very hot topic, but one we won’t address any further.

Exercise

nlsy_error.dta is a version of our NLSY extract that has had duplicate identifiers introduced into it. Identify and address the errors so that the combination of id and year is a unique identifier again and could be used to carry out a one-to-one merge.

Solution

First load the data and confirm the problem:

clear
use nlsy_error
duplicates report id year


Duplicates in terms of id year

--------------------------------------
   Copies | Observations       Surplus
----------+---------------------------
        1 |       240974             0
        2 |           80            40
--------------------------------------

Now create a copies variable that identifies the problem observations and then examine them:

bysort id year: gen copies = _N
list if copies>1, ab(30)


        +------------------------------------------------------------------------+
        |   id   year   year_of_birth                edu   income   age   copies |
        |------------------------------------------------------------------------|
  9507. |  501   1985              63   3RD YEAR COLLEGE     3000    22        2 |
  9508. |  501   1985              63   3RD YEAR COLLEGE     3000    22        2 |
 26450. | 1393   1979              58         12TH GRADE     7000    21        2 |
 26451. | 1393   1979              58         12TH GRADE     7000    21        2 |
 26452. | 1393   1980              58         12TH GRADE     7000    22        2 |
        |------------------------------------------------------------------------|
 26453. | 1393   1980              58         12TH GRADE     7000    22        2 |
 26454. | 1393   1981              58         12TH GRADE     7000    23        2 |
 26455. | 1393   1981              58         12TH GRADE     7000    23        2 |
 26456. | 1393   1982              58         12TH GRADE     6000    24        2 |
 26457. | 1393   1982              58         12TH GRADE     6000    24        2 |
        |------------------------------------------------------------------------|
 26458. | 1393   1983              58         12TH GRADE    31519    25        2 |
 26459. | 1393   1983              58         12TH GRADE    31519    25        2 |
 26460. | 1393   1984              58         12TH GRADE    22000    26        2 |
 26461. | 1393   1984              58         12TH GRADE    22000    26        2 |
 26462. | 1393   1985              58         12TH GRADE    39389    27        2 |
        |------------------------------------------------------------------------|
 26463. | 1393   1985              58         12TH GRADE    39389    27        2 |
 26464. | 1393   1986              58         12TH GRADE    43119    28        2 |
 26465. | 1393   1986              58         12TH GRADE    43119    28        2 |
 26466. | 1393   1987              58         12TH GRADE    10000    29        2 |
 26467. | 1393   1987              58         12TH GRADE    10000    29        2 |
        |------------------------------------------------------------------------|
 26468. | 1393   1988              58         12TH GRADE    20000    30        2 |
 26469. | 1393   1988              58         12TH GRADE    20000    30        2 |
 26470. | 1393   1989              58         12TH GRADE    25000    31        2 |
 26471. | 1393   1989              58         12TH GRADE    25000    31        2 |
 26472. | 1393   1990              58         12TH GRADE    31000    32        2 |
        |------------------------------------------------------------------------|
 26473. | 1393   1990              58         12TH GRADE    31000    32        2 |
 26474. | 1393   1991              58         12TH GRADE    40000    33        2 |
 26475. | 1393   1991              58         12TH GRADE    40000    33        2 |
 26476. | 1393   1992              58         12TH GRADE    32000    34        2 |
 26477. | 1393   1992              58         12TH GRADE    32000    34        2 |
        |------------------------------------------------------------------------|
 26478. | 1393   1993              58         12TH GRADE    49000    35        2 |
 26479. | 1393   1993              58         12TH GRADE    49000    35        2 |
 26480. | 1393   1994              58         12TH GRADE    58000    36        2 |
 26481. | 1393   1994              58         12TH GRADE    58000    36        2 |
 26482. | 1393   1996              58   1ST YEAR COLLEGE    43000    38        2 |
        |------------------------------------------------------------------------|
 26483. | 1393   1996              58   1ST YEAR COLLEGE    43000    38        2 |
 26484. | 1393   1998              58   1ST YEAR COLLEGE    58000    40        2 |
 26485. | 1393   1998              58   1ST YEAR COLLEGE    58000    40        2 |
 26486. | 1393   2000              58   3RD YEAR COLLEGE    40000    42        2 |
 26487. | 1393   2000              58   3RD YEAR COLLEGE    40000    42        2 |
        |------------------------------------------------------------------------|
 38971. | 2051   1979              64          7TH GRADE        .    15        2 |
 38972. | 2051   1979              64          8TH GRADE        .    15        2 |
 38973. | 2051   1980              64          8TH GRADE        .    16        2 |
 38974. | 2051   1980              64                  .        .    16        2 |
 38975. | 2051   1981              64          9TH GRADE        .    17        2 |
        |------------------------------------------------------------------------|
 38976. | 2051   1981              64                  .        .    17        2 |
 38977. | 2051   1982              64         10TH GRADE        .    18        2 |
 38978. | 2051   1982              64                  .        .    18        2 |
 38979. | 2051   1983              64         11TH GRADE     2180    19        2 |
 38980. | 2051   1983              64         12TH GRADE     1500    19        2 |
        |------------------------------------------------------------------------|
 38981. | 2051   1984              64   1ST YEAR COLLEGE     2000    20        2 |
 38982. | 2051   1984              64         11TH GRADE     4500    20        2 |
 38983. | 2051   1985              64         11TH GRADE     5546    21        2 |
 38984. | 2051   1985              64   2ND YEAR COLLEGE     2000    21        2 |
 38985. | 2051   1986              64         11TH GRADE    10000    22        2 |
        |------------------------------------------------------------------------|
 38986. | 2051   1986              64   2ND YEAR COLLEGE     3000    22        2 |
 38987. | 2051   1987              64   3RD YEAR COLLEGE     6400    23        2 |
 38988. | 2051   1987              64         11TH GRADE    11000    23        2 |
 38989. | 2051   1988              64   4TH YEAR COLLEGE    12000    24        2 |
 38990. | 2051   1988              64         11TH GRADE     7000    24        2 |
        |------------------------------------------------------------------------|
 38991. | 2051   1989              64         11TH GRADE     2500    25        2 |
 38992. | 2051   1989              64   4TH YEAR COLLEGE    22000    25        2 |
 38993. | 2051   1990              64         11TH GRADE     9000    26        2 |
 38994. | 2051   1990              64                  .        .    26        2 |
 38995. | 2051   1991              64         11TH GRADE     8556    27        2 |
        |------------------------------------------------------------------------|
 38996. | 2051   1991              64   4TH YEAR COLLEGE    28000    27        2 |
 38997. | 2051   1992              64   4TH YEAR COLLEGE    28000    28        2 |
 38998. | 2051   1992              64         11TH GRADE    10000    28        2 |
 38999. | 2051   1993              64   4TH YEAR COLLEGE    30000    29        2 |
 39000. | 2051   1993              64         11TH GRADE    14000    29        2 |
        |------------------------------------------------------------------------|
 39001. | 2051   1994              64         11TH GRADE    13000    30        2 |
 39002. | 2051   1994              64   4TH YEAR COLLEGE    39000    30        2 |
 39003. | 2051   1996              64   4TH YEAR COLLEGE    55000    32        2 |
 39004. | 2051   1996              64         11TH GRADE    15000    32        2 |
 39005. | 2051   1998              64   4TH YEAR COLLEGE    60000    34        2 |
        |------------------------------------------------------------------------|
 39006. | 2051   1998              64         11TH GRADE    17751    34        2 |
 39007. | 2051   2000              64                  .        .    36        2 |
 39008. | 2051   2000              64         11TH GRADE    30000    36        2 |
123731. | 6512   1980              57         12TH GRADE        0    23        2 |
123732. | 6512   1980              57         12TH GRADE     2900    24        2 |
        +------------------------------------------------------------------------+

Some of these are completely duplicate observations, like person 501 having two copies of their row for 1985, which can be remedied with duplicates drop:

duplicates drop


Duplicates in terms of all variables

(20 observations deleted)

Now recreate copies so you can examine the remaining problems:

by id year: replace copies = _N
list if copies>1, ab(30)

(20 real changes made)

        +------------------------------------------------------------------------+
        |   id   year   year_of_birth                edu   income   age   copies |
        |------------------------------------------------------------------------|
 38951. | 2051   1979              64          7TH GRADE        .    15        2 |
 38952. | 2051   1979              64          8TH GRADE        .    15        2 |
 38953. | 2051   1980              64          8TH GRADE        .    16        2 |
 38954. | 2051   1980              64                  .        .    16        2 |
 38955. | 2051   1981              64          9TH GRADE        .    17        2 |
        |------------------------------------------------------------------------|
 38956. | 2051   1981              64                  .        .    17        2 |
 38957. | 2051   1982              64         10TH GRADE        .    18        2 |
 38958. | 2051   1982              64                  .        .    18        2 |
 38959. | 2051   1983              64         11TH GRADE     2180    19        2 |
 38960. | 2051   1983              64         12TH GRADE     1500    19        2 |
        |------------------------------------------------------------------------|
 38961. | 2051   1984              64   1ST YEAR COLLEGE     2000    20        2 |
 38962. | 2051   1984              64         11TH GRADE     4500    20        2 |
 38963. | 2051   1985              64         11TH GRADE     5546    21        2 |
 38964. | 2051   1985              64   2ND YEAR COLLEGE     2000    21        2 |
 38965. | 2051   1986              64         11TH GRADE    10000    22        2 |
        |------------------------------------------------------------------------|
 38966. | 2051   1986              64   2ND YEAR COLLEGE     3000    22        2 |
 38967. | 2051   1987              64   3RD YEAR COLLEGE     6400    23        2 |
 38968. | 2051   1987              64         11TH GRADE    11000    23        2 |
 38969. | 2051   1988              64   4TH YEAR COLLEGE    12000    24        2 |
 38970. | 2051   1988              64         11TH GRADE     7000    24        2 |
        |------------------------------------------------------------------------|
 38971. | 2051   1989              64         11TH GRADE     2500    25        2 |
 38972. | 2051   1989              64   4TH YEAR COLLEGE    22000    25        2 |
 38973. | 2051   1990              64         11TH GRADE     9000    26        2 |
 38974. | 2051   1990              64                  .        .    26        2 |
 38975. | 2051   1991              64         11TH GRADE     8556    27        2 |
        |------------------------------------------------------------------------|
 38976. | 2051   1991              64   4TH YEAR COLLEGE    28000    27        2 |
 38977. | 2051   1992              64   4TH YEAR COLLEGE    28000    28        2 |
 38978. | 2051   1992              64         11TH GRADE    10000    28        2 |
 38979. | 2051   1993              64   4TH YEAR COLLEGE    30000    29        2 |
 38980. | 2051   1993              64         11TH GRADE    14000    29        2 |
        |------------------------------------------------------------------------|
 38981. | 2051   1994              64         11TH GRADE    13000    30        2 |
 38982. | 2051   1994              64   4TH YEAR COLLEGE    39000    30        2 |
 38983. | 2051   1996              64   4TH YEAR COLLEGE    55000    32        2 |
 38984. | 2051   1996              64         11TH GRADE    15000    32        2 |
 38985. | 2051   1998              64   4TH YEAR COLLEGE    60000    34        2 |
        |------------------------------------------------------------------------|
 38986. | 2051   1998              64         11TH GRADE    17751    34        2 |
 38987. | 2051   2000              64                  .        .    36        2 |
 38988. | 2051   2000              64         11TH GRADE    30000    36        2 |
123711. | 6512   1980              57         12TH GRADE        0    23        2 |
123712. | 6512   1980              57         12TH GRADE     2900    24        2 |
        +------------------------------------------------------------------------+

The remaining observations are not duplicates and suggest different people were assigned the same id by accident. For real work you might investigate further, including reaching out to the data provider for an explanation, but for now you can eliminate all the problem observations with:

drop if copies>1

(40 observations deleted)

Now you could run a one-to-one merge using id and year:

duplicates report id year


Duplicates in terms of id year

--------------------------------------
   Copies | Observations       Surplus
----------+---------------------------
        1 |       240994             0
--------------------------------------