7  Combining Data Sets

Combining data sets is a very common task, and one that’s easy to do if you understand the structure of the data sets you are trying to combine. However, if you’ve misunderstood the structure of the data sets you can end up with a data set that makes no sense at all. Also, combining data sets will often force you to deal with problems in the data set structure.

Always think through what the resulting data set should look like before combining two data sets. If the resulting data set won’t have a consistent, logical structure you probably need to rethink what you’re doing.

We’ll think about combining two data sets in terms of adding new information to an original data set. In reality, which data set is the original and which is new is completely arbitrary, but usually there’s an obvious candidate for each role and it helps for visualizing the process.

How you combine two data sets depends on what the new data is adding to the original data. We’ll discuss the most common scenarios, which are:

We’ll also carry out an extended example and discuss how to handle duplicates when combining data set.

Start up Jupyter Lab if you haven’t already and navigate to the folder where you put the example files. Then create a new Python Notebook and call it Combining_Data_Sets_Practice.ipynb. Have it import Pandas.

import pandas as pd

7.1 Adding Observations

If the new data set adds more observations to the original data set, then this is a job for Panda’s concat() function. Using concat() makes sense if the original data set and the new data set contain the same kinds of things, but not the same things. The concat() function simply stacks DataFrames, passed in as a list, on top of each other.

Suppose that instead of a single file containing our 2000 ACS sample you were given acs_part1 and acs_part2, with each file containing half of the observations. Read them in, take a look, and then use concat() to combine them into a single DataFrame:

acs_part1 = pd.read_pickle('acs_part1.pickle')
acs_part1
age race marital_status edu income female hispanic
household person
37 1 20 White Never married Some college, >=1 year 10000.0 True False
2 19 White Never married Some college, >=1 year 5300.0 True False
3 19 Black Never married Some college, >=1 year 4700.0 True False
241 1 50 White Never married Master's degree 32500.0 True False
242 1 29 White Never married Bachelor's degree 30000.0 True False
... ... ... ... ... ... ... ... ...
617416 3 3 White NaN None NaN False False
4 3 White NaN None NaN False False
617477 1 43 White Now married Bachelor's degree 82300.0 False False
2 40 White Now married Bachelor's degree 3000.0 True False
3 14 White NaN 7th-8th grade NaN True False

13705 rows × 7 columns

acs_part2 = pd.read_pickle('acs_part2.pickle')
acs_part2
age race marital_status edu income female hispanic
household person
617477 4 11 White NaN 5th-6th grade NaN True False
617720 1 72 White Widowed Nursery school-4th grade 8000.0 True False
618413 1 22 Other Never married High School graduate 24000.0 False True
2 25 Other Never married 7th-8th grade 4000.0 False True
3 27 Other Never married 7th-8th grade 0.0 False True
... ... ... ... ... ... ... ... ...
1236624 1 29 White Now married Some college, >=1 year 50100.0 False False
2 26 White Now married High School graduate 12000.0 True False
1236756 1 58 White Now married Master's degree 69800.0 True False
2 61 White Now married Master's degree 40800.0 False False
1236779 1 30 American Indian Divorced High School graduate 22110.0 False False

13705 rows × 7 columns

pd.concat([acs_part1, acs_part2])
age race marital_status edu income female hispanic
household person
37 1 20 White Never married Some college, >=1 year 10000.0 True False
2 19 White Never married Some college, >=1 year 5300.0 True False
3 19 Black Never married Some college, >=1 year 4700.0 True False
241 1 50 White Never married Master's degree 32500.0 True False
242 1 29 White Never married Bachelor's degree 30000.0 True False
... ... ... ... ... ... ... ... ...
1236624 1 29 White Now married Some college, >=1 year 50100.0 False False
2 26 White Now married High School graduate 12000.0 True False
1236756 1 58 White Now married Master's degree 69800.0 True False
2 61 White Now married Master's degree 40800.0 False False
1236779 1 30 American Indian Divorced High School graduate 22110.0 False False

27410 rows × 7 columns

If one of the DataFrames does not have a variable that exists in the other DataFrame, observations from that DataFrame will get NaN (missing) for that variable. Generally you want to make sure that the two DataFrames use the same names for the same things before you combine them.

7.2 Adding Variables

If the new data set adds more variables to the original data set and observations represent the same things in both data sets, then this is a job for the merge() function or its shortcut join(). The merge() command combines data sets by combining observations that have the same value of an identifier variable or variables, often the row index, so the result has all the variables from both files. The terms merge and join are used pretty much interchangeably in the data science world to describe this operation, though join tends to be more common among people with database backgrounds.

Suppose you were given the data files acs_demographics.pickle and acs_ses.pickle, containing demographic information and socio-economic status (SES) information respectively about the 2000 ACS respondents. Load them and take a look:

acs_demo = pd.read_pickle('acs_demographics.pickle')
acs_demo
age hispanic race marital_status female
household person
37 1 20 False White Never married True
2 19 False White Never married True
3 19 False Black Never married True
241 1 50 False White Never married True
242 1 29 False White Never married True
... ... ... ... ... ... ...
1236624 1 29 False White Now married False
2 26 False White Now married True
1236756 1 58 False White Now married True
2 61 False White Now married False
1236779 1 30 False American Indian Divorced False

27410 rows × 5 columns

acs_ses = pd.read_pickle('acs_ses.pickle')
acs_ses
edu income
household person
37 1 Some college, >=1 year 10000.0
2 Some college, >=1 year 5300.0
3 Some college, >=1 year 4700.0
241 1 Master's degree 32500.0
242 1 Bachelor's degree 30000.0
... ... ... ...
1236624 1 Some college, >=1 year 50100.0
2 High School graduate 12000.0
1236756 1 Master's degree 69800.0
2 Master's degree 40800.0
1236779 1 High School graduate 22110.0

27410 rows × 2 columns

The join() function quickly and easily merges one DataFrame into another if they have the same row index:

acs_demo.join(acs_ses)
age hispanic race marital_status female edu income
household person
37 1 20 False White Never married True Some college, >=1 year 10000.0
2 19 False White Never married True Some college, >=1 year 5300.0
3 19 False Black Never married True Some college, >=1 year 4700.0
241 1 50 False White Never married True Master's degree 32500.0
242 1 29 False White Never married True Bachelor's degree 30000.0
... ... ... ... ... ... ... ... ...
1236624 1 29 False White Now married False Some college, >=1 year 50100.0
2 26 False White Now married True High School graduate 12000.0
1236756 1 58 False White Now married True Master's degree 69800.0
2 61 False White Now married False Master's degree 40800.0
1236779 1 30 False American Indian Divorced False High School graduate 22110.0

27410 rows × 7 columns

However, join() does not have all the tools for checking that a merge worked properly that merge() does. Given how much can go wrong with merge, we suggest you use merge() unless you’ve created the data sets to be merged yourself and know they’ll merge successfully.

The merge() function uses the convention of describing the two DataFrames to be merged as the left DataFrame and the right DataFrame. While in the result the columns of the left DataFrame will go on the left and the columns of the right DataFrame will go on the right, they’re mostly just labels. The DataFrame whose merge function is being called is the left DataFrame, and the one being added is the right DataFrame.

The first argument to pas in is the right DataFrame. Then you need to tell it what to use to match observations. Ideally this will be the indexes, so you can pass in left_index=True and right_index=True. Those are the only arguments required to carry out a merge:

acs_demo.merge(
    acs_ses,
    left_index=True, right_index=True,
)
age hispanic race marital_status female edu income
household person
37 1 20 False White Never married True Some college, >=1 year 10000.0
2 19 False White Never married True Some college, >=1 year 5300.0
3 19 False Black Never married True Some college, >=1 year 4700.0
241 1 50 False White Never married True Master's degree 32500.0
242 1 29 False White Never married True Bachelor's degree 30000.0
... ... ... ... ... ... ... ... ...
1236624 1 29 False White Now married False Some college, >=1 year 50100.0
2 26 False White Now married True High School graduate 12000.0
1236756 1 58 False White Now married True Master's degree 69800.0
2 61 False White Now married False Master's degree 40800.0
1236779 1 30 False American Indian Divorced False High School graduate 22110.0

27410 rows × 7 columns

However, with a few more arguments we can take get a much stronger reassurance that the merge is working properly.

This merge should combine one person’s demographic information with the same person’s SES information. Each row should match with exactly one other row: it is a one to one merge. However, that would not be the case if the DataFrames had duplicate values of the variables we are matching on. That would be a serious problem, so tell merge() to validate that you are in fact carrying out a one to one merge with validate='1:1'.

If an observation in one DataFrame does not match with an observation in the other DataFrame, it will get NaN (missing) for all the variables in the other DataFrame. But sometimes you don’t want to keep such observations at all. There are four different kinds of joins depending on what should be done with observations that don’t match:

  • In an inner join only observations that match are kept; observations that do not match are discarded. This is the default.
  • In a left join observations from the left DataFrame that do not match are kept, but observations from the right DataFrame that do not match are discarded.
  • In a right join observations from the right DataFrame that do not match are kept, but observations from the left DataFrame that do not match are discarded.
  • In an outer join all observations are kept whether they match or not, and no observations are discarded.

There are scenarios where all four of those make sense. However, the first time you merge two DataFrames you need to check if the merge is working properly or not. For example, minor differences in the format of the variables to be matched on can lead to nothing matching. To check for that, always start with an outer join even if you plan to discard observations that don’t match later. That will allow you to examine the observations that didn’t match and determine if they really don’t match and thus can be discarded, or if they only didn’t match because the merge didn’t work properly. Once you’re confident everything is working you can changed to the type of merge you really want to do.

You set the type of merge with the how argument. Start with how='outer'.

Examining whether the merge worked or not is much easier if you can immediately identify the observations that did or did not match. Pass in indicator=True and merge will create a column called _merge telling you which DataFrame each observation came from, with ‘both’ indicating it matched successfully. Running value_counts() on _merge is a good way to get a general sense of how the merge went.

Thus the careful version of the above merge is:

acs_demo.merge(
    acs_ses,
    left_index=True, right_index=True,
    validate='1:1',
    how='outer',
    indicator=True    
)
age hispanic race marital_status female edu income _merge
household person
37 1 20 False White Never married True Some college, >=1 year 10000.0 both
2 19 False White Never married True Some college, >=1 year 5300.0 both
3 19 False Black Never married True Some college, >=1 year 4700.0 both
241 1 50 False White Never married True Master's degree 32500.0 both
242 1 29 False White Never married True Bachelor's degree 30000.0 both
... ... ... ... ... ... ... ... ... ...
1236624 1 29 False White Now married False Some college, >=1 year 50100.0 both
2 26 False White Now married True High School graduate 12000.0 both
1236756 1 58 False White Now married True Master's degree 69800.0 both
2 61 False White Now married False Master's degree 40800.0 both
1236779 1 30 False American Indian Divorced False High School graduate 22110.0 both

27410 rows × 8 columns

Exercise

Examine the following pairs of data sets:

  • acs_race.pickle and acs_education.pickle
  • acs_adults.pickle and acs_children.pickle

Determine the appropriate method for combining each pair into a single data set and then do so. If you do a merge, do it carefully and run value_counts() on _merge.

First load and examine acs_race and acs_education:

acs_race = pd.read_pickle('acs_race.pickle')
acs_race
race hispanic
household person
37 1 White False
2 White False
3 Black False
241 1 White False
242 1 White False
... ... ... ...
1236624 1 White False
2 White False
1236756 1 White False
2 White False
1236779 1 American Indian False

27410 rows × 2 columns

acs_edu = pd.read_pickle('acs_education.pickle')
acs_edu
edu
household person
37 1 Some college, >=1 year
2 Some college, >=1 year
3 Some college, >=1 year
241 1 Master's degree
242 1 Bachelor's degree
... ... ...
1236624 1 Some college, >=1 year
2 High School graduate
1236756 1 Master's degree
2 Master's degree
1236779 1 High School graduate

27410 rows × 1 columns

These DataFrames contain different information about the same things, so combining them will add variables. This is a job for merge():

acs_combined1 = acs_race.merge(
    acs_edu,
    left_index=True, right_index=True,
    how='outer',
    validate='1:1',
    indicator=True
)
acs_combined1
race hispanic edu _merge
household person
37 1 White False Some college, >=1 year both
2 White False Some college, >=1 year both
3 Black False Some college, >=1 year both
241 1 White False Master's degree both
242 1 White False Bachelor's degree both
... ... ... ... ... ...
1236624 1 White False Some college, >=1 year both
2 White False High School graduate both
1236756 1 White False Master's degree both
2 White False Master's degree both
1236779 1 American Indian False High School graduate both

27410 rows × 4 columns

Now check that they actually matched successfully by looking at value_counts() for _merge:

acs_combined1['_merge'].value_counts()
_merge
both          27410
left_only         0
right_only        0
Name: count, dtype: int64

Moving on to acs_adults and acs_children:

acs_adults = pd.read_pickle('acs_adults.pickle')
acs_adults
age race marital_status edu income female hispanic
household person
37 1 20 White Never married Some college, >=1 year 10000.0 True False
2 19 White Never married Some college, >=1 year 5300.0 True False
3 19 Black Never married Some college, >=1 year 4700.0 True False
241 1 50 White Never married Master's degree 32500.0 True False
242 1 29 White Never married Bachelor's degree 30000.0 True False
... ... ... ... ... ... ... ... ...
1236624 1 29 White Now married Some college, >=1 year 50100.0 False False
2 26 White Now married High School graduate 12000.0 True False
1236756 1 58 White Now married Master's degree 69800.0 True False
2 61 White Now married Master's degree 40800.0 False False
1236779 1 30 American Indian Divorced High School graduate 22110.0 False False

20118 rows × 7 columns

acs_children = pd.read_pickle('acs_children.pickle')
acs_children
age race marital_status edu income female hispanic
household person
484 3 4 Asian NaN None NaN False False
4 2 Asian NaN NaN NaN True False
607 3 17 Asian Never married 10th grade 0.0 True False
4 15 Asian Never married 7th-8th grade 0.0 False False
5 13 Asian NaN 7th-8th grade NaN True False
... ... ... ... ... ... ... ... ...
1235851 2 13 White NaN 5th-6th grade NaN True False
3 12 White NaN 9th grade NaN True False
1235861 3 1 American Indian NaN NaN NaN False False
4 0 Indigenous, Unspecified NaN NaN NaN True False
1236287 4 4 White NaN None NaN False False

7292 rows × 7 columns

These DataFrames contain the same information but for different people. Combining them will add observations, so this is a job for concat():

pd.concat([acs_adults, acs_children])
age race marital_status edu income female hispanic
household person
37 1 20 White Never married Some college, >=1 year 10000.0 True False
2 19 White Never married Some college, >=1 year 5300.0 True False
3 19 Black Never married Some college, >=1 year 4700.0 True False
241 1 50 White Never married Master's degree 32500.0 True False
242 1 29 White Never married Bachelor's degree 30000.0 True False
... ... ... ... ... ... ... ... ...
1235851 2 13 White NaN 5th-6th grade NaN True False
3 12 White NaN 9th grade NaN True False
1235861 3 1 American Indian NaN NaN NaN False False
4 0 Indigenous, Unspecified NaN NaN NaN True False
1236287 4 4 White NaN None NaN False False

27410 rows × 7 columns

Note that the last people in the DataFrame are now all children. If you want to put the DataFrame back in its original order, where the people in a household are together, sort by household and person.

7.3 Adding Level One Units to Existing Level Two Units

Next consider the data sets nlsy_1979.pickle and nlsy_1980.pickle. They each contain one year’s worth of data from our NLSY extract. Is combining them a job for concat() or merge()?

nlsy_1979 = pd.read_pickle('nlsy_1979.pickle')
nlsy_1979
birth_year edu income age
id
1.0 58.0 12TH GRADE 4620.0 21.0
2.0 59.0 9TH GRADE 4000.0 20.0
3.0 61.0 10TH GRADE NaN 18.0
4.0 62.0 9TH GRADE NaN 17.0
5.0 59.0 1ST YEAR COLLEGE 2200.0 20.0
... ... ... ... ...
12682.0 59.0 9TH GRADE 500.0 20.0
12683.0 59.0 12TH GRADE 0.0 20.0
12684.0 60.0 10TH GRADE 0.0 19.0
12685.0 57.0 12TH GRADE 0.0 22.0
12686.0 60.0 12TH GRADE 15000.0 19.0

12686 rows × 4 columns

nlsy_1980 = pd.read_pickle('nlsy_1980.pickle')
nlsy_1980
birth_year edu income age
id
1.0 58.0 NaN NaN 22.0
2.0 59.0 9TH GRADE 5000.0 21.0
3.0 61.0 10TH GRADE 7000.0 19.0
4.0 62.0 NaN NaN 18.0
5.0 59.0 2ND YEAR COLLEGE 2000.0 21.0
... ... ... ... ...
12682.0 59.0 9TH GRADE 0.0 21.0
12683.0 59.0 12TH GRADE 0.0 21.0
12684.0 60.0 12TH GRADE 0.0 20.0
12685.0 57.0 12TH GRADE 3000.0 23.0
12686.0 60.0 12TH GRADE 17000.0 20.0

12686 rows × 4 columns

The answer depends on whether you want the resulting data set to be in long form or in wide form. In the NLSY, a level two unit is a person and a level one unit is a person-year combination, so adding nlsy_1980 to nlsy_1979 is adding new level one units (years) to the existing level two units (people). In long form each level one unit gets its own observation, so adding nlsy_1980 in long form adds observations. This is a job for concat(). In wide form, each level two unit gets its own observation, but each level one unit gets its own set of variables. Thus adding nlsy_1980 in wide form adds variables; a job for merge().

The only complication is the level one identifier, year. Right now it is found only in the filenames of the two data sets, as is common. In long form, the level one identifier needs to be a variable. In wide form with string column names, it needs to be a suffix at the end of the names of all the level one variables. Fortunately, both concat() and merge() have ways to do that easily.

7.3.1 Long Form

First let’s combine them into long form with concat(). The keys argument allows you to pass in a list containing an identifier for each DataFrame. It will be used to create a MultiIndex for the combined DataFrame:

nlsy_long = pd.concat(
    [nlsy_1979, nlsy_1980],
    keys=[1979, 1980]
)
nlsy_long
birth_year edu income age
id
1979 1.0 58.0 12TH GRADE 4620.0 21.0
2.0 59.0 9TH GRADE 4000.0 20.0
3.0 61.0 10TH GRADE NaN 18.0
4.0 62.0 9TH GRADE NaN 17.0
5.0 59.0 1ST YEAR COLLEGE 2200.0 20.0
... ... ... ... ... ...
1980 12682.0 59.0 9TH GRADE 0.0 21.0
12683.0 59.0 12TH GRADE 0.0 21.0
12684.0 60.0 12TH GRADE 0.0 20.0
12685.0 57.0 12TH GRADE 3000.0 23.0
12686.0 60.0 12TH GRADE 17000.0 20.0

25372 rows × 4 columns

At this point the second part of the row MultiIndex has a name but the first part does not. You can rename indexes with rename_axis(); for a MultiIndex pass in a list of names (including the one that already exists):

nlsy_long = nlsy_long.rename_axis(
    ['year', 'id'],
    axis='rows'
)
nlsy_long
birth_year edu income age
year id
1979 1.0 58.0 12TH GRADE 4620.0 21.0
2.0 59.0 9TH GRADE 4000.0 20.0
3.0 61.0 10TH GRADE NaN 18.0
4.0 62.0 9TH GRADE NaN 17.0
5.0 59.0 1ST YEAR COLLEGE 2200.0 20.0
... ... ... ... ... ...
1980 12682.0 59.0 9TH GRADE 0.0 21.0
12683.0 59.0 12TH GRADE 0.0 21.0
12684.0 60.0 12TH GRADE 0.0 20.0
12685.0 57.0 12TH GRADE 3000.0 23.0
12686.0 60.0 12TH GRADE 17000.0 20.0

25372 rows × 4 columns

This looks a little off because it treats people as nested within years. You’ll get something more familiar if you swap the levels of the MultiIndex and then sort by them:

nlsy_long.swaplevel().sort_index()
birth_year edu income age
id year
1.0 1979 58.0 12TH GRADE 4620.0 21.0
1980 58.0 NaN NaN 22.0
2.0 1979 59.0 9TH GRADE 4000.0 20.0
1980 59.0 9TH GRADE 5000.0 21.0
3.0 1979 61.0 10TH GRADE NaN 18.0
... ... ... ... ... ...
12684.0 1980 60.0 12TH GRADE 0.0 20.0
12685.0 1979 57.0 12TH GRADE 0.0 22.0
1980 57.0 12TH GRADE 3000.0 23.0
12686.0 1979 60.0 12TH GRADE 15000.0 19.0
1980 60.0 12TH GRADE 17000.0 20.0

25372 rows × 4 columns

7.3.2 Wide Form

Now let’s combine these data sets into wide form with merge(). The suffixes argument allows you to pass in a tuple of suffixes, one for each DataFrame, which will be appended to the names of the columns that come from that DataFrame:

nlsy_wide = nlsy_1979.merge(
    nlsy_1980,
    left_index=True, right_index=True,
    suffixes=('_1979', '_1980'),
    validate='1:1',
    how='outer',
    indicator=True      
)
nlsy_wide
birth_year_1979 edu_1979 income_1979 age_1979 birth_year_1980 edu_1980 income_1980 age_1980 _merge
id
1.0 58.0 12TH GRADE 4620.0 21.0 58.0 NaN NaN 22.0 both
2.0 59.0 9TH GRADE 4000.0 20.0 59.0 9TH GRADE 5000.0 21.0 both
3.0 61.0 10TH GRADE NaN 18.0 61.0 10TH GRADE 7000.0 19.0 both
4.0 62.0 9TH GRADE NaN 17.0 62.0 NaN NaN 18.0 both
5.0 59.0 1ST YEAR COLLEGE 2200.0 20.0 59.0 2ND YEAR COLLEGE 2000.0 21.0 both
... ... ... ... ... ... ... ... ... ...
12682.0 59.0 9TH GRADE 500.0 20.0 59.0 9TH GRADE 0.0 21.0 both
12683.0 59.0 12TH GRADE 0.0 20.0 59.0 12TH GRADE 0.0 21.0 both
12684.0 60.0 10TH GRADE 0.0 19.0 60.0 12TH GRADE 0.0 20.0 both
12685.0 57.0 12TH GRADE 0.0 22.0 57.0 12TH GRADE 3000.0 23.0 both
12686.0 60.0 12TH GRADE 15000.0 19.0 60.0 12TH GRADE 17000.0 20.0 both

12686 rows × 9 columns

And we’re done!

7.4 Adding Variables for Different Levels

If you need to merge hierarchical data where the left DataFrame contains data on the level one units and the right DataFrame contains data on the level two units, this is a job for a many-to-one merge. A many-to-one merge combines observations just like a one-to-one merge, but many rows representing level one units are combined with one row representing the corresponding level two unit. A one-to-many merge is essentially the same thing, just the left DataFrame contains the level two unit (the “one”) and the right DataFrame contains the level one units (the “many”).

Load our good friend acs.pickle plus acs_households.pickle. The latter has one observation per household and some household level variables:

acs = pd.read_pickle('acs.pickle')
acs_households = pd.read_pickle('acs_households.pickle')
acs_households
household_income max_edu
household
37 20000.0 Some college, >=1 year
241 32500.0 Master's degree
242 30000.0 Bachelor's degree
377 51900.0 None
418 12200.0 12th grade, no diploma
... ... ...
1236119 93200.0 Bachelor's degree
1236287 53400.0 Some college, <1 year
1236624 62100.0 Some college, >=1 year
1236756 110600.0 Master's degree
1236779 22110.0 High School graduate

10565 rows × 2 columns

With acs as the left DataFrame and acs_households as right, this is a many-to-one merge. You can do this pretty much like the other merges we’ve done: the only thing that needs to change is to set validate to 'm:1':

acs.merge(
    acs_households,
    left_index=True, right_index=True,
    validate='m:1',
    how='outer',
    indicator=True
)
age race marital_status edu income female hispanic household_income max_edu _merge
household person
37 1 20 White Never married Some college, >=1 year 10000.0 True False 20000.0 Some college, >=1 year both
2 19 White Never married Some college, >=1 year 5300.0 True False 20000.0 Some college, >=1 year both
3 19 Black Never married Some college, >=1 year 4700.0 True False 20000.0 Some college, >=1 year both
241 1 50 White Never married Master's degree 32500.0 True False 32500.0 Master's degree both
242 1 29 White Never married Bachelor's degree 30000.0 True False 30000.0 Bachelor's degree both
... ... ... ... ... ... ... ... ... ... ... ...
1236624 1 29 White Now married Some college, >=1 year 50100.0 False False 62100.0 Some college, >=1 year both
2 26 White Now married High School graduate 12000.0 True False 62100.0 Some college, >=1 year both
1236756 1 58 White Now married Master's degree 69800.0 True False 110600.0 Master's degree both
2 61 White Now married Master's degree 40800.0 False False 110600.0 Master's degree both
1236779 1 30 American Indian Divorced High School graduate 22110.0 False False 22110.0 High School graduate both

27410 rows × 10 columns

This is subtly dangerous, however. Note how the combined DataFrame has person and household as a MultiIndex. That makes the DataFrame incapable of including a household that has no people in it and thus no value for person. To see the result, try the merge again after removing the first three observations from acs (household 37):

acs.iloc[3:, :].merge(
    acs_households,
    left_index=True, right_index=True,
    validate='m:1',
    how='outer',
    indicator=True
)
age race marital_status edu income female hispanic household_income max_edu _merge
household person
241 1 50 White Never married Master's degree 32500.0 True False 32500.0 Master's degree both
242 1 29 White Never married Bachelor's degree 30000.0 True False 30000.0 Bachelor's degree both
377 1 69 White Never married None 51900.0 True False 51900.0 None both
418 1 59 White Widowed 12th grade, no diploma 12200.0 True False 12200.0 12th grade, no diploma both
465 1 55 Black Separated Associate degree 0.0 False False 2600.0 Associate degree both
... ... ... ... ... ... ... ... ... ... ... ...
1236624 1 29 White Now married Some college, >=1 year 50100.0 False False 62100.0 Some college, >=1 year both
2 26 White Now married High School graduate 12000.0 True False 62100.0 Some college, >=1 year both
1236756 1 58 White Now married Master's degree 69800.0 True False 110600.0 Master's degree both
2 61 White Now married Master's degree 40800.0 False False 110600.0 Master's degree both
1236779 1 30 American Indian Divorced High School graduate 22110.0 False False 22110.0 High School graduate both

27407 rows × 10 columns

Household 37 has disappeared from the combined DataFrame, despite the fact that it exists in acs_household and we specified how='outer'.

One easy-to-implement solution is to not match on the indexes at all. Instead, reset them and then use on= to specify the variable to be used for matching:

acs.iloc[3:, :].reset_index().merge(
    acs_households.reset_index(),
    on='household',
    validate='m:1',
    how='outer',
    indicator=True
)
household person age race marital_status edu income female hispanic household_income max_edu _merge
0 241 1.0 50.0 White Never married Master's degree 32500.0 True False 32500.0 Master's degree both
1 242 1.0 29.0 White Never married Bachelor's degree 30000.0 True False 30000.0 Bachelor's degree both
2 377 1.0 69.0 White Never married None 51900.0 True False 51900.0 None both
3 418 1.0 59.0 White Widowed 12th grade, no diploma 12200.0 True False 12200.0 12th grade, no diploma both
4 465 1.0 55.0 Black Separated Associate degree 0.0 False False 2600.0 Associate degree both
... ... ... ... ... ... ... ... ... ... ... ... ...
27403 1236624 2.0 26.0 White Now married High School graduate 12000.0 True False 62100.0 Some college, >=1 year both
27404 1236756 1.0 58.0 White Now married Master's degree 69800.0 True False 110600.0 Master's degree both
27405 1236756 2.0 61.0 White Now married Master's degree 40800.0 False False 110600.0 Master's degree both
27406 1236779 1.0 30.0 American Indian Divorced High School graduate 22110.0 False False 22110.0 High School graduate both
27407 37 NaN NaN NaN NaN NaN NaN NaN NaN 20000.0 Some college, >=1 year right_only

27408 rows × 12 columns

Now household 37 is in the result, though down at the bottom, with person set to NaN and _merge set to ‘right_only’. You could go ahead and make household and person the MultiIndex again, though that would leave you with a ‘person NaN.’

If you make acs_households left and acs right, then this is a one to many merge and validate needs to be set to '1:m'. Otherwise nothing needs to change:

acs_households.reset_index().merge(
    acs.iloc[3:, :].reset_index(),
    on='household',
    validate='1:m',
    how='outer',
    indicator=True
)
household household_income max_edu person age race marital_status edu income female hispanic _merge
0 37 20000.0 Some college, >=1 year NaN NaN NaN NaN NaN NaN NaN NaN left_only
1 241 32500.0 Master's degree 1.0 50.0 White Never married Master's degree 32500.0 True False both
2 242 30000.0 Bachelor's degree 1.0 29.0 White Never married Bachelor's degree 30000.0 True False both
3 377 51900.0 None 1.0 69.0 White Never married None 51900.0 True False both
4 418 12200.0 12th grade, no diploma 1.0 59.0 White Widowed 12th grade, no diploma 12200.0 True False both
... ... ... ... ... ... ... ... ... ... ... ... ...
27403 1236624 62100.0 Some college, >=1 year 1.0 29.0 White Now married Some college, >=1 year 50100.0 False False both
27404 1236624 62100.0 Some college, >=1 year 2.0 26.0 White Now married High School graduate 12000.0 True False both
27405 1236756 110600.0 Master's degree 1.0 58.0 White Now married Master's degree 69800.0 True False both
27406 1236756 110600.0 Master's degree 2.0 61.0 White Now married Master's degree 40800.0 False False both
27407 1236779 22110.0 High School graduate 1.0 30.0 American Indian Divorced High School graduate 22110.0 False False both

27408 rows × 12 columns

Note that I described this as an easy to implement solution. There are many different ways to merge in Python, and if you’re doing on something where performance is an issue you may find something more complicated will be substantially faster for your particular problem.

Exercise

nlsy_person.pickle contains information about the people in our NLSY extract that does not change over time, while nlsy_person_year.pickle contains information that changes from year to year. Combine them. Note that the person with id 1 is not in nlsy_person and the person with id 12686 (the last person) is not in nlsy_person_year. Make sure the data you do have for them is not lost! To make it easy to check, set id and year as the indexes again after combining.

nlsy_person = pd.read_pickle('nlsy_person.pickle')
nlsy_person
yearOfBirth total_income
id
2.0 59.0 214500.0
3.0 61.0 80500.0
4.0 62.0 180236.0
5.0 59.0 185679.0
6.0 60.0 314683.0
... ... ...
12682.0 59.0 6005.0
12683.0 59.0 0.0
12684.0 60.0 3000.0
12685.0 57.0 53547.0
12686.0 60.0 335560.0

12685 rows × 2 columns

nlsy_person_year = pd.read_pickle('nlsy_person_year.pickle')
nlsy_person_year
edu income age
id year
1.0 1979 12TH GRADE 4620.0 21.0
1980 NaN NaN 22.0
1981 12TH GRADE 5000.0 23.0
1982 NaN NaN 24.0
1983 NaN NaN 25.0
... ... ... ... ...
12685.0 1993 NaN NaN 36.0
1994 NaN NaN 37.0
1996 NaN NaN 39.0
1998 NaN NaN 41.0
2000 NaN NaN 43.0

241015 rows × 3 columns

nlsy_combined = nlsy_person_year.reset_index().merge(
    nlsy_person.reset_index(),
    on='id',
    how='outer',
    validate='m:1',
    indicator=True
).set_index(['id', 'year'])
nlsy_combined.loc[[1, 12686]]
edu income age yearOfBirth total_income _merge
id year
1.0 1979.0 12TH GRADE 4620.0 21.0 NaN NaN left_only
1980.0 NaN NaN 22.0 NaN NaN left_only
1981.0 12TH GRADE 5000.0 23.0 NaN NaN left_only
1982.0 NaN NaN 24.0 NaN NaN left_only
1983.0 NaN NaN 25.0 NaN NaN left_only
1984.0 NaN NaN 26.0 NaN NaN left_only
1985.0 NaN NaN 27.0 NaN NaN left_only
1986.0 NaN NaN 28.0 NaN NaN left_only
1987.0 NaN NaN 29.0 NaN NaN left_only
1988.0 NaN NaN 30.0 NaN NaN left_only
1989.0 NaN NaN 31.0 NaN NaN left_only
1990.0 NaN NaN 32.0 NaN NaN left_only
1991.0 NaN NaN 33.0 NaN NaN left_only
1992.0 NaN NaN 34.0 NaN NaN left_only
1993.0 NaN NaN 35.0 NaN NaN left_only
1994.0 NaN NaN 36.0 NaN NaN left_only
1996.0 NaN NaN 38.0 NaN NaN left_only
1998.0 NaN NaN 40.0 NaN NaN left_only
2000.0 NaN NaN 42.0 NaN NaN left_only
12686.0 NaN NaN NaN NaN 60.0 335560.0 right_only

It doesn’t matter which DataFrame you make left or right as long as your validate matches what you choose.

7.5 Inflation Example

Next we’ll do an example that illustrates some of the complications that can arise when combining data sets from different sources.

One issue with the NLSY data extract we’ve been using is that incomes from different time periods are not really comparable due to inflation. To adjust them for inflation, we need information about the level of inflation in each year. The data set fred_cpi.pickle contains the average Consumer Price Index for All Urban Consumers for every year from 1970 to 2005. It was obtained from the Federal Reserve Economic Data (FRED). Our task is to merge it with nlsy.pickle.

Start by reading in both data sets:

nlsy = pd.read_pickle('nlsy.pickle')
cpi = pd.read_pickle('fred_cpi.pickle')
cpi
datestr daten CPIAUCSL
0 1970-01-01 1970-01-01 38.841999
1 1971-01-01 1971-01-01 40.483002
2 1972-01-01 1972-01-01 41.807999
3 1973-01-01 1973-01-01 44.424999
4 1974-01-01 1974-01-01 49.317001
5 1975-01-01 1975-01-01 53.825001
6 1976-01-01 1976-01-01 56.932999
7 1977-01-01 1977-01-01 60.617001
8 1978-01-01 1978-01-01 65.241997
9 1979-01-01 1979-01-01 72.583000
10 1980-01-01 1980-01-01 82.383003
11 1981-01-01 1981-01-01 90.932999
12 1982-01-01 1982-01-01 96.532997
13 1983-01-01 1983-01-01 99.583000
14 1984-01-01 1984-01-01 103.932999
15 1985-01-01 1985-01-01 107.599998
16 1986-01-01 1986-01-01 109.692001
17 1987-01-01 1987-01-01 113.616997
18 1988-01-01 1988-01-01 118.275002
19 1989-01-01 1989-01-01 123.942001
20 1990-01-01 1990-01-01 130.658005
21 1991-01-01 1991-01-01 136.167007
22 1992-01-01 1992-01-01 140.307999
23 1993-01-01 1993-01-01 144.475006
24 1994-01-01 1994-01-01 148.225006
25 1995-01-01 1995-01-01 152.382996
26 1996-01-01 1996-01-01 156.858002
27 1997-01-01 1997-01-01 160.524994
28 1998-01-01 1998-01-01 163.007996
29 1999-01-01 1999-01-01 166.582993
30 2000-01-01 2000-01-01 172.192001
31 2001-01-01 2001-01-01 177.042007
32 2002-01-01 2002-01-01 179.867004
33 2003-01-01 2003-01-01 184.000000
34 2004-01-01 2004-01-01 188.908005
35 2005-01-01 2005-01-01 195.266998

Taking this data and adding nlsy to it is a job for a one-to-many merge: one year’s CPI data will match with many people’s NLSY data for that year. Note how this treats year as the level two unit! For most purposes it’s more useful to think of people as the level two unit in the NLSY, but it’s just as logical to group person-year combinations by year instead.

7.5.1 Preparing to Merge

We have some work to do with cpi before it’s ready to merge. The biggest issue is that we need a simple numeric year variable we can use to match with the year index in nlsy. cpi has two different date columns; look at the dtypes to see the difference:

cpi.dtypes
datestr             object
daten       datetime64[ns]
CPIAUCSL           float32
dtype: object

datestr is, unsurprisingly, a string (object), but daten (date number) is a Python datetime. A datetime object (dt) has a year attribute containing just the year component of the date. We can put that in its own column with:

cpi['year'] = cpi['daten'].dt.year
cpi
datestr daten CPIAUCSL year
0 1970-01-01 1970-01-01 38.841999 1970
1 1971-01-01 1971-01-01 40.483002 1971
2 1972-01-01 1972-01-01 41.807999 1972
3 1973-01-01 1973-01-01 44.424999 1973
4 1974-01-01 1974-01-01 49.317001 1974
5 1975-01-01 1975-01-01 53.825001 1975
6 1976-01-01 1976-01-01 56.932999 1976
7 1977-01-01 1977-01-01 60.617001 1977
8 1978-01-01 1978-01-01 65.241997 1978
9 1979-01-01 1979-01-01 72.583000 1979
10 1980-01-01 1980-01-01 82.383003 1980
11 1981-01-01 1981-01-01 90.932999 1981
12 1982-01-01 1982-01-01 96.532997 1982
13 1983-01-01 1983-01-01 99.583000 1983
14 1984-01-01 1984-01-01 103.932999 1984
15 1985-01-01 1985-01-01 107.599998 1985
16 1986-01-01 1986-01-01 109.692001 1986
17 1987-01-01 1987-01-01 113.616997 1987
18 1988-01-01 1988-01-01 118.275002 1988
19 1989-01-01 1989-01-01 123.942001 1989
20 1990-01-01 1990-01-01 130.658005 1990
21 1991-01-01 1991-01-01 136.167007 1991
22 1992-01-01 1992-01-01 140.307999 1992
23 1993-01-01 1993-01-01 144.475006 1993
24 1994-01-01 1994-01-01 148.225006 1994
25 1995-01-01 1995-01-01 152.382996 1995
26 1996-01-01 1996-01-01 156.858002 1996
27 1997-01-01 1997-01-01 160.524994 1997
28 1998-01-01 1998-01-01 163.007996 1998
29 1999-01-01 1999-01-01 166.582993 1999
30 2000-01-01 2000-01-01 172.192001 2000
31 2001-01-01 2001-01-01 177.042007 2001
32 2002-01-01 2002-01-01 179.867004 2002
33 2003-01-01 2003-01-01 184.000000 2003
34 2004-01-01 2004-01-01 188.908005 2004
35 2005-01-01 2005-01-01 195.266998 2005

We no longer need datestr and daten. Up until now we’ve gotten rid of unneeded columns using drop(), but since there are only two columns we do need it’s easier to just select them. Also, we don’t care about the subtleties of different inflation measures, so rename CPIAUCSL to just cpi. Just because you can, do both tasks in one function chain:

cpi = cpi.rename(
    columns={'CPIAUCSL' : 'cpi'}
)[['year', 'cpi']]
cpi
year cpi
0 1970 38.841999
1 1971 40.483002
2 1972 41.807999
3 1973 44.424999
4 1974 49.317001
5 1975 53.825001
6 1976 56.932999
7 1977 60.617001
8 1978 65.241997
9 1979 72.583000
10 1980 82.383003
11 1981 90.932999
12 1982 96.532997
13 1983 99.583000
14 1984 103.932999
15 1985 107.599998
16 1986 109.692001
17 1987 113.616997
18 1988 118.275002
19 1989 123.942001
20 1990 130.658005
21 1991 136.167007
22 1992 140.307999
23 1993 144.475006
24 1994 148.225006
25 1995 152.382996
26 1996 156.858002
27 1997 160.524994
28 1998 163.007996
29 1999 166.582993
30 2000 172.192001
31 2001 177.042007
32 2002 179.867004
33 2003 184.000000
34 2004 188.908005
35 2005 195.266998

Notice that because the column subset was specified as ['year', 'cpi'], year comes first in the result even though it was last in the original data.

7.5.2 Merging

Now we’re ready to merge. Because we haven’t set year as an index for cpi, we have to tell merge() what to merge on. Do that by passing in on='year' rather than left_index=True, right_index=True. This tells it to use the year column in both data sets.

nlsy_combined = cpi.merge(
    nlsy, 
    on='year',
    validate='1:m',
    how='outer',
    indicator=True
)
nlsy_combined
year cpi birth_year edu income age _merge
0 1970 38.841999 NaN NaN NaN NaN left_only
1 1971 40.483002 NaN NaN NaN NaN left_only
2 1972 41.807999 NaN NaN NaN NaN left_only
3 1973 44.424999 NaN NaN NaN NaN left_only
4 1974 49.317001 NaN NaN NaN NaN left_only
... ... ... ... ... ... ... ...
241046 2001 177.042007 NaN NaN NaN NaN left_only
241047 2002 179.867004 NaN NaN NaN NaN left_only
241048 2003 184.000000 NaN NaN NaN NaN left_only
241049 2004 188.908005 NaN NaN NaN NaN left_only
241050 2005 195.266998 NaN NaN NaN NaN left_only

241051 rows × 7 columns

This works, but all the rows we can see have ‘left_only’ for _merge and NaN for all the columns from nlsy. Also, id has completely disappeared!

Let’s start with the problem of id. Since only one of the DataFrames had a MultiIndex, merge() discarded it and replaced it with a column for the variable you matched on, year. But that got rid of id entirely. To fix that, the solution again is to reset the index of nlsy before merging. That makes id an ordinary column which is preserved after the merge.

nlsy_combined = cpi.merge(
    nlsy.reset_index(), 
    on='year',
    validate='1:m',
    how='outer',
    indicator=True
)
nlsy_combined
year cpi id birth_year edu income age _merge
0 1970 38.841999 NaN NaN NaN NaN NaN left_only
1 1971 40.483002 NaN NaN NaN NaN NaN left_only
2 1972 41.807999 NaN NaN NaN NaN NaN left_only
3 1973 44.424999 NaN NaN NaN NaN NaN left_only
4 1974 49.317001 NaN NaN NaN NaN NaN left_only
... ... ... ... ... ... ... ... ...
241046 2001 177.042007 NaN NaN NaN NaN NaN left_only
241047 2002 179.867004 NaN NaN NaN NaN NaN left_only
241048 2003 184.000000 NaN NaN NaN NaN NaN left_only
241049 2004 188.908005 NaN NaN NaN NaN NaN left_only
241050 2005 195.266998 NaN NaN NaN NaN NaN left_only

241051 rows × 8 columns

Of course id for the rows we can see is now NaN just like the other columns from nsly. Things are not quite as dire as they look though: take a look at rows from 1979:

nlsy_combined.loc[
    nlsy_combined['year']==1979
]
year cpi id birth_year edu income age _merge
9 1979 72.583 1.0 58.0 12TH GRADE 4620.0 21.0 both
10 1979 72.583 2.0 59.0 9TH GRADE 4000.0 20.0 both
11 1979 72.583 3.0 61.0 10TH GRADE NaN 18.0 both
12 1979 72.583 4.0 62.0 9TH GRADE NaN 17.0 both
13 1979 72.583 5.0 59.0 1ST YEAR COLLEGE 2200.0 20.0 both
... ... ... ... ... ... ... ... ...
12690 1979 72.583 12682.0 59.0 9TH GRADE 500.0 20.0 both
12691 1979 72.583 12683.0 59.0 12TH GRADE 0.0 20.0 both
12692 1979 72.583 12684.0 60.0 10TH GRADE 0.0 19.0 both
12693 1979 72.583 12685.0 57.0 12TH GRADE 0.0 22.0 both
12694 1979 72.583 12686.0 60.0 12TH GRADE 15000.0 19.0 both

12686 rows × 8 columns

They look just fine. Examine the value_counts() for _merge:

nlsy_combined['_merge'].value_counts()
_merge
both          241034
left_only         17
right_only         0
Name: count, dtype: int64

Only 17 rows did not match, all from cpi (which is why they have NaN for columns from nlsy). Those 17 just happen to be at the beginning and end of the data set so they were all we could see. The data set is sorted by year (a side effect of matching by it) so that gives us a clue where to look for an explanation. Try value_counts() for year and merge (use sort=False so it doesn’t change the order of the years):

(
    nlsy_combined[['year', '_merge']].
    value_counts(sort=False)
)
year  _merge   
1970  left_only        1
1971  left_only        1
1972  left_only        1
1973  left_only        1
1974  left_only        1
1975  left_only        1
1976  left_only        1
1977  left_only        1
1978  left_only        1
1979  both         12686
1980  both         12686
1981  both         12686
1982  both         12686
1983  both         12686
1984  both         12686
1985  both         12686
1986  both         12686
1987  both         12686
1988  both         12686
1989  both         12686
1990  both         12686
1991  both         12686
1992  both         12686
1993  both         12686
1994  both         12686
1995  left_only        1
1996  both         12686
1997  left_only        1
1998  both         12686
1999  left_only        1
2000  both         12686
2001  left_only        1
2002  left_only        1
2003  left_only        1
2004  left_only        1
2005  left_only        1
Name: count, dtype: int64

Compare that with value_counts() for year in nlsy. Since year is the row index for nlsy we first have to reset its index to make year just another column so we can use value_counts() on it:

(
    nlsy.reset_index()['year'].
    value_counts(sort=False)
)
year
1979    12686
1980    12686
1981    12686
1982    12686
1983    12686
1984    12686
1985    12686
1986    12686
1987    12686
1988    12686
1989    12686
1990    12686
1991    12686
1992    12686
1993    12686
1994    12686
1996    12686
1998    12686
2000    12686
Name: count, dtype: int64

Not every year appears in nsly: the NLSY started in 1979; in 1994 it switched to collecting data every other year; and our extract ends in 2000. Meanwhile, cps starts in 1970 and has a row for every year until 2005. The rows that don’t match are just years for which we have CPI data but no data from the NLSY.

We’re interested in the people in the NLSY, not inflation for its own sake, so we don’t need the inflation data for years that don’t appear in nlsy. That makes this a job for a right join, which will discard observations from the left DataFrame, cpi, that don’t match anything from the right DataFrame, nlsy. Inner join would work too now that we know there are no ‘right_only’ observations, but using it right away would risk dropping NLSY observations if we didn’t have inflation data for some NLSY years.

The final merge would be:

nlsy_combined = cpi.merge(
    nlsy.reset_index(), 
    on='year',
    validate='1:m',
    how='right',
    indicator=True
)
nlsy_combined['_merge'].value_counts()
_merge
both          241034
left_only          0
right_only         0
Name: count, dtype: int64

7.5.3 Adjusting For Inflation

To adjust income for inflation we convert all of them to dollars in a given year: let’s choose 2000, the last year in our extract. The formula for doing so is the income in a given year times the CPI in the year 2000 divided by the CPI in that year.

We need the CPI in the year 2000 as a plain old number. You can easily subset it out from cpi with cpi['year']==2000, but the result will be a Series. Python also doesn’t know that the Series will have always have a single element. So to get a simple number, extract the first element of the Series with .iloc[0], and then pass it to the float() function to convert it to a floating-point number (i.e. a number with a fractional component):

cpi_2000 = float(cpi.loc[cpi['year']==2000, 'cpi'].iloc[0])
cpi_2000
172.19200134277344

Now create a column for real income with:

nlsy_combined['real_income'] = (
    nlsy_combined['income'] * 
    cpi_2000 / 
    nlsy_combined['cpi']
)
nlsy_combined
year cpi id birth_year edu income age _merge real_income
0 1979 72.583000 1.0 58.0 12TH GRADE 4620.0 21.0 both 10960.239258
1 1980 82.383003 1.0 58.0 NaN NaN 22.0 both NaN
2 1981 90.932999 1.0 58.0 12TH GRADE 5000.0 23.0 both 9468.070312
3 1982 96.532997 1.0 58.0 NaN NaN 24.0 both NaN
4 1983 99.583000 1.0 58.0 NaN NaN 25.0 both NaN
... ... ... ... ... ... ... ... ... ...
241029 1993 144.475006 12686.0 60.0 12TH GRADE 31000.0 33.0 both 36947.234375
241030 1994 148.225006 12686.0 60.0 NaN NaN 34.0 both NaN
241031 1996 156.858002 12686.0 60.0 NaN NaN 36.0 both NaN
241032 1998 163.007996 12686.0 60.0 NaN NaN 38.0 both NaN
241033 2000 172.192001 12686.0 60.0 NaN NaN 40.0 both NaN

241034 rows × 9 columns

To get a sense of how income and real_income are related, take their mean by year. You’ll see how they start out quite different as the inflation that took place between that year and the year 2000 is substantial, then gradually converge.

nlsy_combined[
    ['income', 'real_income', 'year']
].groupby('year').mean()
income real_income
year
1979 2571.444092 6100.355469
1980 3257.446533 6808.518555
1981 3897.210449 7379.812012
1982 4861.282715 8671.377930
1983 5390.464355 9320.816406
1984 6481.504883 10738.295898
1985 7884.684570 12617.840820
1986 9246.562500 14515.042969
1987 11242.212891 17038.113281
1988 12615.267578 18366.078125
1989 13811.738281 19188.578125
1990 15418.368164 20319.609375
1991 16527.351562 20899.908203
1992 17395.074219 21347.982422
1993 18430.376953 21966.175781
1994 19533.679688 22692.144531
1996 22815.503906 25045.884766
1998 25415.785156 26847.732422
2000 30527.400391 30527.400391

In the past we haven’t had to include the groupby() variable in the subset passed to groupby(); what’s different now? The difference is that year isn’t an index in nlsy_combined. When you subset a DataFrame it keeps the same index, and that index can be used by groupby(). But if you want to groupby() something other than an index it needs to be part of the DataFrame passed to groupby().

7.6 Handling Duplicates

All the DataFrames we’ve combined thus far have been well behaved, with most of them matching perfectly. Obviously that won’t always be true with real-world data. The most common problem is duplicates, so let’s spend some time discussing how to handle them.

To see duplicates in action, load acs_demographics.pickle and acs_ses_dup.pickle.

acs_demo = pd.read_pickle('acs_demographics.pickle')
acs_ses = pd.read_pickle('acs_ses_dup.pickle')
acs_ses
edu income
household person
37 1 Some college, >=1 year 10000.0
2 Some college, >=1 year 5300.0
2 Some college, >=1 year 4700.0
241 1 Master's degree 32500.0
242 1 Bachelor's degree 30000.0
... ... ... ...
1236624 1 Some college, >=1 year 50100.0
2 High School graduate 12000.0
1236756 1 Master's degree 69800.0
2 Master's degree 40800.0
1236779 1 High School graduate 22110.0

27410 rows × 2 columns

These should be familiar from section 1 of this chapter. acs_demo contains demographic information about each individual, while acs_ses contains information about their socio-economic status. But if you look closely at acs_ses you’ll see a problem: household 37 has two person 2’s. This kind of error is not uncommon in real-world data.

First off, note that if you carry out the steps described in the ‘First Steps With Your Data’ chapter, you’ll know about this problem long before you try to combine this data set with anything else. As part of deciding if household and person uniquely identify observations (and before making them a MultiIndex), you’ll look at the maximum value of their value_counts():

(
    acs_ses.reset_index()
    [['household', 'person']].
    value_counts().max()
)
2

The fact that it’s greater than one tells you have a problem immediately.

But suppose you don’t carry out those steps. Will you find it when you try to merge acs_ses with acs_demo? Only if you’re careful. If you try to do the merge the easy way, you’ll get no indication anything is wrong:

acs_demo.merge(
    acs_ses,
    left_index=True, right_index=True
)
age hispanic race marital_status female edu income
household person
37 1 20 False White Never married True Some college, >=1 year 10000.0
2 19 False White Never married True Some college, >=1 year 5300.0
2 19 False White Never married True Some college, >=1 year 4700.0
241 1 50 False White Never married True Master's degree 32500.0
242 1 29 False White Never married True Bachelor's degree 30000.0
... ... ... ... ... ... ... ... ...
1236624 1 29 False White Now married False Some college, >=1 year 50100.0
2 26 False White Now married True High School graduate 12000.0
1236756 1 58 False White Now married True Master's degree 69800.0
2 61 False White Now married False Master's degree 40800.0
1236779 1 30 False American Indian Divorced False High School graduate 22110.0

27410 rows × 7 columns

You may have spent enough time staring at this data set to spot the problem, but if not compare with:

acs_demo
age hispanic race marital_status female
household person
37 1 20 False White Never married True
2 19 False White Never married True
3 19 False Black Never married True
241 1 50 False White Never married True
242 1 29 False White Never married True
... ... ... ... ... ... ...
1236624 1 29 False White Now married False
2 26 False White Now married True
1236756 1 58 False White Now married True
2 61 False White Now married False
1236779 1 30 False American Indian Divorced False

27410 rows × 5 columns

The third person in household 37 is supposed to be Black. What happened? In acs_ses, the third person in household 37 has a 2 for person, so their row was combined with the demographic information for person 2 in acs_demo–even though person 2 had already been combined with the real person 2 in acs_ses. In other words, this merge become a one to many merge instead of a one to one merge, with no indication that that had happened. What’s more, because the default merge is an inner join, the person 3 from acs_demo, not having anyone to match with in acs_ses, was dropped.

This is why you always use validate and start with an outer join. If you run:

acs_demo.merge(
    acs_ses,
    left_index=True, right_index=True,
    validate='1:1',
    how='outer',
    indicator=True
)

You’ll get the error message (at the bottom of a lot of less clear verbiage):

MergeError: Merge keys are not unique in right dataset; not a one-to-one merge

This is a good thing! Far better for code to crash and tell you what’s wrong than to appear to run successfully but give you bogus data.

So what do you do about it? First, we need to be able to identify the problem observations. If you groupby() both household and person, the size() of the resulting groups should be 1. So create a variable called copies containing exactly that:

acs_ses['copies'] = acs_ses.groupby(
    ['household', 'person']
).size()

The next question is how big a problem you have. Find out by running value_counts() on copies:

acs_ses['copies'].value_counts()
copies
1    27408
2        2
Name: count, dtype: int64

This tells you there are only two observations with duplicate values of household and person. That suggests the problem is probably some sort of data entry error or similar mistake. If you find lots of duplicates, that raises the possibility that you have fundamentally misunderstood the structure of the data set: perhaps it’s actually a hierarchical data set of some sort and you didn’t realize it.

Given that there is a small number of problem observations, take a look at them:

acs_ses[acs_ses['copies']>1]
edu income copies
household person
37 2 Some college, >=1 year 5300.0 2
2 Some college, >=1 year 4700.0 2

We see that although the two person 2’s have the same education level, their incomes are different. That suggests they are in fact two different people erroneously given the same person number. If they had been completely identical, that would suggest they are in fact the same person erroneously entered twice.

The problem is, we don’t know which of them is the person 2 in the acs_demo data set so we can’t combine either of them. If matching their SES data with their demographics is essential to your analysis, and it frequently is, you might as well drop them both right now. You can drop all the duplicates with:

acs_ses[acs_ses['copies']==1]
edu income copies
household person
37 1 Some college, >=1 year 10000.0 1
241 1 Master's degree 32500.0 1
242 1 Bachelor's degree 30000.0 1
377 1 None 51900.0 1
418 1 12th grade, no diploma 12200.0 1
... ... ... ... ...
1236624 1 Some college, >=1 year 50100.0 1
2 High School graduate 12000.0 1
1236756 1 Master's degree 69800.0 1
2 Master's degree 40800.0 1
1236779 1 High School graduate 22110.0 1

27408 rows × 3 columns

What if you had found that they were in fact the same person? Then you would want to keep one copy and drop any others. The cumcount() function works with groupby() and carries out a cumulative count, numbering each row in the group consecutively. Of course, being Python, the first row is zero. So you just need to keep the rows where cumcount() is zero:

acs_ses[
    acs_ses.
    groupby(['household', 'person']).
    cumcount()==0
]
edu income copies
household person
37 1 Some college, >=1 year 10000.0 1
2 Some college, >=1 year 5300.0 2
241 1 Master's degree 32500.0 1
242 1 Bachelor's degree 30000.0 1
377 1 None 51900.0 1
... ... ... ... ...
1236624 1 Some college, >=1 year 50100.0 1
2 High School graduate 12000.0 1
1236756 1 Master's degree 69800.0 1
2 Master's degree 40800.0 1
1236779 1 High School graduate 22110.0 1

27409 rows × 3 columns

This allows you to keep the first person 2. Either way when you match with acs_demo you’ll have some unmatched observations, which you can deal with in a way that is appropriate for your analysis.