7 Combining Data Sets

Combining data sets is a very common task, and one that’s easy to do if you understand the structure of the data sets you are trying to combine. However, if you’ve misunderstood the structure of the data sets you can end up with a data set that makes no sense at all. Also, combining data sets will often force you to deal with problems in the data set structure.

Always think through what the resulting data set should look like before combining two data sets. If the resulting data set won’t have a consistent, logical structure you probably need to rethink what you’re doing.

We’ll think about combining two data sets in terms of adding new information to an original data set. In reality, which data set is the original and which is new is completely arbitrary, but usually there’s an obvious candidate for each role and it helps for visualizing the process.

How you combine two data sets depends on what the new data is adding to the original data. We’ll discuss the most common scenarios, which are:

Adding observations
Adding variables
Adding level one units to existing level two units
Adding variables for different levels

We’ll also carry out an extended example and discuss how to handle duplicates when combining data set.

Start up Jupyter Lab if you haven’t already and navigate to the folder where you put the example files. Then create a new Python Notebook and call it Combining_Data_Sets_Practice.ipynb. Have it import Pandas.

import pandas as pd

7.1 Adding Observations

If the new data set adds more observations to the original data set, then this is a job for Panda’s concat() function. Using concat() makes sense if the original data set and the new data set contain the same kinds of things, but not the same things. The concat() function simply stacks DataFrames, passed in as a list, on top of each other.

Suppose that instead of a single file containing our 2000 ACS sample you were given acs_part1 and acs_part2, with each file containing half of the observations. Read them in, take a look, and then use concat() to combine them into a single DataFrame:

acs_part1 = pd.read_pickle('acs_part1.pickle')
acs_part1

		age	race	marital_status	edu	income	female	hispanic
household	person
37	1	20	White	Never married	Some college, >=1 year	10000.0	True	False
	2	19	White	Never married	Some college, >=1 year	5300.0	True	False
	3	19	Black	Never married	Some college, >=1 year	4700.0	True	False
241	1	50	White	Never married	Master's degree	32500.0	True	False
242	1	29	White	Never married	Bachelor's degree	30000.0	True	False
...	...	...	...	...	...	...	...	...
617416	3	3	White	NaN	None	NaN	False	False
617416	4	3	White	NaN	None	NaN	False	False
617477	1	43	White	Now married	Bachelor's degree	82300.0	False	False
	2	40	White	Now married	Bachelor's degree	3000.0	True	False
	3	14	White	NaN	7th-8th grade	NaN	True	False

13705 rows × 7 columns

acs_part2 = pd.read_pickle('acs_part2.pickle')
acs_part2

		age	race	marital_status	edu	income	female	hispanic
household	person
617477	4	11	White	NaN	5th-6th grade	NaN	True	False
617720	1	72	White	Widowed	Nursery school-4th grade	8000.0	True	False
618413	1	22	Other	Never married	High School graduate	24000.0	False	True
	2	25	Other	Never married	7th-8th grade	4000.0	False	True
	3	27	Other	Never married	7th-8th grade	0.0	False	True
...	...	...	...	...	...	...	...	...
1236624	1	29	White	Now married	Some college, >=1 year	50100.0	False	False
1236624	2	26	White	Now married	High School graduate	12000.0	True	False
1236756	1	58	White	Now married	Master's degree	69800.0	True	False
1236756	2	61	White	Now married	Master's degree	40800.0	False	False
1236779	1	30	American Indian	Divorced	High School graduate	22110.0	False	False

13705 rows × 7 columns

pd.concat([acs_part1, acs_part2])

		age	race	marital_status	edu	income	female	hispanic
household	person
37	1	20	White	Never married	Some college, >=1 year	10000.0	True	False
	2	19	White	Never married	Some college, >=1 year	5300.0	True	False
	3	19	Black	Never married	Some college, >=1 year	4700.0	True	False
241	1	50	White	Never married	Master's degree	32500.0	True	False
242	1	29	White	Never married	Bachelor's degree	30000.0	True	False
...	...	...	...	...	...	...	...	...
1236624	1	29	White	Now married	Some college, >=1 year	50100.0	False	False
1236624	2	26	White	Now married	High School graduate	12000.0	True	False
1236756	1	58	White	Now married	Master's degree	69800.0	True	False
1236756	2	61	White	Now married	Master's degree	40800.0	False	False
1236779	1	30	American Indian	Divorced	High School graduate	22110.0	False	False

27410 rows × 7 columns

If one of the DataFrames does not have a variable that exists in the other DataFrame, observations from that DataFrame will get NaN (missing) for that variable. Generally you want to make sure that the two DataFrames use the same names for the same things before you combine them.

7.2 Adding Variables

If the new data set adds more variables to the original data set and observations represent the same things in both data sets, then this is a job for the merge() function or its shortcut join(). The merge() command combines data sets by combining observations that have the same value of an identifier variable or variables, often the row index, so the result has all the variables from both files. The terms merge and join are used pretty much interchangeably in the data science world to describe this operation, though join tends to be more common among people with database backgrounds.

Suppose you were given the data files acs_demographics.pickle and acs_ses.pickle, containing demographic information and socio-economic status (SES) information respectively about the 2000 ACS respondents. Load them and take a look:

acs_demo = pd.read_pickle('acs_demographics.pickle')
acs_demo

		age	hispanic	race	marital_status	female
household	person
37	1	20	False	White	Never married	True
	2	19	False	White	Never married	True
	3	19	False	Black	Never married	True
241	1	50	False	White	Never married	True
242	1	29	False	White	Never married	True
...	...	...	...	...	...	...
1236624	1	29	False	White	Now married	False
1236624	2	26	False	White	Now married	True
1236756	1	58	False	White	Now married	True
1236756	2	61	False	White	Now married	False
1236779	1	30	False	American Indian	Divorced	False

27410 rows × 5 columns

acs_ses = pd.read_pickle('acs_ses.pickle')
acs_ses

		edu	income
household	person
37	1	Some college, >=1 year	10000.0
	2	Some college, >=1 year	5300.0
	3	Some college, >=1 year	4700.0
241	1	Master's degree	32500.0
242	1	Bachelor's degree	30000.0
...	...	...	...
1236624	1	Some college, >=1 year	50100.0
1236624	2	High School graduate	12000.0
1236756	1	Master's degree	69800.0
1236756	2	Master's degree	40800.0
1236779	1	High School graduate	22110.0

27410 rows × 2 columns

The join() function quickly and easily merges one DataFrame into another if they have the same row index:

acs_demo.join(acs_ses)

		age	hispanic	race	marital_status	female	edu	income
household	person
37	1	20	False	White	Never married	True	Some college, >=1 year	10000.0
	2	19	False	White	Never married	True	Some college, >=1 year	5300.0
	3	19	False	Black	Never married	True	Some college, >=1 year	4700.0
241	1	50	False	White	Never married	True	Master's degree	32500.0
242	1	29	False	White	Never married	True	Bachelor's degree	30000.0
...	...	...	...	...	...	...	...	...
1236624	1	29	False	White	Now married	False	Some college, >=1 year	50100.0
1236624	2	26	False	White	Now married	True	High School graduate	12000.0
1236756	1	58	False	White	Now married	True	Master's degree	69800.0
1236756	2	61	False	White	Now married	False	Master's degree	40800.0
1236779	1	30	False	American Indian	Divorced	False	High School graduate	22110.0

27410 rows × 7 columns

However, join() does not have all the tools for checking that a merge worked properly that merge() does. Given how much can go wrong with merge, we suggest you use merge() unless you’ve created the data sets to be merged yourself and know they’ll merge successfully.

The merge() function uses the convention of describing the two DataFrames to be merged as the left DataFrame and the right DataFrame. While in the result the columns of the left DataFrame will go on the left and the columns of the right DataFrame will go on the right, they’re mostly just labels. The DataFrame whose merge function is being called is the left DataFrame, and the one being added is the right DataFrame.

The first argument to pas in is the right DataFrame. Then you need to tell it what to use to match observations. Ideally this will be the indexes, so you can pass in left_index=True and right_index=True. Those are the only arguments required to carry out a merge:

acs_demo.merge(
    acs_ses,
    left_index=True, right_index=True,
)

		age	hispanic	race	marital_status	female	edu	income
household	person
37	1	20	False	White	Never married	True	Some college, >=1 year	10000.0
	2	19	False	White	Never married	True	Some college, >=1 year	5300.0
	3	19	False	Black	Never married	True	Some college, >=1 year	4700.0
241	1	50	False	White	Never married	True	Master's degree	32500.0
242	1	29	False	White	Never married	True	Bachelor's degree	30000.0
...	...	...	...	...	...	...	...	...
1236624	1	29	False	White	Now married	False	Some college, >=1 year	50100.0
1236624	2	26	False	White	Now married	True	High School graduate	12000.0
1236756	1	58	False	White	Now married	True	Master's degree	69800.0
1236756	2	61	False	White	Now married	False	Master's degree	40800.0
1236779	1	30	False	American Indian	Divorced	False	High School graduate	22110.0

27410 rows × 7 columns

However, with a few more arguments we can take get a much stronger reassurance that the merge is working properly.

This merge should combine one person’s demographic information with the same person’s SES information. Each row should match with exactly one other row: it is a one to one merge. However, that would not be the case if the DataFrames had duplicate values of the variables we are matching on. That would be a serious problem, so tell merge() to validate that you are in fact carrying out a one to one merge with validate='1:1'.

If an observation in one DataFrame does not match with an observation in the other DataFrame, it will get NaN (missing) for all the variables in the other DataFrame. But sometimes you don’t want to keep such observations at all. There are four different kinds of joins depending on what should be done with observations that don’t match:

In an inner join only observations that match are kept; observations that do not match are discarded. This is the default.
In a left join observations from the left DataFrame that do not match are kept, but observations from the right DataFrame that do not match are discarded.
In a right join observations from the right DataFrame that do not match are kept, but observations from the left DataFrame that do not match are discarded.
In an outer join all observations are kept whether they match or not, and no observations are discarded.

There are scenarios where all four of those make sense. However, the first time you merge two DataFrames you need to check if the merge is working properly or not. For example, minor differences in the format of the variables to be matched on can lead to nothing matching. To check for that, always start with an outer join even if you plan to discard observations that don’t match later. That will allow you to examine the observations that didn’t match and determine if they really don’t match and thus can be discarded, or if they only didn’t match because the merge didn’t work properly. Once you’re confident everything is working you can changed to the type of merge you really want to do.

You set the type of merge with the how argument. Start with how='outer'.

Examining whether the merge worked or not is much easier if you can immediately identify the observations that did or did not match. Pass in indicator=True and merge will create a column called _merge telling you which DataFrame each observation came from, with ‘both’ indicating it matched successfully. Running value_counts() on _merge is a good way to get a general sense of how the merge went.

Thus the careful version of the above merge is:

acs_demo.merge(
    acs_ses,
    left_index=True, right_index=True,
    validate='1:1',
    how='outer',
    indicator=True    
)

		age	hispanic	race	marital_status	female	edu	income	_merge
household	person
37	1	20	False	White	Never married	True	Some college, >=1 year	10000.0	both
	2	19	False	White	Never married	True	Some college, >=1 year	5300.0	both
	3	19	False	Black	Never married	True	Some college, >=1 year	4700.0	both
241	1	50	False	White	Never married	True	Master's degree	32500.0	both
242	1	29	False	White	Never married	True	Bachelor's degree	30000.0	both
...	...	...	...	...	...	...	...	...	...
1236624	1	29	False	White	Now married	False	Some college, >=1 year	50100.0	both
1236624	2	26	False	White	Now married	True	High School graduate	12000.0	both
1236756	1	58	False	White	Now married	True	Master's degree	69800.0	both
1236756	2	61	False	White	Now married	False	Master's degree	40800.0	both
1236779	1	30	False	American Indian	Divorced	False	High School graduate	22110.0	both

27410 rows × 8 columns

Exercise

Examine the following pairs of data sets:

acs_race.pickle and acs_education.pickle
acs_adults.pickle and acs_children.pickle

Determine the appropriate method for combining each pair into a single data set and then do so. If you do a merge, do it carefully and run value_counts() on _merge.

Solution

First load and examine acs_race and acs_education:

acs_race = pd.read_pickle('acs_race.pickle')
acs_race

		race	hispanic
household	person
37	1	White	False
	2	White	False
	3	Black	False
241	1	White	False
242	1	White	False
...	...	...	...
1236624	1	White	False
1236624	2	White	False
1236756	1	White	False
1236756	2	White	False
1236779	1	American Indian	False

27410 rows × 2 columns

acs_edu = pd.read_pickle('acs_education.pickle')
acs_edu

		edu
household	person
37	1	Some college, >=1 year
	2	Some college, >=1 year
	3	Some college, >=1 year
241	1	Master's degree
242	1	Bachelor's degree
...	...	...
1236624	1	Some college, >=1 year
1236624	2	High School graduate
1236756	1	Master's degree
1236756	2	Master's degree
1236779	1	High School graduate

27410 rows × 1 columns

These DataFrames contain different information about the same things, so combining them will add variables. This is a job for merge():

acs_combined1 = acs_race.merge(
    acs_edu,
    left_index=True, right_index=True,
    how='outer',
    validate='1:1',
    indicator=True
)
acs_combined1

		race	hispanic	edu	_merge
household	person
37	1	White	False	Some college, >=1 year	both
	2	White	False	Some college, >=1 year	both
	3	Black	False	Some college, >=1 year	both
241	1	White	False	Master's degree	both
242	1	White	False	Bachelor's degree	both
...	...	...	...	...	...
1236624	1	White	False	Some college, >=1 year	both
1236624	2	White	False	High School graduate	both
1236756	1	White	False	Master's degree	both
1236756	2	White	False	Master's degree	both
1236779	1	American Indian	False	High School graduate	both

27410 rows × 4 columns

Now check that they actually matched successfully by looking at value_counts() for _merge:

acs_combined1['_merge'].value_counts()

_merge
both          27410
left_only         0
right_only        0
Name: count, dtype: int64

Moving on to acs_adults and acs_children:

acs_adults = pd.read_pickle('acs_adults.pickle')
acs_adults

		age	race	marital_status	edu	income	female	hispanic
household	person
37	1	20	White	Never married	Some college, >=1 year	10000.0	True	False
	2	19	White	Never married	Some college, >=1 year	5300.0	True	False
	3	19	Black	Never married	Some college, >=1 year	4700.0	True	False
241	1	50	White	Never married	Master's degree	32500.0	True	False
242	1	29	White	Never married	Bachelor's degree	30000.0	True	False
...	...	...	...	...	...	...	...	...
1236624	1	29	White	Now married	Some college, >=1 year	50100.0	False	False
1236624	2	26	White	Now married	High School graduate	12000.0	True	False
1236756	1	58	White	Now married	Master's degree	69800.0	True	False
1236756	2	61	White	Now married	Master's degree	40800.0	False	False
1236779	1	30	American Indian	Divorced	High School graduate	22110.0	False	False

20118 rows × 7 columns

acs_children = pd.read_pickle('acs_children.pickle')
acs_children

		age	race	marital_status	edu	income	female	hispanic
household	person
484	3	4	Asian	NaN	None	NaN	False	False
484	4	2	Asian	NaN	NaN	NaN	True	False
607	3	17	Asian	Never married	10th grade	0.0	True	False
	4	15	Asian	Never married	7th-8th grade	0.0	False	False
	5	13	Asian	NaN	7th-8th grade	NaN	True	False
...	...	...	...	...	...	...	...	...
1235851	2	13	White	NaN	5th-6th grade	NaN	True	False
1235851	3	12	White	NaN	9th grade	NaN	True	False
1235861	3	1	American Indian	NaN	NaN	NaN	False	False
1235861	4	0	Indigenous, Unspecified	NaN	NaN	NaN	True	False
1236287	4	4	White	NaN	None	NaN	False	False

7292 rows × 7 columns

These DataFrames contain the same information but for different people. Combining them will add observations, so this is a job for concat():

pd.concat([acs_adults, acs_children])

		age	race	marital_status	edu	income	female	hispanic
household	person
37	1	20	White	Never married	Some college, >=1 year	10000.0	True	False
	2	19	White	Never married	Some college, >=1 year	5300.0	True	False
	3	19	Black	Never married	Some college, >=1 year	4700.0	True	False
241	1	50	White	Never married	Master's degree	32500.0	True	False
242	1	29	White	Never married	Bachelor's degree	30000.0	True	False
...	...	...	...	...	...	...	...	...
1235851	2	13	White	NaN	5th-6th grade	NaN	True	False
1235851	3	12	White	NaN	9th grade	NaN	True	False
1235861	3	1	American Indian	NaN	NaN	NaN	False	False
1235861	4	0	Indigenous, Unspecified	NaN	NaN	NaN	True	False
1236287	4	4	White	NaN	None	NaN	False	False

27410 rows × 7 columns

Note that the last people in the DataFrame are now all children. If you want to put the DataFrame back in its original order, where the people in a household are together, sort by household and person.

7.3 Adding Level One Units to Existing Level Two Units

Next consider the data sets nlsy_1979.pickle and nlsy_1980.pickle. They each contain one year’s worth of data from our NLSY extract. Is combining them a job for concat() or merge()?

nlsy_1979 = pd.read_pickle('nlsy_1979.pickle')
nlsy_1979

	birth_year	edu	income	age
id
1.0	58.0	12TH GRADE	4620.0	21.0
2.0	59.0	9TH GRADE	4000.0	20.0
3.0	61.0	10TH GRADE	NaN	18.0
4.0	62.0	9TH GRADE	NaN	17.0
5.0	59.0	1ST YEAR COLLEGE	2200.0	20.0
...	...	...	...	...
12682.0	59.0	9TH GRADE	500.0	20.0
12683.0	59.0	12TH GRADE	0.0	20.0
12684.0	60.0	10TH GRADE	0.0	19.0
12685.0	57.0	12TH GRADE	0.0	22.0
12686.0	60.0	12TH GRADE	15000.0	19.0

12686 rows × 4 columns

nlsy_1980 = pd.read_pickle('nlsy_1980.pickle')
nlsy_1980

	birth_year	edu	income	age
id
1.0	58.0	NaN	NaN	22.0
2.0	59.0	9TH GRADE	5000.0	21.0
3.0	61.0	10TH GRADE	7000.0	19.0
4.0	62.0	NaN	NaN	18.0
5.0	59.0	2ND YEAR COLLEGE	2000.0	21.0
...	...	...	...	...
12682.0	59.0	9TH GRADE	0.0	21.0
12683.0	59.0	12TH GRADE	0.0	21.0
12684.0	60.0	12TH GRADE	0.0	20.0
12685.0	57.0	12TH GRADE	3000.0	23.0
12686.0	60.0	12TH GRADE	17000.0	20.0

12686 rows × 4 columns

The answer depends on whether you want the resulting data set to be in long form or in wide form. In the NLSY, a level two unit is a person and a level one unit is a person-year combination, so adding nlsy_1980 to nlsy_1979 is adding new level one units (years) to the existing level two units (people). In long form each level one unit gets its own observation, so adding nlsy_1980 in long form adds observations. This is a job for concat(). In wide form, each level two unit gets its own observation, but each level one unit gets its own set of variables. Thus adding nlsy_1980 in wide form adds variables; a job for merge().

The only complication is the level one identifier, year. Right now it is found only in the filenames of the two data sets, as is common. In long form, the level one identifier needs to be a variable. In wide form with string column names, it needs to be a suffix at the end of the names of all the level one variables. Fortunately, both concat() and merge() have ways to do that easily.

7.3.1 Long Form

First let’s combine them into long form with concat(). The keys argument allows you to pass in a list containing an identifier for each DataFrame. It will be used to create a MultiIndex for the combined DataFrame:

nlsy_long = pd.concat(
    [nlsy_1979, nlsy_1980],
    keys=[1979, 1980]
)
nlsy_long

		birth_year	edu	income	age
	id
1979	1.0	58.0	12TH GRADE	4620.0	21.0
	2.0	59.0	9TH GRADE	4000.0	20.0
	3.0	61.0	10TH GRADE	NaN	18.0
	4.0	62.0	9TH GRADE	NaN	17.0
	5.0	59.0	1ST YEAR COLLEGE	2200.0	20.0
...	...	...	...	...	...
1980	12682.0	59.0	9TH GRADE	0.0	21.0
	12683.0	59.0	12TH GRADE	0.0	21.0
	12684.0	60.0	12TH GRADE	0.0	20.0
	12685.0	57.0	12TH GRADE	3000.0	23.0
	12686.0	60.0	12TH GRADE	17000.0	20.0

25372 rows × 4 columns

At this point the second part of the row MultiIndex has a name but the first part does not. You can rename indexes with rename_axis(); for a MultiIndex pass in a list of names (including the one that already exists):

nlsy_long = nlsy_long.rename_axis(
    ['year', 'id'],
    axis='rows'
)
nlsy_long

		birth_year	edu	income	age
year	id
1979	1.0	58.0	12TH GRADE	4620.0	21.0
	2.0	59.0	9TH GRADE	4000.0	20.0
	3.0	61.0	10TH GRADE	NaN	18.0
	4.0	62.0	9TH GRADE	NaN	17.0
	5.0	59.0	1ST YEAR COLLEGE	2200.0	20.0
...	...	...	...	...	...
1980	12682.0	59.0	9TH GRADE	0.0	21.0
	12683.0	59.0	12TH GRADE	0.0	21.0
	12684.0	60.0	12TH GRADE	0.0	20.0
	12685.0	57.0	12TH GRADE	3000.0	23.0
	12686.0	60.0	12TH GRADE	17000.0	20.0

25372 rows × 4 columns

This looks a little off because it treats people as nested within years. You’ll get something more familiar if you swap the levels of the MultiIndex and then sort by them:

nlsy_long.swaplevel().sort_index()

		birth_year	edu	income	age
id	year
1.0	1979	58.0	12TH GRADE	4620.0	21.0
1.0	1980	58.0	NaN	NaN	22.0
2.0	1979	59.0	9TH GRADE	4000.0	20.0
2.0	1980	59.0	9TH GRADE	5000.0	21.0
3.0	1979	61.0	10TH GRADE	NaN	18.0
...	...	...	...	...	...
12684.0	1980	60.0	12TH GRADE	0.0	20.0
12685.0	1979	57.0	12TH GRADE	0.0	22.0
12685.0	1980	57.0	12TH GRADE	3000.0	23.0
12686.0	1979	60.0	12TH GRADE	15000.0	19.0
12686.0	1980	60.0	12TH GRADE	17000.0	20.0

25372 rows × 4 columns

7.3.2 Wide Form

Now let’s combine these data sets into wide form with merge(). The suffixes argument allows you to pass in a tuple of suffixes, one for each DataFrame, which will be appended to the names of the columns that come from that DataFrame:

nlsy_wide = nlsy_1979.merge(
    nlsy_1980,
    left_index=True, right_index=True,
    suffixes=('_1979', '_1980'),
    validate='1:1',
    how='outer',
    indicator=True      
)
nlsy_wide

	birth_year_1979	edu_1979	income_1979	age_1979	birth_year_1980	edu_1980	income_1980	age_1980	_merge
id
1.0	58.0	12TH GRADE	4620.0	21.0	58.0	NaN	NaN	22.0	both
2.0	59.0	9TH GRADE	4000.0	20.0	59.0	9TH GRADE	5000.0	21.0	both
3.0	61.0	10TH GRADE	NaN	18.0	61.0	10TH GRADE	7000.0	19.0	both
4.0	62.0	9TH GRADE	NaN	17.0	62.0	NaN	NaN	18.0	both
5.0	59.0	1ST YEAR COLLEGE	2200.0	20.0	59.0	2ND YEAR COLLEGE	2000.0	21.0	both
...	...	...	...	...	...	...	...	...	...
12682.0	59.0	9TH GRADE	500.0	20.0	59.0	9TH GRADE	0.0	21.0	both
12683.0	59.0	12TH GRADE	0.0	20.0	59.0	12TH GRADE	0.0	21.0	both
12684.0	60.0	10TH GRADE	0.0	19.0	60.0	12TH GRADE	0.0	20.0	both
12685.0	57.0	12TH GRADE	0.0	22.0	57.0	12TH GRADE	3000.0	23.0	both
12686.0	60.0	12TH GRADE	15000.0	19.0	60.0	12TH GRADE	17000.0	20.0	both

12686 rows × 9 columns

And we’re done!

7.4 Adding Variables for Different Levels

If you need to merge hierarchical data where the left DataFrame contains data on the level one units and the right DataFrame contains data on the level two units, this is a job for a many-to-one merge. A many-to-one merge combines observations just like a one-to-one merge, but many rows representing level one units are combined with one row representing the corresponding level two unit. A one-to-many merge is essentially the same thing, just the left DataFrame contains the level two unit (the “one”) and the right DataFrame contains the level one units (the “many”).

Load our good friend acs.pickle plus acs_households.pickle. The latter has one observation per household and some household level variables:

acs = pd.read_pickle('acs.pickle')
acs_households = pd.read_pickle('acs_households.pickle')
acs_households

	household_income	max_edu
household
37	20000.0	Some college, >=1 year
241	32500.0	Master's degree
242	30000.0	Bachelor's degree
377	51900.0	None
418	12200.0	12th grade, no diploma
...	...	...
1236119	93200.0	Bachelor's degree
1236287	53400.0	Some college, <1 year
1236624	62100.0	Some college, >=1 year
1236756	110600.0	Master's degree
1236779	22110.0	High School graduate

10565 rows × 2 columns

With acs as the left DataFrame and acs_households as right, this is a many-to-one merge. You can do this pretty much like the other merges we’ve done: the only thing that needs to change is to set validate to 'm:1':

acs.merge(
    acs_households,
    left_index=True, right_index=True,
    validate='m:1',
    how='outer',
    indicator=True
)

		age	race	marital_status	edu	income	female	hispanic	household_income	max_edu	_merge
household	person
37	1	20	White	Never married	Some college, >=1 year	10000.0	True	False	20000.0	Some college, >=1 year	both
	2	19	White	Never married	Some college, >=1 year	5300.0	True	False	20000.0	Some college, >=1 year	both
	3	19	Black	Never married	Some college, >=1 year	4700.0	True	False	20000.0	Some college, >=1 year	both
241	1	50	White	Never married	Master's degree	32500.0	True	False	32500.0	Master's degree	both
242	1	29	White	Never married	Bachelor's degree	30000.0	True	False	30000.0	Bachelor's degree	both
...	...	...	...	...	...	...	...	...	...	...	...
1236624	1	29	White	Now married	Some college, >=1 year	50100.0	False	False	62100.0	Some college, >=1 year	both
1236624	2	26	White	Now married	High School graduate	12000.0	True	False	62100.0	Some college, >=1 year	both
1236756	1	58	White	Now married	Master's degree	69800.0	True	False	110600.0	Master's degree	both
1236756	2	61	White	Now married	Master's degree	40800.0	False	False	110600.0	Master's degree	both
1236779	1	30	American Indian	Divorced	High School graduate	22110.0	False	False	22110.0	High School graduate	both

27410 rows × 10 columns

This is subtly dangerous, however. Note how the combined DataFrame has person and household as a MultiIndex. That makes the DataFrame incapable of including a household that has no people in it and thus no value for person. To see the result, try the merge again after removing the first three observations from acs (household 37):

acs.iloc[3:, :].merge(
    acs_households,
    left_index=True, right_index=True,
    validate='m:1',
    how='outer',
    indicator=True
)

		age	race	marital_status	edu	income	female	hispanic	household_income	max_edu	_merge
household	person
241	1	50	White	Never married	Master's degree	32500.0	True	False	32500.0	Master's degree	both
242	1	29	White	Never married	Bachelor's degree	30000.0	True	False	30000.0	Bachelor's degree	both
377	1	69	White	Never married	None	51900.0	True	False	51900.0	None	both
418	1	59	White	Widowed	12th grade, no diploma	12200.0	True	False	12200.0	12th grade, no diploma	both
465	1	55	Black	Separated	Associate degree	0.0	False	False	2600.0	Associate degree	both
...	...	...	...	...	...	...	...	...	...	...	...
1236624	1	29	White	Now married	Some college, >=1 year	50100.0	False	False	62100.0	Some college, >=1 year	both
1236624	2	26	White	Now married	High School graduate	12000.0	True	False	62100.0	Some college, >=1 year	both
1236756	1	58	White	Now married	Master's degree	69800.0	True	False	110600.0	Master's degree	both
1236756	2	61	White	Now married	Master's degree	40800.0	False	False	110600.0	Master's degree	both
1236779	1	30	American Indian	Divorced	High School graduate	22110.0	False	False	22110.0	High School graduate	both

27407 rows × 10 columns

Household 37 has disappeared from the combined DataFrame, despite the fact that it exists in acs_household and we specified how='outer'.

One easy-to-implement solution is to not match on the indexes at all. Instead, reset them and then use on= to specify the variable to be used for matching:

acs.iloc[3:, :].reset_index().merge(
    acs_households.reset_index(),
    on='household',
    validate='m:1',
    how='outer',
    indicator=True
)

	household	person	age	race	marital_status	edu	income	female	hispanic	household_income	max_edu	_merge
0	241	1.0	50.0	White	Never married	Master's degree	32500.0	True	False	32500.0	Master's degree	both
1	242	1.0	29.0	White	Never married	Bachelor's degree	30000.0	True	False	30000.0	Bachelor's degree	both
2	377	1.0	69.0	White	Never married	None	51900.0	True	False	51900.0	None	both
3	418	1.0	59.0	White	Widowed	12th grade, no diploma	12200.0	True	False	12200.0	12th grade, no diploma	both
4	465	1.0	55.0	Black	Separated	Associate degree	0.0	False	False	2600.0	Associate degree	both
...	...	...	...	...	...	...	...	...	...	...	...	...
27403	1236624	2.0	26.0	White	Now married	High School graduate	12000.0	True	False	62100.0	Some college, >=1 year	both
27404	1236756	1.0	58.0	White	Now married	Master's degree	69800.0	True	False	110600.0	Master's degree	both
27405	1236756	2.0	61.0	White	Now married	Master's degree	40800.0	False	False	110600.0	Master's degree	both
27406	1236779	1.0	30.0	American Indian	Divorced	High School graduate	22110.0	False	False	22110.0	High School graduate	both
27407	37	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	20000.0	Some college, >=1 year	right_only

27408 rows × 12 columns

Now household 37 is in the result, though down at the bottom, with person set to NaN and _merge set to ‘right_only’. You could go ahead and make household and person the MultiIndex again, though that would leave you with a ‘person NaN.’

If you make acs_households left and acs right, then this is a one to many merge and validate needs to be set to '1:m'. Otherwise nothing needs to change:

acs_households.reset_index().merge(
    acs.iloc[3:, :].reset_index(),
    on='household',
    validate='1:m',
    how='outer',
    indicator=True
)

	household	household_income	max_edu	person	age	race	marital_status	edu	income	female	hispanic	_merge
0	37	20000.0	Some college, >=1 year	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	left_only
1	241	32500.0	Master's degree	1.0	50.0	White	Never married	Master's degree	32500.0	True	False	both
2	242	30000.0	Bachelor's degree	1.0	29.0	White	Never married	Bachelor's degree	30000.0	True	False	both
3	377	51900.0	None	1.0	69.0	White	Never married	None	51900.0	True	False	both
4	418	12200.0	12th grade, no diploma	1.0	59.0	White	Widowed	12th grade, no diploma	12200.0	True	False	both
...	...	...	...	...	...	...	...	...	...	...	...	...
27403	1236624	62100.0	Some college, >=1 year	1.0	29.0	White	Now married	Some college, >=1 year	50100.0	False	False	both
27404	1236624	62100.0	Some college, >=1 year	2.0	26.0	White	Now married	High School graduate	12000.0	True	False	both
27405	1236756	110600.0	Master's degree	1.0	58.0	White	Now married	Master's degree	69800.0	True	False	both
27406	1236756	110600.0	Master's degree	2.0	61.0	White	Now married	Master's degree	40800.0	False	False	both
27407	1236779	22110.0	High School graduate	1.0	30.0	American Indian	Divorced	High School graduate	22110.0	False	False	both

27408 rows × 12 columns

Note that I described this as an easy to implement solution. There are many different ways to merge in Python, and if you’re doing on something where performance is an issue you may find something more complicated will be substantially faster for your particular problem.

Exercise

nlsy_person.pickle contains information about the people in our NLSY extract that does not change over time, while nlsy_person_year.pickle contains information that changes from year to year. Combine them. Note that the person with id 1 is not in nlsy_person and the person with id 12686 (the last person) is not in nlsy_person_year. Make sure the data you do have for them is not lost! To make it easy to check, set id and year as the indexes again after combining.

Solution

nlsy_person = pd.read_pickle('nlsy_person.pickle')
nlsy_person

	yearOfBirth	total_income
id
2.0	59.0	214500.0
3.0	61.0	80500.0
4.0	62.0	180236.0
5.0	59.0	185679.0
6.0	60.0	314683.0
...	...	...
12682.0	59.0	6005.0
12683.0	59.0	0.0
12684.0	60.0	3000.0
12685.0	57.0	53547.0
12686.0	60.0	335560.0

12685 rows × 2 columns

nlsy_person_year = pd.read_pickle('nlsy_person_year.pickle')
nlsy_person_year

		edu	income	age
id	year
1.0	1979	12TH GRADE	4620.0	21.0
	1980	NaN	NaN	22.0
	1981	12TH GRADE	5000.0	23.0
	1982	NaN	NaN	24.0
	1983	NaN	NaN	25.0
...	...	...	...	...
12685.0	1993	NaN	NaN	36.0
	1994	NaN	NaN	37.0
	1996	NaN	NaN	39.0
	1998	NaN	NaN	41.0
	2000	NaN	NaN	43.0

241015 rows × 3 columns

nlsy_combined = nlsy_person_year.reset_index().merge(
    nlsy_person.reset_index(),
    on='id',
    how='outer',
    validate='m:1',
    indicator=True
).set_index(['id', 'year'])
nlsy_combined.loc[[1, 12686]]

		edu	income	age	yearOfBirth	total_income	_merge
id	year
1.0	1979.0	12TH GRADE	4620.0	21.0	NaN	NaN	left_only
	1980.0	NaN	NaN	22.0	NaN	NaN	left_only
	1981.0	12TH GRADE	5000.0	23.0	NaN	NaN	left_only
	1982.0	NaN	NaN	24.0	NaN	NaN	left_only
	1983.0	NaN	NaN	25.0	NaN	NaN	left_only
	1984.0	NaN	NaN	26.0	NaN	NaN	left_only
	1985.0	NaN	NaN	27.0	NaN	NaN	left_only
	1986.0	NaN	NaN	28.0	NaN	NaN	left_only
	1987.0	NaN	NaN	29.0	NaN	NaN	left_only
	1988.0	NaN	NaN	30.0	NaN	NaN	left_only
	1989.0	NaN	NaN	31.0	NaN	NaN	left_only
	1990.0	NaN	NaN	32.0	NaN	NaN	left_only
	1991.0	NaN	NaN	33.0	NaN	NaN	left_only
	1992.0	NaN	NaN	34.0	NaN	NaN	left_only
	1993.0	NaN	NaN	35.0	NaN	NaN	left_only
	1994.0	NaN	NaN	36.0	NaN	NaN	left_only
	1996.0	NaN	NaN	38.0	NaN	NaN	left_only
	1998.0	NaN	NaN	40.0	NaN	NaN	left_only
	2000.0	NaN	NaN	42.0	NaN	NaN	left_only
12686.0	NaN	NaN	NaN	NaN	60.0	335560.0	right_only

It doesn’t matter which DataFrame you make left or right as long as your validate matches what you choose.

7.5 Inflation Example

Next we’ll do an example that illustrates some of the complications that can arise when combining data sets from different sources.

One issue with the NLSY data extract we’ve been using is that incomes from different time periods are not really comparable due to inflation. To adjust them for inflation, we need information about the level of inflation in each year. The data set fred_cpi.pickle contains the average Consumer Price Index for All Urban Consumers for every year from 1970 to 2005. It was obtained from the Federal Reserve Economic Data (FRED). Our task is to merge it with nlsy.pickle.

Start by reading in both data sets:

nlsy = pd.read_pickle('nlsy.pickle')
cpi = pd.read_pickle('fred_cpi.pickle')
cpi

	datestr	daten	CPIAUCSL
0	1970-01-01	1970-01-01	38.841999
1	1971-01-01	1971-01-01	40.483002
2	1972-01-01	1972-01-01	41.807999
3	1973-01-01	1973-01-01	44.424999
4	1974-01-01	1974-01-01	49.317001
5	1975-01-01	1975-01-01	53.825001
6	1976-01-01	1976-01-01	56.932999
7	1977-01-01	1977-01-01	60.617001
8	1978-01-01	1978-01-01	65.241997
9	1979-01-01	1979-01-01	72.583000
10	1980-01-01	1980-01-01	82.383003
11	1981-01-01	1981-01-01	90.932999
12	1982-01-01	1982-01-01	96.532997
13	1983-01-01	1983-01-01	99.583000
14	1984-01-01	1984-01-01	103.932999
15	1985-01-01	1985-01-01	107.599998
16	1986-01-01	1986-01-01	109.692001
17	1987-01-01	1987-01-01	113.616997
18	1988-01-01	1988-01-01	118.275002
19	1989-01-01	1989-01-01	123.942001
20	1990-01-01	1990-01-01	130.658005
21	1991-01-01	1991-01-01	136.167007
22	1992-01-01	1992-01-01	140.307999
23	1993-01-01	1993-01-01	144.475006
24	1994-01-01	1994-01-01	148.225006
25	1995-01-01	1995-01-01	152.382996
26	1996-01-01	1996-01-01	156.858002
27	1997-01-01	1997-01-01	160.524994
28	1998-01-01	1998-01-01	163.007996
29	1999-01-01	1999-01-01	166.582993
30	2000-01-01	2000-01-01	172.192001
31	2001-01-01	2001-01-01	177.042007
32	2002-01-01	2002-01-01	179.867004
33	2003-01-01	2003-01-01	184.000000
34	2004-01-01	2004-01-01	188.908005
35	2005-01-01	2005-01-01	195.266998

Taking this data and adding nlsy to it is a job for a one-to-many merge: one year’s CPI data will match with many people’s NLSY data for that year. Note how this treats year as the level two unit! For most purposes it’s more useful to think of people as the level two unit in the NLSY, but it’s just as logical to group person-year combinations by year instead.

7.5.1 Preparing to Merge

We have some work to do with cpi before it’s ready to merge. The biggest issue is that we need a simple numeric year variable we can use to match with the year index in nlsy. cpi has two different date columns; look at the dtypes to see the difference:

cpi.dtypes

datestr             object
daten       datetime64[ns]
CPIAUCSL           float32
dtype: object

datestr is, unsurprisingly, a string (object), but daten (date number) is a Python datetime. A datetime object (dt) has a year attribute containing just the year component of the date. We can put that in its own column with:

cpi['year'] = cpi['daten'].dt.year
cpi

	datestr	daten	CPIAUCSL	year
0	1970-01-01	1970-01-01	38.841999	1970
1	1971-01-01	1971-01-01	40.483002	1971
2	1972-01-01	1972-01-01	41.807999	1972
3	1973-01-01	1973-01-01	44.424999	1973
4	1974-01-01	1974-01-01	49.317001	1974
5	1975-01-01	1975-01-01	53.825001	1975
6	1976-01-01	1976-01-01	56.932999	1976
7	1977-01-01	1977-01-01	60.617001	1977
8	1978-01-01	1978-01-01	65.241997	1978
9	1979-01-01	1979-01-01	72.583000	1979
10	1980-01-01	1980-01-01	82.383003	1980
11	1981-01-01	1981-01-01	90.932999	1981
12	1982-01-01	1982-01-01	96.532997	1982
13	1983-01-01	1983-01-01	99.583000	1983
14	1984-01-01	1984-01-01	103.932999	1984
15	1985-01-01	1985-01-01	107.599998	1985
16	1986-01-01	1986-01-01	109.692001	1986
17	1987-01-01	1987-01-01	113.616997	1987
18	1988-01-01	1988-01-01	118.275002	1988
19	1989-01-01	1989-01-01	123.942001	1989
20	1990-01-01	1990-01-01	130.658005	1990
21	1991-01-01	1991-01-01	136.167007	1991
22	1992-01-01	1992-01-01	140.307999	1992
23	1993-01-01	1993-01-01	144.475006	1993
24	1994-01-01	1994-01-01	148.225006	1994
25	1995-01-01	1995-01-01	152.382996	1995
26	1996-01-01	1996-01-01	156.858002	1996
27	1997-01-01	1997-01-01	160.524994	1997
28	1998-01-01	1998-01-01	163.007996	1998
29	1999-01-01	1999-01-01	166.582993	1999
30	2000-01-01	2000-01-01	172.192001	2000
31	2001-01-01	2001-01-01	177.042007	2001
32	2002-01-01	2002-01-01	179.867004	2002
33	2003-01-01	2003-01-01	184.000000	2003
34	2004-01-01	2004-01-01	188.908005	2004
35	2005-01-01	2005-01-01	195.266998	2005

We no longer need datestr and daten. Up until now we’ve gotten rid of unneeded columns using drop(), but since there are only two columns we do need it’s easier to just select them. Also, we don’t care about the subtleties of different inflation measures, so rename CPIAUCSL to just cpi. Just because you can, do both tasks in one function chain:

cpi = cpi.rename(
    columns={'CPIAUCSL' : 'cpi'}
)[['year', 'cpi']]
cpi

	year	cpi
0	1970	38.841999
1	1971	40.483002
2	1972	41.807999
3	1973	44.424999
4	1974	49.317001
5	1975	53.825001
6	1976	56.932999
7	1977	60.617001
8	1978	65.241997
9	1979	72.583000
10	1980	82.383003
11	1981	90.932999
12	1982	96.532997
13	1983	99.583000
14	1984	103.932999
15	1985	107.599998
16	1986	109.692001
17	1987	113.616997
18	1988	118.275002
19	1989	123.942001
20	1990	130.658005
21	1991	136.167007
22	1992	140.307999
23	1993	144.475006
24	1994	148.225006
25	1995	152.382996
26	1996	156.858002
27	1997	160.524994
28	1998	163.007996
29	1999	166.582993
30	2000	172.192001
31	2001	177.042007
32	2002	179.867004
33	2003	184.000000
34	2004	188.908005
35	2005	195.266998

Notice that because the column subset was specified as ['year', 'cpi'], year comes first in the result even though it was last in the original data.

7.5.2 Merging

Now we’re ready to merge. Because we haven’t set year as an index for cpi, we have to tell merge() what to merge on. Do that by passing in on='year' rather than left_index=True, right_index=True. This tells it to use the year column in both data sets.

nlsy_combined = cpi.merge(
    nlsy, 
    on='year',
    validate='1:m',
    how='outer',
    indicator=True
)
nlsy_combined

	year	cpi	birth_year	edu	income	age	_merge
0	1970	38.841999	NaN	NaN	NaN	NaN	left_only
1	1971	40.483002	NaN	NaN	NaN	NaN	left_only
2	1972	41.807999	NaN	NaN	NaN	NaN	left_only
3	1973	44.424999	NaN	NaN	NaN	NaN	left_only
4	1974	49.317001	NaN	NaN	NaN	NaN	left_only
...	...	...	...	...	...	...	...
241046	2001	177.042007	NaN	NaN	NaN	NaN	left_only
241047	2002	179.867004	NaN	NaN	NaN	NaN	left_only
241048	2003	184.000000	NaN	NaN	NaN	NaN	left_only
241049	2004	188.908005	NaN	NaN	NaN	NaN	left_only
241050	2005	195.266998	NaN	NaN	NaN	NaN	left_only

241051 rows × 7 columns

This works, but all the rows we can see have ‘left_only’ for _merge and NaN for all the columns from nlsy. Also, id has completely disappeared!

Let’s start with the problem of id. Since only one of the DataFrames had a MultiIndex, merge() discarded it and replaced it with a column for the variable you matched on, year. But that got rid of id entirely. To fix that, the solution again is to reset the index of nlsy before merging. That makes id an ordinary column which is preserved after the merge.

nlsy_combined = cpi.merge(
    nlsy.reset_index(), 
    on='year',
    validate='1:m',
    how='outer',
    indicator=True
)
nlsy_combined

	year	cpi	id	birth_year	edu	income	age	_merge
0	1970	38.841999	NaN	NaN	NaN	NaN	NaN	left_only
1	1971	40.483002	NaN	NaN	NaN	NaN	NaN	left_only
2	1972	41.807999	NaN	NaN	NaN	NaN	NaN	left_only
3	1973	44.424999	NaN	NaN	NaN	NaN	NaN	left_only
4	1974	49.317001	NaN	NaN	NaN	NaN	NaN	left_only
...	...	...	...	...	...	...	...	...
241046	2001	177.042007	NaN	NaN	NaN	NaN	NaN	left_only
241047	2002	179.867004	NaN	NaN	NaN	NaN	NaN	left_only
241048	2003	184.000000	NaN	NaN	NaN	NaN	NaN	left_only
241049	2004	188.908005	NaN	NaN	NaN	NaN	NaN	left_only
241050	2005	195.266998	NaN	NaN	NaN	NaN	NaN	left_only

241051 rows × 8 columns

Of course id for the rows we can see is now NaN just like the other columns from nsly. Things are not quite as dire as they look though: take a look at rows from 1979:

nlsy_combined.loc[
    nlsy_combined['year']==1979
]

	year	cpi	id	birth_year	edu	income	age	_merge
9	1979	72.583	1.0	58.0	12TH GRADE	4620.0	21.0	both
10	1979	72.583	2.0	59.0	9TH GRADE	4000.0	20.0	both
11	1979	72.583	3.0	61.0	10TH GRADE	NaN	18.0	both
12	1979	72.583	4.0	62.0	9TH GRADE	NaN	17.0	both
13	1979	72.583	5.0	59.0	1ST YEAR COLLEGE	2200.0	20.0	both
...	...	...	...	...	...	...	...	...
12690	1979	72.583	12682.0	59.0	9TH GRADE	500.0	20.0	both
12691	1979	72.583	12683.0	59.0	12TH GRADE	0.0	20.0	both
12692	1979	72.583	12684.0	60.0	10TH GRADE	0.0	19.0	both
12693	1979	72.583	12685.0	57.0	12TH GRADE	0.0	22.0	both
12694	1979	72.583	12686.0	60.0	12TH GRADE	15000.0	19.0	both

12686 rows × 8 columns

They look just fine. Examine the value_counts() for _merge:

nlsy_combined['_merge'].value_counts()

_merge
both          241034
left_only         17
right_only         0
Name: count, dtype: int64

Only 17 rows did not match, all from cpi (which is why they have NaN for columns from nlsy). Those 17 just happen to be at the beginning and end of the data set so they were all we could see. The data set is sorted by year (a side effect of matching by it) so that gives us a clue where to look for an explanation. Try value_counts() for year and merge (use sort=False so it doesn’t change the order of the years):

(
    nlsy_combined[['year', '_merge']].
    value_counts(sort=False)
)

year  _merge   
1970  left_only        1
1971  left_only        1
1972  left_only        1
1973  left_only        1
1974  left_only        1
1975  left_only        1
1976  left_only        1
1977  left_only        1
1978  left_only        1
1979  both         12686
1980  both         12686
1981  both         12686
1982  both         12686
1983  both         12686
1984  both         12686
1985  both         12686
1986  both         12686
1987  both         12686
1988  both         12686
1989  both         12686
1990  both         12686
1991  both         12686
1992  both         12686
1993  both         12686
1994  both         12686
1995  left_only        1
1996  both         12686
1997  left_only        1
1998  both         12686
1999  left_only        1
2000  both         12686
2001  left_only        1
2002  left_only        1
2003  left_only        1
2004  left_only        1
2005  left_only        1
Name: count, dtype: int64

Compare that with value_counts() for year in nlsy. Since year is the row index for nlsy we first have to reset its index to make year just another column so we can use value_counts() on it:

(
    nlsy.reset_index()['year'].
    value_counts(sort=False)
)

year
1979    12686
1980    12686
1981    12686
1982    12686
1983    12686
1984    12686
1985    12686
1986    12686
1987    12686
1988    12686
1989    12686
1990    12686
1991    12686
1992    12686
1993    12686
1994    12686
1996    12686
1998    12686
2000    12686
Name: count, dtype: int64

Not every year appears in nsly: the NLSY started in 1979; in 1994 it switched to collecting data every other year; and our extract ends in 2000. Meanwhile, cps starts in 1970 and has a row for every year until 2005. The rows that don’t match are just years for which we have CPI data but no data from the NLSY.

We’re interested in the people in the NLSY, not inflation for its own sake, so we don’t need the inflation data for years that don’t appear in nlsy. That makes this a job for a right join, which will discard observations from the left DataFrame, cpi, that don’t match anything from the right DataFrame, nlsy. Inner join would work too now that we know there are no ‘right_only’ observations, but using it right away would risk dropping NLSY observations if we didn’t have inflation data for some NLSY years.

The final merge would be:

nlsy_combined = cpi.merge(
    nlsy.reset_index(), 
    on='year',
    validate='1:m',
    how='right',
    indicator=True
)
nlsy_combined['_merge'].value_counts()

_merge
both          241034
left_only          0
right_only         0
Name: count, dtype: int64

7.5.3 Adjusting For Inflation

To adjust income for inflation we convert all of them to dollars in a given year: let’s choose 2000, the last year in our extract. The formula for doing so is the income in a given year times the CPI in the year 2000 divided by the CPI in that year.

We need the CPI in the year 2000 as a plain old number. You can easily subset it out from cpi with cpi['year']==2000, but the result will be a Series. Python also doesn’t know that the Series will have always have a single element. So to get a simple number, extract the first element of the Series with .iloc[0], and then pass it to the float() function to convert it to a floating-point number (i.e. a number with a fractional component):

cpi_2000 = float(cpi.loc[cpi['year']==2000, 'cpi'].iloc[0])
cpi_2000

172.19200134277344

Now create a column for real income with:

nlsy_combined['real_income'] = (
    nlsy_combined['income'] * 
    cpi_2000 / 
    nlsy_combined['cpi']
)
nlsy_combined

	year	cpi	id	birth_year	edu	income	age	_merge	real_income
0	1979	72.583000	1.0	58.0	12TH GRADE	4620.0	21.0	both	10960.239258
1	1980	82.383003	1.0	58.0	NaN	NaN	22.0	both	NaN
2	1981	90.932999	1.0	58.0	12TH GRADE	5000.0	23.0	both	9468.070312
3	1982	96.532997	1.0	58.0	NaN	NaN	24.0	both	NaN
4	1983	99.583000	1.0	58.0	NaN	NaN	25.0	both	NaN
...	...	...	...	...	...	...	...	...	...
241029	1993	144.475006	12686.0	60.0	12TH GRADE	31000.0	33.0	both	36947.234375
241030	1994	148.225006	12686.0	60.0	NaN	NaN	34.0	both	NaN
241031	1996	156.858002	12686.0	60.0	NaN	NaN	36.0	both	NaN
241032	1998	163.007996	12686.0	60.0	NaN	NaN	38.0	both	NaN
241033	2000	172.192001	12686.0	60.0	NaN	NaN	40.0	both	NaN

241034 rows × 9 columns

To get a sense of how income and real_income are related, take their mean by year. You’ll see how they start out quite different as the inflation that took place between that year and the year 2000 is substantial, then gradually converge.

nlsy_combined[
    ['income', 'real_income', 'year']
].groupby('year').mean()

	income	real_income
year
1979	2571.444092	6100.355469
1980	3257.446533	6808.518555
1981	3897.210449	7379.812012
1982	4861.282715	8671.377930
1983	5390.464355	9320.816406
1984	6481.504883	10738.295898
1985	7884.684570	12617.840820
1986	9246.562500	14515.042969
1987	11242.212891	17038.113281
1988	12615.267578	18366.078125
1989	13811.738281	19188.578125
1990	15418.368164	20319.609375
1991	16527.351562	20899.908203
1992	17395.074219	21347.982422
1993	18430.376953	21966.175781
1994	19533.679688	22692.144531
1996	22815.503906	25045.884766
1998	25415.785156	26847.732422
2000	30527.400391	30527.400391

In the past we haven’t had to include the groupby() variable in the subset passed to groupby(); what’s different now? The difference is that year isn’t an index in nlsy_combined. When you subset a DataFrame it keeps the same index, and that index can be used by groupby(). But if you want to groupby() something other than an index it needs to be part of the DataFrame passed to groupby().

7.6 Handling Duplicates

All the DataFrames we’ve combined thus far have been well behaved, with most of them matching perfectly. Obviously that won’t always be true with real-world data. The most common problem is duplicates, so let’s spend some time discussing how to handle them.

To see duplicates in action, load acs_demographics.pickle and acs_ses_dup.pickle.

acs_demo = pd.read_pickle('acs_demographics.pickle')
acs_ses = pd.read_pickle('acs_ses_dup.pickle')
acs_ses

		edu	income
household	person
37	1	Some college, >=1 year	10000.0
	2	Some college, >=1 year	5300.0
	2	Some college, >=1 year	4700.0
241	1	Master's degree	32500.0
242	1	Bachelor's degree	30000.0
...	...	...	...
1236624	1	Some college, >=1 year	50100.0
1236624	2	High School graduate	12000.0
1236756	1	Master's degree	69800.0
1236756	2	Master's degree	40800.0
1236779	1	High School graduate	22110.0

27410 rows × 2 columns

These should be familiar from section 1 of this chapter. acs_demo contains demographic information about each individual, while acs_ses contains information about their socio-economic status. But if you look closely at acs_ses you’ll see a problem: household 37 has two person 2’s. This kind of error is not uncommon in real-world data.

First off, note that if you carry out the steps described in the ‘First Steps With Your Data’ chapter, you’ll know about this problem long before you try to combine this data set with anything else. As part of deciding if household and person uniquely identify observations (and before making them a MultiIndex), you’ll look at the maximum value of their value_counts():

(
    acs_ses.reset_index()
    [['household', 'person']].
    value_counts().max()
)

The fact that it’s greater than one tells you have a problem immediately.

But suppose you don’t carry out those steps. Will you find it when you try to merge acs_ses with acs_demo? Only if you’re careful. If you try to do the merge the easy way, you’ll get no indication anything is wrong:

acs_demo.merge(
    acs_ses,
    left_index=True, right_index=True
)

		age	hispanic	race	marital_status	female	edu	income
household	person
37	1	20	False	White	Never married	True	Some college, >=1 year	10000.0
	2	19	False	White	Never married	True	Some college, >=1 year	5300.0
	2	19	False	White	Never married	True	Some college, >=1 year	4700.0
241	1	50	False	White	Never married	True	Master's degree	32500.0
242	1	29	False	White	Never married	True	Bachelor's degree	30000.0
...	...	...	...	...	...	...	...	...
1236624	1	29	False	White	Now married	False	Some college, >=1 year	50100.0
1236624	2	26	False	White	Now married	True	High School graduate	12000.0
1236756	1	58	False	White	Now married	True	Master's degree	69800.0
1236756	2	61	False	White	Now married	False	Master's degree	40800.0
1236779	1	30	False	American Indian	Divorced	False	High School graduate	22110.0

27410 rows × 7 columns

You may have spent enough time staring at this data set to spot the problem, but if not compare with:

acs_demo

		age	hispanic	race	marital_status	female
household	person
37	1	20	False	White	Never married	True
	2	19	False	White	Never married	True
	3	19	False	Black	Never married	True
241	1	50	False	White	Never married	True
242	1	29	False	White	Never married	True
...	...	...	...	...	...	...
1236624	1	29	False	White	Now married	False
1236624	2	26	False	White	Now married	True
1236756	1	58	False	White	Now married	True
1236756	2	61	False	White	Now married	False
1236779	1	30	False	American Indian	Divorced	False

27410 rows × 5 columns

The third person in household 37 is supposed to be Black. What happened? In acs_ses, the third person in household 37 has a 2 for person, so their row was combined with the demographic information for person 2 in acs_demo–even though person 2 had already been combined with the real person 2 in acs_ses. In other words, this merge become a one to many merge instead of a one to one merge, with no indication that that had happened. What’s more, because the default merge is an inner join, the person 3 from acs_demo, not having anyone to match with in acs_ses, was dropped.

This is why you always use validate and start with an outer join. If you run:

acs_demo.merge(
    acs_ses,
    left_index=True, right_index=True,
    validate='1:1',
    how='outer',
    indicator=True
)

You’ll get the error message (at the bottom of a lot of less clear verbiage):

MergeError: Merge keys are not unique in right dataset; not a one-to-one merge

This is a good thing! Far better for code to crash and tell you what’s wrong than to appear to run successfully but give you bogus data.

So what do you do about it? First, we need to be able to identify the problem observations. If you groupby() both household and person, the size() of the resulting groups should be 1. So create a variable called copies containing exactly that:

acs_ses['copies'] = acs_ses.groupby(
    ['household', 'person']
).size()

The next question is how big a problem you have. Find out by running value_counts() on copies:

acs_ses['copies'].value_counts()

copies
1    27408
2        2
Name: count, dtype: int64

This tells you there are only two observations with duplicate values of household and person. That suggests the problem is probably some sort of data entry error or similar mistake. If you find lots of duplicates, that raises the possibility that you have fundamentally misunderstood the structure of the data set: perhaps it’s actually a hierarchical data set of some sort and you didn’t realize it.

Given that there is a small number of problem observations, take a look at them:

acs_ses[acs_ses['copies']>1]

		edu	income	copies
household	person
37	2	Some college, >=1 year	5300.0	2
37	2	Some college, >=1 year	4700.0	2

We see that although the two person 2’s have the same education level, their incomes are different. That suggests they are in fact two different people erroneously given the same person number. If they had been completely identical, that would suggest they are in fact the same person erroneously entered twice.

The problem is, we don’t know which of them is the person 2 in the acs_demo data set so we can’t combine either of them. If matching their SES data with their demographics is essential to your analysis, and it frequently is, you might as well drop them both right now. You can drop all the duplicates with:

acs_ses[acs_ses['copies']==1]

		edu	income	copies
household	person
37	1	Some college, >=1 year	10000.0	1
241	1	Master's degree	32500.0	1
242	1	Bachelor's degree	30000.0	1
377	1	None	51900.0	1
418	1	12th grade, no diploma	12200.0	1
...	...	...	...	...
1236624	1	Some college, >=1 year	50100.0	1
1236624	2	High School graduate	12000.0	1
1236756	1	Master's degree	69800.0	1
1236756	2	Master's degree	40800.0	1
1236779	1	High School graduate	22110.0	1

27408 rows × 3 columns

What if you had found that they were in fact the same person? Then you would want to keep one copy and drop any others. The cumcount() function works with groupby() and carries out a cumulative count, numbering each row in the group consecutively. Of course, being Python, the first row is zero. So you just need to keep the rows where cumcount() is zero:

acs_ses[
    acs_ses.
    groupby(['household', 'person']).
    cumcount()==0
]

		edu	income	copies
household	person
37	1	Some college, >=1 year	10000.0	1
37	2	Some college, >=1 year	5300.0	2
241	1	Master's degree	32500.0	1
242	1	Bachelor's degree	30000.0	1
377	1	None	51900.0	1
...	...	...	...	...
1236624	1	Some college, >=1 year	50100.0	1
1236624	2	High School graduate	12000.0	1
1236756	1	Master's degree	69800.0	1
1236756	2	Master's degree	40800.0	1
1236779	1	High School graduate	22110.0	1

27409 rows × 3 columns

This allows you to keep the first person 2. Either way when you match with acs_demo you’ll have some unmatched observations, which you can deal with in a way that is appropriate for your analysis.