import pandas as pd
7 Combining Data Sets
Combining data sets is a very common task, and one that’s easy to do if you understand the structure of the data sets you are trying to combine. However, if you’ve misunderstood the structure of the data sets you can end up with a data set that makes no sense at all. Also, combining data sets will often force you to deal with problems in the data set structure.
Always think through what the resulting data set should look like before combining two data sets. If the resulting data set won’t have a consistent, logical structure you probably need to rethink what you’re doing.
We’ll think about combining two data sets in terms of adding new information to an original data set. In reality, which data set is the original and which is new is completely arbitrary, but usually there’s an obvious candidate for each role and it helps for visualizing the process.
How you combine two data sets depends on what the new data is adding to the original data. We’ll discuss the most common scenarios, which are:
- Adding observations
- Adding variables
- Adding level one units to existing level two units
- Adding variables for different levels
We’ll also carry out an extended example and discuss how to handle duplicates when combining data set.
Start up Jupyter Lab if you haven’t already and navigate to the folder where you put the example files. Then create a new Python Notebook and call it Combining_Data_Sets_Practice.ipynb
. Have it import Pandas.
7.1 Adding Observations
If the new data set adds more observations to the original data set, then this is a job for Panda’s concat()
function. Using concat()
makes sense if the original data set and the new data set contain the same kinds of things, but not the same things. The concat()
function simply stacks DataFrames, passed in as a list, on top of each other.
Suppose that instead of a single file containing our 2000 ACS sample you were given acs_part1
and acs_part2
, with each file containing half of the observations. Read them in, take a look, and then use concat()
to combine them into a single DataFrame:
= pd.read_pickle('acs_part1.pickle')
acs_part1 acs_part1
age | race | marital_status | edu | income | female | hispanic | ||
---|---|---|---|---|---|---|---|---|
household | person | |||||||
37 | 1 | 20 | White | Never married | Some college, >=1 year | 10000.0 | True | False |
2 | 19 | White | Never married | Some college, >=1 year | 5300.0 | True | False | |
3 | 19 | Black | Never married | Some college, >=1 year | 4700.0 | True | False | |
241 | 1 | 50 | White | Never married | Master's degree | 32500.0 | True | False |
242 | 1 | 29 | White | Never married | Bachelor's degree | 30000.0 | True | False |
... | ... | ... | ... | ... | ... | ... | ... | ... |
617416 | 3 | 3 | White | NaN | None | NaN | False | False |
4 | 3 | White | NaN | None | NaN | False | False | |
617477 | 1 | 43 | White | Now married | Bachelor's degree | 82300.0 | False | False |
2 | 40 | White | Now married | Bachelor's degree | 3000.0 | True | False | |
3 | 14 | White | NaN | 7th-8th grade | NaN | True | False |
13705 rows × 7 columns
= pd.read_pickle('acs_part2.pickle')
acs_part2 acs_part2
age | race | marital_status | edu | income | female | hispanic | ||
---|---|---|---|---|---|---|---|---|
household | person | |||||||
617477 | 4 | 11 | White | NaN | 5th-6th grade | NaN | True | False |
617720 | 1 | 72 | White | Widowed | Nursery school-4th grade | 8000.0 | True | False |
618413 | 1 | 22 | Other | Never married | High School graduate | 24000.0 | False | True |
2 | 25 | Other | Never married | 7th-8th grade | 4000.0 | False | True | |
3 | 27 | Other | Never married | 7th-8th grade | 0.0 | False | True | |
... | ... | ... | ... | ... | ... | ... | ... | ... |
1236624 | 1 | 29 | White | Now married | Some college, >=1 year | 50100.0 | False | False |
2 | 26 | White | Now married | High School graduate | 12000.0 | True | False | |
1236756 | 1 | 58 | White | Now married | Master's degree | 69800.0 | True | False |
2 | 61 | White | Now married | Master's degree | 40800.0 | False | False | |
1236779 | 1 | 30 | American Indian | Divorced | High School graduate | 22110.0 | False | False |
13705 rows × 7 columns
pd.concat([acs_part1, acs_part2])
age | race | marital_status | edu | income | female | hispanic | ||
---|---|---|---|---|---|---|---|---|
household | person | |||||||
37 | 1 | 20 | White | Never married | Some college, >=1 year | 10000.0 | True | False |
2 | 19 | White | Never married | Some college, >=1 year | 5300.0 | True | False | |
3 | 19 | Black | Never married | Some college, >=1 year | 4700.0 | True | False | |
241 | 1 | 50 | White | Never married | Master's degree | 32500.0 | True | False |
242 | 1 | 29 | White | Never married | Bachelor's degree | 30000.0 | True | False |
... | ... | ... | ... | ... | ... | ... | ... | ... |
1236624 | 1 | 29 | White | Now married | Some college, >=1 year | 50100.0 | False | False |
2 | 26 | White | Now married | High School graduate | 12000.0 | True | False | |
1236756 | 1 | 58 | White | Now married | Master's degree | 69800.0 | True | False |
2 | 61 | White | Now married | Master's degree | 40800.0 | False | False | |
1236779 | 1 | 30 | American Indian | Divorced | High School graduate | 22110.0 | False | False |
27410 rows × 7 columns
If one of the DataFrames does not have a variable that exists in the other DataFrame, observations from that DataFrame will get NaN
(missing) for that variable. Generally you want to make sure that the two DataFrames use the same names for the same things before you combine them.
7.2 Adding Variables
If the new data set adds more variables to the original data set and observations represent the same things in both data sets, then this is a job for the merge()
function or its shortcut join()
. The merge()
command combines data sets by combining observations that have the same value of an identifier variable or variables, often the row index, so the result has all the variables from both files. The terms merge and join are used pretty much interchangeably in the data science world to describe this operation, though join tends to be more common among people with database backgrounds.
Suppose you were given the data files acs_demographics.pickle
and acs_ses.pickle
, containing demographic information and socio-economic status (SES) information respectively about the 2000 ACS respondents. Load them and take a look:
= pd.read_pickle('acs_demographics.pickle')
acs_demo acs_demo
age | hispanic | race | marital_status | female | ||
---|---|---|---|---|---|---|
household | person | |||||
37 | 1 | 20 | False | White | Never married | True |
2 | 19 | False | White | Never married | True | |
3 | 19 | False | Black | Never married | True | |
241 | 1 | 50 | False | White | Never married | True |
242 | 1 | 29 | False | White | Never married | True |
... | ... | ... | ... | ... | ... | ... |
1236624 | 1 | 29 | False | White | Now married | False |
2 | 26 | False | White | Now married | True | |
1236756 | 1 | 58 | False | White | Now married | True |
2 | 61 | False | White | Now married | False | |
1236779 | 1 | 30 | False | American Indian | Divorced | False |
27410 rows × 5 columns
= pd.read_pickle('acs_ses.pickle')
acs_ses acs_ses
edu | income | ||
---|---|---|---|
household | person | ||
37 | 1 | Some college, >=1 year | 10000.0 |
2 | Some college, >=1 year | 5300.0 | |
3 | Some college, >=1 year | 4700.0 | |
241 | 1 | Master's degree | 32500.0 |
242 | 1 | Bachelor's degree | 30000.0 |
... | ... | ... | ... |
1236624 | 1 | Some college, >=1 year | 50100.0 |
2 | High School graduate | 12000.0 | |
1236756 | 1 | Master's degree | 69800.0 |
2 | Master's degree | 40800.0 | |
1236779 | 1 | High School graduate | 22110.0 |
27410 rows × 2 columns
The join()
function quickly and easily merges one DataFrame into another if they have the same row index:
acs_demo.join(acs_ses)
age | hispanic | race | marital_status | female | edu | income | ||
---|---|---|---|---|---|---|---|---|
household | person | |||||||
37 | 1 | 20 | False | White | Never married | True | Some college, >=1 year | 10000.0 |
2 | 19 | False | White | Never married | True | Some college, >=1 year | 5300.0 | |
3 | 19 | False | Black | Never married | True | Some college, >=1 year | 4700.0 | |
241 | 1 | 50 | False | White | Never married | True | Master's degree | 32500.0 |
242 | 1 | 29 | False | White | Never married | True | Bachelor's degree | 30000.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
1236624 | 1 | 29 | False | White | Now married | False | Some college, >=1 year | 50100.0 |
2 | 26 | False | White | Now married | True | High School graduate | 12000.0 | |
1236756 | 1 | 58 | False | White | Now married | True | Master's degree | 69800.0 |
2 | 61 | False | White | Now married | False | Master's degree | 40800.0 | |
1236779 | 1 | 30 | False | American Indian | Divorced | False | High School graduate | 22110.0 |
27410 rows × 7 columns
However, join()
does not have all the tools for checking that a merge worked properly that merge()
does. Given how much can go wrong with merge, we suggest you use merge()
unless you’ve created the data sets to be merged yourself and know they’ll merge successfully.
The merge()
function uses the convention of describing the two DataFrames to be merged as the left DataFrame and the right DataFrame. While in the result the columns of the left DataFrame will go on the left and the columns of the right DataFrame will go on the right, they’re mostly just labels. The DataFrame whose merge function is being called is the left DataFrame, and the one being added is the right DataFrame.
The first argument to pas in is the right DataFrame. Then you need to tell it what to use to match observations. Ideally this will be the indexes, so you can pass in left_index=True
and right_index=True
. Those are the only arguments required to carry out a merge:
acs_demo.merge(
acs_ses,=True, right_index=True,
left_index )
age | hispanic | race | marital_status | female | edu | income | ||
---|---|---|---|---|---|---|---|---|
household | person | |||||||
37 | 1 | 20 | False | White | Never married | True | Some college, >=1 year | 10000.0 |
2 | 19 | False | White | Never married | True | Some college, >=1 year | 5300.0 | |
3 | 19 | False | Black | Never married | True | Some college, >=1 year | 4700.0 | |
241 | 1 | 50 | False | White | Never married | True | Master's degree | 32500.0 |
242 | 1 | 29 | False | White | Never married | True | Bachelor's degree | 30000.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
1236624 | 1 | 29 | False | White | Now married | False | Some college, >=1 year | 50100.0 |
2 | 26 | False | White | Now married | True | High School graduate | 12000.0 | |
1236756 | 1 | 58 | False | White | Now married | True | Master's degree | 69800.0 |
2 | 61 | False | White | Now married | False | Master's degree | 40800.0 | |
1236779 | 1 | 30 | False | American Indian | Divorced | False | High School graduate | 22110.0 |
27410 rows × 7 columns
However, with a few more arguments we can take get a much stronger reassurance that the merge is working properly.
This merge should combine one person’s demographic information with the same person’s SES information. Each row should match with exactly one other row: it is a one to one merge. However, that would not be the case if the DataFrames had duplicate values of the variables we are matching on. That would be a serious problem, so tell merge()
to validate that you are in fact carrying out a one to one merge with validate='1:1'
.
If an observation in one DataFrame does not match with an observation in the other DataFrame, it will get NaN
(missing) for all the variables in the other DataFrame. But sometimes you don’t want to keep such observations at all. There are four different kinds of joins depending on what should be done with observations that don’t match:
- In an inner join only observations that match are kept; observations that do not match are discarded. This is the default.
- In a left join observations from the left DataFrame that do not match are kept, but observations from the right DataFrame that do not match are discarded.
- In a right join observations from the right DataFrame that do not match are kept, but observations from the left DataFrame that do not match are discarded.
- In an outer join all observations are kept whether they match or not, and no observations are discarded.
There are scenarios where all four of those make sense. However, the first time you merge two DataFrames you need to check if the merge is working properly or not. For example, minor differences in the format of the variables to be matched on can lead to nothing matching. To check for that, always start with an outer join even if you plan to discard observations that don’t match later. That will allow you to examine the observations that didn’t match and determine if they really don’t match and thus can be discarded, or if they only didn’t match because the merge didn’t work properly. Once you’re confident everything is working you can changed to the type of merge you really want to do.
You set the type of merge with the how
argument. Start with how='outer'
.
Examining whether the merge worked or not is much easier if you can immediately identify the observations that did or did not match. Pass in indicator=True
and merge will create a column called _merge
telling you which DataFrame each observation came from, with ‘both’ indicating it matched successfully. Running value_counts()
on _merge
is a good way to get a general sense of how the merge went.
Thus the careful version of the above merge is:
acs_demo.merge(
acs_ses,=True, right_index=True,
left_index='1:1',
validate='outer',
how=True
indicator )
age | hispanic | race | marital_status | female | edu | income | _merge | ||
---|---|---|---|---|---|---|---|---|---|
household | person | ||||||||
37 | 1 | 20 | False | White | Never married | True | Some college, >=1 year | 10000.0 | both |
2 | 19 | False | White | Never married | True | Some college, >=1 year | 5300.0 | both | |
3 | 19 | False | Black | Never married | True | Some college, >=1 year | 4700.0 | both | |
241 | 1 | 50 | False | White | Never married | True | Master's degree | 32500.0 | both |
242 | 1 | 29 | False | White | Never married | True | Bachelor's degree | 30000.0 | both |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1236624 | 1 | 29 | False | White | Now married | False | Some college, >=1 year | 50100.0 | both |
2 | 26 | False | White | Now married | True | High School graduate | 12000.0 | both | |
1236756 | 1 | 58 | False | White | Now married | True | Master's degree | 69800.0 | both |
2 | 61 | False | White | Now married | False | Master's degree | 40800.0 | both | |
1236779 | 1 | 30 | False | American Indian | Divorced | False | High School graduate | 22110.0 | both |
27410 rows × 8 columns
Examine the following pairs of data sets:
acs_race.pickle
andacs_education.pickle
acs_adults.pickle
andacs_children.pickle
Determine the appropriate method for combining each pair into a single data set and then do so. If you do a merge, do it carefully and run value_counts()
on _merge
.
First load and examine acs_race
and acs_education
:
= pd.read_pickle('acs_race.pickle')
acs_race acs_race
race | hispanic | ||
---|---|---|---|
household | person | ||
37 | 1 | White | False |
2 | White | False | |
3 | Black | False | |
241 | 1 | White | False |
242 | 1 | White | False |
... | ... | ... | ... |
1236624 | 1 | White | False |
2 | White | False | |
1236756 | 1 | White | False |
2 | White | False | |
1236779 | 1 | American Indian | False |
27410 rows × 2 columns
= pd.read_pickle('acs_education.pickle')
acs_edu acs_edu
edu | ||
---|---|---|
household | person | |
37 | 1 | Some college, >=1 year |
2 | Some college, >=1 year | |
3 | Some college, >=1 year | |
241 | 1 | Master's degree |
242 | 1 | Bachelor's degree |
... | ... | ... |
1236624 | 1 | Some college, >=1 year |
2 | High School graduate | |
1236756 | 1 | Master's degree |
2 | Master's degree | |
1236779 | 1 | High School graduate |
27410 rows × 1 columns
These DataFrames contain different information about the same things, so combining them will add variables. This is a job for merge()
:
= acs_race.merge(
acs_combined1
acs_edu,=True, right_index=True,
left_index='outer',
how='1:1',
validate=True
indicator
) acs_combined1
race | hispanic | edu | _merge | ||
---|---|---|---|---|---|
household | person | ||||
37 | 1 | White | False | Some college, >=1 year | both |
2 | White | False | Some college, >=1 year | both | |
3 | Black | False | Some college, >=1 year | both | |
241 | 1 | White | False | Master's degree | both |
242 | 1 | White | False | Bachelor's degree | both |
... | ... | ... | ... | ... | ... |
1236624 | 1 | White | False | Some college, >=1 year | both |
2 | White | False | High School graduate | both | |
1236756 | 1 | White | False | Master's degree | both |
2 | White | False | Master's degree | both | |
1236779 | 1 | American Indian | False | High School graduate | both |
27410 rows × 4 columns
Now check that they actually matched successfully by looking at value_counts()
for _merge
:
'_merge'].value_counts() acs_combined1[
_merge
both 27410
left_only 0
right_only 0
Name: count, dtype: int64
Moving on to acs_adults
and acs_children
:
= pd.read_pickle('acs_adults.pickle')
acs_adults acs_adults
age | race | marital_status | edu | income | female | hispanic | ||
---|---|---|---|---|---|---|---|---|
household | person | |||||||
37 | 1 | 20 | White | Never married | Some college, >=1 year | 10000.0 | True | False |
2 | 19 | White | Never married | Some college, >=1 year | 5300.0 | True | False | |
3 | 19 | Black | Never married | Some college, >=1 year | 4700.0 | True | False | |
241 | 1 | 50 | White | Never married | Master's degree | 32500.0 | True | False |
242 | 1 | 29 | White | Never married | Bachelor's degree | 30000.0 | True | False |
... | ... | ... | ... | ... | ... | ... | ... | ... |
1236624 | 1 | 29 | White | Now married | Some college, >=1 year | 50100.0 | False | False |
2 | 26 | White | Now married | High School graduate | 12000.0 | True | False | |
1236756 | 1 | 58 | White | Now married | Master's degree | 69800.0 | True | False |
2 | 61 | White | Now married | Master's degree | 40800.0 | False | False | |
1236779 | 1 | 30 | American Indian | Divorced | High School graduate | 22110.0 | False | False |
20118 rows × 7 columns
= pd.read_pickle('acs_children.pickle')
acs_children acs_children
age | race | marital_status | edu | income | female | hispanic | ||
---|---|---|---|---|---|---|---|---|
household | person | |||||||
484 | 3 | 4 | Asian | NaN | None | NaN | False | False |
4 | 2 | Asian | NaN | NaN | NaN | True | False | |
607 | 3 | 17 | Asian | Never married | 10th grade | 0.0 | True | False |
4 | 15 | Asian | Never married | 7th-8th grade | 0.0 | False | False | |
5 | 13 | Asian | NaN | 7th-8th grade | NaN | True | False | |
... | ... | ... | ... | ... | ... | ... | ... | ... |
1235851 | 2 | 13 | White | NaN | 5th-6th grade | NaN | True | False |
3 | 12 | White | NaN | 9th grade | NaN | True | False | |
1235861 | 3 | 1 | American Indian | NaN | NaN | NaN | False | False |
4 | 0 | Indigenous, Unspecified | NaN | NaN | NaN | True | False | |
1236287 | 4 | 4 | White | NaN | None | NaN | False | False |
7292 rows × 7 columns
These DataFrames contain the same information but for different people. Combining them will add observations, so this is a job for concat()
:
pd.concat([acs_adults, acs_children])
age | race | marital_status | edu | income | female | hispanic | ||
---|---|---|---|---|---|---|---|---|
household | person | |||||||
37 | 1 | 20 | White | Never married | Some college, >=1 year | 10000.0 | True | False |
2 | 19 | White | Never married | Some college, >=1 year | 5300.0 | True | False | |
3 | 19 | Black | Never married | Some college, >=1 year | 4700.0 | True | False | |
241 | 1 | 50 | White | Never married | Master's degree | 32500.0 | True | False |
242 | 1 | 29 | White | Never married | Bachelor's degree | 30000.0 | True | False |
... | ... | ... | ... | ... | ... | ... | ... | ... |
1235851 | 2 | 13 | White | NaN | 5th-6th grade | NaN | True | False |
3 | 12 | White | NaN | 9th grade | NaN | True | False | |
1235861 | 3 | 1 | American Indian | NaN | NaN | NaN | False | False |
4 | 0 | Indigenous, Unspecified | NaN | NaN | NaN | True | False | |
1236287 | 4 | 4 | White | NaN | None | NaN | False | False |
27410 rows × 7 columns
Note that the last people in the DataFrame are now all children. If you want to put the DataFrame back in its original order, where the people in a household are together, sort by household
and person
.
7.3 Adding Level One Units to Existing Level Two Units
Next consider the data sets nlsy_1979.pickle
and nlsy_1980.pickle
. They each contain one year’s worth of data from our NLSY extract. Is combining them a job for concat()
or merge()
?
= pd.read_pickle('nlsy_1979.pickle')
nlsy_1979 nlsy_1979
birth_year | edu | income | age | |
---|---|---|---|---|
id | ||||
1.0 | 58.0 | 12TH GRADE | 4620.0 | 21.0 |
2.0 | 59.0 | 9TH GRADE | 4000.0 | 20.0 |
3.0 | 61.0 | 10TH GRADE | NaN | 18.0 |
4.0 | 62.0 | 9TH GRADE | NaN | 17.0 |
5.0 | 59.0 | 1ST YEAR COLLEGE | 2200.0 | 20.0 |
... | ... | ... | ... | ... |
12682.0 | 59.0 | 9TH GRADE | 500.0 | 20.0 |
12683.0 | 59.0 | 12TH GRADE | 0.0 | 20.0 |
12684.0 | 60.0 | 10TH GRADE | 0.0 | 19.0 |
12685.0 | 57.0 | 12TH GRADE | 0.0 | 22.0 |
12686.0 | 60.0 | 12TH GRADE | 15000.0 | 19.0 |
12686 rows × 4 columns
= pd.read_pickle('nlsy_1980.pickle')
nlsy_1980 nlsy_1980
birth_year | edu | income | age | |
---|---|---|---|---|
id | ||||
1.0 | 58.0 | NaN | NaN | 22.0 |
2.0 | 59.0 | 9TH GRADE | 5000.0 | 21.0 |
3.0 | 61.0 | 10TH GRADE | 7000.0 | 19.0 |
4.0 | 62.0 | NaN | NaN | 18.0 |
5.0 | 59.0 | 2ND YEAR COLLEGE | 2000.0 | 21.0 |
... | ... | ... | ... | ... |
12682.0 | 59.0 | 9TH GRADE | 0.0 | 21.0 |
12683.0 | 59.0 | 12TH GRADE | 0.0 | 21.0 |
12684.0 | 60.0 | 12TH GRADE | 0.0 | 20.0 |
12685.0 | 57.0 | 12TH GRADE | 3000.0 | 23.0 |
12686.0 | 60.0 | 12TH GRADE | 17000.0 | 20.0 |
12686 rows × 4 columns
The answer depends on whether you want the resulting data set to be in long form or in wide form. In the NLSY, a level two unit is a person and a level one unit is a person-year combination, so adding nlsy_1980
to nlsy_1979
is adding new level one units (years) to the existing level two units (people). In long form each level one unit gets its own observation, so adding nlsy_1980
in long form adds observations. This is a job for concat()
. In wide form, each level two unit gets its own observation, but each level one unit gets its own set of variables. Thus adding nlsy_1980
in wide form adds variables; a job for merge()
.
The only complication is the level one identifier, year. Right now it is found only in the filenames of the two data sets, as is common. In long form, the level one identifier needs to be a variable. In wide form with string column names, it needs to be a suffix at the end of the names of all the level one variables. Fortunately, both concat()
and merge()
have ways to do that easily.
7.3.1 Long Form
First let’s combine them into long form with concat()
. The keys
argument allows you to pass in a list containing an identifier for each DataFrame. It will be used to create a MultiIndex for the combined DataFrame:
= pd.concat(
nlsy_long
[nlsy_1979, nlsy_1980],=[1979, 1980]
keys
) nlsy_long
birth_year | edu | income | age | ||
---|---|---|---|---|---|
id | |||||
1979 | 1.0 | 58.0 | 12TH GRADE | 4620.0 | 21.0 |
2.0 | 59.0 | 9TH GRADE | 4000.0 | 20.0 | |
3.0 | 61.0 | 10TH GRADE | NaN | 18.0 | |
4.0 | 62.0 | 9TH GRADE | NaN | 17.0 | |
5.0 | 59.0 | 1ST YEAR COLLEGE | 2200.0 | 20.0 | |
... | ... | ... | ... | ... | ... |
1980 | 12682.0 | 59.0 | 9TH GRADE | 0.0 | 21.0 |
12683.0 | 59.0 | 12TH GRADE | 0.0 | 21.0 | |
12684.0 | 60.0 | 12TH GRADE | 0.0 | 20.0 | |
12685.0 | 57.0 | 12TH GRADE | 3000.0 | 23.0 | |
12686.0 | 60.0 | 12TH GRADE | 17000.0 | 20.0 |
25372 rows × 4 columns
At this point the second part of the row MultiIndex has a name but the first part does not. You can rename indexes with rename_axis()
; for a MultiIndex pass in a list of names (including the one that already exists):
= nlsy_long.rename_axis(
nlsy_long 'year', 'id'],
[='rows'
axis
) nlsy_long
birth_year | edu | income | age | ||
---|---|---|---|---|---|
year | id | ||||
1979 | 1.0 | 58.0 | 12TH GRADE | 4620.0 | 21.0 |
2.0 | 59.0 | 9TH GRADE | 4000.0 | 20.0 | |
3.0 | 61.0 | 10TH GRADE | NaN | 18.0 | |
4.0 | 62.0 | 9TH GRADE | NaN | 17.0 | |
5.0 | 59.0 | 1ST YEAR COLLEGE | 2200.0 | 20.0 | |
... | ... | ... | ... | ... | ... |
1980 | 12682.0 | 59.0 | 9TH GRADE | 0.0 | 21.0 |
12683.0 | 59.0 | 12TH GRADE | 0.0 | 21.0 | |
12684.0 | 60.0 | 12TH GRADE | 0.0 | 20.0 | |
12685.0 | 57.0 | 12TH GRADE | 3000.0 | 23.0 | |
12686.0 | 60.0 | 12TH GRADE | 17000.0 | 20.0 |
25372 rows × 4 columns
This looks a little off because it treats people as nested within years. You’ll get something more familiar if you swap the levels of the MultiIndex and then sort by them:
nlsy_long.swaplevel().sort_index()
birth_year | edu | income | age | ||
---|---|---|---|---|---|
id | year | ||||
1.0 | 1979 | 58.0 | 12TH GRADE | 4620.0 | 21.0 |
1980 | 58.0 | NaN | NaN | 22.0 | |
2.0 | 1979 | 59.0 | 9TH GRADE | 4000.0 | 20.0 |
1980 | 59.0 | 9TH GRADE | 5000.0 | 21.0 | |
3.0 | 1979 | 61.0 | 10TH GRADE | NaN | 18.0 |
... | ... | ... | ... | ... | ... |
12684.0 | 1980 | 60.0 | 12TH GRADE | 0.0 | 20.0 |
12685.0 | 1979 | 57.0 | 12TH GRADE | 0.0 | 22.0 |
1980 | 57.0 | 12TH GRADE | 3000.0 | 23.0 | |
12686.0 | 1979 | 60.0 | 12TH GRADE | 15000.0 | 19.0 |
1980 | 60.0 | 12TH GRADE | 17000.0 | 20.0 |
25372 rows × 4 columns
7.3.2 Wide Form
Now let’s combine these data sets into wide form with merge()
. The suffixes
argument allows you to pass in a tuple of suffixes, one for each DataFrame, which will be appended to the names of the columns that come from that DataFrame:
= nlsy_1979.merge(
nlsy_wide
nlsy_1980,=True, right_index=True,
left_index=('_1979', '_1980'),
suffixes='1:1',
validate='outer',
how=True
indicator
) nlsy_wide
birth_year_1979 | edu_1979 | income_1979 | age_1979 | birth_year_1980 | edu_1980 | income_1980 | age_1980 | _merge | |
---|---|---|---|---|---|---|---|---|---|
id | |||||||||
1.0 | 58.0 | 12TH GRADE | 4620.0 | 21.0 | 58.0 | NaN | NaN | 22.0 | both |
2.0 | 59.0 | 9TH GRADE | 4000.0 | 20.0 | 59.0 | 9TH GRADE | 5000.0 | 21.0 | both |
3.0 | 61.0 | 10TH GRADE | NaN | 18.0 | 61.0 | 10TH GRADE | 7000.0 | 19.0 | both |
4.0 | 62.0 | 9TH GRADE | NaN | 17.0 | 62.0 | NaN | NaN | 18.0 | both |
5.0 | 59.0 | 1ST YEAR COLLEGE | 2200.0 | 20.0 | 59.0 | 2ND YEAR COLLEGE | 2000.0 | 21.0 | both |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
12682.0 | 59.0 | 9TH GRADE | 500.0 | 20.0 | 59.0 | 9TH GRADE | 0.0 | 21.0 | both |
12683.0 | 59.0 | 12TH GRADE | 0.0 | 20.0 | 59.0 | 12TH GRADE | 0.0 | 21.0 | both |
12684.0 | 60.0 | 10TH GRADE | 0.0 | 19.0 | 60.0 | 12TH GRADE | 0.0 | 20.0 | both |
12685.0 | 57.0 | 12TH GRADE | 0.0 | 22.0 | 57.0 | 12TH GRADE | 3000.0 | 23.0 | both |
12686.0 | 60.0 | 12TH GRADE | 15000.0 | 19.0 | 60.0 | 12TH GRADE | 17000.0 | 20.0 | both |
12686 rows × 9 columns
And we’re done!
7.4 Adding Variables for Different Levels
If you need to merge hierarchical data where the left DataFrame contains data on the level one units and the right DataFrame contains data on the level two units, this is a job for a many-to-one merge. A many-to-one merge combines observations just like a one-to-one merge, but many rows representing level one units are combined with one row representing the corresponding level two unit. A one-to-many merge is essentially the same thing, just the left DataFrame contains the level two unit (the “one”) and the right DataFrame contains the level one units (the “many”).
Load our good friend acs.pickle
plus acs_households.pickle
. The latter has one observation per household and some household level variables:
= pd.read_pickle('acs.pickle')
acs = pd.read_pickle('acs_households.pickle')
acs_households acs_households
household_income | max_edu | |
---|---|---|
household | ||
37 | 20000.0 | Some college, >=1 year |
241 | 32500.0 | Master's degree |
242 | 30000.0 | Bachelor's degree |
377 | 51900.0 | None |
418 | 12200.0 | 12th grade, no diploma |
... | ... | ... |
1236119 | 93200.0 | Bachelor's degree |
1236287 | 53400.0 | Some college, <1 year |
1236624 | 62100.0 | Some college, >=1 year |
1236756 | 110600.0 | Master's degree |
1236779 | 22110.0 | High School graduate |
10565 rows × 2 columns
With acs
as the left DataFrame and acs_households
as right, this is a many-to-one merge. You can do this pretty much like the other merges we’ve done: the only thing that needs to change is to set validate
to 'm:1'
:
acs.merge(
acs_households,=True, right_index=True,
left_index='m:1',
validate='outer',
how=True
indicator )
age | race | marital_status | edu | income | female | hispanic | household_income | max_edu | _merge | ||
---|---|---|---|---|---|---|---|---|---|---|---|
household | person | ||||||||||
37 | 1 | 20 | White | Never married | Some college, >=1 year | 10000.0 | True | False | 20000.0 | Some college, >=1 year | both |
2 | 19 | White | Never married | Some college, >=1 year | 5300.0 | True | False | 20000.0 | Some college, >=1 year | both | |
3 | 19 | Black | Never married | Some college, >=1 year | 4700.0 | True | False | 20000.0 | Some college, >=1 year | both | |
241 | 1 | 50 | White | Never married | Master's degree | 32500.0 | True | False | 32500.0 | Master's degree | both |
242 | 1 | 29 | White | Never married | Bachelor's degree | 30000.0 | True | False | 30000.0 | Bachelor's degree | both |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1236624 | 1 | 29 | White | Now married | Some college, >=1 year | 50100.0 | False | False | 62100.0 | Some college, >=1 year | both |
2 | 26 | White | Now married | High School graduate | 12000.0 | True | False | 62100.0 | Some college, >=1 year | both | |
1236756 | 1 | 58 | White | Now married | Master's degree | 69800.0 | True | False | 110600.0 | Master's degree | both |
2 | 61 | White | Now married | Master's degree | 40800.0 | False | False | 110600.0 | Master's degree | both | |
1236779 | 1 | 30 | American Indian | Divorced | High School graduate | 22110.0 | False | False | 22110.0 | High School graduate | both |
27410 rows × 10 columns
This is subtly dangerous, however. Note how the combined DataFrame has person
and household
as a MultiIndex. That makes the DataFrame incapable of including a household that has no people in it and thus no value for person
. To see the result, try the merge again after removing the first three observations from acs
(household 37):
3:, :].merge(
acs.iloc[
acs_households,=True, right_index=True,
left_index='m:1',
validate='outer',
how=True
indicator )
age | race | marital_status | edu | income | female | hispanic | household_income | max_edu | _merge | ||
---|---|---|---|---|---|---|---|---|---|---|---|
household | person | ||||||||||
241 | 1 | 50 | White | Never married | Master's degree | 32500.0 | True | False | 32500.0 | Master's degree | both |
242 | 1 | 29 | White | Never married | Bachelor's degree | 30000.0 | True | False | 30000.0 | Bachelor's degree | both |
377 | 1 | 69 | White | Never married | None | 51900.0 | True | False | 51900.0 | None | both |
418 | 1 | 59 | White | Widowed | 12th grade, no diploma | 12200.0 | True | False | 12200.0 | 12th grade, no diploma | both |
465 | 1 | 55 | Black | Separated | Associate degree | 0.0 | False | False | 2600.0 | Associate degree | both |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1236624 | 1 | 29 | White | Now married | Some college, >=1 year | 50100.0 | False | False | 62100.0 | Some college, >=1 year | both |
2 | 26 | White | Now married | High School graduate | 12000.0 | True | False | 62100.0 | Some college, >=1 year | both | |
1236756 | 1 | 58 | White | Now married | Master's degree | 69800.0 | True | False | 110600.0 | Master's degree | both |
2 | 61 | White | Now married | Master's degree | 40800.0 | False | False | 110600.0 | Master's degree | both | |
1236779 | 1 | 30 | American Indian | Divorced | High School graduate | 22110.0 | False | False | 22110.0 | High School graduate | both |
27407 rows × 10 columns
Household 37 has disappeared from the combined DataFrame, despite the fact that it exists in acs_household
and we specified how='outer'
.
One easy-to-implement solution is to not match on the indexes at all. Instead, reset them and then use on=
to specify the variable to be used for matching:
3:, :].reset_index().merge(
acs.iloc[
acs_households.reset_index(),='household',
on='m:1',
validate='outer',
how=True
indicator )
household | person | age | race | marital_status | edu | income | female | hispanic | household_income | max_edu | _merge | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 241 | 1.0 | 50.0 | White | Never married | Master's degree | 32500.0 | True | False | 32500.0 | Master's degree | both |
1 | 242 | 1.0 | 29.0 | White | Never married | Bachelor's degree | 30000.0 | True | False | 30000.0 | Bachelor's degree | both |
2 | 377 | 1.0 | 69.0 | White | Never married | None | 51900.0 | True | False | 51900.0 | None | both |
3 | 418 | 1.0 | 59.0 | White | Widowed | 12th grade, no diploma | 12200.0 | True | False | 12200.0 | 12th grade, no diploma | both |
4 | 465 | 1.0 | 55.0 | Black | Separated | Associate degree | 0.0 | False | False | 2600.0 | Associate degree | both |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
27403 | 1236624 | 2.0 | 26.0 | White | Now married | High School graduate | 12000.0 | True | False | 62100.0 | Some college, >=1 year | both |
27404 | 1236756 | 1.0 | 58.0 | White | Now married | Master's degree | 69800.0 | True | False | 110600.0 | Master's degree | both |
27405 | 1236756 | 2.0 | 61.0 | White | Now married | Master's degree | 40800.0 | False | False | 110600.0 | Master's degree | both |
27406 | 1236779 | 1.0 | 30.0 | American Indian | Divorced | High School graduate | 22110.0 | False | False | 22110.0 | High School graduate | both |
27407 | 37 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 20000.0 | Some college, >=1 year | right_only |
27408 rows × 12 columns
Now household 37 is in the result, though down at the bottom, with person
set to NaN and _merge
set to ‘right_only’. You could go ahead and make household
and person
the MultiIndex again, though that would leave you with a ‘person NaN.’
If you make acs_households
left and acs
right, then this is a one to many merge and validate
needs to be set to '1:m'
. Otherwise nothing needs to change:
acs_households.reset_index().merge(3:, :].reset_index(),
acs.iloc[='household',
on='1:m',
validate='outer',
how=True
indicator )
household | household_income | max_edu | person | age | race | marital_status | edu | income | female | hispanic | _merge | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 37 | 20000.0 | Some college, >=1 year | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | left_only |
1 | 241 | 32500.0 | Master's degree | 1.0 | 50.0 | White | Never married | Master's degree | 32500.0 | True | False | both |
2 | 242 | 30000.0 | Bachelor's degree | 1.0 | 29.0 | White | Never married | Bachelor's degree | 30000.0 | True | False | both |
3 | 377 | 51900.0 | None | 1.0 | 69.0 | White | Never married | None | 51900.0 | True | False | both |
4 | 418 | 12200.0 | 12th grade, no diploma | 1.0 | 59.0 | White | Widowed | 12th grade, no diploma | 12200.0 | True | False | both |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
27403 | 1236624 | 62100.0 | Some college, >=1 year | 1.0 | 29.0 | White | Now married | Some college, >=1 year | 50100.0 | False | False | both |
27404 | 1236624 | 62100.0 | Some college, >=1 year | 2.0 | 26.0 | White | Now married | High School graduate | 12000.0 | True | False | both |
27405 | 1236756 | 110600.0 | Master's degree | 1.0 | 58.0 | White | Now married | Master's degree | 69800.0 | True | False | both |
27406 | 1236756 | 110600.0 | Master's degree | 2.0 | 61.0 | White | Now married | Master's degree | 40800.0 | False | False | both |
27407 | 1236779 | 22110.0 | High School graduate | 1.0 | 30.0 | American Indian | Divorced | High School graduate | 22110.0 | False | False | both |
27408 rows × 12 columns
Note that I described this as an easy to implement solution. There are many different ways to merge in Python, and if you’re doing on something where performance is an issue you may find something more complicated will be substantially faster for your particular problem.
nlsy_person.pickle
contains information about the people in our NLSY extract that does not change over time, while nlsy_person_year.pickle
contains information that changes from year to year. Combine them. Note that the person with id
1 is not in nlsy_person
and the person with id
12686 (the last person) is not in nlsy_person_year
. Make sure the data you do have for them is not lost! To make it easy to check, set id
and year
as the indexes again after combining.
= pd.read_pickle('nlsy_person.pickle')
nlsy_person nlsy_person
yearOfBirth | total_income | |
---|---|---|
id | ||
2.0 | 59.0 | 214500.0 |
3.0 | 61.0 | 80500.0 |
4.0 | 62.0 | 180236.0 |
5.0 | 59.0 | 185679.0 |
6.0 | 60.0 | 314683.0 |
... | ... | ... |
12682.0 | 59.0 | 6005.0 |
12683.0 | 59.0 | 0.0 |
12684.0 | 60.0 | 3000.0 |
12685.0 | 57.0 | 53547.0 |
12686.0 | 60.0 | 335560.0 |
12685 rows × 2 columns
= pd.read_pickle('nlsy_person_year.pickle')
nlsy_person_year nlsy_person_year
edu | income | age | ||
---|---|---|---|---|
id | year | |||
1.0 | 1979 | 12TH GRADE | 4620.0 | 21.0 |
1980 | NaN | NaN | 22.0 | |
1981 | 12TH GRADE | 5000.0 | 23.0 | |
1982 | NaN | NaN | 24.0 | |
1983 | NaN | NaN | 25.0 | |
... | ... | ... | ... | ... |
12685.0 | 1993 | NaN | NaN | 36.0 |
1994 | NaN | NaN | 37.0 | |
1996 | NaN | NaN | 39.0 | |
1998 | NaN | NaN | 41.0 | |
2000 | NaN | NaN | 43.0 |
241015 rows × 3 columns
= nlsy_person_year.reset_index().merge(
nlsy_combined
nlsy_person.reset_index(),='id',
on='outer',
how='m:1',
validate=True
indicator'id', 'year'])
).set_index([1, 12686]] nlsy_combined.loc[[
edu | income | age | yearOfBirth | total_income | _merge | ||
---|---|---|---|---|---|---|---|
id | year | ||||||
1.0 | 1979.0 | 12TH GRADE | 4620.0 | 21.0 | NaN | NaN | left_only |
1980.0 | NaN | NaN | 22.0 | NaN | NaN | left_only | |
1981.0 | 12TH GRADE | 5000.0 | 23.0 | NaN | NaN | left_only | |
1982.0 | NaN | NaN | 24.0 | NaN | NaN | left_only | |
1983.0 | NaN | NaN | 25.0 | NaN | NaN | left_only | |
1984.0 | NaN | NaN | 26.0 | NaN | NaN | left_only | |
1985.0 | NaN | NaN | 27.0 | NaN | NaN | left_only | |
1986.0 | NaN | NaN | 28.0 | NaN | NaN | left_only | |
1987.0 | NaN | NaN | 29.0 | NaN | NaN | left_only | |
1988.0 | NaN | NaN | 30.0 | NaN | NaN | left_only | |
1989.0 | NaN | NaN | 31.0 | NaN | NaN | left_only | |
1990.0 | NaN | NaN | 32.0 | NaN | NaN | left_only | |
1991.0 | NaN | NaN | 33.0 | NaN | NaN | left_only | |
1992.0 | NaN | NaN | 34.0 | NaN | NaN | left_only | |
1993.0 | NaN | NaN | 35.0 | NaN | NaN | left_only | |
1994.0 | NaN | NaN | 36.0 | NaN | NaN | left_only | |
1996.0 | NaN | NaN | 38.0 | NaN | NaN | left_only | |
1998.0 | NaN | NaN | 40.0 | NaN | NaN | left_only | |
2000.0 | NaN | NaN | 42.0 | NaN | NaN | left_only | |
12686.0 | NaN | NaN | NaN | NaN | 60.0 | 335560.0 | right_only |
It doesn’t matter which DataFrame you make left or right as long as your validate
matches what you choose.
7.5 Inflation Example
Next we’ll do an example that illustrates some of the complications that can arise when combining data sets from different sources.
One issue with the NLSY data extract we’ve been using is that incomes from different time periods are not really comparable due to inflation. To adjust them for inflation, we need information about the level of inflation in each year. The data set fred_cpi.pickle
contains the average Consumer Price Index for All Urban Consumers for every year from 1970 to 2005. It was obtained from the Federal Reserve Economic Data (FRED). Our task is to merge it with nlsy.pickle
.
Start by reading in both data sets:
= pd.read_pickle('nlsy.pickle')
nlsy = pd.read_pickle('fred_cpi.pickle')
cpi cpi
datestr | daten | CPIAUCSL | |
---|---|---|---|
0 | 1970-01-01 | 1970-01-01 | 38.841999 |
1 | 1971-01-01 | 1971-01-01 | 40.483002 |
2 | 1972-01-01 | 1972-01-01 | 41.807999 |
3 | 1973-01-01 | 1973-01-01 | 44.424999 |
4 | 1974-01-01 | 1974-01-01 | 49.317001 |
5 | 1975-01-01 | 1975-01-01 | 53.825001 |
6 | 1976-01-01 | 1976-01-01 | 56.932999 |
7 | 1977-01-01 | 1977-01-01 | 60.617001 |
8 | 1978-01-01 | 1978-01-01 | 65.241997 |
9 | 1979-01-01 | 1979-01-01 | 72.583000 |
10 | 1980-01-01 | 1980-01-01 | 82.383003 |
11 | 1981-01-01 | 1981-01-01 | 90.932999 |
12 | 1982-01-01 | 1982-01-01 | 96.532997 |
13 | 1983-01-01 | 1983-01-01 | 99.583000 |
14 | 1984-01-01 | 1984-01-01 | 103.932999 |
15 | 1985-01-01 | 1985-01-01 | 107.599998 |
16 | 1986-01-01 | 1986-01-01 | 109.692001 |
17 | 1987-01-01 | 1987-01-01 | 113.616997 |
18 | 1988-01-01 | 1988-01-01 | 118.275002 |
19 | 1989-01-01 | 1989-01-01 | 123.942001 |
20 | 1990-01-01 | 1990-01-01 | 130.658005 |
21 | 1991-01-01 | 1991-01-01 | 136.167007 |
22 | 1992-01-01 | 1992-01-01 | 140.307999 |
23 | 1993-01-01 | 1993-01-01 | 144.475006 |
24 | 1994-01-01 | 1994-01-01 | 148.225006 |
25 | 1995-01-01 | 1995-01-01 | 152.382996 |
26 | 1996-01-01 | 1996-01-01 | 156.858002 |
27 | 1997-01-01 | 1997-01-01 | 160.524994 |
28 | 1998-01-01 | 1998-01-01 | 163.007996 |
29 | 1999-01-01 | 1999-01-01 | 166.582993 |
30 | 2000-01-01 | 2000-01-01 | 172.192001 |
31 | 2001-01-01 | 2001-01-01 | 177.042007 |
32 | 2002-01-01 | 2002-01-01 | 179.867004 |
33 | 2003-01-01 | 2003-01-01 | 184.000000 |
34 | 2004-01-01 | 2004-01-01 | 188.908005 |
35 | 2005-01-01 | 2005-01-01 | 195.266998 |
Taking this data and adding nlsy
to it is a job for a one-to-many merge: one year’s CPI data will match with many people’s NLSY data for that year. Note how this treats year as the level two unit! For most purposes it’s more useful to think of people as the level two unit in the NLSY, but it’s just as logical to group person-year combinations by year instead.
7.5.1 Preparing to Merge
We have some work to do with cpi
before it’s ready to merge. The biggest issue is that we need a simple numeric year variable we can use to match with the year index in nlsy
. cpi
has two different date columns; look at the dtypes
to see the difference:
cpi.dtypes
datestr object
daten datetime64[ns]
CPIAUCSL float32
dtype: object
datestr
is, unsurprisingly, a string (object), but daten
(date number) is a Python datetime. A datetime object (dt
) has a year
attribute containing just the year component of the date. We can put that in its own column with:
'year'] = cpi['daten'].dt.year
cpi[ cpi
datestr | daten | CPIAUCSL | year | |
---|---|---|---|---|
0 | 1970-01-01 | 1970-01-01 | 38.841999 | 1970 |
1 | 1971-01-01 | 1971-01-01 | 40.483002 | 1971 |
2 | 1972-01-01 | 1972-01-01 | 41.807999 | 1972 |
3 | 1973-01-01 | 1973-01-01 | 44.424999 | 1973 |
4 | 1974-01-01 | 1974-01-01 | 49.317001 | 1974 |
5 | 1975-01-01 | 1975-01-01 | 53.825001 | 1975 |
6 | 1976-01-01 | 1976-01-01 | 56.932999 | 1976 |
7 | 1977-01-01 | 1977-01-01 | 60.617001 | 1977 |
8 | 1978-01-01 | 1978-01-01 | 65.241997 | 1978 |
9 | 1979-01-01 | 1979-01-01 | 72.583000 | 1979 |
10 | 1980-01-01 | 1980-01-01 | 82.383003 | 1980 |
11 | 1981-01-01 | 1981-01-01 | 90.932999 | 1981 |
12 | 1982-01-01 | 1982-01-01 | 96.532997 | 1982 |
13 | 1983-01-01 | 1983-01-01 | 99.583000 | 1983 |
14 | 1984-01-01 | 1984-01-01 | 103.932999 | 1984 |
15 | 1985-01-01 | 1985-01-01 | 107.599998 | 1985 |
16 | 1986-01-01 | 1986-01-01 | 109.692001 | 1986 |
17 | 1987-01-01 | 1987-01-01 | 113.616997 | 1987 |
18 | 1988-01-01 | 1988-01-01 | 118.275002 | 1988 |
19 | 1989-01-01 | 1989-01-01 | 123.942001 | 1989 |
20 | 1990-01-01 | 1990-01-01 | 130.658005 | 1990 |
21 | 1991-01-01 | 1991-01-01 | 136.167007 | 1991 |
22 | 1992-01-01 | 1992-01-01 | 140.307999 | 1992 |
23 | 1993-01-01 | 1993-01-01 | 144.475006 | 1993 |
24 | 1994-01-01 | 1994-01-01 | 148.225006 | 1994 |
25 | 1995-01-01 | 1995-01-01 | 152.382996 | 1995 |
26 | 1996-01-01 | 1996-01-01 | 156.858002 | 1996 |
27 | 1997-01-01 | 1997-01-01 | 160.524994 | 1997 |
28 | 1998-01-01 | 1998-01-01 | 163.007996 | 1998 |
29 | 1999-01-01 | 1999-01-01 | 166.582993 | 1999 |
30 | 2000-01-01 | 2000-01-01 | 172.192001 | 2000 |
31 | 2001-01-01 | 2001-01-01 | 177.042007 | 2001 |
32 | 2002-01-01 | 2002-01-01 | 179.867004 | 2002 |
33 | 2003-01-01 | 2003-01-01 | 184.000000 | 2003 |
34 | 2004-01-01 | 2004-01-01 | 188.908005 | 2004 |
35 | 2005-01-01 | 2005-01-01 | 195.266998 | 2005 |
We no longer need datestr
and daten
. Up until now we’ve gotten rid of unneeded columns using drop()
, but since there are only two columns we do need it’s easier to just select them. Also, we don’t care about the subtleties of different inflation measures, so rename CPIAUCSL
to just cpi
. Just because you can, do both tasks in one function chain:
= cpi.rename(
cpi ={'CPIAUCSL' : 'cpi'}
columns'year', 'cpi']]
)[[ cpi
year | cpi | |
---|---|---|
0 | 1970 | 38.841999 |
1 | 1971 | 40.483002 |
2 | 1972 | 41.807999 |
3 | 1973 | 44.424999 |
4 | 1974 | 49.317001 |
5 | 1975 | 53.825001 |
6 | 1976 | 56.932999 |
7 | 1977 | 60.617001 |
8 | 1978 | 65.241997 |
9 | 1979 | 72.583000 |
10 | 1980 | 82.383003 |
11 | 1981 | 90.932999 |
12 | 1982 | 96.532997 |
13 | 1983 | 99.583000 |
14 | 1984 | 103.932999 |
15 | 1985 | 107.599998 |
16 | 1986 | 109.692001 |
17 | 1987 | 113.616997 |
18 | 1988 | 118.275002 |
19 | 1989 | 123.942001 |
20 | 1990 | 130.658005 |
21 | 1991 | 136.167007 |
22 | 1992 | 140.307999 |
23 | 1993 | 144.475006 |
24 | 1994 | 148.225006 |
25 | 1995 | 152.382996 |
26 | 1996 | 156.858002 |
27 | 1997 | 160.524994 |
28 | 1998 | 163.007996 |
29 | 1999 | 166.582993 |
30 | 2000 | 172.192001 |
31 | 2001 | 177.042007 |
32 | 2002 | 179.867004 |
33 | 2003 | 184.000000 |
34 | 2004 | 188.908005 |
35 | 2005 | 195.266998 |
Notice that because the column subset was specified as ['year', 'cpi']
, year
comes first in the result even though it was last in the original data.
7.5.2 Merging
Now we’re ready to merge. Because we haven’t set year
as an index for cpi
, we have to tell merge()
what to merge on. Do that by passing in on='year'
rather than left_index=True, right_index=True
. This tells it to use the year
column in both data sets.
= cpi.merge(
nlsy_combined
nlsy, ='year',
on='1:m',
validate='outer',
how=True
indicator
) nlsy_combined
year | cpi | birth_year | edu | income | age | _merge | |
---|---|---|---|---|---|---|---|
0 | 1970 | 38.841999 | NaN | NaN | NaN | NaN | left_only |
1 | 1971 | 40.483002 | NaN | NaN | NaN | NaN | left_only |
2 | 1972 | 41.807999 | NaN | NaN | NaN | NaN | left_only |
3 | 1973 | 44.424999 | NaN | NaN | NaN | NaN | left_only |
4 | 1974 | 49.317001 | NaN | NaN | NaN | NaN | left_only |
... | ... | ... | ... | ... | ... | ... | ... |
241046 | 2001 | 177.042007 | NaN | NaN | NaN | NaN | left_only |
241047 | 2002 | 179.867004 | NaN | NaN | NaN | NaN | left_only |
241048 | 2003 | 184.000000 | NaN | NaN | NaN | NaN | left_only |
241049 | 2004 | 188.908005 | NaN | NaN | NaN | NaN | left_only |
241050 | 2005 | 195.266998 | NaN | NaN | NaN | NaN | left_only |
241051 rows × 7 columns
This works, but all the rows we can see have ‘left_only’ for _merge
and NaN for all the columns from nlsy
. Also, id
has completely disappeared!
Let’s start with the problem of id
. Since only one of the DataFrames had a MultiIndex, merge()
discarded it and replaced it with a column for the variable you matched on, year
. But that got rid of id
entirely. To fix that, the solution again is to reset the index of nlsy
before merging. That makes id
an ordinary column which is preserved after the merge.
= cpi.merge(
nlsy_combined
nlsy.reset_index(), ='year',
on='1:m',
validate='outer',
how=True
indicator
) nlsy_combined
year | cpi | id | birth_year | edu | income | age | _merge | |
---|---|---|---|---|---|---|---|---|
0 | 1970 | 38.841999 | NaN | NaN | NaN | NaN | NaN | left_only |
1 | 1971 | 40.483002 | NaN | NaN | NaN | NaN | NaN | left_only |
2 | 1972 | 41.807999 | NaN | NaN | NaN | NaN | NaN | left_only |
3 | 1973 | 44.424999 | NaN | NaN | NaN | NaN | NaN | left_only |
4 | 1974 | 49.317001 | NaN | NaN | NaN | NaN | NaN | left_only |
... | ... | ... | ... | ... | ... | ... | ... | ... |
241046 | 2001 | 177.042007 | NaN | NaN | NaN | NaN | NaN | left_only |
241047 | 2002 | 179.867004 | NaN | NaN | NaN | NaN | NaN | left_only |
241048 | 2003 | 184.000000 | NaN | NaN | NaN | NaN | NaN | left_only |
241049 | 2004 | 188.908005 | NaN | NaN | NaN | NaN | NaN | left_only |
241050 | 2005 | 195.266998 | NaN | NaN | NaN | NaN | NaN | left_only |
241051 rows × 8 columns
Of course id
for the rows we can see is now NaN just like the other columns from nsly
. Things are not quite as dire as they look though: take a look at rows from 1979:
nlsy_combined.loc['year']==1979
nlsy_combined[ ]
year | cpi | id | birth_year | edu | income | age | _merge | |
---|---|---|---|---|---|---|---|---|
9 | 1979 | 72.583 | 1.0 | 58.0 | 12TH GRADE | 4620.0 | 21.0 | both |
10 | 1979 | 72.583 | 2.0 | 59.0 | 9TH GRADE | 4000.0 | 20.0 | both |
11 | 1979 | 72.583 | 3.0 | 61.0 | 10TH GRADE | NaN | 18.0 | both |
12 | 1979 | 72.583 | 4.0 | 62.0 | 9TH GRADE | NaN | 17.0 | both |
13 | 1979 | 72.583 | 5.0 | 59.0 | 1ST YEAR COLLEGE | 2200.0 | 20.0 | both |
... | ... | ... | ... | ... | ... | ... | ... | ... |
12690 | 1979 | 72.583 | 12682.0 | 59.0 | 9TH GRADE | 500.0 | 20.0 | both |
12691 | 1979 | 72.583 | 12683.0 | 59.0 | 12TH GRADE | 0.0 | 20.0 | both |
12692 | 1979 | 72.583 | 12684.0 | 60.0 | 10TH GRADE | 0.0 | 19.0 | both |
12693 | 1979 | 72.583 | 12685.0 | 57.0 | 12TH GRADE | 0.0 | 22.0 | both |
12694 | 1979 | 72.583 | 12686.0 | 60.0 | 12TH GRADE | 15000.0 | 19.0 | both |
12686 rows × 8 columns
They look just fine. Examine the value_counts()
for _merge
:
'_merge'].value_counts() nlsy_combined[
_merge
both 241034
left_only 17
right_only 0
Name: count, dtype: int64
Only 17 rows did not match, all from cpi
(which is why they have NaN for columns from nlsy
). Those 17 just happen to be at the beginning and end of the data set so they were all we could see. The data set is sorted by year
(a side effect of matching by it) so that gives us a clue where to look for an explanation. Try value_counts()
for year
and merge
(use sort=False
so it doesn’t change the order of the years):
('year', '_merge']].
nlsy_combined[[=False)
value_counts(sort )
year _merge
1970 left_only 1
1971 left_only 1
1972 left_only 1
1973 left_only 1
1974 left_only 1
1975 left_only 1
1976 left_only 1
1977 left_only 1
1978 left_only 1
1979 both 12686
1980 both 12686
1981 both 12686
1982 both 12686
1983 both 12686
1984 both 12686
1985 both 12686
1986 both 12686
1987 both 12686
1988 both 12686
1989 both 12686
1990 both 12686
1991 both 12686
1992 both 12686
1993 both 12686
1994 both 12686
1995 left_only 1
1996 both 12686
1997 left_only 1
1998 both 12686
1999 left_only 1
2000 both 12686
2001 left_only 1
2002 left_only 1
2003 left_only 1
2004 left_only 1
2005 left_only 1
Name: count, dtype: int64
Compare that with value_counts()
for year
in nlsy
. Since year
is the row index for nlsy
we first have to reset its index to make year
just another column so we can use value_counts()
on it:
('year'].
nlsy.reset_index()[=False)
value_counts(sort )
year
1979 12686
1980 12686
1981 12686
1982 12686
1983 12686
1984 12686
1985 12686
1986 12686
1987 12686
1988 12686
1989 12686
1990 12686
1991 12686
1992 12686
1993 12686
1994 12686
1996 12686
1998 12686
2000 12686
Name: count, dtype: int64
Not every year appears in nsly
: the NLSY started in 1979; in 1994 it switched to collecting data every other year; and our extract ends in 2000. Meanwhile, cps
starts in 1970 and has a row for every year until 2005. The rows that don’t match are just years for which we have CPI data but no data from the NLSY.
We’re interested in the people in the NLSY, not inflation for its own sake, so we don’t need the inflation data for years that don’t appear in nlsy
. That makes this a job for a right join, which will discard observations from the left DataFrame, cpi
, that don’t match anything from the right DataFrame, nlsy
. Inner join would work too now that we know there are no ‘right_only’ observations, but using it right away would risk dropping NLSY observations if we didn’t have inflation data for some NLSY years.
The final merge would be:
= cpi.merge(
nlsy_combined
nlsy.reset_index(), ='year',
on='1:m',
validate='right',
how=True
indicator
)'_merge'].value_counts() nlsy_combined[
_merge
both 241034
left_only 0
right_only 0
Name: count, dtype: int64
7.5.3 Adjusting For Inflation
To adjust income
for inflation we convert all of them to dollars in a given year: let’s choose 2000, the last year in our extract. The formula for doing so is the income
in a given year times the CPI in the year 2000 divided by the CPI in that year.
We need the CPI in the year 2000 as a plain old number. You can easily subset it out from cpi
with cpi['year']==2000
, but the result will be a Series. Python also doesn’t know that the Series will have always have a single element. So to get a simple number, extract the first element of the Series with .iloc[0]
, and then pass it to the float()
function to convert it to a floating-point number (i.e. a number with a fractional component):
= float(cpi.loc[cpi['year']==2000, 'cpi'].iloc[0])
cpi_2000 cpi_2000
172.19200134277344
Now create a column for real income with:
'real_income'] = (
nlsy_combined['income'] *
nlsy_combined[/
cpi_2000 'cpi']
nlsy_combined[
) nlsy_combined
year | cpi | id | birth_year | edu | income | age | _merge | real_income | |
---|---|---|---|---|---|---|---|---|---|
0 | 1979 | 72.583000 | 1.0 | 58.0 | 12TH GRADE | 4620.0 | 21.0 | both | 10960.239258 |
1 | 1980 | 82.383003 | 1.0 | 58.0 | NaN | NaN | 22.0 | both | NaN |
2 | 1981 | 90.932999 | 1.0 | 58.0 | 12TH GRADE | 5000.0 | 23.0 | both | 9468.070312 |
3 | 1982 | 96.532997 | 1.0 | 58.0 | NaN | NaN | 24.0 | both | NaN |
4 | 1983 | 99.583000 | 1.0 | 58.0 | NaN | NaN | 25.0 | both | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
241029 | 1993 | 144.475006 | 12686.0 | 60.0 | 12TH GRADE | 31000.0 | 33.0 | both | 36947.234375 |
241030 | 1994 | 148.225006 | 12686.0 | 60.0 | NaN | NaN | 34.0 | both | NaN |
241031 | 1996 | 156.858002 | 12686.0 | 60.0 | NaN | NaN | 36.0 | both | NaN |
241032 | 1998 | 163.007996 | 12686.0 | 60.0 | NaN | NaN | 38.0 | both | NaN |
241033 | 2000 | 172.192001 | 12686.0 | 60.0 | NaN | NaN | 40.0 | both | NaN |
241034 rows × 9 columns
To get a sense of how income
and real_income
are related, take their mean by year. You’ll see how they start out quite different as the inflation that took place between that year and the year 2000 is substantial, then gradually converge.
nlsy_combined['income', 'real_income', 'year']
['year').mean() ].groupby(
income | real_income | |
---|---|---|
year | ||
1979 | 2571.444092 | 6100.355469 |
1980 | 3257.446533 | 6808.518555 |
1981 | 3897.210449 | 7379.812012 |
1982 | 4861.282715 | 8671.377930 |
1983 | 5390.464355 | 9320.816406 |
1984 | 6481.504883 | 10738.295898 |
1985 | 7884.684570 | 12617.840820 |
1986 | 9246.562500 | 14515.042969 |
1987 | 11242.212891 | 17038.113281 |
1988 | 12615.267578 | 18366.078125 |
1989 | 13811.738281 | 19188.578125 |
1990 | 15418.368164 | 20319.609375 |
1991 | 16527.351562 | 20899.908203 |
1992 | 17395.074219 | 21347.982422 |
1993 | 18430.376953 | 21966.175781 |
1994 | 19533.679688 | 22692.144531 |
1996 | 22815.503906 | 25045.884766 |
1998 | 25415.785156 | 26847.732422 |
2000 | 30527.400391 | 30527.400391 |
In the past we haven’t had to include the groupby()
variable in the subset passed to groupby()
; what’s different now? The difference is that year
isn’t an index in nlsy_combined
. When you subset a DataFrame it keeps the same index, and that index can be used by groupby()
. But if you want to groupby()
something other than an index it needs to be part of the DataFrame passed to groupby()
.
7.6 Handling Duplicates
All the DataFrames we’ve combined thus far have been well behaved, with most of them matching perfectly. Obviously that won’t always be true with real-world data. The most common problem is duplicates, so let’s spend some time discussing how to handle them.
To see duplicates in action, load acs_demographics.pickle
and acs_ses_dup.pickle
.
= pd.read_pickle('acs_demographics.pickle')
acs_demo = pd.read_pickle('acs_ses_dup.pickle')
acs_ses acs_ses
edu | income | ||
---|---|---|---|
household | person | ||
37 | 1 | Some college, >=1 year | 10000.0 |
2 | Some college, >=1 year | 5300.0 | |
2 | Some college, >=1 year | 4700.0 | |
241 | 1 | Master's degree | 32500.0 |
242 | 1 | Bachelor's degree | 30000.0 |
... | ... | ... | ... |
1236624 | 1 | Some college, >=1 year | 50100.0 |
2 | High School graduate | 12000.0 | |
1236756 | 1 | Master's degree | 69800.0 |
2 | Master's degree | 40800.0 | |
1236779 | 1 | High School graduate | 22110.0 |
27410 rows × 2 columns
These should be familiar from section 1 of this chapter. acs_demo
contains demographic information about each individual, while acs_ses
contains information about their socio-economic status. But if you look closely at acs_ses
you’ll see a problem: household 37 has two person 2’s. This kind of error is not uncommon in real-world data.
First off, note that if you carry out the steps described in the ‘First Steps With Your Data’ chapter, you’ll know about this problem long before you try to combine this data set with anything else. As part of deciding if household
and person
uniquely identify observations (and before making them a MultiIndex), you’ll look at the maximum value of their value_counts()
:
(
acs_ses.reset_index()'household', 'person']].
[[max()
value_counts(). )
2
The fact that it’s greater than one tells you have a problem immediately.
But suppose you don’t carry out those steps. Will you find it when you try to merge acs_ses
with acs_demo
? Only if you’re careful. If you try to do the merge the easy way, you’ll get no indication anything is wrong:
acs_demo.merge(
acs_ses,=True, right_index=True
left_index )
age | hispanic | race | marital_status | female | edu | income | ||
---|---|---|---|---|---|---|---|---|
household | person | |||||||
37 | 1 | 20 | False | White | Never married | True | Some college, >=1 year | 10000.0 |
2 | 19 | False | White | Never married | True | Some college, >=1 year | 5300.0 | |
2 | 19 | False | White | Never married | True | Some college, >=1 year | 4700.0 | |
241 | 1 | 50 | False | White | Never married | True | Master's degree | 32500.0 |
242 | 1 | 29 | False | White | Never married | True | Bachelor's degree | 30000.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
1236624 | 1 | 29 | False | White | Now married | False | Some college, >=1 year | 50100.0 |
2 | 26 | False | White | Now married | True | High School graduate | 12000.0 | |
1236756 | 1 | 58 | False | White | Now married | True | Master's degree | 69800.0 |
2 | 61 | False | White | Now married | False | Master's degree | 40800.0 | |
1236779 | 1 | 30 | False | American Indian | Divorced | False | High School graduate | 22110.0 |
27410 rows × 7 columns
You may have spent enough time staring at this data set to spot the problem, but if not compare with:
acs_demo
age | hispanic | race | marital_status | female | ||
---|---|---|---|---|---|---|
household | person | |||||
37 | 1 | 20 | False | White | Never married | True |
2 | 19 | False | White | Never married | True | |
3 | 19 | False | Black | Never married | True | |
241 | 1 | 50 | False | White | Never married | True |
242 | 1 | 29 | False | White | Never married | True |
... | ... | ... | ... | ... | ... | ... |
1236624 | 1 | 29 | False | White | Now married | False |
2 | 26 | False | White | Now married | True | |
1236756 | 1 | 58 | False | White | Now married | True |
2 | 61 | False | White | Now married | False | |
1236779 | 1 | 30 | False | American Indian | Divorced | False |
27410 rows × 5 columns
The third person in household 37 is supposed to be Black. What happened? In acs_ses
, the third person in household 37 has a 2 for person
, so their row was combined with the demographic information for person 2 in acs_demo
–even though person 2 had already been combined with the real person 2 in acs_ses
. In other words, this merge become a one to many merge instead of a one to one merge, with no indication that that had happened. What’s more, because the default merge is an inner join, the person 3 from acs_demo
, not having anyone to match with in acs_ses
, was dropped.
This is why you always use validate
and start with an outer join. If you run:
acs_demo.merge(
acs_ses,
left_index=True, right_index=True,
validate='1:1',
how='outer',
indicator=True
)
You’ll get the error message (at the bottom of a lot of less clear verbiage):
MergeError: Merge keys are not unique in right dataset; not a one-to-one merge
This is a good thing! Far better for code to crash and tell you what’s wrong than to appear to run successfully but give you bogus data.
So what do you do about it? First, we need to be able to identify the problem observations. If you groupby()
both household
and person
, the size()
of the resulting groups should be 1. So create a variable called copies
containing exactly that:
'copies'] = acs_ses.groupby(
acs_ses['household', 'person']
[ ).size()
The next question is how big a problem you have. Find out by running value_counts()
on copies
:
'copies'].value_counts() acs_ses[
copies
1 27408
2 2
Name: count, dtype: int64
This tells you there are only two observations with duplicate values of household
and person
. That suggests the problem is probably some sort of data entry error or similar mistake. If you find lots of duplicates, that raises the possibility that you have fundamentally misunderstood the structure of the data set: perhaps it’s actually a hierarchical data set of some sort and you didn’t realize it.
Given that there is a small number of problem observations, take a look at them:
'copies']>1] acs_ses[acs_ses[
edu | income | copies | ||
---|---|---|---|---|
household | person | |||
37 | 2 | Some college, >=1 year | 5300.0 | 2 |
2 | Some college, >=1 year | 4700.0 | 2 |
We see that although the two person 2’s have the same education level, their incomes are different. That suggests they are in fact two different people erroneously given the same person number. If they had been completely identical, that would suggest they are in fact the same person erroneously entered twice.
The problem is, we don’t know which of them is the person 2 in the acs_demo
data set so we can’t combine either of them. If matching their SES data with their demographics is essential to your analysis, and it frequently is, you might as well drop them both right now. You can drop all the duplicates with:
'copies']==1] acs_ses[acs_ses[
edu | income | copies | ||
---|---|---|---|---|
household | person | |||
37 | 1 | Some college, >=1 year | 10000.0 | 1 |
241 | 1 | Master's degree | 32500.0 | 1 |
242 | 1 | Bachelor's degree | 30000.0 | 1 |
377 | 1 | None | 51900.0 | 1 |
418 | 1 | 12th grade, no diploma | 12200.0 | 1 |
... | ... | ... | ... | ... |
1236624 | 1 | Some college, >=1 year | 50100.0 | 1 |
2 | High School graduate | 12000.0 | 1 | |
1236756 | 1 | Master's degree | 69800.0 | 1 |
2 | Master's degree | 40800.0 | 1 | |
1236779 | 1 | High School graduate | 22110.0 | 1 |
27408 rows × 3 columns
What if you had found that they were in fact the same person? Then you would want to keep one copy and drop any others. The cumcount()
function works with groupby()
and carries out a cumulative count, numbering each row in the group consecutively. Of course, being Python, the first row is zero. So you just need to keep the rows where cumcount()
is zero:
acs_ses[
acs_ses.'household', 'person']).
groupby([==0
cumcount() ]
edu | income | copies | ||
---|---|---|---|---|
household | person | |||
37 | 1 | Some college, >=1 year | 10000.0 | 1 |
2 | Some college, >=1 year | 5300.0 | 2 | |
241 | 1 | Master's degree | 32500.0 | 1 |
242 | 1 | Bachelor's degree | 30000.0 | 1 |
377 | 1 | None | 51900.0 | 1 |
... | ... | ... | ... | ... |
1236624 | 1 | Some college, >=1 year | 50100.0 | 1 |
2 | High School graduate | 12000.0 | 1 | |
1236756 | 1 | Master's degree | 69800.0 | 1 |
2 | Master's degree | 40800.0 | 1 | |
1236779 | 1 | High School graduate | 22110.0 | 1 |
27409 rows × 3 columns
This allows you to keep the first person 2. Either way when you match with acs_demo
you’ll have some unmatched observations, which you can deal with in a way that is appropriate for your analysis.