3 Introduction to DataFrames

Most data wrangling work in Python revolves around DataFrames, a class introduced by the Pandas package. A DataFrame represents a classic data set organized as a table or matrix, with rows representing observations and columns representing variables. In this chapter we’ll learn the basics of working with DataFrames.

3.1 Setting Up

Start up Jupyter Notebook if you haven’t already, then navigate to the folder where you put the example files. Once you’re in that folder, click New, Python 3 to create a new Notebook. Click on the name at the top to change it from Untitled to DataFrames.

We’ll start by importing the Pandas package and then reading in a simple data set from disk. acs_household.csv is one of the example files and contains a single household from the 2000 American Community Survey. It’s a comma separated variable or CSV file, one of the most popular types of data files. The read_csv() function in the Pandas package reads CSV files. The first argument for read_csv() is the file to be read, usually in the form of a string containing the file name. Since the file is in the same folder as your new Notebook, you don’t have to tell Python where to find it. read_csv() has many key word arguments you can pass in to control how the file is read, and we’ll talk about how and why to use the most important ones in the next chapter. However, by default read_csv() will try to figure out the best way to read the file, and since this file is simple and has already been cleaned up it will be successful.

import pandas as pd
acs = pd.read_csv("acs_household.csv")
acs
household person age race marital_status edu income female hispanic
0 132531 1 47 Black Now married Associate degree 20000.0 0 0
1 132531 2 38 White Now married High School graduate 18500.0 1 0
2 132531 3 19 Black Never married 10th grade 5000.0 0 0
3 132531 4 18 Two or more races Never married 11th grade 2200.0 0 0
4 132531 5 16 Black Never married 9th grade 0.0 0 0
5 132531 6 1 Black NaN NaN NaN 0 0
6 132531 7 18 White Never married 9th grade 7900.0 1 0

The variables in this data set are:

  • household: identifies the household
  • person: identifies a person within the household
  • age: the person’s age in years
  • race: the person’s race
  • marital_status: the person’s marital status
  • edu: the person’s educational attainment
  • female: an indicator for whether the person is female or not, with 1 meaning yes, they are female
  • hispanic: an indicator for whether the person is Hispanic or not, with 1 meaning yes, they are Hispanic

3.1.1 Indexes

Note that I did not include in the list of variables the column of numbers you see on the left in the printout of acs. This column is the row index. One of the key features of a Pandas DataFrame is that it always has indexes for both the rows and columns. Like the index of a book, the indexes of a DataFrame tell you where to find a particular piece of information.

Suppose I wanted to draw your attention to the number 19 in the acs DataFrame. One way I could do it would be to tell you to look at the number in the third row and the third column, what Python calls row 2 and column 2 since it numbers from 0. However, I could also tell you to look at the age variable of the row where person is 3. That’s because the variable names act as a column index, and in this data set person acts as a row index. While identifying rows and columns by number is sometimes useful, keep in mind that many operations will change those numbers. For example, if you dropped all the adults in the data set or sorted the data set by age, then person 3 would no longer be on row 2. Thus you’ll usually specify subsets of the data using indexes. We’ll talk more about identifying indexes and what they tell you about the structure of your data in First Steps With Your Data.

Note that to identify a person in the full ACS you need to know both their person number and their household number, because person only uniquely identifies a person within a household (i.e. every household has a person 1). This is called MultiIndexing, and I extracted just one household from the ACS so we don’t have to deal with it yet.

When we read in this data set using read_csv(), it created a default row index for us based on the row numbers. However, once the index is created we can change the order of the observations (using sort_values()) and the same index will still be associated with the same observation.

acs = acs.sort_values('age')
acs
household person age race marital_status edu income female hispanic
5 132531 6 1 Black NaN NaN NaN 0 0
4 132531 5 16 Black Never married 9th grade 0.0 0 0
3 132531 4 18 Two or more races Never married 11th grade 2200.0 0 0
6 132531 7 18 White Never married 9th grade 7900.0 1 0
2 132531 3 19 Black Never married 10th grade 5000.0 0 0
1 132531 2 38 White Now married High School graduate 18500.0 1 0
0 132531 1 47 Black Now married Associate degree 20000.0 0 0

This data set came with a row index: the person variable. We can tell Python to use it as the index with set_index():

acs = acs.set_index('person')
acs
household age race marital_status edu income female hispanic
person
6 132531 1 Black NaN NaN NaN 0 0
5 132531 16 Black Never married 9th grade 0.0 0 0
4 132531 18 Two or more races Never married 11th grade 2200.0 0 0
7 132531 18 White Never married 9th grade 7900.0 1 0
3 132531 19 Black Never married 10th grade 5000.0 0 0
2 132531 38 White Now married High School graduate 18500.0 1 0
1 132531 47 Black Now married Associate degree 20000.0 0 0

Note that the original row numbers have disappeared, and person has been moved to the far left and moved down. That’s because it’s now the row index. Indexes are not variables. Now that person is an index you can’t create a new variable that’s equal to person times 2, for example. That’s okay: identifiers like person generally don’t contain information. However, you can do some things like with an index, like sort by it.

acs = acs.sort_values('person')
acs
household age race marital_status edu income female hispanic
person
1 132531 47 Black Now married Associate degree 20000.0 0 0
2 132531 38 White Now married High School graduate 18500.0 1 0
3 132531 19 Black Never married 10th grade 5000.0 0 0
4 132531 18 Two or more races Never married 11th grade 2200.0 0 0
5 132531 16 Black Never married 9th grade 0.0 0 0
6 132531 1 Black NaN NaN NaN 0 0
7 132531 18 White Never married 9th grade 7900.0 1 0

This put the data back in the original order.

What if you really needed to do math with person? You can convert it back from an index to a variable by resetting the index:

acs.reset_index()
person household age race marital_status edu income female hispanic
0 1 132531 47 Black Now married Associate degree 20000.0 0 0
1 2 132531 38 White Now married High School graduate 18500.0 1 0
2 3 132531 19 Black Never married 10th grade 5000.0 0 0
3 4 132531 18 Two or more races Never married 11th grade 2200.0 0 0
4 5 132531 16 Black Never married 9th grade 0.0 0 0
5 6 132531 1 Black NaN NaN NaN 0 0
6 7 132531 18 White Never married 9th grade 7900.0 1 0

This converts person to a variable and creates a new index based on row numbers, just like read_csv() did originally.

3.1.2 Assignment vs. Changing in Place

Hopefully you noticed that most of the code cells thus far have started with acs = but the last one did not. That’s why if I ask Python to print the acs DataFrame, person is still the index:

acs
household age race marital_status edu income female hispanic
person
1 132531 47 Black Now married Associate degree 20000.0 0 0
2 132531 38 White Now married High School graduate 18500.0 1 0
3 132531 19 Black Never married 10th grade 5000.0 0 0
4 132531 18 Two or more races Never married 11th grade 2200.0 0 0
5 132531 16 Black Never married 9th grade 0.0 0 0
6 132531 1 Black NaN NaN NaN 0 0
7 132531 18 White Never married 9th grade 7900.0 1 0

When I ran acs.reset_index(), the reset_index() function returned a new copy of acs with the index reset. But I did not store this new DataFrame as acs as I did previously. Because I did not store it, Jupyter Notebook carried out an implicit print and showed me the modified copy. But the original acs was unchanged.

You can use the inplace=True argument to tell a function to change the DataFrame it acts on. To demonstrate, sort the acs DataFrame by age using inplace=True, and just to keep things interesting also add ascending=False so it’s sorted in descending order. Since this function call is getting long, put each argument on its own line:

acs.sort_values(
    'age', 
    ascending=False, 
    inplace=True
)
acs
household age race marital_status edu income female hispanic
person
1 132531 47 Black Now married Associate degree 20000.0 0 0
2 132531 38 White Now married High School graduate 18500.0 1 0
3 132531 19 Black Never married 10th grade 5000.0 0 0
4 132531 18 Two or more races Never married 11th grade 2200.0 0 0
7 132531 18 White Never married 9th grade 7900.0 1 0
5 132531 16 Black Never married 9th grade 0.0 0 0
6 132531 1 Black NaN NaN NaN 0 0

You might think that using inplace=True would save memory because the function can change the existing DataFrame rather than making a new copy. In reality, most of the time the function creates a new copy and then puts it in the place of the old one anyway. So we’ll rarely use inplace=True. But we will frequently use DataFrame functions without changing the original DataFrame.

3.1.3 Copies vs. Views

Consider the following code cell:

acs2 = acs.copy(deep=True)
acs3 = acs2.copy(deep=False)
acs3
household age race marital_status edu income female hispanic
person
1 132531 47 Black Now married Associate degree 20000.0 0 0
2 132531 38 White Now married High School graduate 18500.0 1 0
3 132531 19 Black Never married 10th grade 5000.0 0 0
4 132531 18 Two or more races Never married 11th grade 2200.0 0 0
7 132531 18 White Never married 9th grade 7900.0 1 0
5 132531 16 Black Never married 9th grade 0.0 0 0
6 132531 1 Black NaN NaN NaN 0 0

The copy() function does what you’d expect: it makes a copy of the DataFrame. But what’s the difference between a deep copy and a shallow copy?

To see the difference, tell Python to make everyone in the acs3 DataFrame Hispanic. (We’ll talk more about changing data shortly.)

acs3['hispanic'] = 1
acs3
household age race marital_status edu income female hispanic
person
1 132531 47 Black Now married Associate degree 20000.0 0 1
2 132531 38 White Now married High School graduate 18500.0 1 1
3 132531 19 Black Never married 10th grade 5000.0 0 1
4 132531 18 Two or more races Never married 11th grade 2200.0 0 1
7 132531 18 White Never married 9th grade 7900.0 1 1
5 132531 16 Black Never married 9th grade 0.0 0 1
6 132531 1 Black NaN NaN NaN 0 1

Look what this did to acs2:

acs2
household age race marital_status edu income female hispanic
person
1 132531 47 Black Now married Associate degree 20000.0 0 1
2 132531 38 White Now married High School graduate 18500.0 1 1
3 132531 19 Black Never married 10th grade 5000.0 0 1
4 132531 18 Two or more races Never married 11th grade 2200.0 0 1
7 132531 18 White Never married 9th grade 7900.0 1 1
5 132531 16 Black Never married 9th grade 0.0 0 1
6 132531 1 Black NaN NaN NaN 0 1

But it did not change the original acs:

acs
household age race marital_status edu income female hispanic
person
1 132531 47 Black Now married Associate degree 20000.0 0 0
2 132531 38 White Now married High School graduate 18500.0 1 0
3 132531 19 Black Never married 10th grade 5000.0 0 0
4 132531 18 Two or more races Never married 11th grade 2200.0 0 0
7 132531 18 White Never married 9th grade 7900.0 1 0
5 132531 16 Black Never married 9th grade 0.0 0 0
6 132531 1 Black NaN NaN NaN 0 0

So what’s going on here? When we defined acs2 with acs2 = acs.copy(deep=True) we told Python to create a completely new DataFrame containing the same data as acs. However, when we defined acs3 with acs3 = acs2.copy(deep=False) we essentially told Python that the DataFrame known as acs2 can also be called acs3. They are two different references to the same DataFrame in memory. Thus changing acs3 also changed acs2, but not the original acs because acs really is a different DataFrame. Shallow copies are often called ‘views’, as they are just another way of viewing the same data.

Now try a different change:

acs3.reset_index(inplace=True)
acs3
person household age race marital_status edu income female hispanic
0 1 132531 47 Black Now married Associate degree 20000.0 0 1
1 2 132531 38 White Now married High School graduate 18500.0 1 1
2 3 132531 19 Black Never married 10th grade 5000.0 0 1
3 4 132531 18 Two or more races Never married 11th grade 2200.0 0 1
4 7 132531 18 White Never married 9th grade 7900.0 1 1
5 5 132531 16 Black Never married 9th grade 0.0 0 1
6 6 132531 1 Black NaN NaN NaN 0 1
acs2
household age race marital_status edu income female hispanic
person
1 132531 47 Black Now married Associate degree 20000.0 0 1
2 132531 38 White Now married High School graduate 18500.0 1 1
3 132531 19 Black Never married 10th grade 5000.0 0 1
4 132531 18 Two or more races Never married 11th grade 2200.0 0 1
7 132531 18 White Never married 9th grade 7900.0 1 1
5 132531 16 Black Never married 9th grade 0.0 0 1
6 132531 1 Black NaN NaN NaN 0 1

Why didn’t the change to acs3 carry over to acs2 this time? Because the reset_index() function, despite inplace=True, created a new DataFrame with the index reset and then called it acs3. This broke the connection between acs2 and acs3.

Views can be useful because they save memory, but they’re also confusing. It’s hard to know what Python is doing behind the scenes. We’ll run into this again in the future, but the lesson for now is that if you want to create a copy of a DataFrame, you almost always want deep=True.

3.2 Specifying Subsets of DataFrames

A critical Pandas skill is specifying a subset of a DataFrame. For better or worse, you have lots of options for doing so.

3.2.1 Setting Up

Start up Jupyter Notebook if you haven’t already, then navigate to the folder where you put the example files. Once you’re in that folder, click New, Python 3 to create a new Notebook. Click on the name at the top to change it from Untitled to Subsets. Then import pandas, read in the single household from the ACS, and set its index to person:

import pandas as pd
acs = pd.read_csv("acs_household.csv")
acs = acs.set_index('person')
acs
household age race marital_status edu income female hispanic
person
1 132531 47 Black Now married Associate degree 20000.0 0 0
2 132531 38 White Now married High School graduate 18500.0 1 0
3 132531 19 Black Never married 10th grade 5000.0 0 0
4 132531 18 Two or more races Never married 11th grade 2200.0 0 0
5 132531 16 Black Never married 9th grade 0.0 0 0
6 132531 1 Black NaN NaN NaN 0 0
7 132531 18 White Never married 9th grade 7900.0 1 0

3.2.2 Subsetting with Square Brackets

Recall that with a list, you could specify a subset of the list by putting square brackets after the list name:

my_list = [1, 2, 3]
my_list[0]
1

Similarly, you can use square brackets to select a column from a DataFrame by putting the column name in the brackets:

acs['age']
person
1    47
2    38
3    19
4    18
5    16
6     1
7    18
Name: age, dtype: int64

Why acs['age'] rather than acs[age]? Because age without quotes would refer to a variable called age. You can use variables in subsets, and sometimes it’s very useful:

column_to_select = 'age'
acs[column_to_select]
person
1    47
2    38
3    19
4    18
5    16
6     1
7    18
Name: age, dtype: int64

But most of the time you’ll refer to a column name directly by putting it in quotes. Get used to typing lots of quotes in Python.

To select multiple columns, put them in a list:

acs[['age', 'edu']]
age edu
person
1 47 Associate degree
2 38 High School graduate
3 19 10th grade
4 18 11th grade
5 16 9th grade
6 1 NaN
7 18 9th grade

Note that there’s no special meaning to double square brackets ([[ and ]]). The outer [] is for identifying a subset, and the inner [] is for defining a list.

Why does this look so much prettier than when we only selected age? Remember that each column in a DataFrame is a Series. If you select one column from a DataFrame, the result is a Series. When you select two or more columns, the result is a DataFrame. Jupyter Notebook puts extra effort into printing DataFrames so they look nice, but it doesn’t bother with a Series. Note that both the DataFrame and the Series get an index that is copied from the original DataFrame.

3.2.2.1 Exercise

Select and print the race, female, and hispanic columns from the acs DataFrame. (Yes, implicit print is fine.)

3.2.3 Subsetting with iloc

The iloc function allows you to select subsets by row and column number. It’s not quite a conventional function in that the arguments go in square brackets rather than parentheses.

If you give iloc a single number, it will select the corresponding row:

acs.iloc[0]
household                   132531
age                             47
race                         Black
marital_status         Now married
edu               Associate degree
income                     20000.0
female                           0
hispanic                         0
Name: 1, dtype: object

The result is a Series again, but since it’s going across a row the index of the Series is the column names from the original DataFrame.

If you give iloc two numbers, it will select the value of the corresponding row and column:

acs.iloc[0, 0]
132531

This gave us the household number of the first person, because that is the value in the first row and first column (remember Python counts from zero and does not count the row index).

You can also use slices with iloc:

acs.iloc[1:3, 0:4]
household age race marital_status
person
2 132531 38 White Now married
3 132531 19 Black Never married

Slices do not include their endpoint, so this gave us rows one and two (i.e. skipping the first row, row zero) and columns zero through three.

3.2.4 Exercise

Select and print the race of the first person in the acs DataFrame using iloc.

3.2.5 Subsetting with loc

The loc function allows you to subset using the DataFrame’s indexes. If you give it a value of the row index it will select that row:

acs.loc[1]
household                   132531
age                             47
race                         Black
marital_status         Now married
edu               Associate degree
income                     20000.0
female                           0
hispanic                         0
Name: 1, dtype: object

Compare that with selecting row 1 (i.e. the second row) with iloc:

acs.iloc[1]
household                       132531
age                                 38
race                             White
marital_status             Now married
edu               High School graduate
income                         18500.0
female                               1
hispanic                             0
Name: 2, dtype: object

If you give it a value of the row index and a value of the column index, i.e. a column name, it will select that value:

acs.loc[1, 'household']
132531

You can select multiple rows and/or columns with lists:

acs.loc[[1, 2, 3], ['age', 'edu']]
age edu
person
1 47 Associate degree
2 38 High School graduate
3 19 10th grade

If the row indexes are integers, as they often are, you can also use a slice. Just to make things confusing, with loc the endpoint of the slice is included:

acs.loc[1:3, 'age']
person
1    47
2    38
3    19
Name: age, dtype: int64

To select all rows or all columns, use ::

acs.loc[:, ['age', 'edu']]
age edu
person
1 47 Associate degree
2 38 High School graduate
3 19 10th grade
4 18 11th grade
5 16 9th grade
6 1 NaN
7 18 9th grade
acs.loc[[1, 2], :]
household age race marital_status edu income female hispanic
person
1 132531 47 Black Now married Associate degree 20000.0 0 0
2 132531 38 White Now married High School graduate 18500.0 1 0

But remember that passing in just a row is equivalent to selecting all columns:

acs.loc[[1, 2]]
household age race marital_status edu income female hispanic
person
1 132531 47 Black Now married Associate degree 20000.0 0 0
2 132531 38 White Now married High School graduate 18500.0 1 0

3.2.5.1 Exercise

Select and print the income and marital_status of persons 3-5 in the acs DataFrame using loc.

3.2.6 Subsetting by Conditions

Both plain square brackets and loc can subset the rows that meet a condition. For example, to select the children in the household, use:

acs[acs['age']<18]
household age race marital_status edu income female hispanic
person
5 132531 16 Black Never married 9th grade 0.0 0 0
6 132531 1 Black NaN NaN NaN 0 0

Let’s take that apart a bit. acs['age']<18 gives us a Series:

acs['age']<18
person
1    False
2    False
3    False
4    False
5     True
6     True
7    False
Name: age, dtype: bool

You could store and pass in that Series:

is_child = acs['age']<18
acs[is_child]
household age race marital_status edu income female hispanic
person
5 132531 16 Black Never married 9th grade 0.0 0 0
6 132531 1 Black NaN NaN NaN 0 0

loc works the same way:

acs.loc[acs['age']<18]
household age race marital_status edu income female hispanic
person
5 132531 16 Black Never married 9th grade 0.0 0 0
6 132531 1 Black NaN NaN NaN 0 0
acs.loc[is_child]
household age race marital_status edu income female hispanic
person
5 132531 16 Black Never married 9th grade 0.0 0 0
6 132531 1 Black NaN NaN NaN 0 0

The logical operators you can use to create conditions are:

== (equals)

!= (not equals)

> (greater than)

< (less than)

>= (greater than or equal to)

<= (less than or equal to)

~ means ‘not’, and can be used to negate the following condition. Thus:

acs.loc[~is_child]
household age race marital_status edu income female hispanic
person
1 132531 47 Black Now married Associate degree 20000.0 0 0
2 132531 38 White Now married High School graduate 18500.0 1 0
3 132531 19 Black Never married 10th grade 5000.0 0 0
4 132531 18 Two or more races Never married 11th grade 2200.0 0 0
7 132531 18 White Never married 9th grade 7900.0 1 0

Comparison operators have precedence rules just like arithmetic operators, and sometimes the result is confusing. Take:

acs.loc[~acs['age']<18]
household age race marital_status edu income female hispanic
person
1 132531 47 Black Now married Associate degree 20000.0 0 0
2 132531 38 White Now married High School graduate 18500.0 1 0
3 132531 19 Black Never married 10th grade 5000.0 0 0
4 132531 18 Two or more races Never married 11th grade 2200.0 0 0
5 132531 16 Black Never married 9th grade 0.0 0 0
6 132531 1 Black NaN NaN NaN 0 0
7 132531 18 White Never married 9th grade 7900.0 1 0

Why isn’t ~acs['age']<18 equivalent to ~is_child? (Not to mention that the result with ~acs['age']<18 seems to make no sense at all.) Because ~ is evaluated before <., and when ~ is applied to a number rather than a condition it has the effect of making the number negative.

~acs['age']
person
1   -48
2   -39
3   -20
4   -19
5   -17
6    -2
7   -19
Name: age, dtype: int64

Once age is negative, everyone’s age is less than 18.

The solution is to use parentheses generously to make sure everything is evaluated in the order you want.

acs.loc[~(acs['age']<18)]
household age race marital_status edu income female hispanic
person
1 132531 47 Black Now married Associate degree 20000.0 0 0
2 132531 38 White Now married High School graduate 18500.0 1 0
3 132531 19 Black Never married 10th grade 5000.0 0 0
4 132531 18 Two or more races Never married 11th grade 2200.0 0 0
7 132531 18 White Never married 9th grade 7900.0 1 0

To combine conditions, use logical and, &, or logical or, |. With logical and the result is true if and only if both conditions are true. In terms of set theory, it is the intersection. With logical or, the result is true if either condition is true (or both). In terms of set theory, it is the union.

For example, to identify the people who are adults and have an income greater than $5,000, use:

acs.loc[(acs['age']>=18) & (acs['income']>5000)]
household age race marital_status edu income female hispanic
person
1 132531 47 Black Now married Associate degree 20000.0 0 0
2 132531 38 White Now married High School graduate 18500.0 1 0
7 132531 18 White Never married 9th grade 7900.0 1 0

Yes, the parentheses are essential.

To identify the people who are either adults or have an income greater than $5,000, use:

acs.loc[(acs['age']>=18) | (acs['income']>5000)]
household age race marital_status edu income female hispanic
person
1 132531 47 Black Now married Associate degree 20000.0 0 0
2 132531 38 White Now married High School graduate 18500.0 1 0
3 132531 19 Black Never married 10th grade 5000.0 0 0
4 132531 18 Two or more races Never married 11th grade 2200.0 0 0
7 132531 18 White Never married 9th grade 7900.0 1 0

3.2.6.1 Exercise

Select and print edu for the Black people in the acs DataFrame. (Hint: race is a string and values of a string variable, like ‘Black’, go in quotes.)

3.2.7 Which Subsetting Method Should You Use?

Plain square brackets, loc, and iloc all do basically the same thing, so which should you use?

Subsets specified by iloc are always precise and unambiguous, but they make the code hard to read because you have to know exactly what the data set looks like to know what they mean. We’ll rarely use iloc.

Plain square brackets are simple and convenient, but they can’t do as much. Also, because they’re used with so many different kinds of objects, the underlying code isn’t optimized for DataFrames. Avoid them whenever speed is a major consideration.

loc is almost as easy to use as plain square brackets, is much more capable, and is optimized for performance. Plus, the resulting code is easier to understand than with iloc.

In practice we’ll use plain square brackets for simple subsets, especially selecting a single column, and loc for anything more complicated.

3.3 Changing DataFrames

You can change the data in a DataFrame by assigning new values to it. You’ll use the subsetting tools from the previous section to select which parts of the DataFrame will be changed. To create a new column, select it as a subset of the DataFrame as if it already existed and then assign a value to it.

3.3.1 Setting Up

Start up Jupyter Notebook if you haven’t already, then navigate to the folder where you put the example files. Once you’re in that folder, click New, Python 3 to create a new Notebook. Click on the name at the top to change it from Untitled to Changing DataFrames. Then import pandas as usual, read in the single household from the ACS, and set its index to person:

import pandas as pd
acs = pd.read_csv("acs_household.csv")
acs = acs.set_index('person')
acs
household age race marital_status edu income female hispanic
person
1 132531 47 Black Now married Associate degree 20000.0 0 0
2 132531 38 White Now married High School graduate 18500.0 1 0
3 132531 19 Black Never married 10th grade 5000.0 0 0
4 132531 18 Two or more races Never married 11th grade 2200.0 0 0
5 132531 16 Black Never married 9th grade 0.0 0 0
6 132531 1 Black NaN NaN NaN 0 0
7 132531 18 White Never married 9th grade 7900.0 1 0

3.3.2 Numeric Variables

The incomes in this data set date from the year 2000, so to correct for inflation and convert them to January 2021 dollars we need to multiply them by 1.55:

acs['income_2021'] = acs['income']*1.55
acs
household age race marital_status edu income female hispanic income_2021
person
1 132531 47 Black Now married Associate degree 20000.0 0 0 31000.0
2 132531 38 White Now married High School graduate 18500.0 1 0 28675.0
3 132531 19 Black Never married 10th grade 5000.0 0 0 7750.0
4 132531 18 Two or more races Never married 11th grade 2200.0 0 0 3410.0
5 132531 16 Black Never married 9th grade 0.0 0 0 0.0
6 132531 1 Black NaN NaN NaN 0 0 NaN
7 132531 18 White Never married 9th grade 7900.0 1 0 12245.0

Note how the income of person 6 is NaN, or ‘Not a Number’. This is how Python stores missing values. Attempting to multiply NaN by 1.58 results in NaN, so income_2021 is automatically missing as well. This is what we want and expect.

3.3.2.1 Exercise

Go to the BLS inflation calculator and see what you need to multiply incomes from 2000 by to get dollars in the most recent month that has data (assume for simplicity that the incomes are from January 2000). Create a column in the acs DataFrame to store income in current dollars.

3.3.3 Indicator (Boolean) Variables

Now create an indicator variable for ‘this person is black’. Since you’re setting the new variable equal to a condition, it will get True if the condition is true and False if it is false:

acs['black'] = (acs['race']=='Black')
acs
household age race marital_status edu income female hispanic income_2021 black
person
1 132531 47 Black Now married Associate degree 20000.0 0 0 31000.0 True
2 132531 38 White Now married High School graduate 18500.0 1 0 28675.0 False
3 132531 19 Black Never married 10th grade 5000.0 0 0 7750.0 True
4 132531 18 Two or more races Never married 11th grade 2200.0 0 0 3410.0 False
5 132531 16 Black Never married 9th grade 0.0 0 0 0.0 True
6 132531 1 Black NaN NaN NaN 0 0 NaN True
7 132531 18 White Never married 9th grade 7900.0 1 0 12245.0 False

In this case the parentheses are purely for readability.

By setting the variable black equal to a condition, we get a column that is either True or False.

3.3.3.1 Exercise

Create an indicator variable for ‘this person is an adult.’

3.3.4 Missing Values (NaN)

Now create an indicator for has_income, as in the person has a non-zero income.

acs['has_income'] = (acs['income'] > 0)
acs
household age race marital_status edu income female hispanic income_2021 black has_income
person
1 132531 47 Black Now married Associate degree 20000.0 0 0 31000.0 True True
2 132531 38 White Now married High School graduate 18500.0 1 0 28675.0 False True
3 132531 19 Black Never married 10th grade 5000.0 0 0 7750.0 True True
4 132531 18 Two or more races Never married 11th grade 2200.0 0 0 3410.0 False True
5 132531 16 Black Never married 9th grade 0.0 0 0 0.0 True False
6 132531 1 Black NaN NaN NaN 0 0 NaN True False
7 132531 18 White Never married 9th grade 7900.0 1 0 12245.0 False True

Note how person 6 got False for has_income. Given that person 6 is only one year old that’s almost certainly accurate, but Python didn’t put False in has_income because it knew that. Any comparison involving NaN will be false.

NaN is formally defined in the NumPy package. The Pandas package can store NaN values but if we want to talk about them directly we need to import NumPy, which is normally imported as np. Do that, and then see what you get when you compare income to NaN:

import numpy as np
acs['income']==np.NaN
person
1    False
2    False
3    False
4    False
5    False
6    False
7    False
Name: income, dtype: bool

The result is still False for person 6. Even comparing np.NaN to itself gives False:

np.NaN==np.NaN
False

To detect missing values, use the isna() function:`

acs['income'].isna()
person
1    False
2    False
3    False
4    False
5    False
6     True
7    False
Name: income, dtype: bool

To fix the has_income column, we need to change it to NaN for the rows where income is NaN. This requires a subset that specifies both rows and columns, so we’ll need loc.

acs.loc[acs['income'].isna(), 'has_income'] = np.NaN
acs
household age race marital_status edu income female hispanic income_2021 black has_income
person
1 132531 47 Black Now married Associate degree 20000.0 0 0 31000.0 True True
2 132531 38 White Now married High School graduate 18500.0 1 0 28675.0 False True
3 132531 19 Black Never married 10th grade 5000.0 0 0 7750.0 True True
4 132531 18 Two or more races Never married 11th grade 2200.0 0 0 3410.0 False True
5 132531 16 Black Never married 9th grade 0.0 0 0 0.0 True False
6 132531 1 Black NaN NaN NaN 0 0 NaN True NaN
7 132531 18 White Never married 9th grade 7900.0 1 0 12245.0 False True

But in general you don’t want to fix variables: you want to go back and fix the code that created them wrong in the first place. So let’s drop has_income from acs DataFrame using the drop() function so we can try again:

acs = acs.drop('has_income', axis=1)
acs
household age race marital_status edu income female hispanic income_2021 black
person
1 132531 47 Black Now married Associate degree 20000.0 0 0 31000.0 True
2 132531 38 White Now married High School graduate 18500.0 1 0 28675.0 False
3 132531 19 Black Never married 10th grade 5000.0 0 0 7750.0 True
4 132531 18 Two or more races Never married 11th grade 2200.0 0 0 3410.0 False
5 132531 16 Black Never married 9th grade 0.0 0 0 0.0 True
6 132531 1 Black NaN NaN NaN 0 0 NaN True
7 132531 18 White Never married 9th grade 7900.0 1 0 12245.0 False

The drop() function can drop either rows or columns from the DataFrame. Rows are axis 0 and columns are axis 1, so to drop a column we pass in the column name and axis=1. We could drop multiple columns by passing in a list of column names.

Now let’s create has_income properly. We still need to set it equal to a condition, but we know that that condition will be the wrong answer for rows where has_income is missing. So we’ll exclude those rows from the subset to be created. Those rows will still get something in has_income, but since no value is set they’ll get NaN–which is what we want.

acs.loc[
    ~acs['income'].isna(),
    'has_income'
] = acs['income']>0
acs
household age race marital_status edu income female hispanic income_2021 black has_income
person
1 132531 47 Black Now married Associate degree 20000.0 0 0 31000.0 True True
2 132531 38 White Now married High School graduate 18500.0 1 0 28675.0 False True
3 132531 19 Black Never married 10th grade 5000.0 0 0 7750.0 True True
4 132531 18 Two or more races Never married 11th grade 2200.0 0 0 3410.0 False True
5 132531 16 Black Never married 9th grade 0.0 0 0 0.0 True False
6 132531 1 Black NaN NaN NaN 0 0 NaN True NaN
7 132531 18 White Never married 9th grade 7900.0 1 0 12245.0 False True

Why did we have to drop the old has_income before creating it again? Wouldn’t the code we just ran replace the old values? No, not all of them. That code excluded row 6, where income is NaN, from the subset it acted on. When creating a new column, an excluded row gets NaN. But if the column already exists, an excluded row is left unchanged. Thus the incorrect False we had before would have been left alone.

If this had been real code rather than a teaching exercise, as soon as you noticed has_income was incorrect you’d fix the code that created it. But simply running the fixed code would not fix has_income. Instead, you’d need to run all your code from the beginning so it reloads acs from the file on disk, which does not contain the bad version has_income, and the code can create a new has_income column. This illustrates both the value of running all of your code from the beginning, and why you never change your original data. Never write your output over your input!

3.3.4.1 Exercise

Create an indicator variable for “this person is married.” For this exercise, do not assume that the one-year-old cannot possibly be married.

3.3.5 Copies vs. Views (Again)

Make a DataFrame containing just the teenagers from our example household:

teens = acs[(acs['age']>12) & (acs['age']<20)]
teens
household age race marital_status edu income female hispanic income_2021 black has_income
person
3 132531 19 Black Never married 10th grade 5000.0 0 0 7750.0 True True
4 132531 18 Two or more races Never married 11th grade 2200.0 0 0 3410.0 False True
5 132531 16 Black Never married 9th grade 0.0 0 0 0.0 True False
7 132531 18 White Never married 9th grade 7900.0 1 0 12245.0 False True

Now try creating a new variable in that DataFrame.

teens['x'] = 1
teens
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
household age race marital_status edu income female hispanic income_2021 black has_income x
person
3 132531 19 Black Never married 10th grade 5000.0 0 0 7750.0 True True 1
4 132531 18 Two or more races Never married 11th grade 2200.0 0 0 3410.0 False True 1
5 132531 16 Black Never married 9th grade 0.0 0 0 0.0 True False 1
7 132531 18 White Never married 9th grade 7900.0 1 0 12245.0 False True 1

So it worked, but what’s with the warning message? The suggestion to switch to loc is something of a red herring (try it–it doesn’t help). The real problem is copies and views again.

When you create a new DataFrame that is a subset of an existing DataFrame, Python will decide whether the new DataFrame should be a copy or a view based on what it thinks will give better performance, and that depends on things that you don’t know and cannot easily find out, like the structure of the DataFrame in memory. The bottom line is that you don’t know if the new DataFrame is a copy or a view and that’s fine–until you start making changes to it. Then you don’t know if the changes will carry over to the original DataFrame or not and that’s not fine. That’s when you get this warning message.

The solution is to explicitly make the new DataFrame a copy using copy(deep=True):

teens = acs[
    (acs['age']>12) & (acs['age']<20)
].copy(deep=True)

teens.loc[:,'x'] = 1
teens
household age race marital_status edu income female hispanic income_2021 black has_income x
person
3 132531 19 Black Never married 10th grade 5000.0 0 0 7750.0 True True 1
4 132531 18 Two or more races Never married 11th grade 2200.0 0 0 3410.0 False True 1
5 132531 16 Black Never married 9th grade 0.0 0 0 0.0 True False 1
7 132531 18 White Never married 9th grade 7900.0 1 0 12245.0 False True 1

Now you can make changes to teens without getting that warning.

3.3.6 Function Chaining

The above is an example of function chaining. The first step in the chain is to take a subset of acs with acs[(acs['age']>12) & (acs['age']<20)]. The key insight is that the result of the first step is a DataFrame and so it has a copy() function, which we call as the second step in the chain. Chains can be as long as you like, though they’re good candidates for splitting across lines. For example, recreate the teens DataFrame and reset the index so it goes from 0 to 3 rather than using the person number from the (no longer relevant) full household:

teens = (
    acs[(acs['age']>12) & (acs['age']<20)].
    copy(deep=True).
    reset_index().
    drop('person', axis=1)
)
teens
household age race marital_status edu income female hispanic income_2021 black has_income
0 132531 19 Black Never married 10th grade 5000.0 0 0 7750.0 True True
1 132531 18 Two or more races Never married 11th grade 2200.0 0 0 3410.0 False True
2 132531 16 Black Never married 9th grade 0.0 0 0 0.0 True False
3 132531 18 White Never married 9th grade 7900.0 1 0 12245.0 False True

3.3.6.1 Exercise

Create a DataFrame containing just the people we know had income (note framing it as ‘people we know had income’ tells you what to do with missing values), re-index it, drop the old index, and sort it in descending order of income. Carry out all these tasks in a single function chain.