2  DataFrames

Most data wrangling work in Python revolves around DataFrames, a class introduced by the Pandas package. A DataFrame represents a classic data set organized as a table or matrix, with rows representing observations and columns representing variables. In this chapter we’ll learn the basics of working with DataFrames.

2.1 DataFrame Fundamentals

We’ll begin by examining some of the fundamental properties of DataFrames.

2.1.1 Setting Up

Start up Jupyter Lab if you haven’t already, then navigate to the folder where you put the example files. Once you’re in that folder, click File, New, and choose the Python 3 kernel to create a new Notebook. It will start out as Untitled.ipynb so right-click on its tab and choose Rename Notebook to change to DataFrames_Practice.ipynb.

We’ll start by importing the Pandas package and then reading in a simple data set from disk. acs_household.csv is one of the example files and contains a single household from the 2000 American Community Survey. It’s a comma separated variable or CSV file, one of the most popular types of data files. The read_csv() function in the Pandas package reads CSV files. The first argument for read_csv() is the file to be read, usually in the form of a string containing the file name. Since the file is in the same folder as your new Notebook, you don’t have to tell Python where to find it. read_csv() has many key word arguments you can pass in to control how the file is read, and we’ll talk about how and why to use the most important ones in the next chapter. However, by default read_csv() will try to figure out the best way to read the file, and since this file is simple and has already been cleaned up it will be successful.

import pandas as pd
acs = pd.read_csv('acs_household.csv')
acs
household person age race marital_status edu income female hispanic
0 132531 1 47 Black Now married Associate degree 20000.0 0 0
1 132531 2 38 White Now married High School graduate 18500.0 1 0
2 132531 3 19 Black Never married 10th grade 5000.0 0 0
3 132531 4 18 Two or more races Never married 11th grade 2200.0 0 0
4 132531 5 16 Black Never married 9th grade 0.0 0 0
5 132531 6 1 Black NaN NaN NaN 0 0
6 132531 7 18 White Never married 9th grade 7900.0 1 0

The variables in this data set are:

  • household: identifies the household
  • person: identifies a person within the household
  • age: the person’s age in years
  • race: the person’s race
  • marital_status: the person’s marital status
  • edu: the person’s educational attainment
  • female: an indicator for whether the person is female or not, with 1 meaning yes, they are female
  • hispanic: an indicator for whether the person is Hispanic or not, with 1 meaning yes, they are Hispanic

2.1.2 Indexes

Note that I did not include in the list of variables the column of numbers you see on the left in the printout of acs. This column is the row index. One of the key features of a Pandas DataFrame is that it always has indexes for both the rows and columns. Like the index of a book, the indexes of a DataFrame tell you where to find a particular piece of information.

Suppose I wanted to draw your attention to the number 19 in the acs DataFrame. One way I could do it would be to tell you to look at the number in the third row and the third column, what Python calls row 2 and column 2 since it numbers from 0. However, I could also tell you to look at the age variable of the row where person is 3. That’s because the variable names act as a column index, and in this data set person acts as a row index. While identifying rows and columns by number is sometimes useful, keep in mind that many operations will change those numbers. For example, if you dropped all the adults in the data set or sorted the data set by age, then person 3 would no longer be on row 2. Thus you’ll usually specify subsets of the data using indexes. We’ll talk more about identifying indexes and what they tell you about the structure of your data in First Steps With Your Data.

Note that to identify a person in the full ACS you need to know both their person number and their household number, because person only uniquely identifies a person within a household (i.e. every household has a person 1). This is called MultiIndexing, and I extracted just one household from the ACS so we don’t have to deal with it yet.

When we read in this data set using read_csv(), it created a default row index for us based on the row numbers. However, once the index is created we can change the order of the observations (using sort_values()) and the same index will still be associated with the same observation.

acs = acs.sort_values('age')
acs
household person age race marital_status edu income female hispanic
5 132531 6 1 Black NaN NaN NaN 0 0
4 132531 5 16 Black Never married 9th grade 0.0 0 0
3 132531 4 18 Two or more races Never married 11th grade 2200.0 0 0
6 132531 7 18 White Never married 9th grade 7900.0 1 0
2 132531 3 19 Black Never married 10th grade 5000.0 0 0
1 132531 2 38 White Now married High School graduate 18500.0 1 0
0 132531 1 47 Black Now married Associate degree 20000.0 0 0

This data set came with a row index: the person variable. We can tell Python to use it as the index with set_index():

acs = acs.set_index('person')
acs
household age race marital_status edu income female hispanic
person
6 132531 1 Black NaN NaN NaN 0 0
5 132531 16 Black Never married 9th grade 0.0 0 0
4 132531 18 Two or more races Never married 11th grade 2200.0 0 0
7 132531 18 White Never married 9th grade 7900.0 1 0
3 132531 19 Black Never married 10th grade 5000.0 0 0
2 132531 38 White Now married High School graduate 18500.0 1 0
1 132531 47 Black Now married Associate degree 20000.0 0 0

Note that the original row numbers have disappeared, and person has been moved to the far left and its name moved down. That’s because it’s now the row index. Indexes are not variables. Now that person is an index you can’t create a new variable that’s equal to person times 2, for example. That’s okay: identifiers like person generally don’t contain information. However, you can do some things like with an index, like sort by it.

acs = acs.sort_values('person')
acs
household age race marital_status edu income female hispanic
person
1 132531 47 Black Now married Associate degree 20000.0 0 0
2 132531 38 White Now married High School graduate 18500.0 1 0
3 132531 19 Black Never married 10th grade 5000.0 0 0
4 132531 18 Two or more races Never married 11th grade 2200.0 0 0
5 132531 16 Black Never married 9th grade 0.0 0 0
6 132531 1 Black NaN NaN NaN 0 0
7 132531 18 White Never married 9th grade 7900.0 1 0

This put the data back in the original order.

What if you really needed to do math with person? You can convert it from an index back to a variable by resetting the index:

acs.reset_index()
person household age race marital_status edu income female hispanic
0 1 132531 47 Black Now married Associate degree 20000.0 0 0
1 2 132531 38 White Now married High School graduate 18500.0 1 0
2 3 132531 19 Black Never married 10th grade 5000.0 0 0
3 4 132531 18 Two or more races Never married 11th grade 2200.0 0 0
4 5 132531 16 Black Never married 9th grade 0.0 0 0
5 6 132531 1 Black NaN NaN NaN 0 0
6 7 132531 18 White Never married 9th grade 7900.0 1 0

This converts person to a variable and creates a new index based on row numbers, just like read_csv() did originally.

2.1.3 Assignment vs. Changing in Place

Hopefully you noticed that most of the code cells thus far have started with acs = but the last one did not. That’s why if I ask Python to print the acs DataFrame, person is still the index:

acs
household age race marital_status edu income female hispanic
person
1 132531 47 Black Now married Associate degree 20000.0 0 0
2 132531 38 White Now married High School graduate 18500.0 1 0
3 132531 19 Black Never married 10th grade 5000.0 0 0
4 132531 18 Two or more races Never married 11th grade 2200.0 0 0
5 132531 16 Black Never married 9th grade 0.0 0 0
6 132531 1 Black NaN NaN NaN 0 0
7 132531 18 White Never married 9th grade 7900.0 1 0

When I ran acs.reset_index(), the reset_index() function returned a new copy of acs with the index reset. But I did not store this new DataFrame as acs like I did previously. Because I did not store it, Jupyter Lab carried out an implicit print and showed me the modified copy. But the original acs was unchanged.

You can use the inplace=True argument to tell a function to change the DataFrame it acts on. To demonstrate, sort the acs DataFrame by age using inplace=True, and just to keep things interesting also add ascending=False so it’s sorted in descending order. Since this function call is getting long, put each argument on its own line:

acs.sort_values(
    'age', 
    ascending=False, 
    inplace=True
)
acs
household age race marital_status edu income female hispanic
person
1 132531 47 Black Now married Associate degree 20000.0 0 0
2 132531 38 White Now married High School graduate 18500.0 1 0
3 132531 19 Black Never married 10th grade 5000.0 0 0
4 132531 18 Two or more races Never married 11th grade 2200.0 0 0
7 132531 18 White Never married 9th grade 7900.0 1 0
5 132531 16 Black Never married 9th grade 0.0 0 0
6 132531 1 Black NaN NaN NaN 0 0

You might think that using inplace=True would save memory because the function can change the existing DataFrame rather than making a new copy. In reality, most of the time the function creates a new copy and then puts it in the place of the old one anyway. So we’ll rarely use inplace=True. But we will frequently use DataFrame functions without changing the original DataFrame.

2.1.4 Copies vs. Views

If you create a new DataFrame by setting it equal to another DataFrame, the result may not be what you expect.

acs2 = acs
acs2['hispanic'] = 1
acs
household age race marital_status edu income female hispanic
person
1 132531 47 Black Now married Associate degree 20000.0 0 1
2 132531 38 White Now married High School graduate 18500.0 1 1
3 132531 19 Black Never married 10th grade 5000.0 0 1
4 132531 18 Two or more races Never married 11th grade 2200.0 0 1
7 132531 18 White Never married 9th grade 7900.0 1 1
5 132531 16 Black Never married 9th grade 0.0 0 1
6 132531 1 Black NaN NaN NaN 0 1

The line acs2['hispanic'] = 1 told Python to make everyone in the acs2 DataFrame Hispanic, but it made everyone in the original acs DataFrame Hispanic as well. Why?

The reason is that the line acs2 = acs did not actually create a new DataFrame. It just gave the DataFrame acs a second name, acs2. Both names point to the same location in the computer’s memory. Thus if you make a change while referring to the DataFrame by one name it automatically affects the DataFrame when you refer to it by the other name as well.

Usually this is not what you want! If you want to make a new DataFrame that is a copy of an existing DataFrame but separate from it, use the copy() function. You will almost always want to pass in the argument deep=True. This tells copy() to make a deep copy, meaning that it should copy both the structure and values of the original DataFrame.

To demonstrate, make acs3 a deep copy of the original acs DataFrame, change everyone in acs back to being not Hispanic, and observe that that did not change acs3:

acs3 = acs.copy(deep=True)
acs['hispanic'] = 0
acs3
household age race marital_status edu income female hispanic
person
1 132531 47 Black Now married Associate degree 20000.0 0 1
2 132531 38 White Now married High School graduate 18500.0 1 1
3 132531 19 Black Never married 10th grade 5000.0 0 1
4 132531 18 Two or more races Never married 11th grade 2200.0 0 1
7 132531 18 White Never married 9th grade 7900.0 1 1
5 132531 16 Black Never married 9th grade 0.0 0 1
6 132531 1 Black NaN NaN NaN 0 1

That’s because acs and acs3 are now separate DataFrames. The change did affect acs2, because it’s still just another name for acs:

acs2
household age race marital_status edu income female hispanic
person
1 132531 47 Black Now married Associate degree 20000.0 0 0
2 132531 38 White Now married High School graduate 18500.0 1 0
3 132531 19 Black Never married 10th grade 5000.0 0 0
4 132531 18 Two or more races Never married 11th grade 2200.0 0 0
7 132531 18 White Never married 9th grade 7900.0 1 0
5 132531 16 Black Never married 9th grade 0.0 0 0
6 132531 1 Black NaN NaN NaN 0 0

DataFrames like acs2 are sometimes called views: they’re another way of viewing the same data. Views can sometimes be useful because they save memory, but they’re also confusing. We mention them here mostly as something to avoid.

2.2 Specifying Subsets of DataFrames

A critical Pandas skill is specifying a subset of a DataFrame. For better or worse, you have lots of options for doing so.

2.2.1 Subsetting with Square Brackets

Recall that you could select items from lists, tuples, and dictionaries using square brackets. You can use them with DataFrames as well.

If you put the name of a column in the square brackets it will select that column:

acs['age']
person
1    47
2    38
3    19
4    18
7    18
5    16
6     1
Name: age, dtype: int64

Why acs['age'] rather than acs[age]? Because age without quotes would refer to a variable called age. You can use variables in subsets, and sometimes it’s very useful:

column_to_select = 'age'
acs[column_to_select]
person
1    47
2    38
3    19
4    18
7    18
5    16
6     1
Name: age, dtype: int64

But most of the time you’ll refer to a column name directly by putting it in quotes. Get used to typing lots of quotes in Python.

To select multiple columns, put them in a list:

acs[['age', 'edu']]
age edu
person
1 47 Associate degree
2 38 High School graduate
3 19 10th grade
4 18 11th grade
7 18 9th grade
5 16 9th grade
6 1 NaN

Note that there’s no special meaning to double square brackets ([[ and ]]). The outer [] is for identifying a subset, and the inner [] is for defining a list.

Why does this look so much prettier than when we only selected age? Remember that each column in a DataFrame is a Series. If you select one column from a DataFrame, the result is a Series. When you select two or more columns, the result is a DataFrame. JupyterLab puts extra effort into printing DataFrames so they look nice, but it doesn’t bother with a Series. Note that both the DataFrame and the Series get an index that is copied from the original DataFrame. This allows Pandas to align the result with the original DataFrame if you use them together.

Exercise

Select and print the race, female, and hispanic columns from the acs DataFrame. (Yes, implicit print is fine.)

2.2.2 Subsetting with iloc

The iloc function allows you to select subsets by row and column number. It’s not quite a conventional function in that the arguments go in square brackets rather than parentheses.

If you give iloc a single number, it will select the corresponding row:

acs.iloc[0]
household                   132531
age                             47
race                         Black
marital_status         Now married
edu               Associate degree
income                     20000.0
female                           0
hispanic                         0
Name: 1, dtype: object

The result is a Series again, but since it’s going across a row the index of the Series is the column names from the original DataFrame.

If you give iloc two numbers, it will select the value of the corresponding row and column:

acs.iloc[0, 0]
132531

This gave us the household number of the first person, because that is the value in the first row and first column (remember Python counts from zero and does not count the row index).

You can also use slices with iloc:

acs.iloc[1:3, 0:4]
household age race marital_status
person
2 132531 38 White Now married
3 132531 19 Black Never married

Slices do not include their endpoint, so this gave us rows one and two (i.e. skipping the first row, row zero) and columns zero through three.

Exercise

Select and print the race of the first person in the acs DataFrame using iloc.

2.2.3 Subsetting with loc

The loc function allows you to subset using the DataFrame’s indexes. If you give it a value of the row index it will select that row:

acs.loc[1]
household                   132531
age                             47
race                         Black
marital_status         Now married
edu               Associate degree
income                     20000.0
female                           0
hispanic                         0
Name: 1, dtype: object

Compare that with selecting row 1 (i.e. the second row) with iloc:

acs.iloc[1]
household                       132531
age                                 38
race                             White
marital_status             Now married
edu               High School graduate
income                         18500.0
female                               1
hispanic                             0
Name: 2, dtype: object

If you give it a value of the row index and a value of the column index, i.e. a column name, it will select that value:

acs.loc[1, 'household']
132531

You can select multiple rows and/or columns with lists:

acs.loc[[1, 2, 3], ['age', 'edu']]
age edu
person
1 47 Associate degree
2 38 High School graduate
3 19 10th grade

If the row indexes are integers, as they often are, you can also use a slice. Just to make things confusing, with loc the endpoint of the slice is included:

acs.loc[1:3, 'age']
person
1    47
2    38
3    19
Name: age, dtype: int64

To select all rows or all columns, use ::

acs.loc[:, ['age', 'edu']]
age edu
person
1 47 Associate degree
2 38 High School graduate
3 19 10th grade
4 18 11th grade
7 18 9th grade
5 16 9th grade
6 1 NaN
acs.loc[[1, 2], :]
household age race marital_status edu income female hispanic
person
1 132531 47 Black Now married Associate degree 20000.0 0 0
2 132531 38 White Now married High School graduate 18500.0 1 0

But remember that passing in just a row is equivalent to selecting all columns:

acs.loc[[1, 2]]
household age race marital_status edu income female hispanic
person
1 132531 47 Black Now married Associate degree 20000.0 0 0
2 132531 38 White Now married High School graduate 18500.0 1 0
Exercise

Select and print the income and marital_status of persons 3-5 in the acs DataFrame using loc.

2.2.4 Subsetting by Conditions

Both plain square brackets and loc can select the rows that meet a condition. For example, to select the children in the household, use:

acs[acs['age']<18]
household age race marital_status edu income female hispanic
person
5 132531 16 Black Never married 9th grade 0.0 0 0
6 132531 1 Black NaN NaN NaN 0 0

Let’s take that apart a bit. acs['age']<18 gives us a Series:

acs['age']<18
person
1    False
2    False
3    False
4    False
7    False
5     True
6     True
Name: age, dtype: bool

You could store and pass in that Series:

is_child = acs['age']<18
acs[is_child]
household age race marital_status edu income female hispanic
person
5 132531 16 Black Never married 9th grade 0.0 0 0
6 132531 1 Black NaN NaN NaN 0 0

Pandas used the indexes to ensure the right rows of is_child were aligned with the right rows of acs.

loc works the same way:

acs.loc[acs['age']<18]
household age race marital_status edu income female hispanic
person
5 132531 16 Black Never married 9th grade 0.0 0 0
6 132531 1 Black NaN NaN NaN 0 0
acs.loc[is_child]
household age race marital_status edu income female hispanic
person
5 132531 16 Black Never married 9th grade 0.0 0 0
6 132531 1 Black NaN NaN NaN 0 0

However, with loc you can still add a second argument to select columns:

acs.loc[acs['age']<18, ['age', 'edu']]
age edu
person
5 16 9th grade
6 1 NaN

2.2.4.1 Comparison Operators

The comparison operators you can use to create conditions are:

== (equals)

!= (not equals)

> (greater than)

< (less than)

>= (greater than or equal to)

<= (less than or equal to)

~ means ‘not’, and can be used to negate the following condition. Thus:

acs[~is_child]
household age race marital_status edu income female hispanic
person
1 132531 47 Black Now married Associate degree 20000.0 0 0
2 132531 38 White Now married High School graduate 18500.0 1 0
3 132531 19 Black Never married 10th grade 5000.0 0 0
4 132531 18 Two or more races Never married 11th grade 2200.0 0 0
7 132531 18 White Never married 9th grade 7900.0 1 0

Comparison operators have rules for which operators are evaluated first just like arithmetic operators, and sometimes they may not be what you expect. Take:

acs[~acs['age']<18]
household age race marital_status edu income female hispanic
person
1 132531 47 Black Now married Associate degree 20000.0 0 0
2 132531 38 White Now married High School graduate 18500.0 1 0
3 132531 19 Black Never married 10th grade 5000.0 0 0
4 132531 18 Two or more races Never married 11th grade 2200.0 0 0
7 132531 18 White Never married 9th grade 7900.0 1 0
5 132531 16 Black Never married 9th grade 0.0 0 0
6 132531 1 Black NaN NaN NaN 0 0

Why isn’t ~acs['age']<18 equivalent to ~is_child? Because ~ is evaluated before <.

~child

is equivalent to:

~(acs['age']<18)

but:

~acs['age']<18

is equivalent to:

(~acs['age'])<18.

Applying the ~ operator to a number rather than a condition gives results that are unintuitive to say the least:

~acs['age']
person
1   -48
2   -39
3   -20
4   -19
7   -19
5   -17
6    -2
Name: age, dtype: int64

Once age is negative, it’s always less than 18.

The solution is to use parentheses generously to make sure everything is evaluated in the order you want.

acs[~(acs['age']<18)]
household age race marital_status edu income female hispanic
person
1 132531 47 Black Now married Associate degree 20000.0 0 0
2 132531 38 White Now married High School graduate 18500.0 1 0
3 132531 19 Black Never married 10th grade 5000.0 0 0
4 132531 18 Two or more races Never married 11th grade 2200.0 0 0
7 132531 18 White Never married 9th grade 7900.0 1 0
Exercise

Select and print edu for the Black people in the acs DataFrame. (Hint: race is a string and values of a string variable, like ‘Black’, go in quotes.)

2.2.4.2 Combining Conditions

To combine conditions, use logical and, &, or logical or, |. With logical and the result is true if and only if both conditions are true. In terms of set theory, it is the intersection. With logical or, the result is true if either condition is true (or both). In terms of set theory, it is the union.

For example, to identify the people who are adults and have an income greater than $5,000, use:

acs[(acs['age']>=18) & (acs['income']>5000)]
household age race marital_status edu income female hispanic
person
1 132531 47 Black Now married Associate degree 20000.0 0 0
2 132531 38 White Now married High School graduate 18500.0 1 0
7 132531 18 White Never married 9th grade 7900.0 1 0

Yes, the parentheses are essential.

To identify the people who are either adults or have an income greater than $5,000, use:

acs[(acs['age']>=18) | (acs['income']>5000)]
household age race marital_status edu income female hispanic
person
1 132531 47 Black Now married Associate degree 20000.0 0 0
2 132531 38 White Now married High School graduate 18500.0 1 0
3 132531 19 Black Never married 10th grade 5000.0 0 0
4 132531 18 Two or more races Never married 11th grade 2200.0 0 0
7 132531 18 White Never married 9th grade 7900.0 1 0
Exercise

Select and print the adult females in the data set.

2.2.5 Which Subsetting Method Should You Use?

Plain square brackets, loc, and iloc all do basically the same thing, so which should you use?

Subsets specified by iloc are always precise and unambiguous, but they make the code hard to read because you have to know exactly what the data set looks like to know what they mean. We’ll rarely use iloc.

Plain square brackets are simple and convenient, but they can’t do as much. Also, because they’re used with so many different kinds of objects, the underlying code isn’t optimized for DataFrames. Avoid them whenever speed is a major consideration.

loc is almost as easy to use as plain square brackets, is much more capable, and is optimized for performance. Plus, the resulting code is easier to understand than with iloc.

In this book we’ll normally use plain square brackets for simple subsets, especially selecting a single column, and loc for anything more complicated.

Important

When you use plain square brackets, the argument inside the brackets answers the question “Which columns?” When you use loc (or iloc) with one argument, it answers the question “Which rows?” If you use two argumens with loc (or iloc) they answer the questions “Which rows?” and “Which columns?” respectively.

2.3 Changing DataFrames

You can change the data in a DataFrame by assigning new values to it. You’ll use the subsetting tools from the previous section to select which parts of the DataFrame will be changed. To create a new column, select it as a subset of the DataFrame as if it already existed and then assign a value to it.

2.3.1 Numeric Variables

The incomes in this data set date from the year 2000, so to correct for inflation and convert them to January 2023 dollars we need to multiply them by 1.77:

acs['income_2023'] = acs['income']*1.77
acs
household age race marital_status edu income female hispanic income_2023
person
1 132531 47 Black Now married Associate degree 20000.0 0 0 35400.0
2 132531 38 White Now married High School graduate 18500.0 1 0 32745.0
3 132531 19 Black Never married 10th grade 5000.0 0 0 8850.0
4 132531 18 Two or more races Never married 11th grade 2200.0 0 0 3894.0
7 132531 18 White Never married 9th grade 7900.0 1 0 13983.0
5 132531 16 Black Never married 9th grade 0.0 0 0 0.0
6 132531 1 Black NaN NaN NaN 0 0 NaN

Note how the income of person 6 is NaN, or ‘Not a Number’. This is how Python stores missing values. Attempting to multiply NaN by 1.77 results in NaN, so income_2023 is automatically missing as well. This is what we want and expect.

Exercise

Go to the BLS inflation calculator and see what you need to multiply incomes from 2000 by to get dollars in the most recent month that has data (assume for simplicity that the incomes are from January 2000). Create a column in the acs DataFrame to store income in current dollars.

2.3.2 Indicator (Boolean) Variables

Now create an indicator variable for ‘this person is black’. Since you’re setting the new variable equal to a condition, it will get True if the condition is true and False if it is false:

acs['black'] = (acs['race']=='Black')
acs
household age race marital_status edu income female hispanic income_2023 black
person
1 132531 47 Black Now married Associate degree 20000.0 0 0 35400.0 True
2 132531 38 White Now married High School graduate 18500.0 1 0 32745.0 False
3 132531 19 Black Never married 10th grade 5000.0 0 0 8850.0 True
4 132531 18 Two or more races Never married 11th grade 2200.0 0 0 3894.0 False
7 132531 18 White Never married 9th grade 7900.0 1 0 13983.0 False
5 132531 16 Black Never married 9th grade 0.0 0 0 0.0 True
6 132531 1 Black NaN NaN NaN 0 0 NaN True

In this case the parentheses are purely for readability.

By setting the variable black equal to a condition, we get a column that is either True or False.

Exercise

Create an indicator variable for ‘this person is an adult.’

2.3.3 Missing Values (NaN)

Next suppose we want to create an indicator variable has_income which tells us if the person’s income is greater than zero. This seems like a straightforward extension of what we just did:

acs['has_income'] = (acs['income'] > 0)

The problem is person 6 and their missing value for income. Because we don’t know their income, we don’t know what their value of has_income should be. (We could make a pretty good guess based on their age, but we don’t know.) So the right value of has_income for person 6 is NaN. There are just two problems with that: Python won’t automatically give us that answer, and it won’t automatically store it.

What happens if we do run that code?

acs['has_income'] = (acs['income'] > 0)
acs
household age race marital_status edu income female hispanic income_2023 black has_income
person
1 132531 47 Black Now married Associate degree 20000.0 0 0 35400.0 True True
2 132531 38 White Now married High School graduate 18500.0 1 0 32745.0 False True
3 132531 19 Black Never married 10th grade 5000.0 0 0 8850.0 True True
4 132531 18 Two or more races Never married 11th grade 2200.0 0 0 3894.0 False True
7 132531 18 White Never married 9th grade 7900.0 1 0 13983.0 False True
5 132531 16 Black Never married 9th grade 0.0 0 0 0.0 True False
6 132531 1 Black NaN NaN NaN 0 0 NaN True False

has_income looks fine for all the observations except person 6, where we got False rather than NaN. The problem here is that the result of a comparison like (acs['income'] > 0) must be either True or False. NaN is not an option.

Different languages handle missing values in conditions in different ways, and they all have their advantages and disadvantages. Python’s is at least easy to remember: any condition involving NaN returns False.

NaN is defined by the numpy package, so to talk about it directly we need to import numpy. The standard abbreviation for it is np:

import numpy as np

Now compare acs['income'] to np.NaN and note the results:

acs['income']==np.NaN
person
1    False
2    False
3    False
4    False
7    False
5    False
6    False
Name: income, dtype: bool

income is NaN for person 6, but it doesn’t matter: the rule that any condition involving NaN returns False takes precedence. Even comparing NaN to itself returns False:

np.NaN==np.NaN
False

If you think of NaN as representing unknown quantities this makes some sense: two unknown quantities are very unlikely to be identical.

To detect missing values, instead use the isna() function:

acs['income'].isna()
person
1    False
2    False
3    False
4    False
7    False
5    False
6     True
Name: income, dtype: bool

You can also use notna() to identify valid values.

With isna() in your toolkit, you can tell Pandas to change the value of has_income to np.NaN for all the rows where income is missing with code like:

acs.loc[acs['income'].isna(), 'has_income'] = np.NaN

There’s just one problem: a strictly Boolean variable can only store True or False, not NaN. The solution is to convert has_income from the default bool type to the boolean type. boolean is a nullable boolean, meaning ti can store NaN as well.

In previous versions of Pandas, putting NaN in a variable of type bool would prompt Pandas to silently *promote the variable to type boolean, but the Pandas developers decided this was convenient but dangerous. After all, bool and boolean sometimes behave differently. So they want you to explicitly convert has_income to boolean using the astype() function before you put `NaN in it:

acs['has_income'] = acs['has_income'].astype('boolean')
acs.loc[acs['income'].isna(), 'has_income'] = np.NaN
acs
household age race marital_status edu income female hispanic income_2023 black has_income
person
1 132531 47 Black Now married Associate degree 20000.0 0 0 35400.0 True True
2 132531 38 White Now married High School graduate 18500.0 1 0 32745.0 False True
3 132531 19 Black Never married 10th grade 5000.0 0 0 8850.0 True True
4 132531 18 Two or more races Never married 11th grade 2200.0 0 0 3894.0 False True
7 132531 18 White Never married 9th grade 7900.0 1 0 13983.0 False True
5 132531 16 Black Never married 9th grade 0.0 0 0 0.0 True False
6 132531 1 Black NaN NaN NaN 0 0 NaN True <NA>

As of Pandas 2.1, skipping the conversion works (has_income will still be automatically promoted) but you’ll get a warning. In a future version it will crash.

There’s actually a similar issue with integer variable types, as np.NaN is technically a floating point number. However, converting integers to floating point numbers is not considered a promotion, so setting some values of an integer variable to NaN results in the variable being converted to floating point without any fuss.

Creating has_income with some incorrect values and then immediately correcting them is not bad. But let’s talk about a couple of ways you could create it with all the right values immediately.

2.3.3.1 Subsets and New Variables

One takes advantage of the fact a brand new variable essentially starts out with all its values set to NaN. If you use a subset to only set values for some observations when you create it, the observations that are left out will get NaN:

acs.loc[acs['income'].notna(), 'has_income2'] = (acs['income']>0)
acs
household age race marital_status edu income female hispanic income_2023 black has_income has_income2
person
1 132531 47 Black Now married Associate degree 20000.0 0 0 35400.0 True True True
2 132531 38 White Now married High School graduate 18500.0 1 0 32745.0 False True True
3 132531 19 Black Never married 10th grade 5000.0 0 0 8850.0 True True True
4 132531 18 Two or more races Never married 11th grade 2200.0 0 0 3894.0 False True True
7 132531 18 White Never married 9th grade 7900.0 1 0 13983.0 False True True
5 132531 16 Black Never married 9th grade 0.0 0 0 0.0 True False False
6 132531 1 Black NaN NaN NaN 0 0 NaN True <NA> NaN

Because the new variable is only set to the result of (acs['income']>0) for rows where income has a valid value (acs['income'].notna()), rows where income is missing get NaN for has_income2. Just remember that this only works for brand new variables. If has_income2 already exists, the subset means that rows where income is missing keep whatever value of has_income2 they had before.

2.3.3.2 The where() Function

Another way to do this relies on a very useful NumPy function called where(). It takes three arguments:

  1. A condition (something that is either true or false)
  2. The result if the condition is true
  3. The result if the condition is false

For example:

np.where(
    acs['age']>=18,
    'Adult',
    'Child'
)
array(['Adult', 'Adult', 'Adult', 'Adult', 'Adult', 'Child', 'Child'],
      dtype='<U5')

The result is an array rather than a Series because Series a Pandas class, but Pandas will take care of converting arrays to Series as needed.

Our rules for has_income are:

  • if income is not missing, set it to the result of acs['income']>0
  • is income is missing, set it to np.NaN

We can do this using where():

acs['has_income3'] = np.where(
    acs['income'].notna(),
    acs['income']>0,
    np.NaN
)
acs
household age race marital_status edu income female hispanic income_2023 black has_income has_income2 has_income3
person
1 132531 47 Black Now married Associate degree 20000.0 0 0 35400.0 True True True 1.0
2 132531 38 White Now married High School graduate 18500.0 1 0 32745.0 False True True 1.0
3 132531 19 Black Never married 10th grade 5000.0 0 0 8850.0 True True True 1.0
4 132531 18 Two or more races Never married 11th grade 2200.0 0 0 3894.0 False True True 1.0
7 132531 18 White Never married 9th grade 7900.0 1 0 13983.0 False True True 1.0
5 132531 16 Black Never married 9th grade 0.0 0 0 0.0 True False False 0.0
6 132531 1 Black NaN NaN NaN 0 0 NaN True <NA> NaN NaN

There’s just one wrinkle: has_income was set to 1.0 or 0.0 instead of True or False! It’s the problem of variable type again: NumPy sees that the where() function is going to return True, False, and NaN, but it doesn’t know about the Pandas boolean type. So it decides it can store all three as floating point numbers if it converts True to 1.0 and False to 0.0. Converting True and False to 1 and 0 is quite useful at times (we’ll use it a lot in Working With Hierarchical Data), but not what we want here.

The solution is to tell where() to use the boolean type:

acs['has_income3'] = np.where(
    acs['income'].notna(),
    (acs['income']>0).astype('boolean'),
    np.NaN
)
acs
household age race marital_status edu income female hispanic income_2023 black has_income has_income2 has_income3
person
1 132531 47 Black Now married Associate degree 20000.0 0 0 35400.0 True True True True
2 132531 38 White Now married High School graduate 18500.0 1 0 32745.0 False True True True
3 132531 19 Black Never married 10th grade 5000.0 0 0 8850.0 True True True True
4 132531 18 Two or more races Never married 11th grade 2200.0 0 0 3894.0 False True True True
7 132531 18 White Never married 9th grade 7900.0 1 0 13983.0 False True True True
5 132531 16 Black Never married 9th grade 0.0 0 0 0.0 True False False False
6 132531 1 Black NaN NaN NaN 0 0 NaN True <NA> NaN NaN
Exercise

Create an indicator variable for “this person is married.” For this exercise, do not assume that the one-year-old cannot possibly be married.

2.3.3.3 Three-Valued Logic

One difference between bool and boolean is that boolean uses three-valued logic (specifically, Kleene logic for combining conditions that could include NaN. It’s three-valued because the result could be True, False, or NaN. Conceptually, you replace the NaN values withe first True and then False. If the result is the same either way, then that’s the result. If it is not the same, then the result is NaN. For example:

acs['black'] | acs['has_income']
person
1    True
2    True
3    True
4    True
7    True
5    True
6    True
dtype: boolean

The result is True for person 6 because black is true and with | (logical or) only one of the two values needs to be True for the result to be True. Compare with:

acs['black'] & acs['has_income']
person
1     True
2    False
3     True
4    False
7    False
5    False
6     <NA>
dtype: boolean

Now the result for person 6 is NaN, because the result would be True if has_income were True, but False if has_income were False. Thus we don’t know the true result.

2.3.4 Copies vs. Views (Again)

Make a DataFrame containing just the teenagers from our example household:

teens = acs[(acs['age']>12) & (acs['age']<20)]
teens
household age race marital_status edu income female hispanic income_2023 black has_income has_income2 has_income3
person
3 132531 19 Black Never married 10th grade 5000.0 0 0 8850.0 True True True True
4 132531 18 Two or more races Never married 11th grade 2200.0 0 0 3894.0 False True True True
7 132531 18 White Never married 9th grade 7900.0 1 0 13983.0 False True True True
5 132531 16 Black Never married 9th grade 0.0 0 0 0.0 True False False False

Now try creating a new variable in that DataFrame.

teens['x'] = 1
teens
C:\Users\local_rdimond\Temp\54\ipykernel_54244\2513810258.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  teens['x'] = 1
household age race marital_status edu income female hispanic income_2023 black has_income has_income2 has_income3 x
person
3 132531 19 Black Never married 10th grade 5000.0 0 0 8850.0 True True True True 1
4 132531 18 Two or more races Never married 11th grade 2200.0 0 0 3894.0 False True True True 1
7 132531 18 White Never married 9th grade 7900.0 1 0 13983.0 False True True True 1
5 132531 16 Black Never married 9th grade 0.0 0 0 0.0 True False False False 1

So it worked, but what’s with the warning message? The suggestion to switch to loc is something of a red herring (try it–it doesn’t help). The real problem is copies and views again.

When you create a new DataFrame that is a subset of an existing DataFrame, Python will decide whether the new DataFrame should be a copy or a view based on what it thinks will give better performance, and that depends on things that you don’t know and cannot easily find out, like the structure of the DataFrame in memory. The bottom line is that you don’t know if the new DataFrame is a copy or a view and that’s fine–until you start making changes to it. Then you don’t know if the changes will carry over to the original DataFrame or not, and that’s not fine. That’s when you get this warning message.

The solution is to explicitly make the new DataFrame a copy using copy(deep=True):

teens = acs[
    (acs['age']>12) & (acs['age']<20)
].copy(deep=True)

teens['x'] = 1
teens
household age race marital_status edu income female hispanic income_2023 black has_income has_income2 has_income3 x
person
3 132531 19 Black Never married 10th grade 5000.0 0 0 8850.0 True True True True 1
4 132531 18 Two or more races Never married 11th grade 2200.0 0 0 3894.0 False True True True 1
7 132531 18 White Never married 9th grade 7900.0 1 0 13983.0 False True True True 1
5 132531 16 Black Never married 9th grade 0.0 0 0 0.0 True False False False 1

Now you can make changes to teens without getting that warning.

2.3.5 Function Chaining

The above is an example of function chaining. The first step in the chain is to take a subset of acs with acs[(acs['age']>12) & (acs['age']<20)]. The key insight is that the result of the first step is a DataFrame and so it has a copy() function, which we call as the second step in the chain. Chains can be as long as you like, though they’re good candidates for splitting across lines. For example, recreate the teens DataFrame and reset the index so it goes from 0 to 3 rather than using the person number from the (no longer relevant) full household:

teens = (
    acs[(acs['age']>12) & (acs['age']<20)].
    copy(deep=True).
    reset_index()
)
teens
person household age race marital_status edu income female hispanic income_2023 black has_income has_income2 has_income3
0 3 132531 19 Black Never married 10th grade 5000.0 0 0 8850.0 True True True True
1 4 132531 18 Two or more races Never married 11th grade 2200.0 0 0 3894.0 False True True True
2 7 132531 18 White Never married 9th grade 7900.0 1 0 13983.0 False True True True
3 5 132531 16 Black Never married 9th grade 0.0 0 0 0.0 True False False False
Exercise

Create a DataFrame containing just the people we know had income (note framing it as ‘people we know had income’ tells you what to do with missing values), reset its index, and sort it in descending order of income. Carry out all these tasks in a single function chain.