Most data wrangling work in Python revolves around DataFrames, a class introduced by the Pandas package. A DataFrame represents a classic data set organized as a table or matrix, with rows representing observations and columns representing variables. In this chapter we’ll learn the basics of working with DataFrames.
3.1 DataFrame Fundamentals
We’ll begin by examining some of the fundamental properties of DataFrames.
3.1.1 Setting Up
Start up Jupyter Lab if you haven’t already, then navigate to the folder where you put the example files. Once you’re in that folder, click File, New, and choose the Python 3 kernel to create a new Notebook. It will start out as Untitled.ipynb so right-click on its tab and choose Rename Notebook to change to DataFrames_Practice.ipynb.
We’ll start by importing the Pandas package and then reading in a simple data set from disk. acs_household.csv is one of the example files and contains a single household from the 2000 American Community Survey. It’s a comma separated variable or CSV file, one of the most popular types of data files. The read_csv() function in the Pandas package reads CSV files. The first argument for read_csv() is the file to be read, usually in the form of a string containing the file name. Since the file is in the same folder as your new Notebook, you don’t have to tell Python where to find it. read_csv() has many key word arguments you can pass in to control how the file is read, and we’ll talk about how and why to use the most important ones in the next chapter. However, by default read_csv() will try to figure out the best way to read the file, and since this file is simple and has already been cleaned up it will be successful.
import pandas as pdacs = pd.read_csv('acs_household.csv')acs
household
person
age
race
marital_status
edu
income
female
hispanic
0
132531
1
47
Black
Now married
Associate degree
20000.0
0
0
1
132531
2
38
White
Now married
High School graduate
18500.0
1
0
2
132531
3
19
Black
Never married
10th grade
5000.0
0
0
3
132531
4
18
Two or more races
Never married
11th grade
2200.0
0
0
4
132531
5
16
Black
Never married
9th grade
0.0
0
0
5
132531
6
1
Black
NaN
NaN
NaN
0
0
6
132531
7
18
White
Never married
9th grade
7900.0
1
0
The variables in this data set are:
household: identifies the household
person: identifies a person within the household
age: the person’s age in years
race: the person’s race
marital_status: the person’s marital status
edu: the person’s educational attainment
female: an indicator for whether the person is female or not, with 1 meaning yes, they are female
hispanic: an indicator for whether the person is Hispanic or not, with 1 meaning yes, they are Hispanic
3.1.2 Indexes
Note that I did not include in the list of variables the column of numbers you see on the left in the printout of acs. This column is the row index. One of the key features of a Pandas DataFrame is that it always has indexes for both the rows and columns. Like the index of a book, the indexes of a DataFrame tell you where to find a particular piece of information.
Suppose I wanted to draw your attention to the number 19 in the acs DataFrame. One way I could do it would be to tell you to look at the number in the third row and the third column, what Python calls row 2 and column 2 since it numbers from 0. However, I could also tell you to look at the age variable of the row where person is 3. That’s because the variable names act as a column index, and in this data set person acts as a row index. While identifying rows and columns by number is sometimes useful, keep in mind that many operations will change those numbers. For example, if you dropped all the adults in the data set or sorted the data set by age, then person 3 would no longer be on row 2. Thus you’ll usually specify subsets of the data using indexes. We’ll talk more about identifying indexes and what they tell you about the structure of your data in First Steps With Your Data.
Note that to identify a person in the full ACS you need to know both their person number and their household number, because person only uniquely identifies a person within a household (i.e. every household has a person 1). This is called MultiIndexing, and I extracted just one household from the ACS so we don’t have to deal with it yet.
When we read in this data set using read_csv(), it created a default row index for us based on the row numbers. However, once the index is created we can change the order of the observations (using sort_values()) and the same index will still be associated with the same observation.
acs = acs.sort_values('age')acs
household
person
age
race
marital_status
edu
income
female
hispanic
5
132531
6
1
Black
NaN
NaN
NaN
0
0
4
132531
5
16
Black
Never married
9th grade
0.0
0
0
3
132531
4
18
Two or more races
Never married
11th grade
2200.0
0
0
6
132531
7
18
White
Never married
9th grade
7900.0
1
0
2
132531
3
19
Black
Never married
10th grade
5000.0
0
0
1
132531
2
38
White
Now married
High School graduate
18500.0
1
0
0
132531
1
47
Black
Now married
Associate degree
20000.0
0
0
This data set came with a row index: the person variable. We can tell Python to use it as the index with set_index():
acs = acs.set_index('person')acs
household
age
race
marital_status
edu
income
female
hispanic
person
6
132531
1
Black
NaN
NaN
NaN
0
0
5
132531
16
Black
Never married
9th grade
0.0
0
0
4
132531
18
Two or more races
Never married
11th grade
2200.0
0
0
7
132531
18
White
Never married
9th grade
7900.0
1
0
3
132531
19
Black
Never married
10th grade
5000.0
0
0
2
132531
38
White
Now married
High School graduate
18500.0
1
0
1
132531
47
Black
Now married
Associate degree
20000.0
0
0
Note that the original row numbers have disappeared, and person has been moved to the far left and its name moved down. That’s because it’s now the row index. Indexes are not variables. Now that person is an index you can’t create a new variable that’s equal to person times 2, for example. That’s okay: identifiers like person generally don’t contain information. However, you can do some things like with an index, like sort by it.
acs = acs.sort_values('person')acs
household
age
race
marital_status
edu
income
female
hispanic
person
1
132531
47
Black
Now married
Associate degree
20000.0
0
0
2
132531
38
White
Now married
High School graduate
18500.0
1
0
3
132531
19
Black
Never married
10th grade
5000.0
0
0
4
132531
18
Two or more races
Never married
11th grade
2200.0
0
0
5
132531
16
Black
Never married
9th grade
0.0
0
0
6
132531
1
Black
NaN
NaN
NaN
0
0
7
132531
18
White
Never married
9th grade
7900.0
1
0
This put the data back in the original order.
What if you really needed to do math with person? You can convert it from an index back to a variable by resetting the index:
acs.reset_index()
person
household
age
race
marital_status
edu
income
female
hispanic
0
1
132531
47
Black
Now married
Associate degree
20000.0
0
0
1
2
132531
38
White
Now married
High School graduate
18500.0
1
0
2
3
132531
19
Black
Never married
10th grade
5000.0
0
0
3
4
132531
18
Two or more races
Never married
11th grade
2200.0
0
0
4
5
132531
16
Black
Never married
9th grade
0.0
0
0
5
6
132531
1
Black
NaN
NaN
NaN
0
0
6
7
132531
18
White
Never married
9th grade
7900.0
1
0
This converts person to a variable and creates a new index based on row numbers, just like read_csv() did originally.
3.1.3 Assignment vs. Changing in Place
Hopefully you noticed that most of the code cells thus far have started with acs = but the last one did not. That’s why if I ask Python to print the acs DataFrame, person is still the index:
acs
household
age
race
marital_status
edu
income
female
hispanic
person
1
132531
47
Black
Now married
Associate degree
20000.0
0
0
2
132531
38
White
Now married
High School graduate
18500.0
1
0
3
132531
19
Black
Never married
10th grade
5000.0
0
0
4
132531
18
Two or more races
Never married
11th grade
2200.0
0
0
5
132531
16
Black
Never married
9th grade
0.0
0
0
6
132531
1
Black
NaN
NaN
NaN
0
0
7
132531
18
White
Never married
9th grade
7900.0
1
0
When I ran acs.reset_index(), the reset_index() function returned a new copy of acs with the index reset. But I did not store this new DataFrame as acs like I did previously. Because I did not store it, Jupyter Lab carried out an implicit print and showed me the modified copy. But the original acs was unchanged.
You can use the inplace=True argument to tell a function to change the DataFrame it acts on. To demonstrate, sort the acs DataFrame by age using inplace=True, and just to keep things interesting also add ascending=False so it’s sorted in descending order. Since this function call is getting long, put each argument on its own line:
You might think that using inplace=True would save memory because the function can change the existing DataFrame rather than making a new copy. In reality, most of the time the function creates a new copy and then puts it in the place of the old one anyway. So we’ll rarely use inplace=True. But we will frequently use DataFrame functions without changing the original DataFrame.
3.1.4 Copies vs. Views
If you create a new DataFrame by setting it equal to another DataFrame, the result may not be what you expect.
acs2 = acsacs2['hispanic'] =1acs
household
age
race
marital_status
edu
income
female
hispanic
person
1
132531
47
Black
Now married
Associate degree
20000.0
0
1
2
132531
38
White
Now married
High School graduate
18500.0
1
1
3
132531
19
Black
Never married
10th grade
5000.0
0
1
4
132531
18
Two or more races
Never married
11th grade
2200.0
0
1
7
132531
18
White
Never married
9th grade
7900.0
1
1
5
132531
16
Black
Never married
9th grade
0.0
0
1
6
132531
1
Black
NaN
NaN
NaN
0
1
The line acs2['hispanic'] = 1 told Python to make everyone in the acs2 DataFrame Hispanic, but it made everyone in the original acs DataFrame Hispanic as well. Why?
The reason is that the line acs2 = acs did not actually create a new DataFrame. It just gave the DataFrame acs a second name, acs2. Both names point to the same location in the computer’s memory. Thus if you make a change while referring to the DataFrame by one name it automatically affects the DataFrame when you refer to it by the other name as well.
Usually this is not what you want! If you want to make a new DataFrame that is a copy of an existing DataFrame but separate from it, use the copy() function. You will almost always want to pass in the argument deep=True. This tells copy() to make a deep copy, meaning that it should copy both the structure and values of the original DataFrame.
To demonstrate, make acs3 a deep copy of the original acs DataFrame, change everyone in acs back to being not Hispanic, and observe that that did not change acs3:
acs3 = acs.copy(deep=True)acs['hispanic'] =0acs3
household
age
race
marital_status
edu
income
female
hispanic
person
1
132531
47
Black
Now married
Associate degree
20000.0
0
1
2
132531
38
White
Now married
High School graduate
18500.0
1
1
3
132531
19
Black
Never married
10th grade
5000.0
0
1
4
132531
18
Two or more races
Never married
11th grade
2200.0
0
1
7
132531
18
White
Never married
9th grade
7900.0
1
1
5
132531
16
Black
Never married
9th grade
0.0
0
1
6
132531
1
Black
NaN
NaN
NaN
0
1
That’s because acs and acs3 are now separate DataFrames. The change did affect acs2, because it’s still just another name for acs:
acs2
household
age
race
marital_status
edu
income
female
hispanic
person
1
132531
47
Black
Now married
Associate degree
20000.0
0
0
2
132531
38
White
Now married
High School graduate
18500.0
1
0
3
132531
19
Black
Never married
10th grade
5000.0
0
0
4
132531
18
Two or more races
Never married
11th grade
2200.0
0
0
7
132531
18
White
Never married
9th grade
7900.0
1
0
5
132531
16
Black
Never married
9th grade
0.0
0
0
6
132531
1
Black
NaN
NaN
NaN
0
0
DataFrames like acs2 are sometimes called views: they’re another way of viewing the same data. Views can sometimes be useful because they save memory, but they’re also confusing. We mention them here mostly as something to avoid.
3.2 Specifying Subsets of DataFrames
A critical Pandas skill is specifying a subset of a DataFrame. For better or worse, you have lots of options for doing so.
3.2.1 Subsetting with Square Brackets
Recall that you could select items from lists, tuples, and dictionaries using square brackets. You can use them with DataFrames as well.
If you put the name of a column in the square brackets it will select that column:
Why acs['age'] rather than acs[age]? Because age without quotes would refer to a variable called age. You can use variables in subsets, and sometimes it’s very useful:
But most of the time you’ll refer to a column name directly by putting it in quotes. Get used to typing lots of quotes in Python.
To select multiple columns, put them in a list:
acs[['age', 'edu']]
age
edu
person
1
47
Associate degree
2
38
High School graduate
3
19
10th grade
4
18
11th grade
7
18
9th grade
5
16
9th grade
6
1
NaN
Note that there’s no special meaning to double square brackets ([[ and ]]). The outer [] is for identifying a subset, and the inner [] is for defining a list.
Why does this look so much prettier than when we only selected age? Remember that each column in a DataFrame is a Series. If you select one column from a DataFrame, the result is a Series. When you select two or more columns, the result is a DataFrame. JupyterLab puts extra effort into printing DataFrames so they look nice, but it doesn’t bother with a Series. Note that both the DataFrame and the Series get an index that is copied from the original DataFrame. This allows Pandas to align the result with the original DataFrame if you use them together.
Exercise
Select and print the race, female, and hispanic columns from the acs DataFrame. (Yes, implicit print is fine.)
3.2.2 Subsetting with iloc
The iloc function allows you to select subsets by row and column number. It’s not quite a conventional function in that the arguments go in square brackets rather than parentheses.
If you give iloc a single number, it will select the corresponding row:
acs.iloc[0]
household 132531
age 47
race Black
marital_status Now married
edu Associate degree
income 20000.0
female 0
hispanic 0
Name: 1, dtype: object
The result is a Series again, but since it’s going across a row the index of the Series is the column names from the original DataFrame.
If you give iloc two numbers, it will select the value of the corresponding row and column:
acs.iloc[0, 0]
132531
This gave us the household number of the first person, because that is the value in the first row and first column (remember Python counts from zero and does not count the row index).
You can also use slices with iloc:
acs.iloc[1:3, 0:4]
household
age
race
marital_status
person
2
132531
38
White
Now married
3
132531
19
Black
Never married
Slices do not include their endpoint, so this gave us rows one and two (i.e. skipping the first row, row zero) and columns zero through three.
Exercise
Select and print the race of the first person in the acs DataFrame using iloc.
3.2.3 Subsetting with loc
The loc function allows you to subset using the DataFrame’s indexes. If you give it a value of the row index it will select that row:
acs.loc[1]
household 132531
age 47
race Black
marital_status Now married
edu Associate degree
income 20000.0
female 0
hispanic 0
Name: 1, dtype: object
Compare that with selecting row 1 (i.e. the second row) with iloc:
acs.iloc[1]
household 132531
age 38
race White
marital_status Now married
edu High School graduate
income 18500.0
female 1
hispanic 0
Name: 2, dtype: object
If you give it a value of the row index and a value of the column index, i.e. a column name, it will select that value:
acs.loc[1, 'household']
132531
You can select multiple rows and/or columns with lists:
acs.loc[[1, 2, 3], ['age', 'edu']]
age
edu
person
1
47
Associate degree
2
38
High School graduate
3
19
10th grade
If the row indexes are integers, as they often are, you can also use a slice. Just to make things confusing, with loc the endpoint of the slice is included:
acs.loc[1:3, 'age']
person
1 47
2 38
3 19
Name: age, dtype: int64
To select all rows or all columns, use ::
acs.loc[:, ['age', 'edu']]
age
edu
person
1
47
Associate degree
2
38
High School graduate
3
19
10th grade
4
18
11th grade
7
18
9th grade
5
16
9th grade
6
1
NaN
acs.loc[[1, 2], :]
household
age
race
marital_status
edu
income
female
hispanic
person
1
132531
47
Black
Now married
Associate degree
20000.0
0
0
2
132531
38
White
Now married
High School graduate
18500.0
1
0
But remember that passing in just a row is equivalent to selecting all columns:
acs.loc[[1, 2]]
household
age
race
marital_status
edu
income
female
hispanic
person
1
132531
47
Black
Now married
Associate degree
20000.0
0
0
2
132531
38
White
Now married
High School graduate
18500.0
1
0
Exercise
Select and print the income and marital_status of persons 3-5 in the acs DataFrame using loc.
3.2.4 Subsetting by Conditions
Both plain square brackets and loc can select the rows that meet a condition. For example, to select the children in the household, use:
acs[acs['age']<18]
household
age
race
marital_status
edu
income
female
hispanic
person
5
132531
16
Black
Never married
9th grade
0.0
0
0
6
132531
1
Black
NaN
NaN
NaN
0
0
Let’s take that apart a bit. acs['age']<18 gives us a Series:
Pandas used the indexes to ensure the right rows of is_child were aligned with the right rows of acs.
loc works the same way:
acs.loc[acs['age']<18]
household
age
race
marital_status
edu
income
female
hispanic
person
5
132531
16
Black
Never married
9th grade
0.0
0
0
6
132531
1
Black
NaN
NaN
NaN
0
0
acs.loc[is_child]
household
age
race
marital_status
edu
income
female
hispanic
person
5
132531
16
Black
Never married
9th grade
0.0
0
0
6
132531
1
Black
NaN
NaN
NaN
0
0
However, with loc you can still add a second argument to select columns:
acs.loc[acs['age']<18, ['age', 'edu']]
age
edu
person
5
16
9th grade
6
1
NaN
3.2.4.1 Comparison Operators
The comparison operators you can use to create conditions are:
== (equals)
!= (not equals)
> (greater than)
< (less than)
>= (greater than or equal to)
<= (less than or equal to)
~ means ‘not’, and can be used to negate the following condition. Thus:
acs[~is_child]
household
age
race
marital_status
edu
income
female
hispanic
person
1
132531
47
Black
Now married
Associate degree
20000.0
0
0
2
132531
38
White
Now married
High School graduate
18500.0
1
0
3
132531
19
Black
Never married
10th grade
5000.0
0
0
4
132531
18
Two or more races
Never married
11th grade
2200.0
0
0
7
132531
18
White
Never married
9th grade
7900.0
1
0
Comparison operators have rules for which operators are evaluated first just like arithmetic operators, and sometimes they may not be what you expect. Take:
acs[~acs['age']<18]
household
age
race
marital_status
edu
income
female
hispanic
person
1
132531
47
Black
Now married
Associate degree
20000.0
0
0
2
132531
38
White
Now married
High School graduate
18500.0
1
0
3
132531
19
Black
Never married
10th grade
5000.0
0
0
4
132531
18
Two or more races
Never married
11th grade
2200.0
0
0
7
132531
18
White
Never married
9th grade
7900.0
1
0
5
132531
16
Black
Never married
9th grade
0.0
0
0
6
132531
1
Black
NaN
NaN
NaN
0
0
Why isn’t ~acs['age']<18 equivalent to ~is_child? Because ~ is evaluated before <.
~child
is equivalent to:
~(acs['age']<18)
but:
~acs['age']<18
is equivalent to:
(~acs['age'])<18.
Applying the ~ operator to a number rather than a condition gives results that are unintuitive to say the least:
The solution is to use parentheses generously to make sure everything is evaluated in the order you want.
acs[~(acs['age']<18)]
household
age
race
marital_status
edu
income
female
hispanic
person
1
132531
47
Black
Now married
Associate degree
20000.0
0
0
2
132531
38
White
Now married
High School graduate
18500.0
1
0
3
132531
19
Black
Never married
10th grade
5000.0
0
0
4
132531
18
Two or more races
Never married
11th grade
2200.0
0
0
7
132531
18
White
Never married
9th grade
7900.0
1
0
Exercise
Select and print edu for the Black people in the acs DataFrame. (Hint: race is a string and values of a string variable, like ‘Black’, go in quotes.)
3.2.4.2 Combining Conditions
To combine conditions, use logical and, &, or logical or, |. With logical and the result is true if and only if both conditions are true. In terms of set theory, it is the intersection. With logical or, the result is true if either condition is true (or both). In terms of set theory, it is the union.
For example, to identify the people who are adults and have an income greater than $5,000, use:
acs[(acs['age']>=18) & (acs['income']>5000)]
household
age
race
marital_status
edu
income
female
hispanic
person
1
132531
47
Black
Now married
Associate degree
20000.0
0
0
2
132531
38
White
Now married
High School graduate
18500.0
1
0
7
132531
18
White
Never married
9th grade
7900.0
1
0
Yes, the parentheses are essential.
To identify the people who are either adults or have an income greater than $5,000, use:
acs[(acs['age']>=18) | (acs['income']>5000)]
household
age
race
marital_status
edu
income
female
hispanic
person
1
132531
47
Black
Now married
Associate degree
20000.0
0
0
2
132531
38
White
Now married
High School graduate
18500.0
1
0
3
132531
19
Black
Never married
10th grade
5000.0
0
0
4
132531
18
Two or more races
Never married
11th grade
2200.0
0
0
7
132531
18
White
Never married
9th grade
7900.0
1
0
Exercise
Select and print the adult females in the data set.
3.2.5 Which Subsetting Method Should You Use?
Plain square brackets, loc, and iloc all do basically the same thing, so which should you use?
Subsets specified by iloc are always precise and unambiguous, but they make the code hard to read because you have to know exactly what the data set looks like to know what they mean. We’ll rarely use iloc.
Plain square brackets are simple and convenient, but they can’t do as much. Also, because they’re used with so many different kinds of objects, the underlying code isn’t optimized for DataFrames. Avoid them whenever speed is a major consideration.
loc is almost as easy to use as plain square brackets, is much more capable, and is optimized for performance. Plus, the resulting code is easier to understand than with iloc.
In this book we’ll normally use plain square brackets for simple subsets, especially selecting a single column, and loc for anything more complicated.
Important
When you use plain square brackets, the argument inside the brackets answers the question “Which columns?” When you use loc (or iloc) with one argument, it answers the question “Which rows?” If you use two argumens with loc (or iloc) they answer the questions “Which rows?” and “Which columns?” respectively.
3.3 Changing DataFrames
You can change the data in a DataFrame by assigning new values to it. You’ll use the subsetting tools from the previous section to select which parts of the DataFrame will be changed. To create a new column, select it as a subset of the DataFrame as if it already existed and then assign a value to it.
3.3.1 Numeric Variables
The incomes in this data set date from the year 2000, so to correct for inflation and convert them to January 2023 dollars we need to multiply them by 1.77:
acs['income_2023'] = acs['income']*1.77acs
household
age
race
marital_status
edu
income
female
hispanic
income_2023
person
1
132531
47
Black
Now married
Associate degree
20000.0
0
0
35400.0
2
132531
38
White
Now married
High School graduate
18500.0
1
0
32745.0
3
132531
19
Black
Never married
10th grade
5000.0
0
0
8850.0
4
132531
18
Two or more races
Never married
11th grade
2200.0
0
0
3894.0
7
132531
18
White
Never married
9th grade
7900.0
1
0
13983.0
5
132531
16
Black
Never married
9th grade
0.0
0
0
0.0
6
132531
1
Black
NaN
NaN
NaN
0
0
NaN
Note how the income of person 6 is NaN, or ‘Not a Number’. This is how Python stores missing values. Attempting to multiply NaN by 1.77 results in NaN, so income_2023 is automatically missing as well. This is what we want and expect.
Exercise
Go to the BLS inflation calculator and see what you need to multiply incomes from 2000 by to get dollars in the most recent month that has data (assume for simplicity that the incomes are from January 2000). Create a column in the acs DataFrame to store income in current dollars.
3.3.2 Indicator (Boolean) Variables
Now create an indicator variable for ‘this person is black’. Since you’re setting the new variable equal to a condition, it will get True if the condition is true and False if it is false:
acs['black'] = (acs['race']=='Black')acs
household
age
race
marital_status
edu
income
female
hispanic
income_2023
black
person
1
132531
47
Black
Now married
Associate degree
20000.0
0
0
35400.0
True
2
132531
38
White
Now married
High School graduate
18500.0
1
0
32745.0
False
3
132531
19
Black
Never married
10th grade
5000.0
0
0
8850.0
True
4
132531
18
Two or more races
Never married
11th grade
2200.0
0
0
3894.0
False
7
132531
18
White
Never married
9th grade
7900.0
1
0
13983.0
False
5
132531
16
Black
Never married
9th grade
0.0
0
0
0.0
True
6
132531
1
Black
NaN
NaN
NaN
0
0
NaN
True
In this case the parentheses are purely for readability.
By setting the variable black equal to a condition, we get a column that is either True or False.
Exercise
Create an indicator variable for ‘this person is an adult.’
3.3.3 Missing Values (NaN)
Now create an indicator for has_income, as in the person has a non-zero income.
acs['has_income'] = (acs['income'] >0)acs
household
age
race
marital_status
edu
income
female
hispanic
income_2023
black
has_income
person
1
132531
47
Black
Now married
Associate degree
20000.0
0
0
35400.0
True
True
2
132531
38
White
Now married
High School graduate
18500.0
1
0
32745.0
False
True
3
132531
19
Black
Never married
10th grade
5000.0
0
0
8850.0
True
True
4
132531
18
Two or more races
Never married
11th grade
2200.0
0
0
3894.0
False
True
7
132531
18
White
Never married
9th grade
7900.0
1
0
13983.0
False
True
5
132531
16
Black
Never married
9th grade
0.0
0
0
0.0
True
False
6
132531
1
Black
NaN
NaN
NaN
0
0
NaN
True
False
Note how person 6 got False for has_income. Given that person 6 is only one year old that’s almost certainly accurate, but Python didn’t put False in has_income because it knew that. Any comparison involving NaN will be false.
NaN is formally defined in the NumPy package. The Pandas package can store NaN values but if we want to talk about them directly we need to import NumPy, which is normally imported as np. Do that, and then see what you get when you compare income to NaN:
You could fix the has_income column by changing it to NaN for the just rows where income is NaN. This requires a subset that specifies both rows and columns, so you’ll need loc.
There’s just one wrinkle: has_income was set to 1 or 0 instead of True or False! The problem is that a true Boolean Series can only contain True or False, not NaN. So when where() saw the has_income Series would contain NaN, it decided to convert the whole thing into numbers. Automatically converting between True/False and 1/0 is a useful thing we’ll take advantage of later on, but not what we want here.
Many objects have an astype() function that will convert them to different kinds of objects. Just pass in the name of the type to convert to. You can fix this problem by converting the result of acs['income']>0 back to Boolean. This will prompt Pandas to use a generalized type of Boolean series that can contain NaN as well as True and False.
There is an alternative way to create this variable. It relies on the fact that if you create a new column in a DataFrame but only set it to something for a subset of the rows, the rows outside the subset will be left with NaN.
The loc on the left side of the equals sign specifies that only rows where income is not missing (acs['income'].notna()) should have has_income2 set to the result of the condition acs['income']>0. The other rows just get NaN.
Note that this only works when creating new variables. If has_income2 already existed, rows outside the subset would keep whatever values they had before rather than being set to NaN.
I’ve found most students find this second method more confusing, but it is sometimes useful.
Exercise
Create an indicator variable for “this person is married.” For this exercise, do not assume that the one-year-old cannot possibly be married.
3.3.4 Copies vs. Views (Again)
Make a DataFrame containing just the teenagers from our example household:
Now try creating a new variable in that DataFrame.
teens['x'] =1teens
/tmp/ipykernel_256484/2513810258.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
teens['x'] = 1
household
age
race
marital_status
edu
income
female
hispanic
income_2023
black
has_income
has_income2
x
person
3
132531
19
Black
Never married
10th grade
5000.0
0
0
8850.0
True
True
True
1
4
132531
18
Two or more races
Never married
11th grade
2200.0
0
0
3894.0
False
True
True
1
7
132531
18
White
Never married
9th grade
7900.0
1
0
13983.0
False
True
True
1
5
132531
16
Black
Never married
9th grade
0.0
0
0
0.0
True
False
False
1
So it worked, but what’s with the warning message? The suggestion to switch to loc is something of a red herring (try it–it doesn’t help). The real problem is copies and views again.
When you create a new DataFrame that is a subset of an existing DataFrame, Python will decide whether the new DataFrame should be a copy or a view based on what it thinks will give better performance, and that depends on things that you don’t know and cannot easily find out, like the structure of the DataFrame in memory. The bottom line is that you don’t know if the new DataFrame is a copy or a view and that’s fine–until you start making changes to it. Then you don’t know if the changes will carry over to the original DataFrame or not, and that’s not fine. That’s when you get this warning message.
The solution is to explicitly make the new DataFrame a copy using copy(deep=True):
Now you can make changes to teens without getting that warning.
3.3.5 Function Chaining
The above is an example of function chaining. The first step in the chain is to take a subset of acs with acs[(acs['age']>12) & (acs['age']<20)]. The key insight is that the result of the first step is a DataFrame and so it has a copy() function, which we call as the second step in the chain. Chains can be as long as you like, though they’re good candidates for splitting across lines. For example, recreate the teens DataFrame and reset the index so it goes from 0 to 3 rather than using the person number from the (no longer relevant) full household:
Create a DataFrame containing just the people we know had income (note framing it as ‘people we know had income’ tells you what to do with missing values), reset its index, and sort it in descending order of income. Carry out all these tasks in a single function chain.