5 First Steps With Your Data

Once you’ve read in a new data set, your first goals are to understand the data set and to clean it up. Logically these are two separate processes, but in practice they are intertwined: you can’t clean your data set until you understand it at some level, but you won’t fully understand it until it’s clean. Thus this section will cover both.

We’ll also introduce a lot of data science concepts in this section. This makes for lengthy discussions of tasks that in practice you can complete very quickly.

In this section, we’ll primarily use the file 2000_acs_sample.csv. This file started as the 1% unweighted sample of the 2000 American Community Survey available from IPUMS, but then we took a 1% random sample of the households in that dataset just to make it easier to work with. This data set uses the ‘source’ variables directly from the Census Bureau rather than the ‘harmonized’ variables created by IPUMS, which are much cleaner. For real work you can usually use the harmonized variables, but we’re here to learn how to do the kinds of things IPUMS does to create them.

By using this data set in the examples, you’ll also gain some basic understanding of the U.S. population and some experience with a real and important data set, but you would not want to use this particular data file for research.

5.1 Setting Up

Start up Jupyter Notebook if you haven’t already, then navigate to the folder where you put the example files. Once you’re in that folder, click New, Python 3 to create a new Notebook. Click on the name at the top to change it from Untitled to First Steps Part 1. Then import pandas and use pd.read_csv() to read in 2000_acs_sample.csv as acs.

import pandas as pd
acs = pd.read_csv('2000_acs_sample.csv')

5.2 Read the Documentation

When you download a data set, you’ll be tempted to open it up and go to work right away. Resist! Time spent reading the data set’s documentation (assuming there is some) can save you much more time down the road. Data providers may give you files containing documentation along with the data itself, or it may be on their web site. Feel free to skim what’s not relevant to you—this section will give you a better sense of what information is most important.

Unfortunately, not all data sets have good documentation, or any documentation at all, so figuring out the nature of a data set by looking at the data set itself is a vital skill. You also can’t assume that the documentation is completely accurate, so you need to check what it says.

The ACS has lots of good documentation, but for practice we’ll make minimal use of it (just the codebook) and figure out everything we can for ourselves. We’d still do all the same things if we were using the documentation, we’d just understand what we were looking at much more quickly.

5.3 Identify the Variables

To see what variables the data set contains, look at its dtypes:

acs.dtypes
year                 int64
datanum              int64
serial               int64
hhwt                 int64
gq                  object
us2000c_serialno     int64
pernum               int64
perwt                int64
us2000c_pnum         int64
us2000c_sex          int64
us2000c_age          int64
us2000c_hispan       int64
us2000c_race1        int64
us2000c_marstat      int64
us2000c_educ         int64
us2000c_inctot      object
dtype: object

The primary goal of looking at the dtypes is to see what variables you have and what they’re called. But it will frequently let you start a “to do list” of issues you need to address before analyzing the data. Here are some issues brought out by running describe on this data set:

  1. The data set seems to have an excess of identifiers: serial, us2000c_serialno, pernum and us2000c_pnum

  2. Since you’re using a single data set from a single year, you don’t need year and datanum to tell you where each observation come from.

  3. For the same reason, you also don’t need us2000c_ (‘US 2000 Census’) in your variable names to tell you where those variables come from.

  4. You have both household weight (hhwt) and person weight (pwt) variables even though this is supposed to be an unweighted sample.

  5. us2000c_inctot is stored as an object, probably a string, even though it should presumably be numeric. (We’ll have to investigate gq.)

5.3.1 Exercise

Read in 2000_acs_harm.csv and examine its dtypes. This is a similar sample but with the IPUMS ‘harmonized’ variables. What issues did IPUMS resolve? What issues remain?

5.4 Look at the Data

Unless your data set is very small, you can’t possibly read all of it. But just looking at what Jupyter Notebook’s implicit print will give you may allow you to immediately spot patterns that would be difficult to detect using code:

acs
year datanum serial hhwt gq us2000c_serialno pernum perwt us2000c_pnum us2000c_sex us2000c_age us2000c_hispan us2000c_race1 us2000c_marstat us2000c_educ us2000c_inctot
0 2000 4 37 100 Households under 1970 definition 365663 1 100 1 2 20 1 1 5 11 0010000
1 2000 4 37 100 Households under 1970 definition 365663 2 100 2 2 19 1 1 5 11 0005300
2 2000 4 37 100 Households under 1970 definition 365663 3 100 3 2 19 1 2 5 11 0004700
3 2000 4 241 100 Households under 1970 definition 2894822 1 100 1 2 50 1 1 5 14 0032500
4 2000 4 242 100 Households under 1970 definition 2896802 1 100 1 2 29 1 1 5 13 0030000
28167 2000 4 1236624 100 Households under 1970 definition 7055193 1 100 1 1 29 1 1 1 11 0050100
28168 2000 4 1236624 100 Households under 1970 definition 7055193 2 100 2 2 26 1 1 1 9 0012000
28169 2000 4 1236756 100 Households under 1970 definition 8489120 1 100 1 2 58 1 1 1 14 0069800
28170 2000 4 1236756 100 Households under 1970 definition 8489120 2 100 2 1 61 1 1 1 14 0040800
28171 2000 4 1236779 100 Households under 1970 definition 8733299 1 100 1 1 30 1 3 3 9 0022110

28172 rows × 16 columns

Some things to notice:

  • year and datanum seem to have one value each, suggesting that we don’t need them.

  • hhwt and perwt (household and person weights) seem to always be 100, which makes sense given that this is supposed to be an unweighted sample.

  • pernum and us2000c_pnum appear to be identical.

  • pernum seems to count observations, starting over from 1 every time serial changes.

  • us2000c_sex, us2000c_hispan, us2000c_race1, and us2000c_marstat are clearly describing categories. We will have to refer to the codebook to find out what the numbers mean. (This also applies to us2000c_educ, it’s just not as obvious at this point.)

You can’t be sure that these patterns hold the for entire data set until you check, but now you know some things to check for. For example, value_counts() will give you both a list of all the values of a variable and their frequencies, so you can use it to verify that year is always 2000:

acs['year'].value_counts()
2000    28172
Name: year, dtype: int64

5.4.1 Exercise

Examine acs_harm in the same way. What issues do you see? Be sure to take a close look at inc_tot.

5.5 Identify the Indexes and Investigate the Structure of the Data

Logically, row indexes should uniquely identify a row. (Python does not actually enforce this, but duplicate indexes can cause performance problems. More importantly, they don’t make sense.) Back in chapter three, we used person as the row index in the small extract from this data set that we examined. In this not-yet-cleaned-up data set it’s called pernum, but it clearly does not uniquely identify rows. That’s an important clue about the structure of the data: there’s more going on here than just people.

Suppose that I gave you some ‘student data’ and you found that student_id uniquely identified the rows. Then you could reasonably conclude that each row represents a student.

But suppose you found instead that each value of student_id was associated with multiple rows, but a combination of student_id and class_id uniquely identified the rows. That would tell you that each row represents a student-class combination, or a class taken by a particular student. Loosely speaking we might just call each row a class, as long as it’s understood that two students taking the same class will be two rows.

This is critical information, and if you come to the SSCC’s statistical consultants for help you’re likely to be asked what an observation represents in your data set. You might be surprised by how often the answer turns out to be wrong. Ideally the data documentation will tell you what the structure of the data set is, but it’s easy to check and well worth doing.

One way to easily check if pernum is a unique identifier with value_counts():

acs['pernum'].value_counts()
1     11327
2      7786
3      4374
4      2689
5      1210
6       450
7       177
8        81
9        34
10       21
11       11
12        8
13        1
14        1
15        1
16        1
Name: pernum, dtype: int64

There are over 11,000 person 1’s, so clearly not. How about serial?

acs['serial'].value_counts()
1115231    16
562736     12
962652     12
1122678    12
1110995    12
           ..
790922      1
227103      1
227189      1
790501      1
1236779     1
Name: serial, Length: 11327, dtype: int64

We can’t see all the values of serial, but one of them is associated with 16 rows, so once again we know it is not a unique identifier.

How about the combination of serial and pernum? If we give value_counts() two columns to act on, it will tell us the frequency of each combination of them (i.e. a crosstab). So select both columns by subsetting with a list:

acs[['serial', 'pernum']].value_counts()
serial   pernum
37       1         1
829137   2         1
830078   2         1
         1         1
829441   1         1
                  ..
428384   1         1
428351   8         1
         7         1
         6         1
1236779  1         1
Length: 28172, dtype: int64

All the combinations we can see have just one row, but that’s a tiny fraction of the data set. How can we be sure?

The max() function finds the maximum value of a list, including a list of frequencies created by value_counts(). Try it first on the frequencies of serial:

acs['serial'].value_counts().max()
16

Since that gave us 16 as expected, now try it with the combination of serial and pernum:

acs[['serial', 'pernum']].value_counts().max()
1

With a maximum frequency of one, we now know that the combination of serial and pernum uniquely identifies the rows within this data set. In fact serial is a household identifier and pernum identifies a person within that household. We now know that this data set consists of people who are grouped into households.

When a combination of two or more variables uniquely identifies observations this is often known as a compound identifier, but Python calls it a MultiIndex. MultiIndex is actually a class in the Pandas package, though we’ll normally only use it in the context of a DataFrame.

It would make sense to set serial and pernum as the row indexes now, but we want to rename them first and we don’t yet know how to do that. So hold that thought.

5.5.1 Exercise

Read in the data set atus.csv. This is a selection from the American Time Use Survey, which measures how much time people spend on various activities. Find the identifiers in this data set. What does an observation represent? What was the first activity recorded for person 20170101170012?

5.6 Get Rid of Data You Won’t Use

Understanding data takes time. Even skipping past data you don’t care about to get to what you do care about takes time. So if you won’t use parts of a data set, get rid of those parts sooner rather than later. Doing so will also reduce the amount of memory needed to analyze your data and the amount of disk space needed to store it, and make anything you do with it run that much faster. If you change your mind about what you need, you can always change your code later.

We’ve determined that year and datanum contain no useful information (to us), us2000c_pnum is identical to pernum, and us2000c_serialno appears to be an alternative version of serial that we don’t need. Recall that the drop() function will drop columns from the data set as long as you set axis=1, so pass in those four as a list:

acs = acs.drop(
    [
        'year',
        'datanum',
        'us2000c_pnum',
        'us2000c_serialno'
    ],
    axis=1
)
acs
serial hhwt gq pernum perwt us2000c_sex us2000c_age us2000c_hispan us2000c_race1 us2000c_marstat us2000c_educ us2000c_inctot
0 37 100 Households under 1970 definition 1 100 2 20 1 1 5 11 0010000
1 37 100 Households under 1970 definition 2 100 2 19 1 1 5 11 0005300
2 37 100 Households under 1970 definition 3 100 2 19 1 2 5 11 0004700
3 241 100 Households under 1970 definition 1 100 2 50 1 1 5 14 0032500
4 242 100 Households under 1970 definition 1 100 2 29 1 1 5 13 0030000
28167 1236624 100 Households under 1970 definition 1 100 1 29 1 1 1 11 0050100
28168 1236624 100 Households under 1970 definition 2 100 2 26 1 1 1 9 0012000
28169 1236756 100 Households under 1970 definition 1 100 2 58 1 1 1 14 0069800
28170 1236756 100 Households under 1970 definition 2 100 1 61 1 1 1 14 0040800
28171 1236779 100 Households under 1970 definition 1 100 1 30 1 3 3 9 0022110

28172 rows × 12 columns

You can use drop() with axis=0 to drop observations based on the index. For example, if you wanted to drop the first observations you could do:

acs.drop(0, axis=0)
serial hhwt gq pernum perwt us2000c_sex us2000c_age us2000c_hispan us2000c_race1 us2000c_marstat us2000c_educ us2000c_inctot
1 37 100 Households under 1970 definition 2 100 2 19 1 1 5 11 0005300
2 37 100 Households under 1970 definition 3 100 2 19 1 2 5 11 0004700
3 241 100 Households under 1970 definition 1 100 2 50 1 1 5 14 0032500
4 242 100 Households under 1970 definition 1 100 2 29 1 1 5 13 0030000
5 296 100 Other group quarters 1 100 2 20 1 6 5 9 0003000
28167 1236624 100 Households under 1970 definition 1 100 1 29 1 1 1 11 0050100
28168 1236624 100 Households under 1970 definition 2 100 2 26 1 1 1 9 0012000
28169 1236756 100 Households under 1970 definition 1 100 2 58 1 1 1 14 0069800
28170 1236756 100 Households under 1970 definition 2 100 1 61 1 1 1 14 0040800
28171 1236779 100 Households under 1970 definition 1 100 1 30 1 3 3 9 0022110

28171 rows × 12 columns

But you don’t want to drop it, which is why I didn’t store the result as acs.

It’s much more common to want to drop observations based on a condition. Consider the gq variable:

acs['gq'].value_counts()
Households under 1970 definition               27339
Group quarters--Institutions                     406
Other group quarters                             356
Additional households under 1990 definition       71
Name: gq, dtype: int64

Group quarters are very different from ‘normal’ households, so if your research is about normal households you probably want to exclude them from your analysis. Use the following to keep just the households and drop the group quarters:

acs = acs[
    (acs['gq']=='Households under 1970 definition') | 
    (acs['gq']=='Additional households under 1990 definition')
]

5.7 Change Variable Names

A good variable name tells you clearly what the variable contains. Good variable names make code easier to understand, easier to debug, and easier to write. If you have to choose between making a variable name short and making it clear, go with clear.

Many good variable names contain (or should contain) multiple words. Python will allow you to put spaces in variable names, but doing can cause problems for some tasks and should generally be avoided. There are two competing conventions for making multi-word variable names readable. Camel case capitalizes the first letter of each word after the first: householdIncome, mothersEducation, etc. Snake case uses underscores instead of spaces: household_income, mothers_education, etc. You’ve probably noticed that we prefer snake case, but which one you use is less important than that you choose one and stick with it: don’t force yourself to remember whether you called your variable householdIncome or household_income this time!

When you use abbreviations use the same abbreviation every time, even across projects. This data set abbreviates education as educ, and there’s nothing wrong with that, but if you use edu in your other projects that’s sufficient reason to change it.

You can change the names of variables with the rename() function. The columns argument takes a dictionary, with each element of the dictionary containing the old name and desired new name of a variable as a key : value pair. Given the variables we care about in this data set better names with:

acs = acs.rename(
    columns = 
    {
        'serial' : 'household',
        'pernum' : 'person',
        'us2000c_sex' : 'sex',
        'us2000c_age' : 'age',
        'us2000c_hispan' : 'hispanic',
        'us2000c_race1' : 'race',
        'us2000c_marstat' : 'marital_status',
        'us2000c_educ' : 'edu',
        'us2000c_inctot' : 'income'
    }
)
acs.dtypes
household          int64
hhwt               int64
gq                object
person             int64
perwt              int64
sex                int64
age                int64
hispanic           int64
race               int64
marital_status     int64
edu                int64
income            object
dtype: object

5.8 Set Indexes

Now that we’ve given the identifiers the variable names we want, it’s time to set them as indexes. You’ll still do that with set_index() like before, except that you’ll pass in a list containing the two variable names:

acs = acs.set_index(['household', 'person'])
acs
hhwt gq perwt sex age hispanic race marital_status edu income
household person
37 1 100 Households under 1970 definition 100 2 20 1 1 5 11 0010000
2 100 Households under 1970 definition 100 2 19 1 1 5 11 0005300
3 100 Households under 1970 definition 100 2 19 1 2 5 11 0004700
241 1 100 Households under 1970 definition 100 2 50 1 1 5 14 0032500
242 1 100 Households under 1970 definition 100 2 29 1 1 5 13 0030000
1236624 1 100 Households under 1970 definition 100 1 29 1 1 1 11 0050100
2 100 Households under 1970 definition 100 2 26 1 1 1 9 0012000
1236756 1 100 Households under 1970 definition 100 2 58 1 1 1 14 0069800
2 100 Households under 1970 definition 100 1 61 1 1 1 14 0040800
1236779 1 100 Households under 1970 definition 100 1 30 1 3 3 9 0022110

27410 rows × 10 columns

The row index for acs is now a MultiIndex, which means Python now understands the hierarchical structure of the data. But how can you use it?

If you pass in a number to loc it will select that household:

acs.loc[37]
hhwt gq perwt sex age hispanic race marital_status edu income
person
1 100 Households under 1970 definition 100 2 20 1 1 5 11 0010000
2 100 Households under 1970 definition 100 2 19 1 1 5 11 0005300
3 100 Households under 1970 definition 100 2 19 1 2 5 11 0004700

To really take advantage of the MultiIndex, use xs(). With xs() you can pass in a value and then tell it whether it is for a household or a person with the level argument. To select a household, use:

acs.xs(37, level='household')
hhwt gq perwt sex age hispanic race marital_status edu income
person
1 100 Households under 1970 definition 100 2 20 1 1 5 11 0010000
2 100 Households under 1970 definition 100 2 19 1 1 5 11 0005300
3 100 Households under 1970 definition 100 2 19 1 2 5 11 0004700

To select all the people who are person one in their household (much more useful than you may think at this point) use:

acs.xs(1, level='person')
hhwt gq perwt sex age hispanic race marital_status edu income
household
37 100 Households under 1970 definition 100 2 20 1 1 5 11 0010000
241 100 Households under 1970 definition 100 2 50 1 1 5 14 0032500
242 100 Households under 1970 definition 100 2 29 1 1 5 13 0030000
377 100 Households under 1970 definition 100 2 69 1 1 5 1 0051900
418 100 Households under 1970 definition 100 2 59 1 1 2 8 0012200
1236119 100 Households under 1970 definition 100 1 51 1 1 1 11 0062200
1236287 100 Households under 1970 definition 100 1 41 1 3 1 10 0015000
1236624 100 Households under 1970 definition 100 1 29 1 1 1 11 0050100
1236756 100 Households under 1970 definition 100 2 58 1 1 1 14 0069800
1236779 100 Households under 1970 definition 100 1 30 1 3 3 9 0022110

10565 rows × 10 columns

To specify both a household number and a person number, pass in a tuple:

acs.xs((37,1))
hhwt                                           100
gq                Households under 1970 definition
perwt                                          100
sex                                              2
age                                             20
hispanic                                         1
race                                             1
marital_status                                   5
edu                                             11
income                                     0010000
Name: (37, 1), dtype: object

You can also pass a tuple (or list) to level, specifying which order the levels are in. This allows you to change the order:

acs.xs((1,37), level=('person', 'household'))
hhwt gq perwt sex age hispanic race marital_status edu income
household person
37 1 100 Households under 1970 definition 100 2 20 1 1 5 11 0010000

Note how specifying level causes the result to be a DataFrame rather than a Series even though it’s a single row.

5.9 Convert Strings Containing Numbers to Numbers

The income column in acs looks like it contains numbers, but examining the dtypes shows it contains objects, specifically strings.

acs.dtypes
hhwt               int64
gq                object
perwt              int64
sex                int64
age                int64
hispanic           int64
race               int64
marital_status     int64
edu                int64
income            object
dtype: object

Unfortunately this is not unusual. In order to do math with income we need to convert it to an actual numeric variable. The Pandas function to_numeric() can do that for us, but we’ll need to grapple with why income is a string in the first place.

Convert the income column to numbers with to_numeric() but store the result as a separate Series rather than as part of the acs DataFrame for now. Also pass in errors='coerce':

income_test = pd.to_numeric(
    acs['income'],
    errors='coerce'
)
income_test
household  person
37         1         10000.0
           2          5300.0
           3          4700.0
241        1         32500.0
242        1         30000.0
                      ...   
1236624    1         50100.0
           2         12000.0
1236756    1         69800.0
           2         40800.0
1236779    1         22110.0
Name: income, Length: 27410, dtype: float64

This looks like it worked, but try removing errors='coerce' and see what happens. The important part of the error message you’ll get is: ValueError: Unable to parse string “BBBBBBB”. This tells you that some values of income are set to BBBBBBB, which can’t be converted into a number. With errors='coerce' these are converted to NaN (non-numeric values are ‘coerced’ into becoming numbers by converting them to NaN).

Before we decide what to do about non-numeric values of income, we need to know if there are non-numeric values other than BBBBBBB. The rows where income is non-numeric have NaN for income_test, so we can use that to select them:

acs.loc[
    income_test.isna(),
    'income'
].value_counts()
BBBBBBB    6144
Name: income, dtype: int64

Note how it’s no problem to use a condition based on income_test to subset acs even though income_test isn’t part of acs. That’s because income_test has the same number of rows as acs and the same indexes, which allows Python to align the two successfully.

The result tells us that BBBBBBB is in fact the only non-numeric value of income, but doesn’t tell us what it means. If you’re guessing that it’s a code for missing you’re right, but to know that for sure you’d need to read the data documentation.

Since BBBBBBB is in fact a code for missing, setting it to NaN like errors='coerce' did is entirely appropriate. So go ahead and convert the income column in acs to numeric using to_numeric() and errors='coerce':

acs['income'] = pd.to_numeric(
    acs['income'],
    errors='coerce'
)
acs.dtypes
hhwt                int64
gq                 object
perwt               int64
sex                 int64
age                 int64
hispanic            int64
race                int64
marital_status      int64
edu                 int64
income            float64
dtype: object

The code you used to learn what non-numeric values were in income has served its purpose and does not need to be part of your data wrangling workflow. Given that writing a program and making it work usually involves running it many times, if this code took a non-trivial amount of time to run you definitely wouldn’t want to have to wait for it every time.

You could just remove it, but it’s better to document that you checked for other non-numeric values, what you found, and what you did about it. You could do that by putting a # in front of each line so it becomes a comment, or by turning the code cells into Markdown cells. You can tell Markdown not to try to format the code by putting ``` before it and ``` after it, then add more text explaining it.

5.9.1 Exercise

Load in the student survey data used as an exercise in the previous chapter, which you can do with:

survey = pd.read_csv(
    'qualtrics_survey.csv',
    skiprows=[1,2],
    usecols=['Q1', 'Q17', 'Q3', 'Q4']
)

Convert Q1 to numeric. You may treat ‘Between 20-25’ as a missing value but make sure that’s the only non-numeric value.

5.10 Identify the Type of Each Variable

The most common variable types are continuous variables, categorical variables, string variables, and identifier variables. Categorical variables can be further divided into unordered categorical variables, ordered categorical variables, and indicator variables. (There are other variable types, such as date/time variables, but we’ll focus on these for now.) Often it’s obvious what type a variable is, but it’s worth taking a moment to consider each variable and make sure you know its type.

Continuous variables can, in principle, take on an infinite number of values. They can also be changed by very small amounts (i.e. they’re differentiable). In practice, all continuous variables must be rounded, as part of the data collection process or just because computers do not have infinite precision. As long as the underlying quantity is continuous, it doesn’t matter how granular the available measurements of that quantity are. You may have a data set where the income variable is measured in thousands of dollars and all the values are integers, but it’s still a continuous variable.

Continuous variables are sometimes referred to as quantitative variables, emphasizing that the numbers they contain correspond to some quantity in the real world. Thus it makes sense to do math with them.

Categorical variables, also called factor variables, take on a finite set of values, often called levels. The levels are sometimes stored as numbers (1=White, 2=Black, 3=Hispanic, for example), but it’s important to remember that the numbers don’t actually represent quantities. Categorical variables can also be stored as strings, but Python has a class specifically designed for storing them that we’ll discuss shortly

With unordered categorical variables, the numbers assigned are completely arbitrary. Nothing would change if you assigned different numbers to each level (1=Black, 2=Hispanic, 3=White). Thus it makes no sense to do any math with them, like finding the mean.

With ordered categorical variables, the levels have some natural order. Likert scales are examples of ordered categorical variables (e.g. 1=Very Dissatisfied, 2=Dissatisfied, 3=Neither Satisfied nor Dissatisfied, 4=Satisfied, 5=Very Satisfied). The numbers assigned to the levels should reflect their ordering, but beyond that they are still arbitrary: multiply them all by 2, or subtract 10 from the lowest level and add 5 to the highest level, and as long as they stay in the same order nothing real has changed. You will see people report means for ordered categorical variables and do other math with them, but you should be aware that doing so imposes assumptions that may or may not be true. Usually the scale has the same numeric interval between levels so moving one person from Satisfied to Very Satisfied and moving one person from Very Dissatisfied to Dissatisfied have exactly the same effect on the mean, but are you really willing to assume that those are equivalent changes?

Indicator variables, also called binary variables, dummy variables, or boolean variables, are just categorical variables with two levels. In principle they can be ordered or unordered but with only two levels it rarely matters. Often they answer the question “Is some condition true for this observation?” Occasionally indicator variables are referred to as flags, and more commonly flagging observations where a condition is true means to create an indicator variable for that condition.

String variables contain text. Sometimes the text is just labels for categories, and they can be treated like categorical variables. Other times they contain actual information.

Identifier variables allow you to find observations rather than containing information about them, though some compound identifiers blur the line between identifier variables and categorical variables. You’ll normally turn them into indexes as soon as you identify them.

An easy way to identify the type of a variable is to look at its value_counts(). Continuous variables will have many, many values (but they won’t all be printed), categorical variables will have fewer, and indicator variables will have just two. Try it with race and income:

acs['race'].value_counts()
1    20636
2     3202
8     1587
6      939
9      728
3      202
5       64
7       37
4       15
Name: race, dtype: int64
acs['income'].value_counts()
0.0         2579
30000.0      371
20000.0      318
25000.0      279
40000.0      278
            ... 
20330.0        1
19950.0        1
53140.0        1
130300.0       1
69800.0        1
Name: income, Length: 2613, dtype: int64

With just nine unique values, race is pretty clearly categorical. But income has 2,613, so it’s pretty clearly continuous.

Writing out a line of code like this for each variable would get tedious, so let’s talk about a better way.

5.10.1 For Loops

A for loop consists of a list and some code that is executed once for each item in the list. In this case the list will be the columns in acs so the code will be executed once for each column. The code will print the value_counts() for that column, but let’s also have it print the name of the column at above the list of counts.

A good place to start when writing a loop is to pick one item from the list and make the code work for that item. Let’s go with race, not that it matters much. Use the print() function to print the word ‘race’ followed by the value_counts() for race, followed by a new line character (‘’), with a new line as the separator. Printing a new line character has the effect of putting in a blank line, which will be important when we repeat this for many columns.

print(
    'race',
    acs['race'].value_counts(),
    '\n',
    sep='\n'
)
race
1    20636
2     3202
8     1587
6      939
9      728
3      202
5       64
7       37
4       15
Name: race, dtype: int64

Now that we’ve figured out the code we want for race, what needs to change to do the same thing with a different column? The word ‘race’ appears twice in the code: once in the first argument of print() and once in subsetting acs. So we’ll replace ‘race’ wherever it appears with a variable that will eventually contain the name of the column the loop is currently working on:

column = 'race'
print(
    column,
    acs[column].value_counts(),
    '\n',
    sep='\n'
)
race
1    20636
2     3202
8     1587
6      939
9      728
3      202
5       64
7       37
4       15
Name: race, dtype: int64

Next consider the list that the loop will loop over. In this case the list we want is all the columns in acs, which we can get with the columns attribute:

acs.columns
Index(['hhwt', 'gq', 'perwt', 'sex', 'age', 'hispanic', 'race',
       'marital_status', 'edu', 'income'],
      dtype='object')

Now all we need to do is put the list and the code together as a for loop:

for column in acs.columns: print(
    column,
    acs[column].value_counts(),
    '\n',
    sep='\n'
)
hhwt
100    27410
Name: hhwt, dtype: int64


gq
Households under 1970 definition               27339
Additional households under 1990 definition       71
Name: gq, dtype: int64


perwt
100    27410
Name: perwt, dtype: int64


sex
2    14084
1    13326
Name: sex, dtype: int64


age
12    461
41    461
40    459
36    458
38    451
     ... 
86     64
87     59
88     38
89     27
92      1
Name: age, Length: 92, dtype: int64


hispanic
1     23863
2      2073
24      583
3       319
4       129
5       111
11       73
16       46
19       41
7        40
17       33
8        17
12       16
13       15
23       14
6         9
15        7
9         7
21        5
10        4
20        3
22        2
Name: hispanic, dtype: int64


race
1    20636
2     3202
8     1587
6      939
9      728
3      202
5       64
7       37
4       15
Name: race, dtype: int64


marital_status
5    11750
1    11643
3     2177
2     1405
4      435
Name: marital_status, dtype: int64


edu
9     5763
11    3031
13    2920
2     2500
4     1590
10    1495
1     1290
3     1284
12    1202
0     1123
14    1057
6     1021
5      891
7      875
8      860
15     343
16     165
Name: edu, dtype: int64


income
0.0         2579
30000.0      371
20000.0      318
25000.0      279
40000.0      278
            ... 
20330.0        1
19950.0        1
53140.0        1
130300.0       1
69800.0        1
Name: income, Length: 2613, dtype: int64

Now we have value_counts() for all the variables in acs. So what can we learn from them?

  • hhwt and perwt are always 100, so we can drop them.
  • gq now has just two levels, making it an indicator variable for which definition of household applies. We don’t care, so we’ll drop it too.
  • sex is an indicator variable, but with the levels 1 and 2. We’ll have to look up what they mean.
  • age has 92 unique values and a range that looks like actual years, so we can be confident it’s a continuous variable.
  • You might think hispanic would be an indicator variable, but with 22 unique values it must be a categorical variable.
  • race and marital_status are categorical. Again, we’ll have to look up what the numbers mean.
  • With 17 unique values and examples that are plausible numbers of years in school, edu could be a quantitative variable. We’ll examine this variable more closely.
  • With 2,614 unique values and values that look like plausible incomes we can be confident income is a continuous variable.

Go ahead and drop hhwt, perwt, and gq:

acs = acs.drop(['hhwt', 'perwt', 'gq'], axis=1)

5.10.2 Exercise

Load atus.csv and get value_counts() for all of its variables. Identify the variable types of famincome, hispan, asian.

5.11 Recode Indicator Variables

It is highly convenient to store indicator variables as actual boolean variables containing True or False, with the variable name telling you what it is that is either true or false. Consider the sex variable: right now it contains either 1 or 2, but we can only guess which of those numbers means male and which means female. If we had a variable called female and it contained either True or False, the meaning would be immediately obvious. (Of course this only works in a world where sex is binary. That’s true for the vast majority of data sets right now, but expect data to become more nuanced over time.)

There is a file called 2000_acs_codebook.txt among the example files that contains the actual codebook for this data set. Open it and find the description of Sex, and you’ll see:

 US2000C_1009       Sex
1       Male
2       Female

Now that we know what the codes mean, create a variable called female containing True or False:

acs['female'] = (acs['sex']==2)
acs
sex age hispanic race marital_status edu income female
household person
37 1 2 20 1 1 5 11 10000.0 True
2 2 19 1 1 5 11 5300.0 True
3 2 19 1 2 5 11 4700.0 True
241 1 2 50 1 1 5 14 32500.0 True
242 1 2 29 1 1 5 13 30000.0 True
1236624 1 1 29 1 1 1 11 50100.0 False
2 2 26 1 1 1 9 12000.0 True
1236756 1 2 58 1 1 1 14 69800.0 True
2 1 61 1 1 1 14 40800.0 False
1236779 1 1 30 1 3 3 9 22110.0 False

27410 rows × 8 columns

A good way to check your work is to run a crosstab of sex and female by passing both of them to value_counts(). Combinations that don’t make sense (e.g. sex=1 and female=True) should not appear in the list:

acs[['sex', 'female']].value_counts()
sex  female
2    True      14084
1    False     13326
dtype: int64

Now that you’re confident female is right you can drop sex:

acs = acs.drop('sex', axis=1)

5.11.1 Exercise

Convert hispanic to an indicator for ‘this person is Hispanic’. Look in the codebook to see which values of the existing hispanic variable mean ‘this person is Hispanic’ and which do not. So you can check your work, first create it as hisp, run a crosstab of hispanic and hisp, then drop the original hispanic and rename hisp to hispanic.

Since this exercise changes the data in ways that the rest of the code relies on we’ll include the solution, but avoid the temptation to read it until you’ve tried to solve it yourself.

acs['hisp'] = (acs['hispanic'] > 1)
acs[['hispanic', 'hisp']].value_counts()
acs = (
    acs.drop('hispanic', axis=1).
    rename(columns={'hisp' : 'hispanic'})
)
acs
age race marital_status edu income female hispanic
household person
37 1 20 1 5 11 10000.0 True False
2 19 1 5 11 5300.0 True False
3 19 2 5 11 4700.0 True False
241 1 50 1 5 14 32500.0 True False
242 1 29 1 5 13 30000.0 True False
1236624 1 29 1 1 11 50100.0 False False
2 26 1 1 9 12000.0 True False
1236756 1 58 1 1 14 69800.0 True False
2 61 1 1 14 40800.0 False False
1236779 1 30 3 3 9 22110.0 False False

27410 rows × 7 columns

5.12 Define Categories for Categorical Variables

If you look in the codebook for the race variable, you’ll find the following:

 US2000C_1022       Race Recode 1
1       White alone
2       Black or African American alone
3       American Indian alone
4       Alaska Native alone
5       American Indian and Alaska Native tribes specified, and American Indian or Alaska Native, not specified, and no other races
6       Asian alone
7       Native Hawaiian and Other Pacific Islander alone
8       Some other race alone
9       Two or more major race groups

Each number is associated with a text (string) description that tells us what the number means. The data set uses numbers because storing a number for each row uses much less memory than storing a string for each row. But to do anything useful you have to translate the numbers into their text descriptions.

A Python categorical variable does the translating for you. You work with the text descriptions (i.e. strings) and it quietly keeps track of the numbers behind them so you don’t have to. To turn a numeric categorical variable like we have now into a Python categorical variable, you’ll first convert it into a string variable and then have Python convert it into a categorical variable. (We’ll use shorter descriptions than the codebook.)

For the first step we’ll use the replace() function. To use it, we pass in a dictionary of dictionaries that describes the replacing to be done. That sounds scarier than it is, so let’s just look at one:

changes = {
    'race' : {
        1 : 'White',
        2 : 'Black',
        3 : 'American Indian',
        4 : 'Alaska Native',
        5 : 'Indigenous, Unspecified',
        6 : 'Asian',
        7 : 'Pacific Islander',
        8 : 'Other',
        9 : 'Two or more races'
    },
           
    'edu' : {
        0 : 'Not in universe',
        1 : 'None',               
        2 : 'Nursery school-4th grade',
        3 : '5th-6th grade',
        4 : '7th-8th grade',
        5 : '9th grade',
        6 : '10th grade',                  
        7 : '11th grade',
        8 : '12th grade, no diploma',
        9 : 'High School graduate',
        10 : 'Some college, <1 year',
        11 : 'Some college, >=1 year',
        12 : 'Associate degree',
        13 : "Bachelor's degree",
        14 : "Master's degree",
        15 : 'Professional degree',
        16 : 'Doctorate degree'
    }
}

For the ‘outer’ dictionary, the keys are column names, and the values are ‘inner’ dictionaries describing what needs to be done with that column. For the inner dictionaries, the keys are the old levels (i.e. what’s in the column now) and the values are the corresponding new levels (i.e. what they should be replace with).

Now pass that into replace():

acs = acs.replace(changes)
acs
age race marital_status edu income female hispanic
household person
37 1 20 White 5 Some college, >=1 year 10000.0 True False
2 19 White 5 Some college, >=1 year 5300.0 True False
3 19 Black 5 Some college, >=1 year 4700.0 True False
241 1 50 White 5 Master’s degree 32500.0 True False
242 1 29 White 5 Bachelor’s degree 30000.0 True False
1236624 1 29 White 1 Some college, >=1 year 50100.0 False False
2 26 White 1 High School graduate 12000.0 True False
1236756 1 58 White 1 Master’s degree 69800.0 True False
2 61 White 1 Master’s degree 40800.0 False False
1236779 1 30 American Indian 3 High School graduate 22110.0 False False

27410 rows × 7 columns

5.12.1 Exercise

Do the same with marital_status, looking up the meanings in the codebook.

Since this exercise changes the data in ways that the rest of the code relies on we’ll include the solution, but avoid the temptation to read it until you’ve tried to solve it yourself.

marital_changes = {
    'marital_status': {
        1 : 'Now married',
        2 : 'Widowed',
        3 : 'Divorced',
        4 : 'Separated',
        5 : 'Never married'
    }
}
acs = acs.replace(marital_changes)
acs
age race marital_status edu income female hispanic
household person
37 1 20 White Never married Some college, >=1 year 10000.0 True False
2 19 White Never married Some college, >=1 year 5300.0 True False
3 19 Black Never married Some college, >=1 year 4700.0 True False
241 1 50 White Never married Master’s degree 32500.0 True False
242 1 29 White Never married Bachelor’s degree 30000.0 True False
1236624 1 29 White Now married Some college, >=1 year 50100.0 False False
2 26 White Now married High School graduate 12000.0 True False
1236756 1 58 White Now married Master’s degree 69800.0 True False
2 61 White Now married Master’s degree 40800.0 False False
1236779 1 30 American Indian Divorced High School graduate 22110.0 False False

27410 rows × 7 columns

At this point, race, marital_status, and edu are all strings. Convert them to categories using the astype() function:

acs[['race', 'marital_status', 'edu']] = (
    acs[['race', 'marital_status', 'edu']].
    astype('category')
)
acs
age race marital_status edu income female hispanic
household person
37 1 20 White Never married Some college, >=1 year 10000.0 True False
2 19 White Never married Some college, >=1 year 5300.0 True False
3 19 Black Never married Some college, >=1 year 4700.0 True False
241 1 50 White Never married Master’s degree 32500.0 True False
242 1 29 White Never married Bachelor’s degree 30000.0 True False
1236624 1 29 White Now married Some college, >=1 year 50100.0 False False
2 26 White Now married High School graduate 12000.0 True False
1236756 1 58 White Now married Master’s degree 69800.0 True False
2 61 White Now married Master’s degree 40800.0 False False
1236779 1 30 American Indian Divorced High School graduate 22110.0 False False

27410 rows × 7 columns

It may not look like anything has changed, but this DataFrame is now a lot smaller. But you still refer to the string descriptions in writing code. For example, to get a subset with just the people who are High School graduates, use:

acs[acs['edu']=="High School graduate"]
age race marital_status edu income female hispanic
household person
484 1 33 Asian Now married High School graduate 16800.0 False False
894 1 72 White Now married High School graduate 22500.0 False False
930 1 47 White Divorced High School graduate 23010.0 True False
2 36 White Never married High School graduate 6800.0 True False
3 46 White Now married High School graduate 6400.0 True False
1235861 1 19 American Indian Now married High School graduate 12000.0 False False
1236287 2 42 White Now married High School graduate 27000.0 True False
3 23 White Divorced High School graduate 11400.0 True False
1236624 2 26 White Now married High School graduate 12000.0 True False
1236779 1 30 American Indian Divorced High School graduate 22110.0 False False

5763 rows × 7 columns

How about selecting people with a Master’s degree? Here we have a complication: we normally start and end strings with ', but the string Master's degree contains '. If we’re not careful, Python will think that the string ends after Master.

The solution is to start and end this string with ". Python is fine with either ' or " as a way to mark strings. When a string starts with ", Python knows the string doesn’t end until it sees " again.

acs[acs['edu']=="Master's degree"]
age race marital_status edu income female hispanic
household person
241 1 50 White Never married Master’s degree 32500.0 True False
894 2 69 White Now married Master’s degree 11300.0 True False
1014 1 52 White Now married Master’s degree 97000.0 False False
2 54 White Now married Master’s degree 23750.0 True False
1509 1 67 White Widowed Master’s degree 8000.0 False False
1232927 2 56 White Now married Master’s degree 40000.0 True False
1233756 4 23 White Never married Master’s degree 0.0 True False
1235506 1 35 White Divorced Master’s degree 33000.0 True False
1236756 1 58 White Now married Master’s degree 69800.0 True False
2 61 White Now married Master’s degree 40800.0 False False

1057 rows × 7 columns

A categorical variable has an attribute called cat that is an object that describes its categorical nature. It has an attribute called categories that lists the categories themselves:

acs['edu'].cat.categories
Index(['10th grade', '11th grade', '12th grade, no diploma', '5th-6th grade',
       '7th-8th grade', '9th grade', 'Associate degree', 'Bachelor's degree',
       'Doctorate degree', 'High School graduate', 'Master's degree', 'None',
       'Not in universe', 'Nursery school-4th grade', 'Professional degree',
       'Some college, <1 year', 'Some college, >=1 year'],
      dtype='object')

If you pass that to the enumerate() function and then turn the result into a dictionary with the dict() function, you’ll get a list of the numbers that correspond to the descriptions:

dict(enumerate(acs['edu'].cat.categories))
{0: '10th grade',
 1: '11th grade',
 2: '12th grade, no diploma',
 3: '5th-6th grade',
 4: '7th-8th grade',
 5: '9th grade',
 6: 'Associate degree',
 7: "Bachelor's degree",
 8: 'Doctorate degree',
 9: 'High School graduate',
 10: "Master's degree",
 11: 'None',
 12: 'Not in universe',
 13: 'Nursery school-4th grade',
 14: 'Professional degree',
 15: 'Some college, <1 year',
 16: 'Some college, >=1 year'}

The numbers were assigned by putting the descriptions in alphabetical order, which in this case means nothing. edu is supposed to be an ordered categorical variable, but we have some more work to do in order to make it one.

You can change the order of the categories using the reorder_categories() function of cat. Just pass in a list with the category names in the order you want, plus the argument ordered=True:

acs['edu'] = acs['edu'].cat.reorder_categories(
    [
        'Not in universe',
        'None',               
        'Nursery school-4th grade',
        '5th-6th grade',
        '7th-8th grade',
        '9th grade',
        '10th grade',                  
        '11th grade',
        '12th grade, no diploma',
        'High School graduate',
        'Some college, <1 year',
        'Some college, >=1 year',
        'Associate degree',
        "Bachelor's degree",
        "Master's degree",
        'Professional degree',
        'Doctorate degree'        
    ],
    ordered = True
)
dict(enumerate (acs['edu'].cat.categories))
{0: 'Not in universe',
 1: 'None',
 2: 'Nursery school-4th grade',
 3: '5th-6th grade',
 4: '7th-8th grade',
 5: '9th grade',
 6: '10th grade',
 7: '11th grade',
 8: '12th grade, no diploma',
 9: 'High School graduate',
 10: 'Some college, <1 year',
 11: 'Some college, >=1 year',
 12: 'Associate degree',
 13: "Bachelor's degree",
 14: "Master's degree",
 15: 'Professional degree',
 16: 'Doctorate degree'}

The fact that edu is now ordered lets us do things that depend on that order, like sorting or finding the maximum value:

acs['edu'].max()
'Doctorate degree'

5.13 Examine Variable Distributions

Understanding the distributions of your variables is important for both data cleaning and analysis.

5.13.1 Continuous Variables

For continuous variables, the describe() function is a great place to start for understanding their distribution:

acs['income'].describe()
count     21266.000000
mean      27724.104580
std       39166.397371
min      -10000.000000
25%        6000.000000
50%       18000.000000
75%       35800.000000
max      720000.000000
Name: income, dtype: float64

First off, note that the count (21,266) is much smaller than the number of rows in the data set (27,410). This is a good reminder that we have a substantial number of missing values.

The mean (27,724) is substantially higher the 50% percentile, or median (18,000). This tells us that the distribution is right-skewed rather than normal, so our intuition about normal distributions (like the mean being in the center of the distribution) does not apply. The percentiles are helpful for understanding non-normal distributions, but a picture is even better.

5.13.2 A Digression on Data Visualization

Most Python users use the Matplotlib package for creating data visualizations. Seaborn is another popular alternative. But most of our target audience either has done some work in R or will, so we’ll use plotnine, a Python version of Hadley Wickham’s ggplot2 R package.

import plotnine as p9

ggplot2 was based on The Grammar of Graphics by Leland Wilkinson, which defined a systematic way of describing data visualizations. We’ll keep things very simple (we’re making data visualizations to help us understand the data, not for publication), so we’ll only need two components. An aesthetic defines the relationships between the variables used and properties of the graph, for example the x and y coordinates. A geometry defines how that relationship is to be represented. In plotnine (and ggplot) you define a graph by giving it a data set and aesthetics and then add geometries to it.

plotnine is so similar to ggplot that you can almost copy ggplot code from R to Python. The main difference is that while R treats ggplot functions like built-in functions (which can cause problems when different packages use the same name), Python treats them as package functions. So where R just refers to ggplot(), aes(), and the various geom() functions, Python needs p9.ggplot(), p9.aes(), and p9.geom().

5.13.3 Back to Continuous Variables

A histogram will show you the distribution of a continuous variable. To create a histogram, call p9.ggplot(), passing in the data and a p9.aes() object that defines the aesthetics. In this case we want the x coordinate to correspond to income. Then add to it a p9.geom_histogram() that defines the width of each bin and plotnine will take it from there:

p9.ggplot(
    acs,
    p9.aes(x = 'income')
) + p9.geom_histogram(binwidth = 1000)
C:\ProgramData\Anaconda3\lib\site-packages\plotnine\layer.py:372: PlotnineWarning: stat_bin : Removed 6144 rows containing non-finite values.

png

<ggplot: (-9223371954701821316)>

The warning about ‘non-finite values’ is just telling us that missing values were ignored, which is what we want.

The distribution certainly is skewed, to the point that it’s hard to see much. Try plotting just the incomes below $100,000 but greater than 0:

p9.ggplot(
    acs[(acs['income']<100000) & (acs['income']>0)],
    p9.aes(x = 'income')
) + p9.geom_histogram(binwidth = 1000)

5.13.4 Exercise

Create a histogram for age with binwidth to set 5 (years). Then create one where binwidth is set to 1 (i.e. every age gets its own bin). What do you see now that you couldn’t see before?

The age variable has been top-coded in this data set: 93 really means ‘93 or older’. Larger bin widths are arguably better for giving you an overall sense of variable’s distribution: they avoid ‘missing the forest for the trees’. But small bin widths will sometimes show you very important details.

Let’s consider some of the anomalies we found in income. First, let’s try to identify who has a missing value for income. Start by looking at summary statistics for age for that group:

acs.loc[
    acs['income'].isna(),
    'age'
].describe()

Everyone with a missing value for income is less than 15 years old. Now turn that around and get summary statistics for income for everyone who is less than 15 years old:

acs.loc[
    acs['age']<15,
    'income'
].describe()

This tells us no one under the age of 15 has a valid value for income; they’re all NaN. We’ve discovered something about how the Census Bureau collected these data: they didn’t even ask about the income of anyone under the age of 15. Of course a thorough reading of the data documentation would have told us that without us having to discover it.

More puzzling is people with negative numbers for income. Let’s see what we can learn about them. First, look at their age distribution:

acs.loc[
    acs['income']<0,
    'age'
].describe()

Note that there aren’t very many of them–just 29. But there’s no obvious pattern to their ages. How about education? edu is a categorical variable, so use value_counts() rather than describe():

acs.loc[
    acs['income']<0,
    'edu'
].value_counts()

Again, there’s no obvious pattern like there was with missing values. We’ll think more about this.

5.13.5 Categorical Variables

To examine the distribution of categorical variables, use the familiar value_counts(). Start with edu. The table will be more useful if the education categories stay in their inherent order rather than being sorted by frequency, so pass in sort=False:

acs['edu'].value_counts(sort=False)

Things to note:

  • There are no missing values, or at least none coded as such.
  • The category “Not in universe” needs some investigation.
  • There are more people in lower education categories than you might expect
  • There may be more categories here than are useful, so you might consider combining them for analysis

There are a lot of numbers to read in that table, but a bar chart would allow you to see the patterns in it immediately. Creating one is almost identical to creating a histogram: it’s exactly the same aesthetic, but add geom_bar() instead of geom_histogram():

p9.ggplot(
    acs,
    p9.aes(x='edu')
) + p9.geom_bar()

Well that’s not very useful! There’s not enough space on the x-axis for all the labels, which is very common. But this problem has a very simple solution: switch to a horizontal bar graph. All you need to do is add p9.coord_flip() to it:

p9.ggplot(
    acs,
    p9.aes(x='edu')
) + p9.geom_bar() + p9.coord_flip()

This makes horizontal the format of choice for bar graphs.

Now let’s consider the meaning of ‘Not in universe’. We can be fairly confident that the Census Bureau is not collecting data on beings from other dimensions, but what does it mean? Start by examining ages of the people who have ‘Not in universe’ for education:

acs.loc[
    acs['edu']=='Not in universe',
    'age'
].describe()

This time they are all less than three years old. Now reverse it and examine the education of everyone who is less than three years old:

acs.loc[acs['age']<3, 'edu'].value_counts()

We’ve uncovered another data collection practice of the Census Bureau: they do not even ask about the education level of children under three. ‘Not in universe’ is a picturesque way of saying ‘Not in the population of people we ask that question.’

5.13.6 Exercise

Examine the distributions of race and hispanic in the same way. Who are the people who chose ‘Other’ for race? The 2010 Census (i.e. the census after the one we’re examining) added a clarifying note to the questions about Hispanic origin: ‘For this census, Hispanic Origins are not races.’ Can you see why? What problems did not having that note create in the 2000 data?

5.14 Recode Values that Mean Missing to NaN

Python uses NaN to denote missing values. Functions like mean() expect that and act accordingly, typically by omitting NaN for calculations.

Data sets often use different codes to indicate missing values. We saw that with income: ‘BBBBBBB’ meant missing. This was automatically converted to a missing value when you converted income from a string variable to a numeric variable, but it won’t always be that simple. It’s very common to use negative numbers to mean missing, especially -9. Functions like mean() will not recognize that -9 means missing and will include it in calculations, giving incorrect results.

The solution is to identify values that really mean missing, and then change them NaN. For example, ‘Not in universe’ for edu means missing, so change it with:

import numpy as np
acs.loc[
    acs['edu']=='Not in universe',
    'edu'
] = np.NaN
acs.loc[484]

Are the negative values for income codes for missing? Take a look at them with:

acs.loc[
    acs['income']<0,
    'income'
].value_counts()

No, if these were codes for missing, we’d expect fewer of them and more round numbers. These look like actual quantities: most likely they are people who lost money on investments and had a negative net income.

But, losing money on investments requires a certain level of wealth. That’s problematic if you are using income primarily as a measure of socio-economic status. It’s a judgement call, but let’s change the negative values to NaN:

acs.loc[
    acs['income']<0,
    'income'
] = np.NaN
acs.loc[1179490]

5.14.1 Exercise

The income and marital_status variables treat children differently. Children under 15 do not have a zero for income, they have a missing value (now NaN). But children under 15 all have ‘Never Married’ for marital_status, even though it’s assumed rather than asked. Recode marital_status so they have NaN instead. How will that change subsequent analysis of the marital_status variable?

Since this exercise changes the data in ways that the rest of the code relies on we’ll include the solution, but avoid the temptation to read it until you’ve tried to solve it yourself.

acs.loc[
    acs['age']<15,
    'marital_status'
] = np.NaN

5.15 Examine Missing Data

Now that missing data is all coded as NaN, isna() will be True for all of it. If you do ask Python to do math with True/False, it will convert True to 1 and False to 0. So the sum of isna() is the number of observations with a missing value, which give you a very convenient way to see how much missing data you have:

acs.isna().sum()

It’s good to know you don’t have to worry about missing values for age, race, female, or hispanic, but you definitely have to think carefully about marital_status, edu, and income.

What this table can’t tell you is why is why the data are missing, and in particular whether they are missing completely at random. If they are not, their presence is likely to bias your analysis.

Often you can answer this question by examining the relationships between missing values and other variables, as we did when we looked at the distributions of the variables. For example, edu is clearly not missing at random: it is missing if and only if the person is less than three years old. The distribution of edu for these children is presumably made up mostly of ‘None’ and perhaps some ‘Nursery school-4th grade’, which is very different from the distribution of edu for the observations where edu is not missing.

You should also consider relationships between missing values. In this data set, the apparent relationship between the missingness of income and the missingness of marital_status is driven by their individual relationships with age. But in many data sets there are direct relationships. For example, if a subject could not be located for one wave of a survey then they will probably have missing values for all the variables for that survey wave.

Another form of missingness is the observations that aren’t in the data set at all and should be. If the probability of not being is the data set is correlated with quantities of interest, then your estimates of those quantities will be biased. The classic example of this problem is polling before the 2016 and 2020 US presidential elections: voting for Trump was apparently correlated with a lower probability of answering surveys, leading polls to underestimate his vote share. It wasn’t a big difference in absolute terms, but enough that the 2020 election was far closer than the polls predicted, and in 2016 polls predicted the wrong winner.

Unfortunately, questions about missing data cannot generally be answered by looking at the data (the data that would allow you to answer those questions is missing). Instead you’ll need to rely on your understanding of the world, your intuition about human behavior, etc.

5.15.1 Exercise

What is the likely distribution of income for people with a missing value for income? How is that different from the distribution of income for those where income is known? How would the mean of income change if all its missing values became known?

5.16 Save Your Work

This completes the first steps with your data. We will use this cleaned-up data set in examples for much of the rest of the book.

If we saved it as a CSV, we’d need to do a lot of work we’ve done again after loading it. Instead we’ll save the data in a format designed for saving Python DataFrames. There are several, but pickle format is simple and easy to use. You can a DataFrame as a pickle by calling its to_pickle() function, and you read it with the Pandas read_pickle() function. In both cases you’ll pass in the desired file name.

acs.to_pickle('acs_clean.pickle')

However, a version of this file was included with the example files, so if you want to use it do not save your version over it. If you change your mind later, acs_clean_example.pickle is a second copy of the cleaned-up data set.

Read the data set again and you’ll see that everything is exactly how it was before you saved it:

pd.read_pickle('acs_clean.pickle')

5.17 Review

We end this chapter on First Steps With Your Data here, not because the data set is necessarily all ready to be analyzed, but because the next steps depend on what you plan to do with the data. Perhaps the edu variable has too many categories for your analysis and you need to combine some of them. Perhaps you want to treat Hispanic as a race, since so many of the respondents clearly did, and put everyone who said they were Hispanic in a new race category.

But the steps we have described are steps you should take with every data set you work with. Avoid the temptation to skip them! Because they’re so important, we’ll review them here. Consider this a ‘First Steps With Your Data Cheat Sheet.’

5.17.1 Read the Documentation

5.17.2 Identify the Variables

5.17.3 Look at the Data

5.17.4 Set the Indexes and Identify the Structure of the Data

5.17.5 Get Rid of Data You Won’t Use

5.17.6 Change Variable Names

5.17.7 Convert Strings Containing Numbers to Numbers

5.17.8 Identify the Type of Each Variable

5.17.9 Recode Indicator Variables

5.17.10 Define Categories for Categorical Variables

5.17.11 Examine Variable Distributions

5.17.12 Recode Values that Mean Missing to NaN

5.17.13 Examine Missing Data