4 Reading Data

Now that you know some basics about Python and DataFrames, it’s time to start learning how to wrangle data. The first step is to read your data into Python.

4.1 File Systems

In order to read in files you have to be able to tell Python where to find them, so we need to start by talking about file systems.

4.1.1 Setting Up

Start up Jupyter Notebook if you haven’t already, then navigate to the folder where you put the example files. Once you’re in that folder, click New, Python 3 to create a new Notebook. Click on the name at the top to change it from Untitled to File Systems.

4.1.2 Structure of a File System

Most data sets you’ll work with will be files that are stored in a file system. You may never have given file systems much thought—most applications these days try to keep you from having to. But research projects with their many interdependent files require that you pay attention to where the files are located and how they’re organized.

A file system consists of folders (also known as directories) that can contain files and/or other folders, all organized in a hierarchy or “tree.” A folder that is contained in another folder is also called a subfolder. In Windows the root of the tree is usually a drive letter, like C: or U:. In Linux or on a Mac, it’s just called the root directory.

Python keeps track of a working directory. Think of it as your current location in the file system hierarchy. If you try to load a file without specifying its location, Python will look for it in the working directory. If you save a file without specifying a location, Python will put it in the working directory.

When you open a notebook in Jupyter Notebook, the working directory will be set to the location of the notebook. You can change the working directory with the os.chdir() (change directory) function, by passing in a new path as a string.

A path specifies the location of a file or folder. An absolute path starts at the root and lists all the folders and subfolders needed to get from there to the location of the file or folder.

In Windows an absolute path typically starts with a drive letter followed by a colon, and then puts a backslash between each folder. For example:

C:\users\bbadger\dissertation\data\raw_data.csv

An absolute path in Linux or MacOS starts with a forward slash, representing the root directory, and puts a forward slash between each folder:

/users/bbadger/dissertation/data/raw_data.csv

A relative path starts with just the name of a folder, which is assumed to be a subfolder of the working directory. If the working directory is c:, the relative path to the same file is:

data\raw_data.csv

The Linux or MacOS equivalent is:

data/raw_data.csv

Of course this wouldn’t work if the working directory were c:\users\bbadger\ instead.

In specifying a path, .. means ‘go up one level.’ If the working directory were c:\users\bbadger\dissertation\programs then the path to the data file would be:

..\data\raw_data.csv

This means ’go up one level, from c:\users\bbadger\dissertation\programs to c:\users\bbadger\dissertation, then go down into the data folder to find the raw_data.csv file.

A single period, . represents the working directory. In Linux or MacOS, ~ is short for your home directory.

If you put all the files related to a given project in a single folder (or its subfolders), always make that folder the working directory before you do anything with the project, and use relative paths in all your programs, then your project will be portable. That means you can move it to a different location or send it to someone else, and as long as they make the project’s folder the working directory all your code will work without any changes.

4.2 CSV and Excel Files

Now that you know how to find files, we’ll learn how to read the most common data file formats.

4.2.1 Setting Up

Start up Jupyter Notebook if you haven’t already, then navigate to the folder where you put the example files. Once you’re in that folder, click New, Python 3 to create a new Notebook. Click on the name at the top to change it from Untitled to CSV and Excel. Then import pandas.

import pandas as pd

4.2.2 Delimited Files and Excel Spreadsheets

Delimited files and Excel spreadsheets are very popular ways of storing data, but importing them into Python sometimes requires telling Python how to interpret them. Since the same issues arise with both kinds of files, we’ll address them together.

A delimited file is a text file that contains data, where a special character (the delimiter) separates one value from the next. The most common form is CSV (comma separated value) files, where the delimiter is a comma, but the delimiter could be a space, a tab, or in theory just about any other character or set of characters. Your computer may import CSV files into Excel by default, but CSV files are not really Excel spreadsheets.

Pandas gives you the read_csv() function for reading CSV files (as you’ve seen before), and the read_excel() function for reading Excel spreadsheets. They will do their best to understand the file and figure out how to turn it into a DataFrame, but sometimes they’ll need some help.

If you have Excel on your computer, open the sscc_traing.xlsx in the example files. This contains information about the graduate students (only) trained by the SSCC in FY 2018. It also illustrates the most common challenges in reading spreadsheets and CSV files into Python: finding the variable names, and handling content that is not data.

Note that the first row is a title, while the second row contains the variable names. This will confused read_excel() as it by default expects the variable names to be on the first row. Also note that the data end with row 14, as the rows after that contain summary statistics and a note rather than more observations.

4.2.3 Reading Excel Spreadsheets

Now try having Python read the file:

training = pd.read_excel('sscc_training.xlsx')
training
FY 2018 Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5
0 Workshop Name Number of Grad Students Registered Workshop Length (Hours) Number of Times Offered Instructor-Hours Student-Hours
1 Dynamic Documents with Stata Markdown 11 1.5 1 1.5 16.5
2 Introduction to R 85 3.5 3 10.5 297.5
3 Introduction to SAS 17 3 1 3 51
4 Introduction to SPSS 8 3 1 3 24
5 Introduction to Stata 69 3.5 3 10.5 241.5
6 R Programming and Concepts 20 10 2 20 200
7 R for Researchers - Condensed 113 7 3 21 791
8 SAS for Researchers 11 10 1 10 110
9 SW: Automated Reports 37 1.25 1 1.25 46.25
10 SW: Publication-Quality Tables 34 1.25 2 2.5 42.5
11 Stata Programming 62 1.25 3 3.75 77.5
12 Stata for Researchers 146 8 6 48 1168
13 Total 613 NaN 27 135 3065.75
14 NaN NaN NaN NaN NaN NaN
15 Total Grad Students Taught 323 Note that many graduate students take more tha… NaN NaN NaN
16 Average Grad Students per Workshop 22.709259 NaN NaN NaN NaN

Python assumed that the first row contained the variable names, and the actual variable names were interpreted as the first row of the data. Not only did this create an extraneous observation, it forced Python to treat all the columns as strings so it could store the string values it found in row two of the spreadsheet. Note that the number 11 and the string ‘11’ may look the same when printed in a DataFrame, but you can’t do math with the string ‘11’. The summary statistics and notes were also turned into extraneous observations, some of which also contain strings.

Every DataFrame has a dtypes attribute that tells you the data type of each column:

training.dtypes
FY 2018       object
Unnamed: 1    object
Unnamed: 2    object
Unnamed: 3    object
Unnamed: 4    object
Unnamed: 5    object
dtype: object

While an object can be just about anything, in this case they’re all strings. You’ll see what numbers look like in a moment.

We can fix many of the problems by telling read_excel() that the header is on row 1 using the header argument, keeping in mind that row 1 is the second row as far as Python is concerned.

training = pd.read_excel('sscc_training.xlsx', header=1)
training
Workshop Name Number of Grad Students Registered Workshop Length (Hours) Number of Times Offered Instructor-Hours Student-Hours
0 Dynamic Documents with Stata Markdown 11.000000 1.5 1.0 1.50 16.50
1 Introduction to R 85.000000 3.5 3.0 10.50 297.50
2 Introduction to SAS 17.000000 3 1.0 3.00 51.00
3 Introduction to SPSS 8.000000 3 1.0 3.00 24.00
4 Introduction to Stata 69.000000 3.5 3.0 10.50 241.50
5 R Programming and Concepts 20.000000 10 2.0 20.00 200.00
6 R for Researchers - Condensed 113.000000 7 3.0 21.00 791.00
7 SAS for Researchers 11.000000 10 1.0 10.00 110.00
8 SW: Automated Reports 37.000000 1.25 1.0 1.25 46.25
9 SW: Publication-Quality Tables 34.000000 1.25 2.0 2.50 42.50
10 Stata Programming 62.000000 1.25 3.0 3.75 77.50
11 Stata for Researchers 146.000000 8 6.0 48.00 1168.00
12 Total 613.000000 NaN 27.0 135.00 3065.75
13 NaN NaN NaN NaN NaN NaN
14 Total Grad Students Taught 323.000000 Note that many graduate students take more tha… NaN NaN NaN
15 Average Grad Students per Workshop 22.709259 NaN NaN NaN NaN

This already looks better, but you can confirm that most of the columns of numbers are indeed numbers by looking at the dtypes:

training.dtypes
Workshop Name                          object
Number of Grad Students Registered    float64
Workshop Length (Hours)                object
Number of Times Offered               float64
Instructor-Hours                      float64
Student-Hours                         float64
dtype: object

Here float64 means a floating-point number with 64 bytes. Floating-point means it can contain decimals, not just integers. In principle we could save some memory by storing integers as integers and using less precision overall, but we won’t worry about it.

The workshop length is still an object (string) because of the note on row 14. That row isn’t even data, so the next step is to tell Python to skip the last four rows in the spreadsheet with the skipfooter argument.

training = pd.read_excel(
    'sscc_training.xlsx', 
    header=1,
    skipfooter=4
)
training
Workshop Name Number of Grad Students Registered Workshop Length (Hours) Number of Times Offered Instructor-Hours Student-Hours
0 Dynamic Documents with Stata Markdown 11 1.50 1 1.50 16.50
1 Introduction to R 85 3.50 3 10.50 297.50
2 Introduction to SAS 17 3.00 1 3.00 51.00
3 Introduction to SPSS 8 3.00 1 3.00 24.00
4 Introduction to Stata 69 3.50 3 10.50 241.50
5 R Programming and Concepts 20 10.00 2 20.00 200.00
6 R for Researchers - Condensed 113 7.00 3 21.00 791.00
7 SAS for Researchers 11 10.00 1 10.00 110.00
8 SW: Automated Reports 37 1.25 1 1.25 46.25
9 SW: Publication-Quality Tables 34 1.25 2 2.50 42.50
10 Stata Programming 62 1.25 3 3.75 77.50
11 Stata for Researchers 146 8.00 6 48.00 1168.00

Now that looks like a proper data set, and we have some even better news waiting for us in the dtypes:

training.dtypes
Workshop Name                          object
Number of Grad Students Registered      int64
Workshop Length (Hours)               float64
Number of Times Offered                 int64
Instructor-Hours                      float64
Student-Hours                         float64
dtype: object

With those rows of summary statistics out of the way, read_excel() realized some of the columns only contained integers and stored them accordingly.

4.2.4 Reading CSV Files

Reading in the CSV version of the data set results in exactly the same issues:

training = pd.read_csv('sscc_training.csv')
training
FY 2018 Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5
0 Workshop Name Number of Grad Students Registered Workshop Length (Hours) Number of Times Offered Instructor-Hours Student-Hours
1 Dynamic Documents with Stata Markdown 11 1.5 1 1.5 16.5
2 Introduction to R 85 3.5 3 10.5 297.5
3 Introduction to SAS 17 3 1 3 51
4 Introduction to SPSS 8 3 1 3 24
5 Introduction to Stata 69 3.5 3 10.5 241.5
6 R Programming and Concepts 20 10 2 20 200
7 R for Researchers - Condensed 113 7 3 21 791
8 SAS for Researchers 11 10 1 10 110
9 SW: Automated Reports 37 1.25 1 1.25 46.25
10 SW: Publication-Quality Tables 34 1.25 2 2.5 42.5
11 Stata Programming 62 1.25 3 3.75 77.5
12 Stata for Researchers 146 8 6 48 1168
13 Total 613 NaN 27 135 3065.75
14 NaN NaN NaN NaN NaN NaN
15 Total Grad Students Taught 323 Note that many graduate students take more tha… NaN NaN NaN
16 Average Grad Students per Workshop 22.7 NaN NaN NaN NaN
17 NaN NaN NaN NaN NaN NaN

The solutions are almost identical too. Note that the CSV file has a blank line at the end (row 17), so we need skipfooter to skip five rows instead of four.

But there’s also a catch. There are actually two versions or engines for read_csv(): one written in C that’s very fast, and one written directly in Python that is slower but more capable. One of the capabilities the Python version has that the C version does not is skipfooter, so in order to use it we must also tell read_csv() to use the Python engine.

training = pd.read_csv(
    'sscc_training.csv',
    header=1,
    skipfooter=5,
    engine='python'
)
training
Workshop Name Number of Grad Students Registered Workshop Length (Hours) Number of Times Offered Instructor-Hours Student-Hours
0 Dynamic Documents with Stata Markdown 11 1.50 1 1.50 16.50
1 Introduction to R 85 3.50 3 10.50 297.50
2 Introduction to SAS 17 3.00 1 3.00 51.00
3 Introduction to SPSS 8 3.00 1 3.00 24.00
4 Introduction to Stata 69 3.50 3 10.50 241.50
5 R Programming and Concepts 20 10.00 2 20.00 200.00
6 R for Researchers - Condensed 113 7.00 3 21.00 791.00
7 SAS for Researchers 11 10.00 1 10.00 110.00
8 SW: Automated Reports 37 1.25 1 1.25 46.25
9 SW: Publication-Quality Tables 34 1.25 2 2.50 42.50
10 Stata Programming 62 1.25 3 3.75 77.50
11 Stata for Researchers 146 8.00 6 48.00 1168.00

4.2.5 More Useful Arguments

read_excel() and read_csv() have many more arguments for handling particular situations. Here are a few of the more commonly used ones:

4.2.5.1 skiprows

If you give skiprows a number, it will skip that number of rows before beginning to read the data (starting with the variable names). If you give it a list of numbers, it will skip those rows. The latter can be very useful if your data file has extraneous stuff in between the variable names and the actual data.

4.2.5.2 header=None

Use this if the file has no variable names.

4.2.5.3 names

Pass in a list of names to override the variable names in the file, or if the file has no variable names.

4.2.5.4 index_col

Specify a column to use as the row index.

4.2.5.5 usecols

Pass in a list of columns to only read those columns. The list can contain either column numbers or names.

4.2.5.6 thousands

If the file has numeric values with commas in them to separate thousands, pass in thousands=','.

4.2.5.7 na_values

read_excel() and read_csv() will recognize certain values as missing and replace them with NaN, including ‘NA’, ‘NAN, ’NULL’, lower case versions of the preceding, and no value at all. You can add to that list with the na_values argument, which takes a string or a list of strings. For example, if you were reading Stata-style data where . is missing, you would add na_values='.'. If you were reading in survey data, you might add na_values=['Did Not Apply', 'Refused to Answer', 'Unknown'], assuming those are the strings that appear in the data.

4.2.5.8 Exercise

qualitrics_survey.csv is a data file created by Qualtrics, containing results from a student survey. Only four columns contain actual answers to the survey questions: Q1, Q17, Q3, and Q4. Read in just those columns and just the rows that contain the answers. Print out the values of Q1 for rows 30-50, and explain why Q1 is still an object (string) even after you read everything in perfectly.

4.3 Other Data File Formats

While CSV and Excel files are extremely popular, there are a huge number of data file formats out there. In this section we’ll describe some of them briefly so that you’ll recognize them if you do run into them and know what you need to learn to be able to work with them.

4.3.1 Fixed Width Text Files

Fixed width files, also known as fixed format files, are text files, but unlike CSV files there is no separator between variables. Instead, a data dictionary tells you which variables are stored in which columns. The advantage of fixed format files is that they are smaller (no space is wasted on separators). The disadvantage is that they are completely useless without the corresponding data dictionary. For this reason they’ve become less popular as data storage has gotten cheaper.

The read_fwf() function can read fixed width files. The tedious part is that you need to turn the information in the data dictionary into arguments (typically a list of tuples) read_fwf() can use to identify the variables.

4.3.2 Data Files From Statistical Packages

If you have a data file in Stata, SAS, or SPSS format, you can read it with read_stata(), read_sas(), or read_spss(). There are read functions for a variety of other file types as well, including some you might not expect like HTML (it will turn any tables a web page contains into DataFrames).

You’ll notice ‘Python Pickle Format’ in that list of things Python can read. Pickles are rarely used for distributing data because it’s possible to put malicious code in them. But they allow you to save all the fancy features of a DataFrame that can’t be stored in a CSV, and they’re often smaller. We’ll discuss how to use them as part of your data wrangling workflow in the next chapter.

If you have a data file in a more obscure format, Stat/Transfer cam probably convert it into something you can read. The SSCC has Stat/Transfer available on our Winstat and Linstat servers. Only use it if you really need it though, as it is quite expensive and the cost increases with every person that runs it. Also, using Python functions makes it easier to make your project reproducible.

4.3.3 Data Plus Program

Some data providers will give you data in a text file plus a program for reading it. The key to using these programs is to read the instructions carefully. Often you need to put the files in a particular place or modify the code to say where the files are. Also, these programs will typically read the data into memory but leave it to you to save the results. Add that to the code so the entire process is reproducible.

If the data provider doesn’t provide a Python program, you can use one of the other options and then have Python read the results. If you’re not familiar with any of the statistical packages they provide, Stata code is fairly easy to read. And we’ll forgive you for just clicking ‘Save’ when it’s done rather than figuring out how to add that to the code.