Data Wrangling in Stata: Reading in Data

This is part two of Data Wrangling in Stata.

In this section we'll discuss how to get data sets into Stata.

File Systems

Most data sets you'll work with will be files that are stored in a file system. You may never have given file systems much thought—most applications these days try to keep you from having to. But research projects with their many interdependent files require that you pay attention to where the files are located and how they're organized.

A file system consists of folders (also known as directories) that can contain files and/or other folders, all organized in a hierarchy or "tree." A folder that is contained in another folder is also called a subfolder. In Windows the root of the tree is usually a drive letter, like C: or U:. In Linux or on a Mac, it's just called the root directory.

Stata keeps track of a working directory, which is shown in the lower left corner of the Stata window. Think of it as your current location in the file system hierarchy. If you try to load a file without specifying its location, Stata will look for it in the working directory. If you save a file without specifying a location, Stata will put it in the working directory. You can change the working directory with the cd (change directory) command:

cd path

where path should be replaced by the path to the folder you want to make the new working directory.

In Windows, if you start Stata by double-clicking on a Stata file, the working directory will be set to the location of that file, making this easiest way to get to work in Stata.

A path specifies the location of a file or folder. An absolute path starts at the root and lists all the folders and subfolders needed to get from there to the location of the file or folder.

In Windows an absolute path typically starts with a drive letter followed by a colon, and then puts a backslash between each folder. For example:

C:\users\bbadger\dissertation\data\rawdata.dta

An absolute path in Linux or MacOS starts with a forward slash, representing the root directory, and puts a forward slash between each folder:

/users/bbadger/dissertation/data/rawdata.dta

A relative path starts with just the name of a folder, which is assumed to be a subfolder of the working directory. If the working directory is c:\users\bbadger\dissertation, the relative path to the same file is:

data\rawdata.dta

The Linux or MacOS equivalent is:

data/rawdata.dta

Of course this wouldn't work if the working directory were c:\users\bbadger\ instead.

In specifying a path, ".." means "go up one level." If the working directory were c:\users\bbadger\dissertation\dofiles then the path to the data file would be:

..\data\rawdata.dta

This means "go up one level, from c:\users\bbadger\dissertation\dofiles to c:\users\bbadger\dissertation, then go down into the data folder to find the rawdata.dta file."

A single period, "." represents the working directory. In Linux or MacOS, "~" is short for your home directory.

If you put all the files related to a given project in a single folder (or its subfolders), always make that folder the working directory before you do anything with the project, and use relative paths in all your do files, then your project will be portable. That means you can move it to a different location or send it to someone else, and as long as they make the project's folder the working directory all your code will work without any changes. Stata will automatically convert between forward slashes and backslashes depending on what kind of computer it's running on, so your code will even work if it's run on a different kind of computer. On the SSCC network, it's easy to write code that can be run on either our Windows servers or our Linux servers.

Exercise: Decide where you want to put a folder containing the example files for this class, then use the cd command to make that the working directory. Create a folder called dws by running mkdir dws. Make that folder the working directory. Then run:

net get dws, from(https://ssc.wisc.edu/sscc/stata/)

to get the files and put them in that folder.

For example, if I wanted to put the example files in a folder on my desktop, and my user name were bbadger on a Windows computer, I would run:

cd c:\users\bbadger\desktop
mkdir dws
cd dws
net get dws, from(https://ssc.wisc.edu/sscc/stata/)

On a Mac I would run:

cd /users/bbadger/desktop
mkdir dws
cd dws
net get dws, from(https://ssc.wisc.edu/sscc/stata/)

Stata Data Files

Stata has its own format for storing data sets, .dta files. These are highly convenient for Stata users because Stata can use them immediately without any need for interpretation. They can also contain metadata such as value labels. The disadvantage is that non-Stata users may or may not be able to use them, depending on the tools they have available.

The command to load a Stata data set into memory so Stata can use it is simply:

use dataset

where dataset should be replaced by the actual name of your data set. The data set specification can include the path to the file; if it does not Stata will assume it is in the working directory.

Stata will assume files you try to use are in Stata format. Thus you don't need to put .dta at the end of the file name, but it won't hurt if you do.

Load one of the example data sets, 2000_acs_sample_harm.dta with:

use 2000_acs_sample_harm.dta

The above command will work if the working directory is set to the location of the example files. If it fails, you probably need to use the cd command to change the working directory.

This data set is a sample from the 2000 American Community Survey, with "harmonized" variables created by IPUMS. We'll do a lot more with the ACS in the next section.

Stata can also load data sets directly from the web if you give the use command a URL. For example, the auto data set that comes with Stata is also available by running:

use http://www.stata-press.com/data/r16/auto.dta

Stata will not allow you to load a new data set if there is a data set in memory that has unsaved changes. You can first run a clear command to remove the data set from memory, or add the clear option to your use command (use dataset, clear).

Exercise: Load the data set 2000_acs_sample.dta, which contains the ACS "source" data created by the Census Bureau

Delimited Files and Excel Spreadsheets

Delimited files and Excel spreadsheets are very popular ways of storing data, but importing them into Stata sometimes requires telling Stata how to interpret them. Since the same issues arise with both kinds of files, we'll address them together.

A delimited file is a text file that contains data, where a special character (the delimiter) separates one value from the next. The most common form is CSV (comma separated value) files, where the delimiter is a comma, but the delimiter could be a space, a tab, or in theory just about any other character or set of characters. Your computer may import CSV files into Excel by default, but CSV files are not really Excel spreadsheets.

The command to import a delimited file into Stata is import delimited, and the command to import an Excel spreadsheet is import excel. Sometimes it takes some experimentation to find the right options to properly read in a given file. This is a case where Stata's graphical user interface can help you: it includes a preview of what the data set will look like after it is imported as well as the options for reading it. The preview will be updated as you choose different options, so you can experiment until it looks right. Then you can actually carry out the import, look over the results to make sure it was successful, and then copy the import command that was generated into your do file.

The two most common issues you'll have to deal with in importing delimited files or Excel spreadsheets are header rows containing variable names and content that is not actually data. Stata will try to identify whether the file has a header row or not, but may get confused. And it will definitely be confused if it tries to import content that is not data.

If you have Excel on your computer, open sscc_training.xlsx from the example files, which contains data on the number of graduate students trained by SSCC in FY 2018. Note that row 1 is a title, while row 2 contains the variable names. The data end with row 14, as the rows after that contain summary statistics and a note.

Now return to Stata and click File, Import, Text Data. Then click Browse and find sscc_training.csv, a CSV version of the same data. Looking at the preview, you'll see that the presence of the title prevented Stata from recognizing that row 2 contains variable names. It also thinks it should interpret rows 15-19 as data. Note that all the data are in red, indicating that they are strings (text). Because Stata believes the text in row 2 is data, it must make all the variables string variables to store the values in row 2.

Under Use first row for variable names, choose Custom and then enter 2, meaning that row 2 contains the variable names. Most of the variables will change from red to black, meaning that Stata now recognizes they are numeric. However, workshoplength remains red because Stata thinks the note in row 17 is data.

Correct this by clicking on Set ranges.... Check the box next to Last and enter 14. This tells Stata to ignore everything after row 14. It will also ignore row 1 since it comes before the row with the variable names. Now all the variables except workshopname are in black (workshopname really is text and should be in red). Click Okay and Stata will successfully import the data.

The command Stata used will to do so will appear in both the History window and the Results Window. It will look something like:

import delimited c:\users\bbadger\dws\sscc_training.csv, varnames(2) rowrange(:14)

The next step would be to copy this command into your do file. Since it uses an absolute path, we'd suggest editing it to:

import delimited sscc_training.csv, varnames(2) rowrange(:14)

on the assumption that the data set will be in the working directory when you run the do file.

Now try importing the Excel file sscc_training.xlsx (File, Import, Excel Spreadsheet). The preview is less useful here, in that it doesn't use color. Click the button by Cell Range to tell Stata to only read lines 2-14, and check Import first row as variable names.

The resulting command will again use an absolute path, so we'd suggest editing it to:

import excel sscc_training.xlsx, sheet("Sheet1") cellrange(A2:F14) firstrow

for your do file.

So what happens if you just import one of these files without setting the proper options for the import? (Try it and see!) You'll still get a usable data set, but it will contain extraneous rows that will need to be dropped, and most of the variables will need to be converted from string to numeric. These problems can be fixed, but it's easier to import the file properly in the first place.

Exercise: qualtrics_survey.csv is a data file created by the survey program Qualtrics. It was a very simple survey with just four questions (labeled by Qualtrics as Q1, Q17, Q3, and Q4) but Qualtrics includes lots of other information about the respondent and their experience taking the survey that you don't care about. It also puts information about each question in rows 2-3. Import just the data that you care about into Stata. Yes, you'll have to count all the columns you want to ignore. If the import is done properly, Q3 and Q4 will be numeric variables but Q1 will not. Why?

(For real work it's easier to have Qualtrics give you an SPSS data set, which Stata can read, than to have it give you a CSV file. For one thing, that way you'll get labels for the values of Q3 and Q4.)

Fixed Format Files

Fixed format files are text files, but unlike CSV files there is no separator between variables. Instead, a data dictionary tells you which variables are stored in which columns. The advantage of fixed format files is that they are smaller (no space is wasted on separators). The disadvantage is that they are completely useless without the corresponding data dictionary. For this reason they've become less popular as data storage has gotten cheaper and we won't cover how to import them, though you will see an example shortly. If you need to read a fixed format file, read the help file for infix. It's not hard, but translating a data dictionary into specifications for infix can be tedious.

Files in the Formats of Other Statistical Software

If you've got an SPSS .sav file or a SAS .sas7bdat file, Stata can import them with import spss or import sas. If you have a file in the format of a different statistical program, Stat/Transfer can probably convert it to Stata format for you. Stat/Transfer is available on Winstat and Linstat or in SSCC's computer labs. It's very easy to use, but there are instructions in the SSCC Knowledge Base.

If you have a choice of file formats, avoid SAS. SAS stores value labels separately from the data they label, which complicates converting them.

Do Files Generated by the Data Provider

Some data providers, like IPUMS, give you data in text format plus a program you can run to read it into Stata. These generally work, but often challenge new Stata users.

The key to using them is to read the instructions. Often these are placed in comments inside the do file itself. For example, here are the first ten lines of such a do file from IPUMS:

* NOTE: You need to set the Stata working directory to the path
* where the data file is located.

set more off

clear
quietly infix             ///
  int     year     1-4    ///
  byte    datanum  5-6    ///
  double  serial   7-14   ///

Later on the do file refers to the data file by name without specifying a location, so you really do need to set the working directory to the location of the data file before running this or it will not work. Other do files may require you to edit the do file and specify the location of the data file or make other changes.

Then there's what this do file does not include (and this is common): a save command. When the do file finishes running the data will have been imported into Stata, but not saved as a Stata data set. You could just tell Stata to save it, but since you want your workflow to be completely reproducible you should add a save command to the do file they gave you instead.

IPUMS data files are in fixed format, so this also demonstrates what it looks like to import fixed format data. The last three lines of the code displayed specify that columns 1-4 in the text data contain the year variable, columns 5-6 contain a variable called datanum, and columns 7-14 contain a variable called serial.

Saving Files in Stata Format

Reading files in Stata format will always be faster than importing other types of files. So unless your data set is extremely small, you don't want to import it over and over again. If your research project starts with a big file in text format, your first do file should probably just import the data, drop any parts of it you won't actually use (more on that in the next section), and then save the result as a Stata data file. The next do file can then read that Stata data file and go to work. For example, if you were working with the SSCC training data your complete first do file might be:

capture log close
log using read.log, replace

clear all
import excel sscc_training.xlsx, sheet("Sheet1") cellrange(A2:F14) firstrow

save sscc_training, replace
log close

Next: First Steps With Your Data

Last Revised: 11/12/2020