2  Structure of a Stata Data Set

The auto data set has been included with Stata for many, many years. It contains information about 1978 cars. Every Stata user has access to it so it is frequently used for examples, as we’ll use it today. To load it, type:

sysuse auto
(1978 automobile data)

Normally the use commands loads data from disk into memory. The sysuse command is a variation of the normal use command which loads data that was installed with Stata. You’ll probably never use it for anything other than this data set. (There’s also a webuse command that opens example data sets from Stata’s web site.) To see what’s in the data set, type browse or click the button at the top that looks like a magnifying glass over a spreadsheet.

A Stata data set in the Data Editor, in browse mode

This opens Stata’s Data Editor, which shows you your data set in a spreadsheet-like form, in browse mode. You can also invoke the Data Editor in edit mode by typing edit or clicking the button that looks like a pencil writing in a spreadsheet. Then it will allow you to make changes. You might use edit mode for data entry, but since you should never change your data interactively get in the habit of using browse mode so you don’t make changes by accident.

2.1 Observations and Variables

A Stata data set is a matrix, with one row for each observation and one column for each variable. This raises the question “What is an observation in this data set?” The values of the make variable suggests they are cars, but are they individual cars or kinds of cars? The fact that there is just one row for each value of make suggests kinds of cars. We’ll discuss this much more in Data Wrangling in Stata, but you should always know what an observation represents in your data set.

2.2 Variable Types

The variable make contains text or, in Stata’s terminology, strings (as in strings of characters). Obviously you can’t do math with text, but Stata can do many other useful things with string variables.

Variables like price and mpg are continuous or quantitative variables. They can, in principle, take on an infinite number of values (though they’ve been recorded as integers) and represent quantities in the real world.

The variable rep78 is a categorical variable. It can only take on certain values, or levels. It is an ordered categorical variable because 5 is better than 4, 4 is better than 3, etc. But those numbers don’t represent actual quantities: a 5 is not five times better than a 1. Other categorical variables are unordered, and in that case the numbers used to represent the categories are completely arbitrary.

The variable foreign is an indicator or binary or dummy variable. Indicator variables are just categorical variables with two levels.

2.3 Value Labels

The foreign variable appears to contain text, like make. But note that it’s a different color, and if you click on a cell in that column what appears at the top of the browser is a 0 or a 1. This tells you foreign is really an numeric variable with a set of value labels applied. Comparing the numbers at the top with the words in the table, you’ll see that this set of value labels associates the number 0 with the word “Domestic” and the number 1 with the word “Foreign.” We’ll talk about creating value labels in Creating and Changing Variables. But for now, the important thing to remember is that if you write code referring to the foreign variable, your code must use the values 0 and 1, not the labels “Domestic” and “Foreign.”

Note that a 1 means “Yes, this car is foreign” and a 0 means “no, this car is not foreign.” Stata generally uses 1 for true and 0 for false, and if you follow that convention indicator variables will be clear even without value labels.

2.4 Missing Values

Several cars have dots in the rep78 column rather than numbers. These indicate missing values. A Stata data set is a rectangular matrix, so every observation must have something for every variable. If no actual data are available, Stata stores a code for “missing.” While this data set just uses the “generic” missing value, ., there are 26 others you can use: .a through .z. Stata treats them all the same, but you can assign meanings to them. For example, if you were working with a survey you might decide to code “the question did not apply” as .a and “the respondent refused to answer” as .b.