sysuse auto
(1978 automobile data)
Introduction to Stata
The auto
data set has been included with Stata for many, many years. It contains information about 1978 cars. Every Stata user has access to it so it is frequently used for examples, as we’ll use it today. To load it, type:
Normally the use
commands loads data from disk into memory. The sysuse
command is a variation of the normal use
command which loads data that was installed with Stata. You’ll probably never use it for anything other than this data set. (There’s also a webuse
command that opens example data sets from Stata’s web site.) To see what’s in the data set, type browse
or click the button at the top that looks like a magnifying glass over a spreadsheet.
This opens Stata’s Data Editor, which shows you your data set in a spreadsheet-like form, in browse mode. You can also invoke the Data Editor in edit mode by typing edit or clicking the button that looks like a pencil writing in a spreadsheet. Then it will allow you to make changes. You might use edit mode for data entry, but since you should never change your data interactively get in the habit of using browse mode so you don’t make changes by accident.
A Stata data set is a matrix, with one row for each observation and one column for each variable. This raises the question “What is an observation in this data set?” The values of the make
variable suggests they are cars, but are they individual cars or kinds of cars? The fact that there is just one row for each value of make
suggests kinds of cars. We’ll discuss this much more in Data Wrangling in Stata, but you should always know what an observation represents in your data set.
The variable make
contains text or, in Stata’s terminology, strings (as in strings of characters). Obviously you can’t do math with text, but Stata can do many other useful things with string variables.
Variables like price
and mpg
are continuous or quantitative variables. They can, in principle, take on an infinite number of values (though they’ve been recorded as integers) and represent quantities in the real world.
The variable rep78
is a categorical variable. It can only take on certain values, or levels. It is an ordered categorical variable because 5 is better than 4, 4 is better than 3, etc. But those numbers don’t represent actual quantities: a 5 is not five times better than a 1. Other categorical variables are unordered, and in that case the numbers used to represent the categories are completely arbitrary.
The variable foreign
is an indicator or binary or dummy variable. Indicator variables are just categorical variables with two levels.
The foreign
variable appears to contain text, like make
. But note that it’s a different color, and if you click on a cell in that column what appears at the top of the browser is a 0 or a 1. This tells you foreign
is really an numeric variable with a set of value labels applied. Comparing the numbers at the top with the words in the table, you’ll see that this set of value labels associates the number 0 with the word “Domestic” and the number 1 with the word “Foreign.” We’ll talk about creating value labels in Creating and Changing Variables. But for now, the important thing to remember is that if you write code referring to the foreign
variable, your code must use the values 0 and 1, not the labels “Domestic” and “Foreign.”
Note that a 1 means “Yes, this car is foreign” and a 0 means “no, this car is not foreign.” Stata generally uses 1 for true and 0 for false, and if you follow that convention indicator variables will be clear even without value labels.
Several cars have dots in the rep78
column rather than numbers. These indicate missing values. A Stata data set is a rectangular matrix, so every observation must have something for every variable. If no actual data are available, Stata stores a code for “missing.” While this data set just uses the “generic” missing value, .
, there are 26 others you can use: .a
through .z
. Stata treats them all the same, but you can assign meanings to them. For example, if you were working with a survey you might decide to code “the question did not apply” as .a
and “the respondent refused to answer” as .b
.