This is part three of Introduction to Stata. If you're new to Stata we highly recommend starting from the beginning.
Skip to the content that follows this video
If you haven't already, load the automobile data set that comes with Stata by running:
sysuse auto
Elements of Stata Syntax
Almost all Stata commands use a standard syntax. This syntax allows you to control what part of the data set the command acts on, modify what the command does, and more.
We'll discuss five syntax elements:
- Commands
- Variable Lists
- If Conditions
- Options
- By Groups
Stata Commands
Stata is a command-based language. Most Stata commands are verbs. They tell Stata to do something: summarize, tabulate, regress, etc. Normally the command itself comes first and then you tell Stata the details of what you want it to do after.
Many commands can be abbreviated: sum instead of summarize, tab instead of tabulate, reg instead of regress. Commands that can destroy data, like replace, cannot be abbreviated.
We'll explore the elements of Stata syntax using a command that makes it easy to see what they do:
browse
The browse command opens the Data Editor in browse mode, which is what you should always use unless you're doing data entry. Browse mode won't let you accidentally change your data.
Looking at your data is a great way to get a basic understanding of it, but even with this small data set you can't see all of it. The key to using the data browser effectively is being able to view the parts of the data set you care about, and the next two syntax elements will help us do that.
Variable Lists
Listing one or more variables after a command tells the command it should only act on the variables listed:
browse make
browse make price mpg
There are shortcuts for creating long lists of variables without typing them all, or variable lists containing variables that match a pattern, but we'll discuss them in Data Wrangling in Stata.
Exercise: browse the make and weight of each car.
If Conditions
Skip to the content that follows this video
An if condition tell a command which observations it should act on. It will only act on those observations where the condition is true. This allows you to do things with subsets of the data. An if condition comes after a variable list:
browse make foreign if foreign==1
Note the two equals signs! In Stata you use one equals sign when you're setting something equal to something else (see Creating and Changing Variables) and two equals signs when you're asking if two things are equal. Other operators you can use are:
== | Equal |
> | Greater than |
< | Less than |
>= | Greater than or equal to |
<= | Less than or equal to |
!= | Not equals |
! all by itself means "not" and reverses whatever condition follows it.
Internally, Stata equates true and false with one and zero. That means you can write:
browse make foreign if foreign
or:
browse make foreign if !foreign
This makes for simple and readable code. Just be careful: anything other than zero will also be interpreted as true, including missing.
Combining Conditions
You can combine conditions with & (logical and) or | (logical or). The character used for logical or is called the "pipe" character and you type it by pressing Shift-Backslash, the key right above Enter. Try:
browse make price mpg if mpg>25 & price<5000
This shows you cars that get more than 25 miles per gallon and cost less than $5000 (in 1978 dollars). In set theory terms it is the intersection of the two sets. Now try:
browse make price mpg if mpg>25 | price<5000
This shows you cars that get more than 25 miles per gallon or cost less than $5000. A car must meet only one of the two conditions to be shown. In set theory terms it is the union of the two sets.
All the conditions to be combined must be complete. If you wanted to list the cars that have a 1 or a 2 for rep78 you should not use:
browse make rep78 if rep78==1 | 2
(Why this does what it does is left as an exercise for the reader, but it's not what you want.) Instead you should use:
browse make rep78 if rep78==1 | rep78==2
Missing Values
If you have missing values in your data, you need to keep them in mind when writing if conditions. Internally, missing values are stored using the 27 largest possible numbers, starting with the generic missing value (.) and the extended missing values (.a, .b, etc.) after that in alphabetical order, so the following inequalities hold:
any observed value < . < .a < .b < .c ... < .x < .y < .z
If you want a list of cars that are known to have good repair records, you won't get it with:
browse make rep78 if rep78>3
An easy shortcut is to think of missing values as (positive) infinity, and since infinity is greater than 3 cars with a missing value for rep78 are included in the list. So add a second condition to exclude them:
browse make rep78 if rep78>3 & rep78<.
Why <. rather than !=. ? In this data set it makes no difference. But if the data set included extended missing values, the condition !=. would not exclude them. The condition <. excludes them because extended missing values are greater than the generic missing value. Thus using <. ensures you're excluding all missing values.
Exercise: Browse domestic cars that get more than 25 miles per gallon and are known to have good repair records (rep78 greater than 3). Then browse foreign cars that cost less than $5,000 and are not known to have poor repair records (rep78 less than or equal to 3). Include the variables used in the conditions so you can spot-check your results. Explain why you handled missing values the way you did in both cases.
Options
Options change how a command works. They go after any variable list or if condition, following a comma. The comma means "everything after this is options" so you only type one comma no matter how many options you're using.
Consider:
browse make foreign
We know that value labels have been applied to the foreign variable, so the words "Domestic" and "Foreign" are not the actual values. We can see the values instead of the labels by adding the nolabel option:
browse make foreign, nolabel
Options must always be one word. Here the words "no" and "label" are combined because otherwise Stata would think they were two different options.
Many options require additional information, such as a number or a variable they apply to. This additional information goes in parentheses directly after the option name. To illustrate that we need to use a command other than browse, because nolabel is the only option it has.
The list command is very simlar to browse, but it just lists the data in the Results window. If you have a log open the list output will be stored in the log, which is sometimes useful. Try:
list make
The string() option tells the list command to truncate string variables after a given number of characters, with the number going in the parentheses:
list make, string(5)
You might use the string() option to save space, or if the first part of the string contains all the information you really need. But it's mostly here as an example of the "additional information goes in parentheses" syntax you'll use regularly.
Stata reuses option names wherever it makes sense. Thus many commands take a nolabel option that prompts them to ignore value labels. Other common options include gen() to create a new variable (with the name of the new variable going in parentheses), by() to act on groups, and vce() to tell regression commands how to estimate the variance-covariance matrix.
By Groups
By groups allows you to execute a command separately for subgroups within your data. Try:
by foreign: list make
The by foreign: prefix tells Stata to:
- Identify the unique values of foreign (in this case, 0 and 1 or "Domestic" and "Foreign")
- Temporarily split the data set into groups based on their value of foreign
- Run the subsequent command (list make) separately for each group
You'll see how powerful by is later.
In order for by to work, the data must be sorted by the same variable. You can do that with the sort command:
sort rep78
by rep78: list make
Alternatively, you can have by do this for you either by adding the sort option to the by prefix or just saying bysort:
by rep78, sort: list make
bysort rep78: list make
Of course we don't need to have Stata sort the data three times. Once the data are sorted, you can just say:
by rep78: list make
You can have more than one variable in the by list. In that case, Stata will split the data set up into one group for each unique combination of the variables. The data set must still be sorted in the same order.
Next: Do Files
Last Revised: 5/27/2020