This article is part of the Stata for Students series. If you are new to Stata we strongly recommend reading all the articles in the Stata Basics section.
Stata tries very hard to make all its commands work the same way. Spending a little time learning the syntax itself will make it much easier to use commands later.
To carry out the examples in this section, you'll need to have created an SFS folder and downloaded the gss_sample data set as described in Managing Stata Files. Create a new do file in that folder called syntax.do, as described in Doing Your Work Using Do Files. To start with it should contain:
capture log close
log using syntax.log, replace
clear all
set more off
use gss_sample
// work will go here
log close
The example commands will go after use gss_sample and before log close. Add the example commands to this do file as you go, and run it frequently to see the results.
Commands
Most Stata commands are verbs. They tell Stata to do something: summarize, tabulate, regress, etc. Normally the command itself comes first and then you tell Stata the details of what you want it to do after.
Many commands can be abbreviated: sum instead of summarize, tab instead of tabulate, reg instead of regress. Commands that can destroy data, like replace, cannot be abbreviated.
Variable Lists
A list of variables after a command tells the command which variables to act on. First try sum (summarize) all by itself, and then followed by age:
sum
sum age
If you don't specify which variables sum should act on it will give you summary statistics for all the variables in the data set. In this case that's a pretty long list. Putting age after sum tells it to only give you summary statistics for the age variable.
If you list more than one variable, the command will act on all of them:
sum age yearsjob prestg10
This gives you summary statistics for age, years on the job, and a rating of the respondent's job's prestige.
If Conditions
An if condition tell a command which observations it should act on. It will only act on those observations where the condition is true. This allows you to do things with subsets of the data. An if condition comes after a variable list:
sum yearsjob if sex==1
This gives you summary statistics for years on the job for just the male respondents (in the GSS 1 is male and 2 is female).
Note the two equals signs! In Stata you use one equals sign when you're setting something equal to something else (see Creating Variables) and two equals signs when you're asking if two things are equal. Other operators you can use are:
== | Equal |
> | Greater than |
< | Less than |
>= | Greater than or equal to |
<= | Less than or equal to |
!= | Not equals |
! all by itself means "not" and reverses whatever condition follows it.
Combining Conditions
You can combine conditions with & (logical and) or | (logical or). The character used for logical or is called the "pipe" character and you type it by pressing Shift-Backslash, the key right above Enter. Try:
sum yearsjob if sex==1 & income>=9
sum yearsjob if sex==1 | income>=9
The first gives you summary statistics for years on the job for respondents who are male and have a household income of $10,000 or more. The second gives you summary statistics for years on the job for respondents who are male or have a household income of $10,000 or more, a very different group.
Any conditions you combine must be complete. If you want summary statistics for years on the job for respondents who are either black (race==2) or "other" (race==3) you can not use:
sum yearsjob if race==2 | 3 // don't do this
(What this does and why is left as an exercise for the reader, but it's not what you want.) Instead you should use:
sum yearsjob if race==2 | race==3 // do this instead
Missing Values
If you have missing values in your data, you need to keep them in mind when writing if conditions. Recall that the generic missing value (.) acts like positive infinity, and the extended missing values (.a, .b, etc.) are even bigger. So if you type:
sum yearsjob if age>65
you are not just getting summary statistics for years on the job for respondents who are older than 65. Anyone with a missing value for age is also included. Assuming you're interested in people who are known to be older than 65, you should exclude the people with missing values for age with a second condition:
sum yearsjob if age>65 & age<.
It makes a difference!
Why age<. rather than age!=.? For the age variable, the GSS uses .c for missing and age!=. would not exclude .c. Other variables use different extended missing values, and some use more than one. Using age<. guarantees you're excluding all missing values, even if you don't know ahead of time which ones the data set uses.
Binary Variables
If you have a binary variable coded as 0 or 1, you can take advantage of the fact that to Stata 1 is true and 0 is false. Imagine that instead of a variable called sex coded 1/2, you had a variable called female coded 0/1. Then you could do things like:
sum yearsjob if female
sum yearsjob if !female
// meaning "not female"
Just one thing to be careful of: to Stata everything except 0 is true, including missing. If female had missing values you would need to use:
sum yearsjob if female & female<. // exclude missing values
or:
sum yearsjob if female==1 // automatically excludes missing values
Unfortunately the GSS does not code its binary variables 0/1 so you can't actually run these four commands. But many data sets data sets do, and if you have to create your own binary variables you can make them easy to use by coding them 0/1.
Options
Options change how a command works. They go after any variable list or if condition, following a comma. The comma means "everything after this is options" so you only type one comma no matter how many options you're using.
The detail option tells summarize to calculate percentiles (including the 50th percentile, or median) and some additional moments.
sum yearsjob, detail
Many options can be abbreviated like commands can be—in this case just d would do.
Some options require additional information, like the name of a variable or a number. Any additional information an option needs goes in parentheses directly after the option itself.
Recall that when we did sum all by itself and it gave us summary statistics for all the variables, it put a separator line after every five variables. You can change that with the separator (or just sep) option:
sum, sep(10)
The (10) in parentheses tells the separator option to put a separator between every ten variables. You'll learn more useful options that need additional information in the articles on statistical commands.
By
By allows you to execute a command separately for subgroups within your data. Try:
bysort sex: sum yearsjob
This gives you summary statistics for years on the job for both males and females, calculated separately.
By is a prefix, so it comes before the command itself. It's followed by the variable (or variables) that identifies the subgroups of interest, then a colon. The data must be sorted for by to work, so bysort is a shortcut that first sorts the data and then executes the by command. Now that the data set is sorted by sex, you can just use by in subsequent commands:
by sex: sum prestg10
Complete Do File
The following is a do file containing all the example commands in this section:
capture log close
log using syntax.log, replace
clear all
set more off
use gss_sample
sum
sum age
sum age yearsjob
prestg10
sum yearsjob if sex==1
sum yearsjob if sex==1 & income>=9
sum yearsjob if sex==1 | income>=9
sum yearsjob if race==2 | 3 // don't do this
sum yearsjob if race==2 | race==3 // do this instead
sum yearsjob if age>65
sum yearsjob if age>65 & age<. // exclude missing values
/* Things you could do if you had female coded 0/1
instead of sex coded 1/2:
sum yearsjob if female
sum yearsjob if !female // meaning "not female"
sum yearsjob if female & female<. // exclude missing values
sum yearsjob if female==1 // automatically excludes missing values
*/
sum yearsjob, detail
sum, sep(10)
bysort sex: sum yearsjob
by sex: sum prestg10
log close
Last Revised: 6/24/2016