This chapter will teach you the basics of making new variables, modifying existing variables, and creating labels. To set up, create a do file that loads the auto data set.
capturelogcloselogusing vars.log, replaceclearallsysuse auto
-------------------------------------------------------------------------------
name: <unnamed>
log: /home/r/rdimond/kb/stata_intro/vars.log
log type: text
opened on: 3 Jan 2025, 10:42:00
(1978 automobile data)
It’s especially important to use do files when creating and changing variables.
7.1 Generate and Replace
The primary commands for creating and changing variables are generate (usually abbreviated gen) and replace (which, like other commands that can destroy information, has no abbreviation). gen creates new variables; replace changes the values of existing variables. Their core syntax is identical:
gen variable = expression
or
replace variable = expression
where variable is the name of the variable you want to create or change, and expression is the mathematical expression whose result you want to put in it. Expressions can be as simple as a single number or involve all sorts of complicated functions. You can explore what functions are available by typing help functions. If the expression depends on a missing value at any point, the result is missing. Usually this is exactly what you’d expect and want.
The prices in the auto data set are in 1978 dollars, so it might be useful to convert them to January 2024 dollars. To do so you need to multiply the prices by a conversion factor which is the Consumer Price Index in January 2024 divided by the Consumer Price Index in 1978, or about 5. The code will be:
gen price2024 = price*5
Add this line to your do file, run it, and examine the results with:
sum price price2024
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
price | 74 6165.257 2949.496 3291 15906
price2024 | 74 30826.28 14747.48 16455 79530
The mean price is still lower than you’d see at a car dealership, but that’s probably because today’s cars are much nicer than 1978 cars. This is a good example of how to check your work: compare what you got to what you expected, and if they don’t match make sure you know why!
Internally, Stata executed a loop: it calculated price*5 for the first observation and stored the result in price2024 for the first observation, then calculated price*5 for the second observation and stored the result in price2024 for the second observation, and so forth for all the observations in the data set. You’ll learn how to stretch this one-observation-at-a-time paradigm in Data Wrangling in Stata, but tasks that break it (like calculating means) require a different approach that we’ll talk about soon.
Suppose you changed your mind and wanted to be a little more precise and use 4.93 as the conversion factor. You might be tempted to try to add code that “fixes” the price2024 variable (say, multiplying it by 4.93/5) but it’s simpler and cleaner to fix the code that created it in the first place. Change:
gen price2024 = price*5
to:
gen price2024 = price*4.93
If you try to select and run just that line, it will crash because the price2024 already exists. But if you run your entire do file, it will first load a fresh copy of the data that does not contain the old version of price2024 and the do file will run just fine.
Exercise
Go to the Bureau of Labor Statistics’ CPI inflation calculator. Enter 1 for the amount, set the first date to January 1978 and the second date to the last month for which data are available. The result you’ll get is the conversion factor to convert 1978 prices to current dollars, so use it to create price_current.
Solution
As of this writing, the most current data is for November 2024, and the conversion factor is 5.05.
gen price_current = price*5.05sum price*
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
price | 74 6165.257 2949.496 3291 15906
price2024 | 74 30826.28 14747.48 16455 79530
price_curr~t | 74 31134.55 14894.95 16619.55 80325.3
(price* is just a lazy way of saying “all variables that start with price.” In public we say it’s “efficient.”)
Note that this represents 2.4% inflation over an 11 month period, which is pretty close to the Fed’s 2%/year inflation target, yet a car buyer needs about $300 more to buy the average car in November than they did at the start of the year. Inflation adds up even in normal times.
Fortunately, you now know how to correct any value for inflation.
7.2 Creating Variables with If Conditions
If a gen command has an if condition, the resulting variable will (and must) still exist for all observations. However it will be assigned a missing value for observations where the if condition is not true. If a replace command has an if condition, observations where the if condition is not true will be left unchanged. This allows you to set variables to different values for different groups of observations.
Suppose you wanted to collapse the five-point scale of the rep78 variable into a three-point scale. The first step is to lay out exactly how you want to do that in your native language, because if it’s not clear to you you’ll never be able to explain it to Stata. We’ll declare that cars with a rep78 of one or two will get a one for the new variable rep3, cars with a three for rep78 will get a two, and cars with a four or five will get a three.
You can implement that with:
gen rep3 = 1 if rep78<3replace rep3 = 2 if rep78==3replace rep3 = 3 if rep78>3 & rep78<.
(64 missing values generated)
(30 real changes made)
(29 real changes made)
The first line creates the new variable rep3, but only sets it to one for cases where rep78 is less than three. The others get missing. The second line changes some of those missings to twos, and the third changes more of them to threes. Run the do file, note the number of observations changed by each line, and compare that to the total number of observations in the data set.
What will the value of rep3 be for observations where rep78 is missing? Missing, as it should be, because it was never set to anything else. The five observations where rep78 is missing were implicitly or explicitly excluded from all three commands, so they started out with a missing value for rep3 and were never changed. (If you forgot to exclude missing values from the last command then rep3 would be three for cars where rep78 is missing, an all-too-common mistake. Remember, missing is essentially infinity.)
Exercise
Combining the ones and twos makes sense because there are so few of them, but there was no particular need to combine the fours and fives. Create a rep4 variable that combines the ones and twos and renumbers the other categories accordingly (i.e. rep4 should go from one to four).
Solution
There are lots of ways to do this, but here’s one:
gen rep4 = 1 if rep78<3replace rep4 = rep78 - 1 if rep78>=3
(64 missing values generated)
(59 real changes made)
You could absolutely set rep4 to 2, 3, and 4 using one command each like we did with rep3. This way is more “efficient.”
7.3 Recode
The recode command gives you an alternative way of creating rep3. It is designed solely for recoding tasks and is much less flexible than gen and replace. But it’s very easy to use. The syntax is:
recode var (rule 1) (rule 2) (more rules as needed...), gen(newvar)
The gen option at the end is not required—if it’s not there then the original variable will be changed rather than Stata creating a new variable containing the new values. You can also have recode work on a list of variables, recoding them all in the same way.
The core of the recode command is a list of rules, each in parentheses, that tell it how a variable is to be recoded. They take the form (input_value = output_value). The input_value can be a single number, a list of numbers separated by spaces, or a range of numbers specified with start/end. The output_value will always be a single number. Anything not covered by a rule is left unchanged, so you can use recode to change just a few values of a variable or completely redefine it as we do here. Here’s a recode version of converting rep78 to a three-point scale:
In creating indicator variables you can take advantage of the fact that Stata treats true as one and false as zero by setting the new variable equal to a condition. Consider:
gen low_mpg = (mpg<20)
(The parentheses are optional, but make the command easier to read.) This creates an indicator variable called low_mpg which is one (true) for cars where mpg is less than twenty and zero (false) where mpg is greater than or equal to twenty. To see the results run:
bysort low_mpg: sum mpg
-------------------------------------------------------------------------------
-> low_mpg = 0
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
mpg | 39 25.4359 4.80567 20 41
-------------------------------------------------------------------------------
-> low_mpg = 1
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
mpg | 35 16.68571 2.12508 12 19
No car has a missing value for mpg, but if one did, the above code would assign it a zero for low_mpg as if it were known to have good gas mileage. (Remember, missing is essentially infinity, which is definitely not less than 20.) The low_mpg variable should be missing for such cases, which you could do with:
gen lowMPG = (mpg<20) if mpg<.
Exercise
Create an indicator variable called good_rep that identifies cars with good repair records (defined as rep78 greater than 3). How would your code change if the indicator variable needed to identify cars that are known to have good repair records?
Solution
gen good_rep = (rep78<3) if rep78<.
(5 missing values generated)
if rep78<. is essential or cars whose repair record is unknown will get a one as if they were known to have good repair records.
If we need to identify cars that are known to have good repair records, then cars with a missing value for rep78 should get a zero because they are not known to have a good repair record. (We don’t know what their repair record is, but we know that it is not known to be good.) You can do that with:
gen known_good_rep = (rep78<3 & rep78<.)
You can check your work by running crosstabs between first rep78 and good_rep, and then rep78 and known_good_rep, being sure to include the missing option both times.
This clarifies the difference: with good_rep the five cars with mising values for rep78 get missing, while with known_good_rep they get 0.
7.5 String Variables
You can create and change string variables with gen and replace just like numeric variables. One difference is that string values go in quotes; another is that for a string variable missing is "", i.e. a string that contains nothing. For example:
(64 missing values generated)
(64 real changes made)
| Summary of Price
cost | Mean Std. dev. Freq.
------------+------------------------------------
High | 12,607.6 1,808.02 10
Low | 5,158.641 1,412.852 64
------------+------------------------------------
Total | 6,165.257 2,949.496 74
The first command creates a variable called cost which Stata knows should be a string since there are quotes around the value to be stored in it. Observations where the condition price>10000 is not true get a missing value, which for a string variable is "". The second command then changes all those missing values to “Low”. In practice an indicator variable will usually be more useful than a string like this.
You can’t do math with strings, but there are a variety of useful functions for working with them. One of them is strpos(). Given two strings (string values or string variables), it will return the position of the second string within the first string, or a zero if the first string does not contain the second string. This makes it very useful in if conditions. For example, you can select all Volkswagen cars with:
Another useful function is word(). Given a string and a number n, word will return the nth word in the string. For example, you can make a variable containing the manufacturer of each car with:
gen manufacturer = word(make, 1)list make manufacturer ifstrpos(make, "VW")
There are many more string functions. Working with Text Data in Stata will introduce you to some of the most useful ones and show you how to use them.
One common problem is numeric variables accidentally stored as strings. If you create x with:
gen x = "1"
then x is a string variable, not a number, and you can’t do math with it. For example:
sum x
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
x | 0
doesn’t do anything useful. You can convert a string variable to numeric with the destring command, adding either the gen() option with the name for a new variable, or the replace option to change the existing variable:
destring x, replacesum x
x: all characters numeric; replaced as byte
Variable | Obs Mean Std. dev. Min Max
-------------+---------------------------------------------------------
x | 74 1 0 1 1
You should always think about why the variable was a string in the first place. The most common reason is that the data was imported from a spreadsheet and contains some non-numeric values. You can force destring to ignore the problem and convert the non-numeric values to missing with the force option, but you may lose important information in the process. Data Wrangling in Stata describes techniques for identifying and dealing with non-numeric values.
7.6 Egen
The egen command, short for “extended generate” gives you access to another library of functions. It’s a bit of a hodge-podge, but the egen functions you’ll use the most calculate summary statistics:
Function Name
Description
min()
Minimum value
max()
Maximum value
mean()
Mean
median()
Median
sd()
Standard Deviation
total()
Total
These are examples of aggregate functions: they take multiple numbers as input and return a single number as output. They also work across observations, and thus can’t be easily done using gen since it works one observation at a time. The syntax looks almost identical to gen:
egen variable = function()
The big difference with egen is that you’re not writing your own mathematical expression; you’re just using a function from the library. For example, if you needed to set a variable to a mean divided by two, you could not say:
egen y = mean(x)/2
You’d instead say:
egen y = mean(x)
replace y = y/2
Another important difference is how missing values are handled. Recall that with gen, if any part of the input was missing the output would be missing. However, egen simply ignores missing values and calculates the result using the data that are available. Usually this is what you want, but you need to be aware of what egen is doing and be sure it makes sense for your particular task.
The egen functions for calculating summary statistics are very commonly combined with by to calculate summary statistics for groups. Calculate the mean car price for each manufacturer with:
bysort manufacturer: egen mean_price = mean(price)
The zeros for standard deviation reflect the fact that all the cars for a given manufacturer have the same value for mean_price. mean_price really describes a manufacturer, not an individual car. In fact that’s the definition of a variable that describes a group: every observation within the same group must have the same value of the variable.
If we had only wanted to see the mean value of price for each manufacturer, we could have just run:
But having run egen we now have the mean price in a variable, available for later use.
Exercise
Create a variable containing the maximum value of mpg, calculated separately for foreign and domestic cars.
Solution
bysort foreign: egen max_mpg = max(mpg)tab foreign, sum(max_mpg)
| Summary of max_mpg
Car origin | Mean Std. dev. Freq.
------------+------------------------------------
Domestic | 34 0 52
Foreign | 41 0 22
------------+------------------------------------
Total | 36.081081 3.2213192 74
7.7 Labels
Good labels make your data much easier to understand and work with. While Stata has many kinds of labels, we’ll focus on the most common and most useful: variable labels and value labels.
7.7.1 Variable Labels
Variable labels convey information about a variable, and can be a substitute for long variable names. This data set already has a good set of variable labels, as you can see in the Variables window, but let’s make the label on price more specific. The syntax to set a variable label is:
label variable variable_name "label"
So type:
labelvariable price "Price in 1978 Dollars"
You can use the describe command to get information about a variable, including its variable label:
describe price
Variable Storage Display Value
name type format label Variable label
-------------------------------------------------------------------------------
price int %8.0gc Price in 1978 Dollars
7.7.2 Value Labels
Value labels are used with categorical variables to tell you what the categories mean. We’ve seen one in action with the foreign variable: it was the value labels that told us that a zero means “Domestic” and a one means “Foreign.”
Let’s explore value labels by labeling the values of rep3, the new variable we recoded to collapse rep78 from a five point scale to a three point scale. Value labels are a mapping from a set of integers to a set of text descriptions, so the first step is to define the map. To do so, use the label define command:
This creates a mapping called rep_label but does not apply it to anything. Before it does anything useful you have to tell Stata to label the values of the rep3 variable using the rep_label mapping you just defined. The syntax is:
label values variable map
And thus:
labelvalues rep3 rep_label
To see the results, run:
tab rep3
rep3 | Freq. Percent Cum.
------------+-----------------------------------
Bad | 10 14.49 14.49
Average | 30 43.48 57.97
Good | 29 42.03 100.00
------------+-----------------------------------
Total | 69 100.00
Once a map is defined you can apply it to any number of variables: just replace the single variable in the label values command above with a list of variables. Suppose you’re working with survey data and your variables include the gender of the respondent, the gender of the respondent’s spouse, and the genders of all the respondent’s children. You could define just one map called gender and then use it to label the values of all the gender variables.
Three commands for managing value labels: label dir gives you a list of all the defined labels, and label list tells you what they mean. The describe command we used earlier to see the variable label also tells you the name of the value label associated with the variable (and other useful things).
rep4 | Freq. Percent Cum.
------------+-----------------------------------
Bad | 10 14.49 14.49
Okay | 30 43.48 57.97
Pretty Good | 18 26.09 84.06
Very Good | 11 15.94 100.00
------------+-----------------------------------
Total | 69 100.00
7.7.2.1 Labels via Recode
When you use recode to create a new variable, Stata will automatically create a variable label for it (“RECODE of …”). You can also define value labels for it by putting the desired label for each value at the end of the rule that defines it. Create yet another version of rep3, this time with labels right from its creation, with:
If you’re really determined, you can create a new version of rep4 as an exercise. But I think you’ve got this.
This do file changes the data set it uses, so it should save the new version. Remember, never save your output over your input, so don’t save the new data as auto. If you did, you could not run this do file again: it would crash when it tried to create price2024 because that variable would already exist in the modified data set. Instead, save the data as autoV2, as in “version 2 of the automobile data set.”
save autoV2, replace
file autoV2.dta saved
Finally, close the log:
logclose
name: <unnamed>
log: /home/r/rdimond/kb/stata_intro/vars.log
log type: text
closed on: 3 Jan 2025, 10:42:01
-------------------------------------------------------------------------------