This is part six of Introduction to Stata. If you're new to Stata we highly recommend starting from the beginning.
This article will teach you the basics of making new variables, modifying existing variables, and creating labels.
The primary commands for creating and changing variables are generate (usually abbreviated gen) and replace (which, like other commands that can destroy information, has no abbreviation). gen creates new variables; replace changes the values of existing variables. Their core syntax is identical:
gen variable = expression
replace variable = expression
where variable is the name of the variable you want to create or change, and expression is the mathematical expression whose result you want to put in it. Expressions can be as simple as a single number or involve all sorts of complicated functions. You can explore what functions are available by typing help functions. If the expression depends on a missing value at any point, the result is missing. Usually this is exactly what you'd expect and want.
It's especially important to use do files when you change your data, so start by creating a do file that loads the auto data set:
capture log close
log using vars.log, replace
The prices in the auto data set are in 1978 dollars, so it might be useful to convert them to 2020 dollars. To do so you need to multiply the prices by a conversion factor which is the Consumer Price Index in 2020 divided by the Consumer Price Index in 1978, or about 4. The code will be:
gen price2020 = price*4
Add this line to your do file, run it, and examine the results with:
browse make price price2020
The prices are still generally lower than you'd see at a car dealership, but that's probably because today's cars are much nicer than 1978 cars. This is a good example of how to check your work: compare what you got to what you expected, and if they don't match make sure you know why!
Internally, Stata executed a loop: it calculated price*4 for the first observation and stored the result in price2020 for the first observation, then calculated price*4 for the second observation and stored the result in price2020 for the second observation, and so forth for all the observations in the data set. You'll learn how to stretch this one-observation-at-a-time paradigm in Data Wrangling in Stata, but tasks that break it (like calculating means) require a different approach that we'll talk about soon.
Suppose we wanted to be a little more precise and use 4.14 as the conversion factor. You might be tempted to try to add code that "fixes" the price2020 variable (say, multiplying it by 4.14/4). But it's simpler and cleaner to fix the code that created it in the first place. Change:
gen price2020 = price*4
gen price2020 = price*4.14
and run the do file again. Because your do file loads the original data from disk every time it is run, it can simply create the price2020 variable the way it should be.
Having both price and price2020 allowed you to compare their values and check your work. But if you only want to work with 2020 dollars and are confident you've got the formula right, you can use the replace command to change the existing price variable instead of creating a new one:
replace price = price*4.14
Run this version and you'll get the message (74 real changes made). Given that the data set has 74 observations this tells you that all of them were changed, as you'd expect. Once you start including if conditions, how many observations were actually changed can be very useful information.
Exercise: Outside the United States, fuel efficiency is frequently measured in liters per kilometer (note that because the fuel used is in the numerator, a low number is good). To convert miles per gallon to liters per kilometer, multiply the reciprocal of mpg (1/mpg) by 2.35. Create a variable that stores the fuel efficiency of each car in liters per kilometer.
Creating Variables with If Conditions
If a gen command has an if condition, the resulting variable will (and must) still exist for all observations. However it will be assigned a missing value for observations where the if condition is not true. If a replace command has an if condition, observations where the if condition is not true will be left unchanged. This allows you to set variables to different values for different observations.
Suppose you wanted to collapse the five-point scale of the rep78 variable into a three-point scale. Add the following code to your do file to do so:
gen rep3 = 1 if rep78<3
replace rep3 = 2 if rep78==3
replace rep3 = 3 if rep78>3 & rep78<.
The first line creates the new variable rep3, but only sets it to one for cases where rep78 is less than three. The others get missing. The second line changes some of those missings to twos, and the third changes more of them to threes. Run the do file, note the number of observations changed by each line, and compare that to the total number of observations in the data set.
What will the value of rep3 be for observations where rep78 is missing? Missing, as it should be, because it was never set to anything else. The five observations where rep78 is missing were implicitly or explicitly excluded from all three commands, so they started out with a missing value for rep3 and were never changed. (If you forgot to exclude missing values from the last command then rep3 would be three for cars where rep78 is missing, an all-too-common mistake. Remember, missing is essentially infinity.)
Exercise: Combining the ones and twos makes sense because there are so few of them, but there was no particular need to combine the fours and fives. Create a rep4 variable that combines the ones and twos and renumbers the other categories accordingly (i.e. rep4 should go from one to four).
The recode command gives you an alternative way of creating rep3. It is designed solely for recoding tasks and is much less flexible than gen and replace. But it's very easy to use. The syntax is:
recode var (rule 1) (rule 2) (more rules as needed...), gen(newvar)
The gen option at the end is not required—if it's not there then the original variable will be changed rather than creating a new variable with the new values. You can also have recode work on a list of variables, recoding them all in the same way.
The core of the recode command is a list of rules, each in parentheses, that tell it how a variable is to be recoded. They take the form (inputValue = outputValue). The inputValue can be a single number, a list of numbers separated by spaces, or a range of numbers specified with start/end. The outputValue will always be a single number. Anything not covered by a rule is left unchanged, so you can use recode to change just a few values of a variable or completely redefine it as we do here. Here's a recode version of converting rep78 to a three-point scale:
recode rep78 (1 2 = 1) (3 = 2) (4 5 = 3), gen(rep3b)
Missing values required no special handling: since missing was not listed in the input values of any rule, observations with missing values are not changed.
Exercise: Create rep4b, combining only the ones and twos as above, using recode.
In creating indicator variables you can take advantage of the fact that Stata treats true as one and false as zero by setting the new variable equal to a condition. Consider:
gen lowMPG = (mpg<20)
(The parentheses are optional, but make it easier to read.) This creates an indicator variable called lowMPG which is one (true) for cars where mpg is less than twenty and zero (false) where mpg is greater than or equal to twenty. To see the results run the do file and then type browse make mpg if lowMPG.
No car has a missing value for mpg, but if one did, the above code would assign it a zero for lowMPG as if it were known to have good gas mileage. The lowMPG variable should be missing for such cases, which you can do with:
gen lowMPG = (mpg<20) if mpg<.
Exercise: Create an indicator variable that identifies cars with good repair records (defined as rep78 greater than 3). How would your code change if the indicator variable needed to identify cars that are known to have good repair records?
The gen and replace commands work with string variables too. The expressions on the right side of the equals sign are not mathematical, but they follow similar rules. String values always go in quotes, so if you wanted to store the letter x in a variable called x you'd say gen x = "x". Stata would not find this confusing (though you might) because x in quotes ("x") means the letter x and x without quotes means the variable x.
Addition for strings is defined as putting one string after the other, so "abc" + "def" = "abcdef". But most work with strings is done by special-purpose functions that take strings as input (either string values or variables containing strings) and return strings as output.
The make variable really records two pieces of information: the name of the company that produced the car, and the name of the car model. You can easily extract the company name using the word() function:
gen company = word(make,1)
To see the results, run the do file and type browse make company. The first input, or argument, for the word() function is the string to act on (in this case a variable containing strings). The second is a number telling it which word you want. The function breaks the input string into words based on the spaces it contains, and returns the one you asked for, in this case the first.
We'll say much more about string functions in Text Data (forthcoming), but if you're eager to get started you can do a great deal with just the following functions:
|word()||Extracts a word from a string|
|strpos()||Tells you if a string contains another string, and if so its position|
|substr()||Extracts parts of a string|
|subinstr()||Replaces part of a string with something else|
|length()||Tells you how long a string is (how many characters it contains)|
Type help and then the name of a function in the main Stata window to learn how it works.
Exercise: Create a model variable containing the name of the car model (i.e. the rest of make). Your code must be able to handle model names that are either one or two words long.
Converting String Variables to Numeric Variables
Sometimes a variable that should be numeric gets entered into Stata as a string. You can fix that with the destring command, which converts a string variable that contains numbers to a numeric variable. The syntax is just destring variable, replace, where variable should be replaced by the name of the variable (or variables) to be destringed. If the string variable contains anything but numbers, you can add the force option to tell Stata to convert it anyway, but observations with any non-numeric characters will get a missing value. Note that "non-numeric characters" include dollar signs and commas!
In general, if you have to use the force option it's because Stata isn't sure what you're doing is a good idea, and you should think carefully before doing it. In this case you should examine the non-numeric characters to see if it would make sense to remove them first (like those dollar signs and commas) or if the variable isn't really just numbers after all.
This data set doesn't have any variables that need to be destringed, so let's make one:
gen x = "5"
Note how the quotes around "5" mean that x is a string variable containing the character 5, not a numeric variable containing the value 5. Just to make things complicated, let's change some of the values of x to actual text:
replace x = "missing" if foreign
Now try to destring x:
destring x, replace
Stata will refuse, because some of the values of x can't be converted to numbers. But the values which can't be converted are "missing" so it is entirely appropriate to convert them to missing values. So try again with the force option:
destring x, replace force
Now Stata will convert x to a numeric variable, with some missing values.
The egen command, short for "extended generate" gives you access to another library of functions. It's a bit of a hodge-podge, but the egen functions you'll use the most calculate summary statistics:
These are examples of aggregate functions: they take multiple numbers as input and return a single number as output. They also work across observations, and thus can't be easily done using gen since it works one observation at a time. The syntax looks almost identical to gen:
egen variable = function()
The big difference with egen is that you're not writing your own mathematical expression; you're just using a function from the library. For example, if you needed to set a variable to a mean divided by two, you could not say egen y = mean(x)/2. You'd instead first run egen y = mean(x) and then replace y = y/2.
Another important difference is how missing values are handled. Recall that with gen, if any part of the input was missing the output would be missing. However, egen simply ignores missing values and calculates the result using the data that are available. Usually this is what you want, but you need to be aware of what egen is doing and be sure it makes sense for your particular task.
The egen functions for calculating summary statistics are very commonly combined with by to calculate summary statistics for groups. Calculate the mean car price for each company and then view the results with:
bysort company: egen meanPrice = mean(price)
tab company, sum(meanPrice)
Recall that bysort company: first sorts the data by company and then runs the following egen command separately for each company.
The zeros for standard deviation reflect the fact that every car produced by the same company has the same value of meanPrice. That's because meanPrice describes the company, not the car. In fact that's the definition of a variable that describes a group: every unit within the same group must have the same value of the variable.
If we had only wanted to see the mean value of price for each company, we could have just run:
tab company, sum(price)
But having run egen we now have the mean in a variable, available for use.
Exercise: Create a variable containing the mean value of rep78 for each company. Then examine the frequencies of rep78 within each company by creating a two-way table with tab. Be sure to include missing values. Lincoln and Olds have the same mean; how well do you think this summarizes the distribution of rep78 for the two companies? Next consider the missing values: suppose the actual value of rep78 for the cars with missing vales were revealed. What would they have to be in order for these means to not change? How plausible is that?
Good labels make your data much easier to understand and work with. While Stata has many kinds of labels, we'll focus on the most common and most useful: variable labels and value labels.
Variable labels convey information about a variable, and can be a substitute for long variable names. This data set already has a good set of variable labels, as you can see in the Variables window. The only one that's confusing is the label on foreign, so change it using the label variable command. The syntax to set a variable label is:
label variable variableName "label"
label variable foreign "Car Origin"
Look at the Variables window again to see the results.
Value labels are used with categorical variables to tell you what the categories mean. We've seen one in action with the foreign variable: it was the value labels that told us that a 0 means "Domestic" and a 1 means "Foreign."
Let's explore value labels by labeling the values of rep3, the new variable we recoded to collapse rep78 from a five point scale to a three point scale. Value labels are a mapping from a set of integers to a set of text descriptions, so the first step is to define the map. To do so, use the label define command:
label define mapName value1 "label1" value2 "label2"...
label define replabel 1 "Bad" 2 "Average" 3 "Good"
This creates a mapping called replabel but does not apply it to anything. Before it does anything useful you have to tell Stata to label the values of the rep variable using the replabel mapping you just defined. The syntax is:
label values variable map
label values rep3 replabel
To see the results, run:
list make rep3
Once a map is defined you can apply it to any number of variables: just replace the single variable in the label values command above with a list of variables. Suppose you're working with survey data and your variables include the gender of the respondent, the gender of the respondent's spouse, and the genders of all the respondent's children. You could define just one map called gender and then use it to label the values of all the gender variables.
Three commands for managing value labels: label dir gives you a list of all the defined labels, and label list tells you what they mean. The describe command tells you the name of the value labels associated with each variable (among many other useful things).
Exercise: Create value labels for rep4 and apply them. Feel free to decide how to describe the levels.
Labels via Recode
When you use recode to create a new variable, Stata will automatically create a variable label for it ("RECODE of ..."). You can also define value labels for it by putting the desired label for each value at the end of the rule that defines it. Create yet another version of rep3, this time with labels right from its creation, with:
recode rep78 (1 2 = 1 "Bad") (3 = 2 "Average") (4 5 = 3 "Good"), gen(rep3c)
Exercise: Create a rep4c using recode, setting value labels for it.
This do file changes the data set it uses, so it should save the new version. Remember, never save your output over your input, so don't save the new data as auto. If you did, you could not run this do file again: it would crash when it tried to create price2020 because that variable would already exist in the modified data set. Instead, save the data as autoV2, as in "version 2 of the automobile data set."
save autoV2, replace
Finally, close the log:
This brings us to the end of Introduction to Stata. We hope it has been helpful to you. To learn more, consider reading Data Wrangling in Stata, or the other contents of the SSCC's Statistical Computing Knowledge Base.
Last Revised: 5/27/2020