Data Wrangling in Stata: Variable Transformations

This is part four of Data Wrangling in Stata.

Skip to the content following this video

While the syntax for transforming variables in Stata is simple, the process of converting an idea for how to change a variable into actual code can take some getting used to. In this section we'll work through an example, showing several ways of accomplishing the same task. In addition to teaching you how to transform variables, this section will show you some programming tools you can use in a variety of contexts, introduce you to more complicated tasks that require multiple steps, and invite you to consider what makes code "good" or "bad," beyond whether it runs or not.

For this example we'll use the automobile data set that comes with Stata. Start a do file that does the usual set-up and then loads the data:

capture log close
log using transforms.log, replace
clear all
sysuse auto

Suppose the cost of manufacturing a car is the sum of the following:

  • $1.50 per pound of weight
  • $0.25 per pound to ship if it is foreign
  • $100 if its rep78 is 5

Your task is to calculate the profit (price minus cost) from selling each car.

In laying out the exercise I'm doing the first step of a variable transformation for you: describing clearly and precisely what you need to do. If you can't explain to another person what you're doing in your native language you'll have an even harder time explaining it to a computer using Stata code.

Your task description should include what to do with unusual or problem cases. The rep78 variable is sometimes missing; what should be done with those cases? The safe thing would be to declare that if rep78 is missing profit must be missing as well. However, to simplify this exercise we'll take the problem statement literally and declare that since missing is not 5 we won't add $100 to the cost for cars with a missing value for rep78 (i.e. we'll treat them the same as if we knew rep78 was not 5).

New Stata users often try to carry out this task by defining three variables containing the three components of cost, and then adding them up:

// first try--doesn't work
gen weightCost = 1.5*weight
gen shipCost = .25*weight if foreign
gen qualityCost = 100 if rep78==5
gen totalCost = weightCost + shipCost + qualityCost
gen profit = price - totalCost

The problem is that this gives mostly missing values for profit. Why? Look at the feedback Stata gave after running gen shipCost = .25*weight if foreign: "(52 missing values generated)". Cars that are not foreign are getting a missing value for shipCost, not a zero. (Recall that any time you use an if condition with gen, observations that do not meet the condition get a missing value.) The same happens with cars that don't have a 5 for rep78. Then those missing values cause totalCost and profit to be missing.

One solution is to calculate totalCost using the egen function rowtotal(). Like most egen functions, rowtotal() ignores missing values and simply adds up the numbers that are available. Comment out the first try in your do file and try this instead:

// second try--works, but not ideal
gen weightCost = 1.5*weight
gen shipCost = .25*weight if foreign
gen qualityCost = 100 if rep78==5
egen totalCost = rowtotal(weightCost shipCost qualityCost)
gen profit = price - totalCost

The first problem with this method is that shipCost and qualityCost are still wrong: they should be zero for cars that do not meet their conditions, not missing. That might be acceptable if they are only intermediate results required to calculate profit and you drop them immediately after that is done. The second problem is that the method depends on very specific behavior of rowtotal(). Someone who is reading the code may not see that immediately, including yourself reading this code again six months from now. So here's a method that sets shipCost and weightCost properly the first time.

// third try--a little long but works and is clear
gen weightCost = 1.5*weight
gen shipCost = .25*weight if foreign
replace shipCost = 0 if shipCost==.
gen qualityCost = 100 if rep78==5
replace qualityCost = 0 if qualityCost==.
gen totalCost = weightCost + shipCost + qualityCost
gen profit = price - totalCost

Note that the command:

replace shipCost = 0 if shipCost==.

could have been:

replace shipCost = 0 if !foreign

The advantage of using if shipCost==. is that you don't have to come up with the inverse of the prior if condition, which can be tricky if the condition is complicated. However, converting all missing values of shipCost to zero would be incorrect if shipCost can be missing for reasons other than the car not being foreign. If some foreign cars had missing values for weight, then it would be important that their shipCost be missing as well, and you'd have to use if !foreign.

The cond() (condition) function can make this code shorter. It takes three inputs: a condition, the desired result if the condition is true, and the desired result if the condition is false.

// fourth try--clear if the reader knows cond()
gen weightCost = 1.5*weight
gen shipCost = cond(foreign, .25*weight, 0)
gen qualityCost = cond(rep78==5, 100, 0)
gen totalCost = weightCost + shipCost + qualityCost
gen profit = price - totalCost

This is shorter, and clear if the reader knows cond().

A clever alternative takes advantage of the fact that multiplying a number by an indicator variable gives zero if the indicator is zero.

// fifth try--clever, but may be confusing
gen weightCost = 1.5*weight
gen shipCost = .25*weight*foreign
gen qualityCost = 100*(rep78==5)
gen totalCost = weightCost + shipCost + qualityCost
gen profit = price - totalCost

With this method you can easily turn it all into a single line:

// sixth try--shortest of all
gen profit = price - (1.5*weight + .25*weight*foreign + 100*(rep78==5))

If you put a condition in a mathematical expression, like (rep78==5), it will be evaluated and the result will be either 1 or 0 depending on whether the condition is true or false. Once you get used to the way Stata handles conditions this is handy, but it will confuse anyone who is not used to it.

An different approach creates just one variable for cost and then adds each cost component to it:

// seventh try--simplest?
gen cost = 1.5*weight
replace cost = cost + .25*weight if foreign
replace cost = cost + 100 if rep78==5
gen profit = price - cost

The construction replace variable = variable + something is extremely useful and comes up all the time.

Every line of this code is simple and easy to read, even for someone who doesn't know Stata.

But isn't the shortest way to do something the best way to do it? Not necessarily! The short solution may save you a few seconds of typing, but those savings will be lost if you later have to spend a few minutes figuring out how it works or a few hours debugging it. In some programming communities "clever" is not a compliment.

Which solution is the best for this particular task is a matter of opinion, and also depends on the intended audience. For a Stata beginner, consider the seventh method, but once you get comfortable with the logic involved the sixth has its advantages. But they all contain useful programming techniques you're likely to need as you work with data.

Exercise: Suppose an automobile's Completely Arbitrary Rating (CAR score) is defined by the following:

  • Start by multiplying the car's weight by its mpg and dividing by 1,000
  • Subtract the car's price divided by 1,000 unless the following exception applies:
  • For foreign cars weighing more than 2,800 pounds, subtract the car's price divided by 1,200
  • Add 10 if the car's rep78 is greater than 3
  • If any of the inputs to the CAR score are missing, the CAR score is missing

Create a variable for the CAR score.

To wrap up, save the modified data set as autoV3 (just in case you have autoV2 left over from Introduction to Stata) and close your log.

save autoV3, replace
log close

Next: Hierarchical Data

Last Revised: 9/29/2020