use https://sscc.wisc.edu/sscc/pubs/stata_text/auto_strings.dta
(1978 automobile data)
While working with text data (known as strings in programming jargon) can get very complicated, it’s possible to do a lot with a small number of simple functions. In this article we’ll carry out some common tasks using those functions.
Note: when we talk about “text data” in this article we mean data made up of letters and words rather than numbers. (There is a section on removing non-numeric text from numeric data.) That’s different from text files, like CSVs, that can contain both text and numeric data.
Load the auto_strings
data set from the SSCC web site (copying and pasting the following code block is highly recommended).
This is a modified version of the auto
data set that comes with Stata. The make
variable is a text variable giving the make of the car, and price
is the price of the car–but currently stored as a string.
Stata’s string functions are all case sensitive, but in many data sets case is not important. For example “AMC Concord”, “amc concord” and “AMC CONCORD” would presumably all refer to the same car. A valuable step in preparing many text data sets for analysis is to standardize the case, usually by converting everything to lower case. This is especially true if your data contains answers written by subjects themselves.
The strlower()
function will standardize case for you: you pass in a string and it returns the string converted to lower case. Like almost all string functions, the argument can be either a variable containing strings or a string literal, i.e. just some text in quotes:
(74 real changes made)
+---------------+
| make |
|---------------|
1. | amc concord |
2. | amc pacer |
3. | amc spirit |
4. | buick century |
5. | buick electra |
+---------------+
While standardizing to lower case is the most common, the strupper()
function converts strings to upper case, while strproper
capitalizes the first letter of each word:
Often strings need to be cleaned up before they are used, such as standardizing abbreviations or correcting misspellings.
subinstr() takes four arguments: a string to act on, a string to replace, a string to replace it with, and the number of replacements to make. (The fourth argument is very easy to forget.) For example:
Note that only the first instance of “cat” was replaced by “fish” because the fourth argument said to only replace one instance of “cat”. If the fourth argument is a missing value (.
) then all instances will be replaced:
Four of the cars in this data set were built by Volkswagen, but the company name is abbreviated:
+-------------+
| make |
|-------------|
70. | vw dasher |
71. | vw diesel |
72. | vw rabbit |
73. | vw scirocco |
+-------------+
If you want to use the full company name instead, you can do that with subinstr()
:
variable make was str18 now str19
(4 real changes made)
+---------------------+
| make |
|---------------------|
70. | volkswagen dasher |
71. | volkswagen diesel |
72. | volkswagen rabbit |
73. | volkswagen scirocco |
+---------------------+
Six of the cars in this data set were built by Chevrolet, but there are several different abbreviations used:
+-------------------+
| make |
|-------------------|
14. | chev chevette |
15. | chev impala |
16. | chev. malibu |
17. | chev. monte carlo |
18. | chevy monza |
|-------------------|
19. | chevy nova |
+-------------------+
This is common in real-world data, especially data collected directly from subjects. If you’re interested in talking about the companies in this data set, an important step in cleaning it will be to standardize the company names. That means each company should only appear in one form: either the full name or a standard abbreviation.
In this case, let’s replace all the various abbreviations with the full name (chevrolet). Here’s how not do do that:
+-------------------------+
| make_oops |
|-------------------------|
14. | chevrolet chevroletette |
15. | chevrolet impala |
16. | chevrolet. malibu |
17. | chevrolet. monte carlo |
18. | chevrolety monza |
|-------------------------|
19. | chevrolety nova |
+-------------------------+
Oops.
There are actually two problems here. The first is using a missing value for the last subinstr()
argument, meaning that all instances of “chev” will be replaced. That’s what turned the car model “chevette” into “chevroletette”. The company name only appears once in make
, so you only want to replace one instance of its abbreviation. Get in the habit of thinking how many times a replace should happen and use that number for the last argument of subinstr()
rather than lazily always passing in .
.
The second is that “chev” was replaced by “chevrolet” wherever it appeared, so “chev.” became “chevrolet.” and “chevy” became “chevrolety”.
We want to replace “chev” with “chevrolet” only when it appears as a complete word, not when it’s part of a word. The subinword()
function can do that for us. It takes the exact same arguments as subinstr()
and does the same thing, except that it will only replace complete words:
(2 real changes made)
+--------------------+
| make |
|--------------------|
14. | chevrolet chevette |
15. | chevrolet impala |
16. | chev. malibu |
17. | chev. monte carlo |
18. | chevy monza |
|--------------------|
19. | chevy nova |
+--------------------+
Now you can finish the job:
replace make = subinword(make, "chev.", "chevrolet", 1)
replace make = subinword(make, "chevy", "chevrolet", 1)
list make in 14/19
variable make was str19 now str21
(2 real changes made)
(2 real changes made)
+-----------------------+
| make |
|-----------------------|
14. | chevrolet chevette |
15. | chevrolet impala |
16. | chevrolet malibu |
17. | chevrolet monte carlo |
18. | chevrolet monza |
|-----------------------|
19. | chevrolet nova |
+-----------------------+
You can also use subinstr()
or subinword()
to correct misspellings. Unfortunately, there’s no Stata function that can identify all the misspelled words in a data set and tell you what they are supposed to be. Nor is there a function that can identify all the words that are abbreviations for “Chevrolet.” AI language models may be relevant here, but they are well beyond the scope of this chapter. For now, you’ll have to read your data to find all the things that need to be changed.
Another common task is cleaning up numeric data that’s been stored as text so it can be converted into numbers. The price
variable in this data set is a typical example
+------------------------+
| make price |
|------------------------|
1. | amc concord $4,099 |
2. | amc pacer $4,749 |
3. | amc spirit $3,799 |
4. | buick century $4,816 |
5. | buick electra $7,827 |
+------------------------+
price
should be a numeric variable. But if you try to use destring
on it, it fails because of the $
and ,
characters:
The solution is to remove those characters. You can do this with subinstr()
: you just need to replace $
and ,
with nothing. If the third argument for subinstr()
is ""
, an opening quote immediately followed by a closing quote with not even a space in between, then the second argument will simply be removed.
(74 real changes made)
+-----------------------+
| make price |
|-----------------------|
1. | amc concord 4,099 |
2. | amc pacer 4,749 |
3. | amc spirit 3,799 |
4. | buick century 4,816 |
5. | buick electra 7,827 |
+-----------------------+
All of the prices in this data set only have one comma in them, but if a car’s price had two commas (i.e. it cost a million dollars or more) you’d want to remove both of them. In fact you want to remove all commas no matter how many there are. That makes this a time when you legitimately should use a missing value for the last subinstr()
argument.
(74 real changes made)
+-----------------------+
| make price |
|-----------------------|
1. | amc concord 4099 |
2. | amc pacer 4749 |
3. | amc spirit 3799 |
4. | buick century 4816 |
5. | buick electra 7827 |
+-----------------------+
You can now successfully destring price
as it now only contains numbers:
A common task with text data is identifying whether one string contains another. For example, if make
contains the word amc
(after standardizing case to lower) then the car was built by AMC.
The strpos()
(string position) function takes two strings as arguments. The result will be the position of the second string within the first string, or zero if the first string does not contain the second string. This lets you use the result of a strpos()
to say if one string contains another. Some examples:
Note that the position of “concord” within “amc concord” is defined as the position of the first character of “concord” (i.e. the first “c”).
If strpos(string1, string2)
is greater than 0, then string2
is in string1
. This lets you easily create indicator variables:
+---------------------+
| make amc |
|---------------------|
1. | amc concord 1 |
2. | amc pacer 1 |
3. | amc spirit 1 |
4. | buick century 0 |
5. | buick electra 0 |
+---------------------+
This gives you an indicator variable for “This car was made by AMC.”
Frequently strings contain multiple pieces of information. Parsing is the task of identifying the meaning of each piece of information, and often splitting the string up into separate variables. For example, make
contains first the name of the company that built the car and then the model of the car.
The word()
function takes a string, and the number of the word you want out of it. It then returns that word:
This makes it very easy to parse strings by word. For example, you can create a variable containing the manufacturer of each car with:
+------------------------------+
| make manufacturer |
|------------------------------|
1. | amc concord amc |
2. | amc pacer amc |
3. | amc spirit amc |
4. | buick century buick |
5. | buick electra buick |
+------------------------------+
If all the model names were just one word, you could easily use word()
to identify model as well. Unfortunately they are not:
+-----------------------+
| make |
|-----------------------|
17. | chevrolet monte carlo |
+-----------------------+
You can easily create a model
variable if you recognize that the car model is everything in make
except the manufacturer, so all you need to do is remove the manufacturer.
(1 missing value generated)
+-----------------------------------------------------+
| make manufacturer model |
|-----------------------------------------------------|
17. | chevrolet monte carlo chevrolet monte carlo |
+-----------------------------------------------------+
All done? Not quite. There’s a problem with model
, one that’s common but very hard to see. One way to make it visible is to put a star at the beginning and end of model
. Putting one string after another is called concatenation, but in Stata it’s done using the plus sign:
+-------------+
| model_stars |
|-------------|
1. | * concord* |
2. | * pacer* |
3. | * spirit* |
4. | * century* |
5. | * electra* |
+-------------+
Note how there’s a space after the first star, but not before the last star. That’s because model
actually starts with a space. The original make
variable contains the name of the manufacturer followed by a space followed by the name of the model, so removing manufacturer
left the space. This can sometimes cause major headaches, as things that seem like they ought to match just don’t. For example, you might think you could identify the AMC Concord with if model=="concord"
but it doesn’t work:
This gives no results because no observation has “concord” for model
. You actually need:
Don’t leave your data like this, just waiting to trip up an unwary programmer (like yourself six months from now). The strtrim()
function will remove any spaces at the beginning or end of a string:
replace model = strtrim(model)
replace model_stars = "*" + model + "*"
list model_stars in 1/5, ab(15)
(73 real changes made)
(73 real changes made)
+-------------+
| model_stars |
|-------------|
1. | *concord* |
2. | *pacer* |
3. | *spirit* |
4. | *century* |
5. | *electra* |
+-------------+
After using strtrim()
, recreating model_stars
shows us there are no extraneous spaces around model
.
Close relatives of strtrim()
include strltrim()
, which only removes spaces before the string (i.e. on the left), strrtrim()
which only removes spaces after the string (i.e. on the right) and stritrim()
which replaces multiple spaces inside a string with a single space:
What makes parsing make
into manufacturer
and model
relatively easy is that manufacturer
is the first word of make
, so we can rely on the word()
function to identify it. But what if the string you’re parsing isn’t made up of words separated by spaces?
Consider the problem of trying to parse out the area code from a telephone number. Start by creating a variable phone
containing the phone number of the SSCC Help Desk:
The area code is separated from the rest of the phone number by a dash. But if you replace all dashes with spaces, you can use word()
again:
gen temp = subinstr(phone, "-", " ", .)
gen area_code = word(temp, 1)
list phone temp area_code in 1, ab(15)
+-----------------------------------------+
| phone temp area_code |
|-----------------------------------------|
1. | 608-262-9917 608 262 9917 608 |
+-----------------------------------------+
This method, replacing the actual separator with spaces and using word(), is easy to use but will cause trouble if the bits of information themselves contain spaces. So we’ll look at a couple of alternatives that don’t rely on special characters later.
The split
command breaks up a string into all its component words, creating a new variable for each:
variables created as string:
temp1 temp2 temp3
+--------------------------------------+
| temp temp1 temp2 temp3 |
|--------------------------------------|
1. | 608 262 9917 608 262 9917 |
+--------------------------------------+
We could now rename temp1
to area_code
if we wanted to. But split
can also break up a string based on a character other than spaces using the parse()
option, allowing us to split phone
directly:
variables created as string:
phone1 phone2 phone3
+-----------------------------------------+
| phone phone1 phone2 phone3 |
|-----------------------------------------|
1. | 608-262-9917 608 262 9917 |
+-----------------------------------------+
Since the resulting strings contain numbers, we could add the destring
option and split
would create the new variables as numeric.
Area codes are exactly three digits long, so another easy method is to just extract the first three characters of the phone number. This can be done with substr()
(substring). The substr()
function takes three arguments: the string to act on, the starting point of the substring to extract, and the number of characters to extract. Thus:
+---------------------------+
| phone area_code2 |
|---------------------------|
1. | 608-262-9917 608 |
+---------------------------+
If you pass in a missing value for the number of characters to extract, you’ll get the rest of the string:
+-------------------------+
| phone local |
|-------------------------|
1. | 608-262-9917 262-9917 |
+-------------------------+
If the starting point is negative, it will is interpreted as that many characters from the end of the string:
The combination of substr()
and strpos()
is very powerful because you can select a substring based on the content of the original string. That allows you parse just about anything.
Consider the famous opening of Charles Dickens’ A Tale of Two Cities:
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair…
That’s a lot of things that “it was”! If you’re reading this web book you are presumably a data geek who would much prefer to have all those things in a nice neat data set. So let’s do that.
The easy way would be to do a bit of clean up and then use split
with the parse string , it was
. But you already know how to use split
so we’ll use a more general method.
Load the data set from the SSCC web site:
+------------------------------------------------------------------------+
| tale |
|------------------------------------------------------------------------|
1. | It was the best of times, it was the worst of times, it was the age .. |
+------------------------------------------------------------------------+
The first step is to make a copy of the tale
variable called temp
. As you’ll see, it will be highly convenient to parse out one item from the list and then delete it before moving on to the next. Consuming each item in turn makes for easy repetition, eventually in the form of a loop. But you want to be able to check your work, so keep tale
around in its original form. Making a copy of the original string is also a convenient time to standardize the case to lower.
Now take another look at the text:
+------------------------------------------------------------------------+
| temp |
|------------------------------------------------------------------------|
1. | it was the best of times, it was the worst of times, it was the age .. |
+------------------------------------------------------------------------+
We have here a list of items separated by the phrase “it was.” But there’s no need for the string to start with a separator and it will cause problems for what we do next, so remove the first instance of “it was” by using subinstr()
to replace it with nothing. Be sure to include the space after “was” in the text to remove.
(1 real change made)
+------------------------------------------------------------------------+
| temp |
|------------------------------------------------------------------------|
1. | the best of times, it was the worst of times, it was the age of wisd.. |
+------------------------------------------------------------------------+
Now the first item in the list goes from the start of the string to the first instance of “it was”. Put it in the variable it_was1
with:`
+-------------------+
| it_was1 |
|-------------------|
1. | the best of times |
+-------------------+
The first two arguments of substr()
specify that the result will be part of the string temp
, starting from character 1. The third argument is then the number of characters to include, which is calculated by strpos()
. strpos(temp, "it was")
gives the location of the start of the first instance of “it was” within temp. We subtract three from that because we don’t want our substring to include the “i” from “it was” or the comma and space that come before it. The result is the first item, it_was1
.
Now remove it_was1
plus the following comma and space from temp:
(1 real change made)
+------------------------------------------------------------------------+
| temp |
|------------------------------------------------------------------------|
1. | it was the worst of times, it was the age of wisdom, it was the age .. |
+------------------------------------------------------------------------+
Now temp
looks just like it did when we started, other than the second item now being first. So we can repeat the exact same commands to get it_was2
: remove the initial “it was”, select out it_was2
, and then remove it from temp
.
replace temp = subinstr(temp, "it was ", "", 1)
gen it_was2 = substr(temp, 1, strpos(temp, "it was")-3)
replace temp = subinstr(temp, it_was2+", ", "", 1)
list it_was2
list temp
(1 real change made)
(1 real change made)
+--------------------+
| it_was2 |
|--------------------|
1. | the worst of times |
+--------------------+
+------------------------------------------------------------------------+
| temp |
|------------------------------------------------------------------------|
1. | it was the age of wisdom, it was the age of foolishness, it was the .. |
+------------------------------------------------------------------------+
And we can do it again for it_was3
:
replace temp = subinstr(temp, "it was ", "", 1)
gen it_was3 = substr(temp, 1, strpos(temp, "it was")-3)
replace temp = subinstr(temp, it_was3+", ", "", 1)
list it_was3
list temp
(1 real change made)
(1 real change made)
+-------------------+
| it_was3 |
|-------------------|
1. | the age of wisdom |
+-------------------+
+------------------------------------------------------------------------+
| temp |
|------------------------------------------------------------------------|
1. | it was the age of foolishness, it was the epoch of belief, it was th.. |
+------------------------------------------------------------------------+
We could continue repeating this code until we ran out of items. Note that the last item does not end with “it was” so you need to just store it as-is.
This task is practically begging to be done using a loop. If you haven’t learned about loops and macros, come back to this section once you have.
Since we don’t know ahead of time how many items there are to parse, this is a job for a while
loop. Have it keep going until temp
is empty with while (temp!="")
.
But we also need to number the items, so we’ll create a macro i
and increment it (increase it by 1) each time through the loop. It is then added to the it_was
variable name.
If temp
still contains “it was”, then we have not yet reached the last item and the next item can be parsed out just like we’ve done it before. If temp
does not contain “it was”, then the final item can be set to the remaining content of temp
.
clear
use tale
gen temp = lower(tale)
local i 1
while (temp!="") {
replace temp = subinstr(temp, "it was ", "", 1)
if (strpos(temp, "it was")>0) {
gen it_was`i' = substr(temp, 1, strpos(temp, "it was")-3)
}
else {
gen it_was`i' = temp
}
replace temp = substr(temp, strpos(temp, "it was"), .)
local i = `i'+1
}
list
(1 real change made)
(1 real change made)
(1 real change made)
(1 real change made)
(1 real change made)
(1 real change made)
(1 real change made)
(1 real change made)
(1 real change made)
(1 real change made)
(1 real change made)
(1 real change made)
(1 real change made)
(1 real change made)
(1 real change made)
(1 real change made)
(1 real change made)
(1 real change made)
(1 real change made)
(1 real change made)
+------------------------------------------------------------------------+
1. | tale |
| It was the best of times, it was the worst of times, it was the age .. |
|------------------------------------------------------------------------|
| temp | it_was1 | it_was2 | it_was3 |
| | the best of times | the worst of times | the age of wisdom |
|------------------------------------------------------------------------|
| it_was4 | it_was5 |
| the age of foolishness | the epoch of belief |
|------------------------------------------------------------------------|
| it_was6 | it_was7 |
| the epoch of incredulity | the season of light |
|------------------------------------------------------------------------|
| it_was8 | it_was9 | it_was10 |
| the season of darkness | the spring of hope | the winter of despair |
+------------------------------------------------------------------------+
Wide form data sets are always ugly, so let’s reshape this to long form. Reshape needs an level one identifier even though the data set has just one observation, so we’ll need to create one and just set it to 1. Then we can reshape and drop the variables that are no longer needed to get the nice, neat data set promised.
(j = 1 2 3 4 5 6 7 8 9 10)
Data Wide -> Long
-----------------------------------------------------------------------------
Number of observations 1 -> 10
Number of variables 13 -> 5
j variable (10 values) -> j
xij variables:
it_was1 it_was2 ... it_was10 -> it_was
-----------------------------------------------------------------------------
+--------------------------+
| it_was |
|--------------------------|
1. | the best of times |
2. | the worst of times |
3. | the age of wisdom |
4. | the age of foolishness |
5. | the epoch of belief |
|--------------------------|
6. | the epoch of incredulity |
7. | the season of light |
8. | the season of darkness |
9. | the spring of hope |
10. | the winter of despair |
+--------------------------+
The advantage of this method of looping over the string and consuming one item at a time is that the conditions for identifying an item can be complex. If they’re not complex, then split
can do it much more easily.
The names of Santa’s reindeer are given in Clement C. Moore’s Account of a Visit from St. Nicholas (better known as The Night Before Christmas) when Santa calls:
Now Dasher! now, Dancer! now, Prancer and Vixen!
On, Comet! On, Cupid! on, Donner and Blitzen!
This text is available in the data set https://sscc.wisc.edu/sscc/pubs/stata_text/reindeer.dta
. Turn it into a data set of reindeer with eight observations. The easy way is to remove all punctuation and words other than reindeer names (you can use loops to make this easier) and then use split
.
Make sure the reindeer names all end up with the proper case!
Start by loading the data and taking a look.
+------------------------------------------------------------------------+
| reindeer |
|------------------------------------------------------------------------|
1. | Now Dasher! now, Dancer! now, Prancer and Vixen! On, Comet! On, Cupi.. |
+------------------------------------------------------------------------+
Next standardize the case and removing everything from the text that isn’t a reindeer name. That last step can be done using two loops: one that loops over all the punctuation characters and removes them with subinstr()
, and one that loops over all the extraneous words and removes them with subinword()
.
replace reindeer = lower(reindeer)
foreach remove in ! , {
replace reindeer = subinstr(reindeer, "`remove'", "", .)
}
foreach remove in now and on {
replace reindeer = subinword(reindeer, "`remove'", "", .)
}
list
(1 real change made)
(1 real change made)
(1 real change made)
(1 real change made)
(1 real change made)
(1 real change made)
+----------------------------------------------------------------+
| reindeer |
|----------------------------------------------------------------|
1. | dasher dancer prancer vixen comet cupid donner blitzen |
+----------------------------------------------------------------+
In removing the word “on” you need to be careful not to turn “donner” into “dner”. That’s why you need to use subinword()
to remove only the complete word “on” and not “on” within other words. Remove the punctuation first so “on” (and “now”) always appears by itself.
Now you’re ready to split
what’s left into eight reindeer variables, and then reshape
from wide to long.
split reindeer
drop reindeer
gen i = 1
reshape long reindeer, i(i) j(j)
drop i j
replace reindeer = proper(reindeer)
list
variables created as string:
reindeer1 reindeer3 reindeer5 reindeer7
reindeer2 reindeer4 reindeer6 reindeer8
(j = 1 2 3 4 5 6 7 8)
Data Wide -> Long
-----------------------------------------------------------------------------
Number of observations 1 -> 8
Number of variables 9 -> 3
j variable (8 values) -> j
xij variables:
reindeer1 reindeer2 ... reindeer8 -> reindeer
-----------------------------------------------------------------------------
(8 real changes made)
+----------+
| reindeer |
|----------|
1. | Dasher |
2. | Dancer |
3. | Prancer |
4. | Vixen |
5. | Comet |
|----------|
6. | Cupid |
7. | Donner |
8. | Blitzen |
+----------+
If you don’t standardize the case with lower()
at the beginning you don’t need to fix it with proper()
at the end–but then you have to remove “Now”, “now”, “On”, and “on” separately.