7 Character
A third fundamental type of data is character data (also called string data). In R, character vectors may be used as data for analysis, and also as names of data objects or elements. (R is its own macro language, and we use the same functions to manipulate language elements as we use to manipulate data values for analysis. In R, it is all “data”.)
As data for analysis, character values can signify categories (see Chapter 9). An example might be a variable that classifies people as “Democrat”, “Green”, “Independent”, “Libertarian”, or “Republican” (American political affiliations).
affiliations <- c("Dem", "Dem", "Rep", "Rep", "Ind", "Lib")
table(affiliations)
affiliations
Dem Ind Lib Rep
2 1 1 2
A single character value might also represent multiple categorical variables. The FIPS code “55025” is a combination of state (“55” for Wisconsin) and county (“025” for Dane) codes. And the date “2020-12-23” is a combination of a year, a month, and a day code (see Chapter 8).
You will want to be able to combine character values into a single value, and to separate a single value into parts.
Another aspect of working with character data is that they may represent the raw input from multiple people. If you have ever used social media you will appreciate that people’s views of acceptable capitalization, spelling, and punctuation vary enormously. Cleaning raw data is another important part of working with character data.
7.1 Combining Character Values
One basic task when working with character data is to
combine elements from two or more vectors. This is
useful whenever you need to construct a single variable
to represent a value identified by multiple other
variables. For example
you might have data about calendar dates given as
separate month, day, and year variables. To combine
these into a single vector, use the paste()
function
(see help(paste)
).
month <- c("Apr", "Dec", "Jan")
day <- c(3, 13, 23)
year <- c(2001, 2009, 1997)
date_str <- paste(year, month, day, sep="-")
date_str
[1] "2001-Apr-3" "2009-Dec-13" "1997-Jan-23"
The paste()
operation is vectorized in much the same way that
numeric operations are. Notice that the results are character values.
The sep
argument specifies a character value to place between
the data elements being combined. The default separator is a
space. To have nothing added between the elements being combined, we can
either specify a null string, sep=""
(quotes with NO space
between), or we can use the paste0()
function.
You might also use this if you were constructing a set of variable names with a common prefix. Notice the recycling in this example.
paste("Q", 1:4, sep="")
[1] "Q1" "Q2" "Q3" "Q4"
paste0("Q", 3, c("a", "b", "c"))
[1] "Q3a" "Q3b" "Q3c"
paste()
and paste0()
recycles each argument so that it matches the length of the longest argument, and then it concatenates element-wise. In the paste()
statement above, the longest argument (1:4
) is four elements long, so all others (here, just "Q"
) are recycled to length four (c("Q", "Q", "Q", "Q")
). In the paste0()
statement, the longest argument (c("a", "b", "c")
) has three elements, so the others ("Q"
and 3
) are recycled until they are three elements long (c("Q", "Q", "Q")
and c(3, 3, 3)
). Then, they are concatenated element-by-element (the first element of each vector, the second element of each, and so on). Note that paste()
will recycle an argument a non-whole number of times without a warning. Try paste0(c("a", "b", "c"), 1:2, "z")
and notice how 1:2
is recycled to c(1, 2, 1)
to have a length of three.
7.2 Working Within Character Values
A character value is an ordered collection of characters drawn from some alphabet. R is capable of working within a “local” alphabet, converting locales, or working in Unicode (a universal alphabet). The details of switching alphabets gets complicated quickly, so we will skip that here.
The most basic manipulations of character data values are selecting specific characters in a value (matching), removing selected characters, or adding characters.
Matching can be done either by position within a value, or by character.
In the character value “12:08pm” we could operate on
the fourth and fifth characters (to find “08”), or we can
specify that we want to
operate on the character pair “08” (finding the fourth position). We
are either looking for an arbitrary character that occupies a specific
position, or we are looking for an arbitrary position occupied by
a specific character.
7.3 Position Indexing
The substr()
and substring()
functions use
positions and return characters values. The
regexpr()
function matches characters and returns starting positions
(it is an index function).
x <- c("12:08pm", "12:10pm")
substr(x, start=4, stop=5)
[1] "08" "10"
regexpr("08", x)
[1] 4 -1
attr(,"match.length")
[1] 2 -1
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE
Here the character string sought is found at the fourth position in the first value, and not at all (-1) in the second value. (Everything after the first line of output is metadata, which makes this output hard to read.)
Although this example of regexpr()
is very simple, a word of warning.
Character matching functions is R rely on regular expressions, a
system of specifying character patterns that includes literal
characters (as in the example above), wildcards, and position
anchors. We’ll come back to this, below.
To work with positions, if is often useful to know the length of
a character value, for which we have the nchar()
function.
If we wanted the last two characters of each character value, we could specify
x <- c("12:08pm", "12:10pm", "1:08pm")
x_len <- nchar(x)
substr(x, start=x_len-1, stop=x_len)
[1] "pm" "pm" "pm"
By using substr()
on the left-hand side of the assignment
operator, we can substitute in new substrings by position.
substr(x, start=x_len-1, stop=x_len) <- "am"
x
[1] "12:08am" "12:10am" "1:08am"
7.4 Character Matching
While some character value manipulations are easily handled by position indexing, many others are handled more gracefully through character matching.
7.4.1 Global Wildcards
You may already be familiar with the concept of wildcards to specify patterns - computer operating systems all allow wildcards in searching for file names. These are sometimes referred to as global or glob wildcards.
For example, on a Windows computer you could open a Command window and type the following command
dir /b 2*.csv
Or on Mac or Linux, run this command in a terminal:
ls 2*.csv
to get a list of all the CSV files in your current folder beginning with the character “2” (if there are any):
The asterisk (*
) wildcard matches any characters (zero or
more) after the literal “2” at
the beginning of file names, while the “.csv” literal
matches only files which end with that file extension.
Similarly, a question mark matches a single arbitrary character.
With global wildcards, the pattern to match always specifies a string from beginning to end. So
"02*"
matches any character string beginning with “02”"*02"
matches any string ending with “02”"*02*"
matches any string containing “02”.
R does not use global wildcards directly, but
the glob2rx
function can translate this type
of wildcard into a regular expression for you.
glob2rx("02*")
[1] "^02"
glob2rx("*02")
[1] "^.*02$"
glob2rx("*02*")
[1] "^.*02"
7.4.2 Regular Expression Wildcards
Regular expressions expand on the concept of wildcards, and allow us to match elements of arbitrary character vectors with more precise patterns than global wildcards. We expand the concept of a wildcard by separating
- what characters to match
- how many characters to match
- where to match (what position)
A single arbitrary character is specified as a period, “.”, much like the global question mark, “?”. For example, one way to get a vector of column names that are at least four letters long using a regular expression would be
cars <- mtcars
grep("....", names(cars), value=TRUE)
[1] "disp" "drat" "qsec" "gear" "carb"
The grep()
function searches a character vector for elements
that match a pattern. It returns position indexes by default,
or values that contain a match with the value=TRUE
argument. The grepl()
(grep logical) function returns a logical vector
indicating which elements matched. These two functions
give us all three methods of specifying indexes along a
vector.
In addition to wildcard characters, we can also match literal characters, and literal substrings.
grep("a", names(cars), value=TRUE)
[1] "drat" "am" "gear" "carb"
grep("ar", names(cars), value=TRUE)
[1] "gear" "carb"
7.4.2.1 Position
In contrast to global wildcards, these patterns match anywhere within a character value - they are position-less. To specify positions We have two regular expression anchors that we can specify - tying a pattern to the beginning (“^”) or the end (“$”) of a string.
grep("m", names(cars), value=TRUE) # any m
[1] "mpg" "am"
grep("^m", names(cars), value=TRUE) # begins with m
[1] "mpg"
grep("m$", names(cars), value=TRUE) # ends with m
[1] "am"
Although we have added position qualifiers to our patterns, notice that we are still specifying partial strings, not whole strings. To specify a complete string, we use both anchors! One way to find column names that are exactly two characters long would be
grep("^..$", names(cars), value=TRUE)
[1] "hp" "wt" "vs" "am"
Without both anchors, this example would find all column names at least two characters long, including those with three and four characters.
7.4.2.2 Repetition
So far we have been specifying one character at a time, but the regular expression syntax also includes the concept of repetition. There are six ways to specify how many matches are required:
- a question mark, “?”, matches zero or one time, making a character specification optional
- an asterisk, “*“, matches zero or more times, a character is optional but also may be repeated
- a plus, “+”, matches one or more times, a character is required and may be repeated
- braces with a number, “{n}” matches exactly n times
- braces with two numbers, “{n,m}”, matches at least n times and no more than m times
- no repetition qualifier means match exactly once
So another way to get two-letter column names would be to specify
grep("^.{2}$", names(cars))
[1] 4 6 8 9
While the global wildcard “?” is replaced by the dot in regular expressions, the global wildcard “” is replaced with the regular expression ”.”.
7.4.2.3 Character Class
So far we have introduced arbitrary matches and literal matches, but regular expressions are able to work between these two extremes as well. We can specify classes (sets) of characters to match, and we can do this by itemizing the whole class, or using a shortcut name for some classes.
Square brackets, “[ ]”, are used to itemize classes, and to specify shortcut names. As an arbitrary example, column names that begin with “a” or “b” or “c” could be specified
grep("^[abc]", names(cars), value=TRUE)
[1] "cyl" "am" "carb"
Notice that this interacts with with the repetition qualifiers. To require the first two characters to belong to the same character class we would specify
grep("^[abc]{2}", names(cars), value=TRUE)
[1] "carb"
The twelve shortcut names that are predefined in R are
documented on the help("regex")
page, and they include
- [:alpha:], alphabetic characters
- [:digit:], numerals
- [:punct:], punctuation marks
These shortcuts are specific to the use of regular expressions in R, and must themselves be used within class brackets. Contrast the first (correct) example with the second (incorrect) example. Why does the second line pick out “def”?
grep("[[:digit:]]", c("abc", "123", "def"), value=TRUE)
[1] "123"
grep("[:digit:]", c("abc", "123", "def"), value=TRUE)
[1] "def"
7.4.2.4 Regular Expression Metacharacters
We can match arbitrary characters, specified classes
of characters, and most literal characters. However,
we are using some characters as metacharacters with
special meaning in regular expression patterns. The
dot (period, .
), asterisk (*
), question mark (?
), plus sign (+
),
caret (^
), dollar sign ($
), square brackets ([
, ]
), braces ({
, }
), dash (-
), and
a few more we haven’t discussed, all
have a non-literal meaning. What if we want to use
these as literal characters?
There are generally two ways to take a metacharacter and use it as a literal. We can specify it within a square bracket class, or we can escape it. Either method comes with caveats.
To “escape” a character - ignore it’s special meaning and use it as a literal - we typically think of preceding it with a backslash, ” \ “. However, it turns out that a backslash is also a regular expression metacharacter (that we have not discussed so far), so to use it as an escape character in regular expressions we double it. That is, to use an escape character, we need to first escape the escape character!
For example, to find a literal dollar sign contrast the correct specification with two mistakes.
grep("\\$", c("$12.95", "40502"), value=TRUE) # correct
grep("$", c("$12.95", "40502"), value=TRUE) # wrong: no slash = end of string
grep("\$", c("$12.95", "40502"), value=TRUE) # wrong: error
Error: '\$' is an unrecognized escape in character string (<text>:3:8)
An alternative is to write (most) metacharacters within a class.
grep("[$]", c("$12.95", "40502"), value=TRUE) # correct
[1] "$12.95"
The caveat here is that the caret, dash, and backslash all have special meaning within character classes.
(A third approach is to turn off regular expression
matching and use only literal matching. Use the
fixed=TRUE
argument.)
7.5 Substitution
Simply identifying matching values is useful for creating
indicator variables (grepl
) or for creating sets of
variable names (grep
). But for data values, we often
want to manipulate the values we identify. One of the
main tools we will use for this is substitution (sub
)
and repeated substitution (gsub
).
Substitution where a regular expression pattern appears at most once in each value is straightforward.
Returning to the time example, which we previously solved by positional substitution, we can use the very simple regular expression “pm” to identify matching characters to replace with “am”.
x <- c("12:08pm", "12:10pm", "1:08pm")
sub("pm", "am", x)
[1] "12:08am" "12:10am" "1:08am"
Substitution also works as a method of deleting matched characters, when the replacement is the null string (quotes with no space).
7.6 Exercises
Percent to proportion: Given a character vector with values in percent form, convert these to numerical proportions, values between 0 and 1.
x <- sprintf("%4.2f%%", runif(5)*100) x
[1] "72.91%" "29.23%" "91.91%" "86.25%" "65.51%"
Currency is sometimes denoted with both a currency symbol and commas. Convert these to numeric values.
x <- c("$10", "$11.99", "$1,011.01")
Inconsistent capitalization is a problem with some alphabets. The output of
table(colors2)
indicates that we have five unique values, and one occurrence of each. Standardize the capitalization with eithertolower()
ortoupper()
so thattable()
correctly tabulates our values.colors2 <- c("Red", "blue", "red", "blue", "RED")
7.7 Advanced Exercises
Some countries use a comma rather than a period to separate the decimal, and a period to as a delimiter. For example, instead of writing one thousand two hundred thirty-four dollars and fifty-six cents as $1,234.56, they may write it as $1.234,56. The currency symbol may also be placed after the amount, such as 20$ rather than $20. Convert these alternative currency expressions into numeric values:
currency <- c("$1.234,56", "20$", "$12,99", "5.555 $")
Translating wildcard patterns
The
glob2rx()
function translates character strings with wildcards (*
for any string,?
for a single character) into regular expressions. We can translate “Merc *” (a string starting with “Merc” and a space, followed by anything) into “^Merc”. Combining this withgrep()
allows us to select rows from ourmtcars
dataframe. (Note thatvalue = TRUE
returns values, while the defaultvalue = FALSE
returns positions.)glob2rx("Merc *")
[1] "^Merc "
grep(glob2rx("Merc *"), row.names(mtcars), value=TRUE)
[1] "Merc 240D" "Merc 230" "Merc 280" "Merc 280C" "Merc 450SE" "Merc 450SL" "Merc 450SLC"
grep(glob2rx("Merc *"), row.names(mtcars))
[1] 8 9 10 11 12 13 14
mtcars[grep(glob2rx("Merc *"), row.names(mtcars)), ]
mpg cyl disp hp drat wt qsec vs am gear carb Merc 240D 24.4 4 146.7 62 3.69 3.19 20.0 1 0 4 2 Merc 230 22.8 4 140.8 95 3.92 3.15 22.9 1 0 4 2 Merc 280 19.2 6 167.6 123 3.92 3.44 18.3 1 0 4 4 Merc 280C 17.8 6 167.6 123 3.92 3.44 18.9 1 0 4 4 Merc 450SE 16.4 8 275.8 180 3.07 4.07 17.4 0 0 3 3 Merc 450SL 17.3 8 275.8 180 3.07 3.73 17.6 0 0 3 3 Merc 450SLC 15.2 8 275.8 180 3.07 3.78 18.0 0 0 3 3
Now, try selecting rows from
mtcars
where the row name…- starts with “Toyota” and a space
- starts with any four characters and then a space
- ends with 0
- ends with a space and then any three characters