Working with Character Data
Besides numeric data, the other fundamental data type in SAS is character data - data values composed of strings of alpha- numeric character.
Character data can come in many language encodings, representing Western Latin character sets, double-byte character sets, and UTF-8 character sets. SAS has extensive capabilities to work with data in many different encodings, SSCC users have successfully worked with data in half a dozen languages. This is a big topic that is well beyond the scope of these notes. Here we will just discuss a few features of working with the SSCC default “wlatin” encoding.
Lengths
Automatically determined
Unlike numeric data, which is always stored as an 8 byte double by default, the length of character variables is determined automatically by the expressions that create them.
For example, using the cars
data we might want to combine the Make and Model variables into a single string (this can be a good strategy for creating efficient identifiers in large data sets). Initially, these variables can hold up to 13 and 40 characters, respectively.
ods select variables;
proc contents data=sashelp.cars(keep=make model);
run;
The CONTENTS Procedure
Alphabetic List of Variables and Attributes
# Variable Type Len
1 Make Char 13
2 Model Char 40
Concatenating these variables, SAS allows room for 13 + 40 = 53 characters in the new variable.
(We TRIM make
first, otherwise it is padded to 13 characters - too many spaces!)
data cars;
set sashelp.cars;
mm = trim(make) || model;
run;
ods select variables;
proc contents data=cars(keep=make model mm);
run;
The CONTENTS Procedure
Alphabetic List of Variables and Attributes
# Variable Type Len
1 Make Char 13
2 Model Char 40
16 mm Char 53
proc print data=cars(obs=3) noobs;
var make model mm;
run;
Make Model mm
Acura MDX Acura MDX
Acura RSX Type S 2dr Acura RSX Type S 2dr
Acura TSX 4dr Acura TSX 4dr
Too short
Occasionally a variable will be too short, and we’ll want to lengthen it. This is especially common when creating new variables based on character literals (character constants).
Suppose we wanted to create a variable that spells out sex in the class
data. Consider the difference in two different constructions.
data class;
set sashelp.class;
if sex = 'M' then sex2 = 'male';
else if sex = 'F' then sex2 = 'female';
if sex = 'F' then sex3 = 'female';
else if sex = 'M' then sex3 = 'male';
run;
Notice that sex2
is truncated at four characters.
proc print data=class(obs=2);
var name sex sex2 sex3;
run;
Obs Name Sex sex2 sex3
1 Alfred M male male
2 Alice F fema female
This is because, during the construction of the PDV in the compile phase of the DATA step, the first time SAS encountered the variable named sex2
, the expression on the right-hand side of the assignment had a length of 4 characters.
ods select variables;
proc contents data=class(keep=sex sex2 sex3);
run;
The CONTENTS Procedure
Alphabetic List of Variables and Attributes
# Variable Type Len
2 Sex Char 1
6 sex2 Char 4
7 sex3 Char 6
One fix, then, is to ensure that the longest data value appears first. Another, more common fix is to simply initialize the variable with a LENGTH statement.
To declare the variable is a character variable, include the $
symbol after the variable name.
data class;
set sashelp.class;
length sex2 $ 6;
if sex = 'M' then sex2 = 'male';
else if sex = 'F' then sex2 = 'female';
run;
proc print data=class(obs=2);
var name sex sex2;
run;
Obs Name Sex sex2
1 Alfred M male
2 Alice F female
Too long
Occasionally a character variable may be too long for our purposes - this is common where character values force a table of statistical results to be too wide or otherwise difficult to read. Here we have two approaches we can take to shortening the data values.
- extract a substring
- use a format
We’ll discuss extracting a substring below.
Using a format is often a great solution to display problems. However, keep in mind that formats also serve to recode the data.
Suppose in the class
data we wanted a display of names and ages, but we only wanted the first three letters of each name. We could use a character format in PROC PRINT.
proc print data=class(obs=5);
format name $3.;
var name age;
run;
Obs Name Age
1 Alf 14
2 Ali 13
3 Bar 13
4 Car 14
5 Hen 14
However, keep in mind that truncating character values can have statistical consequences, whether in the form of a new variable or a format.
proc freq data=class;
format name $1.;
tables name / nocum;
run;
The FREQ Procedure
Name Frequency Percent
-----------------------------
A 2 10.53
B 1 5.26
C 1 5.26
H 1 5.26
J 7 36.84
L 1 5.26
M 1 5.26
P 1 5.26
R 2 10.53
T 1 5.26
W 1 5.26
Some Character Functions
We have already encountered the TRIM function, for removing spaces at the end of a data value. There are several other basic character functions that are endlessly useful: LENGTH, INDEX, and SUBSTR. There are numerous other character functions not discussed here. SAS also has functions to work with Perl regular expressions.
In extracting and constructing new character values, it is often helpful to
- find the position at which a character is found
- extract a substring
- replace a substring
- determine how long a particular data value is
Index
In the cars
data, the make
variable holds a lot of information that is not encoded elsewhere.
proc print data=sashelp.cars(obs=3) noobs;
var model;
run;
Model
MDX
RSX Type S 2dr
TSX 4dr
Suppose we wanted to create a variable which indicated whether a particular model was a four-door or not. The INDEX function would help. It takes the form
INDEX(source, to-find)
where source evaluates as a character string in which we seek the string to-find. If to-find is found, INDEX returns the position of the first character of the first place to-find occurs in source. If source is not found the function returns a zero (0).
(The data values in model
incidentally have a leading blank which we will trim with the LEFT (“left-align”) function, for the sake of counting characters in our output.)
data cars;
set sashelp.cars;
fdrpos = index(left(model), "4dr");
fourdoor = (fdrpos gt 0);
run;
proc print data=cars(obs=3) noobs;
var model fdrpos fourdoor;
run;
Model fdrpos fourdoor
MDX 0 0
RSX Type S 2dr 0 0
TSX 4dr 5 1
We find that many of the cars in this data set are four-door.
proc freq data=cars;
tables fourdoor /nocum;
run;
The FREQ Procedure
fourdoor Frequency Percent
---------------------------------
0 231 53.97
1 197 46.03
Substrings
Suppose we were working with just the Audi cars. The first “word” (the first two characters) in model
tells us the model series of each observation.
(There is also a WORD function that would be even easier to use were it not for the second observation.)
proc print data=sashelp.cars(obs=3) noobs;
where make eq "Audi";
var make model;
run;
Make Model
Audi A4 1.8T 4dr
Audi A41.8T convertible 2dr
Audi A4 3.0 4dr
We could extract that into a separate variable with the SUBSTR function. This takes the form
SUBSTR(source, start-position, length)
where source is again an expression that evaluates to a character string, start-position is the place where we wish to begin extracting a string from source, and length is the number of characters to extract.
data audi;
set sashelp.cars(where=(make eq "Audi"));
series = substr(left(model), 1, 2);
run;
proc print data=audi(obs=3) noobs;
var model series;
run;
Model series
A4 1.8T 4dr A4
A41.8T convertible 2dr A4
A4 3.0 4dr A4
proc freq data=audi;
tables series /nocum;
run;
The FREQ Procedure
series Frequency Percent
-------------------------------
A4 7 36.84
A6 5 26.32
A8 1 5.26
RS 1 5.26
S4 2 10.53
TT 3 15.79
Replacement
In the Audi data, we see that many of the model
values include information about the engine size - but we already have a separate variable for that.
proc print data=sashelp.cars(obs=3) noobs;
where make eq "Audi";
var make model enginesize;
run;
Engine
Make Model Size
Audi A4 1.8T 4dr 1.8
Audi A41.8T convertible 2dr 1.8
Audi A4 3.0 4dr 3.0
We might want to simplify the model
values by removing the engine size. We could to this by replacing the any engine size string with a space. This also makes use of the SUBSTR function, but in this case SUBSTR appears on the left side of the assignment symbol
SUBSTR(source, start-position, length) = "replacement-string";
(In this example, we have to convert the numeric engine size into a character string in order to index it’s position. The result of the replacement has multiple consecutive spaces, which we can compress with the COMPBL function.)
data audi;
set sashelp.cars;
where make eq "Audi";
keep make model enginesize es es_pos;
es = left(put(enginesize, f3.1));
es_pos = index(model, es);
if (es_pos > 0) then do;
substr(model, es_pos, 3) = " ";
model = compbl(model);
end;
run;
proc print data=audi(obs=5) noobs;
var model enginesize;
run;
Engine
Model Size
A4 T 4dr 1.8
A4 T convertible 2dr 1.8
A4 4dr 3.0
A4 Quattro 4dr manual 3.0
A4 Quattro 4dr auto 3.0
Length
It is often handy to know how long a character string is. We might use this information to shorten a character variable and save storage space (particularly with an already-large data set). And it is often useful in calculating where “start” positions are, or substring lengths. The LENGTH function itself is quite simple
LENGTH(character-string)
where character-string is any expression that evaluates as a character string.
For example, the model
values that resulted from our replacement above were
data audi;
set audi;
model_length = length(model);
run;
proc means data=audi n min max;
var model_length;
run;
The MEANS Procedure
Analysis Variable : model_length
N Minimum Maximum
----------------------------------
19 7.0000000 29.0000000
----------------------------------
Model
is being stored with a length of 40 characters, so this suggests we could save disk space with this variable.