Working with Character Data

Besides numeric data, the other fundamental data type in SAS is character data - data values composed of strings of alpha- numeric character.

Character data can come in many language encodings, representing Western Latin character sets, double-byte character sets, and UTF-8 character sets. SAS has extensive capabilities to work with data in many different encodings, SSCC users have successfully worked with data in half a dozen languages. This is a big topic that is well beyond the scope of these notes. Here we will just discuss a few features of working with the SSCC default “wlatin” encoding.

Lengths

Automatically determined

Unlike numeric data, which is always stored as an 8 byte double by default, the length of character variables is determined automatically by the expressions that create them.

For example, using the cars data we might want to combine the Make and Model variables into a single string (this can be a good strategy for creating efficient identifiers in large data sets). Initially, these variables can hold up to 13 and 40 characters, respectively.

ods select variables;
proc contents data=sashelp.cars(keep=make model);
run;
                          The CONTENTS Procedure

                Alphabetic List of Variables and Attributes
 
                       #    Variable    Type    Len

                       1    Make        Char     13
                       2    Model       Char     40

Concatenating these variables, SAS allows room for 13 + 40 = 53 characters in the new variable.

(We TRIM make first, otherwise it is padded to 13 characters - too many spaces!)

data cars;
  set sashelp.cars;
  mm = trim(make) || model;
  run;
ods select variables;
proc contents data=cars(keep=make model mm);
run;
                          The CONTENTS Procedure

                Alphabetic List of Variables and Attributes
 
                        #    Variable    Type    Len

                        1    Make        Char     13
                        2    Model       Char     40
                       16    mm          Char     53
proc print data=cars(obs=3) noobs;
  var make model mm;
run;
              Make     Model             mm

              Acura    MDX               Acura MDX           
              Acura    RSX Type S 2dr    Acura RSX Type S 2dr
              Acura    TSX 4dr           Acura TSX 4dr       

Too short

Occasionally a variable will be too short, and we’ll want to lengthen it. This is especially common when creating new variables based on character literals (character constants).

Suppose we wanted to create a variable that spells out sex in the class data. Consider the difference in two different constructions.

data class;
  set sashelp.class;

  if sex = 'M' then sex2 = 'male';
  else if sex = 'F' then sex2 = 'female';

  if sex = 'F' then sex3 = 'female';
  else if sex = 'M' then sex3 = 'male';
  run;

Notice that sex2 is truncated at four characters.

proc print data=class(obs=2);
  var name sex sex2 sex3;
run;
                  Obs     Name     Sex    sex2     sex3

                    1    Alfred     M     male    male  
                    2    Alice      F     fema    female

This is because, during the construction of the PDV in the compile phase of the DATA step, the first time SAS encountered the variable named sex2, the expression on the right-hand side of the assignment had a length of 4 characters.

ods select variables;
proc contents data=class(keep=sex sex2 sex3);
run;
                          The CONTENTS Procedure

                Alphabetic List of Variables and Attributes
 
                       #    Variable    Type    Len

                       2    Sex         Char      1
                       6    sex2        Char      4
                       7    sex3        Char      6

One fix, then, is to ensure that the longest data value appears first. Another, more common fix is to simply initialize the variable with a LENGTH statement.

To declare the variable is a character variable, include the $ symbol after the variable name.

data class;
  set sashelp.class;
  length sex2 $ 6;
  if sex = 'M' then sex2 = 'male';
  else if sex = 'F' then sex2 = 'female';
  run;
proc print data=class(obs=2);
  var name sex sex2;
run;
                      Obs     Name     Sex     sex2

                        1    Alfred     M     male  
                        2    Alice      F     female

Too long

Occasionally a character variable may be too long for our purposes - this is common where character values force a table of statistical results to be too wide or otherwise difficult to read. Here we have two approaches we can take to shortening the data values.

  • extract a substring
  • use a format

We’ll discuss extracting a substring below.

Using a format is often a great solution to display problems. However, keep in mind that formats also serve to recode the data.

Suppose in the class data we wanted a display of names and ages, but we only wanted the first three letters of each name. We could use a character format in PROC PRINT.

proc print data=class(obs=5);
  format name $3.;
  var name age;
run;
                            Obs    Name    Age

                              1    Alf      14
                              2    Ali      13
                              3    Bar      13
                              4    Car      14
                              5    Hen      14

However, keep in mind that truncating character values can have statistical consequences, whether in the form of a new variable or a format.

proc freq data=class;
  format name $1.;
  tables name / nocum;
run;
                            The FREQ Procedure

                       Name    Frequency     Percent
                       -----------------------------
                       A              2       10.53 
                       B              1        5.26 
                       C              1        5.26 
                       H              1        5.26 
                       J              7       36.84 
                       L              1        5.26 
                       M              1        5.26 
                       P              1        5.26 
                       R              2       10.53 
                       T              1        5.26 
                       W              1        5.26 

Some Character Functions

We have already encountered the TRIM function, for removing spaces at the end of a data value. There are several other basic character functions that are endlessly useful: LENGTH, INDEX, and SUBSTR. There are numerous other character functions not discussed here. SAS also has functions to work with Perl regular expressions.

In extracting and constructing new character values, it is often helpful to

  • find the position at which a character is found
  • extract a substring
  • replace a substring
  • determine how long a particular data value is

Index

In the cars data, the make variable holds a lot of information that is not encoded elsewhere.

proc print data=sashelp.cars(obs=3) noobs;
  var model;
run;
                              Model

                              MDX           
                              RSX Type S 2dr
                              TSX 4dr       

Suppose we wanted to create a variable which indicated whether a particular model was a four-door or not. The INDEX function would help. It takes the form

INDEX(source, to-find)

where source evaluates as a character string in which we seek the string to-find. If to-find is found, INDEX returns the position of the first character of the first place to-find occurs in source. If source is not found the function returns a zero (0).

(The data values in model incidentally have a leading blank which we will trim with the LEFT (“left-align”) function, for the sake of counting characters in our output.)

data cars;
  set sashelp.cars;
  fdrpos = index(left(model), "4dr");
  fourdoor = (fdrpos gt 0);
run;
proc print data=cars(obs=3) noobs;
  var model fdrpos fourdoor;
run;
                   Model             fdrpos    fourdoor

                   MDX                  0          0   
                   RSX Type S 2dr       0          0   
                   TSX 4dr              5          1   

We find that many of the cars in this data set are four-door.

proc freq data=cars;
  tables fourdoor /nocum;
run;
                            The FREQ Procedure

                     fourdoor    Frequency     Percent
                     ---------------------------------
                            0         231       53.97 
                            1         197       46.03 

Substrings

Suppose we were working with just the Audi cars. The first “word” (the first two characters) in model tells us the model series of each observation.

(There is also a WORD function that would be even easier to use were it not for the second observation.)

proc print data=sashelp.cars(obs=3) noobs;
  where make eq "Audi";
  var make model;
  run;
                      Make    Model

                      Audi    A4 1.8T 4dr           
                      Audi    A41.8T convertible 2dr
                      Audi    A4 3.0 4dr            

We could extract that into a separate variable with the SUBSTR function. This takes the form

SUBSTR(source, start-position, length)

where source is again an expression that evaluates to a character string, start-position is the place where we wish to begin extracting a string from source, and length is the number of characters to extract.

data audi;
  set sashelp.cars(where=(make eq "Audi"));
  series = substr(left(model), 1, 2);
run;
proc print data=audi(obs=3) noobs;
  var model series;
  run;
                     Model                     series

                     A4 1.8T 4dr                 A4  
                     A41.8T convertible 2dr      A4  
                     A4 3.0 4dr                  A4  
proc freq data=audi;
  tables series /nocum;
run;
                            The FREQ Procedure

                      series    Frequency     Percent
                      -------------------------------
                      A4               7       36.84 
                      A6               5       26.32 
                      A8               1        5.26 
                      RS               1        5.26 
                      S4               2       10.53 
                      TT               3       15.79 

Replacement

In the Audi data, we see that many of the model values include information about the engine size - but we already have a separate variable for that.

proc print data=sashelp.cars(obs=3) noobs;
  where make eq "Audi";
  var make model enginesize;
  run;
                                                   Engine
                 Make    Model                      Size

                 Audi    A4 1.8T 4dr                 1.8 
                 Audi    A41.8T convertible 2dr      1.8 
                 Audi    A4 3.0 4dr                  3.0 

We might want to simplify the model values by removing the engine size. We could to this by replacing the any engine size string with a space. This also makes use of the SUBSTR function, but in this case SUBSTR appears on the left side of the assignment symbol

SUBSTR(source, start-position, length) = "replacement-string";

(In this example, we have to convert the numeric engine size into a character string in order to index it’s position. The result of the replacement has multiple consecutive spaces, which we can compress with the COMPBL function.)

data audi;
  set sashelp.cars;
  where make eq "Audi";
  keep make model enginesize es es_pos;
  es = left(put(enginesize, f3.1));
  es_pos = index(model, es);
  if (es_pos > 0) then do;

    substr(model, es_pos, 3) = " ";

    model = compbl(model);
    end;
run;
proc print data=audi(obs=5) noobs;
  var model enginesize;
  run;
                                               Engine
                      Model                     Size

                      A4 T 4dr                   1.8 
                      A4 T convertible 2dr       1.8 
                      A4 4dr                     3.0 
                      A4 Quattro 4dr manual      3.0 
                      A4 Quattro 4dr auto        3.0 

Length

It is often handy to know how long a character string is. We might use this information to shorten a character variable and save storage space (particularly with an already-large data set). And it is often useful in calculating where “start” positions are, or substring lengths. The LENGTH function itself is quite simple

LENGTH(character-string)

where character-string is any expression that evaluates as a character string.

For example, the model values that resulted from our replacement above were

data audi;
  set audi;
  model_length = length(model);
run;

proc means data=audi n min max;
  var model_length;
  run;
                            The MEANS Procedure

                     Analysis Variable : model_length 
 
                     N         Minimum         Maximum
                    ----------------------------------
                    19       7.0000000      29.0000000
                    ----------------------------------

Model is being stored with a length of 40 characters, so this suggests we could save disk space with this variable.