Using Compressed Data in SAS

SAS has a variety of tools for working with compressed data. This article will describe how to use them, and why.

Compression programs look for patterns in the data, and then replace the original file with a file that describes those patterns. Nothing is lost--the description contains all the information needed to recreate the original file. Normally the description is smaller than the original file, but how much smaller will depend on the data itself and the compression scheme used. With the compression schemes build into SAS, the “compressed” file can sometimes be bigger than the original!

Tip

It takes CPU time to compress or uncompress a file. Compression trades CPU power for disk space. For files you use constantly, this may not be a good trade.

We strongly you encourage to compress any data sets you are not using on a regular basis.

Compressing SAS Data Sets

You can turn on SAS file compression at three levels.

Level Code example
system options compress=binary;
library libname z "Z:/SAS" compress=binary;
data set data output-data(compress=binary);

Setting the COMPRESS system option means any data set you create will be compressed. Setting the LIBRARY option means all data created in that library will be compressed. And the DATA step option means that particular data set will be compressed.

The COMPRESS option can take one of two values: binary or char (or equivalent to char, yes). The char compression is likely to be the most efficient if your variables are largely character, while binary compression works with both numeric and character data.

Compression works observation-by-observation. Results depend on how much data is in each observation. However, it's very difficult to predict which will work better for a particular data set. You may want to experiment and see.

To use a compressed data set takes no special syntax whatsoever. SAS recognizes that the data set is compressed and uncompresses each observation automatically as it reads it.

Examples

To illustrate, let’s set up a library with disk access equivalent to the WORK library, and try a few examples.

First let’s see how the (uncompressed) WORK library is set up.

2          libname work list;
NOTE: Libref=   WORK 
      Scope=    Kernel  
      Engine=   V9
      Access=   TEMP
      Physical Name= C:\Users\hemken\AppData\Local\Temp\SAS Temporary 
      Files\_TD12828_SULPHUROUS_
      Filename= C:\Users\hemken\AppData\Local\Temp\SAS Temporary 
      Files\_TD12828_SULPHUROUS_
      Owner Name= PRIMO\hemken
      File Size=              0KB
      File Size (bytes)= 0

Next let’s set up a compressed library with the same disk access speed.

2          libname wkcmp "C:/temp" compress=binary;
NOTE: Libref WKCMP was successfully assigned as follows: 
      Engine:        V9 
      Physical Name: C:\temp

“Compression” can increase size

Using the COMPRESS option is not automatically a good thing! In this example, it actually increases the size of the data set.

2          data wkcmp.class;
3           set sashelp.class;
4           run;

NOTE: There were 19 observations read from the data set SASHELP.CLASS.
NOTE: The data set WKCMP.CLASS has 19 observations and 5 variables.
NOTE: Compressing data set WKCMP.CLASS increased size by 100.00 percent. 
      Compressed is 2 pages; un-compressed would require 1 pages.

The class data has only a few variables, and they are mostly numeric. There is not a lot here to compress!

Character data can compress well

In an example with only a few variables, but mainly composed of character data, we get good compression. Here is what a few observations of the data look like.

proc print data=sashelp.eismsg(obs=5);
run;
     Obs    MSGID    MNEMONIC                    LINENO    LEVEL

       1      24     IN_NOT_AVAILABLE               1        N  
       2      29     IN_FETCH_MESSAGE_FAILED        1        E  
       3      36     IO_DS_NOT_REGISTERED           1        E  
       4      37     IO_DS_NOT_EXIST_WARN           1        W  
       5      39     IO_MBNAME_SET_TO_CURRENT       1        W  

     Obs    TEXT

       1    This option is not yet available.                         
       2    %LAttempt to fetch message: %$ failed.                    
       3    %1$ is not registered.                                    
       4    %L%1$ does not exist.                                     
       5    %1$ is missing. It has been set to the current repository.

     Obs          PBUTTONS

       1    SASHELP.FSP.OK.SLIST
       2    SASHELP.FSP.OK.SLIST
       3    SASHELP.FSP.OK.SLIST
       4    SASHELP.FSP.OK.SLIST
       5    SASHELP.FSP.OK.SLIST

And it compresses very well.

2          data wkcmp.eismsg;
3           set sashelp.eismsg;
4           run;

NOTE: There were 1470 observations read from the data set SASHELP.EISMSG.
NOTE: The data set WKCMP.EISMSG has 1470 observations and 6 variables.
NOTE: Compressing data set WKCMP.EISMSG decreased size by 42.86 percent. 
      Compressed is 4 pages; un-compressed would require 7 pages.

Compression is relative to observation length

Consider a data set with 10,000 variables and 10,000 observations. All the data values are numeric (and randomly generated).

Tip

If you are experimenting with compressing a large data set, a reasonably big subset can tell you what you need to know.

Compressing all the observations gives us some reduction in file size.

2          data wkcmp.example1;
3           set work.example1;
4           run;

NOTE: There were 10000 observations read from the data set WORK.EXAMPLE1.
NOTE: The data set WKCMP.EXAMPLE1 has 10000 observations and 10000 
      variables.
NOTE: Compressing data set WKCMP.EXAMPLE1 decreased size by 24.87 percent. 
      Compressed is 2507 pages; un-compressed would require 3337 pages.

Taking the same number of variables but with fewer observations we see the same (relative) compression.

2          data wkcmp.example1;
3           set work.example1(obs=500);
4           run;

NOTE: There were 500 observations read from the data set WORK.EXAMPLE1.
NOTE: The data set WKCMP.EXAMPLE1 has 500 observations and 10000 variables.
NOTE: Compressing data set WKCMP.EXAMPLE1 decreased size by 23.98 percent. 
      Compressed is 130 pages; un-compressed would require 171 pages.

More variables can mean more compression

Whether or nor compression provides a benefit depends very much on the specific data set.

Compressing a small number of variables increases the data set size with this data set (but compare to the character data example, above).

2          data wkcmp.lessvars;
3           set work.example1(obs=500 keep=x1-x10);
4           run;

NOTE: There were 500 observations read from the data set WORK.EXAMPLE1.
NOTE: The data set WKCMP.LESSVARS has 500 observations and 10 variables.
NOTE: Compressing data set WKCMP.LESSVARS increased size by 100.00 
      percent. 
      Compressed is 2 pages; un-compressed would require 1 pages.

With more variables, the compressed data set is about the same size as the original data set - we just break even.

2          data wkcmp.lessvars;
3           set work.example1(obs=500 keep=x1-x75);
4           run;

NOTE: There were 500 observations read from the data set WORK.EXAMPLE1.
NOTE: The data set WKCMP.LESSVARS has 500 observations and 75 variables.
NOTE: Compressing data set WKCMP.LESSVARS decreased size by 0.00 percent. 
      Compressed is 5 pages; un-compressed would require 5 pages.

With even more variables - more data values in each observation - we finally see a reduced file size from compression.

2          data wkcmp.lessvars;
3           set work.example1(obs=500 keep=x1-x1000);
4           run;

NOTE: There were 500 observations read from the data set WORK.EXAMPLE1.
NOTE: The data set WKCMP.LESSVARS has 500 observations and 1000 variables.
NOTE: Compressing data set WKCMP.LESSVARS decreased size by 12.50 percent. 
      Compressed is 14 pages; un-compressed would require 16 pages.

Increased compression == slower processing

To see the trade-off in file size versus processing speed, consider how long it takes to generate 15 data sets like the example above, first in library WORK, and then in library wkcmp.

When we compare the elapsed times, we see using compressed data is slower.

                             The GLM Procedure
 
               Dependent Variable: time   Elapsed time (sec)

                                               Standard
Parameter                      Estimate           Error  t Value  Pr > |t|

Intercept                   0.660866674 B    0.00406925   162.41    <.0001
library   compressed        1.726066653 B    0.00575478   299.94    <.0001
library   not compressed    0.000000000 B     .              .       .    

Boxplot