2 libname work list;
NOTE: Libref= WORK
Scope= Kernel
Engine= V9
Access= TEMP
Physical Name= C:\Users\hemken\AppData\Local\Temp\SAS Temporary
Files\_TD12828_SULPHUROUS_
Filename= C:\Users\hemken\AppData\Local\Temp\SAS Temporary
Files\_TD12828_SULPHUROUS_
Owner Name= PRIMO\hemken
File Size= 0KB
File Size (bytes)= 0
Using Compressed Data in SAS
SAS has a variety of tools for working with compressed data. This article will describe how to use them, and why.
Compression programs look for patterns in the data, and then replace the original file with a file that describes those patterns. Nothing is lost--the description contains all the information needed to recreate the original file. Normally the description is smaller than the original file, but how much smaller will depend on the data itself and the compression scheme used. With the compression schemes build into SAS, the “compressed” file can sometimes be bigger than the original!
It takes CPU time to compress or uncompress a file. Compression trades CPU power for disk space. For files you use constantly, this may not be a good trade.
We strongly you encourage to compress any data sets you are not using on a regular basis.
Compressing SAS Data Sets
You can turn on SAS file compression at three levels.
Level | Code example |
---|---|
system | options compress=binary; |
library | libname z "Z:/SAS" compress=binary; |
data set | data output-data(compress=binary); |
Setting the COMPRESS system option means any data set you create will be compressed. Setting the LIBRARY option means all data created in that library will be compressed. And the DATA step option means that particular data set will be compressed.
The COMPRESS option can take one of two values: binary
or char
(or equivalent to char
, yes
). The char
compression is likely to be the most efficient if your variables are largely character, while binary
compression works with both numeric and character data.
Compression works observation-by-observation. Results depend on how much data is in each observation. However, it's very difficult to predict which will work better for a particular data set. You may want to experiment and see.
To use a compressed data set takes no special syntax whatsoever. SAS recognizes that the data set is compressed and uncompresses each observation automatically as it reads it.
Examples
To illustrate, let’s set up a library with disk access equivalent to the WORK library, and try a few examples.
First let’s see how the (uncompressed) WORK library is set up.
Next let’s set up a compressed library with the same disk access speed.
2 libname wkcmp "C:/temp" compress=binary;
NOTE: Libref WKCMP was successfully assigned as follows:
Engine: V9
Physical Name: C:\temp
“Compression” can increase size
Using the COMPRESS option is not automatically a good thing! In this example, it actually increases the size of the data set.
2 data wkcmp.class;
3 set sashelp.class;
4 run;
NOTE: There were 19 observations read from the data set SASHELP.CLASS.
NOTE: The data set WKCMP.CLASS has 19 observations and 5 variables.
NOTE: Compressing data set WKCMP.CLASS increased size by 100.00 percent.
Compressed is 2 pages; un-compressed would require 1 pages.
The class
data has only a few variables, and they are mostly numeric. There is not a lot here to compress!
Character data can compress well
In an example with only a few variables, but mainly composed of character data, we get good compression. Here is what a few observations of the data look like.
proc print data=sashelp.eismsg(obs=5);
run;
Obs MSGID MNEMONIC LINENO LEVEL
1 24 IN_NOT_AVAILABLE 1 N
2 29 IN_FETCH_MESSAGE_FAILED 1 E
3 36 IO_DS_NOT_REGISTERED 1 E
4 37 IO_DS_NOT_EXIST_WARN 1 W
5 39 IO_MBNAME_SET_TO_CURRENT 1 W
Obs TEXT
1 This option is not yet available.
2 %LAttempt to fetch message: %$ failed.
3 %1$ is not registered.
4 %L%1$ does not exist.
5 %1$ is missing. It has been set to the current repository.
Obs PBUTTONS
1 SASHELP.FSP.OK.SLIST
2 SASHELP.FSP.OK.SLIST
3 SASHELP.FSP.OK.SLIST
4 SASHELP.FSP.OK.SLIST
5 SASHELP.FSP.OK.SLIST
And it compresses very well.
2 data wkcmp.eismsg;
3 set sashelp.eismsg;
4 run;
NOTE: There were 1470 observations read from the data set SASHELP.EISMSG.
NOTE: The data set WKCMP.EISMSG has 1470 observations and 6 variables.
NOTE: Compressing data set WKCMP.EISMSG decreased size by 42.86 percent.
Compressed is 4 pages; un-compressed would require 7 pages.
Compression is relative to observation length
Consider a data set with 10,000 variables and 10,000 observations. All the data values are numeric (and randomly generated).
If you are experimenting with compressing a large data set, a reasonably big subset can tell you what you need to know.
Compressing all the observations gives us some reduction in file size.
2 data wkcmp.example1;
3 set work.example1;
4 run;
NOTE: There were 10000 observations read from the data set WORK.EXAMPLE1.
NOTE: The data set WKCMP.EXAMPLE1 has 10000 observations and 10000
variables.
NOTE: Compressing data set WKCMP.EXAMPLE1 decreased size by 24.87 percent.
Compressed is 2507 pages; un-compressed would require 3337 pages.
Taking the same number of variables but with fewer observations we see the same (relative) compression.
2 data wkcmp.example1;
3 set work.example1(obs=500);
4 run;
NOTE: There were 500 observations read from the data set WORK.EXAMPLE1.
NOTE: The data set WKCMP.EXAMPLE1 has 500 observations and 10000 variables.
NOTE: Compressing data set WKCMP.EXAMPLE1 decreased size by 23.98 percent.
Compressed is 130 pages; un-compressed would require 171 pages.
More variables can mean more compression
Whether or nor compression provides a benefit depends very much on the specific data set.
Compressing a small number of variables increases the data set size with this data set (but compare to the character data example, above).
2 data wkcmp.lessvars;
3 set work.example1(obs=500 keep=x1-x10);
4 run;
NOTE: There were 500 observations read from the data set WORK.EXAMPLE1.
NOTE: The data set WKCMP.LESSVARS has 500 observations and 10 variables.
NOTE: Compressing data set WKCMP.LESSVARS increased size by 100.00
percent.
Compressed is 2 pages; un-compressed would require 1 pages.
With more variables, the compressed data set is about the same size as the original data set - we just break even.
2 data wkcmp.lessvars;
3 set work.example1(obs=500 keep=x1-x75);
4 run;
NOTE: There were 500 observations read from the data set WORK.EXAMPLE1.
NOTE: The data set WKCMP.LESSVARS has 500 observations and 75 variables.
NOTE: Compressing data set WKCMP.LESSVARS decreased size by 0.00 percent.
Compressed is 5 pages; un-compressed would require 5 pages.
With even more variables - more data values in each observation - we finally see a reduced file size from compression.
2 data wkcmp.lessvars;
3 set work.example1(obs=500 keep=x1-x1000);
4 run;
NOTE: There were 500 observations read from the data set WORK.EXAMPLE1.
NOTE: The data set WKCMP.LESSVARS has 500 observations and 1000 variables.
NOTE: Compressing data set WKCMP.LESSVARS decreased size by 12.50 percent.
Compressed is 14 pages; un-compressed would require 16 pages.
Increased compression == slower processing
To see the trade-off in file size versus processing speed, consider how long it takes to generate 15 data sets like the example above, first in library WORK, and then in library wkcmp
.
When we compare the elapsed times, we see using compressed data is slower.
The GLM Procedure
Dependent Variable: time Elapsed time (sec)
Standard
Parameter Estimate Error t Value Pr > |t|
Intercept 0.660866674 B 0.00406925 162.41 <.0001
library compressed 1.726066653 B 0.00575478 299.94 <.0001
library not compressed 0.000000000 B . . .