sas优化技巧(2),缩减数据存储空间、length、compress、reuse、data步视图
1:控制sas数据存储空间大小
1.1:缩减字符变量存储空间
sas怎么存储字符变量?
对于赋值情况的字符型变量,变量的长度依据第一个值得长度
比如name=yi(那么name的长度即为2),然后再给其赋值name=can,那么只会读入ca
对于datalines读入或从外部数据集中读入的数据,sas默认为字符变量长度为8。
用length语句改变长度,data语句中要出现在变量前才有用。
1.2:缩减数值变量存储空间
sas怎么存储数值变量
sas默认数值变量长度为8
length改变长度,以及其作用的范围
Numeric variables always have a length of 8 bytes in the program data vector and during processing.
就是说length语句不会影响读入数据的长度,对于读入数据,原始长度是多少读进去就是多少
keep in mind that the LENGTH statement affects the length of a numeric variable only in the output data set.
只会影响输出数据的长度
You should never use the LENGTH statement to reduce the length of your numeric variables if the values are not integers
不要对非整数用length语句,会导致精度的丢失,即使是5.0这种也是不行的,length只能对整数使用
对于不同的length长度,能取的值得范围也有不同
如何判断用length后精度是否缺失---->>proc compare过程
2:压缩数据文件
压缩数据文件使得数据变小,这样减少I/O次数,但是每次读取时都要进行一次解压操作,会增加CPU的消耗,还有一点,压缩不会百分百保证
使得文件变小,也有可能变大。
数据文件未压缩前是什么情况?
1:每个变量中的值占的字节数都一样,每个观测行占的字节数也一样
2:字符由空格补全,数值由binary zeros补全。
3:每一页上面都由16字节的空白
4:描述文件在第一页的末尾
压缩后是什么个情况?
1:treat an observation as a single string of bytes by ignoring variable types and
boundaries.
2:collapse consecutive repeating characters and numbers into fewer bytes.
3:contain a 28-byte overhead at the beginning of each page.
4:contain a 12-byte- or 24-byte-per-observation overhead following the page overhead. This space is used for deletion status, compressed length, pointers, and flags.
哪些数据集适合被压缩?
It is large.
It contains many long character values.
It contains many values that have repeated characters or binary zeros.
It contains many missing values.
It contains repeated values in variables that are physically stored next to one another.
哪些数据集不适合被压缩?
few repeated characters
small physical size
few missing values
short text strings.
compress具体实现方式
YES/CHAR适合压缩数据集中数值变量多,且含有较多0值,或者字符变量多且字符间空格较多的数据
BINARY特别适合压缩中到大的数值型数据集
Use binary compression only if the observation length is several hundred bytes or more.
对compress的数据最好不要使用直接访问形式,可以通过POINTOBS选项来做到
要在压缩后这个选项才有效
reuse=选项,和compress一样,也有两种形式
具体用途:If the REUSE= option is set to YES, observations that are added to the SAS data set are inserted wherever enough free space exists, instead of at the end of the SAS data set,当确定用reuse后,sas不会再数据集尾上添加数据,而会在有足够空白的地方添加数据
描述:track and reuse free space within the data set when you delete or update observations
用了reuse=yes后就默认pointobs=no
3:视图
contains only descriptor information aboutthe data and instructions on how to retrieve data values that are stored elsewhere.
A DATA step view can be created only in a DATA step. A DATA step view cannot contain global statements, host-specific data set options, or most host-specific FILE and INFILE statements. Also, a DATA step view cannot be indexed or compressed.
The VIEW= option tells SAS to compile, but not to execute, the source program andto store the compiled code in the input DATA step view that is named in the option
描述视图
data view=company.newdata;
describe;
run;