sas优化技巧(3)排序
1:防止不必要的排序
下面四种方式可以防止排序的进行
1.1:BY-group processing with an index to avoid a sort
在以下情况下,by语句不会用索引
the BY statement includes the DESCENDING or NOTSORTED option or if SAS detects that the data file is physically stored in sorted order on the BY variables
索引列用来进行排序的利弊
弊:
1:It is generally less efficient than sequentially reading a sorted data set because
processing BY groups typically means retrieving the entire file.
2:It requires storage space for the index.
1.2:BY-group processing with the NOTSORTED option/GROUPFORMAT option
by variable option;
The NOTSORTED option specifies that observations that have the same BY value are grouped together but are not necessarily sorted in alphabetical or numeric order.
The NOTSORTED option works best when observations that have the same BY value are stored together.
注意事项:
The NOTSORTED option turns off sequence checking. If your data is not grouped,using the NOTSORTED option can produce a large amount of output.
The NOTSORTED option cannot be used with the MERGE or UPDATE statements
Groupformat option
The GROUPFORMAT option uses the formatted values of a variable instead of the internal values to determine where a BY group begins and ends
意思就是用格式化后的变量值来进行分组,而不是用原来的数据集的值
by order_date groupformat notsorted;
1.3:a CLASS statement
提前对变量进行排序对class语句几乎没有什么帮助,但是对by语句有很大的帮助
1.4: the SORTEDBY= data set option.
If you are working with input data that is already sorted, you can specify how the data is ordered by using the SORTEDBY= data set option.
Although the SORTEDBY= option does not sort a data set, it sets the value of the Sorted flag. It does not set the value of the Validated sort flag. (PROC SORT sets the Validated sort flag.)
data company.transactions (sortedby=invoice);invoice为排序好的列,这个选项表示这个列已被排好序。
排序对空间的要求
When data is sorted, SAS requires enough space in the data library for two copies ofthe data ?le that is being sorted as well as additional workspace,等于是原数据集*4的空间,这针对的是use disk space in order to sort the data
2:多线程排序
PROC SORT SAS-data-set-name THREADS | NOTHREADS;
进行多线程排序的策略
When a threaded sort is used, the observations in the input data set are divided intoequal temporary subsets, based on the number of processors that are allocated to theSORT procedure. Each subset is then sorted on a different processor. The sortedsubsets are then interleaved to re-create the sorted version of the input data set.
设置多余实际CPU数量会降低运行效率
CPUCOUNT= n | ACTUAL;
3:大数据集排序
对于大数据集的排序,如果空间不够可以分块进行
合并时,如果是用obs进行分割,则不能用append来合并
五种分割在advance上面看吧。。。。
用tagsort进行排序,不支持多线程
PROC SORT DATA=SAS-data-set-name TAGSORT;
原理:The TAGSORT option stores only the BY variables and the observation numbers in temporary files. The BY variables and the observation numbers are called tags. At the completion of the sorting process, PROC SORT uses the tags to retrieve records from the input data set in sorted order.
时间多,空间少,对比第一种
和正常排序比较,如果数据集序列混乱则用到的时间多很多,I/O也是
但是如果基本排好序,那么时间 I/O都只会多一点点
PROC SORT DATA=SAS-data-set-name TAGSORT;
The TAGSORT optionstores only the BY variables and the observation numbers in temporary ?les. The BYvariables and the observation numbers are called tags.At the completion of the sortingprocess, PROC SORT uses the tags to retrieve records from the input data set in sortedorder.
4:高效删除重复值
4.1:Using the NODUPKEY Option
PROC SORT compares all BY-variable values for each observation to those for the previous observation that was written to the output data set
PROC SORT DATA=SAS-data-set-name NODUPKEY;
4.2:Using the NODUPRECS /nodup Option
the NODUPRECS option compares all of the variable values for each observation to those for the previous observation that was written to the output data set.
PROC SORT DATA=SAS-data-set-name NODUPRECS;
Because NODUPRECS checks only consecutive observations, some nonconsecutive duplicate observations might remain in the output data set. You can remove allduplicates with this option by sorting on all variables.(这个选项只对连续的重复值有效,不连续的就不会消除)
4.3:Using the EQUALS | NOEQUALS Option
EQUALS maintains the order from the input data set in the output data set.NOEQUALS does not necessarily preserve this order in the output data set.NOEQUALS can save CPU time and memory resources.
这里要这样理解,对于这样两条数据
1 2
1 3
进行这样的程序 proc sort data=old out=new nodupkey equal/unequal; by ,..;run;
如果是equal那么会保留1 2
如果是unequal则会保留1 3
5:host sort utility
Host sort utilities are third-party sort packages that are available in some operating environments. In some cases, using a host sort utility with PROC SORT might be more efficient than using the SAS sort utility with PROC SORT.(是一个第三方包,对比特定数据集用起来效果会比proc sort好)
5.1:Using the SORTPGM= System Option
tells SAS whether to use the SAS sort, to use the host sort, or to determine which sort utility is best for the data set.
指定用哪种排序策略或者让sas自己选择最好的
5.2:Using the SORTCUTP= System Option
The SORTCUTP= system option specifies the number of bytes above which the host sort utility is used instead of the SAS sort utility.
OPTIONS SORTCUTP=n / nK / nM / nG / MIN / MAX / hexX;
5.3:Using the SORTCUT= System Option
Beginning with SAS 9, the SORTCUT= system option can be used to specify the number of observations above which the host sort utility is used instead of the SAS sortutility.
OPTIONS SORTCUT=n / nK / nM / nG / MIN / MAX / hexX;
5.4:Using the SORTNAME= System Option
The SORTNAME= option specifies the host sort utility that will be used if the value of SORTPGM= is BEST or HOST.
OPTIONS SORTNAME=host-sort-utility name;
options sortpgm=best sortcutp=10000 sortname=syncsort;