Weka数据挖掘——选择属性

如果你现在还不努力，那么将来的你会过的更加吃力。

1 选择属性

属性选择是通过搜索数据中所有可能的属性组合，以找到预测效果最好的属性子集。手工选择属性既繁琐又容易出错，为了帮助用户事项选择属性自动化。Weka中提供了选择属性面板。要自动选择属性需要设立两个对象：属性评估器和搜索方法，如下图所示：

属性评估器确定使用什么方法给每个属性分配一个评估值，搜索方法决定执行什么风格的搜索。

2 选择属性算法的介绍

2-1 属性子集评估器

属性子集评估器选取属性的一个子集，并且返回一个指导搜索的度量数值。
CfsSubsetEval评估器评估每个属性的预测能力以及相互之间的冗余度，倾向于选择与类别属性相关度高，但是相互之间相关度第的属性。选项迭代添加与类别属性相关度最高的属性，只要是子集中不包含与当前属性相关更高的属性。评估器将缺失值作为单独值，也可以将缺失值计数与其他的值一起按照出现频率分布。
WrapperSubsetEval评估器是包装器方法。它使用一个分类器来评估属性集，它对每个子集采用交叉验证估计学习方案的准确性。

2-2 单个属性评估器

单个属性评估器和Ranker搜索方法一起使用，Ranker产生一个丢弃若干属性后得到的给定数目的属性列表。
ReliefAttributeEval是基于实例的评估器，它随机抽取样本，并检查具有相同和不同类别的邻近实例。它可以运行在离散型和连续性的数据之上，参数包括指定抽样实例的数量，要检查的临近实例的数量，是否对近邻的距离加权，以及控制权重如何根据距离衰减的指数函数。

InfoGainAttributeEval评估器是通过测量类别对应属性的信息增益来评估属性，它首相基于MDL（最小描述长度）的离散化方法（也可以设置二元化处理）对数值属性惊醒离散化。
GainRatioAttributeEval评估器通过测量相应类别的增益率来评估属性。

其他的在使用的时候在研究………………

2-3 搜索方法

搜索方法遍历属性空间以搜索好的子集，通过所选的属性子集评估器来衡量其质量。
BestFirst搜索方法执行带回溯的贪婪爬山法，用户可以指定在系统的回溯钱，必须连续遇到多少个无法改善的结点。它可以从空属性集开始向前搜索，也可以从全集可是向后搜索，也可以从中间点开始双向搜索（增删单个属性）。为了提高效率可以缓存已经评估的子集。
GreedyStepwise搜索方法贪婪搜索属性的子集空间。不会进行回溯。
Ranker对单个属性进行排名的方案。

3 Weka选择属性实例分析

选择属性的一般目的是为了更好的实现分类功能，因为属性和最终需要分类的目标属性的关联度是不一样的。

使用劳工数据集labor.arff
CfsSubsetEval

=== Run information ===

Evaluator:    weka.attributeSelection.CfsSubsetEval -P 1 -E 1
Search:       weka.attributeSelection.GreedyStepwise -T -1.7976931348623157E308 -N -1 -num-slots 1
Relation:     labor-neg-data
Instances:    57
Attributes:   17
              duration
              wage-increase-first-year
              wage-increase-second-year
              wage-increase-third-year
              cost-of-living-adjustment
              working-hours
              pension
              standby-pay
              shift-differential
              education-allowance
              statutory-holidays
              vacation
              longterm-disability-assistance
              contribution-to-dental-plan
              bereavement-assistance
              contribution-to-health-plan
              class
Evaluation mode:    evaluate on all training data



=== Attribute Selection on all input data ===

Search Method:
    Greedy Stepwise (forwards).
    Start set: no attributes
    Merit of best subset found:    0.363

Attribute Subset Evaluator (supervised, Class (nominal): 17 class):
    CFS Subset Evaluator
    Including locally predictive attributes

Selected attributes: 2,3,5,11,12,13,14 : 7
                     wage-increase-first-year
                     wage-increase-second-year
                     cost-of-living-adjustment
                     statutory-holidays
                     vacation
                     longterm-disability-assistance
                     contribution-to-dental-plan

WrapperSubsetEval评估器

=== Run information ===

Evaluator:    weka.attributeSelection.WrapperSubsetEval -B weka.classifiers.trees.J48 -F 5 -T 0.01 -R 1 -E DEFAULT -- -C 0.25 -M 2
Search:       weka.attributeSelection.BestFirst -D 1 -N 5
Relation:     labor-neg-data
Instances:    57
Attributes:   17
              duration
              wage-increase-first-year
              wage-increase-second-year
              wage-increase-third-year
              cost-of-living-adjustment
              working-hours
              pension
              standby-pay
              shift-differential
              education-allowance
              statutory-holidays
              vacation
              longterm-disability-assistance
              contribution-to-dental-plan
              bereavement-assistance
              contribution-to-health-plan
              class
Evaluation mode:    evaluate on all training data



=== Attribute Selection on all input data ===

Search Method:
    Best first.
    Start set: no attributes
    Search direction: forward
    Stale search after 5 node expansions
    Total number of subsets evaluated: 138
    Merit of best subset found:    0.842

Attribute Subset Evaluator (supervised, Class (nominal): 17 class):
    Wrapper Subset Evaluator
    Learning scheme: weka.classifiers.trees.J48
    Scheme options: -C 0.25 -M 2 
    Subset evaluation: classification accuracy
    Number of folds for accuracy estimation: 5

Selected attributes: 1,2,4,6,11,12 : 6
                     duration
                     wage-increase-first-year
                     wage-increase-third-year
                     working-hours
                     statutory-holidays
                     vacation

研究对比：使用J48分类器，十折交叉验证来比较GfsSubsetEval评估器和WrapperSubsetEval评估器。
直接全集使用

=== Run information ===

Scheme:       weka.classifiers.trees.J48 -C 0.25 -M 2
Relation:     labor-neg-data
Instances:    57
Attributes:   17
              duration
              wage-increase-first-year
              wage-increase-second-year
              wage-increase-third-year
              cost-of-living-adjustment
              working-hours
              pension
              standby-pay
              shift-differential
              education-allowance
              statutory-holidays
              vacation
              longterm-disability-assistance
              contribution-to-dental-plan
              bereavement-assistance
              contribution-to-health-plan
              class
Test mode:    10-fold cross-validation

=== Classifier model (full training set) ===

J48 pruned tree
------------------

wage-increase-first-year <= 2.5: bad (15.27/2.27)
wage-increase-first-year > 2.5
|   statutory-holidays <= 10: bad (10.77/4.77)
|   statutory-holidays > 10: good (30.96/1.0)

Number of Leaves  :     3

Size of the tree :  5


Time taken to build model: 0.04 seconds

=== Stratified cross-validation ===
=== Summary ===

Correctly Classified Instances          42               73.6842 %
Incorrectly Classified Instances        15               26.3158 %
Kappa statistic                          0.4415
Mean absolute error                      0.3192
Root mean squared error                  0.4669
Relative absolute error                 69.7715 %
Root relative squared error             97.7888 %
Coverage of cases (0.95 level)          91.2281 %
Mean rel. region size (0.95 level)      85.9649 %
Total Number of Instances               57     

=== Detailed Accuracy By Class ===

                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                 0.700    0.243    0.609      0.700    0.651      0.444    0.695     0.559     bad
                 0.757    0.300    0.824      0.757    0.789      0.444    0.695     0.738     good
Weighted Avg.    0.737    0.280    0.748      0.737    0.740      0.444    0.695     0.675     

=== Confusion Matrix ===

  a  b   <-- classified as
 14  6 |  a = bad
  9 28 |  b = good

使用Cfs的结果，首先过滤属性

=== Run information ===

Scheme:       weka.classifiers.trees.J48 -C 0.25 -M 2
Relation:     labor-neg-data-weka.filters.unsupervised.attribute.Remove-R1,4,6-10,15-16
Instances:    57
Attributes:   8
              wage-increase-first-year
              wage-increase-second-year
              cost-of-living-adjustment
              statutory-holidays
              vacation
              longterm-disability-assistance
              contribution-to-dental-plan
              class
Test mode:    10-fold cross-validation

=== Classifier model (full training set) ===

J48 pruned tree
------------------

wage-increase-first-year <= 2.5: bad (15.27/2.27)
wage-increase-first-year > 2.5
|   longterm-disability-assistance = yes
|   |   statutory-holidays <= 10
|   |   |   wage-increase-first-year <= 3: bad (2.0)
|   |   |   wage-increase-first-year > 3: good (3.99)
|   |   statutory-holidays > 10: good (25.67)
|   longterm-disability-assistance = no
|   |   vacation = below_average: bad (5.09/1.09)
|   |   vacation = average: good (2.64/1.0)
|   |   vacation = generous: good (2.34)

Number of Leaves  :     7

Size of the tree :  12


Time taken to build model: 0 seconds

=== Stratified cross-validation ===
=== Summary ===

Correctly Classified Instances          44               77.193  %
Incorrectly Classified Instances        13               22.807  %
Kappa statistic                          0.4935
Mean absolute error                      0.2787
Root mean squared error                  0.441 
Relative absolute error                 60.9191 %
Root relative squared error             92.3655 %
Coverage of cases (0.95 level)          89.4737 %
Mean rel. region size (0.95 level)      78.0702 %
Total Number of Instances               57     

=== Detailed Accuracy By Class ===

                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                 0.650    0.162    0.684      0.650    0.667      0.494    0.737     0.586     bad
                 0.838    0.350    0.816      0.838    0.827      0.494    0.733     0.777     good
Weighted Avg.    0.772    0.284    0.770      0.772    0.771      0.494    0.735     0.710     

=== Confusion Matrix ===

  a  b   <-- classified as
 13  7 |  a = bad
  6 31 |  b = good

使用Wrap结果，首先过滤属性

=== Run information ===

Scheme:       weka.classifiers.trees.J48 -C 0.25 -M 2
Relation:     labor-neg-data-weka.filters.unsupervised.attribute.Remove-R3,5,7-10,13-16
Instances:    57
Attributes:   7
              duration
              wage-increase-first-year
              wage-increase-third-year
              working-hours
              statutory-holidays
              vacation
              class
Test mode:    10-fold cross-validation

=== Classifier model (full training set) ===

J48 pruned tree
------------------

wage-increase-first-year <= 2.5: bad (15.27/2.27)
wage-increase-first-year > 2.5
|   statutory-holidays <= 10
|   |   vacation = below_average: bad (7.54/1.54)
|   |   vacation = average: bad (0.0)
|   |   vacation = generous: good (3.23)
|   statutory-holidays > 10: good (30.96/1.0)

Number of Leaves  :     5

Size of the tree :  8


Time taken to build model: 0 seconds

=== Stratified cross-validation ===
=== Summary ===

Correctly Classified Instances          46               80.7018 %
Incorrectly Classified Instances        11               19.2982 %
Kappa statistic                          0.5905
Mean absolute error                      0.2593
Root mean squared error                  0.4162
Relative absolute error                 56.6868 %
Root relative squared error             87.1592 %
Coverage of cases (0.95 level)          92.9825 %
Mean rel. region size (0.95 level)      78.9474 %
Total Number of Instances               57     

=== Detailed Accuracy By Class ===

                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                 0.800    0.189    0.696      0.800    0.744      0.594    0.775     0.608     bad
                 0.811    0.200    0.882      0.811    0.845      0.594    0.775     0.808     good
Weighted Avg.    0.807    0.196    0.817      0.807    0.810      0.594    0.775     0.738     

=== Confusion Matrix ===

  a  b   <-- classified as
 16  4 |  a = bad
  7 30 |  b = good

总结：
第一：经过属性选择之后，分类的准确度得到提高；
第二：对于本例Wrap由于Cfs

posted @ 2016-01-23 21:19 snowwolf101 阅读(2376) 评论(0) 编辑收藏举报

刷新页面返回顶部

snowwolf101

总有一些时候不要给自己借口！！