Feature Selection 特征选择

 

Feature Selection

Why

1.curse of dimensionality

 

 

 

Method

1. Filter: Scoring each feature, filter out good features. (Information Content)

  1.1 Chi-Square 卡方检验
  1.2 Correlation 相关检验
  1.3 Information Gain 信息获取

Comments:decent complexity, features not as good as wrapper.

2. Wrapper: Try each possibility, evaluate result of each subset. (Accuracy)
  1.1 Recursive feature elimination with cross validaton

Comments: features are better, but computition cost is higher, could overfit.

3. Embedded: Measure the contribution of each feature when creating model.
  1.1 Lasso

 

 

In Practice

1. XgBoosting

print(xgb.importance(matrix()))

 

 2.Decision Tree

Information Gain

 

3. Random Forest

library(party)
cf1 <- cforest(ozone_reading ~ . , data= inputData, control=cforest_unbiased(mtry=2,ntree=50)) # fit the random forest
varimp(cf1)

4. case 1, 用 caret

 4.1 cor(dataset[])

算出每个变量的相关系数, 取系数 >= 0.75的系数

 4.2 varimp

计算每个变量的重要程度,可以用不同model来衡量, 常用的有random forest

 4.3 Recursive Feature Elimination

 

 

 

 

posted @ 2017-05-12 11:57  付小同  阅读(238)  评论(0编辑  收藏  举报