原文链接

这里

原文小结

个人觉得本文最有价值点的地方在于提供了一些上文说的一些automated feature selection方法:都是sklearn里面的

  1. RFE:rucursive feature elimination(需要给定K值)
  2. RFECV:recursive feature elimination with cross-validation(给定cv次数,自动计算K值)
  3. SelectKBest:select k best features(需要给定K值)

其次本文在EDA中判断feature对于target的影响中用到了: (本质上都是看分布对于target是否差异大,差异越大越好)

  1. violinplot
  2. boxplot
  3. swarmplot
  4. kdeplot(上文)

EDA的一些Plot小工具

plot前10个feature

# first ten features
data_dia = y
data = x
data_n_2 = (data - data.mean()) / (data.std())              # standardization
data = pd.concat([y,data_n_2.iloc[:,0:10]],axis=1)
data = pd.melt(data,id_vars="diagnosis",
                    var_name="features",
                    value_name='value')
plt.figure(figsize=(10,10))
sns.violinplot(x="features", y="value", hue="diagnosis", data=data,split=True, inner="quart")
# 换成boxplot
# sns.boxplot(x="features", y="value", hue="diagnosis", data=data)
# 换成swarmplot
# sns.swarmplot(x="features", y="value", hue="diagnosis", data=data)
plt.xticks(rotation=90)
  • violinplot
    violin plot
  • swarmplot
    swarmplot

自动feature selection

参考sklearn专项页

SelectKBest(univariate statistical tests)

就类似我们手动对单一因素分析,看他对target分布的影响,如果差异大就是score高

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2 #X2分布
# find best scored 5 features
select_feature = SelectKBest(chi2, k=5).fit(x_train, y_train)
print('Score list:', select_feature.scores_)
print('Feature list:', x_train.columns)

Recursive feature elimination

本质就是用classifier然后对feature importance排序,删除最末尾的几个,然后递归这个操作直到我们的feature只剩下我们定义的k个。

from sklearn.feature_selection import RFE
# Create the RFE object and rank each pixel
clf_rf_3 = RandomForestClassifier()      
rfe = RFE(estimator=clf_rf_3, n_features_to_select=5, step=1)
rfe = rfe.fit(x_train, y_train)
print('Chosen best 5 feature by rfe:',x_train.columns[rfe.support_])

Recursive feature elimination with cross validation

在上面的RFE上面加上CV,猜测是k次CV然后对于每次的scoring排序进行并集操作,然后设置阈值保留前n个。

from sklearn.feature_selection import RFECV

# The "accuracy" scoring is proportional to the number of correct classifications
clf_rf_4 = RandomForestClassifier() 
rfecv = RFECV(estimator=clf_rf_4, step=1, cv=5,scoring='accuracy')   #5-fold cross-validation
rfecv = rfecv.fit(x_train, y_train)

print('Optimal number of features :', rfecv.n_features_)
print('Best features :', x_train.columns[rfecv.support_])