原文链接
原文小结
个人觉得本文最有价值点的地方在于提供了一些上文说的一些automated feature selection方法:都是sklearn里面的
- RFE:rucursive feature elimination(需要给定K值)
- RFECV:recursive feature elimination with cross-validation(给定cv次数,自动计算K值)
- SelectKBest:select k best features(需要给定K值)
其次本文在EDA中判断feature对于target的影响中用到了: (本质上都是看分布对于target是否差异大,差异越大越好)
- violinplot
- boxplot
- swarmplot
- kdeplot(上文)
EDA的一些Plot小工具
plot前10个feature
# first ten features
data_dia = y
data = x
data_n_2 = (data - data.mean()) / (data.std()) # standardization
data = pd.concat([y,data_n_2.iloc[:,0:10]],axis=1)
data = pd.melt(data,id_vars="diagnosis",
var_name="features",
value_name='value')
plt.figure(figsize=(10,10))
sns.violinplot(x="features", y="value", hue="diagnosis", data=data,split=True, inner="quart")
# 换成boxplot
# sns.boxplot(x="features", y="value", hue="diagnosis", data=data)
# 换成swarmplot
# sns.swarmplot(x="features", y="value", hue="diagnosis", data=data)
plt.xticks(rotation=90)
- violinplot
- swarmplot
自动feature selection
SelectKBest(univariate statistical tests)
就类似我们手动对单一因素分析,看他对target分布的影响,如果差异大就是score高
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2 #X2分布
# find best scored 5 features
select_feature = SelectKBest(chi2, k=5).fit(x_train, y_train)
print('Score list:', select_feature.scores_)
print('Feature list:', x_train.columns)
Recursive feature elimination
本质就是用classifier然后对feature importance排序,删除最末尾的几个,然后递归这个操作直到我们的feature只剩下我们定义的k个。
from sklearn.feature_selection import RFE
# Create the RFE object and rank each pixel
clf_rf_3 = RandomForestClassifier()
rfe = RFE(estimator=clf_rf_3, n_features_to_select=5, step=1)
rfe = rfe.fit(x_train, y_train)
print('Chosen best 5 feature by rfe:',x_train.columns[rfe.support_])
Recursive feature elimination with cross validation
在上面的RFE上面加上CV,猜测是k次CV然后对于每次的scoring排序进行并集操作,然后设置阈值保留前n个。
from sklearn.feature_selection import RFECV
# The "accuracy" scoring is proportional to the number of correct classifications
clf_rf_4 = RandomForestClassifier()
rfecv = RFECV(estimator=clf_rf_4, step=1, cv=5,scoring='accuracy') #5-fold cross-validation
rfecv = rfecv.fit(x_train, y_train)
print('Optimal number of features :', rfecv.n_features_)
print('Best features :', x_train.columns[rfecv.support_])
联系方式:clarence_wu12@outlook.com