特征选择

1, 去掉取值变化小的特征(Removing features with low variance)

 

sklearn.feature_selection.VarianceThreshold(threshold=0.0)

 

2, 单变量特征选择(Univariate feature selection)

 

sklearn.feature_selection.SelectKBest(score_func=<function f_classif>, k=10)

选择前k个分数较高的特征,去掉其他的特征

sklearn.feature_selection.SelectPercentile(score_func=<function f_classif>, percentile=10)

 

f_regression(单因素线性回归试验)用作回归 
chi2卡方检验,f_classif(方差分析的F值)等用作分类

 

选择一定百分比的最高的评分的特征。

sklearn.feature_selection.SelectFpr(score_func=<function f_classif>, alpha=0.05)

 

根据配置的参选搜索

sklearn.feature_selection.GenericUnivariateSelect(score_func=<function f_classif>, mode='percentile', param=1e-05

 

 

3,递归特征消除Recursive feature elimination (RFE)

递归特征消除的主要思想是反复的构建模型(如SVM或者回归模型)然后选出最好的(或者最差的)的特征(可以根据系数来选),把选出来的特征选择出来,然后在剩余的特征上重复这个过程,直到所有特征都遍历了。这个过程中特征被消除的次序就是特征的排序。因此,这是一种寻找最优特征子集的贪心算法。
RFE的稳定性很大程度上取决于在迭代的时候底层用哪种模型。例如,假如RFE采用的普通的回归,没有经过正则化的回归是不稳定的,那么RFE就是不稳定的;假如采用的是Ridge,而用Ridge正则化的回归是稳定的,那么RFE就是稳定的。

 

 

class sklearn.feature_selection.RFECV(estimator, step=1, cv=None, scoring=None, estimator_params=None, verbose=0)

from sklearn.svm import SVC
from sklearn.datasets import load_digits
from sklearn.feature_selection import RFE
import matplotlib.pyplot as plt

# Load the digits dataset
digits = load_digits()
X = digits.images.reshape((len(digits.images), -1))
y = digits.target

# Create the RFE object and rank each pixel
svc = SVC(kernel="linear", C=1)
rfe = RFE(estimator=svc, n_features_to_select=1, step=1)
rfe.fit(X, y)
ranking = rfe.ranking_.reshape(digits.images[0].shape)

# Plot pixel ranking
plt.matshow(ranking)
plt.colorbar()
plt.title("Ranking of pixels with RFE")
plt.show()

4, Feature selection using SelectFromModel

SelectFromModel 是一个 meta-transformer,可以和在训练完后有一个coef_ 或者 feature_importances_ 属性的评估器(机器学习算法)一起使用。
如果相应的coef_ 或者feature_importances_ 的值小于设置的阀值参数,这些特征可以视为不重要或者删除。除了指定阀值参数外,也可以通过设置一个字符串参数,使用内置的启发式搜索找到夜歌阀值。可以使用的字符串参数包括:“mean”, “median” 以及这两的浮点乘积,例如“0.1*mean”.

sklearn.feature_selection.SelectFromModel(estimator, threshold=None, prefit=False)

 

与Lasso一起使用,从boston数据集中选择最好的两组特征值。

import matplotlib.pyplot as plt
import numpy as np

from sklearn.datasets import load_boston
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LassoCV

# Load the boston dataset.
boston = load_boston()
X, y = boston['data'], boston['target']

# We use the base estimator LassoCV since the L1 norm promotes sparsity of features.
clf = LassoCV()

# Set a minimum threshold of 0.25
sfm = SelectFromModel(clf, threshold=0.25)
sfm.fit(X, y)
n_features = sfm.transform(X).shape[1]

# Reset the threshold till the number of features equals two.
# Note that the attribute can be set directly instead of repeatedly
# fitting the metatransformer.
while n_features > 2:
sfm.threshold += 0.1
X_transform = sfm.transform(X)
n_features = X_transform.shape[1]

# Plot the selected two features from X.
plt.title(
"Features selected from Boston using SelectFromModel with "
"threshold %0.3f." % sfm.threshold)
feature1 = X_transform[:, 0]
feature2 = X_transform[:, 1]
plt.plot(feature1, feature2, 'r.')
plt.xlabel("Feature number 1")
plt.ylabel("Feature number 2")
plt.ylim([np.min(feature2), np.max(feature2)])
plt.show()

 

4.1,L1-based feature selection

 

L1正则化将系数w的l1范数作为惩罚项加到损失函数上,由于正则项非零,这就迫使那些弱的特征所对应的系数变成0。因此L1正则化往往会使学到的模型很稀疏(系数w经常为0),

这个特性使得L1正则化成为一种很好的特征选择方法。

 

from sklearn.svm import LinearSVC
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectFromModel
iris = load_iris()
X, y = iris.data, iris.target
X.shape
#(150, 4)
lsvc = LinearSVC(C=0.01, penalty="l1", dual=False).fit(X, y)
model = SelectFromModel(lsvc, prefit=True)
X_new = model.transform(X)
X_new.shape
#(150, 3)

 

 

4.2, 随机稀疏模型Randomized sparse models

面临一些相互关联的特征是基于L1的稀疏模型的限制,因为模型只选择其中一个特征。为了减少这个问题,可以使用随机特征选择方法,通过打乱设计的矩阵或者子采样的数据并,多次重新估算稀疏模型,并且统计有多少次一个特定的回归量是被选中。

RandomizedLasso使用Lasso实现回归设置

sklearn.linear_model.RandomizedLasso(alpha='aic', scaling=0.5, sample_fraction=0.75, n_resampling=200, selection_threshold=0.25, fit_intercept=True, verbose=False, normalize=True, precompute='auto', max_iter=500, eps=2.2204460492503131e-16, random_state=None, n_jobs=1, pre_dispatch='3*n_jobs', memory=Memory(cachedir=None))
1
RandomizedLogisticRegression 使用逻辑回归 logistic regression,适合分类任务

sklearn.linear_model.RandomizedLogisticRegression(C=1, scaling=0.5, sample_fraction=0.75, n_resampling=200, selection_threshold=0.25, tol=0.001, fit_intercept=True, verbose=False, normalize=True, random_state=None, n_jobs=1, pre_dispatch='3*n_jobs', memory=Memory(cachedir=None))

 

 

4.3, 基于树的特征选择Tree-based feature selection
基于树的评估器 (查看sklearn.tree 模块以及在sklearn.ensemble模块中的树的森林) 可以被用来计算特征的重要性,根据特征的重要性去掉无关紧要的特征 (当配合sklearn.feature_selection.SelectFromModel meta-transformer):

from sklearn.ensemble import ExtraTreesClassifier
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectFromModel
iris = load_iris()
X, y = iris.data, iris.target
X.shape
#(150, 4)
clf = ExtraTreesClassifier()
clf = clf.fit(X, y)
clf.feature_importances_
array([ 0.04..., 0.05..., 0.4..., 0.4...])
model = SelectFromModel(clf, prefit=True)
X_new = model.transform(X)
X_new.shape
#(150, 2)

 

 

5, Feature selection as part of a pipeline
在进行学习之前,特征选择通常被用作预处理步骤。在scikit-learn中推荐使用的处理的方法是sklearn.pipeline.Pipeline

sklearn.pipeline.Pipeline(steps)
1
Pipeline of transforms with a final estimator.
Sequentially 应用一个包含 transforms and a final estimator的列表 ,pipeline中间的步骤必须是‘transforms’, 也就是它们必须完成fit 以及transform 方法s. final estimator 仅仅只需要完成 fit方法.

使用pipeline是未来组合多个可以在设置不同参数时进行一起交叉验证的步骤 。因此,它允许设置不同步骤中的参数事使用参数名,这些参数名使用‘__’进行分隔。如下实例中所示:

from sklearn import svm
from sklearn.datasets import samples_generator
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression
from sklearn.pipeline import Pipeline
# generate some data to play with
X, y = samples_generator.make_classification(
... n_informative=5, n_redundant=0, random_state=42)
# ANOVA SVM-C
anova_filter = SelectKBest(f_regression, k=5)
clf = svm.SVC(kernel='linear')
anova_svm = Pipeline([('anova', anova_filter), ('svc', clf)])
# You can set the parameters using the names issued
# For instance, fit using a k of 10 in the SelectKBest
# and a parameter 'C' of the svm
anova_svm.set_params(anova__k=10, svc__C=.1).fit(X, y)
...
Pipeline(steps=[...])
prediction = anova_svm.predict(X)
anova_svm.score(X, y)
0.77...
# getting the selected features chosen by anova_filter
anova_svm.named_steps['anova'].get_support()
#array([ True, True, True, False, False, True, False, True, True, True,
False, False, True, False, True, False, False, False, False,
True], dtype=bool)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
简单语法示例:

clf = Pipeline([
('feature_selection', SelectFromModel(LinearSVC(penalty="l1"))),
('classification', RandomForestClassifier())
])
clf.fit(X, y)

 

作者:面向未来的历史
来源:CSDN
原文:https://blog.csdn.net/a1368783069/article/details/52048349

 

posted @ 2019-04-24 14:36  gao_jian  阅读(844)  评论(0编辑  收藏  举报