




       本文使用sklearn中的IRIS(Iris)数据集来说明特征处理功能。IRIS数据集由Fisher于1936年编译,包含四个特征(Sepal.Length,Sepal.Width,Petal.Length,Petal.Width),特征值两者都是以厘米为单位的正浮点数。目标值是虹膜(Iris Setosa),虹膜杂色(Iris Virginica),虹膜Virgin(Iir Virginica)(弗吉尼亚虹膜)的分类。导入IRIS数据集的代码如下:

from sklearn.datasets import load_iris
#import IRIS data set
Iris = load_iris()
# Feature matrix
#Target vector




       信息冗余:对于某些定量功能,所包含的有效信息是区间划分,例如学习成绩。如果仅关心“通过”或不关心“通过”,则需要将定量测试分数转换为“ 1”和“ 0”。“”表示通过和失败。二值化可以解决这个问题。











from sklearn.preprocessing import StandardScaler
# Standardization, return data is normalized



                                                                      \(x^{\prime}=\frac{x-M i n}{M a x-M i n}\)


from sklearn.preprocessing import MinMaxScaler
#interval scaling, the return value is the data scaled to the [0, 1] interval



                                                                       \(x^{\prime}=\frac{x}{\sqrt{\sum_{j}^{m} x[j]^{2}}}\)


from sklearn.preprocessing import Normalizer
#Normalization, return value is normalized data



                                                                       \(x^{\prime}=\left\{\begin{array}{l}1, x>\text { threshold } \\ 0, x \leq \text { threshold }\end{array}\right.\)


from sklearn.preprocessing import Binarizer
# Binarization, a threshold value is set to 3, the return data is binarized



from sklearn.preprocessing import OneHotEncoder
# Dummy encoding, the target value of IRIS data set, return the value of the dummy data encoding



from numpy import vstack, array, nan
from sklearn.preprocessing import Imputer
#missing value calculation, return value is the data after calculating the missing value
# The parameter missing_value is a representation of the missing value. The default is NaN.
#Parameters is a missing value filling method, the default is mean (mean)
Imputer().fit_transform(vstack((array([nan, nan, nan, nan]),iris.data)))



                \(\left(x_{1}^{\prime} x_{2}^{\prime} x_{3}^{\prime} x_{4}^{\prime} x_{5}^{\prime} x_{6}^{\prime} x_{7}^{\prime} x_{8}^{\prime} x_{8}^{\prime} x_{10}^{\prime} x_{11}^{\prime} x_{12}^{\prime}, x_{13}^{\prime} x_{14}^{\prime} x_{15}^{\prime}\right)\)
                                   \(=\left(1, x_{1}, x_{2}, x_{3}, x_{4}, x_{1}^{2}, x_{1} * x_{2}, x_{1} * x_{3}, x_{1} * x_{4}, x_{2}^{2}, x_{2} * x_{3}, x_{2} * x_{4}, x_{3}^{2} x_{2} * x_{4}, x_{4}^{2}\right)\)


from sklearn.preprocessing import PolynomialFeatures
# polynomial conversion
# Parameterdegree is degree, default is 2


from numpy import log1p
from sklearn.preprocessing import FunctionTransformer
#Custom conversion function is a data transformation of logarithmic function
# The first parameter is a function of univariate



  • 特征是否发散:例如,如果某个特征不发散,则方差接近于零,也就是说,样本在此特征中基本没有差异,则此特征对于区分样本没有用。
  • 特征与目标之间的相关性:这更加明显,与目标高度相关的特征应被优先考虑。除了方差法,本文还介绍了其他方法的相关性。


  •         筛选器:筛选器方法,该方法根据差异或相关性对每个特征评分,设置要选择的阈值或阈值数量,然后选择特征。
  •         包装器:一种包装器方法,它根据目标函数(通常是预测效果得分)一次选择多个特征,或排除多个特征。
  •         嵌入式:一种集成方法,该方法首先使用一些机器学习算法和模型进行训练,获得每个特征的权重系数,然后根据系数的大小从大到小选择特征。与“过滤器”方法类似,但是经过训练可以确定功能的优缺点。




       使用方差选择方法,首先计算每个特征的方差,然后根据阈值选择方差大于阈值的特征。使用feature_selection库的Variance Threshold类选择要素的代码如下:

from sklearn.feature_selection import VarianceThreshold
# variance selection method, the return value is the data after the feature selection
#Parameter threshold is the threshold of variance



from sklearn.feature_selection import SelectKBest
from scipy.stats import pearsonr
#Select K best features, return the data after selecting the feature
The first parameter is a function to calculate whether the evaluation feature is good. The function inputs the feature matrix and the target vector, and outputs an array of two groups (score, P value). The i-th item of the array is the score and P value of the i-th feature. . Defined here as the correlation coefficient
#Parameter k is the number of features selected
SelectKBest(lambda X, Y: array(map(lambda x:pearsonr(x, Y), X.T)).T, k=2).fit_transform(iris.data, iris.target)



                                                                       \(\chi^{2}=\sum \frac{(A-E)^{2}}{E}\)

       不难发现,该统计的含义仅仅是自变量与因变量的相关性 卡方检验维基百科 。将feature_selection库的SelectKBest类与卡方检验结合使用,以选择特征代码,如下所示:

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
#Select K best features, return the data after selecting the feature
SelectKBest(chi2, k=2).fit_transform(iris.data, iris.target)



                                                       \(I(X ; Y)=\sum_{x \in X} \sum_{y \in Y} p(x, y) \log \frac{p(x, y)}{p(x) p(y)}\)


from sklearn.feature_selection import SelectKBest
from minepy import MINE
# Since the design of MINE is not functional, the mic method is defined as a functional one, returning a binary group, and the second item of the binary group is set to a fixed P value of 0.5.
def mic(x, y):
     m = MINE()
     m.compute_score(x, y)
     return (m.mic(), 0.5)
#Select K best features, return the data after feature selection
SelectKBest(lambda X, Y: array(map(lambda x:mic(x, Y), X.T)).T, k=2).fit_transform(iris.data, iris.target)




from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
#Recursive feature elimination method, returning the data after feature selection
#Parameter estimator is the base model
#Parameter n_features_to_select is the number of features selected
RFE(estimator=LogisticRegression(),n_features_to_select=2).fit_transform(iris.data, iris.target)




from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression
#Logo regression with L1 penalty term as feature selection of base model
SelectFromModel(LogisticRegression(penalty="l1",C=0.1)).fit_transform(iris.data, iris.target)


from sklearn.linear_model import LogisticRegression

class LR(LogisticRegression):
    def __init__(self, threshold=0.01, dual=False, tol=1e-4, C=1.0,
                 fit_intercept=True, intercept_scaling=1, class_weight=None,
                 random_state=None, solver='liblinear', max_iter=100,
                 multi_class='ovr', verbose=0, warm_start=False, n_jobs=1):

        self.threshold = threshold
        LogisticRegression.__init__(self, penalty='l1', dual=dual, tol=tol, C=C,
                 fit_intercept=fit_intercept, intercept_scaling=intercept_scaling, class_weight=class_weight,
                 random_state=random_state, solver=solver, max_iter=max_iter,
                 multi_class=multi_class, verbose=verbose, warm_start=warm_start, n_jobs=n_jobs)
        #Create L2 logistic regression using the same parameters
        self.l2 = LogisticRegression(penalty='l2', dual=dual, tol=tol, C=C, fit_intercept=fit_intercept, intercept_scaling=intercept_scaling, class_weight = class_weight, random_state=random_state, solver=solver, max_iter=max_iter, multi_class=multi_class, verbose=verbose, warm_start=warm_start, n_jobs=n_jobs)

    def fit(self, X, y, sample_weight=None):
        # Training L1 logistic regression
        super(LR, self).fit(X, y, sample_weight=sample_weight)
        self.coef_old_ = self.coef_.copy()
        # L2 logistic regression training
        self.l2.fit(X, y, sample_weight=sample_weight)

        cntOfRow, cntOfCol = self.coef_.shape
        #Number of coefficient matrix The number of rows corresponds to the number of types of target values
        for i in range(cntOfRow):
            for j in range(cntOfCol):
                coef = self.coef_[i][j]
                # The weight coefficient of L1 logistic regression is not 0.
                if coef != 0:
                    idx = [j]
                    #correspond to the weight coefficient in L2 logistic regression
                    coef1 = self.l2.coef_[i][j]
                    for k in range(cntOfCol):
                        coef2 = self.l2.coef_[i][k]
                        #In L2 logistic regression, the difference between the weight coefficients is less than the set threshold, and the corresponding weight in L1 is 0.
                        if abs(coef1-coef2) < self.threshold and j != k and self.coef_[i][k] == 0:
                    #Calculate the mean value of the weight coefficient of this type of feature
                    mean = coef / len(idx)
                    self.coef_[i][idx] = mean
        return self


from sklearn.feature_selection import SelectFromModel
#Logo regression with L1 and L2 penalty terms as feature selection of the base model
#Parameter threshold is the threshold of the difference between the weight coefficients
SelectFromModel(LR(threshold=0.5, C=0.1)).fit_transform(iris.data, iris.target)



from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import GradientBoostingClassifier
#GBDT as the feature selection of the base model
SelectFromModel(GradientBoostingClassifier()).fit_transform(iris.data, iris.target)





from sklearn.decomposition import PCA
#Principal component analysis method, returning the data after dimension reduction
#Parameter n_components number of main components



from sklearn.lda import LDA
#linear discriminant analysis method, returning the data after dimensionality reduction
#Parameter n_components is the dimensionality after dimension reduction
LDA(n_components=2).fit_transform(iris.data, iris.target)






       原文链接 https://medium.com/ml-research-lab/chapter-6-how-to-learn-feature-engineering-49f4246f0d41

posted on 2021-06-11 10:01  雾恋过往  阅读(449)  评论(0编辑  收藏  举报
