什么是机器学习里面的特征工程

1.什么是特征工程？

有一种说法在业界广为流传：数据和特性决定了机器学习的上限，而模型和算法恰好达到了这个上限。什么是特征项目？顾名思义，其本质是一项工程活动，旨在最大程度地从原始数据中提取特征，以供算法和模型使用。通过总结和总结，人们认为要素工程包括以下方面：

特征处理是特征工程的核心部分。Sklearn提供了更完整的特征处理方法，包括数据预处理，特征选择和降维。与sklearn的第一次接触通常是因为其丰富且方便的算法模型库而引起的，但是这里描述的特征处理库也非常强大！

本文使用sklearn中的IRIS（Iris）数据集来说明特征处理功能。IRIS数据集由Fisher于1936年编译，包含四个特征（Sepal.Length，Sepal.Width，Petal.Length，Petal.Width），特征值两者都是以厘米为单位的正浮点数。目标值是虹膜（Iris Setosa），虹膜杂色（Iris Virginica），虹膜Virgin（Iir Virginica）（弗吉尼亚虹膜）的分类。导入IRIS数据集的代码如下：

from sklearn.datasets import load_iris
#import IRIS data set
Iris = load_iris()
# Feature matrix
Iris.data
#Target vector
Iris.target

2.数据预处理

通过特征提取，可以获得未处理的特征，此时特征可能存在以下问题：

不属于同一尺寸：即功能的规格不同，无法一起比较。无量纲化可以解决此问题。

信息冗余：对于某些定量功能，所包含的有效信息是区间划分，例如学习成绩。如果仅关心“通过”或不关心“通过”，则需要将定量测试分数转换为“ 1”和“ 0”。“”表示通过和失败。二值化可以解决这个问题。

定性特征不能直接使用：某些机器学习算法和模型只能接受定量特征的输入，因此需要将定性特征转换为定量特征。最简单的方法是为每个定性值指定一个定量值，但是此方法过于灵活，会增加调整工作。通常通过伪编码将定性特征转换为定量特征：如果有N个定性值，则此功能扩展为N个功能。当原始特征值是第i个定性值时，将分配第i个扩展特征。为1时，其他扩展功能的赋值为0。与直接指定的方法相比，哑编码方法不需要增加参数调整的工作。对于线性模型，使用哑编码功能可以实现非线性效果。

有缺失值：需要添加缺失值。

信息利用率低：不同的机器学习算法和模型在数据中使用不同的信息。如前所述，在线性模型中，使用定性特征哑编码可以实现非线性效果。类似地，量化变量的多项式或其他变换可以实现非线性效果。

我们使用sklearn中的预处理库进行数据预处理，以解决上述问题。

2.1无量纲

无量纲将不同规格的数据转换为相同规格。常见的无量纲化方法是标准化和区间缩放。标准化的前提是特征值遵循正态分布，归一化后将其转换为标准正态分布。间隔缩放方法利用边界值信息来将特征范围缩放到一系列特征范围，例如[0，1]。

2.1.1标准化

标准化需要计算特征的平均值和标准偏差，表示为：

\(x=\frac{x-\bar{X}}{S}\)

使用预处理库的StandardScaler类对数据进行规范化的代码如下：

from sklearn.preprocessing import StandardScaler
# Standardization, return data is normalized
StandardScaler().fit_transform(iris.data)

2.1.2间隔缩放方法

关于间隔缩放有很多想法。常见的一种是使用两个最大值进行缩放。公式表示为：

\(x^{\prime}=\frac{x-M i n}{M a x-M i n}\)

使用预处理库的MinMaxScaler类进行数据间隔缩放的代码如下：

from sklearn.preprocessing import MinMaxScaler
#interval scaling, the return value is the data scaled to the [0, 1] interval
MinMaxScaler().fit_transform(iris.data)

2.1.3标准化与规范化之间的区别

简而言之，标准化就是根据特征矩阵的列来处理数据，通过z评分方法将样本的特征值转换为相同维度。归一化是根据特征矩阵的行对数据进行的处理。目的是当点乘法运算或其他核函数计算相似度时，样本矢量具有统一的标准，即，将其转换为“单位矢量”。规则为l2的归一化公式如下：

\(x^{\prime}=\frac{x}{\sqrt{\sum_{j}^{m} x[j]^{2}}}\)

使用预处理库的Normalizer类对数据进行规范化的代码如下：

from sklearn.preprocessing import Normalizer
#Normalization, return value is normalized data
Normalizer().fit_transform(iris.data)

2.2二进制定量特征

量化特征二值化的核心是设置阈值。大于阈值的值为1，小于或等于阈值的值为0。公式如下：

\(x^{\prime}=\left\{\begin{array}{l}1, x>\text { threshold } \\ 0, x \leq \text { threshold }\end{array}\right.\)

使用预处理库的Binarizer类对数据进行二进制化的代码如下：

from sklearn.preprocessing import Binarizer
# Binarization, a threshold value is set to 3, the return data is binarized
Binarizer(threshold=3).fit_transform(iris.data)

2.3对于定性特征，哑编码

由于IRIS数据集的特征都是定量特征，因此将其目标值用于伪编码（实际上不是必需的）。使用预处理库的OneHotEncoder类对数据进行哑编码的代码如下：

from sklearn.preprocessing import OneHotEncoder
# Dummy encoding, the target value of IRIS data set, return the value of the dummy data encoding
OneHotEncoder().fit_transform(iris.target.reshape((-1,1)))

2.4遗漏值计算

由于IRIS数据集没有缺失值，因此将新样本添加到数据集，并且为所有四个要素分配了NaN值，表明该数据缺失。使用预先处理库的Imputer类进行的缺少数据计算的代码如下：

from numpy import vstack, array, nan
from sklearn.preprocessing import Imputer
#missing value calculation, return value is the data after calculating the missing value
# The parameter missing_value is a representation of the missing value. The default is NaN.
#Parameters is a missing value filling method, the default is mean (mean)
Imputer().fit_transform(vstack((array([nan, nan, nan, nan]),iris.data)))

2.5数据转换

常见的数据转换是基于多项式，基于指数和基于日志的函数。次数为2的多项式转换公式的四个特征如下：

\(\left(x_{1}^{\prime} x_{2}^{\prime} x_{3}^{\prime} x_{4}^{\prime} x_{5}^{\prime} x_{6}^{\prime} x_{7}^{\prime} x_{8}^{\prime} x_{8}^{\prime} x_{10}^{\prime} x_{11}^{\prime} x_{12}^{\prime}, x_{13}^{\prime} x_{14}^{\prime} x_{15}^{\prime}\right)\)
\(=\left(1, x_{1}, x_{2}, x_{3}, x_{4}, x_{1}^{2}, x_{1} * x_{2}, x_{1} * x_{3}, x_{1} * x_{4}, x_{2}^{2}, x_{2} * x_{3}, x_{2} * x_{4}, x_{3}^{2} x_{2} * x_{4}, x_{4}^{2}\right)\)

使用预处理库的PolynomialFeatures类进行数据的多项式转换的代码如下：

from sklearn.preprocessing import PolynomialFeatures
# polynomial conversion
# Parameterdegree is degree, default is 2
PolynomialFeatures().fit_transform(iris.data)

基于单参数函数的数据转换可以统一进行。使用预处理库的FunctionTransformer对数据进行对数函数转换的代码如下：

from numpy import log1p
from sklearn.preprocessing import FunctionTransformer
#Custom conversion function is a data transformation of logarithmic function
# The first parameter is a function of univariate
functionTransformer(log1p).fit_transform(iris.data)

3.功能选择

数据预处理完成后，我们需要选择有意义的算法和机器模型以进行机器学习训练。通常，从两个角度选择功能：

特征是否发散：例如，如果某个特征不发散，则方差接近于零，也就是说，样本在此特征中基本没有差异，则此特征对于区分样本没有用。
特征与目标之间的相关性：这更加明显，与目标高度相关的特征应被优先考虑。除了方差法，本文还介绍了其他方法的相关性。

根据特征选择的形式，特征选择方法可分为三种类型：

筛选器：筛选器方法，该方法根据差异或相关性对每个特征评分，设置要选择的阈值或阈值数量，然后选择特征。
包装器：一种包装器方法，它根据目标函数（通常是预测效果得分）一次选择多个特征，或排除多个特征。
嵌入式：一种集成方法，该方法首先使用一些机器学习算法和模型进行训练，获得每个特征的权重系数，然后根据系数的大小从大到小选择特征。与“过滤器”方法类似，但是经过训练可以确定功能的优缺点。

我们使用sklearn中的feature_selection库进行特征选择。

3.1过滤器

3.1.1方差选择方法

使用方差选择方法，首先计算每个特征的方差，然后根据阈值选择方差大于阈值的特征。使用feature_selection库的Variance Threshold类选择要素的代码如下：

from sklearn.feature_selection import VarianceThreshold
# variance selection method, the return value is the data after the feature selection
#Parameter threshold is the threshold of variance
VarianceThreshold(threshold=3).fit_transform(iris.data)

3.1.2相关系数法

使用相关系数方法，首先计算每个特征与目标值的相关系数以及相关系数的P值。使用feature_selection库的SelectKBest类组合相关系数以选择特征码，如下所示：

from sklearn.feature_selection import SelectKBest
from scipy.stats import pearsonr
#Select K best features, return the data after selecting the feature
The first parameter is a function to calculate whether the evaluation feature is good. The function inputs the feature matrix and the target vector, and outputs an array of two groups (score, P value). The i-th item of the array is the score and P value of the i-th feature. . Defined here as the correlation coefficient
#Parameter k is the number of features selected
SelectKBest(lambda X, Y: array(map(lambda x:pearsonr(x, Y), X.T)).T, k=2).fit_transform(iris.data, iris.target)

3.1.3卡方检验

经典卡方检验是测试定性自变量与定性因变量的相关性。假设自变量具有N种值，因变量具有M种值。考虑自变量等于i而因变量等于j的采样频率的观测值与期望值之间的差，并构造统计量：

\(\chi^{2}=\sum \frac{(A-E)^{2}}{E}\)

不难发现，该统计的含义仅仅是自变量与因变量的相关性卡方检验维基百科。将feature_selection库的SelectKBest类与卡方检验结合使用，以选择特征代码，如下所示：

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
#Select K best features, return the data after selecting the feature
SelectKBest(chi2, k=2).fit_transform(iris.data, iris.target)

3.1.4相互信息法

经典互信息还用于评估定性自变量与定性因变量的相关性。相互信息的计算公式如下：

\(I(X ; Y)=\sum_{x \in X} \sum_{y \in Y} p(x, y) \log \frac{p(x, y)}{p(x) p(y)}\)

为了处理定量数据，提出了最大信息系数法。将feature_selection库的SelectKBest类与最大信息系数方法结合使用以选择特征的代码如下：

from sklearn.feature_selection import SelectKBest
from minepy import MINE
 
# Since the design of MINE is not functional, the mic method is defined as a functional one, returning a binary group, and the second item of the binary group is set to a fixed P value of 0.5.
def mic(x, y):
     m = MINE()
     m.compute_score(x, y)
     return (m.mic(), 0.5)
#Select K best features, return the data after feature selection
SelectKBest(lambda X, Y: array(map(lambda x:mic(x, Y), X.T)).T, k=2).fit_transform(iris.data, iris.target)

3.2包装器

3.2.1递归特征消除

递归消除特征方法使用基本模型来执行多轮训练。在每一轮训练之后，将消除几个权重系数的特征，并根据新的特征集执行下一轮训练。使用feature_selection库的RFE类选择要素的代码如下：

from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
#Recursive feature elimination method, returning the data after feature selection
#Parameter estimator is the base model
#Parameter n_features_to_select is the number of features selected
RFE(estimator=LogisticRegression(),n_features_to_select=2).fit_transform(iris.data, iris.target)

3.3嵌入式

3.3.1基于惩罚的功能选择

使用带有惩罚项的基本模型，除了滤除特征外，还执行降维。将feature_selection库的SelectFromModel类与具有L1惩罚的逻辑回归模型一起使用，以选择特征代码，如下所示：

from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression
#Logo regression with L1 penalty term as feature selection of base model
SelectFromModel(LogisticRegression(penalty="l1",C=0.1)).fit_transform(iris.data, iris.target)

实际上，L1惩罚项的降维原理是保留与目标值具有同等相关性的特征之一，因此未选择的特征并不表示不重要。因此，可以结合L2惩罚项对其进行优化。具体操作如下：如果某要素在L1中的权重为1，则在L2中的权重差较小且在L1中的权重为0的要素构成同质集，并且该组中的要素被均等分割进入L1。权重，因此您需要构建一个新的逻辑回归模型：

from sklearn.linear_model import LogisticRegression

class LR(LogisticRegression):
    def __init__(self, threshold=0.01, dual=False, tol=1e-4, C=1.0,
                 fit_intercept=True, intercept_scaling=1, class_weight=None,
                 random_state=None, solver='liblinear', max_iter=100,
                 multi_class='ovr', verbose=0, warm_start=False, n_jobs=1):

        #Thresold
        self.threshold = threshold
        LogisticRegression.__init__(self, penalty='l1', dual=dual, tol=tol, C=C,
                 fit_intercept=fit_intercept, intercept_scaling=intercept_scaling, class_weight=class_weight,
                 random_state=random_state, solver=solver, max_iter=max_iter,
                 multi_class=multi_class, verbose=verbose, warm_start=warm_start, n_jobs=n_jobs)
        #Create L2 logistic regression using the same parameters
        self.l2 = LogisticRegression(penalty='l2', dual=dual, tol=tol, C=C, fit_intercept=fit_intercept, intercept_scaling=intercept_scaling, class_weight = class_weight, random_state=random_state, solver=solver, max_iter=max_iter, multi_class=multi_class, verbose=verbose, warm_start=warm_start, n_jobs=n_jobs)

    def fit(self, X, y, sample_weight=None):
        # Training L1 logistic regression
        super(LR, self).fit(X, y, sample_weight=sample_weight)
        self.coef_old_ = self.coef_.copy()
        # L2 logistic regression training
        self.l2.fit(X, y, sample_weight=sample_weight)

        cntOfRow, cntOfCol = self.coef_.shape
        #Number of coefficient matrix The number of rows corresponds to the number of types of target values
        for i in range(cntOfRow):
            for j in range(cntOfCol):
                coef = self.coef_[i][j]
                # The weight coefficient of L1 logistic regression is not 0.
                if coef != 0:
                    idx = [j]
                    #correspond to the weight coefficient in L2 logistic regression
                    coef1 = self.l2.coef_[i][j]
                    for k in range(cntOfCol):
                        coef2 = self.l2.coef_[i][k]
                        #In L2 logistic regression, the difference between the weight coefficients is less than the set threshold, and the corresponding weight in L1 is 0.
                        if abs(coef1-coef2) < self.threshold and j != k and self.coef_[i][k] == 0:
                            idx.append(k)
                    #Calculate the mean value of the weight coefficient of this type of feature
                    mean = coef / len(idx)
                    self.coef_[i][idx] = mean
        return self

将feature_selection库的SelectFromModel类与具有L1和L2惩罚项的逻辑回归模型一起使用，以选择特征代码，如下所示：

from sklearn.feature_selection import SelectFromModel
 
#Logo regression with L1 and L2 penalty terms as feature selection of the base model
#Parameter threshold is the threshold of the difference between the weight coefficients
SelectFromModel(LR(threshold=0.5, C=0.1)).fit_transform(iris.data, iris.target)

3.3.2基于树模型的特征选择

在树模型中，GBDT也可以用作特征选择的基础模型。通过将feature_selection库的SelectFromModel类与GBDT模型结合使用来选择功能部件的代码。

from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import GradientBoostingClassifier
#GBDT as the feature selection of the base model
SelectFromModel(GradientBoostingClassifier()).fit_transform(iris.data, iris.target)

4.降维

特征选择完成后，可以直接训练模型，但特征矩阵太大，导致计算量大，训练时间长。因此，也有必要减小特征矩阵的维数。常见的降维方法除了上述基于L1惩罚的模型外，还有主成分分析（PCA）和线性判别分析（LDA）。线性判别分析本身也是一个分类模型。PCA和LDA有许多相似之处，其本质是将原始样本映射到低维样本空间，但是PCA和LDA的映射目标不同：PCA是使映射样本具有最大的差异。LDA旨在为映射的样本提供最佳的分类性能。因此，PCA是一种无监督的降维方法，而LDA是一种无监督的降维方法。

4.1主成分分析（PCA）

使用分解库的PCA类选择要素的代码如下：

from sklearn.decomposition import PCA
#Principal component analysis method, returning the data after dimension reduction
#Parameter n_components number of main components
PCA(n_components=2).fit_transform(iris.data)

4.2线性判别分析（LDA）

使用lda库的LDA类选择功能的代码如下：

from sklearn.lda import LDA
#linear discriminant analysis method, returning the data after dimensionality reduction
#Parameter n_components is the dimensionality after dimension reduction
LDA(n_components=2).fit_transform(iris.data, iris.target)

参考文献：

https://www.quora.com/topic/Data-Cleansing

https://www.quora.com/What-is-the-real-meaning-of-data-cleaning-for-a-Data-Scientist

https://www.quora.com/What-is-your-best-description-of-data-cleaning-in-data-analysis-and-machine-learninghttps://www.quora.com/What-is-your-best-description-of-data-cleaning-in-data-analysis-and-machine-learning

https://www.quora.com/What-is-your-best-description-of-data-cleaning-in-data-analysis-and-machine-learning

原文链接 https://medium.com/ml-research-lab/chapter-6-how-to-learn-feature-engineering-49f4246f0d41

posted on 2021-06-11 10:01 雾恋过往阅读(449) 评论(0) 编辑收藏举报