特征工程（二）数据转换

数据科学项目中少不了要用到机器学习算法。通常每种算法都会对数据有相应的要求，比如有的算法要求数据集特征是离散的，有的算法要求数据集特征是分类型的，而数据集特征不一定就满足这些要求，必须依据某些原则、方法对数据进行变换。

特征变换

2.1 特征的类型

特征的类型由其所有值的集合决定，通常有如下几种：

分类型：性别分男女，职业分士农工商
二值型：0和1
顺序型：职称有：讲师、副教授、教授
数值型：整数、浮点数

2.2 特征数值化

基础知识

以基因测序预测病患实例：

import pandas as pd
df = pd.DataFrame({"gene_segA": [1, 0, 0, 1, 1, 1, 0, 0, 1, 0],
                   "gene_segB": [1, 0, 1, 0, 1, 1, 0, 0, 1, 0],
                   "hypertension": ["Y", 'N', 'N', 'N', 'N', 'N', 'Y', 'N', 'Y', 'N'],
                   "Gallstones": ['Y', 'N', 'N', 'N', 'Y', 'Y', 'Y', 'N', 'N', 'Y']
                  })
df

	gene_segA	gene_segB	hypertension	Gallstones
0	1	1	Y	Y
1	0	0	N	N
2	0	1	N	N
3	1	0	N	N
4	1	1	N	Y
5	1	1	N	Y
6	0	0	Y	Y
7	0	0	N	N
8	1	1	Y	N
9	0	0	N	Y

如果机器学习算法中使用此数据，是无法训练模型的，因为算法不能理解Y和N这样的字符串。

将数据集中的Y、N替换为数字，比如用1替换Y，用0替换N。

df.replace({"N": 0, 'Y': 1})

在scikit-learn中也提供了专用模块

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit_transform(df['hypertension'])

array([1, 0, 0, 0, 0, 0, 1, 0, 1, 0])

实例化LabelEncoder，得到了一个实现特征数值化的模型实例，用它训练特征中的数据，即可得到其中的枚举值。hypertension是分类型或者二值型，le实例能自动从0开始，将每个值用整数替换。

le.fit_transform([1, 3, 3, 7, 3, 1])

array([0, 1, 1, 2, 1, 0], dtype=int64)

LabelEncoder实例对象还有一个实现“反向取值”的方法。

le.inverse_transform([0, 1, 1, 2, 1, 0])

array([1, 3, 3, 7, 3, 1])

项目案例

假设有数据['white', 'green', 'red', 'green', 'white']，要求利用此数据创建特征数值化模型，然后用模型对另外数据集进行特征变换。

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()    
le.fit(['white', 'green', 'red', 'green', 'white'])    
le.classes_

array(['green', 'red', 'white'], dtype='<U5')

le.transform(["green", 'green', 'green', 'white'])

array([0, 0, 0, 2])

但如果出现了超出所得分类的参数，就会报错。

le.transform(["green", 'green', 'green', 'white']) #报错

2.3 特征二值化

无论是连续型特征还是离散型特征，都可以进行二值化转换

基础知识

import pandas as pd

pm25 = pd.read_csv("datasets/pm2.csv")
pm25.head()

	RANK	CITY_ID	CITY_NAME	Exposed days
0	1	594	拉萨	2
1	2	579	玉溪	7
2	3	263	厦门	8
3	4	267	泉州	9
4	5	271	漳州	10

下面以平均值为阈值，对特征Exposed days进行二值化

import numpy as np
pm25['bdays'] = np.where(pm25["Exposed days"] > pm25["Exposed days"].mean(), 1, 0)
pm25.sample(10)

	RANK	CITY_ID	CITY_NAME	Exposed days	bdays
3	4	267	泉州	9	0
243	266	350	滨州	203	1
252	275	367	新乡	216	1
11	12	64	鄂尔多斯	18	0
229	252	419	随州	186	1
196	219	324	潍坊	144	1
127	139	616	平凉	96	0
221	244	238	合肥	175	1
123	135	128	白城	95	0
182	205	451	娄底	132	1

新增特征bdays是对Exposed days二值化后所得到的二值型特征。除了用np.where函数实现特征二值化，还可以使用scikit-learn提供的二值化模块Binarizer实现特征二值化。

from sklearn.preprocessing import Binarizer
bn = Binarizer(threshold=pm25["Exposed days"].mean())  
result = bn.fit_transform(pm25[["Exposed days"]])   
pm25['sk-bdays'] = result
pm25.sample(10)

项目案例

对读入的图像数据进行二值化变换

![cat](C:/Users/10325/Desktop/cat.png)%matplotlib inline
import matplotlib.pyplot as plt
import cv2
# 写一个专门在Jupyter中显示图片的函数
def show_img(img):    
    if len(img.shape) == 3:
        b, g, r = cv2.split(img)   
        img = cv2.merge([r, g, b])
        plt.imshow(img)
    else:
        plt.imshow(img, cmap="gray")
    plt.axis("off")
    plt.show()

cat = cv2.imread("datasets/cat.png")
show_img(cat)

图像的二值化就是将图像中的内容分为两部分：前景和背景。要设置一个阈值，每个像素的值与阈值比较，以确定是前景还是背景。

先把得到的图像进行灰度化处理

gray_cat = cv2.cvtColor(cat, cv2.COLOR_BGR2GRAY)
show_img(gray_cat)

实施二值化操作

超过阈值127的像素就设置为最大值255，否则就是0

ret,thr = cv2.threshold(gray_cat, 127, 255, cv2.THRESH_BINARY)
show_img(thr)

2.4 OneHot编码

基础知识

import pandas as pd
g = pd.DataFrame({"gender": ["man", 'woman', 'woman', 'man', 'woman']})
g

	gender
0	man
1	woman
2	woman
3	man
4	woman

特征gender的值除了man就是woman。在2.2中用数值化的方式处理这种类型的特征，但数值化会带来原来没有的“大小关系”。为避免这种“副作用”，下面换一种处理方式。

pd.get_dummies(g)

	gender_man	gender_woman
0	1	0
1	0	1
2	0	1
3	1	0
4	0	1

get_dummies作用是将分类特征转化为虚拟变量（哑变量）

persons = pd.DataFrame({"name":["Newton", "Andrew Ng", "Jodan", "Bill Gates"], 'color':['white', 'yellow', 'black', 'white']})
persons

	name	color
0	Newton	white
1	Andrew Ng	yellow
2	Jodan	black
3	Bill Gates	white

此处color有三个值，是分类特征，但不是二值型的了。使用OneHot编码，分别以三个值为特征名称，当第0行中white为1（称为高位，且只有一个高位）时，则其他特征的值都是低位（记为0）

from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()
features = ohe.fit_transform(persons[['color']])
features.toarray()

array([[0., 1., 0.],[0., 0., 1.], [1., 0., 0.],[0., 1., 0.]])

项目案例

创建如下数据

df = pd.DataFrame({
    "color": ['green', 'red', 'blue', 'red'],
    "size": ['M', 'L', 'XL', 'L'],
    "price": [29.9, 69.9, 99.9, 59.9],
    "classlabel": ['class1', 'class2', 'class1', 'class1']
})
df

	color	size	price	classlabel
0	green	M	29.9	class1
1	red	L	69.9	class2
2	blue	XL	99.9	class1
3	red	L	59.9	class1

要求对此数据集完成如下操作：

对有必要的特征进行数值化操作
对有必要的特征进行OneHot编码

将特征size数值化

size_mapping = {'XL': 3, 'L': 2, 'M': 1}
df['size'] = df['size'].map(size_mapping)    
df

	color	size	price	classlabel
0	green	1	29.9	class1
1	red	2	69.9	class2
2	blue	3	99.9	class1
3	red	2	59.9	class1

将特征color进行OneHot编码

from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()
fs = ohe.fit_transform(df[['color']])
fs_ohe = pd.DataFrame(fs.toarray(),columns=['color_blue', 'color_green','color_red'])
df = pd.concat([df, fs_ohe], axis=1)
df

	color	size	price	classlabel	color_blue	color_green	color_red
0	green	1	29.9	class1	0.0	1.0	0.0
1	red	2	69.9	class2	0.0	0.0	1.0
2	blue	3	99.9	class1	1.0	0.0	0.0
3	red	2	59.9	class1	0.0	0.0	1.0

2.5 数据变换

基础知识

为了研究数据集中特征之间潜在的规律，有时候还需要对特征运用某些函数进行变换，以便更容易地找到其中的规律。

import pandas as pd

data = pd.read_csv("datasets/freefall.csv", index_col=0)
data.describe()

	time	location
count	100.000000	1.000000e+02
mean	250.000000	4.103956e+05
std	146.522832	3.709840e+05
min	0.000000	0.000000e+00
25%	124.997500	7.658593e+04
50%	250.000000	3.062812e+05
75%	375.002500	6.890859e+05
max	500.000000	1.225000e+06

这个数据集记录了物体从足够高的位置开始下落，以及不同时刻所对应的下落高度、我们的目的是找到时间和下落高度两个变量之间的函数关系(假装不知道自由落体运动函数式)

%matplotlib inline
import seaborn as sns
ax = sns.scatterplot(x='time', y='location', data=data)

对time和location特征进行对数变换，用变换后的数据绘制散点图

import numpy as np
data.drop([0], inplace=True)    # 去掉0，不计算log0
data['logtime'] = np.log10(data['time'])    
data['logloc'] = np.log10(data['location'])   
data.head()

	time	location	logtime	logloc
1	5.05	124.99	0.703291	2.096875
2	10.10	499.95	1.004321	2.698927
3	15.15	1124.89	1.180413	3.051110
4	20.20	1999.80	1.305351	3.300987
5	25.25	3124.68	1.402261	3.494806

ax2 = sns.scatterplot(x='logtime', y='logloc', data=data)

根据输出结果，可以判定变换之后得到特征logtime和logloc之间是直线关系

from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(data['logtime'].values.reshape(-1, 1), data['logloc'].values.reshape(-1, 1))
(reg.coef_, reg.intercept_)

(array([[1.99996182]]), array([0.69028797]))

引入scikit-learn的线性回归模型，并用上述变换后的数据对这个模型进行训练，得到直线的斜率是2，截距是0.69 表达式如下

$𝑙𝑔𝐿=2𝑙𝑔𝑡+0.69$

$𝐿 = 4.9𝑡^2$

符合自由落体运动定律

具体如何进行数据变换？应选择什么样函数？通常需要根据数据和业务特点而定，以上所实行的是对数变换，此外常用的还有指数变换、多项式变换、Box-Cox变换等。

项目案例

dc_data = pd.read_csv('datasets/sample_data.csv')
dc_data.head()

	MONTH	AIR_TIME
0	1	28
1	1	29
2	1	29
3	1	29
4	1	29

再对其进行数据处理之前，先观察它的分布

%matplotlib inline
import matplotlib.pyplot as plt
h = plt.hist(dc_data['AIR_TIME'], bins=100)

dc_data['AIR_TIME']中的数据很显然不是标准正态分布，下面就对它进行变换

from sklearn.preprocessing import power_transform
dft2 = power_transform(dc_data[['AIR_TIME']], method='box-cox')   
hbcs = plt.hist(dft2, bins=100)

Box-Cox属于广义幂等变换，sklearn.preprocessing中的power_transform函数的命名也符合这种说法。
利用power_transform函数除了可以实现Box-Cox变换，还可以实现Yeo-Johnson变换。

2.6 特征离散化

离散型：在任意两个值之间具有可计数的值
连续型：在任意两个值之间具有无限个值机器学习中的一些算法，比如决策树、朴素贝叶斯、对数概率回归等算法，都要求变量必须是离散化。

此外对于连续型特征，在离散化之后，能够降低对离群数据的影响，例如将表示年龄的特征离散化，大于50的是1，否则为0。如果此特征中出现了年龄为500的离群值，在离散化后，该离群值对特征的影响就被消除了。相对于连续型特征，离散型特征在计算速度、表达能力、模型稳定性等方面都具有优势。

通常使用的离散化方法可以划分为“有监督的”和“无监督的”两类

离散化也可以称为“分箱”

无监督离散化

基础知识

import pandas as pd
ages = pd.DataFrame({'years':[10, 14, 30, 53, 67, 32, 45], 'name':['A', 'B', 'C', 'D', 'E', 'F', 'G']})
ages

	years	name
0	10	A
1	14	B
2	30	C
3	53	D
4	67	E
5	32	F
6	45	G

如果对特征years离散化，可以使用Pandas提供的函数cut

pd.cut(ages['years'],3)

0 (9.943, 29.0]
1 (9.943, 29.0]
2 (29.0, 48.0]
3 (48.0, 67.0]
4 (48.0, 67.0]
5 (29.0, 48.0]
6 (29.0, 48.0]
Name: years, dtype: category
Categories (3, interval[float64]): [(9.943, 29.0] < (29.0, 48.0] < (48.0, 67.0]]

cut函数的第2个参数3，表示将ages['years']划分为等宽的3个区间[(9.943, 29.0] < (29.0, 48.0] < (48.0, 67.0]

因为离散化的别称是“分箱”，所以上述操作也称为“等宽分箱法”
但是，若使用等宽划分，在遇到离群值时常会出现问题

ages2 = pd.DataFrame({'years':[10, 14, 30, 53, 300, 32, 45], 'name':['A', 'B', 'C', 'D', 'E', 'F', 'G']})
klass2 = pd.cut(ages2['years'], 3, labels=['Young', 'Middle', 'Senior'])    # ②
ages2['label'] = klass2
ages2

	years	name	label
0	10	A	Young
1	14	B	Young
2	30	C	Young
3	53	D	Young
4	300	E	Senior
5	32	F	Young
6	45	G	Young

Young第4个样本的离群值导致其他记录都被标记为Young。

这里对离群值的处理通过指定数据的分割点，避免了离群值的影响。

ages2 = pd.DataFrame({'years':[10, 14, 30, 53, 300, 32, 45], 'name':['A', 'B', 'C', 'D', 'E', 'F', 'G']})
klass2 = pd.cut(ages2['years'], bins=[9, 30, 50, 300], labels=['Young', 'Middle', 'Senior'])    # ③
ages2['label'] = klass2
ages2

	years	name	label
0	10	A	Young
1	14	B	Young
2	30	C	Young
3	53	D	Young
4	300	E	Senior
5	32	F	Young
6	45	G	Young

在sklearn中有实现无监督离散化的类KBinsDiscretizer

from sklearn.preprocessing import KBinsDiscretizer
kbd = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='uniform')   
trans = kbd.fit_transform(ages[['years']])   
ages['kbd'] = trans[:, 0]    
ages

	years	name	kbd
0	10	A	0.0
1	14	B	0.0
2	30	C	1.0
3	53	D	2.0
4	67	E	2.0
5	32	F	1.0
6	45	G	1.0

KBinsDiscretizer(
    n_bins=5,
    *,
    encode='onehot',
    strategy='quantile',
    dtype=None,
)

Parameters
----------
n_bins : int or array-like of shape (n_features,), default=5
    The number of bins to produce. Raises ValueError if ``n_bins < 2``.

encode : {'onehot', 'onehot-dense', 'ordinal'}, default='onehot'
    Method used to encode the transformed result.

    - 'onehot': Encode the transformed result with one-hot encoding
      and return a sparse matrix. Ignored features are always
      stacked to the right.
    - 'onehot-dense': Encode the transformed result with one-hot encoding
      and return a dense array. Ignored features are always
      stacked to the right.
    - 'ordinal': Return the bin identifier encoded as an integer value.

strategy : {'uniform', 'quantile', 'kmeans'}, default='quantile'
    Strategy used to define the widths of the bins.

    - 'uniform': All bins in each feature have identical widths.
    - 'quantile': All bins in each feature have the same number of points.
    - 'kmeans': Values in each bin have the same nearest center of a 1D
      k-means cluster.

dtype : {np.float32, np.float64}, default=None
    The desired data-type for the output. If None, output dtype is
    consistent with input dtype. Only np.float32 and np.float64 are
    supported.

KBinsDiscretizer的参数strategy有三个取值，代表了无监督离散化的三个常用方法。

‘uniform’：统一，离散化在每个特征上都是统一的，这意味着bin宽度在每个维度上都是恒定的。
‘quantile’：离散化是在量化后的值上完成的，这意味着每个单元格具有大约相同数量的样本。
‘kmeans’：离散化基于KMeans聚类过程的质心。

项目案例

鸢尾花数据集的各个特征是连续值，要求用此数据集训练机器学习的分类算法，并比较在离散化与原始值两种状态下的分类效果。

import numpy as np 
from sklearn.datasets import load_iris
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
iris = load_iris()

鸢尾花数据集有4个特征，用如下方式显示：

iris.feature_names

['sepal length (cm)',
'sepal width (cm)',
'petal length (cm)',
'petal width (cm)']

为简化问题，在下面的操作中只选用两个特征：

X = iris.data
y = iris.target
X = X[:, [2, 3]]

先直观地显示这些数据的分布。

%matplotlib inline
import matplotlib.pyplot as plt
# X[:, 0]是第一个特征的所有数据，X[:, 1]是第二个特征的所有数据
plt.scatter(X[:, 0], X[:, 1], c=y, alpha=0.3,  cmap=plt.cm.RdYlBu, edgecolor='black')

然后，对这些数据离散化，并用可视化的方式显示离散化后的数据分布

Xd = KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='uniform').fit_transform(X)
plt.scatter(Xd[:, 0], Xd[:, 1], c=y, cmap=plt.cm.RdYlBu, edgecolor='black')

离散化后的数据更泾渭分明，有利于分类算法的应用。
下面将以上两种数据用于决策树分类算法，比较优劣。

dtc = DecisionTreeClassifier(random_state=0)   
score1 = cross_val_score(dtc, X, y, cv=5)   
score2 = cross_val_score(dtc, Xd, y, cv=5)

np.mean(score1), np.std(score1)

(0.9466666666666667, 0.039999999999999994)

np.mean(score2), np.std(score2)

(0.96, 0.03265986323710903)

从计算后的平均值和标准差，可以看出，在实施离散化后，对优化模型的性能还是有价值的。

如果使用k-means聚类方法进行离散化，效果还会更好。

km = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='kmeans').fit_transform(X)
s = cross_val_score(dtc, km, y, cv=5)
np.mean(s), np.std(s)

(0.9733333333333334, 0.02494438257849294)

有监督离散化

所谓有监督离散化，类似于有监督学习，需要根据样本标签实现离散化

此处介绍基于熵和信息增益的有监督离散化，以下表为例，依据results列对values列的数值实现离散化，results列就是所谓的标签。

values	results
1	Y
1	Y
2	N
3	Y
3	N

表中 results中的值为Y的样本数为3；值为N的样本数是2
根据熵计算公式
$Entropy = -\sum_{i=0}^m p_i log_2 p_i$ 得

$E(R) = -\frac{3}{5} log_2 \frac{3}{5} - \frac{2}{5}log_3 \frac{3}{5} = 0.97$

如果将特征values的值以整数2为离散化的分割点，分别统计results的值为Y和N的数据。

统计方法	Y	N	总计
小于或等于2的样本数量	2	1	3
大于2的样本数量	1	1	2

再计算熵

$E(R,V) = \frac{3}{5}E(2,1)+ \frac{2}{5}E(1,1) = 0.95 $ 然后计算信息增益，$G = E(R) - E(R,v) = 0.02$

用同样的方法，如果以1为values离散化的分割点，其熵为0.55，相应的信息增益为0.42
显然,以1为分割点,信息增益大,那么特征values的值离散化之后为[0,0,1,1,1]

上述过程用程序来实现，可以利用entropy_based_binning的计算模块 pip install entropy_based_binning

import entropy_based_binning as ebb
A = np.array([[1,1,2,3,3], [1,1,0,1,0]])
ebb.bin_array(A, nbins=2, axis=1)

array([[0, 0, 1, 1, 1],[1, 1, 0, 1, 0]])

2.7 数据规范化

规范化包含对特征的标准化、区间化和归一化等操作。

标准化

标准化计算公式

$x_{std}^{(i)} = \frac{x^{(i)}- μ_x}{σ_x}$

使用鸢尾花书籍，用StandardScaler创建标准化实例，将数据标准化

from sklearn import datasets
from sklearn.preprocessing import StandardScaler 
iris = datasets.load_iris()
iris_std = StandardScaler().fit_transform(iris.data)

# 原有的前5个样本的数值
iris['data'][:5]

array([[5.1, 3.5, 1.4, 0.2],
[4.9, 3. , 1.4, 0.2],
[4.7, 3.2, 1.3, 0.2],
[4.6, 3.1, 1.5, 0.2],
[5. , 3.6, 1.4, 0.2]])

# 这5个样本经过标准化变换之后的Z分数
iris_std[:5]

array([[-0.90068117, 1.01900435, -1.34022653, -1.3154443 ],
[-1.14301691, -0.13197948, -1.34022653, -1.3154443 ],
[-1.38535265, 0.32841405, -1.39706395, -1.3154443 ],
[-1.50652052, 0.09821729, -1.2833891 , -1.3154443 ],
[-1.02184904, 1.24920112, -1.34022653, -1.3154443 ]])

import numpy as np
np.mean(iris_std, axis=0)

array([-1.69031455e-15, -1.84297022e-15, -1.69864123e-15, -1.40924309e-15])

np.std(iris_std, axis=0)

array([1., 1., 1., 1.])

经标准化变换之后的数据集，各个特征的平均值为0，标准差为1。适合多数机器学习算法，比如对数概率回归和SVM等。

区间化

区间化计算公式

$x_{scaled}^{(i)} = \frac{x^{(i)}- x_{min}}{x_{(max)}- x_{min}}$

与标准化类似，在sklearn中由MinMaxScaler实现上述计算

from sklearn.preprocessing import MinMaxScaler
iris_mm = MinMaxScaler().fit_transform(iris.data)    
iris_mm[:5]

array([[0.22222222, 0.625 , 0.06779661, 0.04166667],
[0.16666667, 0.41666667, 0.06779661, 0.04166667],
[0.11111111, 0.5 , 0.05084746, 0.04166667],
[0.08333333, 0.45833333, 0.08474576, 0.04166667],
[0.19444444, 0.66666667, 0.06779661, 0.04166667]])

np.mean(iris_mm, axis=0)

array([0.4287037 , 0.44055556, 0.46745763, 0.45805556])

np.std(iris_mm, axis=0)

array([0.22925036, 0.18100457, 0.29820408, 0.31653859])

在sklearn中还有一个MinMaxScaler功能类似的类RobustScaler，从名称上来看，这个类应该有鲁棒性。
此类特征缩放所执行的数学公式是：

$x_{nor}^{(i)} = \frac{x^{(i)}- Q_{1}(x)}{Q_3(x)- Q_1(x)}$

import pandas as pd
X = pd.DataFrame({
    'x1': np.concatenate([np.random.normal(20, 1, 1000), np.random.normal(1, 1, 25)]),
    'x2': np.concatenate([np.random.normal(30, 1, 1000), np.random.normal(50, 1, 25)]),
})
X.sample(10)

	x1	x2
1008	-0.031074	50.628570
713	20.951094	29.837530
87	19.292356	30.371313
847	21.128625	29.077542
208	20.384260	30.683392
603	20.620852	30.198629
763	18.410347	31.632454
901	21.100977	30.285051
370	21.244946	28.534997
939	20.800052	29.687268

这里创建了数据集X，它有两个特征，相对于特征x1而言，特征x2的数据有更大的数据变换范围，方差较大。

np.std(X, axis=0)

x1 3.110426
x2 3.236057
dtype: float64

对X数据集分别用类MinMaxScaler和RobustScaler进行区间化

from sklearn.preprocessing import RobustScaler, MinMaxScaler
robust = RobustScaler()
robust_scaled = robust.fit_transform(X)
robust_scaled = pd.DataFrame(robust_scaled, columns=['x1', 'x2'])

minmax = MinMaxScaler()
minmax_scaled = minmax.fit_transform(X)
minmax_scaled = pd.DataFrame(minmax_scaled, columns=['x1', 'x2'])

为直观地比较缩放的效果，再分别对三种数据以可视化的方式表示它们的分布。

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

fig, (ax1, ax2, ax3) = plt.subplots(ncols=3, figsize=(9, 5))

ax1.set_title('Before Scaling')
sns.kdeplot(X['x1'], ax=ax1)
sns.kdeplot(X['x2'], ax=ax1)

ax2.set_title('After Robust Scaling')
sns.kdeplot(robust_scaled['x1'], ax=ax2)
sns.kdeplot(robust_scaled['x2'], ax=ax2)

ax3.set_title('After Min-Max Scaling')
sns.kdeplot(minmax_scaled['x1'], ax=ax3)
sns.kdeplot(minmax_scaled['x2'], ax=ax3)

归一化

from sklearn.preprocessing import Normalizer 
# 默认按l2范数归一化
norma = Normalizer()    
norma.fit_transform([[3, 4]])

array([[0.6, 0.8]])

计算公式

$\frac{3}{ \sqrt{3^2 +4^2}}= 0.6 ,\frac{4}{ \sqrt{3^2 +4^2}}= 0.8$

# 按l1范数归一化
norma1 = Normalizer(norm='l1')
norma1.fit_transform([[3, 4]])

计算公式

$\frac{3}{|3|+|4|}= 0.42857 ,\frac{4}{ |3|+|4|}= 0.57143$

除了上述两种，norm也可以设置为max，其含义是依据向量中的最大值进行归一化

norma_max = Normalizer(norm='max')
norma_max.fit_transform([[3, 4]])

array([[0.75, 1. ]])

这里注意，使用Normalizer实施的归一化与MinMaxScaler所实施的（0,1）区间化，虽然都是将数值经过缩放变换到0到1的范围，但两者还是有很大差别的，下面的示例就展示了其中的差别。

from mpl_toolkits.mplot3d import Axes3D

df = pd.DataFrame({
    'x1': np.random.randint(-100, 100, 1000).astype(float),
    'y1': np.random.randint(-80, 80, 1000).astype(float),
    'z1': np.random.randint(-150, 150, 1000).astype(float),
})

scaler = Normalizer()
scaled_df = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled_df, columns=df.columns)

fig = plt.figure(figsize=(9, 5))
ax1 = fig.add_subplot(121, projection='3d')
ax2 = fig.add_subplot(122, projection='3d')
ax1.scatter(df['x1'], df['y1'], df['z1'])
ax2.scatter(scaled_df['x1'], scaled_df['y1'], scaled_df['z1'])

在创建的数据集中有三个特征，本来在三维空间内分布的数据，经过归一化（l2范数）之后，将所有点都集中到一个球范围内。
如果利用MinMaxScaler将这些数据区间化（0,1）范围，会怎么样？

scaler = MinMaxScaler()
scaled_df = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled_df, columns=df.columns)

fig = plt.figure(figsize=(9, 5))
ax1 = fig.add_subplot(121, projection='3d')
ax2 = fig.add_subplot(122, projection='3d')
ax1.scatter(df['x1'], df['y1'], df['z1'])
ax2.scatter(scaled_df['x1'], scaled_df['y1'], scaled_df['z1'])

项目案例

对wine_data.csv中的特征Alcohol、Malic_acid的数据进行标准化和最大最小区间化操作，然后用图示的方式比较原始数据和规范化后的数据分布。

import pandas as pd
import numpy as np

df = pd.read_csv("datasets/wine_data.csv",usecols=[0,1,2])
df.head()

	Class_label	Alcohol	Malic_acid
0	1	14.23	1.71
1	1	13.20	1.78
2	1	13.16	2.36
3	1	14.37	1.95
4	1	13.24	2.59

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

std_scaler = StandardScaler()
df_std = std_scaler.fit_transform(df[['Alcohol', 'Malic_acid']])

mm_scaler = MinMaxScaler()
df_mm = mm_scaler.fit_transform(df[['Alcohol', 'Malic_acid']])

分别绘制特征Alcohol和Malic_acid的原始数据、标准化数据、区间化数据的散点图

%matplotlib inline

import matplotlib.pyplot as plt

plt.figure(figsize=(8, 6))
plt.scatter(df['Alcohol'], df['Malic_acid'],
            color='green', label='input scale', alpha=0.5)    # ③

plt.scatter(df_std[:,0], df_std[:,1], color='black',
            label='Standardized', alpha=0.3)    # ④

plt.scatter(df_mm[:,0], df_mm[:,1],
            color='blue', label='min-max scaled', alpha=0.3)    # ⑤

plt.title('Alcohol and Malic Acid content of the wine dataset')
plt.xlabel('Alcohol')
plt.ylabel('Malic Acid')
plt.legend(loc='upper left')
plt.grid()

plt.tight_layout()

posted @ 2022-06-09 16:20 王陸阅读(817) 评论(0) 收藏举报

刷新页面返回顶部

王陸

我可不是为了被全人类喜欢才活着的，只要对于某一个人来说我是必要的，我就能活下去。

特征工程（二）数据转换

2.1 特征的类型

2.2 特征数值化

基础知识

项目案例

2.3 特征二值化

基础知识

项目案例

2.4 OneHot编码

基础知识

项目案例

2.5 数据变换

基础知识

项目案例

2.6 特征离散化

无监督离散化

基础知识

项目案例

有监督离散化

2.7 数据规范化

标准化

区间化

归一化

项目案例

公告