2、数据规范化
数据规范化
均值-方差规范化、极差规范化
由于变量或指标的单位不同,造成有些指标数据值非常大,而有些非常小,在模型运算过程中大的数据会把小的数据覆盖掉,造成模型失真。因此,需要对这些数据做规范化处理,或者说去量纲化。
均值-方差规范化:是指变量或者指标数据减去其均值再除以标准差得到的数据。新数据均值为0,方差为1。其公式如下:
𝑥∗=𝑥−𝑚𝑒𝑎𝑛(𝑥)𝑠𝑡𝑑(𝑥)x^∗=(x-mean(x))/(std(x))
极差规范化: 是指变量或是指标数据减去其最小值,再除以最大值与最小值之差,得到新的数据。新数据取值范围再[0,1]。其计算公式为:
𝑥∗=𝑥−min(𝑥)max𝑥−min(𝑥)x^∗=(x-min(x))/(max(x)-min(x))
1、读取数据
#读取数据
import numpy as np
data=np.load('data.npy')
data=data[:,1:]
data
array([[ 17. , 66.17647059, 32. , 1614.96618125,
13.15625 ],
[ 8. , 68.6875 , 36. , 143.56458056,
3.80555556],
[ 16. , 65.84375 , 43. , 1344.13137674,
12.69767442],
...,
[ 10. , 67.95 , 24. , 115.87417083,
2.79166667],
[ 21. , 66.5 , 41. , 538.71289268,
20.31707317],
[ 11. , 78.27272727, 9. , 62.98323333,
9.44444444]])
#个人爱好,jupyter内看着舒服
import pandas as pd
data = pd.DataFrame(data)
data
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | 17.0 | 66.176471 | 32.0 | 1614.966181 | 13.156250 |
1 | 8.0 | 68.687500 | 36.0 | 143.564581 | 3.805556 |
2 | 16.0 | 65.843750 | 43.0 | 1344.131377 | 12.697674 |
3 | 2.0 | 75.000000 | 2.0 | 0.365700 | 1.000000 |
4 | 27.0 | 65.740741 | 60.0 | 991.953787 | 11.100000 |
... | ... | ... | ... | ... | ... |
830 | 35.0 | 66.057143 | 44.0 | 127.945364 | 12.250000 |
831 | 14.0 | 69.714286 | 7.0 | 32.219643 | 15.571429 |
832 | 10.0 | 67.950000 | 24.0 | 115.874171 | 2.791667 |
833 | 21.0 | 66.500000 | 41.0 | 538.712893 | 20.317073 |
834 | 11.0 | 78.272727 | 9.0 | 62.983233 | 9.444444 |
835 rows × 5 columns
2、导入预处理库
#导入预处理库
from sklearn.impute import SimpleImputer
from sklearn import preprocessing
3、预处理空值
#均值填充空值
#利用Imputer 创建填充对象imp_mean
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean',verbose=0)
#fit_transform一步完成调取结果
imp_mean = imp_mean.fit_transform(data)
imp_mean
array([[ 17. , 66.17647059, 32. , 1614.96618125,
13.15625 ],
[ 8. , 68.6875 , 36. , 143.56458056,
3.80555556],
[ 16. , 65.84375 , 43. , 1344.13137674,
12.69767442],
...,
[ 10. , 67.95 , 24. , 115.87417083,
2.79166667],
[ 21. , 66.5 , 41. , 538.71289268,
20.31707317],
[ 11. , 78.27272727, 9. , 62.98323333,
9.44444444]])
#看着舒服
pd.DataFrame(imp_mean)
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | 17.0 | 66.176471 | 32.0 | 1614.966181 | 13.156250 |
1 | 8.0 | 68.687500 | 36.0 | 143.564581 | 3.805556 |
2 | 16.0 | 65.843750 | 43.0 | 1344.131377 | 12.697674 |
3 | 2.0 | 75.000000 | 2.0 | 0.365700 | 1.000000 |
4 | 27.0 | 65.740741 | 60.0 | 991.953787 | 11.100000 |
... | ... | ... | ... | ... | ... |
830 | 35.0 | 66.057143 | 44.0 | 127.945364 | 12.250000 |
831 | 14.0 | 69.714286 | 7.0 | 32.219643 | 15.571429 |
832 | 10.0 | 67.950000 | 24.0 | 115.874171 | 2.791667 |
833 | 21.0 | 66.500000 | 41.0 | 538.712893 | 20.317073 |
834 | 11.0 | 78.272727 | 9.0 | 62.983233 | 9.444444 |
835 rows × 5 columns
4、均值-方差规范化(Z-Score规范化)
X1 = imp_mean
scaled_x = preprocessing.scale(X1)
pd.DataFrame(scaled_x)
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | 0.200258 | -0.827606 | 0.055546 | 2.843538 | 0.769541 |
1 | -0.689187 | -0.092243 | 0.206625 | -0.185541 | -0.651561 |
2 | 0.101431 | -0.925045 | 0.471013 | 2.285988 | 0.699848 |
3 | -1.282151 | 1.756395 | -1.077545 | -0.480335 | -1.077944 |
4 | 1.188531 | -0.955211 | 1.113098 | 1.560983 | 0.457036 |
... | ... | ... | ... | ... | ... |
830 | 1.979150 | -0.862552 | 0.508783 | -0.217695 | 0.631811 |
831 | -0.096223 | 0.208455 | -0.888696 | -0.414760 | 1.136596 |
832 | -0.491533 | -0.308222 | -0.246611 | -0.242546 | -0.805650 |
833 | 0.595568 | -0.732860 | 0.395474 | 0.627925 | 1.857831 |
834 | -0.392705 | 2.714824 | -0.813157 | -0.351429 | 0.205428 |
835 rows × 5 columns
5、极差规范化(Min-max 规范化)
X2 = imp_mean
# 将数据进行 [0,1] 规范化
min_max_scaler = preprocessing.MinMaxScaler()
minmax_x = min_max_scaler.fit_transform(X2)
pd.DataFrame(minmax_x)
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | 0.380952 | 0.044063 | 0.251969 | 0.339418 | 0.138139 |
1 | 0.166667 | 0.171583 | 0.283465 | 0.030155 | 0.031881 |
2 | 0.357143 | 0.027166 | 0.338583 | 0.282493 | 0.132928 |
3 | 0.023810 | 0.492158 | 0.015748 | 0.000057 | 0.000000 |
4 | 0.619048 | 0.021935 | 0.472441 | 0.208472 | 0.114773 |
... | ... | ... | ... | ... | ... |
830 | 0.809524 | 0.038003 | 0.346457 | 0.026872 | 0.127841 |
831 | 0.309524 | 0.223728 | 0.055118 | 0.006753 | 0.165584 |
832 | 0.214286 | 0.134130 | 0.188976 | 0.024335 | 0.020360 |
833 | 0.476190 | 0.060493 | 0.322835 | 0.113208 | 0.219512 |
834 | 0.238095 | 0.658361 | 0.070866 | 0.013218 | 0.095960 |
835 rows × 5 columns