2、数据规范化

数据规范化

均值-方差规范化、极差规范化

由于变量或指标的单位不同,造成有些指标数据值非常大,而有些非常小,在模型运算过程中大的数据会把小的数据覆盖掉,造成模型失真。因此,需要对这些数据做规范化处理,或者说去量纲化。

均值-方差规范化:是指变量或者指标数据减去其均值再除以标准差得到的数据。新数据均值为0,方差为1。其公式如下:

                    𝑥∗=𝑥−𝑚𝑒𝑎𝑛(𝑥)𝑠𝑡𝑑(𝑥)x^∗=(x-mean(x))/(std(x))

极差规范化: 是指变量或是指标数据减去其最小值,再除以最大值与最小值之差,得到新的数据。新数据取值范围再[0,1]。其计算公式为:

                    𝑥∗=𝑥−min⁡(𝑥)max𝑥−min⁡(𝑥)x^∗=(x-min⁡(x))/(max⁡(x)-min⁡(x))

1、读取数据

#读取数据
import numpy as np
data=np.load('data.npy')
data=data[:,1:]
data
array([[  17.        ,   66.17647059,   32.        , 1614.96618125,
          13.15625   ],
       [   8.        ,   68.6875    ,   36.        ,  143.56458056,
           3.80555556],
       [  16.        ,   65.84375   ,   43.        , 1344.13137674,
          12.69767442],
       ...,
       [  10.        ,   67.95      ,   24.        ,  115.87417083,
           2.79166667],
       [  21.        ,   66.5       ,   41.        ,  538.71289268,
          20.31707317],
       [  11.        ,   78.27272727,    9.        ,   62.98323333,
           9.44444444]])
#个人爱好,jupyter内看着舒服
import pandas as pd
data = pd.DataFrame(data)
data
0 1 2 3 4
0 17.0 66.176471 32.0 1614.966181 13.156250
1 8.0 68.687500 36.0 143.564581 3.805556
2 16.0 65.843750 43.0 1344.131377 12.697674
3 2.0 75.000000 2.0 0.365700 1.000000
4 27.0 65.740741 60.0 991.953787 11.100000
... ... ... ... ... ...
830 35.0 66.057143 44.0 127.945364 12.250000
831 14.0 69.714286 7.0 32.219643 15.571429
832 10.0 67.950000 24.0 115.874171 2.791667
833 21.0 66.500000 41.0 538.712893 20.317073
834 11.0 78.272727 9.0 62.983233 9.444444

835 rows × 5 columns

2、导入预处理库

#导入预处理库
from sklearn.impute import SimpleImputer
from sklearn import preprocessing

3、预处理空值

#均值填充空值
#利用Imputer 创建填充对象imp_mean
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean',verbose=0)

#fit_transform一步完成调取结果
imp_mean = imp_mean.fit_transform(data) 
imp_mean
array([[  17.        ,   66.17647059,   32.        , 1614.96618125,
          13.15625   ],
       [   8.        ,   68.6875    ,   36.        ,  143.56458056,
           3.80555556],
       [  16.        ,   65.84375   ,   43.        , 1344.13137674,
          12.69767442],
       ...,
       [  10.        ,   67.95      ,   24.        ,  115.87417083,
           2.79166667],
       [  21.        ,   66.5       ,   41.        ,  538.71289268,
          20.31707317],
       [  11.        ,   78.27272727,    9.        ,   62.98323333,
           9.44444444]])
#看着舒服
pd.DataFrame(imp_mean)
0 1 2 3 4
0 17.0 66.176471 32.0 1614.966181 13.156250
1 8.0 68.687500 36.0 143.564581 3.805556
2 16.0 65.843750 43.0 1344.131377 12.697674
3 2.0 75.000000 2.0 0.365700 1.000000
4 27.0 65.740741 60.0 991.953787 11.100000
... ... ... ... ... ...
830 35.0 66.057143 44.0 127.945364 12.250000
831 14.0 69.714286 7.0 32.219643 15.571429
832 10.0 67.950000 24.0 115.874171 2.791667
833 21.0 66.500000 41.0 538.712893 20.317073
834 11.0 78.272727 9.0 62.983233 9.444444

835 rows × 5 columns

4、均值-方差规范化(Z-Score规范化)

X1 = imp_mean
scaled_x = preprocessing.scale(X1)
pd.DataFrame(scaled_x)
0 1 2 3 4
0 0.200258 -0.827606 0.055546 2.843538 0.769541
1 -0.689187 -0.092243 0.206625 -0.185541 -0.651561
2 0.101431 -0.925045 0.471013 2.285988 0.699848
3 -1.282151 1.756395 -1.077545 -0.480335 -1.077944
4 1.188531 -0.955211 1.113098 1.560983 0.457036
... ... ... ... ... ...
830 1.979150 -0.862552 0.508783 -0.217695 0.631811
831 -0.096223 0.208455 -0.888696 -0.414760 1.136596
832 -0.491533 -0.308222 -0.246611 -0.242546 -0.805650
833 0.595568 -0.732860 0.395474 0.627925 1.857831
834 -0.392705 2.714824 -0.813157 -0.351429 0.205428

835 rows × 5 columns

5、极差规范化(Min-max 规范化)

X2 = imp_mean

# 将数据进行 [0,1] 规范化
min_max_scaler = preprocessing.MinMaxScaler()

minmax_x = min_max_scaler.fit_transform(X2)
pd.DataFrame(minmax_x)
0 1 2 3 4
0 0.380952 0.044063 0.251969 0.339418 0.138139
1 0.166667 0.171583 0.283465 0.030155 0.031881
2 0.357143 0.027166 0.338583 0.282493 0.132928
3 0.023810 0.492158 0.015748 0.000057 0.000000
4 0.619048 0.021935 0.472441 0.208472 0.114773
... ... ... ... ... ...
830 0.809524 0.038003 0.346457 0.026872 0.127841
831 0.309524 0.223728 0.055118 0.006753 0.165584
832 0.214286 0.134130 0.188976 0.024335 0.020360
833 0.476190 0.060493 0.322835 0.113208 0.219512
834 0.238095 0.658361 0.070866 0.013218 0.095960

835 rows × 5 columns


posted @ 2022-04-07 21:14  AubeLiang  阅读(725)  评论(0编辑  收藏  举报