数据编码

把文本型、字符型数据转换成数值型的方法

标签编码(LabelEncode)

从名字上看，这是对标签进行编码，实际上可以对任何数据进行编码

作法很简单，就是将序列中不同值给个序号，以代表这个字符型数据。

示例代码

from sklearn.preprocessing import LabelEncoder  # 标签专用

label = np.array(['好','不好','不好','好','不好','好','不好'])
encoding=LabelEncoder()
classes = encoding.fit_transform(label)  # 把字符型特征转化成整型
print(encoding.classes_)            # ['好','不好']
print(classes)                      # [1 0 0 1 0 1 0]

label = encoding.inverse_transform(classes)     # 还原
print(label)

上面把好与不好变成了0，1

这种方法简单，但是有很大问题。

思考以下问题：

比如特征是[‘鸡’，‘鸭’，‘鹅’]，彼此之间完全独立，没有关联性和大小，

再如特征是[‘小学’，‘初中’，‘高中’]，彼此之间不完全独立，有“大小”之分，小学<初中<高中，但彼此之间无法数据计算

又如特征是[‘<45’，‘<90’，‘<135’]，彼此之间不完全独立，不仅有“大小”之分，彼此间也可以数据计算

单纯的变为 0 1 2，忽略了特征本身之间的联系，传达了不准确的信息，影响建模效果。

独热编码(OneHotEncode)

也叫哑编码，就是用二进制来表示特征，生成稀疏矩阵。

作用

1. 转换成数值型

2. 解决labelencode的问题

3. 增加维度

用于特征

把特征做 onehot 编码时，特征之间可能产生多重共线性，因为每个特征加起来和是一样的；

为了避免这个问题，一种可行的方法是扔掉一列特征，因为共 n 个编码，有了 n-1 个，就能推出第 n 个；

这种方式也解释了为什么特征只有 2 个值时，无需进行独热编码，其实相当于编码了，然后扔掉一列，只剩一列了

示例代码

from sklearn.preprocessing import OneHotEncoder
import numpy as np

train = np.array([  [0, 1, 2],
                    [1, 1, 0],
                    [2, 0, 1],
                    [3, 1, 1]])
one_hot = OneHotEncoder()
one_hot = one_hot.fit(train)
result = one_hot.transform([[1, 0, 1]]).toarray()
print result                                        # [[0. 1. 0. 0. 1. 0. 0. 1. 0.]]

# print(one_hot.transform([[1, 0, 1]]).todense())   # 将结果保存起来  稀疏矩阵，转为密集矩阵
# print one_hot.inverse_transform(result)           # 还原
# print one_hot.get_feature_names()                 

### 解释
# one column 0 1 2 3-->0 1 2 3
# two column 0 1-->0 1
# three column 0 1 2-->0 1 2

# 1 0 1 --> 1==> 0 1 0 0
#           0==> 1 0
#           1==> 0 1 0


### test1
train = np.array([  ['0', 1, 2],
                    ['1', 1, 0],
                    ['2', 0, 1],
                    ['3', 1, 1]])
one_hot = OneHotEncoder()
one_hot.fit(train)
print(one_hot.transform([['1', 0, 1]]).toarray())    # [[0. 1. 0. 0. 1. 0. 0. 1. 0.]]

### test2
train = np.array([  ['0', 'a', 2],
                    ['1', 'a', 0],
                    ['2', 'b', 1],
                    ['3', 'a', 1]])
one_hot = OneHotEncoder()
one_hot.fit(train)
print(one_hot.transform([['1', 'b', 1]]).toarray())    # ValueError: could not convert string to float: a

sklearn中独热编码不能直接处理非数值的字符串，需要先根据 LableEncode 进行编码，再进行独热编码

用于标签

把标签编码转换成 one hot

import numpy as np
from sklearn.preprocessing import label_binarize

y = np.array([0,0,1,2,2])
y = label_binarize(y, classes=[0, 1, 2])
print(y)
# [[1 0 0]
#  [1 0 0]
#  [0 1 0]
#  [0 0 1]
#  [0 0 1]]

二值化编码

很简单，设定阈值，大于为1，小于为0

from sklearn.preprocessing import Binarizer
from sklearn.preprocessing import KBinsDiscretizer

具体用法请百度

发表于 2019-04-15 11:55 努力的孔子阅读(1384) 评论(0) 编辑收藏举报

刷新页面返回顶部