使用sklearn来处理类别数据
转载自:https://blog.csdn.net/sinat_29957455/article/details/79452141
一、有序特征的映射
import pandas as pd
if __name__ == "__main__":
\#定义衣服尺寸的映射关系
size_mapping = {"S":1,"M":2,"X":3,"XL":4}
\#定义一个DataFrame数据
data = pd.DataFrame([
["green","S",100],
["blue", "M", 110],
["red", "X", 120],
["black", "XL", 130]
])
\#设置列名
data.columns = ["color","size","price"]
\#对size列的类别数据进行映射
data["size"] = data["size"].map(size_mapping)
print(data)
二、类标的编码
许多的机器学习算法都要求将类标换成整数值来进行处理。对于类标进行编码与之前对于有序特征的映射有所不同,类标并不要求是有序的,对于特定的字符串类标赋予哪个整数值给它对于我们来说并不重要,所以在对于类标进行编码的时候我们可以使用枚举的方式从0开始设定类标。
import pandas as pd
import numpy as np
if __name__ == "__main__":
\# 定义一个DataFrame数据
data = pd.DataFrame([
["green", "S", 100,"label1"],
["blue", "M", 110,"label2"],
["red", "X", 120,"label3"],
["black", "XL", 130,"label4"]
])
\# 设置列名
data.columns = ["color", "size", "price","label"]
\#通过枚举获取类标与整数之间的映射关系
label_mapping = {label:idx for idx,label in enumerate(np.unique(data["label"]))}
print(label_mapping)
\#对label列进行映射
data["label"] = data["label"].map(label_mapping)
print(data)
通过下面的方法可以将整数类标还原为字符串
inv_label_mapping = {v:k for k,v in label_mapping.items()}
data["label"] = data["label"].map(inv_label_mapping)
print(data)
还可以通过sklearn的LabelEncoder类来实现类标的编码
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
if __name__ == "__main__":
\# 定义一个DataFrame数据
data = pd.DataFrame([
["green", "S", 100,"label1"],
["blue", "M", 110,"label2"],
["red", "X", 120,"label3"],
["black", "XL", 130,"label4"]
])
\# 设置列名
data.columns = ["color", "size", "price","label"]
class_label = LabelEncoder()
data["label"] = class_label.fit_transform(data["label"].values)
print(data)
通过sklearn的inverse_transform方法可以将整数类标还原为原始的字符串
data["label"] = class_label.inverse_transform(data["label"])
print(data)
三、标称特征上的独热编码(one-hot encoding)
我们对上面衣服的颜色特征进行编码,将颜色映射为{"green":0,"blue":1,"red":2,"black":3}。看起来这样映射好像没什么问题,真的没有问题吗?实则不然,我们这样映射实际上给颜色强加了一个大小关系,即black>red>blue>green,实际上颜色是不存在这种关系的,很显然结果肯定也不是最优的。这时,我们可以通过独热编码(one-hot encoding)来解决这一类问题。独热编码是通过创建一个新的虚拟特征,虚拟特征的每一列各代表标称数据的一个值。例如,颜色一共有四个取值green、blue、red、black,独热编码是通过四位二进制来表示,如果是green就表示为[1,0,0,0],对应的颜色是[green,blue,red,black],如果属于哪一种颜色,则取值为1,否则为0。
使用sklearn的OneHotEncoder实现OneHot编码
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
if __name__ == "__main__":
\# 定义一个DataFrame数据
data = pd.DataFrame([
["green", "S", 100, "label1"],
["blue", "M", 110, "label2"],
["red", "X", 120, "label3"],
["black", "XL", 130, "label4"]
])
\# 设置列名
data.columns = ["color", "size", "price", "label"]
X = data[["color", "price"]].values
\#通过类标编码将颜色装换成为整数
color_label = LabelEncoder()
X[:,0] = color_label.fit_transform(X[:,0])
\#设置颜色列使用oneHot编码
one_hot = OneHotEncoder(categorical_features=[0])
print(one_hot.fit_transform(X).toarray())
注意:在使用OneHotEncoder进行OneHot编码的时候,需要先将字符串转换成为整数之后才能进行OneHot编码,不然会报错。
使用pandas来实现oneHot编码
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
if __name__ == "__main__":
\# 定义一个DataFrame数据
data = pd.DataFrame([
["green", "S", 100, "label1"],
["blue", "M", 110, "label2"],
["red", "X", 120, "label3"],
["black", "XL", 130, "label4"]
])
\# 设置列名
data.columns = ["color", "size", "price", "label"]
X = data[["color", "price"]].values
\#pandas的get_dummies方法只对字符串列进行转换,其他的列保持不变
print(pd.get_dummies(data[["color","price"]]))