类别标签处理
pandas one_hot编码
1 import pandas as pd 2 3 data = pd.Series(["java","python","c","python","html","java","linux"]) 4 data1 = pd.get_dummies(data) 5 print(data1)
输出结果:
c html java linux python
0 0 0 1 0 0
1 0 0 0 0 1
2 1 0 0 0 0
3 0 0 0 0 1
4 0 1 0 0 0
5 0 0 1 0 0
6 0 0 0 1 0
sklearn one_hot编码
1 import pandas as pd 2 from sklearn.preprocessing import label_binarize 3 data = pd.DataFrame([["java"],["python"],["c"],["python"],["html"],["java"],["linux"]],columns=["name"]) 4 classes = list(set(data["name"].values.tolist())) 5 data2 = label_binarize(data["name"],classes=classes) 6 data = pd.DataFrame(data2,columns=classes) 7 print(data)
java linux html c python
0 1 0 0 0 0
1 0 0 0 0 1
2 0 0 0 1 0
3 0 0 0 0 1
4 0 0 1 0 0
5 1 0 0 0 0
6 0 1 0 0 0
1 from sklearn.preprocessing import LabelEncoder 2 from sklearn.preprocessing import OneHotEncoder 3 4 data = pd.DataFrame([["java"],["python"],["c"],["python"],["html"],["java"],["linux"]],columns=["name"]) 5 l = LabelEncoder() 6 d = l.fit_transform(data["name"]) 7 o = OneHotEncoder() 8 data3 = pd.DataFrame((o.fit_transform(d.reshape(-1,1))).toarray(),columns=l.classes_) 9 print(data3)
c html java linux python
0 0.0 0.0 1.0 0.0 0.0
1 0.0 0.0 0.0 0.0 1.0
2 1.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 1.0
4 0.0 1.0 0.0 0.0 0.0
5 0.0 0.0 1.0 0.0 0.0
6 0.0 0.0 0.0 1.0 0.0
label_binarize的返回值是numpy.ndarray的数据类型
OneHotEncoder的返回值是scipy.sparse.csr.csr_matrix的数据类型使用toarray()处理为numpy.ndarray的数据类型
对用 ndarray的数据使用tolist()转换为列表,使用list(set(list_data))去重复