编码
独热编码
基于树的算法不需要独热编码。
1.pandas方法
from sklearn.datasets import load_iris import pandas as pd #创建数据集 data = pd.DataFrame({'one':[1,2,3],'two':[2,3,4],'city':[3,4,5]}) #对city独热编码 data =pd.get_dummies(data, columns=['city']) data
2.sklearn方法
from sklearn import preprocessing enc = preprocessing.OneHotEncoder() enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]]) # fit来学习编码 enc.transform([[0, 1, 3]]).toarray() # 进行编码
标签编码
sklearn的决策时算法特征无法使用字符串特征,需要编码成数字。
from sklearn import preprocessing column_a = preprocessing.LabelEncoder() column_a.fit(X['column_a']) column_a.classes_ X['column_a']=column_a.transform(X['column_a']) # Transform Categories Into Integers
Without summary,you can't master it.