sklearn机器学习实战-KNN
KNN分类
KNN是惰性学习模型,也被称为基于实例的学习模型
简单线性回归是勤奋学习模型,训练阶段耗费计算资源,但是预测阶段代价不高
首先工作是把label的内容进行二值化(如果多分类任务,则考虑OneHot)
from sklearn.preprocessing import LabelBinarizer
from sklearn.neighbors import KNeighborsClassifier
import numpy as np
X_train = np.array([[158,64],[170,86],[183,84],[191,80],[155,49],[163,59],[180,67],[158,54],[170,67]])
y_train = np.array(['male','male','male','male','femal','femal','femal','femal','femal'])
# print(y_train.shape)
lb = LabelBinarizer()
y_train_binarized = lb.fit_transform(y_train)
# print(y_train_binarized.shape)
print(y_train_binarized)
[[1]
[1]
[1]
[1]
[0]
[0]
[0]
[0]
[0]]
这里值得注意的地方是,processing是数据处理工具,这里使用LabelBinarizer,目的是将数据进行二值化,注意,下述数据是在上面数据的基础上,最后一个值修改
from sklearn.preprocessing import LabelBinarizer
from sklearn.neighbors import KNeighborsClassifier
import numpy as np
X_train = np.array([[158,64],[170,86],[183,84],[191,80],[155,49],[163,59],[180,67],[158,54],[170,67]])
y_train = np.array(['male','male','male','male','femal','femal','femal','femal','femaal'])
# print(y_train.shape)
lb = LabelBinarizer()
y_train_binarized = lb.fit_transform(y_train)
# print(y_train_binarized.shape)
print(y_train_binarized)
[[0 0 1]
[0 0 1]
[0 0 1]
[0 0 1]
[0 1 0]
[0 1 0]
[0 1 0]
[0 1 0]
[1 0 0]]
在这里说明一点,fit计算用于进行特征缩放的最大值、最小值,但是之前简单线性回归中,fit是LinearRegression的一个方法。一般fit对训练集用,然后分别对训练集和测试集用transform
可以理解为,fit进行训练,transform进行转换
一般如下使用
from sklearn.preprocessing import StandardScaler
scaler_ss = StandardScaler()
# 训练接操作
new_train_x = scaler_ss.fit_transform(train_x)
# 测试集操作
new_test_x = scaler_ss.tranform(test_x)
如果训练集和测试集都用fit_transform的话,那么测试集的最大最小,将有可能与训练集的不同
from sklearn.preprocessing import LabelBinarizer
from sklearn.neighbors import KNeighborsClassifier
import numpy as np
X_train = np.array([[158,64],[170,86],[183,84],[191,80],[155,49],[163,59],[180,67],[158,54],[170,67]])
y_train = np.array(['male','male','male','male','femal','femal','femal','femal','femal'])
lb = LabelBinarizer()
y_train_binarized = lb.fit_transform(y_train)
clf = KNeighborsClassifier(3)
clf.fit(X_train,y_train_binarized.reshape(-1))
prediction = clf.predict(np.array([155,70]).reshape(1,-1))[0]
predict_label = lb.inverse_transform(prediction)
array(['femal'], dtype='<U5')
一般准确率,使用accuracy_score,这个很好理解,就是每个都去比较
那么查全率和查准率怎么理解呢
from sklearn.metrics import recall_score
x_true = [1,1,1,1,1,1,1,1]
x_predict = [1,1,1,1,1,1,0,0]
print(recall_score(x_true,x_predict))
0.75
from sklearn.metrics import precision_score
x_true = [1,1,1,1,1,1,1,1]
x_predict = [1,1,1,1,1,1,0,0]
print(precision_score(x_true,x_predict))
1.0
其实都是用上下对应上的1,去除以一边总共的1,不同的是求准,是要去除以上面的,求全是除以下面的,这也好理解嘛,我要全,那反正下面预测出来的对了,我就不管了
对应着西瓜书上关于求准和求全的公式
$P = \frac{TP}{TP+FP}$
$R = \frac{TP}{TP+FN}$
那么TP对应的就是两边都是1的个数(True Positive),那么FP就是False Positive,也就是两个都是预测的positive
KNN回归
一般没人会用KNN做回归吧?
import numpy as np
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_absolute_error,mean_squared_error,r2_score
X_train = np.array([[158,1],[170,1],[183,1],[191,1],[155,0],[163,0],[180,0],[158,0],[170,0]])
y_train = [64,86,84,80,49,59,67,54,67]
X_test = np.array([[168,1],[180,1],[160,0],[169,0]])
y_test = [65,96,52,67]
clf = KNeighborsRegressor(3)
clf.fit(X_train,y_train)
predictions = clf.predict(X_test)
print("R2_score:")
print(r2_score(y_test,predictions))
print("MAE:")
print(mean_absolute_error(y_test,predictions))
print("MSE:")
print(mean_squared_error(y_test,predictions))
R2_score:
0.6290565226735438
MAE:
8.333333333333336
MSE:
95.8888888888889
R2_score感觉也不太常用,看了公式,没太记住
StandardScalar后面会用到,这里也不做展示
这一小节主要用了LabelBinarizer进行二值转化,如果错输入三值,虽然也可以运行,但结果可就难说了
熟悉了主要流程,也即
- 选择模型,如LinearRegression
- 可能要用到MinMax之类的preprocessing去处理数据
- 模型实例化,再去fit,然后可能transform
- 再接着可能要去predict
- 然后选用合适的方法看拟合效果,如一般选用accuracy_score