《机器学习(周志华)》笔记--模型的评估与选择(8)--评估方法、分类任务的性能指标、回归任务的性能指标具体实现
2 模型评估与选择
2.1评估方法
2.1.1训练集和测试集
实例1:鸢尾花数据集(Iris)
鸢尾花数据集(Iris)是一个经典数据集。数据集内包含 3 类共 150 条记录,每类各 50 个数据,每条记录都有 4 项特征:花萼长度、花萼宽度、花瓣长度、花瓣宽度,可以通过这4个特征预测鸢尾花卉属于(iris-setosa, iris-versicolour, iris-virginica,山鸢尾、变色鸢尾和维吉尼亚鸢尾三个类别)中的哪一品种。
import pandas as pd from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression filename = 'iris.csv' names = ['sepal length','sepal width','petal length','petal width','species'] data = pd.read_csv(filename,names=names) array = data.values X=array[:,:-1] Y=array[:,-1] test_size = 0.3 seed = 4 X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=test_size,random_state=seed) model = LogisticRegression() model.fit(X_train,Y_train) result = model.score(X_test,Y_test) print('算法评估结果为:%.2f' % (result*100))
2.1.2 K折交叉验证
import pandas as pd from sklearn.model_selection import KFold from sklearn.model_selection import cross_val_score from sklearn.linear_model import LogisticRegression filename = 'iris.csv' names = ['sepal length','sepal width','petal length','petal width','species'] data = pd.read_csv(filename,names=names) array = data.values X=array[:,:-1] Y=array[:,-1] seed = 7 num_folds =10 kfold = KFold(n_splits=num_folds,random_state=seed) model = LogisticRegression() result = cross_val_score(model,X,Y,cv=kfold) print(result) print('算法的评估结果:%.2f' %(result.mean()*100))
2.1.3留一法
import pandas as pd from sklearn.model_selection import LeaveOneOut from sklearn.model_selection import cross_val_score from sklearn.linear_model import LogisticRegression filename = 'iris.csv' names = ['sepal length','sepal width','petal length','petal width','species'] data = pd.read_csv(filename,names=names) array = data.values X=array[:,:-1] Y=array[:,-1] seed = 7 loocv=LeaveOneOut() model = LogisticRegression() result = cross_val_score(model,X,Y,cv=loocv) print(result) print('算法的评估结果:%.2f' %(result.mean()*100))
2.2分类任务的性能指标
准确度、混淆矩阵、精准率、召回率、AUC、F1score
分类报告:
from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score,roc_auc_score,confusion_matrix y_true = [1, 0, 0, 1] y_predict = [1, 0, 1, 0] y_score = [0.1, 0.4, 0.35, 0.8] print(accuracy_score(y_true, y_predict)) print(precision_score(y_true,y_predict)) print(recall_score(y_true, y_predict)) print(f1_score(y_true, y_predict)) print(roc_auc_score(y_true, y_score)) print(confusion_matrix(y_true,y_predict))
实例2:印第安人糖尿病数据集
数据集的目标是基于数据集中(共有768条数据)包含的某些诊断测量来诊断性的预测患者是否患有糖尿病。数据有8个属性,【1】Pregnancies:怀孕次数;【2】Glucose:葡萄糖【3】BloodPressure:血压 (mm Hg)【4】SkinThickness:皮层厚度 (mm) 【5】Insulin:胰岛素 2小时血清胰岛素(mu U / ml【6】BMI:体重指数 (体重/身高)^2 【7】DiabetesPedigreeFunction:糖尿病谱系功能【8】Age:年龄 (岁)
import pandas as pd from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score,\ roc_auc_score,confusion_matrix,classification_report filename = 'pima_data.csv' names = ['preg','plas','blood','skin','insulin','bmi','pedi','age','class'] data = pd.read_csv(filename,names=names) array = data.values X=array[:,:-1] Y=array[:,-1] test_size = 0.3 seed = 4 X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=test_size,random_state=seed) model = LogisticRegression() model.fit(X_train,Y_train) y_predict = model.predict(X_test) y_true = Y_test print(confusion_matrix(y_true,y_predict)) print(classification_report(y_true,y_predict))
输出:
[[138 14]
[ 30 49]]
precision recall f1-score support
0.0 0.82 0.91 0.86 152
1.0 0.78 0.62 0.69 79
avg / total 0.81 0.81 0.80 231
import pandas as pd from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score,\ roc_auc_score,confusion_matrix,classification_report filename = 'iris.csv' names = ['sepal length','sepal width','petal length','petal width','species'] data = pd.read_csv(filename,names=names) array = data.values X=array[:,:-1] Y=array[:,-1] test_size = 0.3 seed = 4 X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=test_size,random_state=seed) model = LogisticRegression() model.fit(X_train,Y_train) y_predict = model.predict(X_test) y_true = Y_test print(confusion_matrix(y_true,y_predict)) print(classification_report(y_true,y_predict))
输出:
[[21 0 0]
[ 0 8 2]
[ 0 1 13]]
precision recall f1-score support
Iris-setosa 1.00 1.00 1.00 21
Iris-versicolor 0.89 0.80 0.84 10
Iris-virginica 0.87 0.93 0.90 14
avg / total 0.93 0.93 0.93 45
2.3回归任务的性能指标
平均绝对误差
均方误差
决定系数
import numpy as np def mse_score(y_predict,y_true): rmse = np.mean((y_predict-y_true)**2) return rmse def rmse_score(y_predict,y_true): mse = np.sqrt(np.mean((y_predict-y_true)**2)) return mse def mae_score(y_predict,y_true): mae = np.mean(np.abs(y_predict-y_true)) return mae def r2_score(y_predict,y_true): r2 = 1 - mse_score(y_predict,y_true)/np.var(y_true) return r2 y_true=np.array([0.7,0.2,1.8,0.4,1.4]) y_predict=np.array([0.7,-0.8,3.8,0.9,2.9]) print(rmse_score(y_predict,y_true)) print(mse_score(y_predict,y_true)) print(mae_score(y_predict,y_true)) print(r2_score(y_predict,y_true))
实例3:波士顿房价数据集
波士顿房价数据集共有506条波士顿房价的数据,每条数据包括对指定房屋的13项数值型特征和目标房价组成。
from sklearn.datasets import load_boston from sklearn.model_selection import KFold,cross_val_score from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_absolute_error,mean_squared_error,r2_score boston= load_boston() X = boston['data'] Y = boston['target'] seed = 7 num_folds =10 kfold = KFold(n_splits=num_folds,random_state=seed) model=LinearRegression() test_size = 0.3 seed = 4 X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=test_size,random_state=seed) model=LinearRegression() model.fit(X_train,Y_train) Y_predict=model.predict(X_test) print(Y_predict-Y_test) mae = mean_absolute_error(Y_test,Y_predict) rmse = mean_squared_error(Y_test,Y_predict) r2 = r2_score(Y_test,Y_predict) print("MAE:%.2f" % mae) print("RMSE:%.2f" % rmse) print("R2:%.2f" % r2)