《机器学习(周志华)》笔记--模型的评估与选择(8)--评估方法、分类任务的性能指标、回归任务的性能指标具体实现

2 模型评估与选择

2.1评估方法

2.1.1训练集和测试集

实例1:鸢尾花数据集(Iris)

鸢尾花数据集(Iris)是一个经典数据集。数据集内包含 3 类共 150 条记录,每类各 50 个数据,每条记录都有 4 项特征:花萼长度、花萼宽度、花瓣长度、花瓣宽度,可以通过这4个特征预测鸢尾花卉属于(iris-setosa, iris-versicolour, iris-virginica,山鸢尾、变色鸢尾和维吉尼亚鸢尾三个类别)中的哪一品种。

import pandas as pd 
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
filename = 'iris.csv'
names = ['sepal length','sepal width','petal length','petal width','species']
data = pd.read_csv(filename,names=names)
array = data.values
X=array[:,:-1]
Y=array[:,-1]
test_size = 0.3
seed = 4
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=test_size,random_state=seed)
model = LogisticRegression()
model.fit(X_train,Y_train)
result = model.score(X_test,Y_test)
print('算法评估结果为:%.2f' % (result*100))


2.1.2 K折交叉验证

import pandas as pd 
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
filename = 'iris.csv'
names = ['sepal length','sepal width','petal length','petal width','species']
data = pd.read_csv(filename,names=names)
array = data.values
X=array[:,:-1]
Y=array[:,-1]
seed = 7
num_folds =10
kfold = KFold(n_splits=num_folds,random_state=seed)
model = LogisticRegression()
result = cross_val_score(model,X,Y,cv=kfold)
print(result)
print('算法的评估结果:%.2f' %(result.mean()*100))

2.1.3留一法

import pandas as pd 
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
filename = 'iris.csv'
names = ['sepal length','sepal width','petal length','petal width','species']
data = pd.read_csv(filename,names=names)
array = data.values
X=array[:,:-1]
Y=array[:,-1]
seed = 7
loocv=LeaveOneOut()
model = LogisticRegression()
result = cross_val_score(model,X,Y,cv=loocv)
print(result)
print('算法的评估结果:%.2f' %(result.mean()*100))

 

2.2分类任务的性能指标

准确度、混淆矩阵、精准率、召回率、AUC、F1score

分类报告:

from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score,roc_auc_score,confusion_matrix
y_true =    [1, 0, 0, 1]
y_predict = [1, 0, 1, 0]
y_score = [0.1, 0.4, 0.35, 0.8]
print(accuracy_score(y_true, y_predict))
print(precision_score(y_true,y_predict))
print(recall_score(y_true, y_predict))
print(f1_score(y_true, y_predict))
print(roc_auc_score(y_true, y_score))
print(confusion_matrix(y_true,y_predict))

实例2:印第安人糖尿病数据集

数据集的目标是基于数据集中(共有768条数据)包含的某些诊断测量来诊断性的预测患者是否患有糖尿病。数据有8个属性,【1】Pregnancies:怀孕次数;【2】Glucose:葡萄糖【3】BloodPressure:血压 (mm Hg)【4】SkinThickness:皮层厚度 (mm) 【5】Insulin:胰岛素 2小时血清胰岛素(mu U / ml【6】BMI:体重指数 (体重/身高)^2 【7】DiabetesPedigreeFunction:糖尿病谱系功能【8】Age:年龄 (岁)

import pandas as pd 
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score,\
                            roc_auc_score,confusion_matrix,classification_report
filename = 'pima_data.csv'
names = ['preg','plas','blood','skin','insulin','bmi','pedi','age','class']
data = pd.read_csv(filename,names=names)
array = data.values
X=array[:,:-1]
Y=array[:,-1]
test_size = 0.3
seed = 4
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=test_size,random_state=seed)
model = LogisticRegression()
model.fit(X_train,Y_train)
y_predict = model.predict(X_test)
y_true = Y_test
print(confusion_matrix(y_true,y_predict))
print(classification_report(y_true,y_predict))
输出:
[[138 14] [ 30 49]] precision recall f1-score support 0.0 0.82 0.91 0.86 152 1.0 0.78 0.62 0.69 79 avg / total 0.81 0.81 0.80 231
import pandas as pd 
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score,\
                            roc_auc_score,confusion_matrix,classification_report
filename = 'iris.csv'
names = ['sepal length','sepal width','petal length','petal width','species']
data = pd.read_csv(filename,names=names)
array = data.values
X=array[:,:-1]
Y=array[:,-1]
test_size = 0.3
seed = 4
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=test_size,random_state=seed)
model = LogisticRegression()
model.fit(X_train,Y_train)
y_predict = model.predict(X_test)
y_true = Y_test
print(confusion_matrix(y_true,y_predict))
print(classification_report(y_true,y_predict))
输出:
[[21 0 0] [ 0 8 2] [ 0 1 13]] precision recall f1-score support Iris-setosa 1.00 1.00 1.00 21 Iris-versicolor 0.89 0.80 0.84 10 Iris-virginica 0.87 0.93 0.90 14 avg / total 0.93 0.93 0.93 45

 

2.3回归任务的性能指标

平均绝对误差

均方误差

决定系数

import numpy as np
def mse_score(y_predict,y_true):
    rmse = np.mean((y_predict-y_true)**2)
    return rmse
def rmse_score(y_predict,y_true):
    mse = np.sqrt(np.mean((y_predict-y_true)**2))
    return mse
def mae_score(y_predict,y_true):
    mae = np.mean(np.abs(y_predict-y_true))
    return mae
def r2_score(y_predict,y_true):
    r2 = 1 - mse_score(y_predict,y_true)/np.var(y_true)
    return r2

y_true=np.array([0.7,0.2,1.8,0.4,1.4])
y_predict=np.array([0.7,-0.8,3.8,0.9,2.9])
print(rmse_score(y_predict,y_true))
print(mse_score(y_predict,y_true))
print(mae_score(y_predict,y_true))
print(r2_score(y_predict,y_true))

实例3:波士顿房价数据集

波士顿房价数据集共有506条波士顿房价的数据,每条数据包括对指定房屋的13项数值型特征和目标房价组成。

from sklearn.datasets import load_boston
from sklearn.model_selection import KFold,cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error,mean_squared_error,r2_score
boston= load_boston()
X = boston['data']
Y = boston['target']
seed = 7
num_folds =10
kfold = KFold(n_splits=num_folds,random_state=seed)
model=LinearRegression()

test_size = 0.3
seed = 4
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=test_size,random_state=seed)
model=LinearRegression()
model.fit(X_train,Y_train)
Y_predict=model.predict(X_test)
print(Y_predict-Y_test)
mae = mean_absolute_error(Y_test,Y_predict)
rmse = mean_squared_error(Y_test,Y_predict)
r2 = r2_score(Y_test,Y_predict)

print("MAE:%.2f" % mae)
print("RMSE:%.2f" % rmse)
print("R2:%.2f" % r2)

 

 

 

posted @ 2020-01-18 21:41  泰初  阅读(1106)  评论(0编辑  收藏  举报