交叉验证
为什么交叉验证
- 一个模型建立起来,首要任务就是要评估这个模型的好坏!然而,交叉验证对模型好坏的评估有至关重要的作用
- 交叉验证把数据集随机分成训练集和测试集,可以有效评估一个模型的泛化能力
如何交叉验证
- 导入sklearn.model_selection.train_test_split
- X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) ,X为数据集所有特征,y为数据集所有标签
交叉验证的运用
- 交叉验证运用于决策树(分类)
分类模型要评估的是预测值和实际值匹配率,我理解为准确率,比如我预测10条数据特征,得到10个预测结果标签,有8个预测结果标签和实际标签一样,那么预测准确率为0.8
Using the Iris dataset, we can construct a tree as follows:
from sklearn.datasets import load_iris
from sklearn import tree
from sklearn import metrics
iris = load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train,y_train)
pred = clf.predict(X_test)
print('{:.2%}\n'.format(metrics.accuracy_score(y_test, pred )))
- 交叉验证运用于线性回归(回归)
回归模型评估的是预测值和实际值之间的误差,是varance,离散度
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
#Load the diabetes dataset
diabetes = datasets.load_diabetes()
X_train, X_test, y_train, y_test = train_test_split(diabetes.data,diabetes.target, test_size=0.3, random_state=42)
#Create linear regression object
reg = linear_model.LinearRegression()
#Train the model using the training sets
reg.fit(X_train, y_train)
#Make predictions using the testing set
pred = regr.predict(X_test)
print('Coefficients: \n', reg.coef_)
print("Mean squared error: %.2f" % mean_squared_error(y_test , pred))
print('Variance score: %.2f' % r2_score(y_test , pred))
-
k-fold cv
假设我将数据集随机分为相等的10份,第一次用第一份做测试集,其他做训练集,得出score
第二次用第2份做测试集,,其他做训练集,得出score...依次类推,训练10次
这样等于全量数据都做了训练,同事也保证了泛化要求,k-fold cv就是基于类似的思想实现的 -
GridSearchCV
$sklearn.model_selection.GridSearchCV$
$$P(A\mid B) = \frac{P(B\mid A)P(A)}{P(B)}$$