网格搜索与交叉验证
一. 网格搜索验证
sklearn.model_selection.GridSearchCV(estimator, param_grid, scoring=None, fit_params=None, n_jobs=1, iid=True, refit=True, cv=None, verbose=0, pre_dispatch=‘2*n_jobs’, error_score=’raise’, return_train_score=’warn’)
2. 常用方法和属性
- grid.fit():运行网格搜索
- best_params_:描述了已取得最佳结果的参数的组合
- best_score_:提供优化过程期间观察到的最好的评分
- feature_importances_: 提供所有特征重要程度的分数
3. 使用示例(以RandomForestClassifier为例, 其它的分类模型也能按这个方法调参)
- 1. 先寻找最优RF的n_estimators参数
1 param_test1 = {'n_estimators':[50,120,160,200,250]} 2 gsearch1 = GridSearchCV(estimator = RandomForestClassifier(min_samples_split=100, 3 min_samples_leaf=20,max_depth=8,max_features='sqrt' ,random_state=10), 4 param_grid = param_test1, scoring='roc_auc',cv=5) 5 gsearch1.fit(x_train,y_train) 6 print( gsearch1.best_params_, gsearch1.best_score_) # 得到最优n_estimators参数
- 2. 接着寻找最优决策树最大深度max_depth
1 param_test2 = {'max_depth':[1,2,3,5,7,9,11,13]}#, 'min_samples_split':[100,120,150,180,200,300]} 2 gsearch2 = GridSearchCV(estimator = RandomForestClassifier(n_estimators=50, min_samples_split=100, 3 min_samples_leaf=20,max_features='sqrt' ,oob_score=True, random_state=10), 4 param_grid = param_test2, scoring='roc_auc',iid=False, cv=5) 5 gsearch2.fit(x_train,y_train) 6 print( gsearch2.best_params_, gsearch2.best_score_) # 得到最优max_depth参数
- 3. 对于RF分类器, 可以看看现在模型的袋外分数
1 rf1 = RandomForestClassifier(n_estimators= 50, max_depth=2, min_samples_split=100, 2 min_samples_leaf=20,max_features='sqrt',oob_score=True, random_state=10) 3 rf1.fit(x_train,y_train) 4 print( rf1.oob_score_) # 打印袋外分数
#假设输出结果为0.984, 默认情况为0.972
#相对于默认情况,袋外分数有提高,也就是说模型的泛化能力变好了
- 4. 继续如此循环调整可以得到最优参数组合
二. 交叉验证
- 示例
1 from sklearn.neighbors import KNeighborsClassifier 2 from sklearn.model_selection import cross_val_score 3 4 k_range = [1, 5, 9, 15] 5 cv_scores = [] 6 for k in k_range: 7 knn = KNeighborsClassifier(n_neighbors=k) 8 scores = cross_val_score(knn, X_train, y_train, cv=5) 9 cv_score = np.mean(scores) 10 print('k={},验证集上的准确率={:.3f}'.format(k, cv_score)) 11 cv_scores.append(cv_score) 12 # k=1,验证集上的准确率=0.947 13 # k=5,验证集上的准确率=0.955 14 # k=9,验证集上的准确率=0.964 15 # k=15,验证集上的准确率=0.964 16 17 best_k = k_range[np.argmax(cv_scores)] # 从交叉验证中的最优score中取出最优参数, 代入模型重新fit,score 18 best_knn = KNeighborsClassifier(n_neighbors=best_k) 19 best_knn.fit(X_train, y_train) 20 print('测试集准确率:', best_knn.score(X_test, y_test)) 21 # 测试集准确率: 0.9736842105263158