模型融合---Stacking调参总结
分类:
机器学习实战
1. 回归
训练了两个回归器,GBDT和Xgboost,用这两个回归器做stacking
使用之前已经调好参的训练器
1 2 3 4 | gbdt_nxf = GradientBoostingRegressor(learning_rate = 0.06 ,n_estimators = 250 , min_samples_split = 700 ,min_samples_leaf = 70 ,max_depth = 6 , max_features = 'sqrt' ,subsample = 0.8 ,random_state = 75 ) xgb_nxf = XGBRegressor(learning_rate = 0.06 ,max_depth = 6 ,n_estimators = 200 ,random_state = 75 ) |
事先建好stacking要用到的矩阵
1 2 3 4 5 6 7 8 9 10 11 12 | from sklearn.model_selection import KFold,StratifiedKFold kf = StratifiedKFold(n_splits = 5 ,random_state = 75 ,shuffle = True ) from sklearn.metrics import r2_score train_proba = np.zeros(( len (gbdt_train_data), 2 )) train_proba = pd.DataFrame(train_proba) train_proba.columns = [ 'gbdt_nxf' , 'xgb_nxf' ] test_proba = np.zeros(( len (gbdt_test_data), 2 )) test_proba = pd.DataFrame(test_proba) test_proba.columns = [ 'gbdt_nxf' , 'xgb_nxf' ] |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | reg_names = [ 'gbdt_nxf' , 'xgb_nxf' ] for i,reg in enumerate ([gbdt_nxf,xgb_nxf]): pred_list = [] col = reg_names[i] for train_index,val_index in kf.split(gbdt_train_data,gbdt_train_label): x_train = gbdt_train_data.loc[train_index,:].values y_train = gbdt_train_label[train_index] x_val = gbdt_train_data.loc[val_index,:].values y_val = gbdt_train_label[val_index] reg.fit(x_train,y_train) y_vali = reg.predict(x_val) train_proba.loc[val_index,col] = y_vali print ( '%s cv r2 %s' % (col,r2_score(y_val,y_vali))) y_testi = reg.predict(gbdt_test_data.values) pred_list.append(y_testi) test_proba.loc[:,col] = np.mean(np.array(pred_list),axis = 0 ) |
r2值最高为0.79753,效果还不是特别的好
然后用五折交叉验证,每折都预测整个测试集,得到五个预测的结果,求平均,就是新的预测集;而训练集就是五折中任意四折预测该折的训练集得到的标签的集合
因为有两个训练器,GBDT和Xgboost,所以我们得到了两列的train_proba
最后对新的训练集和测试集做回归,得到我们的结果
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | #使用逻辑回归做stacking from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression scalar = StandardScaler() # train_proba = train_proba.values # test_proba = test_proba.values scalar.fit(train_proba) train_proba = scalar.transform(train_proba) test_proba = scalar.transform(test_proba) lr = LogisticRegression(tol = 0.0001 ,C = 0.5 ,random_state = 24 ,max_iter = 10 ) kf = StratifiedKFold(n_splits = 5 ,random_state = 75 ,shuffle = True ) r2_list = [] pred_list = [] for train_index,val_index in kf.split(train_proba,gbdt_train_label): #训练集的标签还是一开始真实的训练集的标签 x_train = train_proba[train_index] y_train = gbdt_train_label[train_index] x_val = train_proba[val_index] y_val = gbdt_train_label[val_index] lr.fit(x_train,y_train) y_vali = lr.predict(x_val) print ( 'lr stacking cv r2 %s' % (r2_score(y_val,y_vali))) r2_list.append(r2_score(y_val,y_vali)) y_testi = lr.predict(test_proba) pred_list.append(y_testi) print (lr.coef_,lr.n_iter_) #过拟合很严重 |
2. 分类
经过对每个单模型进行调参之后,我们可以把这些模型进行 stacking 集成。
如上图所示,我们将数据集分成均匀的5部分进行交叉训练,使用其中的4部分训练,之后将训练好的模型对剩下的1部分进行预测,同时预测测试集;经过5次cv之后,我们可以得到训练集每个样本的预测值,同时得到测试集的5个预测值,我们将测试集的5个测试集进行平均。有多少个基模型,我们会得到几组不同的预测值;最后使用一个模型对上一步得到预测结果再进行训练预测,得到stacking结果。stacking模型一般使用线性模型。
stacking 有点像神经网络,基模型就像底层的神经网络对输入数据进行特征的提取,如下图所示:
首先我们先定义一个DataFrame 格式数据结构荣来存储中间预测结果:
1 2 3 4 5 6 7 | train_proba = np.zeros(( len (train), 6 )) train_proba = pd.DataFrame(train_proba) train_proba.columns = [ 'rf' , 'ada' , 'etc' , 'gbc' , 'sk_xgb' , 'sk_lgb' ] test_proba = np.zeros(( len (test), 6 )) test_proba = pd.DataFrame(test_proba) test_proba.columns = [ 'rf' , 'ada' , 'etc' , 'gbc' , 'sk_xgb' , 'sk_lgb' ] |
定义基模型,交叉训练预测
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 | rf = RandomForestClassifier(n_estimators = 700 , max_depth = 13 , min_samples_split = 30 ,\ min_weight_fraction_leaf = 0.0 , random_state = 24 , verbose = 0 ) ada = AdaBoostClassifier(n_estimators = 450 , learning_rate = 0.1 , random_state = 24 ) gbc = GradientBoostingClassifier(learning_rate = 0.08 ,n_estimators = 150 ,max_depth = 9 , min_samples_leaf = 70 ,min_samples_split = 900 , max_features = 'sqrt' ,subsample = 0.8 ,random_state = 10 ) etc = ExtraTreesClassifier(n_estimators = 290 , max_depth = 12 , min_samples_split = 30 ,random_state = 24 ) sk_xgb = XGBClassifier(learning_rate = 0.05 ,n_estimators = 400 , min_child_weight = 20 ,max_depth = 3 ,subsample = 0.8 , colsample_bytree = 0.8 , reg_lambda = 1. , random_state = 10 ) sk_lgb = LGBMClassifier(num_leaves = 31 ,max_depth = 3 ,learing_rate = 0.03 ,n_estimators = 600 , subsample = 0.8 , colsample_bytree = 0.9 , objective = 'binary' , min_child_weight = 0.001 , subsample_freq = 1 , min_child_samples = 10 , reg_alpha = 0.0 , reg_lambda = 0.0 , random_state = 10 , n_jobs = - 1 , silent = True , importance_type = 'split' ) kf = StratifiedKFold(n_splits = 5 ,random_state = 233 ,shuffle = True ) clf_name = [ 'rf' , 'ada' , 'etc' , 'gbc' , 'sk_xgb' , 'sk_lgb' ] for i,clf in enumerate ([rf,ada,etc,gbc,sk_xgb,sk_lgb]): pred_list = [] col = clf_name[i] for train_index, val_index in kf.split(train,label): X_train = train.loc[train_index,:].values y_train = label[train_index] X_val = train.loc[val_index,:].values y_val = label[val_index] clf.fit(X_train, y_train) y_vali = clf.predict_proba(X_val)[:, 1 ] train_proba.loc[val_index,col] = y_vali print ( "%s cv auc %s" % (col, roc_auc_score(y_val, y_vali))) y_testi = clf.predict_proba(test.values)[:, 1 ] pred_list.append(y_testi) test_proba.loc[:,col] = np.mean(np.array(pred_list),axis = 0 ) |
使用逻辑回归做最后的stacking
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | scaler = StandardScaler() train_proba = train_proba.values test_proba = test_proba.values scaler.fit(train_proba) train_proba = scaler.transform(train_proba) test_proba = scaler.transform(test_proba) lr = LogisticRegression(tol = 0.0001 , C = 0.5 , random_state = 24 , max_iter = 10 ) kf = StratifiedKFold(n_splits = 5 ,random_state = 244 ,shuffle = True ) auc_list = [] pred_list = [] for train_index, val_index in kf.split(train_proba,label): X_train = train_proba[train_index] y_train = label[train_index] X_val = train_proba[val_index] y_val = label[val_index] lr.fit(X_train, y_train) y_vali = lr.predict_proba(X_val)[:, 1 ] print ( "lr stacking cv auc %s" % (roc_auc_score(y_val, y_vali))) auc_list.append(roc_auc_score(y_val, y_vali)) y_testi = lr.predict_proba(test_proba)[:, 1 ] pred_list.append(y_testi) print (lr.coef_, lr.n_iter_) |
最终各个基模型和stacking模型的 auc 得分如下图所示:
分别为 0.8415,0.8506,0.8511,0.8551,0.8572,0.8580,0.8584。
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· SQL Server 2025 AI相关能力初探
· Linux系列:如何用 C#调用 C方法造成内存泄露
· AI与.NET技术实操系列(二):开始使用ML.NET
· 记一次.NET内存居高不下排查解决与启示
· 探究高空视频全景AR技术的实现原理
· 阿里最新开源QwQ-32B,效果媲美deepseek-r1满血版,部署成本又又又降低了!
· 单线程的Redis速度为什么快?
· SQL Server 2025 AI相关能力初探
· AI编程工具终极对决:字节Trae VS Cursor,谁才是开发者新宠?
· 展开说说关于C#中ORM框架的用法!