调参汇总
什么网格搜索、贝叶斯调参,就不做赘述。
本次简单记录一下工作积累的调参经验。
n_estimators:和多个参数相互影响,比如learning_rate,一般来说,树越简单,n_estimators越大,树越复杂,n_estimators越小,简单对应的是树分得很粗糙,max_depth较小,min_child_weight较大。一般建议>=30
min_child_weight:即是树枝的样本数,值越大,越不容易过拟合,一般建议训练样本的5%。
max_depth:即是树的层数,一般建议2-5,越小越不容易过拟合。
scale_pos_weight:1-10都OK,常用1-3。
其余的参数调整无多大意义,使用常用的即可。
主要调整的参数是n_estimators,min_child_weight,调整的衡量指标是训练集测试集oot的ks的值以及差值。
一、下面这种主要使用与数据中所有变量的效果都不好(最高iv不超过2%),初跑模型时,训练集和测试集的ks相差较大。
def xgb_train(xtrain,ytrain,xtest,ytest,xoot,yoot,n_estimators,min_child_weight): xgb_param1 ={'max_depth': 3, 'n_estimators': n_estimators, 'learning_rate': 0.02, 'min_child_weight': min_child_weight, 'subsample': 0.7, 'colsample_bytree': 0.7, 'objective': 'binary:logistic', # 'alpha': 6.04, # 'lambda':20.21, # 'gamma':0.001, 'random_state':2023, 'n_jobs': -1} xgb_clf = xgb.XGBClassifier(**xgb_param1) xgb_clf.fit(xtrain,ytrain) xgb_train_pred = xgb_clf.predict_proba(xtrain)[:,1] fpr_train,tpr_train,_ = roc_curve(ytrain,xgb_train_pred) train_ks = abs(fpr_train - tpr_train).max() train_auc = roc_auc_score(ytrain,xgb_train_pred) xgb_val_pred = xgb_clf.predict_proba(xtest)[:,1] fpr_val,tpr_val,_ = roc_curve(ytest,xgb_val_pred) val_ks = abs(fpr_val - tpr_val).max() val_auc = roc_auc_score(ytest,xgb_val_pred) xgb_oot_pred = xgb_clf.predict_proba(xoot)[:,1] fpr_oot,tpr_oot,_ = roc_curve(yoot,xgb_oot_pred) oot_ks = abs(fpr_oot - tpr_oot).max() oot_auc = roc_auc_score(yoot,xgb_oot_pred) val_PSI=cal_psi(xgb_train_pred,xgb_val_pred) return train_ks,val_ks,oot_ks list1 = [] list2 = [] list3 = [] list4 = [] list5 = [] for n in range(20,100,1): for k in range(30,300,1): train_ks,val_ks,oot_ks= xgb_train(train_xgb[l],train_xgb['target'],val_xgb[l],val_xgb['target'],oot_xgb[l],oot_xgb['target'],n,k) # if (val_ks>=0.32) and (train_ks - val_ks)<0.04: # print(n,k) # else: # continue list4.append(n) list5.append(k) list1.append(train_ks) list2.append(val_ks) list3.append(oot_ks) xgb_df = pd.DataFrame({'n_estimators':list4,'min_child_weight':list5, 'train_ks':list1,"test_ks":list2,"oot_ks":list3}) xgb_df['ks_diff_test'] = xgb_df['train_ks'] - xgb_df['test_ks'] xgb_df['ks_diff_oot'] = xgb_df['train_ks'] - xgb_df['test_ks']
最终根据测试集效果以及训练集和测试集、训练集和oot的ks差值选取较有的参数。
xgb_df_ks = xgb_df[(xgb_df['test_ks']>=0.16) & (xgb_df['ks_diff_test']<0.04)&(xgb_df['ks_diff_oot']<0.04)] # 这里面的阈值可以根据你的数据去设定 xgb_df_ks
最终从这些结果中选出比较优的参数。