调参
在利用gridseachcv进行调参时,其中关于scoring可以填的参数在SKlearn中没有写清楚,就自己找了下,具体如下:
parameters = {'eps':[0.3,0.4,0.5,0.6], 'min_samples':[20,30,40]}
db = DBSCAN(metric='cosine', algorithm='brute').fit(xx)
grid = GridSearchCV(db, parameters, cv=5, scoring='adjusted_rand_score')
Scoring | Function | Comment |
---|---|---|
Classification | ||
‘accuracy’ | metrics.accuracy_score |
|
‘average_precision’ | metrics.average_precision_score |
|
‘f1’ | metrics.f1_score |
for binary targets |
‘f1_micro’ | metrics.f1_score |
micro-averaged |
‘f1_macro’ | metrics.f1_score |
macro-averaged |
‘f1_weighted’ | metrics.f1_score |
weighted average |
‘f1_samples’ | metrics.f1_score |
by multilabel sample |
‘neg_log_loss’ | metrics.log_loss |
requires predict_proba support |
‘precision’ etc. | metrics.precision_score |
suffixes apply as with ‘f1’ |
‘recall’ etc. | metrics.recall_score |
suffixes apply as with ‘f1’ |
‘roc_auc’ | metrics.roc_auc_score |
|
Clustering | ||
‘adjusted_rand_score’ | metrics.adjusted_rand_score |
|
Regression | ||
‘neg_mean_absolute_error’ | metrics.mean_absolute_error |
|
‘neg_mean_squared_error’ | metrics.mean_squared_error |
|
‘neg_median_absolute_error’ | metrics.median_absolute_error |
|
‘r2’ | metrics.r2_score |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
但后面听另外一个课的时候老师说,对于特征较多的模型不建议用gridSearch ,耗时,而且只是在train上表现好的参数,不一定在跨时间验证集上表现好
建议设计调参 ,设计的目标是跨时间验证集的KS要最大化,同时跨时间验证集和训练集的KS差距最小
调参方法
- offks + 0.8(offks - devks)最大化
import pandas as pd from sklearn.metrics import roc_auc_score,roc_curve,auc from sklearn.model_selection import train_test_split from sklearn import metrics from sklearn.linear_model import LogisticRegression import numpy as np import random import math import lightgbm as lgb from sklearn.model_selection import train_test_split data = pd.read_csv('Acard.txt') train = data[data.obs_mth != '2018-11-30'].reset_index().copy() val = data[data.obs_mth == '2018-11-30'].reset_index().copy() feature_lst = ['person_info','finance_info','credit_info','act_info'] x = train[feature_lst] y = train['bad_ind'] val_x = val[feature_lst] val_y = val['bad_ind'] train_x,test_x,train_y,test_y = train_test_split(x,y,random_state=0,test_size=0.2) #改变我们想去调整的参数为value,设置调参区间 min_value = 40 max_value = 60 for value in range(min_value,max_value+1): best_omd = -1 best_value = -1 best_ks=[] def lgb_test(train_x,train_y,test_x,test_y): clf =lgb.LGBMClassifier(boosting_type = 'gbdt', objective = 'binary', metric = 'auc', learning_rate = 0.1, n_estimators = value, max_depth = 5, num_leaves = 20, max_bin = 45, min_data_in_leaf = 6, bagging_fraction = 0.6, bagging_freq = 0, feature_fraction = 0.8, silent=True ) clf.fit(train_x,train_y,eval_set = [(train_x,train_y),(test_x,test_y)],eval_metric = 'auc') return clf,clf.best_score_['valid_1']['auc'], lgb_model , lgb_auc = lgb_test(train_x,train_y,test_x,test_y) y_pred = lgb_model.predict_proba(x)[:,1] fpr_lgb_train,tpr_lgb_train,_ = roc_curve(y,y_pred) train_ks = abs(fpr_lgb_train - tpr_lgb_train).max() y_pred = lgb_model.predict_proba(val_x)[:,1] fpr_lgb,tpr_lgb,_ = roc_curve(val_y,y_pred) val_ks = abs(fpr_lgb - tpr_lgb).max() Omd= val_ks + 0.8*(val_ks - train_ks) if Omd>best_omd: best_omd = Omd best_value = value best_ks = [train_ks,val_ks] print('best_value:',best_value) print('best_ks:',best_ks)