原文
前言
虽然当时候设想这篇文章的重点在于outlier treatment异常值处理,但是读了下来貌似作者的重点在xgb和lgb的调优和组合参数的调优上面。
对于异常值处理貌似简单的threshold就带过了。不过即使这样,本文也有很多值得学的东西。
正文
按照原文顺序来盘点值得学习的地方和一些操作。
换多个col的type
prop = pd.read_csv('../input/properties_2016.csv')
for c, dtype in zip(prop.columns, prop.dtypes):
if dtype == np.float64:
prop[c] = prop[c].astype(np.float32)
个人觉得也可以
prop[prop.columns.dtype==np.float64] = prop[prop.columns.dtype==np.float64].astype(np.float32)
nan处理,直接全部fill
df_train.fillna(df_train.median(),inplace = True)
lgb的一般步骤
import lightgbm as lgb
lgb的数据集结构
d_train = lgb.Dataset(x_train, label=y_train)
hyperparameter
params = {}
params['max_bin'] = 10
params['learning_rate'] = 0.0021 # shrinkage_rate
params['boosting_type'] = 'gbdt'
params['objective'] = 'regression'
params['metric'] = 'l1' # or 'mae'
params['sub_feature'] = 0.5 # feature_fraction -- OK, back to .5, but maybe later increase this
params['bagging_fraction'] = 0.85 # sub_row
params['bagging_freq'] = 40
params['num_leaves'] = 512 # num_leaf
params['min_data'] = 500 # min_data_in_leaf
params['min_hessian'] = 0.05 # min_sum_hessian_in_leaf
params['verbose'] = 0
训练
clf = lgb.train(params, d_train, 430)
del d_train; gc.collect()
del x_train; gc.collect()
测试
# num_threads > 1 will predict very slow in kernal
clf.reset_parameter({"num_threads":1})
p_test = clf.predict(x_test)
del x_test; gc.collect()
简单异常值处理
# drop out ouliers
train_df=train_df[ train_df.logerror > -0.4 ]
train_df=train_df[ train_df.logerror < 0.418 ]
XGB训练一般步骤
import xgboost as xgb
设置参数
# xgboost params
xgb_params = {
'eta': 0.037,
'max_depth': 5,
'subsample': 0.80,
'objective': 'reg:linear',
'eval_metric': 'mae',
'lambda': 0.8,
'alpha': 0.4,
'base_score': y_mean,
'silent': 1
}
xgb的数据结构
dtrain = xgb.DMatrix(x_train, y_train)
dtest = xgb.DMatrix(x_test)
训练+测试
# num_boost_rounds = 150
num_boost_rounds = 242
print("\nXGBoost tuned with CV in:")
print(" https://www.kaggle.com/aharless/xgboost-without-outliers-tweak ")
print("num_boost_rounds="+str(num_boost_rounds))
# train model
print( "\nTraining XGBoost ...")
model = xgb.train(dict(xgb_params, silent=1), dtrain, num_boost_round=num_boost_rounds)
print( "\nPredicting with XGBoost ...")
xgb_pred = model.predict(dtest)
print( "\nXGBoost predictions:" )
print( pd.DataFrame(xgb_pred).head() )
简单加权组合结果
xgb+lgb+naive(0.115)模型
# Parameters
XGB_WEIGHT = 0.6500
BASELINE_WEIGHT = 0.0056
BASELINE_PRED = 0.0115
lgb_weight = 1 - XGB_WEIGHT - BASELINE_WEIGHT
pred = XGB_WEIGHT*xgb_pred + BASELINE_WEIGHT*BASELINE_PRED + lgb_weight*p_test
怎么选这个加权的参数??
- 原文是:
- To tune lgb_weight, I've been following a strategy of repeatedly fitting a quadratic to the last 3 submissions to approximate the value that minimizes the LB score
- 然后根据comment的提问我才大概猜到了作者想要干啥:
lgb_weight->LB Score(leaderboard score就是kaggle的排行榜分数)
0.202 -> 0.0644421,
0.207 -> 0.0644408,
0.212 -> 0.0644403
作者把这三个作为(x,y)坐标然后用他们来拟合一个一元二次方程,看最低点是哪个x。就是所说的(quadratic approximation) - 个人想法:
很神奇的想法,直接从lb-score下手,但是从数据角度出发我觉得不太合理。
一个朴素的想法是validation好的模型,得到的weight就高。
或者对三个参数进行grid-search
联系方式:clarence_wu12@outlook.com