原文

这里

前言

虽然当时候设想这篇文章的重点在于outlier treatment异常值处理,但是读了下来貌似作者的重点在xgb和lgb的调优和组合参数的调优上面。
对于异常值处理貌似简单的threshold就带过了。不过即使这样,本文也有很多值得学的东西。

正文

按照原文顺序来盘点值得学习的地方和一些操作。

换多个col的type

prop = pd.read_csv('../input/properties_2016.csv')
for c, dtype in zip(prop.columns, prop.dtypes):	
    if dtype == np.float64:		
        prop[c] = prop[c].astype(np.float32)

个人觉得也可以

prop[prop.columns.dtype==np.float64] = prop[prop.columns.dtype==np.float64].astype(np.float32)

nan处理,直接全部fill

df_train.fillna(df_train.median(),inplace = True)

lgb的一般步骤

import lightgbm as lgb

lgb的数据集结构

d_train = lgb.Dataset(x_train, label=y_train)

hyperparameter

params = {}
params['max_bin'] = 10
params['learning_rate'] = 0.0021 # shrinkage_rate
params['boosting_type'] = 'gbdt'
params['objective'] = 'regression'
params['metric'] = 'l1'          # or 'mae'
params['sub_feature'] = 0.5      # feature_fraction -- OK, back to .5, but maybe later increase this
params['bagging_fraction'] = 0.85 # sub_row
params['bagging_freq'] = 40
params['num_leaves'] = 512        # num_leaf
params['min_data'] = 500         # min_data_in_leaf
params['min_hessian'] = 0.05     # min_sum_hessian_in_leaf
params['verbose'] = 0

训练

clf = lgb.train(params, d_train, 430)

del d_train; gc.collect()
del x_train; gc.collect()

测试

# num_threads > 1 will predict very slow in kernal
clf.reset_parameter({"num_threads":1})
p_test = clf.predict(x_test)

del x_test; gc.collect()

简单异常值处理

# drop out ouliers
train_df=train_df[ train_df.logerror > -0.4 ]
train_df=train_df[ train_df.logerror < 0.418 ]

XGB训练一般步骤

import xgboost as xgb

设置参数

# xgboost params
xgb_params = {
    'eta': 0.037,
    'max_depth': 5,
    'subsample': 0.80,
    'objective': 'reg:linear',
    'eval_metric': 'mae',
    'lambda': 0.8,   
    'alpha': 0.4, 
    'base_score': y_mean,
    'silent': 1
}

xgb的数据结构

dtrain = xgb.DMatrix(x_train, y_train)
dtest = xgb.DMatrix(x_test)

训练+测试

# num_boost_rounds = 150
num_boost_rounds = 242
print("\nXGBoost tuned with CV in:")
print("   https://www.kaggle.com/aharless/xgboost-without-outliers-tweak ")
print("num_boost_rounds="+str(num_boost_rounds))

# train model
print( "\nTraining XGBoost ...")
model = xgb.train(dict(xgb_params, silent=1), dtrain, num_boost_round=num_boost_rounds)

print( "\nPredicting with XGBoost ...")
xgb_pred = model.predict(dtest)

print( "\nXGBoost predictions:" )
print( pd.DataFrame(xgb_pred).head() )

简单加权组合结果

xgb+lgb+naive(0.115)模型

# Parameters
XGB_WEIGHT = 0.6500
BASELINE_WEIGHT = 0.0056

BASELINE_PRED = 0.0115

lgb_weight = 1 - XGB_WEIGHT - BASELINE_WEIGHT
pred = XGB_WEIGHT*xgb_pred + BASELINE_WEIGHT*BASELINE_PRED + lgb_weight*p_test

怎么选这个加权的参数??

  • 原文是:
    • To tune lgb_weight, I've been following a strategy of repeatedly fitting a quadratic to the last 3 submissions to approximate the value that minimizes the LB score
  • 然后根据comment的提问我才大概猜到了作者想要干啥:
    lgb_weight->LB Score(leaderboard score就是kaggle的排行榜分数)
    0.202 -> 0.0644421,
    0.207 -> 0.0644408,
    0.212 -> 0.0644403
    作者把这三个作为(x,y)坐标然后用他们来拟合一个一元二次方程,看最低点是哪个x。就是所说的(quadratic approximation)
  • 个人想法:
    很神奇的想法,直接从lb-score下手,但是从数据角度出发我觉得不太合理。
    一个朴素的想法是validation好的模型,得到的weight就高。
    或者对三个参数进行grid-search