数据挖掘算法比赛 - 简单经验总结
一、单个特征的EDA
-
对于 binary feature 和 categorical feature,train['feature_name'].value_counts().sort_index().plot(kind='bar')
-
对于 continuous numerical feature,
def cdf_plot(data_series): data_size = len(data_series) data_set = sorted(set(data_series)) bins = np.append(data_set, data_set[-1]+1) counts, bin_edges = np.histogram(data_series, bins=bins, density=False) counts = counts.astype(float) / data_size cdf = np.cumsum(counts) plt.plot(bin_edges[0:-1], cdf, linestyle='--', marker="o", color='b') plt.ylim((0,1)) plt.ylabel("CDF") plt.grid(True) plt.show()
二、对于类别特征的处理
https://github.com/scikit-learn-contrib/categorical-encoding
主要有三种方式
-
如果该类别特征的数值具有序数关系(ordinality),即数值之间是可以比较大小的,则可以直接将其映射为数值特征
-
独热编码
最常见的处理方式。
若使用独热编码,则可能仅将出现频次较高的取值编码。即设置一个阈值,数据中某个取值出现次数低于阈值时丢弃,或者编码为一个特殊值【将出现次数少的所有取值都编码成同一个值】
【可以使用 hash trick 来减少内存的使用】 -
统计编码
最常见的统计编码是计数统计特征。统计每个类别在训练集(加上测试集)中的出现次数。
对样本的标签进行统计如Target Encoding 或者 Leave-One-Out Encoding,可能会产生信息泄露。
独热编码后可能产生高维稀疏特征
- LR,线性 SVM 算法,学习每个特征对问题结果的影响程度,即与预测目标的线性关系。
- FM, FFM 算法,学习二阶交叉特征对问题结果的影响程度。
- GBDT 等树模型算法,可以学习到特征之间的更高阶的表示。
- DeepFM,GBDT叶子结点 + LR(FFM),结合了低阶和高阶特征对问题结果的影响
关于 Target Encoding:
def add_noise(series, noise_level):
return series * (1 + noise_level * np.random.randn(len(series)))
def target_encode(trn_series=None, # Revised to encode validation series
val_series=None,
tst_series=None,
target=None,
min_samples_leaf=1,
smoothing=1,
noise_level=0):
"""
Smoothing is computed like in the following paper by Daniele Micci-Barreca
https://kaggle2.blob.core.windows.net/forum-message-attachments/225952/7441/high%20cardinality%20categoricals.pdf
trn_series : training categorical feature as a pd.Series
tst_series : test categorical feature as a pd.Series
target : target data as a pd.Series
min_samples_leaf (int) : minimum samples to take category average into account
smoothing (int) : smoothing effect to balance categorical average vs prior
"""
assert len(trn_series) == len(target)
assert trn_series.name == tst_series.name
temp = pd.concat([trn_series, target], axis=1)
# Compute target mean
averages = temp.groupby(by=trn_series.name)[target.name].agg(["mean", "count"])
# Compute smoothing
smoothing = 1 / (1 + np.exp(-(averages["count"] - min_samples_leaf) / smoothing))
# Apply average function to all target data
prior = target.mean()
# The bigger the count the less full_avg is taken into account
averages[target.name] = prior * (1 - smoothing) + averages["mean"] * smoothing
averages.drop(["mean", "count"], axis=1, inplace=True)
# Apply averages to trn and tst series
ft_trn_series = pd.merge(
trn_series.to_frame(trn_series.name),
averages.reset_index().rename(columns={'index': target.name, target.name: 'average'}),
on=trn_series.name,
how='left')['average'].rename(trn_series.name + '_mean').fillna(prior)
# pd.merge does not keep the index so restore it
ft_trn_series.index = trn_series.index
ft_val_series = pd.merge(
val_series.to_frame(val_series.name),
averages.reset_index().rename(columns={'index': target.name, target.name: 'average'}),
on=val_series.name,
how='left')['average'].rename(trn_series.name + '_mean').fillna(prior)
# pd.merge does not keep the index so restore it
ft_val_series.index = val_series.index
ft_tst_series = pd.merge(
tst_series.to_frame(tst_series.name),
averages.reset_index().rename(columns={'index': target.name, target.name: 'average'}),
on=tst_series.name,
how='left')['average'].rename(trn_series.name + '_mean').fillna(prior)
# pd.merge does not keep the index so restore it
ft_tst_series.index = tst_series.index
return add_noise(ft_trn_series, noise_level), add_noise(ft_val_series, noise_level), add_noise(ft_tst_series, noise_level)
关于贝叶斯平滑:
import random
import numpy as np
import pandas as pd
import scipy.special as special
class HyperParam(object):
def __init__(self, alpha, beta):
self.alpha = alpha
self.beta = beta
def sample_from_beta(self, alpha, beta, num, imp_upperbound):
# 产生样例数据
sample = np.random.beta(alpha, beta, num)
I = []
C = []
for click_ratio in sample:
imp = random.random() * imp_upperbound
# imp = imp_upperbound
click = imp * click_ratio
I.append(imp)
C.append(click)
return pd.Series(I), pd.Series(C)
def update_from_data_by_FPI(self, tries, success, iter_num, epsilon):
# 更新策略
for i in range(iter_num):
new_alpha, new_beta = self.__fixed_point_iteration(tries, success, self.alpha, self.beta)
if abs(new_alpha - self.alpha) < epsilon and abs(new_beta - self.beta) < epsilon:
break
self.alpha = new_alpha
self.beta = new_beta
def __fixed_point_iteration(self, tries, success, alpha, beta):
# 迭代函数
sumfenzialpha = 0.0
sumfenzibeta = 0.0
sumfenmu = 0.0
sumfenzialpha = (special.digamma(success + alpha) - special.digamma(alpha)).sum()
sumfenzibeta = (special.digamma(tries - success + beta) - special.digamma(beta)).sum()
sumfenmu = (special.digamma(tries + alpha + beta) - special.digamma(alpha + beta)).sum()
return alpha * (sumfenzialpha / sumfenmu), beta * (sumfenzibeta / sumfenmu)
三、特征工程与特征选择
训练GBDT或者RF,将训练集中的特征的重要程度按从高到低排序
-
直接做交叉特征
将重要性程度高的特征进行乘法或除法计算。 -
将特征分为两部分,其中一部分特征做训练集,依次预测另一部分的每个特征的取值,将预测结果作为新的特征
-
Feature Aggregation
将重要性程度高的特征进行交叉统计
具体地讲,每次从重要性程度高选出两个特征,然后一个特征做分组变量,计算另一个特征的最值、均值、中位数、方差等。new_feature = features.groupby('feature1')['features].mean()
常见的特征选择方法
-
穷尽搜索
优点是100%找到全集下面的最优子集,缺点是需要O(2^n)的时间复杂度 -
随机选择搜索
启发式,每次选取一部分特征训练,不断循环。计算量较小 -
mRMR 特征选择(最小冗余最大关联特征选择)
四、XGBoost 调参
-
初始化参数:
eta = 0.1,depth= 10,subsample=1.0,min_child_weight = 5,
col_sample_bytree = 0.5 (每棵树构造时随机采样的特征的占比,该值的设置与数据的特征的数量有关)
除了eta为0.1外,其他参数的初始化由具体问题决定。
选择合适的 objective 和 eval_metric。xgboost.train()中的obj和feval参数分别代表自定义的损失函数(目的函数)和评估函数。maximize参数表示是否对评估函数进行最大化 -
划分20%的数据作为验证集,设置较大的num_rounds,当验证集错误率开始上升时停止训练。
- 调整depth
- 调整subsample - subsample ratio of the training instance
- 调整min_child_weight
- 调整colsample_bytree - subsample ratio of columns when constructing each tree.
- 最后,将eta调整到0.02,找到最适合的num_rounds
-
将通过以上步骤得到的参数作为baseline,再次基础上做一些微小的改变,使模型尽量得接近局部最优解。
五、特征融合
https://mlwave.com/kaggle-ensembling-guide/
-
Voting - get the result voted most
-
Uniform
A lower correlation between ensemble model members seems to result in an increase in the error-correcting capability. -
Weighing
give a better model more weight -> The only way for the inferior models to overrule the best model (expert) is for them to collectively (and confidently) agree on an alternative.
-
-
Averaging - bagging, reduces overfit
-
averaging
average the submissions from multiple models -
rank averaging
first turn the predictions into ranks, then averaging these ranks.
-
-
stacking and blending
-
Stacked generalization
The basic idea behind stacked generalization is to use a pool of base classifiers, then using another classifier to combine their predictions, with the aim of reducing the generalization error.
之前提到的ensemble方法都是定义一个公式或者方法来融合各种模型的预测结果。而stacking是通过另一个算法(分类器)来融合。 -
Blending
不做交叉验证,而仅使用 holdout 数据。stacker 模型在 holdout 数据的预测结果上训练。
相比 stacking 的优缺点
优点:简单快速。不存在信息泄露【比如 holdout 10%,则第一层用 90% 的数据训练,第二层用 10% 的数据训练】
缺点:使用的训练数据少,容易在holdout上过拟合,CV不够valid -
Stacking with logistic regression
-
Stacking with non-linear algorithms
Popular non-linear algorithms for stacking are GBM, KNN, NN, RF and ET.
Non-linear stacking with the original features on multiclass problems gives surprising gains. -
Feature weighted linear stacking
先将提取后的特征用各个模型进行预测,然后使用一个线性的模型去学习出哪个个模型对于某些样本来说是最优的,通过将各个模型的预测结果加权求和完成
-