XGB算法梳理

学习内容：

1.CART树

2.算法原理

3.损失函数

4.分裂结点算法

5.正则化

6.对缺失值处理

7.优缺点

8.应用场景

9.sklearn参数

1.CART树

　　CART算法是一种二分递归分割技术，把当前样本划分为两个子样本，使得生成的每个非叶子结点都有两个分支，因此CART算法生成的决策树是结构简洁的二叉树。由于CART算法构成的是一个二叉树，它在每一步的决策时只能是“是”或者“否”，即使一个feature有多个取值，也是把数据分为两部分。在CART算法中主要分为两个步骤

将样本递归划分进行建树过程
用验证数据进行剪枝

2.算法原理

　　输入：训练数据集 $D$ ，停止计算的条件；

　　输出：CART决策树。

　　根据训练数据集，从根结点开始，递归地对每个结点进行一下操作，构建二叉决策树：

　　1）设结点的训练数据集为 $D$ ，计算现有特征对该点数据集的基尼指数。此时，对每个特征A，对其可能取的每个值 $a$ ，根据样本点计算对 $A = a$ 的测试为“是”或“否”讲 $D$ 分割成 $D_1$ 和 $D_2$ 两部分，计算 $A = a$ 时的基尼指数。

　　2）在所有可能的特征 $A$ 以及他们所有可能的切分点 $a$ 中，选择基尼指数最小的特征及其对应的切分点作为最优切分点，依最有特征与最优切分点，从现结点生成两个子结点，将训练数据集依特征分配到两个子结点中去。

　　3）对两个子结点递归地调用1），2），直至满足停止条件。

　　4）生成CART决策树。

3.损失函数

　　 $L = \sum\limits_{x_i \leq R_m} (y_i - f(x_i))^2 + \sum\limits_{i=1}^K \Omega (f_k)$

4.分裂结点算法

　　使用基尼指数用于分裂结点的依据

　　概率分布的基尼指数定义为： $Gini(p) = \sum\limits_{k=1}^K p_k (1-p_k) = 1 - \sum\limits_{k=1}^K p_k^2$

　　如果样本那集合D根据特征A是否取某一可能值 $a$ 被分割成 $D_1$ 和 $D_2$ 两部分，即 $D_1 = \{(x,y) \leq D | A(x) = a \} , D_2 = D - D_1$

　　根据基尼指数值越大，样本集合不确定性就越大。

5.正则化

　　标准GBM的实现没有像XGBoost这样的正则化步骤。正则化对减少过拟合也是有帮助的。实际上，XGBoost以“正则化提升(regularized boosting)”技术而闻名。

　　 $\Omega (f) = \gamma T + \frac{1}{2} \lambda ||\omega||^2$

6.对缺失值处理

　　XGBoost内置处理缺失值的规则。用户需要提供一个和其它样本不同的值，然后把它作为一个参数传进去，以此来作为缺失值的取值。XGBoost在不同节点遇到缺失值时采用不同的处理方法，并且会学习未来遇到缺失值时的处理方法。

7.优缺点

优点：

　　XGBoost可以实现并行处理，相比GBM有了速度的飞跃，LightGBM也是微软最新推出的一个速度提升的算法。 XGBoost也支持Hadoop实现。

　　XGBoost支持用户自定义目标函数和评估函数，只要目标函数二阶可导就行。

8.应用场景

　　评分系统，智能垃圾邮件识别，广告推荐系统

9.sklearn参数

　　　　　　class xgboost.XGBRegressor(max_depth=3, learning_rate=0.1, n_estimators=100, silent=True, objective='reg:linear', booster='gbtree', n_jobs=1, nthread=None, gamma=0, min_child_weight=1, max_delta_step=0, subsample=1, colsample_bytree=1, colsample_bylevel=1, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, base_score=0.5, random_state=0, seed=None, missing=None, importance_type='gain', **kwargs)

　　max_depth: 参数类型(int) – Maximum tree depth for base learners. 树的最大深度

　　learning_rate: 参数类型(float) – Boosting learning rate (xgb’s “eta”).学习率

　　n_estimators: 参数类型(int) – Number of boosted trees to fit.优化树的个数

　　silent: 参数类型(boolean) – Whether to print messages while running boosting.在运行过程中是否打印流程

　　objective: 参数类型(string or callable) – Specify the learning task and the corresponding learning objective or a custom objective function to be used (see note below).明确学习任务

　　booster: 参数类型(string) – Specify which booster to use: gbtree, gblinear or dart.指定使用的booster

　　nthread: 参数类型(int) – Number of parallel threads used to run xgboost. (Deprecated, please use n_jobs).多线程

　　n_jobs: 参数类型(int) – Number of parallel threads used to run xgboost. (replaces nthread).多线程

　　gamma: 参数类型(float) – Minimum loss reduction required to make a further partition on a leaf node of the tree.增加分支时减少的最少损失

　　min_child_weight: 参数类型(int) – Minimum sum of instance weight(hessian) needed in a child.叶节点最小权重

　　max_delta_step: 参数类型(int) – Maximum delta step we allow each tree’s weight estimation to be.最大迭代次数

　　subsample: 参数类型(float) – Subsample ratio of the training instance.训练样本的采样率

　　colsample_bytree: 参数类型(float) – Subsample ratio of columns when constructing each tree.构建树时下采样率

　　colsample_bylevel: 参数类型(float) – Subsample ratio of columns for each split, in each level.构建每一分支时下采样率

　　reg_alpha: 参数类型(float (xgb's alpha)) – L1 regularization term on weights.L1正则化权重

　　reg_lambda: 参数类型(float (xgb's lambda)) – L2 regularization term on weights.L2正则化权重

　　scale_pos_weight: 参数类型(float) – Balancing of positive and negative weights.正负样本比率

　　base_score: – The initial prediction score of all instances, global bias.初始实例分数

　　seed: 参数类型(int) – Random number seed. (Deprecated, please use random_state).随机种子

　　random_state: 参数类型(int) – Random number seed. (replaces seed).随机种子

　　missing： 参数类型(float, optional) – Value in the data which needs to be present as a missing value. If None, defaults to np.nan.当出现缺失值时，使用该值代替。

　　importanc_type： 参数类型(string, default "gain") – The feature importance type for the feature_importances_ property: either “gain”, “weight”, “cover”, “total_gain” or “total_cover”.特征重要类型

　　**kwargs: 参数类型(dict, optional) –Keyword arguments for XGBoost Booster object. Full documentation of parameters can be found here:

posted @ 2019-03-04 21:07 burton_shi 阅读(3154) 评论(0) 编辑收藏举报

刷新页面返回顶部

登录后才能查看或发表评论，立即登录或者逛逛博客园首页

公告

昵称： burton_shi
园龄： 7年5个月
粉丝： 1
关注： 6

2025年3月

日

一

二

三

四

五

六

HahaStrong

XGB算法梳理

公告