GBDT 特征重要度计算

GBDT原理和推导：https://blog.csdn.net/yangxudong/article/details/53872141

Pyspark 分类、回归、聚类示例：

https://blog.csdn.net/littlely_ll/article/details/78151964

https://blog.csdn.net/littlely_ll/article/details/78161574?utm_source=blogxgwz2

https://blog.csdn.net/littlely_ll/article/details/78155192

特征重要度的计算
Friedman在GBM的论文中提出的方法：特征j的全局重要度通过特征j在单颗树中的重要度的平均值来衡量。

实现代码片段

为了更好的理解特征重要度的计算方法，下面给出scikit-learn工具包中的实现，代码移除了一些不相关的部分。

下面的代码来自于GradientBoostingClassifier对象的feature_importances属性的计算方法：

def feature_importances_(self):
    total_sum = np.zeros((self.n_features, ), dtype=np.float64)
    for tree in self.estimators_:
        total_sum += tree.feature_importances_ 
    importances = total_sum / len(self.estimators_)
    return importances

其中，self.estimators_是算法构建出的决策树的数组，tree.feature_importances_ 是单棵树的特征重要度向量，其计算方法如下：

cpdef compute_feature_importances(self, normalize=True):
    """Computes the importance of each feature (aka variable)."""

    while node != end_node:
        if node.left_child != _TREE_LEAF:
            # ... and node.right_child != _TREE_LEAF:
            left = &nodes[node.left_child]
            right = &nodes[node.right_child]

            importance_data[node.feature] += (
                node.weighted_n_node_samples * node.impurity -
                left.weighted_n_node_samples * left.impurity -
                right.weighted_n_node_samples * right.impurity)
        node += 1

    importances /= nodes[0].weighted_n_node_samples

    return importances

上面的代码经过了简化，保留了核心思想。计算所有的非叶子节点在分裂时加权不纯度的减少，减少得越多说明特征越重要。

不纯度的减少实际上就是该节点此次分裂的收益，因此我们也可以这样理解，节点分裂时收益越大，该节点对应的特征的重要度越高。

原文链接：https://blog.csdn.net/yangxudong/article/details/53899260

posted @ 2020-07-29 10:14 静悟生慧阅读(1671) 评论(0) 编辑收藏举报

刷新页面返回顶部

静悟生慧

GBDT 特征重要度计算

实现代码片段

公告