LightGBM
Optimization in speed and memory usage
Many boosting tools use pre-sorted based algorithms[1][2](e.g. default algorithm in xgboost) for decision tree learning. It is a simple solution, but not easy to optimize.
目前已有的GBDT工具基本都是基于预排序的方法(pre-sorted)的决策树算法(如 xgboost,一个线程处理一个特征,首先在开始建树之前,将所有的数据按照该特征的特征值排序,记录排序结果。然后每一层分裂时,对于每个叶子节点中的数据逐一遍历,计算按当前数据点的特征值分裂的增益)。这种构建决策树的算法基本思想是:
首先,对所有特征都分别按照特征的数值进行预排序。
其次,在遍历分割点的时候用O(#data)的代价(对于每个数据点都使用其特征值计算分裂增益,这里没有考虑计算分裂增益的代价)找到一个特征上的最好分割点。
最后,找到一个特征的分割点后(此处应该是不同线程间通信,得到最好分裂特征和分裂点,然后使用该特征和分裂点对结点进行分裂),将数据分裂成左右子节点。
这样的预排序算法的优点是能精确地找到分割点。
缺点也很明显:
首先,空间消耗大。这样的算法需要保存数据的特征值,还保存了特征排序的结果(例如排序后的索引,为了后续快速的计算分割点),这里需要消耗训练数据两倍的内存。
其次,时间上也有较大的开销,在遍历每一个分割点的时候,都需要进行分裂增益的计算,消耗的代价大。
最后,对cache优化不友好。在预排序后,特征对梯度的访问是一种随机访问,并且不同的特征访问的顺序不一样,无法对cache进行优化。同时,在每一层长树的时候,需要随机访问一个行索引到叶子索引的数组,并且不同特征访问的顺序也不一样,也会造成较大的cache miss。
when splitting on one feature, figure 2a says that sometimes the split point can be chosen within a certain range without affecting the accuracy.In Figure 2b, we bin the data points into two levels on the horizontal (i.e., features) axis.Suppose we choose the quantizationas shown in the Figure 2b,then the accuracy will not be affected either.
Of course, we would not know ahead of time how to bin the data to avoid losing accuracy.Therefore, we suggest an adaptive quantization scheme, pictured in Figure 2c, to make the accuracy loss as little as possible. In the pre-processing stage, for each feature, the training data points are sorted according to the feature value, and we bin the feature values in the sorted order.We start with a very small initial bin length, e.g.10^-8. As shown in Figure 2c, we only bin the data where there are indeed data, because the boosting tree algorithm will not consider the area where there are no data anyway.We set an allowed maximum number of bins, denoted by B.If the bin length is so small that we need more than B bins, we simply increment the bin length and re-do the quantization.After the quantization, we replace the original feature value by the bin labels(0,1,2...).Note that since we start with a small bin length, the ordinal categorical features are naturally taken care of.
This simple binning scheme is very effective particularly for the boosting tree algorithm:
直方图算法的基本思想是先把连续的浮点特征值离散化成k个整数,同时构造一个宽度为k的直方图。在遍历数据的时候,根据离散化后的值作为索引在直方图中累积统计量,当遍历一次数据后,直方图累积了需要的统计量(累积的统计量表示有多少实例的特征值离散化为当前值,用于计算分裂增益),然后根据直方图的离散值,遍历寻找最优的分割点。
使用直方图算法有很多优点。首先,最明显就是内存消耗的降低,直方图算法不仅不需要额外存储预排序的结果,而且可以只保存特征离散化后的值,而这个值一般用8位整型存储就足够了,内存消耗可以降低为原来的1/8(float -> int)。
然后在计算上的代价也大幅降低,预排序算法每遍历一个特征值就需要计算一次分裂的增益,而直方图算法只需要计算k次(k可以认为是常数),时间复杂度从O(#data*#feature)优化到O(k*#features)。(特征值离散化前可能每个实例的特征值都不同,所以每个特征值都要计算一次分裂增益)
当然,Histogram算法并不是完美的。由于特征被离散化后,找到的并不是很精确的分割点,所以会对结果产生影响。但在不同的数据集上的结果表明,离散化的分割点对最终的精度影响并不是很大,甚至有时候会更好一点。原因是决策树本来就是弱模型,分割点是不是精确并不是太重要;较粗的分割点也有正则化的效果,可以有效地防止过拟合;即使单棵树的训练误差比精确分割的算法稍大,但在梯度提升(Gradient Boosting)的框架下没有太大的影响。
LightGBM uses the histogram based algorithms[3][4][5], which bucketing continuous feature(attribute) values into discrete bins, to speed up training procedure and reduce memory usage. Following are advantages for histogram based algorithms:
- Reduce calculation cost of split gain
- Pre-sorted based algorithms need O(#data) times calculation
- Histogram based algorithms only need to calculate O(#bins) times, and #bins is far smaller than #data
- It still needs O(#data) times to construct histogram, which only contain sum-up operation
- Use histogram subtraction for further speed-up
- To get one leaf's histograms in a binary tree, can use the histogram subtraction of its parent and its neighbor
- So it only need to construct histograms for one leaf (with smaller #data than its neighbor), then can get histograms of its neighbor by histogram subtraction with small cost( O(#bins) )
- Reduce Memory usage
- Can replace continuous values to discrete bins. If #bins is small, can use small data type, e.g. uint8_t, to store training data
- No need to store additional information for pre-sorting feature values
- Reduce communication cost for parallel learning
Sparse optimization
- Only need O(2 x #non_zero_data) to construct histogram for sparse features
Optimization in accuracy
Most decision tree learning algorithms grow tree by level(depth)-wise, like the following image:
LightGBM grows tree by leaf-wise(best-first)[7]. It will choose the leaf with max delta loss to grow. When growing same #leaf, Leaf-wise algorithm can reduce more loss than level-wise algorithm.
Leaf-wise may cause over-fitting when #data is small. So, LightGBM can use an additional parameter max_depth
to limit depth of tree and avoid over-fitting (Tree still grows by leaf-wise).
在Histogram算法之上,LightGBM进行进一步的优化。首先它抛弃了大多数GBDT工具使用的按层生长 (level-wise) 的决策树生长策略,而使用了带有深度限制的按叶子生长 (leaf-wise) 算法。Level-wise过一次数据可以同时分裂同一层的叶子,容易进行多线程优化,也好控制模型复杂度,不容易过拟合。但实际上Level-wise是一种低效的算法,因为它不加区分的对待同一层的叶子,带来了很多没必要的开销,因为实际上很多叶子的分裂增益较低,没必要进行搜索和分裂。
Leaf-wise则是一种更为高效的策略,每次从当前所有叶子中,找到分裂增益最大的一个叶子,然后分裂,如此循环。因此同Level-wise相比,在分裂次数相同的情况下,Leaf-wise可以降低更多的误差,得到更好的精度。Leaf-wise的缺点是可能会长出比较深的决策树,产生过拟合。因此LightGBM在Leaf-wise之上增加了一个最大深度的限制,在保证高效率的同时防止过拟合。
Optimization in network communication
It only needs to use some collective communication algorithms, like "All reduce", "All gather" and "Reduce scatter", in parallel learning of LightGBM. LightGBM implement state-of-art algorithms described in this paper[6]. These collective communication algorithms can provide much better performance than point-to-point communication.
Optimization in parallel learning
LightGBM provides following parallel learning algorithms.
Feature Parallel
Traditional algorithm
Feature parallel aim to parallel the "Find Best Split" in the decision tree(特征层面的并行化,寻找最佳分裂特征和分裂点). The procedure of traditional feature parallel is:
- Partition data vertically (different machines have different feature set)
- Workers find local best split point {feature, threshold} on local feature set
- Communicate local best splits with each other and get the best one
- Worker with best split to perform split, then send the split result of data to other workers
- Other workers split data according received data
The shortage of traditional feature parallel:
- Has computation overhead, since it cannot speed up "split", whose time complexity is O(#data). Thus, feature parallel cannot speed up well when #data is large.
- Need communication of split result, which cost about O(#data/8) (one bit for one data).
Feature parallel in LightGBM
Since feature parallel cannot speed up well when #data is large, we make a little change here: instead of partitioning data vertically, every worker holds the full data. Thus, LightGBM doesn't need to communicate for split result of data since every worker know how to split data. And #data won't be larger, so it is reasonable to hold full data in every machine.
The procedure of feature parallel in LightGBM:
- Workers find local best split point{feature, threshold} on local feature set
- Communicate local best splits with each other and get the best one
- Perform best split(xgb只有一个work可以执行split,因为只有一个work拥有的特征数据中含有最佳分裂特征, 而lgb每个work都拥有全量数据,都可以进行分裂)
However, this feature parallel algorithm still suffers from computation overhead for "split" when #data is large. So it will be better to use data parallel when #data is large.
Data Parallel
Traditional algorithm
Data parallel aim to parallel the whole decision learning. The procedure of data parallel is:
- Partition data horizontally
- Workers use local data to construct local histograms
- Merge global histograms from all local histograms.
- Find best split from merged global histograms, then perform splits
The shortage of traditional data parallel:
- High communication cost. If using point-to-point communication algorithm, communication cost for one machine is about O(#machine * #feature * #bin). If using collective communication algorithm (e.g. "All Reduce"), communication cost is about O(2 * #feature * #bin) ( check cost of "All Reduce" in chapter 4.5 at this paper ).
Data parallel in LightGBM
We reduce communication cost of data parallel in LightGBM:
- Instead of "Merge global histograms from all local histograms", LightGBM use "Reduce Scatter" to merge histograms of different(non-overlapping) features for different workers. Then workers find local best split on local merged histograms and sync up global best split.
- As aforementioned, LightGBM use histogram subtraction to speed up training. Based on this, we can communicate histograms only for one leaf, and get its neighbor's histograms by subtraction as well.
Above all, we reduce communication cost to O(0.5 * #feature* #bin) for data parallel in LightGBM.
Voting parallel
Voting parallel further reduce the communication cost in Data parallel to constant cost. It uses two stage voting to reduce the communication cost of feature Histograms. For more details, please refer to this paper.
LightGBM原生支持并行学习,目前支持特征并行和数据并行的两种。特征并行的主要思想是在不同机器在不同的特征集合上分别寻找最优的分割点,然后在机器间同步最优的分割点。数据并行则是让不同的机器先在本地构造直方图,然后进行全局的合并,最后在合并的直方图上面寻找最优分割点。LightGBM针对这两种并行方法都做了优化,在特征并行算法中,通过在本地保存全部数据避免对数据切分结果的通信;在数据并行中使用分散规约 (Reduce scatter) 把直方图合并的任务分摊到不同的机器,降低通信和计算,并利用直方图做差,进一步减少了一半的通信量。基于投票的数据并行则进一步优化数据并行中的通信代价,使通信代价变成常数级别。在数据量很大的时候,使用投票并行可以得到非常好的加速效果。
参考:
https://github.com/Microsoft/LightGBM/wiki/Features
https://github.com/Microsoft/LightGBM/wiki
http://www.msra.cn/zh-cn/news/blogs/2017/01/lightgbm-20170105.aspx