Bagging - Bootstrap AGGregatING

Learn n base learners in parallel, combine to reduce model variance.

先训练 n 个模型(base learners)，每个模型独立训练（parallel）

Combine learners by averaging the outputs (regression) or majority voting (classification)

回归问题：对每个模型的输出做平均；分类问题：投票

Each base learner is trained on a bootstrap sample

Given a dataset of m examples, create a sample by randomly sampling m examples with replacement # 有放回采样

Around \(1 − 1/e ≈ 63\%\) unique examples will be sampled use the out-of-bag examples for validation # 剩下 27% 是一些重复样本，没有被采样的数据作为验证集，来检查模型效果。

Bagging Code (scikit-learn)

class Bagging:
	def __init__(self, base_learner, n_learners):
		self.learners = [clone(base_learner) for _ in range(n_learners)]

	def fit(self, X, y):
		for learner in self.learners:
			examples = np.random.choice(
			np.arange(len(X)), int(len(X)), replace=True) # replace=True 有放回采样
			learner.fit(X.iloc[examples, :], y.iloc[examples])
	def predict(self, X):
		preds = [learner.predict(X) for learner in self.learners]
		return np.array(preds).mean(axis=0) # 回归模型做平均

随机森林

使用决策树做 base learners。

经常有放回采样时候会随机采样一些特征，即列也会随机采样。好处是：避免一定的过拟合和更大的增加每一颗决策树之间的差异性。