Ensemble Methods (Part 2): Boosting and AdaBoost
October 13, 2017
Ensemble methods were introduced in a previous tutorial. In this tutorial we explore another approach to ensemble learning called boosting, and a specific boosting algorithm called AdaBoost.
Boosting
Boosting aims to build a strong learner from a set of weak learners. A weak learner is a predictive model that performs only slightly better than random guessing, whereas strong learner refers to a model whose predictions are close to perfect.
Boosting proceeds by repeatedly modifying the training set based on the performance of the earlier predictors. It can be seen as a sequential ensemble method. It assigns a greater importance to the data points that were previously misclassified, thereby leading subsequent models to focus more of their resources on those tougher data points. AdaBoost (adaptive boosting) is a popular boosting algorithm.
AdaBoost
Let there be a binary classification dataset of n training samples (xi, yi) where yi is either +1 or -1. We seek to successively train m binary classifiers hj to later combine them for preicting new instances x as their weighted majority vote:
ˆy(x)=sign(m∑j=1αjhj(x))(1)classifier weights αj are computed following the premise of giving a greater influence to the more accurate classifiers. At each model iteration, the misclassification error εj on the training set is calculated:
ϵj=∑ni=1wj(i) I(yi≠hj(xi))∑ni=1wj(i)=∑yi≠hj(xi)wj(i)∑ni=1wj(i)(2)and used to compute αj:
αj=12 ln(1−ϵjϵj)>0(3)In addition, probability weights Wj = (wj(1), …, wj(n)) are used to assign different importances to the training instances. They are individually modified after each model iteration before resampling the training set for next classifier hj+1:
wj+1(i)=wj(i) e−αjyihj(xi)(4)where, for a binary classification,
yihj(xi)={+1if xi well-classified−1if xi mis-classifiedSince αj > 0, weights are increased for instances misclassified by hj (scaled up by the exponential term), and decreased for well-classified samples. Misclassified points take a larger part of hj+1 training set. In addition, to make Wj+1 a distribution whose values sum up to 1, weights are normalized:
wj+1(i)←wj+1(i)∑kwj+1(k)Note that there are two types of weights: wi for initially training the models, and αj for combining the weak learners to predict any future instance.
Pseudocode
# Probability weights initialization
for i=1, ..., n
w1(i) = 1/n
# Model iteration
for j=1, ..., m
resample training set using wj and fit model hj
calculate misclassification error εj using (2)
calculate weight αj by applying (3)
for i=1, ..., n
compute wj+1(i) by applying (4)
normalize all wj+1
# Termination
output the weighted majority vote ŷ using (1)