Ensemble Methods (Part 2): Boosting and AdaBoost

October 13, 2017

Ensemble methods were introduced in a previous tutorial. In this tutorial we explore another approach to ensemble learning called boosting, and a specific boosting algorithm called AdaBoost.

Boosting

Boosting aims to build a strong learner from a set of weak learners. A weak learner is a predictive model that performs only slightly better than random guessing, whereas strong learner refers to a model whose predictions are close to perfect.

Boosting proceeds by repeatedly modifying the training set based on the performance of the earlier predictors. It can be seen as a sequential ensemble method. It assigns a greater importance to the data points that were previously misclassified, thereby leading subsequent models to focus more of their resources on those tougher data points. AdaBoost (adaptive boosting) is a popular boosting algorithm.

AdaBoost

Let there be a binary classification dataset of n training samples (x_i, y_i) where y_i is either +1 or -1. We seek to successively train m binary classifiers h_j to later combine them for preicting new instances x as their weighted majority vote:

$\hat{y} (x) = sign (\sum_{j=1}^m \alpha_j h_j(x)) \qquad (1)$

classifier weights α_j are computed following the premise of giving a greater influence to the more accurate classifiers. At each model iteration, the misclassification error ε_j on the training set is calculated:

$\epsilon_j = \dfrac{\sum_{i=1}^n w_j(i) \ I(y_i \neq h_j(x_i))}{\sum_{i=1}^n w_j(i)} = \dfrac{\sum_{y_i \neq h_j(x_i)} w_j(i)}{\sum_{i=1}^n w_j(i)} \qquad (2)$

and used to compute α_j:

$\alpha_j = \dfrac{1}{2} \ ln \left( \dfrac{1-\epsilon_j}{\epsilon_j} \right) > 0 \qquad (3)$

In addition, probability weights W_j = (w_j(1), …, w_j(n)) are used to assign different importances to the training instances. They are individually modified after each model iteration before resampling the training set for next classifier h_j+1:

$w_{j+1}(i) = w_j(i)\ e^{-\alpha_j y_i h_j(x_i)} \qquad (4)$

where, for a binary classification,

$y_i h_j(x_i) = \begin{cases} +1 &\text{if} \ x_i \ \text{well-classified} \\ -1 &\text{if} \ x_i \ \text{mis-classified} \end{cases}$

Since α_j > 0, weights are increased for instances misclassified by h_j (scaled up by the exponential term), and decreased for well-classified samples. Misclassified points take a larger part of h_j+1 training set. In addition, to make W_j+1 a distribution whose values sum up to 1, weights are normalized:

$w_{j+1}(i) \leftarrow \dfrac{w_{j+1}(i)}{\sum_k w_{j+1}(k)}$

Note that there are two types of weights: w_i for initially training the models, and α_j for combining the weak learners to predict any future instance.

Pseudocode

# Probability weights initialization
for i=1, ..., n
    w1(i) = 1/n
# Model iteration
for j=1, ..., m
    resample training set using wj and fit model hj
    calculate misclassification error εj using (2)
    calculate weight αj by applying (3)
    for i=1, ..., n
        compute wj+1(i) by applying (4)
    normalize all wj+1
# Termination
output the weighted majority vote ŷ using (1)