Mathematical mystries behind Adaboost

Debayan Bose
6 min readJul 28, 2021

--

Adaptive boosting a.k.a. Adaboost is a very popular boosting algorithm and was first introduced by Freund and Schapire back in 1996. There has been lot of mathematical research around the performance of Adaboost and extensive numerical evidence which proves its robustness to overfitting. In this blog we will try to understand the mathematical intuition behind Adaboost and comprehend one of the most popular conjectures proposed by Breiman in 2001 that Adaboost is a variant of Random Forest.

Any boosting algorithm emphases heavily on increasing the accuracy by focusing more on the residuals generated from “weak learners”.

What is meant by weak learners?

We often talk about weak learners or “base line” models but tend to overlook the rationales behind this. If XGboost is not performing well in a Learning set L of N samples, can we really call XGB a weak learning algorithm solely based on its performance?

Perhaps NO, we can’t. As per Freund and Schapire, weak learning algorithm ‘performs just slightly better than random guessing into one with arbitrarily high accuracy’. So, if you build a decision tree with 1 level (which is also called as Decision Stumps) that can be fairly called as weak learner. Now even if Decision Stump performs fairly well, we would still be in a position to call that as a weak learner.

So with this knowledge let’s proceed further and have a quick look on the algorithm (A Brief Introduction to Boosting, Schapire, 1999).

Given:

Initial weight given on each observation will be equal (Initialize the sampling weight) and can be defined as

For t = 1, . . . , T:

  1. Train weak learner with sampling weight distribution D_t
  2. Develop a weak hypothesis h_t with error,

3. Choose:

4. Update:

Here Z_t is a normalization factor

Output of the final hypothesis will be:

Based on the theory described above, we can go little deep to understand what these steps mean.

During initialization, we don’t know which observations are more prone to misclassification and hence we start with an equal sampling weight. If we have m observations in the data, our initial weight would be 1/m. Next step is to develop a weak learner. As described previously, we can use a decision stump to build our weak learner. Decision stump is nothing but a decision tree with 1 node and 2 leaves. Here we will pass our sampling weight as part of the parameter list. Next, we must calculate the misclassification error.

Truly speaking this misclassification error is a weighted sum of misclassification and weight being the sampling weight.

Now it’s time to calculate the performance of the decision stump (also referred as alpha_t in the original paper).

Let’s try to develop some intuition behind alpha_t. when e_t is either 0 or 1, alpha_t becomes infinite. Now if predicted probability is close to actual probability,e_t becomes small leading to higher value of alpha_t. This makes sense because higher values of alpha_t implies that performance of our decision stump is superior. Whenever our decision stump misclassifies a record, we should give more weightage to the observation during weight update to take care of the error.

We will follow the equation described previously to update the weight. Let’s take a closer look on the mathematical form. When predicted correctly, y_i * h_t(x) becomes positive and thus new weight becomes old_weight * exp(-performance). Conversely, when predicted incorrectly, y_i * h_t(x) becomes negative leading to new weight = old_weight * exp(performance).

Now if the weak learner misclassifies the record by a large margin, performance value will be large leading to major change in new weight compared to the old weight. Whereas if it misclassifies by a small margin, new weight will be closer to the old weight.

Once the weights are updated, we must normalize them by taking a sum of all the new weights (this essentially is Z_t) and dividing each weight by this normalized constant.

The above steps continue for T-th iteration, and we finally combine all the predictions by taking a sign of weighted sum (weights being alpha_t) of all the predictions.

In 2001, Breiman conjectured that Adaboost is a random forest under certain assumptions. Let’s dig into the maths behind the conjecture. But before that let’s understand why we are curious about Adaboost’s performance rationale. In general, linear combination of weak hypothesis increases the complexity of the model itself and thus should result in increase of generalization error. But in contrast, generalization error tends to decrease as we increase the number of iterations. This implies that a linear combination of 1000 trees generalize better than simpler 10 trees (https://arxiv.org/pdf/1212.1108.pdf).

To explain this phenomenon, let’s look at the margin theory. In 1998, Schapire et al. proposed that “generalization error of any convex combination of functions can be bounded by a function of their margins on the training examples, independent of the number of classifiers in the ensemble. AdaBoost provably produces reasonably large margins, and tends to continue to improve the margins even after the training error has reached zero.”

Now the bigger question here is the performance of Adaboost model on test data. In general, test error doesn’t fluctuate too much and doesn’t approach towards a bound induced by the margins. It seems to converge to a stationary value. This makes the algorithm very confusing.

Now we can take a look at what Breiman conjectured back in 2001. On the basis of his conjecture, Adaboost can be considered to be an ergodic dynamic systems.

What’s an ergodic process?

A process can be called ergodic if the ensemble average of the observables remains equal to the time average.

So now if we go back to origin of output formulation from Adaboost, we can reformulate to find out that the normalized output resembles to the following:

Remember that Q(w_t) is analogous to Z_t in the Adaboost formulation. Breiman proposed that if this follows an ergodic process with an invariant measure, the above equation converges to

. Now invariant measure ensures that the weights are selected from the same probability distribution which in turn lead to the theory that Adaboost is a random forest.

Truly speaking if we know the distribution of P(X,Y) for a learning set L, we can compute the Bayes risk. So theoretically for infinite sample size, generalization error in of Adaboost should converge to Bayes risk for large samples. Statistically an optimal Adaboost should be Bayes-consistent. But as per empirical results, most of the times, Adaboost doesn’t converge and follows the behavior of an ergodic process. This makes the claims of convergence questionable. In fact, research shows that all the convergence are observed in finite-sized data.

Joshua Belanich et al. in 2015 published a research which proves that the assumption of invariant measure may not be correct for infinite sample space. It’s rather a preserving measure which ensures the stability. But there is still lot of research happening around the rationale behind convergence of generalization error.

Reference:

  1. https://link.springer.com/content/pdf/10.1023/A:1010933404324.pdf
  2. https://www.kdnuggets.com/2020/12/implementing-adaboost-algorithm-from-scratch.html
  3. https://arxiv.org/pdf/1212.1108.pdf
  4. https://arxiv.org/pdf/1504.07676v1.pdf

--

--

Debayan Bose

I am a machine learning lead working @ Thoughtworks. I love solving problems mathematically first and programatically next.