AdaBoost Algorithm in Machine Learning

FREE Online Courses: Transform Your Career – Enroll for Free!

Till now, we have seen various implementations of the gradient descent boosting algorithm like the XGBoost algorithm. In this article, we will discuss one of the boosting ensemble techniques- AdaBoost. We will look at the introduction to boosting ensembles, AdaBoost algorithm and how it works.

In this article from PythonGeeks, we will discuss AdaBoost and how to boost the performance of the decision tree using AdaBoost. Furthermore, we will also look at the prediction techniques used by the learned AdaBoost models. Towards the end of the article, we will look at the best ways to prepare the datasets for learning using the AdaBoost algorithm. So, let’s dive straight to the introduction section to know more about AdaBoost.

Boosting Ensemble Method

Boosting is classified under a general ensemble technique in which we tend to create a strong classifier by arranging a sequence of weak classifiers. We can achieve this by building a model from training data, then fabricating a second model to rectify the errors of the first model. We keep on adding the consequent model until we can accurately predict the training dataset to enhance the performance of the overall model.

AdaBoost is an example of such a boosting algorithm that was developed primarily for binary classification. It is the most precise algorithm to understand the concept of boosting.

Algorithms like Stochastic gradient boosting machines are the key factors behind modern AdaBoost algorithms.

Introduction to AdaBoost Algorithm

AdaBoost is the acronym for Adaptive Boosting which is a Machine Learning technique used as an Ensemble Method. The most widely used algorithm with AdaBoost is decision trees with one level. This implies that the decision in the AdaBoost has only a single split. In the AdaBoost algorithm, these decision trees are known as Decision Stumps.

At the beginning of the dataset training, the model gives equal weights to all the data points of the dataset. When the model finishes the first iteration of the training data, it judges the data points that are misclassified. The model then associates higher weight values to these misclassified data points. So, when we pass the data to the consequent model, it gives preference to those data points that have higher weights. This process continues to iterate until we achieve a low error.

Learning of an AdaBoost Model from Data

AdaBoost yields the best results when we use it to boost the performance of decision trees on binary classification problems. Conventionally, AdaBoost was used as a regressive model however, in recent times, we have started using it as a classification algorithm rather than for regression.

Though we primarily use AdaBoost with decision trees, we can use this algorithm to boost the performance of any Machine Learning Algorithm by using it along with the algorithm. It gives the best possible outcome when we use this algorithm alongside weak learners. As we know, weak learners are entities that are slightly better at predicting a classification output as compared to a random guess.

Because of these reasons, the most widely used and well-suited algorithm to use alongside AdaBoost is decision trees with a single level. As these trees are too short and contain only a single level, they perform the task of taking a single decision for classification. Due to this, we sometimes call these trees decision stumps.

Each entity of our training dataset is weighted beforehand. We initiate the weights of these entities by
weight(x)= 1/n

How Do AdaBoost works?

Before we move on to the working of AdaBoost, let us first discuss how boosting works. It makes ‘n’ number of decision trees during the data training period of the model. As the model makes the first decision tree/model, the incorrectly classified record in the first model is given priority by assigning it higher weights. Only these records are sent as input for the second consequent model. The process goes on until we specify the number of base learners we want to create beforehand and observe accurate results. Remember, we can repeat the number of records with all boosting techniques.

As stated earlier, we feed the record which is incorrectly classified as input for the next model. We repeat this process until we meet the specified condition. All types of boosting models work on the same principle as we have mentioned earlier.

As we now know the boosting principle, it will be easy for us to understand the AdaBoost algorithm. Let’s dive into AdaBoost’s working in brief with the following steps. When we use the random forest, the algorithm makes an ‘n’ number of consequent trees. It makes accurate trees that comprise a start node with several leaf nodes. Some trees might be bigger as compared to others, however, there is no fixed depth in a random forest for this algorithm.

The working of the AdaBoost model follows the below-mentioned path:

Creation of the Base Learner
Calculation of the Total Error by the below formula
Calculation of Performance of the Decision stumps
Updating the Weights according to the misclassified points
Creation of a New Database

How does the Algorithm Decide the Output for the test?

Suppose with the above newly created dataset, the algorithm constructs 3 decision trees or stumps. The test dataset will then pass through all the stumps that the model has constructed. While passing through the 1st stump, the results of the output show 1. When passing through the 2nd stump, it generates the output as 1 again. While passing through the 3rd stump it results in the output as 0. Even in the AdaBoost algorithm, the majority of votes take place between the stumps. This follows the same process in the same way as in random trees. In the above-mentioned case, the final output will be 1.

How to Train One Model

With the help of weighted samples, we prepare the weak classifiers on the training data. As the decision trees have one single level, this algorithm supports only binary classification. This means that the data that we feed to the decision stump makes only a single decision on the input variable. The output of this decision tree is either +1 or -1 for the first- or second-class value.

After the end of any iteration of the model, we calculate the misclassification rate by the formula

M = (correct – N)/ N

Here, M is the miscalculation rate, correctly denotes the number of training instances that the model predicts accurately, and N denotes the total number of training instances.

As an example, if our model predicts 78 out of 100 training instances correctly, then the error rate is (78-100)/100 or 0.22.

The above equation was later modified to make it compatible with the weighing of the training instances

M= sum(w(i) * terror(i))/ sum (w)

Here, w is the weight of training instance i, and terror denotes prediction error for training.

As an example, consider if we had 3 training instances having weights 0.001, 0.5, and 0.2 along with the predicted values as -1, -1, and -1. The output variables in the instances are -1, 1, and -1, then the terror would be 0, 1, and 0.

Then the misclassification rate is
M= (0.01*0.5+0.2*0)/ (0.01+0.5+0.2) or M= 0.704

We calculate the stage value for the training data that provides a weighing for any predictions that the model is capable of making. We can calculate the stage value for the training data as

S= ln((1-M)/M)

Here, S is the stage value we used to weight predictions from the model, ln denotes the natural logarithm and M is the misclassification error. The advantage of using the stage is that this technique makes sure that the learners with higher weight have more contribution to the final prediction.

After the iteration of the first dataset, the model updates the weights of the data points giving higher weights to the incorrectly predicted instances and lesser weights to those that are correctly predicted.

We update these weights by using the formula

w = w * exp(S*terror)

Here, w denotes the weight for a specific training instance, exp() is the numerical constant, S is the misclassification rate for weak classifiers and terror denotes the measure of an error made by the model in making the prediction.
terror = 0 if (y=p), and 1 in all other cases.

Here, y is the output variable of the model while p is the prediction of the model.

AdaBoost Ensemble

In the ensemble technique, we add the weak models sequentially and then train them using weighted training data.

We continue to iterate the process until we achieve the creation of a pre-set number of weak learners or we cannot observe further improvement on the dataset. At the end of the algorithm, we are left with a number of weak learners with a stage value.

Making Predictions with AdaBoost

We perform classification with AdaBoost by calculating the weighted average of the weak classifiers.
When we add new instances as input for the classification, each weak learner tries to calculate a predicted value as either +1 or -1. These predicted weights are further weighted by the weak learners. The overall output of this ensemble arrangement is the sum of all the weighted predictions. If we obtain a positive result, then the output is classified as first-class (+1), and if it yields a negative result, then the output of the process is classified as second class (-1).

As an example, if we have 5 weak classifiers that predict the values as 1, 1, -1, 1, -1. By mere observation, we can predict that the majority votes are for the first-class output and hence the model will yield the result as 1. Now, consider the same 5 weak learners having the stage values as 0.2, 0.5, 0.8, 0.2, and 0.9. When we calculate the weighted sum of these predictions, then the result is -0.8 and hence the model will produce an output of -1 or a second-class output.

Python Code for AdaBoost

Programming AdaBoost in Python is really efficient and quite handy. We can code the algorithm in a really effective and short way. Programming AdaBoost in Python is just a matter of 3-4 lines. Before coding the algorithm, we have to make sure that we had already split our data into a test and training set. After splitting the data, we need to import the necessary libraries and look for the accuracy of the results.

from sklearn.ensemble import AdaBoostClassifier
ad= AdaBoostClassifier() #creation of the object
pred= ad.fit(xtrain, ytrain). predict(xtest)
accuracy_score(ytest, pred)

Data Preparation for AdaBoost

The following section of the article discusses the best preparing ways for your data for AdaBoost

1. Quality Data:

As the ensemble tends to correct the misclassification of the preceding models for acquiring better results, we have to make sure that the training data is highly accurate.

2. Outliers:

Outliers make sure that the models work hard in order to correct the residual errors of the training data. These outliers can sometimes act as redundant entities and we can remove them from the training data set.

3. Noisy Data:

Noisy Data can lead to the creation of noise in the output which can be quite troublesome to deal with. As a preventive measure, we should attempt to isolate these parts and try to clear the noise for a high-quality training data set.

Conclusion

With this, we have reached the end of the article that talked about the AdaBoost algorithm. We came across the working of the boosting ensemble and the introduction. Furthermore, we came across the ways in which we can effectively train the data for AdaBoost. Towards the end, we came across the various ways in which we can boost the performance of the AdaBoost algorithm. Hope that this article from PythonGeeks was able to clear your concepts about AdaBoost and its working.