Loading...
Showing posts with label AdaBoost. Show all posts
Showing posts with label AdaBoost. Show all posts

Saturday, January 25, 2014

Python: Regression with AdaBoost


Hello Readers,


Today we will discuss regression with AdaBoost, a part of scikit module for Python. We shall compare how this boosting technique can allow the regressor to fit with less prediction error than a single decision tree.

We aim to create the graphic on the right showing the fits of a decision tree and one with AdaBoost.


Start Python (I am using 2.7.5) and let us get started!


The Modules



We require a few modules to run the script: numpy, pylab, sklearn.tree, and sklearn.ensemble. Specifically from sklearn.tree and sklearn.ensemble, we will use the DecisionTreeRegressor, and AdaBoostRegressor classes respectively.



Creating the Data



So we will create sinusoidal dataset with cosine using the cos() function and some Gaussian noise with the normal() function on the random number.


Data Creation
After we have the x and y, we can create the regression models from DecisionTreeRegressor and AdaBoostRegressor. Note that we have 300 estimators for the AdaBoost regression for 299 boosts, for 299 additional fits on the same data set but with different weights adjusted to the error of the prediction. Next we fit them to X and y, using fit(). Afterwards we predict() on the X values to obtain the predicted y values.


Regression Modeling, Fitting, and Predicting



Plotting the Predicted Values



Naturally, to visualize the two predicted y values, we plot them over the original y data. Using the pylab module, we can plot the original y values as scatter(), and the predicted y values with plot().

After adding x and y labels, a title, and a legend, we display the plot using show().


Plotting the Actual and Predicted Values

Which yields the graphic below.




Note how the green line (the single decision tree) has a rough fit, while trying to regress along the modified cos() points. See how the red AdaBoost regression with 299 boosts can better fit the cosine sinusoidal data, from altering the instance weights from the error of current prediction with each boost.  Increasing the number of boosts further enables, the regression fit. For more about AdaBoost from scikit, click here.


Thanks for reading,


Wayne
@beyondvalence