Bagging
Bagging (Bootstrap aggregating) is a meta-algorithm introduced by Breiman [1] that generates multiple versions of a predictor and uses these to get an aggregated predictor.
The Random Subspace Method is another meta-algorithm proposed by Ho [2] that performs the same transformations as Bagging but on the feature space.
Combining these two methods is called SubBag and is designed by Pance Panov and Saso Dzeroski [3].
Here the BaggingClassifier
and the BaggingRegressor
implement the SubBag meta-estimator.
For classification, BaggingClassificationModel
uses a majority vote of the base model predictions.
It can be either soft
or hard
, using the predicted classes or the predicted probabilities of each base model.
For regression, BaggingRegressionModel
uses the average of the base model predictions.
Parameters
The parameters available for Bagging are related to the number of base learners and the randomness of the subbag method.
import org.apache.spark.ml.classification.{BaggingClassifier, DecisionTreeClassifier}
import org.apache.spark.ml.regression.{BaggingRegressor, DecisionTreeRegressor}
new BaggingClassifier()
.setBaseLearner(new DecisionTreeClassifier()) //Base learner used by the meta-estimator.
.setNumBaseLearners(10) //Number of base learners.
.setSubsampleRatio(0.8) //Ratio sampling of examples.
.setReplacement(true) //Samples drawn with replacement or not.
.setSubspaceRatio(0.8) //Ratio sampling of features.
.setVotingStrategy("soft") //Soft or Hard majority vote.
.setParallelism(4) //Number of base learners trained simultaneously.
new BaggingRegressor()
.setBaseLearner(new DecisionTreeRegressor()) //Base learner used by the meta-estimator.
.setNumBaseLearners(10) //Number of base learners.
.setSubsampleRatio(0.8) //Sampling ratio of examples.
.setReplacement(true) //Samples drawn with replacement or not.
.setSubspaceRatio(0.8) //Sampling ratio of features.
.setParallelism(4) //Number of base learners trained simultaneously.