Ensemble Learning

1. Voting Classifers

  • Aggregate the predications of each classifier and predict the class that gets the most votes, called hard voting classifier
  • Ensemble method work best when the classifiers are as independent from one another as possible
  • hard voting, aggregate the predictions of each classifier and predict the class that gets the most votes
  • soft voting, predict the class with the highest class probability, averaged over all the individual classifiers, it often archieves higher performance than hard voting, need all classifiers are able to estimate class probabilities
In [2]:
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_moons

X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
In [3]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
In [4]:
log_clf = LogisticRegression(solver="lbfgs", random_state=42)
rnd_clf = RandomForestClassifier(n_estimators=100, random_state=42)
svm_clf = SVC(gamma="scale", random_state=42)

voting_clf = VotingClassifier(
    estimators=[('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
    voting='hard')
In [5]:
from sklearn.metrics import accuracy_score

for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))
LogisticRegression 0.864
RandomForestClassifier 0.896
SVC 0.896
VotingClassifier 0.912

2. Bagging and Pasting

  • Use the same model for every classifier and train them on different random subsets of the training sets
  • bagging, sampling is performed with replacement
  • pasting, sampling is performed without replacement
  • bagging has higer bias and lower variance than pasting, results in better models
In [6]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

bag_clf = BaggingClassifier(
    DecisionTreeClassifier(random_state=42), n_estimators=500,
    max_samples=100, bootstrap=True, random_state=42) # bootstrap = True, samples with replacement
bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)
print(accuracy_score(y_test, y_pred))
0.904
In [7]:
# Out-of-Bag Evaluation
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(random_state=42), n_estimators=500,
    max_samples=100, bootstrap=True, oob_score=True)
bag_clf.fit(X_train, y_train)
print(bag_clf.oob_score_)
y_pred = bag_clf.predict(X_test)
print(accuracy_score(y_test, y_pred))
0.928
0.92
  • Random Patches method, sample both training instances and features, max_samples, boostrap=True, max_feature, bootstrap_features=True
  • Random Subspaces method, keeping all training instace but sampling features, boostrap=False, max_samples=1.0, boostrap_featreus=True, max_features=0.6
  • Random Forest is trained via the bagging method max_samples=1.0 or use RandomForestClassifer
In [8]:
# Random Forest with BaggingClassifier
bag_clf = BaggingClassifier(
    DecisionTreeClassifier(splitter="random", max_leaf_nodes=16, random_state=42),
    n_estimators=500, max_samples=1.0, bootstrap=True, random_state=42)
bag_clf.fit(X_train, y_train)
y_pred = bag_clf.predict(X_test)
print(accuracy_score(y_test, y_pred))
0.92
In [9]:
# Random Forest with RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, random_state=42)
rnd_clf.fit(X_train, y_train)

y_pred_rf = rnd_clf.predict(X_test)
print(accuracy_score(y_test, y_pred_rf))
rnd_clf.feature_importances_ # feature importance
0.912
Out[9]:
array([0.42253629, 0.57746371])
  • Extra-Trees, using random thresholds for each feature when considering for splitting, rather than searching for the best possible thresholds
  • Trade more bias for lower variance
In [10]:
from sklearn.ensemble import ExtraTreesClassifier
extra_clf = ExtraTreesClassifier(n_estimators=500, max_leaf_nodes=16, random_state=42)
extra_clf.fit(X_train, y_train)

y_pred_extra = extra_clf.predict(X_test)
print(accuracy_score(y_test, y_pred_extra))

extra_clf.feature_importances_ # feature importance
0.912
Out[10]:
array([0.42504529, 0.57495471])

3. Boosting

  • Train predictors sequentially, each trying to correct its predecessor
  • AdaBoost
    • train a base classifier, and use it to make predictions on the training set
    • increases the relative weight of misclassified trainning instances, train a second classfier using the updated weights, makes predictions on the training set
    • and so on
    • the algorithm stops when the desired number of predictors is reached, or when a perfect predictor is found
    • make decision, computes the predictions of all the predictors and weights them using the predictor weights $\alpha_{j}$, the predicted class is the one that receiveds the majority of weighted votes
    • too many estimators can make the model overfitted for boosting, bagging doest not
In [11]:
from sklearn.ensemble import AdaBoostClassifier

ada_clf = AdaBoostClassifier(
    DecisionTreeClassifier(random_state=42),
    n_estimators=200, algorithm="SAMME.R", learning_rate=0.5, random_state=42)
ada_clf.fit(X_train, y_train)

y_pred_ada = ada_clf.predict(X_test)
print(accuracy_score(y_test, y_pred_ada))

ada_clf.feature_importances_ # feature importance
0.856
Out[11]:
array([0.43562701, 0.56437299])
  • Gradient Boosting
    • adding predictors sequentially to an ensemble
    • instead of tweaking the instance weights, fit the new predictor to the residual errors
In [12]:
from sklearn.ensemble import GradientBoostingClassifier
clf = GradientBoostingClassifier(random_state=0, n_estimators = 200, learning_rate=0.05)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
print(accuracy_score(y_test, y_pred))

clf.feature_importances_ # feature importance
0.888
Out[12]:
array([0.42212231, 0.57787769])
In [13]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import mean_squared_error
import numpy as np

clf = GradientBoostingClassifier(random_state=0, n_estimators = 200, learning_rate=0.05)
clf.fit(X_train, y_train)

# search for the best number of estimators
errors = [mean_squared_error(y_test, y_pred) for y_pred in clf.staged_predict(X_test)]
best_n = np.argmin(errors)+1

clf = GradientBoostingClassifier(random_state=0, n_estimators = best_n, learning_rate=0.05)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
print(accuracy_score(y_test, y_pred))

clf.feature_importances_ # feature importance
0.92
Out[13]:
array([0.35868635, 0.64131365])
  • Stochastic Gradient Boosting
    • Use the fraction of training instances for training each tree
  • XGBoost
    • Extreme Gradient Boosting
In [18]:
import xgboost
xgb_clf = xgboost.XGBClassifier(random_state=42)
xgb_clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
print(accuracy_score(y_test, y_pred))

clf.feature_importances_ # feature importance
0.92
Out[18]:
array([0.35868635, 0.64131365])

4. Stacking

  • Train predictors with the first half of the training set
  • Predict using the trained predictors with the second half of the training set
  • Use the predicted values from trained predictors as the new features
  • Use the generated new features and the target values to train a predictor (blender)