Decision Trees

Pros

  • The cost of using the tree (i.e., predicting data) is logarithmic in the number of data points used to train the tree
  • Able to handle both numerical and categorical data
  • Able to handle multi-output problems

Cons

  • Decision-tree learners can create over-complex trees that do not generalise the data well
  • The problem of learning an optimal decision tree is known to be NP-complete under several aspects of optimality and even for simple concepts
  • There are concepts that are hard to learn because decision trees do not express them easily, such as XOR, parity or multiplexer problems
  • Decision tree learners create biased trees if some classes dominate. It is therefore recommended to balance the dataset prior to fitting with the decision tree

Tips

  • Decision trees tend to overfit on data with a large number of features. Getting the right ratio of samples to number of features is important, since a tree with few samples in high dimensional space is very likely to overfit
  • Consider performing dimensionality reduction (PCA, ICA, or Feature selection) beforehand to give your tree a better chance of finding features that are discriminative
  • Balance your dataset before training to prevent the tree from being biased toward the classes that are dominant

Get the Data

In [27]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

iris = load_iris()
X = iris.data[:, 2:] # petal length and width
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

Decision Tree

In [28]:
tree_clf = DecisionTreeClassifier(max_depth=2, random_state=42)
tree_clf.fit(X_train, y_train)
Out[28]:
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=2, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=42, splitter='best')

Visualization

In [29]:
from sklearn.tree import export_graphviz

export_graphviz(
        tree_clf,
        out_file="iris_tree.dot",
        feature_names=iris.feature_names[2:],
        class_names=iris.target_names,
        rounded=True,
        filled=True
    )

dot -Tpng iris_tree.dot -o tree.png

gini

  • measures impurity, a node is pure (gini = 0) if all training instances it applies to belong to the same class
In [31]:
from sklearn import tree
tree.plot_tree(tree_clf, feature_names=iris.feature_names[2:],class_names=iris.target_names, rounded=True, filled=True)
Out[31]:
[Text(133.92000000000002, 181.2, 'petal length (cm) <= 2.35\ngini = 0.665\nsamples = 112\nvalue = [37, 34, 41]\nclass = virginica'),
 Text(66.96000000000001, 108.72, 'gini = 0.0\nsamples = 37\nvalue = [37, 0, 0]\nclass = setosa'),
 Text(200.88000000000002, 108.72, 'petal length (cm) <= 4.95\ngini = 0.496\nsamples = 75\nvalue = [0, 34, 41]\nclass = virginica'),
 Text(133.92000000000002, 36.23999999999998, 'gini = 0.153\nsamples = 36\nvalue = [0, 33, 3]\nclass = versicolor'),
 Text(267.84000000000003, 36.23999999999998, 'gini = 0.05\nsamples = 39\nvalue = [0, 1, 38]\nclass = virginica')]
In [32]:
from sklearn.tree.export import export_text
r = export_text(tree_clf, feature_names=iris.feature_names[2:])
print(r)
|--- petal length (cm) <= 2.35
|   |--- class: 0
|--- petal length (cm) >  2.35
|   |--- petal length (cm) <= 4.95
|   |   |--- class: 1
|   |--- petal length (cm) >  4.95
|   |   |--- class: 2

/opt/anaconda3/lib/python3.7/site-packages/sklearn/utils/deprecation.py:144: FutureWarning: The sklearn.tree.export module is  deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.tree. Anything that cannot be imported from sklearn.tree is now part of the private API.
  warnings.warn(message, FutureWarning)

Methods and Attributes

In [33]:
print(iris.target_names)
tree_clf.predict([[5, 1.5]])[0]
['setosa' 'versicolor' 'virginica']
Out[33]:
2
In [34]:
tree_clf.predict_proba([[5, 1.5]]) # 0 for setosa, 0.026 for versicolor, 0.974 for virginica
Out[34]:
array([[0.        , 0.02564103, 0.97435897]])
In [35]:
tree_clf.feature_importances_ # feature importance, petal length is more important than width
Out[35]:
array([1., 0.])

Feature Importance

  • feature importance $$f_{i} = \frac{ \sum_{j=1}^{m}n_{j} }{ \sum_{k=1}^{ma}n_{k}}$$ Where m is the number of nodes that splits on feature i, ma is the number of all nodes
  • node importance $$ n_{k} = w_{k}G_{k} - w_{left}G_{left} - w_{right}G_{right}$$ Where w is the weighted number of sample at a specific node, G is Gini value
  • Normalized feature importance $$ normf_{i} = \frac{f_{i}}{\sum_{}^{}f_{i}}$$

Classification and Regression Tree (CART)

  • Split the training set into two subsets, search for the pair (k, $t_{k}$) that produces the purest subsets, k is a single feature and $t_{k}$ is a threshold $$J(k, t_{k}) = \frac{m_{left}}{m} G_{left} + \frac{m_{right}}{m} G_{right}$$ Where, $G_{left/right}$ measures the impurity of the left/right subset, $m_{left/right}$ is the number of instances in the left/right subset $$G_{i} = 1 = \sum_{k=1}^{n}p_{i,k}^{2}$$ Where, $p_{i,k}$ is the ratio of class k instances among the training instance in the $i^{th}$ node CART splits the subset using the same logic recursively until reaches the maximum depth or cannot find a split that will reduce impurity (max_depth, min_samples_split, min_sample_leaf, min_weight_fraction_leaf, and max_leaf_nodes)
  • Gini impurity is slightly faster to computer than entropy impurity

Computational Complexity

  • O(nmlg(m))
  • Presorting data can speed up training for samll data set, ut slows down training for larger training sets

Regularization Hyperparameters, train the model with restrictions

  • unconstrained decision tree fits the training data closely, and most likely overfitting it, such a model is called nonparametric model
  • parametric model has predetermined number of parameters to reduce the risk of overfitting
  • regularization hyperparameters
    • max_depth, the maximum depth of the tree
    • min_samples_split, the minimum number of samples required to split an internal node
    • min_samples_leaf, the minimum number of samples required to be at a leaf node
    • min_weight_fraction_leaf, the minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node
    • max_features, the number of features to consider when looking for the best split
    • max_leaf_nodes, the maximum number of leaf nodes
    • min_impurity_decrease, a node will be split if this split induces a decrease of the impurity greater than or equal to this value
    • min_impurity_split, threshold for early stopping in tree growth
  • Increasing min*hyperparameters or reducing max*hyperparameters will regularize the model

Pruning, train the model without restrictions, then delete unneccessary nodes

  • Cost complexity pruning
In [37]:
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
In [38]:
clf = DecisionTreeClassifier(random_state=0)
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities # get alpha values and their impurities
In [39]:
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.plot(ccp_alphas[:-1], impurities[:-1], marker='o', drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
Out[39]:
Text(0.5, 1.0, 'Total Impurity vs effective alpha for training set')
In [45]:
clfs = []
for ccp_alpha in ccp_alphas:
    clf = DecisionTreeClassifier(random_state=0, ccp_alpha=ccp_alpha)
    clf.fit(X_train, y_train)
    clfs.append(clf)
In [46]:
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]

train_scores = [clf.score(X_train, y_train) for clf in clfs]
test_scores = [clf.score(X_test, y_test) for clf in clfs]
fig, ax = plt.subplots()
ax.set_xlabel("alpha")
ax.set_ylabel("accuracy")
ax.set_title("Accuracy vs alpha for training and testing sets")
ax.plot(ccp_alphas, train_scores, marker='o', label="train", drawstyle="steps-post")
ax.plot(ccp_alphas, test_scores, marker='o', label="test", drawstyle="steps-post")
Out[46]:
[<matplotlib.lines.Line2D at 0x1a16a9bfd0>]

Regression

  • The prediction is the average target value of the training instances associated with this leaf node $$J(k, t_{k}) = \frac{m_{left}}{m} MSE_{left} + \frac{m_{right}}{m} MSE_{right}$$
In [49]:
from sklearn.tree import DecisionTreeRegressor
X = [[0, 0], [2, 2]]
y = [0.5, 2.5]
clf = DecisionTreeRegressor()
clf = clf.fit(X, y)
clf.predict([[1, 1]])
Out[49]:
array([0.5])

Multi-output

  • No correlation between the outputs, build n independent models, independently predict each one of the n outputs
  • Output values are themselves correlated, build a single model capable of predicting simultaneously all n outputs
In [52]:
import numpy as np
from sklearn.tree import DecisionTreeRegressor

rng = np.random.RandomState(1)
X = np.sort(200 * rng.rand(100, 1) - 100, axis=0)
y = np.array([np.pi * np.sin(X).ravel(), np.pi * np.cos(X).ravel()]).T
In [62]:
y[::5, :] += (0.5 - rng.rand(20, 2)) # add noise
In [63]:
regr = DecisionTreeRegressor(max_depth=6)
regr.fit(X, y)

X_test = np.arange(-100.0, 100.0, 0.01)[:, np.newaxis]
y_test = regr.predict(X_test)
In [71]:
plt.scatter(y[:, 0], y[:, 1], c="cornflowerblue", s=40, edgecolor="black")
plt.scatter(y_test[:, 0], y_test[:, 1], c="cornflowerblue", s=40, edgecolor="red")
Out[71]:
<matplotlib.collections.PathCollection at 0x1a1aa62048>

Instability

  • Trees love orthogonal decision boundaries, sensitive to rotation
  • On way to limit this problem is to use Principal Component Analysis (PCA)
  • Random Forests can limit the instability by averaging predictions over many trees

Bagging

  • take many training sets from the population, build a separate prediction model using each training set, and average the resulting predictions
  • these trees are grown deep, and are not pruned, each individual tree has high variance, but low bias. Averaging these B trees reduces the variance
  • record the class predicted by each of the trees, and take a majority vote

Random Forest

  • build a number of decision trees on bootstrapped training samples
  • each time a split in a tree is considered, a random number of features is chosen as split candidates from the full set of n features, which takes sqrt(n) usually
  • will not overfit if increase the number of trees

Boosting

  • trees are grown sequentially, each tree is grown using information from previously grown trees, each tree is fit on a modified version of the original data set
  • can overfit if the number of trees is too large, use cross-validation to select the number of trees
  • shrinkage parameter $\lambda$, ontrols the rate at which boosting learns, typical values are 0.01 or 0.001
  • the number d of splits in each tree, which controls the complexity of the boosted ensemble