Decision Trees¶

Pros¶

The cost of using the tree (i.e., predicting data) is logarithmic in the number of data points used to train the tree
Able to handle both numerical and categorical data
Able to handle multi-output problems

Cons¶

Decision-tree learners can create over-complex trees that do not generalise the data well
The problem of learning an optimal decision tree is known to be NP-complete under several aspects of optimality and even for simple concepts
There are concepts that are hard to learn because decision trees do not express them easily, such as XOR, parity or multiplexer problems
Decision tree learners create biased trees if some classes dominate. It is therefore recommended to balance the dataset prior to fitting with the decision tree

Tips¶

Decision trees tend to overfit on data with a large number of features. Getting the right ratio of samples to number of features is important, since a tree with few samples in high dimensional space is very likely to overfit
Consider performing dimensionality reduction (PCA, ICA, or Feature selection) beforehand to give your tree a better chance of finding features that are discriminative
Balance your dataset before training to prevent the tree from being biased toward the classes that are dominant

Get the Data¶

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

iris = load_iris()
X = iris.data[:, 2:] # petal length and width
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

Decision Tree¶

tree_clf = DecisionTreeClassifier(max_depth=2, random_state=42)
tree_clf.fit(X_train, y_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=2, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=42, splitter='best')

Visualization¶

from sklearn.tree import export_graphviz

export_graphviz(
        tree_clf,
        out_file="iris_tree.dot",
        feature_names=iris.feature_names[2:],
        class_names=iris.target_names,
        rounded=True,
        filled=True
    )

dot -Tpng iris_tree.dot -o tree.png¶

gini¶

measures impurity, a node is pure (gini = 0) if all training instances it applies to belong to the same class

from sklearn import tree
tree.plot_tree(tree_clf, feature_names=iris.feature_names[2:],class_names=iris.target_names, rounded=True, filled=True)

[Text(133.92000000000002, 181.2, 'petal length (cm) <= 2.35\ngini = 0.665\nsamples = 112\nvalue = [37, 34, 41]\nclass = virginica'),
 Text(66.96000000000001, 108.72, 'gini = 0.0\nsamples = 37\nvalue = [37, 0, 0]\nclass = setosa'),
 Text(200.88000000000002, 108.72, 'petal length (cm) <= 4.95\ngini = 0.496\nsamples = 75\nvalue = [0, 34, 41]\nclass = virginica'),
 Text(133.92000000000002, 36.23999999999998, 'gini = 0.153\nsamples = 36\nvalue = [0, 33, 3]\nclass = versicolor'),
 Text(267.84000000000003, 36.23999999999998, 'gini = 0.05\nsamples = 39\nvalue = [0, 1, 38]\nclass = virginica')]

from sklearn.tree.export import export_text
r = export_text(tree_clf, feature_names=iris.feature_names[2:])
print(r)

|--- petal length (cm) <= 2.35
|   |--- class: 0
|--- petal length (cm) >  2.35
|   |--- petal length (cm) <= 4.95
|   |   |--- class: 1
|   |--- petal length (cm) >  4.95
|   |   |--- class: 2

/opt/anaconda3/lib/python3.7/site-packages/sklearn/utils/deprecation.py:144: FutureWarning: The sklearn.tree.export module is  deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.tree. Anything that cannot be imported from sklearn.tree is now part of the private API.
  warnings.warn(message, FutureWarning)

Methods and Attributes¶

print(iris.target_names)
tree_clf.predict([[5, 1.5]])[0]

['setosa' 'versicolor' 'virginica']

2

tree_clf.predict_proba([[5, 1.5]]) # 0 for setosa, 0.026 for versicolor, 0.974 for virginica

array([[0.        , 0.02564103, 0.97435897]])

tree_clf.feature_importances_ # feature importance, petal length is more important than width

array([1., 0.])

Feature Importance¶

feature importance $$f_{i} = \frac{ \sum_{j=1}^{m}n_{j} }{ \sum_{k=1}^{ma}n_{k}}$$ Where m is the number of nodes that splits on feature i, ma is the number of all nodes
node importance $$ n_{k} = w_{k}G_{k} - w_{left}G_{left} - w_{right}G_{right}$$ Where w is the weighted number of sample at a specific node, G is Gini value
Normalized feature importance $$ normf_{i} = \frac{f_{i}}{\sum_{}^{}f_{i}}$$

Classification and Regression Tree (CART)¶

Split the training set into two subsets, search for the pair (k, $t_{k}$) that produces the purest subsets, k is a single feature and $t_{k}$ is a threshold $$J(k, t_{k}) = \frac{m_{left}}{m} G_{left} + \frac{m_{right}}{m} G_{right}$$ Where, $G_{left/right}$ measures the impurity of the left/right subset, $m_{left/right}$ is the number of instances in the left/right subset $$G_{i} = 1 = \sum_{k=1}^{n}p_{i,k}^{2}$$ Where, $p_{i,k}$ is the ratio of class k instances among the training instance in the $i^{th}$ node CART splits the subset using the same logic recursively until reaches the maximum depth or cannot find a split that will reduce impurity (max_depth, min_samples_split, min_sample_leaf, min_weight_fraction_leaf, and max_leaf_nodes)
Gini impurity is slightly faster to computer than entropy impurity

Computational Complexity¶

O(nmlg(m))
Presorting data can speed up training for samll data set, ut slows down training for larger training sets

Regularization Hyperparameters, train the model with restrictions¶

unconstrained decision tree fits the training data closely, and most likely overfitting it, such a model is called nonparametric model
parametric model has predetermined number of parameters to reduce the risk of overfitting
regularization hyperparameters
- max_depth, the maximum depth of the tree
- min_samples_split, the minimum number of samples required to split an internal node
- min_samples_leaf, the minimum number of samples required to be at a leaf node
- min_weight_fraction_leaf, the minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node
- max_features, the number of features to consider when looking for the best split
- max_leaf_nodes, the maximum number of leaf nodes
- min_impurity_decrease, a node will be split if this split induces a decrease of the impurity greater than or equal to this value
- min_impurity_split, threshold for early stopping in tree growth
Increasing min*hyperparameters or reducing max*hyperparameters will regularize the model

Pruning, train the model without restrictions, then delete unneccessary nodes¶

Cost complexity pruning

from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

clf = DecisionTreeClassifier(random_state=0)
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities # get alpha values and their impurities

import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.plot(ccp_alphas[:-1], impurities[:-1], marker='o', drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")

Text(0.5, 1.0, 'Total Impurity vs effective alpha for training set')

clfs = []
for ccp_alpha in ccp_alphas:
    clf = DecisionTreeClassifier(random_state=0, ccp_alpha=ccp_alpha)
    clf.fit(X_train, y_train)
    clfs.append(clf)

clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]

train_scores = [clf.score(X_train, y_train) for clf in clfs]
test_scores = [clf.score(X_test, y_test) for clf in clfs]
fig, ax = plt.subplots()
ax.set_xlabel("alpha")
ax.set_ylabel("accuracy")
ax.set_title("Accuracy vs alpha for training and testing sets")
ax.plot(ccp_alphas, train_scores, marker='o', label="train", drawstyle="steps-post")
ax.plot(ccp_alphas, test_scores, marker='o', label="test", drawstyle="steps-post")

[<matplotlib.lines.Line2D at 0x1a16a9bfd0>]

Regression¶

The prediction is the average target value of the training instances associated with this leaf node $$J(k, t_{k}) = \frac{m_{left}}{m} MSE_{left} + \frac{m_{right}}{m} MSE_{right}$$

from sklearn.tree import DecisionTreeRegressor
X = [[0, 0], [2, 2]]
y = [0.5, 2.5]
clf = DecisionTreeRegressor()
clf = clf.fit(X, y)
clf.predict([[1, 1]])

array([0.5])

Multi-output¶

No correlation between the outputs, build n independent models, independently predict each one of the n outputs
Output values are themselves correlated, build a single model capable of predicting simultaneously all n outputs

import numpy as np
from sklearn.tree import DecisionTreeRegressor

rng = np.random.RandomState(1)
X = np.sort(200 * rng.rand(100, 1) - 100, axis=0)
y = np.array([np.pi * np.sin(X).ravel(), np.pi * np.cos(X).ravel()]).T

y[::5, :] += (0.5 - rng.rand(20, 2)) # add noise

regr = DecisionTreeRegressor(max_depth=6)
regr.fit(X, y)

X_test = np.arange(-100.0, 100.0, 0.01)[:, np.newaxis]
y_test = regr.predict(X_test)

plt.scatter(y[:, 0], y[:, 1], c="cornflowerblue", s=40, edgecolor="black")
plt.scatter(y_test[:, 0], y_test[:, 1], c="cornflowerblue", s=40, edgecolor="red")

<matplotlib.collections.PathCollection at 0x1a1aa62048>

Instability¶

Trees love orthogonal decision boundaries, sensitive to rotation
On way to limit this problem is to use Principal Component Analysis (PCA)
Random Forests can limit the instability by averaging predictions over many trees

Bagging¶

take many training sets from the population, build a separate prediction model using each training set, and average the resulting predictions
these trees are grown deep, and are not pruned, each individual tree has high variance, but low bias. Averaging these B trees reduces the variance
record the class predicted by each of the trees, and take a majority vote

Random Forest¶

build a number of decision trees on bootstrapped training samples
each time a split in a tree is considered, a random number of features is chosen as split candidates from the full set of n features, which takes sqrt(n) usually
will not overfit if increase the number of trees

Boosting¶

trees are grown sequentially, each tree is grown using information from previously grown trees, each tree is fit on a modified version of the original data set
can overfit if the number of trees is too large, use cross-validation to select the number of trees
shrinkage parameter $\lambda$, ontrols the rate at which boosting learns, typical values are 0.01 or 0.001
the number d of splits in each tree, which controls the complexity of the boosted ensemble

Decision Trees¶

Pros¶

Cons¶

Tips¶

Get the Data¶

Decision Tree¶

Visualization¶

dot -Tpng iris_tree.dot -o tree.png¶

gini¶

Methods and Attributes¶

Feature Importance¶

Classification and Regression Tree (CART)¶

Computational Complexity¶

Regularization Hyperparameters, train the model with restrictions¶

Pruning, train the model without restrictions, then delete unneccessary nodes¶

Regression¶

Multi-output¶

Instability¶

Bagging¶

Random Forest¶

Boosting¶

Reference¶