Decision Trees

Builds classification or regression models in the form of a tree structure
Breaks down dataset into smaller subsets while associated dec tree is built
decision nodes and leaf nodes

Growing A Tree¶

Feature choice
Conditions for splitting
Stopping condition
Pruning

Decision tree induction¶

Hunt's Algorithm
CART
ID3, C4.5
SLIQ, SPRINT

Hunt's Algo¶

Grow recursively by partitioning training records successively into purer subset
It is the basis of many existing decision tree induction algorithm

Algo:

Let \(D_t\) be the set of training records that reach a node \(t\)
If \(D_t\) contains records that all belong to the same class \(y_t\), \(t\) is a leaf node labeled \(y_t\)
If \(D_t\) is an empty set then \(t\) is a leaf node labeled \(y_d\) (default)
If \(D_t\) contains records that belong to more than one class, use an attribute test to split the data into smaller subset, recurse

Attribute Test¶

Greedy strat to split records based on optimising a certain metric.

Nominal/Ordinal Attributes¶

Multi-Way Split: As many splits as distinct values for \(D_t\)
Binary Split: split into something and not something. Need to find optimal partitioning

Continuous Attributes¶

Descretisation:
- static - discretise once at the beginning
- dynamic - ranges by equal interval bucketing or equal freq bucketing or clustering
Binary Decision < and >= Find optimal cutting point

Homogeneous Split¶

Greedy strat

Measure of node impurity¶

Gini Index
- Probability of being class in X given a node \(t\)
- \(\text{GINI}(t) = 1 - \sum\limits_{j} p^2(j|t)\) where \(p(j|t)\) is the relative freq of class \(j\) at node \(t\)
- Max: \(1 - 1/n_c\) when the records are equally distributed among all classes, implying least interesting info
- Min (0): When all records belong to one class, most interesting info
- CART, SLIQ, SPRINT
- When a node \(t\) is split into \(k\) parts,
- \(\text{GINI}_{split} = \sum\limits_{i} \cfrac{n_i}{n} \text{GINI}(i)\) where \(n_i\) is the num records at child \(i\) and \(n\) is the num records at node \(t\)
Entropy
- \(\text{Entropy}(t) = -\sum\limits_{j} p(j|t) \ln p(j|t)\) where \(p(j|t)\) is the relative freq of class \(j\) at node \(t\)
- Max (\(\ln n_c\)) when records equally distributed, least interesting info
- Min (0) When all records belong to one class, most interesting info
- When a node \(t\) is split into \(k\) parts,
- \(\text{GAIN}_{split} = \text{Entropy(t)} - \sum\limits_{i} \cfrac{n_i}{n} \text{Entropy}(i)\) where \(n_i\) is the num records at child \(i\) and \(n\) is the num records at node \(t\)
- \(\text{SplitINFO} = - \sum\limits_{i} \cfrac{n_i}{n} \ln \cfrac{n_i}{n}\)
- \(\text{GainRATIO}_{split} = \cfrac{\text{GAIN}_{split}}{SplitINFO}\) (higher entropy partitions are penalised)
Misclassification Error
- \(\text{Error}(t) = 1 - \max\limits_{i} p(i|t)\) where \(p(j|t)\) is the relative freq of class \(j\) at node \(t\)
- Max: \(1 - 1/n_c\) when the records are equally distributed among all classes, implying least interesting info
- Min (0): When all records belong to one class, most interesting info
- When a node \(t\) is split into \(k\) parts,
- \(\text{Error}_{split} = \sum\limits_{i} \cfrac{n_i}{n} \text{Error}(i)\) where \(n_i\) is the num records at child \(i\) and \(n\) is the num records at node \(t\)

Stopping Criteria¶

Stop when all nodes under record have same label
Stop if all attributes label are the same
Early Termination

Pros¶

Simple to understand, interpret, visualise
Categorical and Numerical data
Extremely fast
Accuracy is comparable to other techniques for simple datasets
Non linear relations between variables do no affect perf

Cons¶

Prone to overfitting
Unstable, small variation in data gives different tree
Greedy algorithm doesn't guarantee the return of globally optimal decision tree