Decision Trees
- Builds classification or regression models in the form of a tree structure
- Breaks down dataset into smaller subsets while associated dec tree is built
- decision nodes and leaf nodes
Growing A Tree¶
- Feature choice
- Conditions for splitting
- Stopping condition
- Pruning
Decision tree induction¶
- Hunt's Algorithm
- CART
- ID3, C4.5
- SLIQ, SPRINT
Hunt's Algo¶
- Grow recursively by partitioning training records successively into purer subset
- It is the basis of many existing decision tree induction algorithm
Algo:
- Let \(D_t\) be the set of training records that reach a node \(t\)
- If \(D_t\) contains records that all belong to the same class \(y_t\), \(t\) is a leaf node labeled \(y_t\)
- If \(D_t\) is an empty set then \(t\) is a leaf node labeled \(y_d\) (default)
- If \(D_t\) contains records that belong to more than one class, use an attribute test to split the data into smaller subset, recurse
Attribute Test¶
Greedy strat to split records based on optimising a certain metric.
Nominal/Ordinal Attributes¶
- Multi-Way Split: As many splits as distinct values for \(D_t\)
- Binary Split: split into something and not something. Need to find optimal partitioning
Continuous Attributes¶
-
Descretisation:
- static - discretise once at the beginning
- dynamic - ranges by equal interval bucketing or equal freq bucketing or clustering
-
Binary Decision
<
and>=
Find optimal cutting point
Homogeneous Split¶
Greedy strat
Measure of node impurity¶
-
Gini Index
- Probability of being class in X given a node \(t\)
- \(\text{GINI}(t) = 1 - \sum\limits_{j} p^2(j|t)\) where \(p(j|t)\) is the relative freq of class \(j\) at node \(t\)
- Max: \(1 - 1/n_c\) when the records are equally distributed among all classes, implying least interesting info
- Min (0): When all records belong to one class, most interesting info
- CART, SLIQ, SPRINT
- When a node \(t\) is split into \(k\) parts,
- \(\text{GINI}_{split} = \sum\limits_{i} \cfrac{n_i}{n} \text{GINI}(i)\) where \(n_i\) is the num records at child \(i\) and \(n\) is the num records at node \(t\)
-
Entropy
- \(\text{Entropy}(t) = -\sum\limits_{j} p(j|t) \ln p(j|t)\) where \(p(j|t)\) is the relative freq of class \(j\) at node \(t\)
- Max (\(\ln n_c\)) when records equally distributed, least interesting info
- Min (0) When all records belong to one class, most interesting info
- When a node \(t\) is split into \(k\) parts,
- \(\text{GAIN}_{split} = \text{Entropy(t)} - \sum\limits_{i} \cfrac{n_i}{n} \text{Entropy}(i)\) where \(n_i\) is the num records at child \(i\) and \(n\) is the num records at node \(t\)
- \(\text{SplitINFO} = - \sum\limits_{i} \cfrac{n_i}{n} \ln \cfrac{n_i}{n}\)
- \(\text{GainRATIO}_{split} = \cfrac{\text{GAIN}_{split}}{SplitINFO}\) (higher entropy partitions are penalised)
-
Misclassification Error
- \(\text{Error}(t) = 1 - \max\limits_{i} p(i|t)\) where \(p(j|t)\) is the relative freq of class \(j\) at node \(t\)
- Max: \(1 - 1/n_c\) when the records are equally distributed among all classes, implying least interesting info
- Min (0): When all records belong to one class, most interesting info
- When a node \(t\) is split into \(k\) parts,
- \(\text{Error}_{split} = \sum\limits_{i} \cfrac{n_i}{n} \text{Error}(i)\) where \(n_i\) is the num records at child \(i\) and \(n\) is the num records at node \(t\)
Stopping Criteria¶
- Stop when all nodes under record have same label
- Stop if all attributes label are the same
- Early Termination
Pros¶
- Simple to understand, interpret, visualise
- Categorical and Numerical data
- Extremely fast
- Accuracy is comparable to other techniques for simple datasets
- Non linear relations between variables do no affect perf
Cons¶
- Prone to overfitting
- Unstable, small variation in data gives different tree
- Greedy algorithm doesn't guarantee the return of globally optimal decision tree