Ensemble

Overfitting¶

Model too complex
Fit noise
One off pattern
Overfit if
- another F can be found with more training errors but less test errors

Underfitting¶

Model too simple
Does not capture salient patterns
Underfit if
- another F can be found with less training and test errors

Generalisation Error¶

Training error: \(1/N \sum e(t)\)
Generalisation error: \(1/N \sum e'(t)\)
- Optimistic approach: \(e'(t) = e(t)\)
- Pessimistic approach: \(e'(t) = e(t) + 0.5\)
- Reduced Error Pruning (REP): Use actual test data to estimate generalisation error

The actual generalisation error can be lower than the optimistic error because its entirely dependent on the test data. The pessimistic error can never be lower than the optimistic error though, since it is literally \(n/N 0.5\) + optimistic error (\(n\) = errors, \(N\) = total)

Occam's Razor¶

Given two models of similar generalisation error, choose the simpler one
For complex models, there is a higher chance of overfit
Model complexity should be included in the metrics for evaluation

Addressing Overfitting¶

Pre-Pruning (Early stopping rule)
- Stop the algorithm before it becomes a fully grown tree
- Typical stopping conditions for a node:
  - Stop if all instances belong to the same label
  - Stop if all attribute values are the same (user majority for label)
  - Stop if expanding the current node does not improve impurity measures
Post-Pruning
- Grow fully
- Trim the nodes bottom up, if gen error improves, replace subtree with leaf and use majority for leaf
Lazy solution
- Try all depths of tree

Ensemble Method¶

Construct a set of classifiers from the training data (odd number)
Predict class label by majority prediciton from all the classifiers
Split training data into \(n\) sets, and use each to generate a DT. Combine the resulting classifiers
\(n\) can be found from the elbow method? don't make \(n\) too big, or divided sets too small and underfitting

Why does this work?¶

Supposing there are \(N = 25\) base classifiers, and the independent error rate \(\varepsilon = 0.35\), the \(\mathbb{P}( \text{misclassification}) = \mathbb{P}(X > 13)\) where \(X\) is the number of individual misclassification

\(\mathbb{P}(\text{misclassification}) = \sum\limits_{i = N//2+1}^{N} \displaystyle \binom{N}{i} \varepsilon^i (1-\varepsilon)^{N-i} = 0.06\)

Bagging¶

Bootstrap Aggregation (Bagging)
The above, but the \(D_i\) are generated by random sampling with replacement.
Each sample has a prob. of \(1 - (1-\cfrac{1}{n})^n\) of being selected (1 - prob. of it NOT being selected at all)
= \(1- \cfrac{1}{e}\) as \(n \to \infty\)
Why is it not done without replacement? If its not done with replacement, there is no difference between the aggregate and the full tree because they see nothing different!
Why repetition in the same bag is ok? The path to it created with decision trees will be the same anyway (There is a low chance of multiple records repeating in the same bag if \(N\) is large)
Why can the model not be different for each bag? It can, but this doesn't fall under bagging approach, if the models are different there is no need to divide the training dataset, just use voting approach with different models on the full dataset is ok! Then why not do the same with same models? No difference in resulting models if we do this with the same model

Random Forest¶

Train a lot of DT and use bagging
Wisdom of crowd
- Uncorrelated trees are preferred

watch

Boosting¶

An iterative procedure to adaptively change distribution of training data by focusing more on previously misclassified records
- Split into \(n\) bags sampling randomly with replacement
- For each bag \(i\), Initially all points have the same weight
- Train the \(n\) models and classify points
- For each model, put the misclassified points into a new bag and fill the remaining with random samples from dataset (subject to boosting algo)
- Records wrongly classified will have their weight increased
- Records that are correctly classified with their weights decreased
- Retrain. New \(\text{model} = \text{model}_1 + \text{model}_2\)
- Why is this not done on the entire dataset? That is definitely overfitting
Tendency To Overfit

AdaBoost¶

\(w(x_i, y_i) = \cfrac{1}{n}\)
Initialise base classifiers \(C_i\)
for each classifier \(C_i\)
- Train \(C_i\)
- \(\varepsilon_i = \cfrac{1}{N} \sum\limits_{j = 1}^{N} w_j \delta(C_i(x_j) \neq y_j)\) (\(\delta = 1\) if condition true)
- Compute the importance of \(C_i\):
- \(\alpha_i = \cfrac{1}{2} ln (\cfrac{1-\varepsilon_i}{\varepsilon_i})\) (if error rate > 0.5, this is negative)
- Update \(w_i^{(j+1)} = \cfrac{w_i^{(j)}}{Z_j} \begin{cases} e^{-\alpha_i} \text{ if correct} \\ e^{\alpha_i} \text{ if incorrect} \end{cases}\) (\(e^{-\alpha}\) is low if correctly classified but \(e^{\alpha}\) is higher on misclassification, \(Z_j\) is normalisation factor)
Final Classifier: \(\arg \max\limits_{y} \sum\limits_{j = 1}^{N} \alpha_j \delta(C_j (x) = y)\)