Ensemble
Overfitting¶
- Model too complex
- Fit noise
- One off pattern
- Overfit if
- another F can be found with more training errors but less test errors
Underfitting¶
- Model too simple
- Does not capture salient patterns
- Underfit if
- another F can be found with less training and test errors
Generalisation Error¶
- Training error: \(1/N \sum e(t)\)
- Generalisation error: \(1/N \sum e'(t)\)
- Optimistic approach: \(e'(t) = e(t)\)
- Pessimistic approach: \(e'(t) = e(t) + 0.5\)
- Reduced Error Pruning (REP): Use actual test data to estimate generalisation error
The actual generalisation error can be lower than the optimistic error because its entirely dependent on the test data. The pessimistic error can never be lower than the optimistic error though, since it is literally \(n/N 0.5\) + optimistic error (\(n\) = errors, \(N\) = total)
Occam's Razor¶
- Given two models of similar generalisation error, choose the simpler one
- For complex models, there is a higher chance of overfit
- Model complexity should be included in the metrics for evaluation
Addressing Overfitting¶
-
Pre-Pruning (Early stopping rule)
- Stop the algorithm before it becomes a fully grown tree
- Typical stopping conditions for a node:
- Stop if all instances belong to the same label
- Stop if all attribute values are the same (user majority for label)
- Stop if expanding the current node does not improve impurity measures
-
Post-Pruning
- Grow fully
- Trim the nodes bottom up, if gen error improves, replace subtree with leaf and use majority for leaf
-
Lazy solution
- Try all depths of tree
Ensemble Method¶
- Construct a set of classifiers from the training data (odd number)
- Predict class label by majority prediciton from all the classifiers
- Split training data into \(n\) sets, and use each to generate a DT. Combine the resulting classifiers
- \(n\) can be found from the elbow method? don't make \(n\) too big, or divided sets too small and underfitting
Why does this work?¶
Supposing there are \(N = 25\) base classifiers, and the independent error rate \(\varepsilon = 0.35\), the \(\mathbb{P}( \text{misclassification}) = \mathbb{P}(X > 13)\) where \(X\) is the number of individual misclassification
\(\mathbb{P}(\text{misclassification}) = \sum\limits_{i = N//2+1}^{N} \displaystyle \binom{N}{i} \varepsilon^i (1-\varepsilon)^{N-i} = 0.06\)
Bagging¶
- Bootstrap Aggregation (Bagging)
- The above, but the \(D_i\) are generated by random sampling with replacement.
- Each sample has a prob. of \(1 - (1-\cfrac{1}{n})^n\) of being selected (1 - prob. of it NOT being selected at all)
- = \(1- \cfrac{1}{e}\) as \(n \to \infty\)
- Why is it not done without replacement? If its not done with replacement, there is no difference between the aggregate and the full tree because they see nothing different!
- Why repetition in the same bag is ok? The path to it created with decision trees will be the same anyway (There is a low chance of multiple records repeating in the same bag if \(N\) is large)
- Why can the model not be different for each bag? It can, but this doesn't fall under bagging approach, if the models are different there is no need to divide the training dataset, just use voting approach with different models on the full dataset is ok! Then why not do the same with same models? No difference in resulting models if we do this with the same model
Random Forest¶
- Train a lot of DT and use bagging
- Wisdom of crowd
- Uncorrelated trees are preferred
Boosting¶
- An iterative procedure to adaptively change distribution of training data by focusing more on previously misclassified records
- Split into \(n\) bags sampling randomly with replacement
- For each bag \(i\), Initially all points have the same weight
- Train the \(n\) models and classify points
- For each model, put the misclassified points into a new bag and fill the remaining with random samples from dataset (subject to boosting algo)
- Records wrongly classified will have their weight increased
- Records that are correctly classified with their weights decreased
- Retrain. New \(\text{model} = \text{model}_1 + \text{model}_2\)
- Why is this not done on the entire dataset? That is definitely overfitting
- Tendency To Overfit
AdaBoost¶
- \(w(x_i, y_i) = \cfrac{1}{n}\)
- Initialise base classifiers \(C_i\)
-
for each classifier \(C_i\)
- Train \(C_i\)
- \(\varepsilon_i = \cfrac{1}{N} \sum\limits_{j = 1}^{N} w_j \delta(C_i(x_j) \neq y_j)\) (\(\delta = 1\) if condition true)
- Compute the importance of \(C_i\):
- \(\alpha_i = \cfrac{1}{2} ln (\cfrac{1-\varepsilon_i}{\varepsilon_i})\) (if error rate > 0.5, this is negative)
- Update \(w_i^{(j+1)} = \cfrac{w_i^{(j)}}{Z_j} \begin{cases} e^{-\alpha_i} \text{ if correct} \\ e^{\alpha_i} \text{ if incorrect} \end{cases}\) (\(e^{-\alpha}\) is low if correctly classified but \(e^{\alpha}\) is higher on misclassification, \(Z_j\) is normalisation factor)
-
Final Classifier: \(\arg \max\limits_{y} \sum\limits_{j = 1}^{N} \alpha_j \delta(C_j (x) = y)\)