How will you select the model ?

Model Selection

1. The central issue in all of Machine Learning is “how do we extrapolate what has been learnt from a finite amount of data to all possible inputs ’of the same kind’?”.
2. We build models from some training data. However the training data is always finite.
3. On the other hand the model is expected to have learnt ‘enough’ about the entire domain from where the data points can possibly come.
Let us understand some of the key concerns in selecting an appropriate model for a task.

a)Occam’s Razor

A predictive model has to be as simple as possible, but no simpler. Often referred to as the Occam’s Razor, this is a fundamental tenet of all of machine learning.
Occam’s Razor is therefore a simple thumb rule — given two models that show similar ’performance’ in the finite training or test data, we should pick the one that makes fewer assumptions about the data that is yet to be seen.

b) Over-fitting

Over-fitting is a phenomenon where a model becomes way too complex than what is warranted for the task at hand and as a result suffers from bad generalization properties.
Overfitting happens when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data.
Overfitting occurs when a statistical model or machine learning algorithm captures the noise of the data. Intuitively, overfitting occurs when the model or the algorithm fits the data too well.

c) Regularization

Regularization is the simplification done by the training algorithm to control the model complexity.
This is a form of regression, that constrains/ regularizes or shrinks the coefficient estimates towards zero. In other words, this technique discourages learning a more complex or flexible model, so as to avoid the risk of overfitting.

Roles of Regularization:

  1. It significantly reduces the variance of the model without a substantial increase in the bias.
  2. It is used in the case of Overfitting.
  3. It shrinks and regularizes the coefficients for a better prediction, without losing the important properties of the data.

d) Bias-Variance Trade-off

  • Error due to Bias is the difference between the expected (or average) model prediction and the correct value or true value.
E[Y’ — Y]: Y’ is Predicted Value & Y is Actual Value.
•Imagine Running Multiple models several times will have the range of predictions.
Error due to Variance is the variability in the results of a model when the dataset is changed.
•High Variance increases the spread of points which results in less accurate predictions.
•A Low Bias and High Variance Model is an Overfitted Model
•Variance is how much the predictions for a given point vary between different samples of the training data.

•Model with high variance pays a lot of attention to the training data and does not generalize well on the test data.
•A low bias algorithm is not easy to learn but highly flexible, due to this they have higher predictive performance.
•A High Bias and Low Variance Model is an Underfitted Model
•A high bias algorithm is easy to learn but less flexible, due to this they have lower predictive performance.

e) Model Complexity

Complexity

Number of parameters required to specify the model completely. For example in a simple linear regression for the response attribute y on the explanatory attributes x1,x2,x3 the model y = ax1+bx2 is ‘Simpler’ than the model y = ax1+bx2+cx3 — the latter requires 3 parameters compared to the 2 required for the first model.

f) Cross Validation

The Dataset is randomly partitioned in k equal sized samples.
Out of k samples a single sub sample is retained as validation dataset for testing the model and remaining k-1 are used for training data.
This process is repeated k times and the results can be averaged out to produce a single estimation.

g) Hold-Out Strategy

Hold-out is when you split up your dataset into a ‘train’ and ‘test’ set. The training set is what the model is trained on, and the test set is used to see how well that model performs on unseen data.
A common split when using the hold-out method is using 80% of data for training and the remaining 20% of the data for testing.

Thank you and Keep Learning 🙂