In the age of data, it becomes imperative for organizations to utilize the data in the most effective and efficient manner. In the quest to exploit the power of data, every company is turning to Machine Learning. In the last, few years Machine Learning has grown by leaps and bounds and every aspiring Data Scientist should have expertise in machine learning algorithms and maths, assumption behind them. To learn algorithms and appreciate the power of ML, a Data Scientist should be well versed with the basics. Linear Regression is considered to be one of the basic algorithm and most of the time the base model. In this three-part series, we would be discussing the end to end of Linear Regression. In this part, we would be discussing the assumptions of Linear Regression as it vital for every Data Scientist to check it before finalizing the model. Before discussing the assumptions of Linear Regression, let’s discuss what Linear Regression is in brief. We would be deep diving in the second part of this series.
Linear Regression is a technique to develop a linear relationship between a dependent variable and a number of the independent variables. The Equation of Linear Regression is as follows :
Y = X0 + b1X1 + b2X2 + ……. + bnXn
Here Y is the dependent variable, X1 to Xn are predictors(independent variable), X0 is constant and b1 to bn are the coefficients.
To establish a relation between Y and X’s following points should be satisfied:
Linear Relationship:
There should be a linear relationship between the dependent variable(Y) and the independent variables(X’s). The linear relationship could easily be determined by a scatter plot.
It is shown below:
Sales Variation with TV Advertisenment
Sales are linearly increasing with the increment in TV advertisement marketing costs. The relations should be above otherwise Linear Regression would not give a fruitful output. If a linear relationship is not there then depending upon relationship, a transformation like square root, logarithm, square root, etc can be applied.
Multi-Collinearity:
There should not be any relationship between independent variables, which means one of the independent variables should not be expressed as a combination of other variables. For example X1 = func(X2,…Xn). This should not be satisfied. If two variables are dependent on each other, we can remove one of them. To determine multi-collinearity, Variance Inflation Factor(VIF) should be checked. Ideally, the value of VIF should be lower than 4 for every variable. If the value of VIF is more then 4 then it should be dropped from the system. VIF is known after running linear regression. In python’s sklearn, a function can be called to know VIF. One more way to determine the multi-collinearity is by seeing the difference between R^2 and Adjusted R^2. If the difference is more it means there is multi-collinearity. We will discuss R^2 in detail in the next part of this series. Other ways to check multicollinearity to see the correlation among variables. It should not be high. Depending upon the business a cut off could be decided. It is done before applying regression. Pair plots could also be made to determine any relationship between independent variables. But pair plots are difficult to visualize when there are too many variables in the dataset.
Pair Plot Among Different Variables
Pair plot shows there is a relation between the TV, Newspaper and Radio marketing spending. There is multi-collinearity in this case.
No Auto-Correlation
It states that there should not be any kind of correlation between the residuals. Residuals are the difference between actual and predicted values. It can be determined by the Durbin Watson Test. It is one of the outputs of regression. Ideally, the value should be 2. Nearer to 2, better it is. This generally happens in time series.
Durbin Watson Test in Linear Regression
DW statistic is near to 2, it is in the acceptable range for the base model.
Homoscedasticity:
It states that error terms should be the same similarly distributed across the independent variable. There should not be a pattern in the distribution and if there is then it called heteroscedastic. It can be checked by plotting the scatter plot of predicted values and residuals.
Normal Distribution of the Error Terms:
The error terms should be normally distributed. Quantile – Quantile (Q-Q) plot is used to check it. This assumption is mainly required when the data size is small. When the data size is large, error terms are generally normally distributed.
QQ plot is normally distributed. It satisfies the normal distribution criteria. We have tried to explain the basic assumption that we need to check while using Linear Regression.
If there is any mistake, please let us know in the comment section. We will update it as soon as possible.
You May Also Like:
- All You Need to Know about Activation Functions (Sigmoid, Tanh Relu, Leaky Relu, Softmax)
- All You Need to Know About Sampling Distribution in Statistics
- Scratch Implementation of Stochastic Gradient Descent using Python
- Evaluation Metrics for Classification (Accuracy Score, Precision, Recall, Confusion Metric, F1-Score)
- Top Skills You Must Not Avoid to Become a Great Data Scientist
- Feedback on Your Preparation for Data Science or Machine Learning Jobs (Mock Interview)