In this blog, we take a critical look at the assumptions of a linear regression model, how to detect and fix them, and how much water they hold in the real world. We will check some of these assumptions and tests in Python, which will provide a blueprint for other cases using well-known libraries. We will also examine its shortcomings and how its assumptions limit its use.
In the first blog of this series, we deconstructed the linear regression model, its various aliases and types. In the second installment, we looked at the application of linear regression on market data in Python and R.
Our coverage proceeds on the following lines:
- What is linear regression? A brief recap
- Assumptions of linear regression
- Linear relationship
- No Multicollinearity
- Gaussian distribution of the error terms
- No Autocorrelation of the error terms
- Homoskedasticity of the error terms
- Zero conditional mean of the error terms
- Limitations of linear regression
- Simplistic in some cases
- Sensitivity to outliers
- Prone to underfitting
- Overfitting of complex models
What is Linear Regression? A brief recap
Linear regression models the linear relationship between a response (or dependent) variable (Y) and one or more explanatory (independent) variables (X).
We can express it in the form of the following equation:
Yi = β0 + β1Xi + ϵi
In the case of a single explanatory variable, it is called simple linear regression, and if there is more than one explanatory variable, it is multiple linear regression.
In regression analysis, we aim to draw inferences about the population at large by finding the relationships between the dependent and independent variables for the sample. Usually, the OLS (Ordinary Least Squares) method is used to estimate the regression coefficients. OLS finds the best coefficients by minimizing the sum of the squares of the errors.
The Gauss-Markov theorem states that under certain conditions, the Ordinary Least Squares (OLS) estimators are the Best Linear Unbiased Estimators (BLUE). This means that when those conditions are met in the dataset, the variance of the OLS model is the smallest out of all the estimators that are linear and unbiased.
Let’s examine the terms ‘linear’ and ‘unbiased’.
- Linear – Linear estimators imply that they have a linear relationship with the dependent variable. This makes them easier to understand and implement.
- Unbiased – Unbiased estimators imply that when applying a model repeatedly, on average, the estimators will attain their true value.
We now look at the “under certain conditions” (i.e. the assumptions) mentioned earlier that form the core of the Gauss-Markov theorem.
Assumptions of Linear Regression
We can divide the basic assumptions of linear regression into two categories based on whether the assumptions are about the explanatory variables (i.e. features) or the residuals.
Assumptions about the explanatory variables (features):
- No multicollinearity
Assumptions about the error terms (residuals):
- Gaussian distribution
- No autocorrelation
- Zero conditional mean
The basic assumption of the linear regression model, as the name suggests, is that of a linear relationship between the dependent and independent variables. Here the linearity is only with respect to the parameters. Oddly enough, there’s no such restriction on the degree or form of the explanatory variables themselves.
So both the following equations represent linear regression:
Here, the model is linear in parameters as well as linear in the explanatory variable(s).
This model is linear in parameters and non-linear in the explanatory variable(s).
The explanatory variables can be exponentiated, quadratic, cubic, etc. and it can still be framed as a linear regression problem.
The following equation is NOT linear regression:
Linear regression minimizes the error (mean-squared error) to estimate the unknown betas by solving a set of linear equations.
When betas take non-linear forms, things get harder and we cannot use the methods we’d mentioned (but not derived!) earlier. Hence, we cannot use linear regression in the case of equation 3. Hence, the linearity (of parameters) assumption is important.
How to detect linearity?
A residual plot helps us identify poor or incorrect curve fitting between the data and the regression model. It is probably the simplest way to check for linearity or lack thereof. A nice even spread is indicative of linearity.
How to fix linearity?
The tricky part now is to get the functional form of the equation right.
- We can try reframing it by applying a non-linear transformation on the independent and/or the dependent term(s). We can transform messy data by normalizing them, taking logs of the original values, etc. This would make the data linear.
- We can also try adding another independent variable to the equation (like X2).
Another assumption is that the independent variables are not correlated with each other. If there is a linear relationship between one or more explanatory variables, it adds to the complexity of the model without being able to delineate the impact of each explanatory variable on the response variable.
If we were to model the salaries of a group of professionals based on their ages and years of experience.
salaryi = β0 + β1(years of experience)i + β2(age in years)i + ϵi
Linear regression studies the effect of each of the independent variables (X) on the dependent variable (Y). But when the independent variables are correlated, as in this case, it is difficult to isolate the impact of a single factor on the dependent variable. If you increase the years of experience, the age also will increase.
So did the salary increase due to the experience or the age?
This will affect the accuracy of the coefficients and also the standard errors.
How to detect multicollinearity?
- Check the correlation among the independent variables.
- Variance Inflation Factor
How to fix multicollinearity?
One way to deal with multicollinearity among the independent variables is to do dimensionality reduction using techniques like PCA to create uncorrelated features with the maximum variance.
Visit QuantInsti to read the full article: https://blog.quantinsti.com/linear-regression-assumptions-limitations/.
Disclosure: Interactive Brokers
Information posted on IBKR Campus that is provided by third-parties does NOT constitute a recommendation that you should contract for the services of that third party. Third-party participants who contribute to IBKR Campus are independent of Interactive Brokers and Interactive Brokers does not make any representations or warranties concerning the services offered, their past or future performance, or the accuracy of the information provided by the third party. Past performance is no guarantee of future results.
This material is from QuantInsti and is being posted with its permission. The views expressed in this material are solely those of the author and/or QuantInsti and Interactive Brokers is not endorsing or recommending any investment or trading discussed in the material. This material is not and should not be construed as an offer to buy or sell any security. It should not be construed as research or investment advice or a recommendation to buy, sell or hold any security or commodity. This material does not and is not intended to take into account the particular financial conditions, investment objectives or requirements of individual customers. Before acting on this material, you should consider whether it is suitable for your particular circumstances and, as necessary, seek professional advice.