# Linear Regression: Assumptions and Limitations – Part I Articles From: QuantInsti
Website: QuantInsti

In this blog, we take a critical look at the assumptions of a linear regression model, how to detect and fix them, and how much water they hold in the real world. We will check some of these assumptions and tests in Python, which will provide a blueprint for other cases using well-known libraries. We will also examine its shortcomings and how its assumptions limit its use.

In the first blog of this series, we deconstructed the linear regression model, its various aliases and types. In the second installment, we looked at the application of linear regression on market data in Python and R.

Our coverage proceeds on the following lines:

• What is linear regression? A brief recap
• Assumptions of linear regression
• Linear relationship
• No Multicollinearity
• Gaussian distribution of the error terms
• No Autocorrelation of the error terms
• Homoskedasticity of the error terms
• Zero conditional mean of the error terms
• Limitations of linear regression
• Simplistic in some cases
• Sensitivity to outliers
• Prone to underfitting
• Overfitting of complex models
• References

## What is Linear Regression? A brief recap

Linear regression models the linear relationship between a response (or dependent) variable (Y) and one or more explanatory (independent) variables (X).

We can express it in the form of the following equation:

Yi = β0 + β1Xi + ϵi

In the case of a single explanatory variable, it is called simple linear regression, and if there is more than one explanatory variable, it is multiple linear regression.

In regression analysis, we aim to draw inferences about the population at large by finding the relationships between the dependent and independent variables for the sample. Usually, the OLS (Ordinary Least Squares) method is used to estimate the regression coefficients. OLS finds the best coefficients by minimizing the sum of the squares of the errors.

## Gauss-Markov theorem

The Gauss-Markov theorem states that under certain conditions, the Ordinary Least Squares (OLS) estimators are the Best Linear Unbiased Estimators (BLUE). This means that when those conditions are met in the dataset, the variance of the OLS model is the smallest out of all the estimators that are linear and unbiased.

Let’s examine the terms ‘linear’ and ‘unbiased’.

• Linear – Linear estimators imply that they have a linear relationship with the dependent variable. This makes them easier to understand and implement.
• Unbiased – Unbiased estimators imply that when applying a model repeatedly, on average, the estimators will attain their true value.

We now look at the “under certain conditions” (i.e. the assumptions) mentioned earlier that form the core of the Gauss-Markov theorem.

## Assumptions of Linear Regression

We can divide the basic assumptions of linear regression into two categories based on whether the assumptions are about the explanatory variables (i.e. features) or the residuals.

Assumptions about the explanatory variables (features):

• Linearity
• No multicollinearity

Assumptions about the error terms (residuals):

• Gaussian distribution
• Homoskedasticity
• No autocorrelation
• Zero conditional mean

## Linearity

The basic assumption of the linear regression model, as the name suggests, is that of a linear relationship between the dependent and independent variables. Here the linearity is only with respect to the parameters. Oddly enough, there’s no such restriction on the degree or form of the explanatory variables themselves.

So both the following equations represent linear regression:

Here, the model is linear in parameters as well as linear in the explanatory variable(s).

This model is linear in parameters and non-linear in the explanatory variable(s).

The explanatory variables can be exponentiated, quadratic, cubic, etc. and it can still be framed as a linear regression problem.

The following equation is NOT linear regression:

Linear regression minimizes the error (mean-squared error) to estimate the unknown betas by solving a set of linear equations.

When betas take non-linear forms, things get harder and we cannot use the methods we’d mentioned (but not derived!) earlier. Hence, we cannot use linear regression in the case of equation 3. Hence, the linearity (of parameters) assumption is important.

### How to detect linearity?

A residual plot helps us identify poor or incorrect curve fitting between the data and the regression model. It is probably the simplest way to check for linearity or lack thereof. A nice even spread is indicative of linearity.

### How to fix linearity?

The tricky part now is to get the functional form of the equation right.

• We can try reframing it by applying a non-linear transformation on the independent and/or the dependent term(s). We can transform messy data by normalizing them, taking logs of the original values, etc. This would make the data linear.
• We can also try adding another independent variable to the equation (like X2).

## No Multicollinearity

Another assumption is that the independent variables are not correlated with each other. If there is a linear relationship between one or more explanatory variables, it adds to the complexity of the model without being able to delineate the impact of each explanatory variable on the response variable.

If we were to model the salaries of a group of professionals based on their ages and years of experience.

salaryi = β0 + β1(years of experience)i + β2(age in years)i + ϵi

Linear regression studies the effect of each of the independent variables (X) on the dependent variable (Y). But when the independent variables are correlated, as in this case, it is difficult to isolate the impact of a single factor on the dependent variable. If you increase the years of experience, the age also will increase.

So did the salary increase due to the experience or the age?
This will affect the accuracy of the coefficients and also the standard errors.

### How to fix multicollinearity?

One way to deal with multicollinearity among the independent variables is to do dimensionality reduction using techniques like PCA to create uncorrelated features with the maximum variance. 