Assumptions in Regression Models:
Assumptions play a key role in our everyday life and so it does in regression models. Let us take an example. You decided to do overtime work in this week to take a break in life and visit Goa. What are your assumptions here? First, you will still be breathing after a week. Secondly, a flood won’t hit Goa in anytime soon. And lastly, you’ll get cheap hotel rooms( which is not going happen). Similarly, we have a few undeniable regression assumptions while preparing our regression model. All the assumptions are to be made on the available data-set which varies between different regression models.
- Linearity: Well obviously we shall expect our data point to be linear. The model must be an accurate description of the true relationship between the variables yi = β0 + β1*x1,i + β2*x2,i + … + βk−1,i *xk−1,i + ϵ .Here’s a hard-coded example of a linear data points fit with a linear regression model.
#Linear Function demonstration import numpy as np import matplotlib.pyplot as plt x = np.linspace(1, 10, 20) y_data = np.random.randint(-2, 2, size=20)+2*x y_func = 2*x plt.scatter(x, y_data, label="Real Data Points") plt.plot(x, y_func, 'r', label="Regression Model Predictions") plt.legend()
- Consistency: We can conclude an estimate is consistent if as the sample size gets very large, the sample estimates for the coefficients approach the true population coefficients. If the residuals are not independent, this most likely indicates you wrongly specified the model.
- Independence of error term: This assumption states that an error from one observation (ϵi) is independent of the error from another observation (ϵj). Though satisfying this assumption is not necessary for Ordinary Least Square (OLS) results to be consistent. But, better methods than OLS are possible.
- Normality of error term: The normality assumption is one of the most misunderstood in all of statistics. In multiple regression, the assumption requiring a normal distribution applies only to the disturbance term, not to the independent variables. Perhaps the confusion about this assumption derives from difficulty understanding what this disturbance term refers to. It is the random error in the relationship between the independent variables and the dependent variable in a regression model.
- Stationary Variance: The error term should have a constant variance. The variance should not increase with the increase in Xi that is the error must be independent of the input value.
Suppose you are predicting how much people spend on food and luxury. This will not have a constant error variance since the error is going to be more for the rich.
- Independence of explanatory variables: For consistent results, the X variables must be independent of the error terms. That is, the errors made in the regression cannot be related to your variables. This problem can arise when there are possible explanatory variables (that may not even be measurable) not included in the regression that is correlated with the included explanatory variables.
- Multicollinearity: Linear regression assumes that there is little or no multicollinearity in the data. Multicollinearity occurs when the independent variables are not independent of each other. A second important independence assumption is that the error of the mean has to be independent of the independent variables. This effect multicollinearity can be computed by calculating the tolerance. Tolerance measures the influence of one independent variable on all other independent variables, the tolerance is calculated with an initial linear regression analysis. Tolerance is defined as T = 1 – R² for these first step regression analysis.
If multicollinearity is found in the data centering the data, that is deducting the mean score might help to solve the problem. Other alternatives to tackle the problems is conducting a factor analysis and rotating the factors to ensure the independence of the factors in the linear regression analysis.
- Autocorrelation: Now comes the most difficult problem to handle, we have to assume that the Explanatory Variables are Stationary. That is the Variables should time invariant. Economics data like GDP, income, price level, wages are often not stationary, rather they grow as time goes on. In other words when the value of y(x+1) is not independent of the value of y(x).
This can be visualized with Scatterplot.
Let’s take an example of Petrol Prices in Kolkata over last 10 Days
# An example of data with non-stationary explanatory variables import matplotlib.pyplot as plt import numpy as np # Petrol Prices in Kolkata from 29 Sep to 8 Oct petrol_price = [83.87, 83.66, 83.52, 83.3, 85.8, 85.65, 85.65, 85.53, 85.3, 85.21] plt.plot(petrol_price)
There are a few advantages of logistic regression. Firstly, it does not need a linear relationship between the dependent and independent variables but assumes linearity of independent variables and log odds. Logistic regression can handle all sorts of relationships because it applies a non-linear log transformation to the predicted odds ratio. Secondly, the independent variables do not need to be multivariate normal – although multivariate normality yields a more stable solution. Also, the error terms (the residuals) do not need to be multivariate normally distributed. Thirdly, homoscedasticity is not needed. Logistic regression does not need variances to be heteroscedastic for each level of the independent variables. Lastly, it can handle ordinal and nominal data as independent variables. The independent variables do not need to be metric.
Still few assumptions are there to take care of:
- In the case of binary logistic regression, it requires the dependent variable to be binary and ordinal logistic regression requires the dependent variable to be ordinal. Reduction of an ordinal or even metric variable to dichotomous level loses a lot of information, which makes this test inferior compared to ordinal logistic regression in these cases.
- For a binary Logistic Regression since the output is in 0 or 1 the dependent variable must be transformed to the same or should be One Hot encoded.
- The model should be devoid of any overfitting nor underfitting. Only the meaningful variables should be included. Using a step-wise method to estimate the logistic regression can solve this problem.
- Just like Linear Regression, there should be little to no Multicollinearity and the error terms must be independent.
- Most importantly it requires quite large sample sizes. Because maximum likelihood estimates are less powerful than ordinary least squares. Thus to maintain a healthy expectancy large data sets are crucial.
Taking care of these aspects whilst designing our Regression model can significantly reduce our debugging time. As we are going to have a clear insight into our dataset based on which we can choose an adequate model which will serve our purpose.