Linear Regression is the basic algorithm a machine learning engineer should know. No matter how many algorithms you know, the one that will always work will be Linear Regression. The aim of linear regression is to predict the outcome Y on the basis of the one or more predictors X and establish a leaner relationship between them. Therefore, Y can be calculated if all the X are known.
Linear Regression supports Supervised learning(The outcome is known to us and on that basis, we predict the future values).
The equation used in Simple Linear Regression is –
Y = b0 + b1*X
The equation is the same as we studied for the equation of a line – Y = a*X + b. Thus b0 is the intercept and b1 is the slope.
Multiple Linear Regression
This is a leveled up version of Simple Linear Regression. As Simple Linear involves only one predictor, Multiple involves two or more predictors over one target variable. The equation is –
Y = b0 + b1*X1 + b2*X2 + … + bn*Xn
b0 – the common intercept
bi – slope for each Xi predictor.
Now let’s look at two examples, one for Simple Linear and one for Multiple Linear simultaneously. Create 2 files for each Linear Regression in the RStudio. For Simple Linear, we will use the ‘cars’ dataset and for Multiple Linear we will use ‘iris’ dataset.
Like in the above image, create 2 files and 2 data frames ‘dataset_cars’ and ‘dataset_iris’ for differentiating between them.
If we need to see the first six or the last 6 observations in the dataset then we use the head() and the tail() functions respectively. If we need to see the whole dataset, left click on the required dataset in the environment window(top right side) once(give it a few seconds) and the dataset will open. Below the code shows how to use the head() function.
head(dataset_cars) # to print 1st 6 obsevations.
# speed dist
#1 4 2
#2 4 10
#3 7 4
#4 7 22
#5 8 16
#6 9 10
Let’s understand the datasets. The ‘cars’ dataset contains 50 rows and 2 columns(speed and distance, respectively). It provides different speeds covering a certain distance.
Now let’s understand the ‘iris’ dataset. This contains 150 rows and 5 columns which refer to the iris plant with three different species. The columns are Sepal length, Sepal width, Petal length, Petal width and their species(‘setosa’, ‘versicolor’, and ‘virginica’).
Data preprocessing is the most important step in any Machine Learning program. This is done to remove any null values in data, encode the categorical variables, change text according to the required specification and many other problems which are needed to be solved before running any model on the data.
The Simple Linear problem does not need any preprocessing as there is no problem with the dataset.
The Multiple Linear problem requires preprocessing as it contains categorical variables because the machine does not understand or cannot determine the relationship between the species of the flower. Encoding in R is done by changing the categorical variable into a factor and giving each variety a value.
dataset_iris$Species = factor(dataset_iris$Species, levels = c(‘setosa’, ‘versicolor’, ‘virginica’), labels = c(1, 2, 3))
Let’s perform some analysis before performing the Linear Regression algorithm.
A statistical measure used to determine a linear relationship between two variables that occur in a pair. Correlation takes a value between +1 to -1, a high relation is shown by a value closer to +1 and an inverse relation is shown with a value close to -1.
The values between -0.2 and +0.2 show unexplainable behavior between Y and predictor X and another predictor will be needed to calculate the outcome, thus showing a weak relation.
Now let’s execute the cor() function one by one in each file.
# For Simple Linear Regression
#  0.8068949
Now for Multiple Linear, species column is excluded because it is a categorical variable and these variables don’t have any correlation with any variable.
# Sepal.Length Sepal.Width Petal.Length Petal.Width
#Sepal.Length 1.0000000 -0.1175698 0.8717538 0.8179411
#Sepal.Width -0.1175698 1.0000000 -0.4284401 -0.3661259
#Petal.Length 0.8717538 -0.4284401 1.0000000 0.9628654
#Petal.Width 0.8179411 -0.3661259 0.9628654 1.0000000
Through correlation, Petal Length is preferred as a dependent variable, as it has good correlation with other variables.
Graphical Analysis helps to understand to visualize the behavior of the variables. There are 3 ways to visualize the behavior.
- Scatter plot
- Density Plot
It is used to visualize the linear relation between predictor and outcome. If multiple predictors, a relation is developed with each predictor.
scatter.smooth(x = dataset_cars$speed, y = dataset_cars$dist, main = “Dist ~ Speed”)
Now for ‘iris’ dataset, we’ll make a common plot as for every variable there will be many plots.
Therefore, the axes differ according to the variables written on the left or below.
The density plot is used to check the normality of the variables. The preferable normal distribution is bell-shaped, where there is no skewness(neither left nor right). Thus, it tells us whether the data presides lesser or greater than the median.
There are only two parameters in the ‘cars’ dataset, thus it is easy to construct them.
par(mfrow = c(1, 2)) # divide area into 2 columns
plot(density(dataset_cars$dist), main = “Distance”, round(e1071::skewness(dataset_cars$dist), 2))
plot(density(dataset_cars$speed), main = “Speed”, round(e1071::skewness(dataset_cars$speed), 2))
The values below the graph are the skewness of the variables, closer to zero, the better.
Similarly for multiple linear,
par(mfrow = c(1,2))
plot(density(dataset_iris$Sepal.Length), round(e1071::skewness(dataset_iris$Sepal.Length), 2),
main = “Sepal Length”)
plot(density(dataset_iris$Sepal.Width), round(e1071::skewness(dataset_iris$Sepal.Width), 2),
main = “Sepal Width”)
plot(density(dataset_iris$Petal.Length), round(e1071::skewness(dataset_iris$Petal.Length), 2),
main = “Petal Length”)
plot(density(dataset_iris$Petal.Width), round(e1071::skewness(dataset_iris$Petal.Width), 2),
main = “Petal Width”)
Splitting the data
Before building the model, we first need to split the data into training and test set, so that we can estimate the accuracy of our linear model.
Generally, the splitting is 75% to the training set and rest to test set, but the datasets are very small in size, we’ll take that number to 80%. First, we need to install the package called ‘caTools’. It contains the split() function which will split the data into required sizes. To install the package we need to use the following command(If you have it already then no need to install it again) –
library(caTools) # to import the package
The caTools library has the split() function which will split the data into the required size. For random splitting, we use the seed() function. split() takes two arguments, first is the dataset and second is the split ratio. So let’s see the code!
set.seed(1234) # seed can have any value, whatever you like, but do same seed for comparing the answers.
split = sample.split(dataset_cars$speed, SplitRatio = 0.8)
training_cars = subset(dataset_cars, split == TRUE)
test_cars = subset(dataset_cars, split == FALSE)
The above code will divide the 50 lines dataset into 40 and 10 lines as training and test set respectively.
Likewise in Multiple Linear,
split = sample.split(dataset_iris$Petal.Length, SplitRatio = 0.8)
training_iris = subset(dataset_iris, split == TRUE)
test_iris = subset(dataset_iris, split == FALSE)
This will consequently divide the dataset into 122 and 28 lines respectively.
Now let’s finally build the model!
The function used for building linear models is lm() function. It mainly takes two arguments, first the formula and the second is the dataset. Therefore, let’s see what the equation it will form.
Simple Linear Regression Model
regressor_cars = lm(formula = speed ~ dist, data = training_cars)
#lm(formula = speed ~ dist, data = training_cars)
# 7.8278 0.1728
As a result, the equation formed is –
speed = 7.8278 + 0.1728*dist
Now, we’ll predict the values for the test set(There are very fewer values in the training set, so some observations will have very less accuracy, bigger the data, more accurate the values).
y_pred_cars = predict(regressor_cars, newdata = test_cars)
# row_numbers – 9 14 16 21 26 29 34 36 39 47
#speed_predict – 14 12 12 14 17 13 21 14 13 24
#comparing – 10 12 13 14 15 17 18 19 20 24
As seen from the result, some values are certain, some are close and a few are less accurate.
Multiple Linear Regression Model
Similarly, for multiple linear regression, only the formula will change as it contains 4 independent variables.
regressor_iris = lm(formula = Petal.Length ~., data = training_iris)
#lm(formula = Petal.Length ~ ., data = training_iris) # ‘.’ means taking all the variables or (formula = Petal.Length ~ Sepal.Length + Sepal.Width + Petal.Width + Species)
#(Intercept) Sepal.Length Sepal.Width Petal.Width Species2 Species3
#-1.1989 0.6276 -0.1881 0.6253 1.4090 1.9117
Therefore, the equation comes out to be –
Petal.Length = -1.1989 + 0.6276*Sepal.Length – 0.1881*Sepal.Width + 1.4090*Species2 + 1.9117*Species3
Now, the question arises that the species column got divided into three parts and that too ‘Species1’ is not included, this is done because the species has three categories, so it got broken into those three and then removed ‘Species1’ to avoid the dummy variable trap(to avoid multicollinearity because the other variables can predict the value of the removed ones).
Let’s run the model on the test set.
y_pred_iris = predict(regressor_iris, newdata = test_iris)
round(y_pred_iris, 1) # There are 28 values, so you can compare them on your console, I’ll just put the first 5 #here.
# row number – 6 9 11 29 31
#predicted values – 1.7 1.1 1.6 1.6 1.4
#original values – 1.7 1.4 1.5 1.4 1.6
The rest values are quite accurate as well. Now let’s diagnose our models!
The model creation and prediction is one part, but to use the model we need to analyze it first, due to which we can confirm, that our model is good. For this, we will use the summary() function.
Let’s see it on Simple Linear first.
# lm(formula = speed ~ dist, data = training_cars)
# Min 1Q Median 3Q Max
# -7.6479 -1.8285 0.3667 2.5346 5.8436
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 7.82784 0.93963 8.331 4.22e-10 ***
# dist 0.17275 0.01873 9.225 3.06e-11 ***
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# Residual standard error: 3.125 on 38 degrees of freedom
# Multiple R-squared: 0.6913, Adjusted R-squared: 0.6832
# F-statistic: 85.09 on 1 and 38 DF, p-value: 3.062e-11
Let’s understand a few diagnosis variables.
The p-values(model’s p-value at the bottom and individual in Pr(>|t|) column) are important because they tell us whether the variables are statistically significant(variables having high relation with dependent variable) or not. The default value is 0.05. Any variable with a p-value below 0.05 is statistically significant.
R-squared and Adjusted R-squared
R-squared tells us about the proportion of variation in the dependent variable given by the model.
SSE – the sum of squared error calculated by
SST – the sum of squared total
yi^ – the fitted value of observation and yi(bar) – mean of Y.
Now, let’s see understand what Adjusted R-squared is. Adjusted R-squared corrects total value for the number of terms in the model. Therefore, Adj R-squared is much better to look at the R-squared.
where n is the number of observations and q is the number of coefficients in the model. Higher the R-squared and Adjusted R-squared, the better.
Now we know about some analysis variables, we’ll check Multiple Linear Regression summary().
# lm(formula = Petal.Length ~ ., data = training_iris)
# Min 1Q Median 3Q Max
# -0.79849 -0.17488 0.00084 0.15366 0.67231
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -1.19893 0.32060 -3.740 0.000288 ***
# Sepal.Length 0.62761 0.05864 10.703 < 2e-16 ***
# Sepal.Width -0.18805 0.09472 -1.985 0.049474 *
# Petal.Width 0.62527 0.14676 4.261 4.17e-05 ***
# Species2 1.40903 0.20463 6.886 3.14e-10 ***
# Species3 1.91174 0.28984 6.596 1.32e-09 ***
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# Residual standard error: 0.2791 on 116 degrees of freedom
# Multiple R-squared: 0.9766, Adjusted R-squared: 0.9756
# F-statistic: 969.3 on 5 and 116 DF, p-value: < 2.2e-16
If there would have been a variable with a p-value greater than 0.05 then we will remove it from the formula and then run the regressor again. Therefore it may affect the accuracy and Adjusted R-squared accordingly(If Adj R-squared increases or remains the same, the better else don’t remove the variable). This step is called Backward Elimination.
We have learned about the basics of Linear regression, the types and an example to show how this all works. Get the code from this link.