Simple Linear Regression: The foundation of Artificial Intelligence

Introduction to linear regression

For a myriad of machine learning engineers or data scientists, linear regression is the starting point of many statistical modelling projects. It is a basic and commonly used type of predictive analysis.

Linear regression is used for finding a linear relationship between the target and one or more predictors. There are two types of linear regression – Simple and Multiple. The scope of this article is confined to Simple Linear Regression.

Simple Linear Regression

Simple linear regression (SLR) is used for finding the relationship between two continuous variables. One is the predictor or independent variable and other is dependent or target variable.SLR looks for statistical relationship but not deterministic relationship. In a deterministic relationship, a variable is used to determine the value of another. For example,

We can determine the area of a circle from its radius:

Deterministic Equation - Area of the Circle

Area of the Circle vs. Radius

Whereas, in statistical relationships, the relationship between the variables is not clearly defined. When you plot the Predictor vs. Target graph, it exhibits some trend, but it also exhibits some scatter. For example, as Experience increases, you’d expect Salary to increase, but not perfectly. So, we focus on capturing the trend between two quantitative variables.

Linear regression

Modelling Simple Linear Regression

          Since we are interested in developing a hypothesis or capturing the trend between two quantitative variables, we need to find the best fitting line. Before we move further, let’s represent this best fitting line in mathematical terms to gain some intuition:

Equation -1

Here, ŷi is the dependent variable or the variable we are trying to predict, xi is the independent variable or the variable we use to make predictions, b0 and b1 are intercept and slope of the line respectively.  We use this equation to predict the actual response yi .We then calculate the prediction error which is:

Equation-2

            A line that fits the data “best” will be the one for which n prediction errors – one for each observed data point – are as small as possible in an overall sense. So, we end up with a minimization problem . One way minimize the error is through “least squares criterion” , which says to “minimize the sum of the squared prediction errors”.

Equation -3

            All we have to do is , find the values of b0 and b1 that minimizes the sum of the squared prediction errors to the maximum extent possible.

For more math, click here .

Core Idea

The core idea is to obtain a line that best fits the data. The best fit line is the one for which total prediction error for all data points are as small as possible. Error is the distance between the point to the regression line.

Python Implementation

Python is widely used language in statistical modelling, since we can perform wide ranging tasks from data visualization to modelling etc. It also support lot of machine learning libraries such as scikit-learn ,numpy,SciPy etc. which ease our task by providing functions to load, train, fit, visualize data and eliminate lot of boiler plate code .

We will now try to develop an intelligent system which helps in finding the correlation between the years of experience  of  employees and their salaries in a company .The management can make better decisions while setting salaries for new hires by using this model to estimate the salary for their respective experience and keep updating the model for finding the best fitting line .Here,we have data pertaining to 30 employees and their salaries and we will try to estimate the salaries for new hires by building a simple linear regression model .

Click here, to access the data-set.So, without further delay, let’s get started!

  1. Data Pre-processing

The data which is provided is a .csv file, in order to structure this data in a more readable, meaning full way we need to pre-process our data. We also need to extract the independent variable (X) and dependent variable (Y) and also split the data set into training set and test set. We call the model selection class from scikit-learn library to split the data set.

# Data Preprocessing Template

# Importing the libraries

import matplotlib.pyplot as plt # Used for plotting data
import pandas as pd             # Used for loading data

# Importing the dataset
dataset = pd.read_csv('Salary_Data.csv')  # Reading csv file
X = dataset.iloc[:, :-1].values           # Experience
y = dataset.iloc[:, 1].values             # Salary

# Splitting the dataset into the Training set and Test set.
# We are splitting 2/3 of dataset as training set and 1/3 of the dataset as test set.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/3, random_state = 0)

DataFrame                                                 X ( Experience)                                                          Y( Salary)
Data Frame             X (Employee Experience)                                         Y ( Salary)

  1. Fit the model and Predict the values

We call the linear model class from the scikit-learn library to fit the data and predict the values from test set .

# Fitting Single Linear Regression to the training set

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train,y_train)

# Predicting the Test set results

y_Pred = regressor.predict(X_test)
  1. Visualizing the results from the training set and test set

# Visualising the Training set results

plt.scatter(X_train,y_train,color ='red')
plt.plot(X_train,regressor.predict(X_train),color ='blue')
plt.title(" Salary vs Experience ( Training Set)")
plt.xlabel("Years of Experience")
plt.ylabel("Salary")
plt.show()

# Visualising the Test set results

plt.scatter(X_test,y_test,color ='red')
plt.plot(X_test,y_Pred,color ='blue')
plt.title(" Salary vs Experience ( Test Set)")
plt.xlabel("Years of Experience")
plt.ylabel("Salary")
plt.show()

Results   

Predicted vs. Actual Values

Training Set Results

Test Set Predictions

When you pay attention to the distance between the data points and the fitting line in Training Set results, you can see that the regression model tried to obtain the best fitting line for the given data points in training set .

Later, when we used the trained model , it has predicted salaries which are quite close to the actual values and plotted line has data points not far from it, indicating that it is the best fitting line for the model .

Conclusion

As you can see the model is intelligent because it tries to find the best possible correlation between an employee’s experience and their salaries and use this model to estimate a salary from an employee’s experience , rather than just substituting values in a formula and obtaining some value without taking into account the  statistical relationship between various factors.These estimates will be useful for the management in offering salaries with  a certain range from them and help make right decisions.So,this simple regression model has a potential business value and huge implications for their business .

They can later improve this model’s accuracy,if they feel it isn’t enough to use just experience  to estimate the salary and there are other factors which also needed to be taken into account,they can do so by including additional parameters (these models are defined as multiple linear regression models )and experiment with those  models and chose the one which best fits their bill.

 

 

 

Don't miss out!
Subscribe To Our Newsletter

Learn new things. Get an article everyday.

Invalid email address
Give it a try. You can unsubscribe at any time.

Comments

comments