Introduction to linear regression
For a myriad of machine learning engineers or data scientists, linear regression is the starting point of many statistical modelling projects. It is a basic and commonly used type of predictive analysis.
Linear regression is used for finding a linear relationship between the target and one or more predictors. There are two types of linear regression – Simple and Multiple. The scope of this article is confined to Simple Linear Regression.
Simple Linear Regression
Simple linear regression (SLR) is used for finding the relationship between two continuous variables. One is the predictor or independent variable and other is dependent or target variable.SLR looks for statistical relationship but not deterministic relationship. In a deterministic relationship, a variable is used to determine the value of another. For example,
We can determine the area of a circle from its radius:
Whereas, in statistical relationships, the relationship between the variables is not clearly defined. When you plot the Predictor vs. Target graph, it exhibits some trend, but it also exhibits some scatter. For example, as Experience increases, you’d expect Salary to increase, but not perfectly. So, we focus on capturing the trend between two quantitative variables.
Modelling Simple Linear Regression
Since we are interested in developing a hypothesis or capturing the trend between two quantitative variables, we need to find the best fitting line. Before we move further, let’s represent this best fitting line in mathematical terms to gain some intuition:
Here, ŷi is the dependent variable or the variable we are trying to predict, xi is the independent variable or the variable we use to make predictions, b0 and b1 are intercept and slope of the line respectively. We use this equation to predict the actual response yi .We then calculate the prediction error which is:
A line that fits the data “best” will be the one for which n prediction errors – one for each observed data point – are as small as possible in an overall sense. So, we end up with a minimization problem . One way minimize the error is through “least squares criterion” , which says to “minimize the sum of the squared prediction errors”.
All we have to do is , find the values of b0 and b1 that minimizes the sum of the squared prediction errors to the maximum extent possible.
For more math, click here .
The core idea is to obtain a line that best fits the data. The best fit line is the one for which total prediction error for all data points are as small as possible. Error is the distance between the point to the regression line.
Python is widely used language in statistical modelling, since we can perform wide ranging tasks from data visualization to modelling etc. It also support lot of machine learning libraries such as scikit-learn ,numpy,SciPy etc. which ease our task by providing functions to load, train, fit, visualize data and eliminate lot of boiler plate code .
We will now try to develop an intelligent system which helps in finding the correlation between the years of experience of employees and their salaries in a company .The management can make better decisions while setting salaries for new hires by using this model to estimate the salary for their respective experience and keep updating the model for finding the best fitting line .Here,we have data pertaining to 30 employees and their salaries and we will try to estimate the salaries for new hires by building a simple linear regression model .
Click here, to access the data-set.So, without further delay, let’s get started!
The data which is provided is a .csv file, in order to structure this data in a more readable, meaning full way we need to pre-process our data. We also need to extract the independent variable (X) and dependent variable (Y) and also split the data set into training set and test set. We call the model selection class from scikit-learn library to split the data set.
# Data Preprocessing Template # Importing the libraries import matplotlib.pyplot as plt # Used for plotting data import pandas as pd # Used for loading data # Importing the dataset dataset = pd.read_csv('Salary_Data.csv') # Reading csv file X = dataset.iloc[:, :-1].values # Experience y = dataset.iloc[:, 1].values # Salary # Splitting the dataset into the Training set and Test set. # We are splitting 2/3 of dataset as training set and 1/3 of the dataset as test set. from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/3, random_state = 0)
DataFrame X ( Experience) Y( Salary)
Fit the model and Predict the values
We call the linear model class from the scikit-learn library to fit the data and predict the values from test set .
# Fitting Single Linear Regression to the training set from sklearn.linear_model import LinearRegression regressor = LinearRegression() regressor.fit(X_train,y_train) # Predicting the Test set results y_Pred = regressor.predict(X_test)
Visualizing the results from the training set and test set
# Visualising the Training set results plt.scatter(X_train,y_train,color ='red') plt.plot(X_train,regressor.predict(X_train),color ='blue') plt.title(" Salary vs Experience ( Training Set)") plt.xlabel("Years of Experience") plt.ylabel("Salary") plt.show() # Visualising the Test set results plt.scatter(X_test,y_test,color ='red') plt.plot(X_test,y_Pred,color ='blue') plt.title(" Salary vs Experience ( Test Set)") plt.xlabel("Years of Experience") plt.ylabel("Salary") plt.show()
When you pay attention to the distance between the data points and the fitting line in Training Set results, you can see that the regression model tried to obtain the best fitting line for the given data points in training set .
Later, when we used the trained model , it has predicted salaries which are quite close to the actual values and plotted line has data points not far from it, indicating that it is the best fitting line for the model .
As you can see the model is intelligent because it tries to find the best possible correlation between an employee’s experience and their salaries and use this model to estimate a salary from an employee’s experience , rather than just substituting values in a formula and obtaining some value without taking into account the statistical relationship between various factors.These estimates will be useful for the management in offering salaries with a certain range from them and help make right decisions.So,this simple regression model has a potential business value and huge implications for their business .
They can later improve this model’s accuracy,if they feel it isn’t enough to use just experience to estimate the salary and there are other factors which also needed to be taken into account,they can do so by including additional parameters (these models are defined as multiple linear regression models )and experiment with those models and chose the one which best fits their bill.