Classification problems occur often, perhaps even more so than regression problems. There are many classification models, the scope of this article is confined to one such model – the logistic regression model. In this article, we shall have an in-depth look at logistic regression in r.
Classification is different from regression because in any regression model we find the predicted value is quantitative. For example, in simple linear regression we were trying to predict an employee’s salary which is a quantitative variable. But in many cases, the response variable or the predicted value is qualitative. Some examples include :
- Classifying fraudulent and genuine transactions.
- Identifying cancer cells from non-cancerous ones.
- Classifying colors etc.
These qualitative variables are often referred to as categorical. The process of predicting these qualitative responses from observation is known as classification.
Why not use regression for qualitative response ?
Suppose that we are trying to identify the medical condition of a patient based on his/her symptoms and if there are three possible outcomes : malaria, stroke, depression. We would represent these outcomes using a dummy variable and encode these values as a quantitative variable Y .
Here, the dummy variables are 0,1,2 used for encoding these outcomes into a quantitative variable Y. Now, if we use least squares method to fit the linear regression model to predict Y using relevant features, we will obtain a similar difference between malaria – stroke, and stroke – depression. This isn’t the case in reality. Sometimes, it is difficult to encode a qualitative response to a quantitative one. Hence, the classification method is the best way to deal with qualitative responses.
The main idea of logistic regression is to estimate the probability that the dependent variable belongs to a category given certain features. .For example,if we were trying to predict whether an email is a spam or not.The categorical variable is a binary variable which is either encoded as 0 or 1 (yes or no /true or false ).
So , logistic regression models the probability that a dependent variable(Y) belongs to a particular category.For example, the probability that a dependent variable “y” belongs to a category “1” given a feature “x” is represented as :
If we are trying to model the relationship between P(y=1|x) and X using linear regression ,these probabilities are represented as :
where,p(X) = P(y=1|x) .Unfortunately,we would end up obtaining probabilities below zero .This shouldn’t be the case ,since probability of event should fall between 0 and 1.
To avoid the problem of negative probabilities ,we will use the sigmoid function which models these probabilities between 0 and 1 and it is obtained after applying transformations on logit function and it is represented as :
After taking logarithms on both sides , we will obtain the sigmoid function as below :
This sigmoid function will always produce an S-shaped curve and probabilities between 0 and 1.The coefficients β0 and β1 are unknown.Since the relationship between p(X) and X is not a straight line,β1 does not contribute to the change in p(X) for a unit change in X.
Hence,we should be estimate them based on available training data .We can do so using a method called maximum likelihood.
So,we will use maximum likelihood to fit a logistic regression model , by estimating β0 and β1 such that each predicted probability for a particular outcome is closely related to actual observed outcome .
For example, we will estimate β0 and β1 and then use these values to obtain a probability value close to one for all emails which are actually spam and close to zero for all emails which are not spam .
After obtaining β0 and β1 estimates, all we have to do is, plug in these values and obtain probability that an observation belongs to a particular category.
The math for maximum likelihood is beyond the scope of this article and the basic intuition behind it is sufficient enough to help you experiment with this amazing model since, most of math is offered through various packages in programming languages such as R.
Logistic Regression in R : Social Network Advertisements
Firstly,R is a programming language and free software environment for statistical computing and graphics.The R language is widely used among statisticians and data miners for developing statistical software and data analysis.
We will now develop a logistic regression model for a social networking site – Connections. They have an automaker as their client.The client has launched a brand new car and is advertising on their site.
So, Connections Inc. would now want to find the correlation between an user’s age ,salary and his/her response for the advertisement. The response here is categorized as Purchased and Not Purchased .They would use this logistic regression model to help them understand the correlation and thus serve the ad to a user who would most likely purchase the car.
Click here,to access the data set.
So,let’s get started .
The data which is provided is a .csv file, in order to structure this data in a more readable, meaning full way we need to pre-process our data. We also need to extract the features and encode response variable (Y) as 0 for not purchasing and 1 for purchasing and also split the data set into training set and test set.
# Importing the dataset dataset = read.csv('Social_Network_Ads.csv') dataset = dataset[3:5] # Encoding the target feature as factor dataset$Purchased = factor(dataset$Purchased, levels = c(0, 1)) # Splitting the dataset into the Training set and Test set # install.packages('caTools') library(caTools) set.seed(123) split = sample.split(dataset$Purchased, SplitRatio = 0.75) training_set = subset(dataset, split == TRUE) test_set = subset(dataset, split == FALSE)
Visualizing the data set
Now, we shall look at some of the records. You will find two features age, an estimated salary of the user and whether that particular user has purchased the car can be identified using 0 or 1, where 1 stand for purchased and 0 for not purchased.
As you can see, features age and estimated salary are both in different ranges. So, it is advisable to normalize these features within a particular range.Sometimes, it also helps in speeding up the calculations in an algorithm. So, let’s feature scale training set as well as the test set.
# Feature Scaling training_set[-3] = scale(training_set[-3]) test_set[-3] = scale(test_set[-3])
Visualizing the training set and test set after feature scaling
Test Set Training Set
Fit the Logistic Regression model and Predict values
We call the generalized linear model function in R to run the linear model on the data and estimate the probabilities for the observations from the test set. We then use a probability threshold of 0.5 to classify all the observations with probability greater that 0.5 as 1(purchased) and less than 0.5 as 0(not purchased).
# Fitting Logistic Regression to the Training set classifier = glm(formula = Purchased ~ ., family = binomial, data = training_set) # Predicting the Test set results prob_pred = predict(classifier, type = 'response', newdata = test_set[-3]) y_pred = ifelse(prob_pred &amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt; 0.5, 1, 0) # Making the Confusion Matrix cm = table(test_set[, 3], y_pred &amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt; 0.5)
Accuracy of the model: Confusion Matrix
We can now test the accuracy of the model by building the confusion matrix, which will help us understand how many observations are incorrectly classified.
# Making the Confusion Matrix cm = table(test_set[, 3], y_pred &amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt; 0.5)
Here, out of 64 users who haven’t purchased the car (0) , 57 are correctly classified and out 36 users who have purchased the car (1), 26 have been correctly classified.Hence,the accuracy of the logistic regression model for these 100 observations is : (57 + 26) / 100 = 0.83 , therefore 83% accuracy .
Visualizing the training set and test set results
# Visualizing the Training set results library(ElemStatLearn) set = training_set X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01) X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01) grid_set = expand.grid(X1, X2) colnames(grid_set) = c('Age', 'EstimatedSalary') prob_set = predict(classifier, type = 'response', newdata = grid_set) y_grid = ifelse(prob_set &amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt; 0.5, 1, 0) plot(set[, -3], main = 'Logistic Regression (Training set)', xlab = 'Age', ylab = 'Estimated Salary', xlim = range(X1), ylim = range(X2)) contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE) points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato')) points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3')) # Visualizing the Test set results library(ElemStatLearn) set = test_set X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01) X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01) grid_set = expand.grid(X1, X2) colnames(grid_set) = c('Age', 'EstimatedSalary') prob_set = predict(classifier, type = 'response', newdata = grid_set) y_grid = ifelse(prob_set &amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt; 0.5, 1, 0) plot(set[, -3], main = 'Logistic Regression (Test set)', xlab = 'Age', ylab = 'Estimated Salary', xlim = range(X1), ylim = range(X2)) contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE) points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato')) points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'))
We can observe that a line is separating the green and red regions where green region is associated with purchasing the car and red region is associated with not purchasing the car .
As we have seen in the confusion matrix, some of the observations are incorrectly classified as either purchased or not purchased whereas actually they have responded otherwise.We can now visualize these incorrectly classified data points on both the training set and test set .
So,Connections Inc. could use this model to target ads to users based on their age and salary and actually be confident that at most 83 % would be buying the car.This would immensely improve their advertisement business. If they would like to improve the logistic regression model’s accuracy by adding more features and mining more data about users, they can do so or they can experiment and compare the results from other classification models such as K-Nearest Neighbors, Support vector machines which are more power classification techniques and use the best model .