K-Nearest Neighbor(KNN) Classification and build KNN classifier using Python Sklearn package.
KNN is a method that simply observes what kind of data is lies nearest to the one it’s trying to predict . It then classifies the point of interest based on the majority of those around it. Let’s see this algorithm in action with the help of a simple example. Here, we shall implement knn with sklearn library of Python and see how to get the best results. Suppose you have a dataset with two variables(stars and hexagons). Which when plotted, looks like the one in the following figure. And you have to classify a new point(circle).
Here we can see in the circle with k=1 has 1 point which has to be classified. And other than that one other pre-existing point(whose class is known).In k=3 we have 3 pre-existing point(don’t forget to include the hexagon inside k=1 circle).
k-NN is a classification algorithm. It uses the k value (hang on, we will discuss this, for now just keep in mind that k is an integer value , not 0 or negative tho )and measures the distance of new points to nearest neighbors.
It then selects the k-nearest data points . And based on those nearest data points it assigns the class of majority data points .
What is “k”?
k actually is the number of neighbors to be considered. Thus, when fitting a model with k=3 implies that the three closest neighbors are used to smooth the estimate at a given point. Generally, Data scientists choose as an odd number if the number of classes is even. You can also check by generating the model on different values of k and check their performance.
A small value of k means that noise will have a higher influence on the result and a large value make it computationally expensive. We will use multiple values of k in our implementation part to determine the best k among several values.
- Supervised Technique
- Used for Classification or Regression
- Used for classification and regression of known data where usually the target attribute/variable is known before hand.
- kNN needs labelled points
- k in k-NN algorithm is the number of nearest neigbours’ labels used to assign a label to the current point.
- It uses Lazy learning algorithm.
Now, why k-NN is Lazy Algorithm?
We know now that k-NN just calculates the nearest neighbor’s distances to classify the new point . So, it does not include any training phase. Training phase here refers to the phase where we make our model to learn some kind of function or pattern from the input data to be used later to classify.
K-NN is a lazy learner because it doesn’t learn a discriminative function from the training data but “memorizes” the training dataset instead.
For example, the logistic regression algorithm learns its model weights (parameters) during training time. In contrast, there is no training time in K-NN.
KNN has the following basic steps:
- Calculate distance
- Find closest neighbors
- Vote for labels
That’s enough introduction , so now starting the implementation part.
We will make use of widely known iris dataset. It’s available online for all.
The Iris flower data set or Fisher’s Iris data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis. Now we shall use knn with sklearn library for better understanding.
The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. Based on the combination of these four features, Fisher developed a linear discriminant model to distinguish the species from each other.
Here we will use a small portion of that dataset.
So, Importing the general libraries and iris dataset from a specific url.
import numpy as np import matplotlib.pyplot as plt import pandas as pd url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data" # Assign colum names to the dataset names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class'] # Read dataset to pandas dataframe dataset = pd.read_csv(url, names=names)
Now, let’s check how much data we have and what does it look like.
(150, 5) sepal-length sepal-width petal-length petal-width Class 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa
Step 2: Preprocessing ,Train-test split and Feature scaling
First of all, we imported libraries and dataset.
After that now, dividing the data in input and output labels.
X = dataset.iloc[:, :-1].values y = dataset.iloc[:, 4].values from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)
Before making any actual predictions. It is always a good practice to scale the features so that all of them can be uniformly evaluated. Wikipedia explains the reasoning pretty well:
First of all,since the range of values of raw data varies widely, in some machine learning algorithms. so, as a result objective functions will not work properly without normalization. For example, the majority of classifiers calculate the distance between two points by the Euclidean distance. If one of the features has a broad range of values, the distance will be governed by this particular feature. Therefore, the range of all features should be normalized so that each feature contributes approximately proportionately to the final distance.
So, moving on to feature-scaling.
from sklearn.preprocessing import StandardScaler s = StandardScaler() s.fit(X_train) X_train = s.transform(X_train) X_test = s.transform(X_test)
Step 3: Creating a model and hence fitting the data in our model.
We are taking k=5 here for now. Now we create the model. After that fitting our data in the model and displaying the output.
from sklearn.neighbors import KNeighborsClassifier knn1 = KNeighborsClassifier(n_neighbors=5) knn1.fit(X_train, y_train)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=1, n_neighbors=5, p=2, weights='uniform')
Step 4: Predictions and Scoring
In the end, let’s check how well our model is performing .
y_pred = knn1.predict(X_test) print("Score is",knn1.score(X_test,y_test))
Seems like that’s all about the implementation. Now , as I mentioned earlier that I will show the performance for different values of k. So , here it is.
for i in range (1,9,2): knn2 = KNeighborsClassifier(n_neighbors=i) knn2.fit(X_train, y_train) print("For k = %d accuracy is"%i,knn2.score(X_test,y_test))
For k = 1 accuracy is 0.9666666666666667 For k = 3 accuracy is 0.9666666666666667 For k = 5 accuracy is 0.9333333333333333 For k = 7 accuracy is 0.9
As we can see that the greater the value of k lesser the score is. So, we have to be smart about choosing the value of k.
To summarize this article let us look at pros and cons of k-NN.
- It is simple to implement.
- Training is easier.
- Has few parameters.
- Uses Lazy learning algorithm.
- High prediction cost.
- Doesn’t play well with numerous features.
- Doesn’t like categorical features because its difficult to find the distance between dimensions between categorical features.
I hope I was able to convey what k-NN is and how to implement knn with sklearn. I would advise you to implement the KNN algorithm for a different classification dataset. Vary the test and training size along with the K value to see how your results differ and how can you improve the accuracy of your algorithm.
In the end if someone wants to learn machine learning click here.