The beauty of data science is the fact that it can deal with any kind of situation that a programmer may face. This leads to the need for more and more data scientists in all fields of technology- e-commerce, fintech, automation and what not. A good data scientist, therefore, has to be ready for all of these challenges. Which means, we need to have something in our arsenal to deal with all kinds of problems with data.
Today we look at 4 machine learning algorithms that will let a person face any problem statement that is put in front of us.
1. Decision Tree
A decision tree is a very simple learning model that can just walk through any classification problem. It is the simplest machine learning algorithms out there. This algorithm asks a series of question to the data based on which, the class of an instance is predicted. The intelligence of the system lies in the selection of questions.
Consider the following example where we are trying to determine whether someone will turn up at an election or not.
In this algorithm, the first question asked is whether a person is from a suburb, village or city. Based on this answer, we ask more questions and come to a decision. How these questions are asked is the real issue. The key criterion for this purpose is entropy. Entropy is a measure of the randomness of any variable. As we go on reducing the entropy of a variable, we get closer to an answer. The reduction in entropy on asking a question is the information gain from that question. Therefore, a decision tree algorithm is trained to ask questions with potentially high information gain at the later stages.
The entropy is calculated by:
This formula gives us the entropy for each variable. There are many algorithms to determine how a decision tree is structured: ID3, C4.5 etc. here is the code to implement a decision tree with the sci-kit learn library in Python.
>>> from sklearn import tree >>> X = [[0, 0], [1, 1]] >>> Y = [0, 1] >>> clf = tree.DecisionTreeClassifier() >>> clf = clf.fit(X, Y) >>> clf.predict([[2., 2.]]) array()
2. K-nearest neighbours
Another important classification algorithm is the KNN(k nearest neighbours) model. When data belonging to the same class is concentrated in an area, we call upon this algorithm. It works on the principle that the class of a data point is the same as the class of its closest neighbours. The algorithm takes a parameter, k from the user and selects k number of its closest training data points. Then it counts the number of these neighbours belonging to each class. The class emerging with the highest number of neighbours from this process is selected as the class of the point in question.
>>> from sklearn.neighbors import NearestNeighbors >>> import numpy as np >>> X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]]) >>> nbrs = NearestNeighbors(n_neighbors=2, algorithm='ball_tree').fit(X) >>> distances, indices = nbrs.kneighbors(X) >>> indices array([[0, 1], [1, 0], [2, 1], [3, 4], [4, 3], [5, 4]]...) >>> distances array([[ 0. , 1. ], [ 0. , 1. ], [ 0. , 1.41421356], [ 0. , 1. ], [ 0. , 1. ], [ 0. , 1.41421356]])
>>> nbrs.kneighbors_graph(X).toarray() array([[ 1., 1., 0., 0., 0., 0.], [ 1., 1., 0., 0., 0., 0.], [ 0., 1., 1., 0., 0., 0.], [ 0., 0., 0., 1., 1., 0.], [ 0., 0., 0., 1., 1., 0.], [ 0., 0., 0., 0., 1., 1.]])
The algorithm is the go-to method for many complex problems with data science. When the distribution of the data is even, KNN can work wonders. The above code is to explain the working on the algorithm. You can look up the Lung Cancer and Glass Identification datasets on UCI Machine Learning Repository and try it out on your own.
3. Polynomial Regression
Regression is the most widely used Machine learning algorithm- sometimes wrongly. One should know when to use polynomial regression and extrapolate- a very common mistake that people commit is using this method when it is not to be. There are many other methods of regression but using polynomial regression is the most important.
Polynomial regression simply finds a curve that fits the training data and extrapolates the predictions for test data. The theory is as simple as drawing a line over 2 given points using the formula y=mx+c and finding the values for m and c. For large sets of data, the power in this polynomial changes. The intelligence of the system lies in the selection of the correct degree of the polynomial so that the bias-variance trade-off is favourable.
>>> X = np.arange(6).reshape(3, 2) >>> X array([[0, 1], [2, 3], [4, 5]]) >>> poly = PolynomialFeatures(2) >>> poly.fit_transform(X) array([[ 1., 0., 1., 0., 0., 1.], [ 1., 2., 3., 4., 6., 9.], [ 1., 4., 5., 16., 20., 25.]]) >>> poly = PolynomialFeatures(interaction_only=True) >>> poly.fit_transform(X) array([[ 1., 0., 1., 0.], [ 1., 2., 3., 6.], [ 1., 4., 5., 20.]])
Mostly, we use the above function, poly.fit_transform() for regression problems. It is easily the most used function from the sci-kit learn library in Python. Regression is a very simple way of making predictions mathematically and is used in many industries for both internal and external purposes.
4. Neural networks
A neural network is a model that mimics the functioning of the human nervous system. Each of its smallest units is called a perceptron which is similar to a neuron in a human. A perceptron performs the hardlim function. It takes some inputs with weights attached to each input and computes the weighted sum. If this sum is greater than the threshold value assigned to the perceptron, we get a 1 as output and 0 otherwise. A neural network is a collection of such perceptrons with a set of inputs, an output layer, and hidden layers- which do the computation needed to reach the final output.
>>> from sklearn.neural_network import MLPClassifier >>> X = [[0., 0.], [1., 1.]] >>> y = [0, 1] >>> clf = MLPClassifier(solver='lbfgs', alpha=1e-5, ... hidden_layer_sizes=(5, 2), random_state=1) ... >>> clf.fit(X, y) MLPClassifier(activation='relu', alpha=1e-05, batch_size='auto', beta_1=0.9, beta_2=0.999, early_stopping=False, epsilon=1e-08, hidden_layer_sizes=(5, 2), learning_rate='constant', learning_rate_init=0.001, max_iter=200, momentum=0.9, nesterovs_momentum=True, power_t=0.5, random_state=1, shuffle=True, solver='lbfgs', tol=0.0001, validation_fraction=0.1, verbose=False, warm_start=False) >>> clf.predict([[2., 2.], [-1., -2.]]) array([1, 0])
Neural network is the most ingenious and effective machine learning algorithm in the world. It can be used to solve all kinds of problems and it has a lot of uses in real life- translators, diagnostics, routing and what not.
The trick is to learn when to use a certain algorithm. Machine learning problems can be exceedingly confusing and so it is very important that you learn the theory behind an algorithm and the dataset and use them well for doing all kinds of tricks with intelligent systems. You can find many datasets to work with on UCI machine learning repository.