Introduction to Decision Trees
Decision tree is basically a binary tree flowchart where each node splits a group of observations according to some feature variable. We will learn some basics of Decision trees and implement sklearn decision tree.
The final goal of a decision tree is that it has to make the optimal choice at the end of each node. So, it needs an algorithm that is capable of doing just that. The algorithm used for that is Hunt’s algorithm, which is both greedy, and recursive.
It is one of the predictive modeling approaches used in statistics, data mining and also in machine learning. Tree models where the target variable can take a discrete set of values are called classification trees; in these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels. Decision trees where the target variable can take continuous values (typically real numbers) are called regression trees.
Considering a simple example
IS A PERSON FIT?
There can be many factors based on age, sex, weight etc. to consider to determine whether a person is fit or not.
Taking some common things to understand how decision tree works.
First, we consider the age. Whether the person’s age is less than 30 or not? A decision will be made here.
If yes, then we check if he is eating a lot of pizza or not? And if he is eating a lot of pizza then we can say that he is unfit. If he is not eating a lot of pizza then we can say that he is fit. A decision will be again made here.
If no, then we will check if he is exercising regularly or not? And if he is not then we can say that he is unfit. And if he is exercising regularly then we can say that he is fit.
Algorithm for Decision tree:
- Place the most basic attribute of the dataset at the root.
- Split the training set into subsets.
- Repeat step 1 and step 2 on each subset until you find leaf nodes in all the branches of the tree.
Now, we have a basic idea of a decision tree. Let us start with the implementation.
Sklearn Decision Tree
We will make use of widely known iris dataset. It’s available online for all.
The Iris flower data set or Fisher’s Iris data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis. Now we shall use the decision tree with a sklearn library for better understanding.
The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica, and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. Based on the combination of these four features, Fisher developed a linear discriminant model to distinguish the species from each other.
Here we will use a small portion of that dataset.
So, Importing the general libraries and iris dataset from a specific URL.
import numpy as np import matplotlib.pyplot as plt import pandas as pd url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data" # Assign colum names to the dataset names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class'] # Read dataset to pandas dataframe dataset = pd.read_csv(url, names=names)
Now, let’s check how much data we have and what does it look like.
(150, 5) sepal-length sepal-width petal-length petal-width Class 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa
Step 2: Preprocessing and Train-test split
First of all, we imported libraries and dataset.
After that now, dividing the data in input and output labels. So, dividing the dataset into training and testing data.
X = dataset.iloc[:, :-1].values y = dataset.iloc[:, 4].values from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)
Step 3: Feature scaling
Before making any actual predictions. It is always a good practice to scale the features. So that, all of them can be uniformly evaluated. Wikipedia explains the reasoning pretty well:
First of all,since the range of values of raw data varies widely, in some machine learning algorithms. so, as a result objective functions will not work properly without normalization. For example, the majority of classifiers calculate the distance between two points by the Euclidean distance. If one of the features has a broad range of values, the distance will be governed by this particular feature. Therefore, the range of all features should be normalized so that each feature contributes approximately proportionately to the final distance.
Hence, moving on to feature-scaling.
from sklearn.preprocessing import StandardScaler s = StandardScaler() s.fit(X_train) X_train = s.transform(X_train) X_test = s.transform(X_test)
Step 4: Creating a model and hence fitting the data in our model.
Now we will create the model. After that, we will fit our data in the model and display the output.
from sklearn.tree import DecisionTreeClassifier # Create decision tree classifer object clf = DecisionTreeClassifier(random_state=0) # Train model model = clf.fit(X_train, y_train) print(model)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=0, splitter='best')
Step 5: Predictions and Scoring
Finally, let’s check how well our model is performing.
pred = model.predict(X_test) print("Score is",model.score(X_test,y_test))
Score is 0.9666666666666667