# Introduction to Decision Trees

Decision tree is basically a binary tree flowchart whereÂ each node splits a group of observations according to some feature variable. We will learn some basics of Decision trees and implementÂ sklearn decision tree.

The finalÂ goalÂ of aÂ decision treeÂ is that it has to makeÂ the optimal choice at the end of each node. So, it needs an algorithm that is capable of doing just that. TheÂ algorithm used for that is Huntâ€™s algorithm, which is both greedy, and recursive.

It is one of the predictive modeling approaches used inÂ statistics,Â data mining and also in machine learning. Tree models where the target variable can take a discrete set of values are calledÂ classification trees;Â  in these tree structures,Â leavesÂ represent class labels and branches representÂ conjunctionsÂ of features that lead to those class labels. Decision trees where the target variable can take continuous values (typicallyÂ real numbers) are calledÂ regression trees.

## Considering a simple example

IS A PERSON FIT?

There can be many factors based on age, sex, weight etc. to consider to determine whether a person is fit or not.

Taking some common things to understand how decision tree works.

First, we consider the age. Whether the person’s age is less than 30 or not?Â  A decision will be made here.

If yes, then we check if he is eating a lot of pizza or not? And if he is eating a lot of pizza then we can say that he is unfit. If he is not eating a lot of pizza then we can say that he is fit.Â A decision will be again made here.

If no, then we will check if he is exercising regularly or not? And if he is not then we can say that he is unfit. And if he is exercising regularly then we can say that he is fit.

# Algorithm for Decision tree:

1. Place the most basic attribute of the datasetÂ at the root.
2. SplitÂ the training set intoÂ subsets.
3. Repeat step 1 and step 2 on each subset until you findÂ leaf nodesÂ in all the branches of the tree.

Now, we have a basic idea of a decision tree. Let us start with the implementation.

# Step 1:

We will make use of widely known iris dataset. Itâ€™s available online for all.

TheÂ IrisÂ flower data setÂ orÂ Fisherâ€™sÂ IrisÂ data setÂ is aÂ multivariateÂ data setÂ introduced by the BritishÂ statisticianÂ andÂ biologistÂ Ronald FisherÂ in his 1936 paperÂ The use of multiple measurements in taxonomic problemsÂ as an example ofÂ linear discriminant analysis. Now we shall use the decision tree with a sklearn library for better understanding.

The data set consists of 50 samples from each of three species ofÂ IrisÂ (Iris setosa,Â Iris virginica, andÂ Iris versicolor). FourÂ featuresÂ were measured from each sample: the length and the width of theÂ sepalsÂ andÂ petals, in centimeters. Based on the combination of these four features, Fisher developed a linear discriminant model to distinguish the species from each other.

Here we will use a small portion of that dataset.

So, Importing the general libraries and iris dataset from a specific URL.

```import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"

# Assign colum names to the dataset
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']

# Read dataset to pandas dataframe

```

Now, letâ€™s check how much data we have and what does it look like.

```print(dataset.shape)
```

Output:

```(150, 5)
sepal-length	sepal-width	petal-length	petal-width	Class
0	5.1	3.5	1.4	0.2	Iris-setosa
1	4.9	3.0	1.4	0.2	Iris-setosa
2	4.7	3.2	1.3	0.2	Iris-setosa
3	4.6	3.1	1.5	0.2	Iris-setosa
4	5.0	3.6	1.4	0.2	Iris-setosa
```

## Step 2: PreprocessingÂ  and Train-test split

First of all, we imported libraries and dataset.

After that now, dividing the data in input and output labels. So, dividing the dataset into training and testing data.

```X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 4].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)
```

## Step 3: Feature scaling

Before making any actual predictions. It is always a good practice to scale the features. So that, all of them can be uniformly evaluated. Wikipedia explains the reasoning pretty well:

First of all,since the range of values of raw data varies widely, in some machine learning algorithms. so, as a result objective functions will not work properly without normalization. For example, the majority of classifiers calculate the distance between two points by the Euclidean distance. If one of the features has a broad range of values, the distance will be governed by this particular feature. Therefore, the range of all features should be normalized so that each feature contributes approximately proportionately to the final distance.

#### Hence, moving on to feature-scaling.

```from sklearn.preprocessing import StandardScaler
s = StandardScaler()
s.fit(X_train)

X_train = s.transform(X_train)
X_test = s.transform(X_test)
```

## Step 4: Creating a model and hence fitting the data in our model.

Now we will create the model. After that, we will fit our data in the model and display the output.

```from sklearn.tree import DecisionTreeClassifier
# Create decision tree classifer object
clf = DecisionTreeClassifier(random_state=0)

# Train model
model = clf.fit(X_train, y_train)
print(model)

```

Output:

```DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=0,
splitter='best')
```

## Step 5: Predictions and Scoring

Finally, letâ€™s check how well our model is performing.

```pred = model.predict(X_test)
print("Score is",model.score(X_test,y_test))
```

Output:

```Score is 0.9666666666666667
```

Don't miss out!