A major facet of machine learning is classification. It is the process of predicting the class or nature of an object based on previous examples. There are many algorithms used for classification: Decision trees, logistic regression and so many more. To add to these, there are some meta-algorithms that are used with existing algorithms to alter the results and make better predictions. To start off with random forests, we must know a few fundamental things. Then we shall move on to the algorithm and its implementation in R.
What is a meta-algorithm?
A meta-algorithm is an algorithm that tampers with the existing algorithm to create a completely new model. This model is expected to make better predictions than the original algorithm. The original algorithm is called the base learner. In the case of random forests, the base learner is a decision tree.
What is a decision tree?
A decision tree is the simplest machine learning algorithm there is. It is a model which asks questions of the training data and creates a flow chart- following which, the prediction on training data is made. You can read more about decision trees under the first point of this post.
What is ensemble learning?
Ensemble learning is a way of using meta-algorithms by using the same algorithm over and over again with different chunks of training data. This improves the quality of predictions quite a lot and therefore, is very much in demand industrially. Ensemble learning, thus, is a very important aspect of machine learning that one must learn if the long-term goal is to stick to the industrial applications only. Random forest is an ensemble of decision trees.
How does the algorithm work?
The training data is split up into n number of decision trees, where n is an arbitrary positive integer. It is to be noted that these decision trees are taken at random. When the final output is taken the average of all the predictions results in a much better prediction. Why this happens is a very interesting thing. If we take a number of decision trees, some of them will be prone to overfitting just as much as your normal decision trees do. Now, the noise resulting in overfitting is not present in most of the other decision trees in the ensemble. So, when we combine the results of all the trees, the effect due to overfitting wears out almost completely.
To add to that, the bias in the system is also very low- because the trees are selected at random. As a result of all this, the bias-variance trade-off is almost perfect and we get brilliant results. After all, machine learning is all about the bias-variance trade-off of any learner.
How to implement a Random Forest on R?
Implementing this algorithm is fairly easy given how good its results are. You have to install the randomForest package in R for doing so. Here’s the code to do that:
The dataset we are using is iris– which is a pre-loaded dataset in RStudio. It has the following attributes: Sepal.length, Sepal.width, Petal.length, Petal.width and Species. You can have a look at it by using the head function. Then we shall divide the data set into training and test data sets.
head(iris) set.seed(123) index=0.8*nrow(iris) train=iris[index,] test=iris[-index,]
Now, we shall create a random forest on this dataset. For this, we use the randomForest algorithm of the randomForest library. We have to feed the attributes needed to make the predictions and the attribute to be predicted.
myForest <- randomForest (Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data = train) print (myForest)
The print statement will display the accuracy of the model after training. After this, we can use the predict function to make predictions on the test dataset.
pred <-predict (myforest ,newdata=testdata)
Thus we can see the results and the accuracy of the model. Normally, the accuracy rate of any ensemble learning model is considerably high. The best thing about decision trees is the fact that they are applicable almost anywhere. When k nearest neighbor is not applicable and when we do not know how to create a neural network for a particular problem, a random forest gives us the best way out.
Try this algorithm on any classification example. You can try out the following datasets: Glass Identification and Annealing. Further, you can control the number of trees in the forest with the ntree attribute and play around with your classification model. It is very important to note that a random forest should not be used when the number of attributes in your dataset is very high. This leads to too many splits and results in overfitting. In such cases, SVM for classification and neural networks for regression are the way to go. Try practicing more machine learning problems in order to learn when to use which algorithm.