How to read a boxplot: Visualization with statistics
We observe our environment and make decisions based on our cognitive abilities and its interaction with our environment. On a similar note statistically, data is collected from the environment and then processed and analyzed into a viable information. Statistics is the study, interpretation, and presentation of the data and derive information based on some predefined condition. Graphs and plot help us understand the data better as visualization is one of the best ways to pass on information to users and stakeholders. Boxplot is one of the plots which is the culmination of statistical data with visualization to make effective observations. John W. Tukey introduced box plot in 1969 in an article and later in his book, Exploratory Data Analysis. How to read a Boxplot? Here we are going to study how to read this visually abiding box plot.
The greatest value of a picture is when it forces us to notice what we never expected to see.
John W. Tukey, 1977
How to read a boxplot: Study of the distribution
Statistics is the study and analysis of the distribution of data. The distribution of data is analyzed and based on those analyses, we arrive at some decision making solutions. Statistics is broadly divided into two branches which are Inferential statistics and descriptive statistics. Descriptive statistics is used to summarise the sample data using some indexes like mean, standard deviation etc. Inferential statistics is used to draw a conclusion from the random collection of data. Boxplot is one of those plots which is used to infer the descriptive statistics from its data structure.
What is a boxplot?
We use different plots to study the data elements in a given sample of a dataset. There are various basic plots like histogram, bar chart, line chart available to draw the pattern out of the data. Similarly, we use boxplot to study the pattern of data. It analyzes the spread of the data and studies the distribution of the data. It is specifically used in exploratory data analysis. We can view and subsequently compare the quartile range of different categories using the boxplot. This plot also provides us with information about the availability of outliers in the given dataset for the selected categorical data elements. Thus it gives an idea about the measures of the variability of a data distribution.
How to read a boxplot: Usage
Boxplot is a visualization figure to graphically analyze the data in respect of the spread of data. The data elements in the plot show the first spread of data at 25th quartile (Q1) and the last spread of data at 75th quartile(Q3). The line inside the box represents the 50th quartile(Q2) which defines the median of that particular category. It divides the sample data into the equal half. The length of the rectangular box is the interquartile range of the sample. The whisker coming out of the box and pointing to the lowest point represents the minimum value. Similarly, whisker at the highest point represents the maximum.
Its usage lies in its simplicity in giving a quick detail about five statistical figure which are the minimum, the lower quartile, the median, the upper quartile, the maximum as a data visualization.
How to read a boxplot: Analysis
As discussed, the boxplot analyzes the descriptive statistics of a sample dataset. The length of the box tells us about the variability of the data in question and the line across the box provides us with information about the centered data of the dataset in question. The comparative length of both the whiskers(left and right side) taken with the position of line inside the box can also give us the fair idea of the distribution of the sample data provided the sample data should be large enough to test(It can provide you with false reading if the sample size is small). These boxes can also give us a fair idea about the standard deviation and variance for the data points.
For instance, if the top or right whisker( depending on whether it is drawn vertically or horizontally) is much longer than the left or bottom whisker and the central line is gravitating towards the bottom or left then the sample distribution is right-skewed else it is left-skewed. If both whiskers are of comparably even length with the line placed nearly at center, then we can gauge the distribution as symmetric distribution towards the center. This symmetrically distributed boxplot represents a bell curve when plotted in a graph. As a result, the strength of the boxplot lies in its simplicity. We can see that the length of the tails in conjunction with the line in the box tells us about the type of distribution. From this, we can have an idea about the level of Kurtosis in the distribution.
Inferences from the visual
Every data has some values which are far away from normal observations and which can impact the statistical analysis. This may be due to some variations in the distribution or some experimental error. Any data points lying beyond the 1.5 times the IQR value needs to be further analyzed and if needed to be discarded from the sample data as these values can change the range of values in the dataset. For example, It can affect the clustering algorithms if the data is going to be used for predictive modeling. Thus it is one of the simplest and better ways to identify outliers.
We use parallel boxplots for comparative analysis of different categories or samples. Consequently, they provide greater clarification between different categories or samples as compared to other plots like histogram or bar chart. These side by side boxplot provides the summarised information for each of the categorical variables in the plot.
Boxplot thus is the simple yet powerful visual map helping on the path of exploration in statistics and data science. Thus, we need to make ourselves well versed in studying data using a boxplot to make the data preparation and data analysis process useful and effective.