Hierarchical clustering


 Hierarchical clustering

Hierarchical methods form the backbone of cluster analysis. As the name suggests, Hierarchical clustering is an algorithm that builds hierarchy of clusters.This algorithm starts with all the data points assigned to a cluster of their own. Then two nearest clusters are merged into the same cluster. In the end, this algorithm terminates when there is only a single cluster left.

The need for hierarchical clustering naturally emerges in domains where it is not only required to discover similarity-based groups but also need to organize them.
This clustering is an alternative approach to k-means clustering . It is used for identifying groups in the dataset. And does not require to pre-specify the number of clusters to generate.

What exactly is Hierarchical clustering?

It refers to a set of clustering algorithms that build tree-like clusters by successively splitting or merging them. This hierarchical structure is represented using a tree.
H-clustering methods use a distance similarity measure to combine or split clusters. The recursive process continues until there is only one cluster left .Or we cannot split more clusters. We can use a dendrogram to represent the hierarchy of clusters.

In general

the merges and splits are determined in a greedy manner. The results of hierarchical clustering are usually presented in a dendrogram.


A little complex details :

The standard algorithm for hierarchical agglomerative clustering (HAC) has a time complexity of {\mathcal {O}}(n^{3}) .It requires {\mathcal {O}}(n^{2})memory, which makes it too slow for even medium data sets. However, for some special cases, optimal efficient agglomerative methods (of complexity {\mathcal {O}}(n^{2})) are known. These are SLINK for single-linkage and CLINK for complete-linkage clustering.

With a  the runtime of the general case can be reduced to {\mathcal {O}}(n^{2}\log n) at the cost of further increasing the memory requirements. In many programming languages, the memory overheads of this approach are too large to make it practically usable.

Except for the special case of single-linkage, none of the algorithms (except exhaustive search in {\displaystyle {\mathcal {O}}(2^{n})}) can be guaranteed to find the optimum solution.

Divisive clustering with an exhaustive search is {\displaystyle {\mathcal {O}}(2^{n})}, but it is common to use faster heuristics to choose splits, such as k-means.


Difference between K Means and Hierarchical clustering

  • H-clustering can’t handle big data well but K Means clustering can. This is because the time complexity of K Means is linear i.e. O(n). While that of hierarchical clustering is quadratic i.e. O(n2).
  • In K Means clustering, since we start with random choice of clusters. The results produced by running the algorithm multiple times might differ. While results are reproducible in H-clustering.
  • K Means is found to work well when the shape of the clusters is hyper spherical (like circle in 2D, sphere in 3D).
  • K Means clustering requires prior knowledge of K i.e. no. of clusters you want to divide your data into. But, you can stop at whatever number of clusters you find appropriate.  In hierarchical clustering it can be done by interpreting the dendrogram.


A dendrogram is a tree-like structure frequently used to illustrate the arrangement of the clusters produced by hierarchical clustering.

Hierarchical classifications produced by either

  • Agglomerative
  • Divisive

The agglomerative or divisive route may be represented by a two-dimensional diagram known as a dendrogram. It illustrates the fusions or divisions made at each stage of the analysis . Agglomerative clustering usually yields a higher number of clusters, with fewer leaf nodes in the cluster.

In a hierarchical classification, the data are not partitioned into a particular number of classes or clusters at a single step. Instead, the classification consists of a series of partitions, which may run from a single cluster containing all individuals to n clusters each containing a single individual.

These clustering algorithms can be either bottom-up or top-down.


Agglomerative clustering

Agglomerative clustering is Bottom-up technique start by considering each data point as its own cluster .And merging them together into larger groups from the bottom up into a single giant cluster.

To learn Machine learning from End to End check here

Divisive clustering

Divisive clustering is the opposite, it starts with one cluster, which is then divided in two as a function of the similarities or distances in the data. These new clusters are then divided, and so on until each case is a cluster.


Clustering linkage comparison

 Describing the bottom-up approach in the detailed manner i.e. agglomerative algorithm

  1. Start with each point in its own cluster.
  2. Compare each pair of data points using a distance metric. We can use any of the methods discussed above.
  3. Use a linkage criterion to merge data points (at the first stage) or clusters (in subsequent phases). Where the linkage is represented by a function such as:
    • Maximum or complete linkage clustering :  It computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2. And considers the largest value (i.e., maximum value) of these dissimilarities as the distance between the two clusters. It tends to produce more compact clusters.
    • Minimum or single linkage clustering :  It computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2. And considers the smallest of these dissimilarities as a linkage criterion. It tends to produce long, “loose” clusters.
    • Mean or average linkage clustering :  It computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2. And considers the average of these dissimilarities as the distance between the two clusters.
    • Centroid linkage clustering : It computes the dissimilarity between the centroid for cluster 1 (a mean vector of length p variables) and the centroid for cluster 2.                                                                                                                                            .
    • Ward’s minimum variance method  : It minimizes the total within-cluster variance. The pair of clusters with minimum between-cluster distance are merged at each step.

To summarize this article let’s consider the advantages and disadvantages.


  • It can produce an ordering of the objects, which may be informative for data display.

  • Smaller clusters are generated, which may be helpful for discovery.


  • We can make no provision for a relocation of objects that may have been ‘incorrectly’ grouped at an early stage. To ensure the result makes sense it should be examined closely.

  • Use of different distance metrics for measuring distances between clusters may generate different results. It is recommended to perform multiple experiments. After that, comparing the results to support the efficiency of the original results.

For more knowledge and implementation one can refer to following:


Don't miss out!
Subscribe To Our Newsletter

Learn new things. Get an article everyday.

Invalid email address
Give it a try. You can unsubscribe at any time.