Hierarchical clustering
Hierarchical methods form the backbone of cluster analysis. As the name suggests, Hierarchical clustering is an algorithm that builds hierarchy of clusters.This algorithm starts with all the data points assigned to a cluster of their own. Then two nearest clusters are merged into the same cluster. In the end, this algorithm terminates when there is only a single cluster left.
The need for hierarchical clustering naturally emerges in domains where it is not only required to discover similaritybased groups but also need to organize them.
This clustering is an alternative approach to kmeans clustering . It is used for identifying groups in the dataset. And does not require to prespecify the number of clusters to generate.
What exactly is Hierarchical clustering?
It refers to a set of clustering algorithms that build treelike clusters by successively splitting or merging them. This hierarchical structure is represented using a tree.
Hclustering methods use a distance similarity measure to combine or split clusters. The recursive process continues until there is only one cluster left .Or we cannot split more clusters. We can use a dendrogram to represent the hierarchy of clusters.
In general
the merges and splits are determined in a greedy manner. The results of hierarchical clustering are usually presented in a dendrogram.
A little complex details :
The standard algorithm for hierarchical agglomerative clustering (HAC) has a time complexity of .It requires memory, which makes it too slow for even medium data sets. However, for some special cases, optimal efficient agglomerative methods (of complexity ) are known. These are SLINK for singlelinkage and CLINK for completelinkage clustering.
With a the runtime of the general case can be reduced to at the cost of further increasing the memory requirements. In many programming languages, the memory overheads of this approach are too large to make it practically usable.
Except for the special case of singlelinkage, none of the algorithms (except exhaustive search in ) can be guaranteed to find the optimum solution.
Divisive clustering with an exhaustive search is , but it is common to use faster heuristics to choose splits, such as kmeans.
CLUSTERING TECHNIQUES:
Difference between K Means and Hierarchical clustering
 Hclustering can’t handle big data well but K Means clustering can. This is because the time complexity of K Means is linear i.e. O(n). While that of hierarchical clustering is quadratic i.e. O(n^{2}).
 In K Means clustering, since we start with random choice of clusters. The results produced by running the algorithm multiple times might differ. While results are reproducible in Hclustering.
 K Means is found to work well when the shape of the clusters is hyper spherical (like circle in 2D, sphere in 3D).
 K Means clustering requires prior knowledge of K i.e. no. of clusters you want to divide your data into. But, you can stop at whatever number of clusters you find appropriate. In hierarchical clustering it can be done by interpreting the dendrogram.
Dendrogram
A dendrogram is a treelike structure frequently used to illustrate the arrangement of the clusters produced by hierarchical clustering.
Hierarchical classifications produced by either
 Agglomerative
 Divisive
The agglomerative or divisive route may be represented by a twodimensional diagram known as a dendrogram. It illustrates the fusions or divisions made at each stage of the analysis . Agglomerative clustering usually yields a higher number of clusters, with fewer leaf nodes in the cluster.
In a hierarchical classification, the data are not partitioned into a particular number of classes or clusters at a single step. Instead, the classification consists of a series of partitions, which may run from a single cluster containing all individuals to n clusters each containing a single individual.
These clustering algorithms can be either bottomup or topdown.
Agglomerative clustering
Agglomerative clustering is Bottomup technique start by considering each data point as its own cluster .And merging them together into larger groups from the bottom up into a single giant cluster.
To learn Machine learning from End to End check here
Divisive clustering
Divisive clustering is the opposite, it starts with one cluster, which is then divided in two as a function of the similarities or distances in the data. These new clusters are then divided, and so on until each case is a cluster.
Clustering linkage comparison
Describing the bottomup approach in the detailed manner i.e. agglomerative algorithm
 Start with each point in its own cluster.
 Compare each pair of data points using a distance metric. We can use any of the methods discussed above.
 Use a linkage criterion to merge data points (at the first stage) or clusters (in subsequent phases). Where the linkage is represented by a function such as:
 Maximum or complete linkage clustering : It computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2. And considers the largest value (i.e., maximum value) of these dissimilarities as the distance between the two clusters. It tends to produce more compact clusters.
 Minimum or single linkage clustering : It computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2. And considers the smallest of these dissimilarities as a linkage criterion. It tends to produce long, “loose” clusters.
 Mean or average linkage clustering : It computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2. And considers the average of these dissimilarities as the distance between the two clusters.
 Centroid linkage clustering : It computes the dissimilarity between the centroid for cluster 1 (a mean vector of length p variables) and the centroid for cluster 2. .
 Ward’s minimum variance method : It minimizes the total withincluster variance. The pair of clusters with minimum betweencluster distance are merged at each step.
To summarize this article let’s consider the advantages and disadvantages.
Advantages

It can produce an ordering of the objects, which may be informative for data display.

Smaller clusters are generated, which may be helpful for discovery.
Disadvantages

We can make no provision for a relocation of objects that may have been ‘incorrectly’ grouped at an early stage. To ensure the result makes sense it should be examined closely.

Use of different distance metrics for measuring distances between clusters may generate different results. It is recommended to perform multiple experiments. After that, comparing the results to support the efficiency of the original results.
For more knowledge and implementation one can refer to following:
https://github.com/ZwEin27/HierarchicalClustering