Progress2/3 lessons (67%)

Lesson 2

Hierarchical Clustering

Understand Hierarchical Clustering, a method that builds clusters step‑by‑step to reveal data structure. This lesson explains dendrograms, linkage methods, and how to identify natural groupings without predefining the number of clusters.

10 min read30 views

Introduction to Hierarchical Clustering

Hierarchical clustering is an unsupervised machine learning technique that builds a hierarchy of clusters by either merging or splitting data points progressively. Unlike K-Means, hierarchical clustering does not require you to specify the number of clusters beforehand, making it particularly useful for exploratory data analysis.

The result of hierarchical clustering is a tree-like diagram called a dendrogram, which visualizes how clusters are formed at different levels of similarity. This provides valuable insights into the natural structure of your data.

Types of Hierarchical Clustering

Hierarchical clustering can be performed using two opposite approaches, each with distinct advantages.

Agglomerative Clustering (Bottom-Up)

Agglomerative clustering is the most common approach:

Start with each data point as its own individual cluster
Find the two closest clusters and merge them
Repeat step 2 until all points belong to a single cluster

This bottom-up approach builds clusters by progressively combining smaller groups into larger ones.

Divisive Clustering (Top-Down)

Divisive clustering works in the opposite direction:

Start with all data points in one large cluster
Split the cluster into smaller sub-clusters
Repeat step 2 until each point is its own cluster

While theoretically valid, divisive clustering is computationally expensive and less commonly used in practice.

Understanding Linkage Methods

When merging clusters, you need a method to measure the distance between them. The choice of linkage method significantly affects the resulting cluster structure.

Single Linkage (Minimum Distance)

Measures the distance between the two closest points in different clusters:

Pros: Can detect elongated, chain-like clusters
Cons: Susceptible to the "chaining effect" where clusters merge due to single outlier points

Complete Linkage (Maximum Distance)

Measures the distance between the two farthest points in different clusters:

Pros: Creates compact, spherical clusters
Cons: Sensitive to outliers

Average Linkage

Calculates the average distance between all pairs of points in two clusters:

Pros: Balanced approach, less sensitive to outliers
Cons: Can be computationally intensive

Ward's Method

Minimizes the total within-cluster variance when merging:

Pros: Creates compact, similarly-sized clusters
Cons: Assumes spherical cluster shapes

Real-World Applications

Hierarchical clustering is particularly valuable in scenarios where understanding relationships between groups matters:

Biological Taxonomy: Classifying species based on genetic similarity
Document Organization: Creating topic hierarchies in large document collections
Social Network Analysis: Identifying community structures within networks
Gene Expression Analysis: Grouping genes with similar expression patterns
Market Research: Building customer segment hierarchies for targeted strategies

Implementing Hierarchical Clustering in Python

Let's implement agglomerative hierarchical clustering using scikit-learn and SciPy.

Setting Up and Preparing Data

import numpy as np
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler

# Generate sample data
X, _ = make_blobs(n_samples=50, centers=3, 
                   cluster_std=1.0, random_state=42)

# Scale the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print(f"Data shape: {X_scaled.shape}")

We create a small dataset with 50 samples for clearer dendrogram visualization. Scaling ensures fair distance calculations.

Creating and Visualizing the Dendrogram

from scipy.cluster.hierarchy import dendrogram, linkage

# Compute the linkage matrix using Ward's method
linkage_matrix = linkage(X_scaled, method='ward')

print(f"Linkage matrix shape: {linkage_matrix.shape}")

The linkage function computes the hierarchical clustering encoded as a linkage matrix. Each row contains information about which clusters were merged and at what distance.

Understanding the Linkage Matrix

# First few merges in the linkage matrix
# Columns: [cluster1, cluster2, distance, num_points]
print("First 5 merges:")
print(linkage_matrix[:5])

Each row in the linkage matrix represents a merge operation, showing which two clusters combined, the distance at which they merged, and the resulting cluster size.

Applying Agglomerative Clustering

from sklearn.cluster import AgglomerativeClustering

# Create hierarchical clustering model
hierarchical = AgglomerativeClustering(
    n_clusters=3, 
    linkage='ward'
)

# Fit and predict cluster labels
labels = hierarchical.fit_predict(X_scaled)

print(f"Cluster assignments: {labels[:10]}")
print(f"Points per cluster: {np.bincount(labels)}")

The AgglomerativeClustering class from scikit-learn provides a straightforward interface for applying hierarchical clustering when you know the desired number of clusters.

Cutting the Dendrogram

A key advantage of hierarchical clustering is the flexibility to choose the number of clusters after building the hierarchy.

Distance-Based Cutting

from scipy.cluster.hierarchy import fcluster

# Cut dendrogram at a specific distance threshold
distance_threshold = 5.0
labels_by_distance = fcluster(linkage_matrix, 
                               distance_threshold, 
                               criterion='distance')

print(f"Clusters at distance {distance_threshold}: "
      f"{len(np.unique(labels_by_distance))}")

Cutting by distance creates clusters where all points within a cluster are closer than the specified threshold.

Cluster Count-Based Cutting

# Cut dendrogram to get exactly 4 clusters
labels_by_count = fcluster(linkage_matrix, 
                            4, 
                            criterion='maxclust')

print(f"Labels when cut for 4 clusters: "
      f"{np.unique(labels_by_count)}")

This approach lets you specify exactly how many clusters you want, and the algorithm finds the appropriate cut point.

Comparing Linkage Methods

Different linkage methods produce different cluster structures:

from sklearn.cluster import AgglomerativeClustering

linkage_methods = ['ward', 'complete', 'average', 'single']

for method in linkage_methods:
    model = AgglomerativeClustering(n_clusters=3, linkage=method)
    labels = model.fit_predict(X_scaled)
    
    # Count points in each cluster
    counts = np.bincount(labels)
    print(f"{method:10} linkage: cluster sizes = {counts}")

This comparison helps you understand how each linkage method distributes data points across clusters differently.

Advantages and Limitations

Advantages of Hierarchical Clustering

No need to pre-specify K: Explore multiple cluster solutions from one computation
Interpretable visualization: Dendrograms reveal data structure at all levels
Deterministic results: Same data always produces the same hierarchy
Flexible cluster shapes: Less restrictive than K-Means for certain linkage methods

Limitations

Computational complexity: O(n²) memory and O(n³) time for naive implementations
No reassignment: Once a merge is made, it cannot be undone
Scalability issues: Not suitable for very large datasets
Sensitive to noise: Outliers can significantly affect the hierarchy

Best Practices for Hierarchical Clustering

Scale your data: Always standardize features before clustering
Choose appropriate linkage: Ward's method works well for compact clusters; single linkage for elongated shapes
Examine the dendrogram: Visual inspection reveals natural cluster boundaries
Validate with metrics: Use silhouette scores to evaluate cluster quality
Consider data size: For large datasets, consider sampling or using approximate methods

Key Takeaways

Hierarchical clustering builds a tree structure showing nested cluster relationships
Agglomerative (bottom-up) clustering is the most widely used approach
Linkage methods determine how cluster distances are calculated during merging
Dendrograms provide valuable visualization of cluster hierarchies
The method is best suited for small to medium datasets where interpretability matters

Related Lessons

DBSCAN - Density-Based Clustering

Explore DBSCAN, a powerful clustering algorithm that identifies dense regions and detects noise. Learn how it discovers clusters of any shape and performs well with outliers and complex datasets.

K-Means Clustering

Learn how K-Means Clustering groups similar data points into meaningful clusters. This guide covers the algorithm’s workflow, distance metrics, choosing optimal K, and practical applications in customer segmentation and pattern discovery.