Understand Hierarchical Clustering, a method that builds clusters step‑by‑step to reveal data structure. This lesson explains dendrograms, linkage methods, and how to identify natural groupings without predefining the number of clusters.
Hierarchical clustering is an unsupervised machine learning technique that builds a hierarchy of clusters by either merging or splitting data points progressively. Unlike K-Means, hierarchical clustering does not require you to specify the number of clusters beforehand, making it particularly useful for exploratory data analysis.
The result of hierarchical clustering is a tree-like diagram called a dendrogram, which visualizes how clusters are formed at different levels of similarity. This provides valuable insights into the natural structure of your data.
Hierarchical clustering can be performed using two opposite approaches, each with distinct advantages.
Agglomerative clustering is the most common approach:
This bottom-up approach builds clusters by progressively combining smaller groups into larger ones.
Divisive clustering works in the opposite direction:
While theoretically valid, divisive clustering is computationally expensive and less commonly used in practice.
When merging clusters, you need a method to measure the distance between them. The choice of linkage method significantly affects the resulting cluster structure.
Measures the distance between the two closest points in different clusters:
Measures the distance between the two farthest points in different clusters:
Calculates the average distance between all pairs of points in two clusters:
Minimizes the total within-cluster variance when merging:
Hierarchical clustering is particularly valuable in scenarios where understanding relationships between groups matters:
Let's implement agglomerative hierarchical clustering using scikit-learn and SciPy.
import numpy as np
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
# Generate sample data
X, _ = make_blobs(n_samples=50, centers=3,
cluster_std=1.0, random_state=42)
# Scale the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
print(f"Data shape: {X_scaled.shape}")
We create a small dataset with 50 samples for clearer dendrogram visualization. Scaling ensures fair distance calculations.
from scipy.cluster.hierarchy import dendrogram, linkage
# Compute the linkage matrix using Ward's method
linkage_matrix = linkage(X_scaled, method='ward')
print(f"Linkage matrix shape: {linkage_matrix.shape}")
The linkage function computes the hierarchical clustering encoded as a linkage matrix. Each row contains information about which clusters were merged and at what distance.
# First few merges in the linkage matrix
# Columns: [cluster1, cluster2, distance, num_points]
print("First 5 merges:")
print(linkage_matrix[:5])
Each row in the linkage matrix represents a merge operation, showing which two clusters combined, the distance at which they merged, and the resulting cluster size.
from sklearn.cluster import AgglomerativeClustering
# Create hierarchical clustering model
hierarchical = AgglomerativeClustering(
n_clusters=3,
linkage='ward'
)
# Fit and predict cluster labels
labels = hierarchical.fit_predict(X_scaled)
print(f"Cluster assignments: {labels[:10]}")
print(f"Points per cluster: {np.bincount(labels)}")
The AgglomerativeClustering class from scikit-learn provides a straightforward interface for applying hierarchical clustering when you know the desired number of clusters.
A key advantage of hierarchical clustering is the flexibility to choose the number of clusters after building the hierarchy.
from scipy.cluster.hierarchy import fcluster
# Cut dendrogram at a specific distance threshold
distance_threshold = 5.0
labels_by_distance = fcluster(linkage_matrix,
distance_threshold,
criterion='distance')
print(f"Clusters at distance {distance_threshold}: "
f"{len(np.unique(labels_by_distance))}")
Cutting by distance creates clusters where all points within a cluster are closer than the specified threshold.
# Cut dendrogram to get exactly 4 clusters
labels_by_count = fcluster(linkage_matrix,
4,
criterion='maxclust')
print(f"Labels when cut for 4 clusters: "
f"{np.unique(labels_by_count)}")
This approach lets you specify exactly how many clusters you want, and the algorithm finds the appropriate cut point.
Different linkage methods produce different cluster structures:
from sklearn.cluster import AgglomerativeClustering
linkage_methods = ['ward', 'complete', 'average', 'single']
for method in linkage_methods:
model = AgglomerativeClustering(n_clusters=3, linkage=method)
labels = model.fit_predict(X_scaled)
# Count points in each cluster
counts = np.bincount(labels)
print(f"{method:10} linkage: cluster sizes = {counts}")
This comparison helps you understand how each linkage method distributes data points across clusters differently.
Explore DBSCAN, a powerful clustering algorithm that identifies dense regions and detects noise. Learn how it discovers clusters of any shape and performs well with outliers and complex datasets.
Learn how K-Means Clustering groups similar data points into meaningful clusters. This guide covers the algorithm’s workflow, distance metrics, choosing optimal K, and practical applications in customer segmentation and pattern discovery.