How to run hierarchical clustering with SciPy

Hierarchical clustering groups observations by repeatedly merging nearby clusters, leaving a tree that can be cut into flat labels. In SciPy, scipy.cluster.hierarchy is a compact way to check whether numeric observations form clear groups before plotting a dendrogram or using the labels in later analysis.

linkage() builds the hierarchy from an observation array or a condensed distance vector. Each row in the linkage matrix records the two cluster ids that were merged, the merge distance, and the number of original observations in the new cluster.

A small labeled feature matrix keeps the merge rows and final cluster labels easy to inspect. Ward linkage is defined for Euclidean geometry, so scale real features before clustering when columns use different units or ranges.

Steps to run hierarchical clustering with SciPy:

Create a Python script named hierarchical_clustering.py with labeled observations.

hierarchical_clustering.py

import numpy as np
from scipy.cluster.hierarchy import dendrogram, fcluster, linkage
 
 
labels = np.array(
    [
        "web-1",
        "web-2",
        "web-3",
        "cache-1",
        "cache-2",
        "cache-3",
        "db-1",
        "db-2",
        "db-3",
    ]
)
 
features = np.array(
    [
        [0.0, 0.1],
        [0.2, -0.1],
        [-0.2, 0.0],
        [0.1, 4.8],
        [-0.1, 5.1],
        [0.3, 5.0],
        [5.1, 4.9],
        [4.8, 5.2],
        [5.0, 5.1],
    ]
)
 
linkage_matrix = linkage(features, method="ward", metric="euclidean")
cluster_ids = fcluster(linkage_matrix, t=3, criterion="maxclust")
tree = dendrogram(linkage_matrix, labels=labels, no_plot=True)
 
print(f"observations: {len(labels)}")
print(f"linkage_shape: {linkage_matrix.shape}")
print("first_merges:")
for merge_index, row in enumerate(linkage_matrix[:4]):
    left, right, distance, count = row
    print(
        f"  {merge_index}: left={int(left)} right={int(right)} "
        f"distance={distance:.3f} count={int(count)}"
    )
 
print("flat_clusters:")
for cluster_id in sorted(set(cluster_ids)):
    members = labels[cluster_ids == cluster_id]
    print(f"  {cluster_id}: {', '.join(members)}")
 
expected_groups = {
    frozenset(["web-1", "web-2", "web-3"]),
    frozenset(["cache-1", "cache-2", "cache-3"]),
    frozenset(["db-1", "db-2", "db-3"]),
}
observed_groups = {
    frozenset(labels[cluster_ids == cluster_id]) for cluster_id in set(cluster_ids)
}
 
print(f"dendrogram_leaf_order: {', '.join(tree['ivl'])}")
print(f"expected_groups_match: {observed_groups == expected_groups}")

The rows in features are observations, and the columns are numeric features. The label array is separate so cluster membership can be printed with readable names.

Run the script to compute the linkage matrix and flat cluster labels.

$ python hierarchical_clustering.py
observations: 9
linkage_shape: (8, 4)
first_merges:
  0: left=6 right=8 distance=0.224 count=2
  1: left=0 right=2 distance=0.224 count=2
  2: left=3 right=5 distance=0.283 count=2
  3: left=7 right=9 distance=0.370 count=3
flat_clusters:
  1: web-1, web-2, web-3
  2: db-1, db-2, db-3
  3: cache-1, cache-2, cache-3
dendrogram_leaf_order: web-2, web-1, web-3, db-2, db-1, db-3, cache-2, cache-1, cache-3
expected_groups_match: True

Check the linkage matrix dimensions.

Nine observations produce eight merge rows, so linkage_shape: (8, 4) is expected. The four columns are left cluster id, right cluster id, merge distance, and merged observation count.
Use fcluster() to cut the hierarchy into flat labels.
```
cluster_ids = fcluster(linkage_matrix, t=3, criterion="maxclust")
```
The maxclust criterion asks SciPy for no more than three flat clusters. Treat the numeric cluster ids as labels; their order is not a quality ranking.
Compare the observed members with the expected groups.

expected_groups_match: True confirms that the flat labels separate the three visible groups in the input data.
Use the dendrogram leaf order only for display planning.

dendrogram(…, no_plot=True) returns the leaf order without drawing a plot. The leaf order controls how labels appear in a dendrogram, while cluster_ids controls flat membership.
Scale real input features before using Ward linkage on mixed units.

Large-range columns can dominate Euclidean distances. Standardize or otherwise scale columns before clustering when one feature uses counts, another uses milliseconds, and another uses percentages.
Remove the demo script after adapting the clustering pattern.
```
$ rm hierarchical_clustering.py
```