Hierarchical clustering groups observations by repeatedly merging nearby clusters, leaving a tree that can be cut into flat labels. In SciPy, scipy.cluster.hierarchy is a compact way to check whether numeric observations form clear groups before plotting a dendrogram or using the labels in later analysis.
linkage() builds the hierarchy from an observation array or a condensed distance vector. Each row in the linkage matrix records the two cluster ids that were merged, the merge distance, and the number of original observations in the new cluster.
A small labeled feature matrix keeps the merge rows and final cluster labels easy to inspect. Ward linkage is defined for Euclidean geometry, so scale real features before clustering when columns use different units or ranges.
import numpy as np from scipy.cluster.hierarchy import dendrogram, fcluster, linkage labels = np.array( [ "web-1", "web-2", "web-3", "cache-1", "cache-2", "cache-3", "db-1", "db-2", "db-3", ] ) features = np.array( [ [0.0, 0.1], [0.2, -0.1], [-0.2, 0.0], [0.1, 4.8], [-0.1, 5.1], [0.3, 5.0], [5.1, 4.9], [4.8, 5.2], [5.0, 5.1], ] ) linkage_matrix = linkage(features, method="ward", metric="euclidean") cluster_ids = fcluster(linkage_matrix, t=3, criterion="maxclust") tree = dendrogram(linkage_matrix, labels=labels, no_plot=True) print(f"observations: {len(labels)}") print(f"linkage_shape: {linkage_matrix.shape}") print("first_merges:") for merge_index, row in enumerate(linkage_matrix[:4]): left, right, distance, count = row print( f" {merge_index}: left={int(left)} right={int(right)} " f"distance={distance:.3f} count={int(count)}" ) print("flat_clusters:") for cluster_id in sorted(set(cluster_ids)): members = labels[cluster_ids == cluster_id] print(f" {cluster_id}: {', '.join(members)}") expected_groups = { frozenset(["web-1", "web-2", "web-3"]), frozenset(["cache-1", "cache-2", "cache-3"]), frozenset(["db-1", "db-2", "db-3"]), } observed_groups = { frozenset(labels[cluster_ids == cluster_id]) for cluster_id in set(cluster_ids) } print(f"dendrogram_leaf_order: {', '.join(tree['ivl'])}") print(f"expected_groups_match: {observed_groups == expected_groups}")
The rows in features are observations, and the columns are numeric features. The label array is separate so cluster membership can be printed with readable names.
$ python hierarchical_clustering.py observations: 9 linkage_shape: (8, 4) first_merges: 0: left=6 right=8 distance=0.224 count=2 1: left=0 right=2 distance=0.224 count=2 2: left=3 right=5 distance=0.283 count=2 3: left=7 right=9 distance=0.370 count=3 flat_clusters: 1: web-1, web-2, web-3 2: db-1, db-2, db-3 3: cache-1, cache-2, cache-3 dendrogram_leaf_order: web-2, web-1, web-3, db-2, db-1, db-3, cache-2, cache-1, cache-3 expected_groups_match: True
Nine observations produce eight merge rows, so linkage_shape: (8, 4) is expected. The four columns are left cluster id, right cluster id, merge distance, and merged observation count.
cluster_ids = fcluster(linkage_matrix, t=3, criterion="maxclust")
The maxclust criterion asks SciPy for no more than three flat clusters. Treat the numeric cluster ids as labels; their order is not a quality ranking.
expected_groups_match: True confirms that the flat labels separate the three visible groups in the input data.
dendrogram(…, no_plot=True) returns the leaf order without drawing a plot. The leaf order controls how labels appear in a dendrogram, while cluster_ids controls flat membership.
Large-range columns can dominate Euclidean distances. Standardize or otherwise scale columns before clustering when one feature uses counts, another uses milliseconds, and another uses percentages.
$ rm hierarchical_clustering.py