How to run text clustering with Sentence Transformers

Short text collections often contain themes that keyword sorting misses, such as support notes that describe the same account, release, or backup work with different words. Sentence Transformers converts each text item into an embedding, and a clustering algorithm can group nearby embeddings so unlabeled items can be reviewed by topic.

A small Python script can load sentence-transformers/all-MiniLM-L6-v2, embed each text, and pass the vectors to KMeans from scikit-learn. KMeans fits a known number of clusters, so it is a good match when the corpus should be split into a chosen number of themes before manual review.

The sample corpus keeps three visible themes so the output can be checked by eye before the same pattern is applied to a larger dataset. Cluster numbers are labels assigned by the algorithm, not permanent category names, so inspect the grouped text before naming or saving the clusters.

Steps to run text clustering with Sentence Transformers:

  1. Create a Python script that embeds sample texts and clusters the embeddings.
    cluster_themes.py
    from collections import defaultdict
     
    from sentence_transformers import SentenceTransformer
    from sklearn.cluster import KMeans
     
    texts = [
        "Reset a customer password from the account portal.",
        "Resend the password reset email to the customer.",
        "Review two-factor authentication recovery codes.",
        "Build the web release container image.",
        "Deploy the web release to production.",
        "Roll back the web release from production.",
        "Back up the customer database overnight.",
        "Restore the database backup into staging.",
        "Verify the database backup checksum.",
    ]
     
    model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2", device="cpu")
    embeddings = model.encode(texts, normalize_embeddings=True, show_progress_bar=False)
     
    kmeans = KMeans(n_clusters=3, random_state=7, n_init=10)
    labels = kmeans.fit_predict(embeddings)
     
    groups = defaultdict(list)
    for text, label in zip(texts, labels):
        groups[int(label)].append(text)
     
    for cluster_number, label in enumerate(sorted(groups, key=lambda key: groups[key][0]), start=1):
        print(f"Cluster {cluster_number}:")
        for text in groups[label]:
            print(f"- {text}")
        print()

    Set n_clusters to the number of groups the review needs. The script normalizes embeddings before clustering so KMeans groups by embedding direction rather than raw vector length.

  2. Run the clustering script.
    $ python3 cluster_themes.py
    Cluster 1:
    - Back up the customer database overnight.
    - Restore the database backup into staging.
    - Verify the database backup checksum.
    
    Cluster 2:
    - Build the web release container image.
    - Deploy the web release to production.
    - Roll back the web release from production.
    
    Cluster 3:
    - Reset a customer password from the account portal.
    - Resend the password reset email to the customer.
    - Review two-factor authentication recovery codes.
  3. Review the grouped text under each cluster.

    In the sample output, Cluster 1 is database backup work, Cluster 2 is web release work, and Cluster 3 is account access work. KMeans assigns arbitrary numeric labels, so name clusters after inspecting their members.

  4. Remove the sample script when the clustering check is finished.
    $ rm cluster_themes.py