How to evaluate an embedding model on STS data with Sentence Transformers

Semantic textual similarity tests check whether an embedding model places sentence pairs near each other when human labels say the meanings match. In a Sentence Transformers project, a small STS evaluation run can catch a model that looks acceptable on hand-picked examples but ranks labeled pairs poorly.

EmbeddingSimilarityEvaluator takes two aligned sentence lists plus one numeric score list, encodes both text columns, and reports Pearson and Spearman correlations for the selected similarity function. Using cosine similarity matches the common sentence-embedding retrieval path and keeps the check focused on whether higher model similarity follows higher human similarity labels.

Six labeled pairs are enough for a local smoke test, while a release decision should use a larger held-out STS or domain validation set. Treat a failed correlation threshold as a signal to review label scaling, domain mismatch, or the selected embedding model before relying on the vectors in search or ranking code.

Steps to evaluate a Sentence Transformers embedding model on STS data:

  1. Create a Python script with labeled STS sentence pairs.
    evaluate_sts.py
    from sentence_transformers import SentenceTransformer
    from sentence_transformers.sentence_transformer.evaluation import (
        EmbeddingSimilarityEvaluator,
    )
     
    sentences1 = [
        "A support agent reset the user's password.",
        "The database backup completed overnight.",
        "The release was rolled back after the deploy.",
        "The customer changed the billing address.",
        "The web server returned an SSL certificate warning.",
        "A new search index was built for documents.",
    ]
     
    sentences2 = [
        "The account password was reset by support.",
        "A nightly backup of the database finished successfully.",
        "The deployment was reverted after the release.",
        "A user updated payment and billing details.",
        "The office printer ran out of toner.",
        "The kitchen refrigerator needs cleaning.",
    ]
     
    scores = [0.95, 0.90, 0.86, 0.55, 0.05, 0.03]
     
    model = SentenceTransformer(
        "sentence-transformers/all-MiniLM-L6-v2",
        device="cpu",
    )
     
    evaluator = EmbeddingSimilarityEvaluator(
        sentences1=sentences1,
        sentences2=sentences2,
        scores=scores,
        name="support-sts",
        main_similarity="cosine",
        show_progress_bar=False,
        write_csv=False,
    )
     
    results = evaluator(model)
    primary_metric = evaluator.primary_metric
     
    print("Pairs evaluated:", len(sentences1))
    print("Primary metric:", primary_metric)
    print(f"Primary score: {results[primary_metric]:.3f}")
    print(f"Pearson cosine: {results['support-sts_pearson_cosine']:.3f}")
    print(f"Spearman cosine: {results['support-sts_spearman_cosine']:.3f}")
     
    if results[primary_metric] < 0.75:
        raise SystemExit("STS evaluator score is below the acceptance threshold")
     
    print("check: STS evaluation correlation is above 0.75")

    Keep each row aligned by index: sentences1[i], sentences2[i], and scores[i] describe one labeled pair. Replace the sample rows with a held-out validation split before using the threshold in a project test.

  2. Run the STS evaluation script.
    $ python3 evaluate_sts.py
    Pairs evaluated: 6
    Primary metric: support-sts_spearman_cosine
    Primary score: 0.771
    Pearson cosine: 0.971
    Spearman cosine: 0.771
    check: STS evaluation correlation is above 0.75

    The first model load can print Hugging Face download or weight-loading messages before the metric output. Keep those environment messages out of saved test assertions.

  3. Check that the primary score meets the local threshold.

    EmbeddingSimilarityEvaluator uses the evaluator name in metric keys, so support-sts_spearman_cosine comes from name="support-sts" and main_similarity="cosine". A failed threshold should stop the release or training run until the model and labeled pairs are reviewed.

  4. Remove the temporary evaluator script after the smoke test.
    $ rm evaluate_sts.py