Semantic textual similarity tests check whether an embedding model places sentence pairs near each other when human labels say the meanings match. In a Sentence Transformers project, a small STS evaluation run can catch a model that looks acceptable on hand-picked examples but ranks labeled pairs poorly.
EmbeddingSimilarityEvaluator takes two aligned sentence lists plus one numeric score list, encodes both text columns, and reports Pearson and Spearman correlations for the selected similarity function. Using cosine similarity matches the common sentence-embedding retrieval path and keeps the check focused on whether higher model similarity follows higher human similarity labels.
Six labeled pairs are enough for a local smoke test, while a release decision should use a larger held-out STS or domain validation set. Treat a failed correlation threshold as a signal to review label scaling, domain mismatch, or the selected embedding model before relying on the vectors in search or ranking code.
from sentence_transformers import SentenceTransformer from sentence_transformers.sentence_transformer.evaluation import ( EmbeddingSimilarityEvaluator, ) sentences1 = [ "A support agent reset the user's password.", "The database backup completed overnight.", "The release was rolled back after the deploy.", "The customer changed the billing address.", "The web server returned an SSL certificate warning.", "A new search index was built for documents.", ] sentences2 = [ "The account password was reset by support.", "A nightly backup of the database finished successfully.", "The deployment was reverted after the release.", "A user updated payment and billing details.", "The office printer ran out of toner.", "The kitchen refrigerator needs cleaning.", ] scores = [0.95, 0.90, 0.86, 0.55, 0.05, 0.03] model = SentenceTransformer( "sentence-transformers/all-MiniLM-L6-v2", device="cpu", ) evaluator = EmbeddingSimilarityEvaluator( sentences1=sentences1, sentences2=sentences2, scores=scores, name="support-sts", main_similarity="cosine", show_progress_bar=False, write_csv=False, ) results = evaluator(model) primary_metric = evaluator.primary_metric print("Pairs evaluated:", len(sentences1)) print("Primary metric:", primary_metric) print(f"Primary score: {results[primary_metric]:.3f}") print(f"Pearson cosine: {results['support-sts_pearson_cosine']:.3f}") print(f"Spearman cosine: {results['support-sts_spearman_cosine']:.3f}") if results[primary_metric] < 0.75: raise SystemExit("STS evaluator score is below the acceptance threshold") print("check: STS evaluation correlation is above 0.75")
Keep each row aligned by index: sentences1[i], sentences2[i], and scores[i] describe one labeled pair. Replace the sample rows with a held-out validation split before using the threshold in a project test.
$ python3 evaluate_sts.py Pairs evaluated: 6 Primary metric: support-sts_spearman_cosine Primary score: 0.771 Pearson cosine: 0.971 Spearman cosine: 0.771 check: STS evaluation correlation is above 0.75
The first model load can print Hugging Face download or weight-loading messages before the metric output. Keep those environment messages out of saved test assertions.
EmbeddingSimilarityEvaluator uses the evaluator name in metric keys, so support-sts_spearman_cosine comes from name="support-sts" and main_similarity="cosine". A failed threshold should stop the release or training run until the model and labeled pairs are reviewed.
$ rm evaluate_sts.py