Retrieval models need a search-shaped check, not only pairwise similarity scores. Sentence Transformers provides InformationRetrievalEvaluator for query IDs, corpus documents, and relevance mappings, so a developer can measure whether the expected document rises near the top of a ranked result list.
InformationRetrievalEvaluator embeds every query and corpus entry with the selected model, scores the query-document pairs, and reports metrics such as accuracy@k, recall@k, MRR@k, nDCG@k, and MAP@k. The support-search dataset is small enough to inspect directly before the same dictionary structure is replaced with a validation split.
The Python environment should already have sentence-transformers installed. Because the support-search data contains one relevant document for each query, a score of 1.000 means each labeled document ranked inside the requested cutoffs for this development check.
Steps to run retrieval evaluation with Sentence Transformers:
- Create a Python script that defines the queries, corpus, relevance mapping, and evaluator.
- retrieval_evaluator_run.py
from sentence_transformers import SentenceTransformer from sentence_transformers.sentence_transformer.evaluation import InformationRetrievalEvaluator model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2") queries = { "q1": "How do I reset a forgotten password?", "q2": "How can I export invoices as a CSV file?", } corpus = { "d1": "Reset a lost account password from the profile security page.", "d2": "Export paid invoices from the billing dashboard as a CSV file.", "d3": "Change the color theme for the analytics workspace.", "d4": "Archive an inactive user without deleting historical records.", } relevant_docs = { "q1": {"d1"}, "q2": {"d2"}, } evaluator = InformationRetrievalEvaluator( queries=queries, corpus=corpus, relevant_docs=relevant_docs, name="support-search-dev", accuracy_at_k=[1, 3], precision_recall_at_k=[1, 3], mrr_at_k=[3], ndcg_at_k=[3], map_at_k=[3], show_progress_bar=False, write_csv=False, ) results = evaluator(model) for metric in ( "support-search-dev_cosine_accuracy@1", "support-search-dev_cosine_recall@3", "support-search-dev_cosine_mrr@3", "support-search-dev_cosine_ndcg@3", ): print(f"{metric}: {results[metric]:.3f}") print(f"primary metric: {evaluator.primary_metric}") print(f"primary score: {results[evaluator.primary_metric]:.3f}")
queries and corpus use string IDs. relevant_docs maps each query ID to the document IDs that should count as correct.
- Run the evaluator script.
$ python retrieval_evaluator_run.py support-search-dev_cosine_accuracy@1: 1.000 support-search-dev_cosine_recall@3: 1.000 support-search-dev_cosine_mrr@3: 1.000 support-search-dev_cosine_ndcg@3: 1.000 primary metric: support-search-dev_cosine_ndcg@3 primary score: 1.000
The first model load can print Hugging Face download or rate-limit messages before the metrics. The evaluator metric lines are the output to compare across model changes.
- Remove the temporary evaluator script.
$ rm retrieval_evaluator_run.py
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.