How to run retrieval evaluation with Sentence Transformers

Retrieval models need a search-shaped check, not only pairwise similarity scores. Sentence Transformers provides InformationRetrievalEvaluator for query IDs, corpus documents, and relevance mappings, so a developer can measure whether the expected document rises near the top of a ranked result list.

InformationRetrievalEvaluator embeds every query and corpus entry with the selected model, scores the query-document pairs, and reports metrics such as accuracy@k, recall@k, MRR@k, nDCG@k, and MAP@k. The support-search dataset is small enough to inspect directly before the same dictionary structure is replaced with a validation split.

The Python environment should already have sentence-transformers installed. Because the support-search data contains one relevant document for each query, a score of 1.000 means each labeled document ranked inside the requested cutoffs for this development check.

Steps to run retrieval evaluation with Sentence Transformers:

Create a Python script that defines the queries, corpus, relevance mapping, and evaluator.

retrieval_evaluator_run.py

from sentence_transformers import SentenceTransformer
from sentence_transformers.sentence_transformer.evaluation import InformationRetrievalEvaluator
 
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
 
queries = {
    "q1": "How do I reset a forgotten password?",
    "q2": "How can I export invoices as a CSV file?",
}
 
corpus = {
    "d1": "Reset a lost account password from the profile security page.",
    "d2": "Export paid invoices from the billing dashboard as a CSV file.",
    "d3": "Change the color theme for the analytics workspace.",
    "d4": "Archive an inactive user without deleting historical records.",
}
 
relevant_docs = {
    "q1": {"d1"},
    "q2": {"d2"},
}
 
evaluator = InformationRetrievalEvaluator(
    queries=queries,
    corpus=corpus,
    relevant_docs=relevant_docs,
    name="support-search-dev",
    accuracy_at_k=[1, 3],
    precision_recall_at_k=[1, 3],
    mrr_at_k=[3],
    ndcg_at_k=[3],
    map_at_k=[3],
    show_progress_bar=False,
    write_csv=False,
)
 
results = evaluator(model)
 
for metric in (
    "support-search-dev_cosine_accuracy@1",
    "support-search-dev_cosine_recall@3",
    "support-search-dev_cosine_mrr@3",
    "support-search-dev_cosine_ndcg@3",
):
    print(f"{metric}: {results[metric]:.3f}")
 
print(f"primary metric: {evaluator.primary_metric}")
print(f"primary score: {results[evaluator.primary_metric]:.3f}")

queries and corpus use string IDs. relevant_docs maps each query ID to the document IDs that should count as correct.

Run the evaluator script.

$ python retrieval_evaluator_run.py
support-search-dev_cosine_accuracy@1: 1.000
support-search-dev_cosine_recall@3: 1.000
support-search-dev_cosine_mrr@3: 1.000
support-search-dev_cosine_ndcg@3: 1.000
primary metric: support-search-dev_cosine_ndcg@3
primary score: 1.000

The first model load can print Hugging Face download or rate-limit messages before the metrics. The evaluator metric lines are the output to compare across model changes.

Remove the temporary evaluator script.
```
$ rm retrieval_evaluator_run.py
```