How to run retrieval evaluation with Sentence Transformers

Retrieval models need a search-shaped check, not only pairwise similarity scores. Sentence Transformers provides InformationRetrievalEvaluator for query IDs, corpus documents, and relevance mappings, so a developer can measure whether the expected document rises near the top of a ranked result list.

InformationRetrievalEvaluator embeds every query and corpus entry with the selected model, scores the query-document pairs, and reports metrics such as accuracy@k, recall@k, MRR@k, nDCG@k, and MAP@k. The support-search dataset is small enough to inspect directly before the same dictionary structure is replaced with a validation split.

The Python environment should already have sentence-transformers installed. Because the support-search data contains one relevant document for each query, a score of 1.000 means each labeled document ranked inside the requested cutoffs for this development check.

Steps to run retrieval evaluation with Sentence Transformers:

  1. Create a Python script that defines the queries, corpus, relevance mapping, and evaluator.
    retrieval_evaluator_run.py
    from sentence_transformers import SentenceTransformer
    from sentence_transformers.sentence_transformer.evaluation import InformationRetrievalEvaluator
     
    model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
     
    queries = {
        "q1": "How do I reset a forgotten password?",
        "q2": "How can I export invoices as a CSV file?",
    }
     
    corpus = {
        "d1": "Reset a lost account password from the profile security page.",
        "d2": "Export paid invoices from the billing dashboard as a CSV file.",
        "d3": "Change the color theme for the analytics workspace.",
        "d4": "Archive an inactive user without deleting historical records.",
    }
     
    relevant_docs = {
        "q1": {"d1"},
        "q2": {"d2"},
    }
     
    evaluator = InformationRetrievalEvaluator(
        queries=queries,
        corpus=corpus,
        relevant_docs=relevant_docs,
        name="support-search-dev",
        accuracy_at_k=[1, 3],
        precision_recall_at_k=[1, 3],
        mrr_at_k=[3],
        ndcg_at_k=[3],
        map_at_k=[3],
        show_progress_bar=False,
        write_csv=False,
    )
     
    results = evaluator(model)
     
    for metric in (
        "support-search-dev_cosine_accuracy@1",
        "support-search-dev_cosine_recall@3",
        "support-search-dev_cosine_mrr@3",
        "support-search-dev_cosine_ndcg@3",
    ):
        print(f"{metric}: {results[metric]:.3f}")
     
    print(f"primary metric: {evaluator.primary_metric}")
    print(f"primary score: {results[evaluator.primary_metric]:.3f}")

    queries and corpus use string IDs. relevant_docs maps each query ID to the document IDs that should count as correct.

  2. Run the evaluator script.
    $ python retrieval_evaluator_run.py
    support-search-dev_cosine_accuracy@1: 1.000
    support-search-dev_cosine_recall@3: 1.000
    support-search-dev_cosine_mrr@3: 1.000
    support-search-dev_cosine_ndcg@3: 1.000
    primary metric: support-search-dev_cosine_ndcg@3
    primary score: 1.000

    The first model load can print Hugging Face download or rate-limit messages before the metrics. The evaluator metric lines are the output to compare across model changes.

  3. Remove the temporary evaluator script.
    $ rm retrieval_evaluator_run.py