Ranking tests catch the search failure that matters most to a retrieval system: a relevant passage appears below distractors for the same query. Sentence Transformers includes RerankingEvaluator for this query-positive-negative shape, so a small labeled set can show whether an embedding model orders candidates well enough before a retrieval workflow depends on it.

The evaluator receives samples with one query, one or more positive documents, and negative documents. It encodes the query and candidate texts with a SentenceTransformer model, sorts candidates by similarity, and reports MAP, MRR@k, and NDCG@k.

Recent Sentence Transformers releases keep this evaluator in the SentenceTransformer evaluation module. The older top-level evaluation import can still print a deprecation warning, so the script uses the package-reference path and keeps first-run model download or cache messages out of saved metric logs.

Steps to run Sentence Transformers reranking evaluation:

  1. Install Sentence Transformers in the active Python environment.
    $ python -m pip install --upgrade sentence-transformers

    The first model load may download model files from Hugging Face before evaluation metrics print.
    Related: How to install Sentence Transformers with pip

  2. Create a reranking evaluator script with labeled query, positive, and negative samples.
    reranking_eval.py
    from sentence_transformers import SentenceTransformer
    from sentence_transformers.sentence_transformer.evaluation import (
        RerankingEvaluator,
    )
     
     
    samples = [
        {
            "query": "How do I reset a forgotten password?",
            "positive": [
                "Reset a lost account password from the profile security page.",
            ],
            "negative": [
                "Generate quarterly revenue charts from a CSV export.",
                "Tune the database connection pool for a busy API server.",
                "Schedule a sales demo with the accounts team.",
            ],
        },
        {
            "query": "How can I rotate API keys?",
            "positive": [
                "Rotate API tokens before sharing a new integration.",
            ],
            "negative": [
                "Download a password reset email from account settings.",
                "Create a dashboard with monthly revenue charts.",
                "Archive old support tickets after closing a case.",
            ],
        },
    ]
     
    model = SentenceTransformer(
        "sentence-transformers/all-MiniLM-L6-v2",
        device="cpu",
    )
     
    evaluator = RerankingEvaluator(
        samples=samples,
        name="support-smoke",
        at_k=3,
        write_csv=True,
        show_progress_bar=False,
    )
     
    results = evaluator(model, output_path="reranking-results")
    threshold = 0.80
     
    print(f"primary metric: {evaluator.primary_metric}")
    print(f"primary score: {results[evaluator.primary_metric]:.4f}")
    for key in sorted(results):
        print(f"{key}: {results[key]:.4f}")
     
    if results[evaluator.primary_metric] < threshold:
        raise SystemExit(
            f"primary score below threshold {threshold:.2f}: "
            f"{results[evaluator.primary_metric]:.4f}"
        )
     
    print(f"verification: PASS threshold {threshold:.2f} reached")

    Each sample needs at least one positive and one negative document. Replace the inline samples with held-out labels before comparing real models.

  3. Run the evaluator script.
    $ python reranking_eval.py
    primary metric: support-smoke_ndcg@3
    primary score: 1.0000
    support-smoke_map: 1.0000
    support-smoke_mrr@3: 1.0000
    support-smoke_ndcg@3: 1.0000
    verification: PASS threshold 0.80 reached

    The metric key includes the evaluator name and the at_k value. A first run can print model download or cache messages before these metric lines.

  4. Check the CSV written by RerankingEvaluator.
    $ cat reranking-results/*.csv
    epoch,steps,MAP,MRR@3,NDCG@3
    -1,-1,1.0,1.0,1.0

    Keep the CSV with the model, sample set, and at_k value when comparing later training runs or model candidates.

  5. Remove the smoke-test script if it is not part of the project test suite.
    $ rm reranking_eval.py