How to run reranking evaluation with Sentence Transformers

Ranking tests catch the search failure that matters most to a retrieval system: a relevant passage appears below distractors for the same query. Sentence Transformers includes RerankingEvaluator for this query-positive-negative shape, so a small labeled set can show whether an embedding model orders candidates well enough before a retrieval workflow depends on it.

The evaluator receives samples with one query, one or more positive documents, and negative documents. It encodes the query and candidate texts with a SentenceTransformer model, sorts candidates by similarity, and reports MAP, MRR@k, and NDCG@k.

Recent Sentence Transformers releases keep this evaluator in the SentenceTransformer evaluation module. The older top-level evaluation import can still print a deprecation warning, so the script uses the package-reference path and keeps first-run model download or cache messages out of saved metric logs.

Steps to run Sentence Transformers reranking evaluation:

Install Sentence Transformers in the active Python environment.
```
$ python -m pip install --upgrade sentence-transformers
```
The first model load may download model files from Hugging Face before evaluation metrics print.
Related: How to install Sentence Transformers with pip

Create a reranking evaluator script with labeled query, positive, and negative samples.

reranking_eval.py

from sentence_transformers import SentenceTransformer
from sentence_transformers.sentence_transformer.evaluation import (
    RerankingEvaluator,
)
 
 
samples = [
    {
        "query": "How do I reset a forgotten password?",
        "positive": [
            "Reset a lost account password from the profile security page.",
        ],
        "negative": [
            "Generate quarterly revenue charts from a CSV export.",
            "Tune the database connection pool for a busy API server.",
            "Schedule a sales demo with the accounts team.",
        ],
    },
    {
        "query": "How can I rotate API keys?",
        "positive": [
            "Rotate API tokens before sharing a new integration.",
        ],
        "negative": [
            "Download a password reset email from account settings.",
            "Create a dashboard with monthly revenue charts.",
            "Archive old support tickets after closing a case.",
        ],
    },
]
 
model = SentenceTransformer(
    "sentence-transformers/all-MiniLM-L6-v2",
    device="cpu",
)
 
evaluator = RerankingEvaluator(
    samples=samples,
    name="support-smoke",
    at_k=3,
    write_csv=True,
    show_progress_bar=False,
)
 
results = evaluator(model, output_path="reranking-results")
threshold = 0.80
 
print(f"primary metric: {evaluator.primary_metric}")
print(f"primary score: {results[evaluator.primary_metric]:.4f}")
for key in sorted(results):
    print(f"{key}: {results[key]:.4f}")
 
if results[evaluator.primary_metric] < threshold:
    raise SystemExit(
        f"primary score below threshold {threshold:.2f}: "
        f"{results[evaluator.primary_metric]:.4f}"
    )
 
print(f"verification: PASS threshold {threshold:.2f} reached")

Each sample needs at least one positive and one negative document. Replace the inline samples with held-out labels before comparing real models.

Run the evaluator script.

$ python reranking_eval.py
primary metric: support-smoke_ndcg@3
primary score: 1.0000
support-smoke_map: 1.0000
support-smoke_mrr@3: 1.0000
support-smoke_ndcg@3: 1.0000
verification: PASS threshold 0.80 reached

The metric key includes the evaluator name and the at_k value. A first run can print model download or cache messages before these metric lines.

Check the CSV written by RerankingEvaluator.
```
$ cat reranking-results/*.csv
epoch,steps,MAP,MRR@3,NDCG@3
-1,-1,1.0,1.0,1.0
```
Keep the CSV with the model, sample set, and at_k value when comparing later training runs or model candidates.
Remove the smoke-test script if it is not part of the project test suite.
```
$ rm reranking_eval.py
```

Author: Mohd Shakir Zakaria
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.