Ranking tests catch the search failure that matters most to a retrieval system: a relevant passage appears below distractors for the same query. Sentence Transformers includes RerankingEvaluator for this query-positive-negative shape, so a small labeled set can show whether an embedding model orders candidates well enough before a retrieval workflow depends on it.
The evaluator receives samples with one query, one or more positive documents, and negative documents. It encodes the query and candidate texts with a SentenceTransformer model, sorts candidates by similarity, and reports MAP, MRR@k, and NDCG@k.
Recent Sentence Transformers releases keep this evaluator in the SentenceTransformer evaluation module. The older top-level evaluation import can still print a deprecation warning, so the script uses the package-reference path and keeps first-run model download or cache messages out of saved metric logs.
Steps to run Sentence Transformers reranking evaluation:
- Install Sentence Transformers in the active Python environment.
$ python -m pip install --upgrade sentence-transformers
The first model load may download model files from Hugging Face before evaluation metrics print.
Related: How to install Sentence Transformers with pip - Create a reranking evaluator script with labeled query, positive, and negative samples.
- reranking_eval.py
from sentence_transformers import SentenceTransformer from sentence_transformers.sentence_transformer.evaluation import ( RerankingEvaluator, ) samples = [ { "query": "How do I reset a forgotten password?", "positive": [ "Reset a lost account password from the profile security page.", ], "negative": [ "Generate quarterly revenue charts from a CSV export.", "Tune the database connection pool for a busy API server.", "Schedule a sales demo with the accounts team.", ], }, { "query": "How can I rotate API keys?", "positive": [ "Rotate API tokens before sharing a new integration.", ], "negative": [ "Download a password reset email from account settings.", "Create a dashboard with monthly revenue charts.", "Archive old support tickets after closing a case.", ], }, ] model = SentenceTransformer( "sentence-transformers/all-MiniLM-L6-v2", device="cpu", ) evaluator = RerankingEvaluator( samples=samples, name="support-smoke", at_k=3, write_csv=True, show_progress_bar=False, ) results = evaluator(model, output_path="reranking-results") threshold = 0.80 print(f"primary metric: {evaluator.primary_metric}") print(f"primary score: {results[evaluator.primary_metric]:.4f}") for key in sorted(results): print(f"{key}: {results[key]:.4f}") if results[evaluator.primary_metric] < threshold: raise SystemExit( f"primary score below threshold {threshold:.2f}: " f"{results[evaluator.primary_metric]:.4f}" ) print(f"verification: PASS threshold {threshold:.2f} reached")
Each sample needs at least one positive and one negative document. Replace the inline samples with held-out labels before comparing real models.
- Run the evaluator script.
$ python reranking_eval.py primary metric: support-smoke_ndcg@3 primary score: 1.0000 support-smoke_map: 1.0000 support-smoke_mrr@3: 1.0000 support-smoke_ndcg@3: 1.0000 verification: PASS threshold 0.80 reached
The metric key includes the evaluator name and the at_k value. A first run can print model download or cache messages before these metric lines.
- Check the CSV written by RerankingEvaluator.
$ cat reranking-results/*.csv epoch,steps,MAP,MRR@3,NDCG@3 -1,-1,1.0,1.0,1.0
Keep the CSV with the model, sample set, and at_k value when comparing later training runs or model candidates.
- Remove the smoke-test script if it is not part of the project test suite.
$ rm reranking_eval.py
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.