Evaluating a sparse encoder checks whether its weighted vocabulary terms rank the labeled document above distractors for each query. Sparse retrieval behaves differently from dense embedding similarity because only active dimensions contribute to the dot product, so a retrieval-shaped test catches failures that a single encoded vector inspection misses.
SparseInformationRetrievalEvaluator accepts query IDs, corpus IDs, and relevant document IDs, then calls a SparseEncoder model and reports information retrieval metrics under the sparse score function. A small support-search set keeps the labels readable while preserving the same dictionary shape used for larger validation data.
The evaluator can compare the full sparse output with a 64-active-dimension cap through max_active_dims. Matching scores on the tiny sample confirm the evaluator wiring and active-dimension setting; model selection still needs a held-out validation set with enough hard negatives to expose ranking mistakes.
from sentence_transformers import SparseEncoder from sentence_transformers.sparse_encoder.evaluation import ( SparseInformationRetrievalEvaluator, ) model = SparseEncoder( "naver/splade-cocondenser-ensembledistil", device="cpu", ) queries = { "q1": "How do I reset a forgotten password?", "q2": "How can I export invoices as a CSV file?", } corpus = { "d1": "Reset a lost account password from the profile security page.", "d2": "Export paid invoices from the billing dashboard as a CSV file.", "d3": "Change the color theme for the analytics workspace.", "d4": "Archive an inactive user without deleting historical records.", } relevant_docs = { "q1": {"d1"}, "q2": {"d2"}, } def run_evaluator(name, max_active_dims): evaluator = SparseInformationRetrievalEvaluator( queries=queries, corpus=corpus, relevant_docs=relevant_docs, name=name, accuracy_at_k=[1, 3], precision_recall_at_k=[1, 3], mrr_at_k=[3], ndcg_at_k=[3], map_at_k=[3], max_active_dims=max_active_dims, show_progress_bar=False, write_csv=False, ) results = evaluator(model) print(f"{name} primary metric: {evaluator.primary_metric}") print(f"{name} primary score: {results[evaluator.primary_metric]:.3f}") for metric in ( f"{name}_dot_accuracy@1", f"{name}_dot_recall@3", f"{name}_dot_mrr@3", f"{name}_dot_ndcg@3", ): print(f"{metric}: {results[metric]:.3f}") run_evaluator("support-sparse-full", None) run_evaluator("support-sparse-64", 64)
queries and corpus use string IDs. relevant_docs maps each query ID to the document IDs that should count as correct.
$ python sparse_encoder_evaluate.py support-sparse-full primary metric: support-sparse-full_dot_ndcg@3 support-sparse-full primary score: 1.000 support-sparse-full_dot_accuracy@1: 1.000 support-sparse-full_dot_recall@3: 1.000 support-sparse-full_dot_mrr@3: 1.000 support-sparse-full_dot_ndcg@3: 1.000 support-sparse-64 primary metric: support-sparse-64_dot_ndcg@3 support-sparse-64 primary score: 1.000 support-sparse-64_dot_accuracy@1: 1.000 support-sparse-64_dot_recall@3: 1.000 support-sparse-64_dot_mrr@3: 1.000 support-sparse-64_dot_ndcg@3: 1.000
The first model load can print Hugging Face download, rate-limit, or weight-loading messages before the metrics. The metric lines are the output to compare across sparse model changes.
support-sparse-64 primary metric: support-sparse-64_dot_ndcg@3 support-sparse-64 primary score: 1.000
max_active_dims=64 limits the number of active vocabulary dimensions used during evaluation. Lower values can reduce sparse vector cost, but a real validation set should prove that the cap does not hide relevant documents.
$ rm sparse_encoder_evaluate.py