How to evaluate a sparse encoder with Sentence Transformers

Evaluating a sparse encoder checks whether its weighted vocabulary terms rank the labeled document above distractors for each query. Sparse retrieval behaves differently from dense embedding similarity because only active dimensions contribute to the dot product, so a retrieval-shaped test catches failures that a single encoded vector inspection misses.

SparseInformationRetrievalEvaluator accepts query IDs, corpus IDs, and relevant document IDs, then calls a SparseEncoder model and reports information retrieval metrics under the sparse score function. A small support-search set keeps the labels readable while preserving the same dictionary shape used for larger validation data.

The evaluator can compare the full sparse output with a 64-active-dimension cap through max_active_dims. Matching scores on the tiny sample confirm the evaluator wiring and active-dimension setting; model selection still needs a held-out validation set with enough hard negatives to expose ranking mistakes.

Steps to evaluate a SparseEncoder model with Sentence Transformers:

Create a sparse retrieval evaluator script.

sparse_encoder_evaluate.py

from sentence_transformers import SparseEncoder
from sentence_transformers.sparse_encoder.evaluation import (
    SparseInformationRetrievalEvaluator,
)
 
 
model = SparseEncoder(
    "naver/splade-cocondenser-ensembledistil",
    device="cpu",
)
 
queries = {
    "q1": "How do I reset a forgotten password?",
    "q2": "How can I export invoices as a CSV file?",
}
 
corpus = {
    "d1": "Reset a lost account password from the profile security page.",
    "d2": "Export paid invoices from the billing dashboard as a CSV file.",
    "d3": "Change the color theme for the analytics workspace.",
    "d4": "Archive an inactive user without deleting historical records.",
}
 
relevant_docs = {
    "q1": {"d1"},
    "q2": {"d2"},
}
 
 
def run_evaluator(name, max_active_dims):
    evaluator = SparseInformationRetrievalEvaluator(
        queries=queries,
        corpus=corpus,
        relevant_docs=relevant_docs,
        name=name,
        accuracy_at_k=[1, 3],
        precision_recall_at_k=[1, 3],
        mrr_at_k=[3],
        ndcg_at_k=[3],
        map_at_k=[3],
        max_active_dims=max_active_dims,
        show_progress_bar=False,
        write_csv=False,
    )
 
    results = evaluator(model)
    print(f"{name} primary metric: {evaluator.primary_metric}")
    print(f"{name} primary score: {results[evaluator.primary_metric]:.3f}")
    for metric in (
        f"{name}_dot_accuracy@1",
        f"{name}_dot_recall@3",
        f"{name}_dot_mrr@3",
        f"{name}_dot_ndcg@3",
    ):
        print(f"{metric}: {results[metric]:.3f}")
 
 
run_evaluator("support-sparse-full", None)
run_evaluator("support-sparse-64", 64)

queries and corpus use string IDs. relevant_docs maps each query ID to the document IDs that should count as correct.

Run the evaluator script.

$ python sparse_encoder_evaluate.py
support-sparse-full primary metric: support-sparse-full_dot_ndcg@3
support-sparse-full primary score: 1.000
support-sparse-full_dot_accuracy@1: 1.000
support-sparse-full_dot_recall@3: 1.000
support-sparse-full_dot_mrr@3: 1.000
support-sparse-full_dot_ndcg@3: 1.000
support-sparse-64 primary metric: support-sparse-64_dot_ndcg@3
support-sparse-64 primary score: 1.000
support-sparse-64_dot_accuracy@1: 1.000
support-sparse-64_dot_recall@3: 1.000
support-sparse-64_dot_mrr@3: 1.000
support-sparse-64_dot_ndcg@3: 1.000

The first model load can print Hugging Face download, rate-limit, or weight-loading messages before the metrics. The metric lines are the output to compare across sparse model changes.

Check that the capped run still reports the expected primary metric.
```
support-sparse-64 primary metric: support-sparse-64_dot_ndcg@3
support-sparse-64 primary score: 1.000
```
max_active_dims=64 limits the number of active vocabulary dimensions used during evaluation. Lower values can reduce sparse vector cost, but a real validation set should prove that the cap does not hide relevant documents.
Remove the temporary evaluator script if it is not part of the project test suite.
```
$ rm sparse_encoder_evaluate.py
```