How to evaluate embeddings with MTEB and Sentence Transformers

MTEB gives embedding teams a shared benchmark runner for comparing models on named tasks instead of private ad hoc examples. Pairing it with Sentence Transformers is useful when a model shortlist needs a repeatable score before a retrieval, clustering, or semantic similarity workflow depends on it.

The MTEB CLI can load public Sentence Transformers model IDs directly and write JSON result files under a chosen output folder. A small semantic textual similarity task such as STSBenchmark confirms the package install, model loader, dataset fetch, evaluation split, and score file layout before a larger benchmark run.

The local run evaluates sentence-transformers/all-MiniLM-L6-v2 on the STSBenchmark test split and reads the saved JSON for the main Spearman-based score. Replace the model ID and task name only after the small run succeeds, because full benchmark suites can download much larger datasets and run for far longer.

Steps to evaluate Sentence Transformers embeddings with MTEB:

Activate the Python environment that should run the benchmark.
```
$ source .venv/bin/activate
```
Use the same project environment that loads the model and stores evaluation dependencies.
Related: How to install Sentence Transformers with pip
Install MTEB and Sentence Transformers in the active environment.
```
(.venv) $ python -m pip install --upgrade mteb sentence-transformers
##### snipped #####
Successfully installed mteb-2.16.1 sentence-transformers-5.6.0
```
If pip builds retrieval-metric dependencies from source, install the platform compiler and Python headers for that environment before repeating the package install.
Confirm the exact MTEB task name before launching the run.
```
(.venv) $ mteb available-tasks --tasks STSBenchmark --languages eng
STS
    - STSBenchmark, t2t
```
MTEB task names are case-sensitive. Use ISO 639-3 language codes such as eng when filtering tasks by language.

Run the STSBenchmark test split with the selected Sentence Transformers model.

(.venv) $ mteb run --model sentence-transformers/all-MiniLM-L6-v2 --tasks STSBenchmark --languages eng --eval-splits test --output-folder mteb-results --batch-size 32 --no-co2-tracker
Running task STSBenchmark (split='test', hf_subset='default')
Running semantic similarity - Finished.
Finished evaluation for STSBenchmark

MTEB writes model metadata, run settings, and task scores under /mteb-results/results/. Increase --batch-size only when memory use stays stable.

Create a reader for the saved task score.

read_mteb_score.py

import json
from pathlib import Path
 
 
result_file = next(Path("mteb-results").rglob("STSBenchmark.json"))
result = json.loads(result_file.read_text())
score = result["scores"]["test"][0]
 
print(f"task={result['task_name']}")
print("model=sentence-transformers/all-MiniLM-L6-v2")
print("split=test")
print(f"subset={score['hf_subset']}")
print(f"main_score={score['main_score']:.6f}")
print(f"spearman={score['spearman']:.6f}")
print(f"result_file={result_file}")

The same pattern works for another task JSON file when the task stores scores under the test split.

Print the saved score and result path.

(.venv) $ python read_mteb_score.py
task=STSBenchmark
model=sentence-transformers/all-MiniLM-L6-v2
split=test
subset=default
main_score=0.820325
spearman=0.820325
result_file=mteb-results/results/sentence-transformers__all-MiniLM-L6-v2/8b3219a92973c328a8e22fadcfa821b5dc75636a/STSBenchmark.json

The exact score can change when the model revision, task dataset revision, or MTEB version changes. Keep the generated JSON result with the experiment notes so later model comparisons use the same task and split.