MTEB gives embedding teams a shared benchmark runner for comparing models on named tasks instead of private ad hoc examples. Pairing it with Sentence Transformers is useful when a model shortlist needs a repeatable score before a retrieval, clustering, or semantic similarity workflow depends on it.
The MTEB CLI can load public Sentence Transformers model IDs directly and write JSON result files under a chosen output folder. A small semantic textual similarity task such as STSBenchmark confirms the package install, model loader, dataset fetch, evaluation split, and score file layout before a larger benchmark run.
The local run evaluates sentence-transformers/all-MiniLM-L6-v2 on the STSBenchmark test split and reads the saved JSON for the main Spearman-based score. Replace the model ID and task name only after the small run succeeds, because full benchmark suites can download much larger datasets and run for far longer.
$ source .venv/bin/activate
Use the same project environment that loads the model and stores evaluation dependencies.
Related: How to install Sentence Transformers with pip
(.venv) $ python -m pip install --upgrade mteb sentence-transformers ##### snipped ##### Successfully installed mteb-2.16.1 sentence-transformers-5.6.0
If pip builds retrieval-metric dependencies from source, install the platform compiler and Python headers for that environment before repeating the package install.
(.venv) $ mteb available-tasks --tasks STSBenchmark --languages eng
STS
- STSBenchmark, t2t
MTEB task names are case-sensitive. Use ISO 639-3 language codes such as eng when filtering tasks by language.
(.venv) $ mteb run --model sentence-transformers/all-MiniLM-L6-v2 --tasks STSBenchmark --languages eng --eval-splits test --output-folder mteb-results --batch-size 32 --no-co2-tracker Running task STSBenchmark (split='test', hf_subset='default') Running semantic similarity - Finished. Finished evaluation for STSBenchmark
MTEB writes model metadata, run settings, and task scores under /mteb-results/results/. Increase --batch-size only when memory use stays stable.
import json from pathlib import Path result_file = next(Path("mteb-results").rglob("STSBenchmark.json")) result = json.loads(result_file.read_text()) score = result["scores"]["test"][0] print(f"task={result['task_name']}") print("model=sentence-transformers/all-MiniLM-L6-v2") print("split=test") print(f"subset={score['hf_subset']}") print(f"main_score={score['main_score']:.6f}") print(f"spearman={score['spearman']:.6f}") print(f"result_file={result_file}")
The same pattern works for another task JSON file when the task stores scores under the test split.
(.venv) $ python read_mteb_score.py task=STSBenchmark model=sentence-transformers/all-MiniLM-L6-v2 split=test subset=default main_score=0.820325 spearman=0.820325 result_file=mteb-results/results/sentence-transformers__all-MiniLM-L6-v2/8b3219a92973c328a8e22fadcfa821b5dc75636a/STSBenchmark.json
The exact score can change when the model revision, task dataset revision, or MTEB version changes. Keep the generated JSON result with the experiment notes so later model comparisons use the same task and split.