How to train a Matryoshka embedding model with Sentence Transformers

Embedding models usually force one vector dimension on every downstream index, query, and reranker handoff. Matryoshka training with Sentence Transformers teaches the earlier part of each embedding to carry ranking signal, so the saved model can still compare related text after vectors are truncated to smaller dimensions.

The current trainer workflow wraps a normal supervised loss with MatryoshkaLoss. The wrapper applies the base loss at each listed embedding prefix, such as 768, 256, 128, and 64 dimensions, while the same SentenceTransformerTrainer handles batching, logging, saving, and evaluator calls.

Production training should use task-specific pairs or triplets from the retrieval system that will consume the model. A tiny support-ticket dataset keeps the local run short; acceptance comes from saving the model, comparing evaluator scores at multiple dimensions, and checking that a 64-dimension query still ranks the related document above an unrelated one.

Steps to train a Matryoshka embedding model with Sentence Transformers:

Install Sentence Transformers with its training dependencies.
```
$ python -m pip install --upgrade "sentence-transformers[train]"
```
Use a virtual environment for training work. The training extra installs the trainer stack, including datasets and accelerate.

Create a training script that wraps CoSENTLoss with MatryoshkaLoss.

train_matryoshka.py

from pathlib import Path
 
from datasets import Dataset
from sentence_transformers import (
    SentenceTransformer,
    SentenceTransformerTrainer,
    SentenceTransformerTrainingArguments,
)
from sentence_transformers.sentence_transformer import losses
from sentence_transformers.sentence_transformer.evaluation import (
    EmbeddingSimilarityEvaluator,
    SimilarityFunction,
)
 
output_dir = Path("models/support-matryoshka")
 
model = SentenceTransformer(
    "sentence-transformers/paraphrase-albert-small-v2",
    model_kwargs={"torch_dtype": "float32"},
)
 
train_dataset = Dataset.from_dict(
    {
        "sentence1": [
            "reset a forgotten admin password",
            "restore a deleted customer record",
            "create a read only database user",
            "rotate an expired api token",
            "download the latest invoice",
            "configure daily database backups",
            "invite a new support agent",
            "archive a completed support case",
        ],
        "sentence2": [
            "help an administrator regain account access",
            "recover a customer profile that was removed",
            "add a user that can only read from the database",
            "replace a token that has expired",
            "retrieve the newest billing invoice",
            "schedule backups for the database every day",
            "add another agent to the support team",
            "close and archive a resolved case",
        ],
        "score": [0.95, 0.92, 0.9, 0.91, 0.88, 0.89, 0.86, 0.84],
    }
)
 
base_loss = losses.CoSENTLoss(model)
loss = losses.MatryoshkaLoss(
    model=model,
    loss=base_loss,
    matryoshka_dims=[768, 256, 128, 64],
)
 
args = SentenceTransformerTrainingArguments(
    output_dir=str(output_dir),
    num_train_epochs=1,
    per_device_train_batch_size=4,
    learning_rate=2e-5,
    warmup_steps=0,
    save_strategy="no",
    report_to="none",
    disable_tqdm=True,
    logging_steps=1,
)
 
evaluator = EmbeddingSimilarityEvaluator(
    sentences1=train_dataset["sentence1"],
    sentences2=train_dataset["sentence2"],
    scores=train_dataset["score"],
    main_similarity=SimilarityFunction.COSINE,
    name="support-pairs",
    write_csv=False,
)
 
print(f"matryoshka dimensions: {loss.matryoshka_dims}")
trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    loss=loss,
    evaluator=evaluator,
)
trainer.train()
model.save_pretrained(output_dir / "final")
print(f"saved model: {output_dir / 'final'}")
 
trained_model = SentenceTransformer(str(output_dir / "final"))
for dim in [768, 256, 128, 64]:
    dim_evaluator = EmbeddingSimilarityEvaluator(
        sentences1=train_dataset["sentence1"],
        sentences2=train_dataset["sentence2"],
        scores=train_dataset["score"],
        main_similarity=SimilarityFunction.COSINE,
        name=f"support-pairs-{dim}",
        truncate_dim=dim,
        write_csv=False,
    )
    results = dim_evaluator(trained_model)
    print(f"{dim} dimensions cosine_spearman: {results[dim_evaluator.primary_metric]:.4f}")
 
query = ["reset admin login"]
documents = [
    "help an administrator regain account access",
    "retrieve the newest billing invoice",
]
embeddings = trained_model.encode(query + documents, normalize_embeddings=True, truncate_dim=64)
scores = trained_model.similarity(embeddings[:1], embeddings[1:])[0]
print(f"64 dimensions retrieval scores: {scores[0]:.4f}, {scores[1]:.4f}")

The first dimension should match the model's native embedding size. Keep every listed dimension compatible with the vector database schemas that will store or query the trained model.

Run the training script.

$ python train_matryoshka.py
matryoshka dimensions: (768, 256, 128, 64)
{'loss': '24.37', 'grad_norm': '1349', 'learning_rate': '2e-05', 'epoch': '0.5'}
{'loss': '13.64', 'grad_norm': '801.7', 'learning_rate': '1e-05', 'epoch': '1'}
{'train_runtime': '79.81', 'train_samples_per_second': '0.1', 'train_steps_per_second': '0.025', 'train_loss': '19', 'epoch': '1'}
saved model: models/support-matryoshka/final
768 dimensions cosine_spearman: 0.5000
256 dimensions cosine_spearman: 0.5952
128 dimensions cosine_spearman: 0.5714
64 dimensions cosine_spearman: 0.5476
64 dimensions retrieval scores: 0.7809, -0.1884

The exact scores vary by model, dataset, hardware, and random seed. For this local smoke test, the important signals are the saved model path, one score per Matryoshka dimension, and a higher 64-dimension score for the related support document. Rebuild vector indexes when changing the truncation dimension because stored document vectors and query vectors must use the same width.

Author: Mohd Shakir Zakaria
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.