How to choose a Sentence Transformers inference backend

Sentence Transformers can run embedding inference through PyTorch, ONNX Runtime, or OpenVINO. Choosing the backend before wiring the model into an application keeps the first deployment aligned with the available hardware, installed runtime packages, and export format.

The backend argument controls the inference runtime used by SentenceTransformer. torch is the default path and usually fits development, GPU-backed experiments, and the first working local baseline. onnx and openvino add runtime-specific dependencies and may export the model on first load when a matching exported file is not already present.

Use the backend decision as a short smoke test rather than a benchmark substitute. A backend is ready for the workload when the runtime package is present, the model loads with that backend name, and a sample encode call returns the expected embedding shape.

Steps to choose a Sentence Transformers inference backend:

Check the installed runtime packages and accelerator state.

$ python - <<'PY'
import importlib.util
import torch

available = ["torch"]
if importlib.util.find_spec("onnxruntime"):
    available.append("onnx")
if importlib.util.find_spec("openvino"):
    available.append("openvino")

print("available_backends=" + ",".join(available))
print(f"cuda_available={torch.cuda.is_available()}")
PY
available_backends=torch
cuda_available=False

torch is always available when Sentence Transformers and PyTorch are installed. onnx appears when onnxruntime is installed, and openvino appears when the OpenVINO runtime is installed.

Select the backend for the deployment target.

Choose torch for the first local baseline, GPU-backed PyTorch deployments, or model work that still changes often. Choose onnx when the service already standardizes on ONNX Runtime. Choose openvino for CPU-focused deployments on hardware where OpenVINO is the intended inference runtime.
Install the backend extra when the selected backend is not torch.
```
$ pip install -U "sentence-transformers[onnx]"
```
Use sentence-transformers[onnx-gpu] when ONNX Runtime needs GPU providers, or sentence-transformers[openvino] when the selected backend is openvino. Skip this step for the default torch backend.

Related: How to use the ONNX backend with Sentence Transformers
Related: How to use the OpenVINO backend with Sentence Transformers

Verify that the selected backend loads the model and encodes text.

$ BACKEND=torch python - <<'PY'
import os
from sentence_transformers import SentenceTransformer

backend = os.environ["BACKEND"]
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2", backend=backend)
embeddings = model.encode(["backend choice check", "short text"], convert_to_numpy=True)
print(f"selected_backend={backend}")
print(f"embedding_shape={embeddings.shape}")
PY
selected_backend=torch
embedding_shape=(2, 384)

Replace BACKEND=torch with BACKEND=onnx or BACKEND=openvino after installing the matching extra. The first onnx or openvino load can take longer when Sentence Transformers must export the model before running inference.

Author: Mohd Shakir Zakaria
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.