How to select a Sentence Transformers inference device

Local embedding workloads can land on different hardware depending on the PyTorch build, installed drivers, and visible accelerators. For Sentence Transformers, selecting the inference device explicitly keeps development runs, GPU servers, and Apple Silicon laptops from silently using a different runtime than expected.

SentenceTransformer accepts a device value such as cpu, cuda:0, or mps when the model is loaded. The selected runtime is also visible through model.device after loading, which gives a direct check before a larger embedding job starts.

Use cpu as the fallback that works everywhere, and request cuda:0 or mps only after PyTorch reports that accelerator as available. A failed explicit request should stop early instead of falling back silently, because a batch job that misses the intended accelerator can be much slower and harder to diagnose after it starts.

Steps to select a Sentence Transformers inference device:

Create a device selection script that detects PyTorch accelerators and loads the embedding model on the requested device.

select_inference_device.py

import argparse
 
import torch
from sentence_transformers import SentenceTransformer
 
 
def detect_devices():
    devices = []
 
    if torch.cuda.is_available():
        devices.extend(f"cuda:{index}" for index in range(torch.cuda.device_count()))
 
    mps_backend = getattr(torch.backends, "mps", None)
    if mps_backend is not None and mps_backend.is_available():
        devices.append("mps")
 
    devices.append("cpu")
    return devices
 
 
parser = argparse.ArgumentParser()
parser.add_argument("--device", default="auto", help="auto, cpu, cuda:0, or mps")
args = parser.parse_args()
 
available_devices = detect_devices()
requested_device = args.device
 
if requested_device == "auto":
    selected_device = available_devices[0]
elif requested_device in available_devices:
    selected_device = requested_device
else:
    raise SystemExit(
        f"{requested_device} is not available. Available devices: "
        f"{', '.join(available_devices)}"
    )
 
model = SentenceTransformer(
    "sentence-transformers/all-MiniLM-L6-v2",
    device=selected_device,
)
 
embeddings = model.encode(
    [
        "Route embedding inference to the selected accelerator.",
        "Keep a CPU fallback for development machines.",
    ],
    show_progress_bar=False,
)
 
print(f"requested device: {requested_device}")
print(f"available devices: {', '.join(available_devices)}")
print(f"selected device: {model.device}")
print(f"embedding shape: {embeddings.shape}")

torch.cuda.is_available() reports usable CUDA devices, while torch.backends.mps.is_available() reports whether the Apple Metal backend is usable in the current PyTorch runtime.

Run the script with the explicit CPU device.
```
$ python select_inference_device.py --device cpu
requested device: cpu
available devices: cpu
selected device: cpu
embedding shape: (2, 384)
```
The shape proves that the model encoded two texts into 384-dimension embeddings while model.device stayed on cpu.
Repeat the run with an accelerator value only when it appears in the available devices line.
```
$ python select_inference_device.py --device cuda:0
```
Use cuda:0 for the first NVIDIA CUDA device or mps for Apple Silicon. If the value is missing from the available list, keep cpu or fix the PyTorch accelerator installation before running production batches.

Copy the selected device into the application code that loads the model.

device = "cuda:0"  # use "cpu" or "mps" when that is the selected runtime
 
model = SentenceTransformer(
    "sentence-transformers/all-MiniLM-L6-v2",
    device=device,
)
 
embeddings = model.encode(documents, show_progress_bar=False)
print(f"embedding device: {model.device}")

If the same model load also passes model_kwargs={"device_map": "auto"}, that device map controls placement instead of the top-level device argument.

Remove the temporary selection script after the application prints the expected device during its own smoke test.
```
$ rm select_inference_device.py
```

Author: Mohd Shakir Zakaria
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.