Local embedding workloads can land on different hardware depending on the PyTorch build, installed drivers, and visible accelerators. For Sentence Transformers, selecting the inference device explicitly keeps development runs, GPU servers, and Apple Silicon laptops from silently using a different runtime than expected.
SentenceTransformer accepts a device value such as cpu, cuda:0, or mps when the model is loaded. The selected runtime is also visible through model.device after loading, which gives a direct check before a larger embedding job starts.
Use cpu as the fallback that works everywhere, and request cuda:0 or mps only after PyTorch reports that accelerator as available. A failed explicit request should stop early instead of falling back silently, because a batch job that misses the intended accelerator can be much slower and harder to diagnose after it starts.
import argparse import torch from sentence_transformers import SentenceTransformer def detect_devices(): devices = [] if torch.cuda.is_available(): devices.extend(f"cuda:{index}" for index in range(torch.cuda.device_count())) mps_backend = getattr(torch.backends, "mps", None) if mps_backend is not None and mps_backend.is_available(): devices.append("mps") devices.append("cpu") return devices parser = argparse.ArgumentParser() parser.add_argument("--device", default="auto", help="auto, cpu, cuda:0, or mps") args = parser.parse_args() available_devices = detect_devices() requested_device = args.device if requested_device == "auto": selected_device = available_devices[0] elif requested_device in available_devices: selected_device = requested_device else: raise SystemExit( f"{requested_device} is not available. Available devices: " f"{', '.join(available_devices)}" ) model = SentenceTransformer( "sentence-transformers/all-MiniLM-L6-v2", device=selected_device, ) embeddings = model.encode( [ "Route embedding inference to the selected accelerator.", "Keep a CPU fallback for development machines.", ], show_progress_bar=False, ) print(f"requested device: {requested_device}") print(f"available devices: {', '.join(available_devices)}") print(f"selected device: {model.device}") print(f"embedding shape: {embeddings.shape}")
torch.cuda.is_available() reports usable CUDA devices, while torch.backends.mps.is_available() reports whether the Apple Metal backend is usable in the current PyTorch runtime.
$ python select_inference_device.py --device cpu requested device: cpu available devices: cpu selected device: cpu embedding shape: (2, 384)
The shape proves that the model encoded two texts into 384-dimension embeddings while model.device stayed on cpu.
$ python select_inference_device.py --device cuda:0
Use cuda:0 for the first NVIDIA CUDA device or mps for Apple Silicon. If the value is missing from the available list, keep cpu or fix the PyTorch accelerator installation before running production batches.
device = "cuda:0" # use "cpu" or "mps" when that is the selected runtime model = SentenceTransformer( "sentence-transformers/all-MiniLM-L6-v2", device=device, ) embeddings = model.encode(documents, show_progress_bar=False) print(f"embedding device: {model.device}")
If the same model load also passes model_kwargs={"device_map": "auto"}, that device map controls placement instead of the top-level device argument.
$ rm select_inference_device.py