Local embedding workloads can land on different hardware depending on the PyTorch build, installed drivers, and visible accelerators. For Sentence Transformers, selecting the inference device explicitly keeps development runs, GPU servers, and Apple Silicon laptops from silently using a different runtime than expected.

SentenceTransformer accepts a device value such as cpu, cuda:0, or mps when the model is loaded. The selected runtime is also visible through model.device after loading, which gives a direct check before a larger embedding job starts.

Use cpu as the fallback that works everywhere, and request cuda:0 or mps only after PyTorch reports that accelerator as available. A failed explicit request should stop early instead of falling back silently, because a batch job that misses the intended accelerator can be much slower and harder to diagnose after it starts.

Steps to select a Sentence Transformers inference device:

  1. Create a device selection script that detects PyTorch accelerators and loads the embedding model on the requested device.
    select_inference_device.py
    import argparse
     
    import torch
    from sentence_transformers import SentenceTransformer
     
     
    def detect_devices():
        devices = []
     
        if torch.cuda.is_available():
            devices.extend(f"cuda:{index}" for index in range(torch.cuda.device_count()))
     
        mps_backend = getattr(torch.backends, "mps", None)
        if mps_backend is not None and mps_backend.is_available():
            devices.append("mps")
     
        devices.append("cpu")
        return devices
     
     
    parser = argparse.ArgumentParser()
    parser.add_argument("--device", default="auto", help="auto, cpu, cuda:0, or mps")
    args = parser.parse_args()
     
    available_devices = detect_devices()
    requested_device = args.device
     
    if requested_device == "auto":
        selected_device = available_devices[0]
    elif requested_device in available_devices:
        selected_device = requested_device
    else:
        raise SystemExit(
            f"{requested_device} is not available. Available devices: "
            f"{', '.join(available_devices)}"
        )
     
    model = SentenceTransformer(
        "sentence-transformers/all-MiniLM-L6-v2",
        device=selected_device,
    )
     
    embeddings = model.encode(
        [
            "Route embedding inference to the selected accelerator.",
            "Keep a CPU fallback for development machines.",
        ],
        show_progress_bar=False,
    )
     
    print(f"requested device: {requested_device}")
    print(f"available devices: {', '.join(available_devices)}")
    print(f"selected device: {model.device}")
    print(f"embedding shape: {embeddings.shape}")

    torch.cuda.is_available() reports usable CUDA devices, while torch.backends.mps.is_available() reports whether the Apple Metal backend is usable in the current PyTorch runtime.

  2. Run the script with the explicit CPU device.
    $ python select_inference_device.py --device cpu
    requested device: cpu
    available devices: cpu
    selected device: cpu
    embedding shape: (2, 384)

    The shape proves that the model encoded two texts into 384-dimension embeddings while model.device stayed on cpu.

  3. Repeat the run with an accelerator value only when it appears in the available devices line.
    $ python select_inference_device.py --device cuda:0

    Use cuda:0 for the first NVIDIA CUDA device or mps for Apple Silicon. If the value is missing from the available list, keep cpu or fix the PyTorch accelerator installation before running production batches.

  4. Copy the selected device into the application code that loads the model.
    device = "cuda:0"  # use "cpu" or "mps" when that is the selected runtime
     
    model = SentenceTransformer(
        "sentence-transformers/all-MiniLM-L6-v2",
        device=device,
    )
     
    embeddings = model.encode(documents, show_progress_bar=False)
    print(f"embedding device: {model.device}")

    If the same model load also passes model_kwargs={"device_map": "auto"}, that device map controls placement instead of the top-level device argument.

  5. Remove the temporary selection script after the application prints the expected device during its own smoke test.
    $ rm select_inference_device.py