Large embedding jobs can spend most of their runtime feeding text through one model process. Sentence Transformers can split encode() work across several target devices, which helps when a corpus is too large for a single CPU or GPU process to handle comfortably.
The reusable pool path starts worker processes with start_multi_process_pool() and passes that pool into encode(). A target-device list such as ["cuda:0", "cuda:1"] uses separate GPUs, while repeated cpu entries create CPU worker processes for environments without GPUs.
A small local corpus is enough to prove the worker count, returned embedding shape, normalized vector length, and pool shutdown before moving the pattern into a larger corpus job. Replace the CPU list with one device per GPU for production and keep the stop call in a finally block so workers shut down when an encode call raises an error.
from sentence_transformers import SentenceTransformer def main() -> None: model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2") documents = [ "Reset a forgotten password from the profile security page.", "Export paid invoices from the billing dashboard.", "Rotate API tokens before sharing an integration.", "Change the notification email address for alerts.", "Create a vector index for semantic document search.", "Archive old support tickets after the retention period.", "Review failed background jobs from the worker dashboard.", "Update the workspace theme for a user profile.", ] target_devices = ["cpu", "cpu"] pool = model.start_multi_process_pool(target_devices=target_devices) try: embeddings = model.encode( documents, pool=pool, batch_size=2, chunk_size=4, normalize_embeddings=True, show_progress_bar=False, ) print(f"target devices: {', '.join(target_devices)}") print(f"worker processes: {len(pool['processes'])}") print(f"documents encoded: {len(documents)}") print(f"embedding shape: {embeddings.shape}") print(f"first vector norm: {float((embeddings[0] ** 2).sum() ** 0.5):.3f}") finally: model.stop_multi_process_pool(pool) print("pool stopped: True") if __name__ == "__main__": main()
Use one target per GPU, such as ["cuda:0", "cuda:1"]. Repeating cpu gives a CPU-only check when GPU devices are not available. chunk_size controls how many input texts are sent to each process, while batch_size controls the per-process model batch.
Related: How to select a Sentence Transformers inference device
$ python encode_multiprocess.py target devices: cpu, cpu worker processes: 2 documents encoded: 8 embedding shape: (8, 384) first vector norm: 1.000 pool stopped: True
PyTorch shares model weights through system shared memory. In a small container, the pool can fail with No space left on device when /dev/shm is too small; increase --shm-size or reduce the worker count.
documents encoded should equal the number of source strings. A smaller count means the script changed the corpus before encoding or failed before all chunks returned.
The selected model returns 384 columns, and first vector norm near 1.000 confirms that normalize_embeddings=True produced unit-length vectors for cosine-style scoring.
$ rm encode_multiprocess.py