Large embedding jobs can spend most of their runtime feeding text through one model process. Sentence Transformers can split encode() work across several target devices, which helps when a corpus is too large for a single CPU or GPU process to handle comfortably.
The reusable pool path starts worker processes with start_multi_process_pool() and passes that pool into encode(). A target-device list such as ["cuda:0", "cuda:1"] uses separate GPUs, while repeated cpu entries create CPU worker processes for environments without GPUs.
A small local corpus is enough to prove the worker count, returned embedding shape, normalized vector length, and pool shutdown before moving the pattern into a larger corpus job. Replace the CPU list with one device per GPU for production and keep the stop call in a finally block so workers shut down when an encode call raises an error.
Steps to encode Sentence Transformers embeddings with multiple processes:
- Create the multi-process encoding script in the project.
- encode_multiprocess.py
from sentence_transformers import SentenceTransformer def main() -> None: model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2") documents = [ "Reset a forgotten password from the profile security page.", "Export paid invoices from the billing dashboard.", "Rotate API tokens before sharing an integration.", "Change the notification email address for alerts.", "Create a vector index for semantic document search.", "Archive old support tickets after the retention period.", "Review failed background jobs from the worker dashboard.", "Update the workspace theme for a user profile.", ] target_devices = ["cpu", "cpu"] pool = model.start_multi_process_pool(target_devices=target_devices) try: embeddings = model.encode( documents, pool=pool, batch_size=2, chunk_size=4, normalize_embeddings=True, show_progress_bar=False, ) print(f"target devices: {', '.join(target_devices)}") print(f"worker processes: {len(pool['processes'])}") print(f"documents encoded: {len(documents)}") print(f"embedding shape: {embeddings.shape}") print(f"first vector norm: {float((embeddings[0] ** 2).sum() ** 0.5):.3f}") finally: model.stop_multi_process_pool(pool) print("pool stopped: True") if __name__ == "__main__": main()
Use one target per GPU, such as ["cuda:0", "cuda:1"]. Repeating cpu gives a CPU-only check when GPU devices are not available. chunk_size controls how many input texts are sent to each process, while batch_size controls the per-process model batch.
Related: How to select a Sentence Transformers inference device - Run the script.
$ python encode_multiprocess.py target devices: cpu, cpu worker processes: 2 documents encoded: 8 embedding shape: (8, 384) first vector norm: 1.000 pool stopped: True
PyTorch shares model weights through system shared memory. In a small container, the pool can fail with No space left on device when /dev/shm is too small; increase --shm-size or reduce the worker count.
- Confirm that the document count matches the input list.
documents encoded should equal the number of source strings. A smaller count means the script changed the corpus before encoding or failed before all chunks returned.
- Confirm that the embedding matrix has one row per document.
The selected model returns 384 columns, and first vector norm near 1.000 confirms that normalize_embeddings=True produced unit-length vectors for cosine-style scoring.
- Remove the temporary encoding script after copying the pool pattern into the larger corpus job.
$ rm encode_multiprocess.py
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.