How to quantize embeddings with Sentence Transformers

Dense text embeddings are usually stored as 32-bit floating point arrays, which can dominate memory and disk usage once a retrieval corpus grows. Sentence Transformers can convert those embeddings to lower precision so a vector search experiment keeps the same rows and dimensions while using fewer bytes.

Embedding quantization is separate from model quantization. The model still produces semantic vectors, while quantize_embeddings() changes the stored embedding array to int8, uint8, binary, or ubinary for downstream search code that supports that representation.

Scalar int8 quantization needs consistent bucket ranges across corpus and query embeddings. Build those ranges from representative calibration data or reuse saved ranges from the corpus build; calculating buckets from a tiny query batch can shift values enough to change retrieval behavior.

Steps to quantize embeddings with Sentence Transformers:

  1. Create a script that loads a model, builds calibration ranges, and quantizes matching corpus and query embeddings.
    quantize_embeddings.py
    from sentence_transformers import SentenceTransformer
    from sentence_transformers.util.quantization import quantize_embeddings
    import numpy as np
     
    corpus = [
        "Int8 vectors use less storage.",
        "Cross encoders rerank candidate pairs.",
        "Image search compares pictures and captions.",
        "Fine-tuning updates model weights.",
    ]
    query = "Int8 vectors use less storage."
    calibration_sentences = corpus + [
        f"Calibration sentence {i} about compact vector search."
        for i in range(120)
    ]
     
    model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
    corpus_float = model.encode(corpus, normalize_embeddings=True)
    query_float = model.encode([query], normalize_embeddings=True)
    calibration = model.encode(calibration_sentences, normalize_embeddings=True)
    ranges = np.vstack((calibration.min(axis=0), calibration.max(axis=0)))
     
    corpus_int8 = quantize_embeddings(corpus_float, precision="int8", ranges=ranges)
    query_int8 = quantize_embeddings(query_float, precision="int8", ranges=ranges)
    scores = corpus_int8.astype(np.int32) @ query_int8[0].astype(np.int32)
    best = int(np.argmax(scores))
     
    print(f"float32: {corpus_float.shape}, {corpus_float.dtype}, {corpus_float.nbytes} bytes")
    print(f"int8: {corpus_int8.shape}, {corpus_int8.dtype}, {corpus_int8.nbytes} bytes")
    print(f"top match: {corpus[best]}")

    ranges is a two-row array with minimum and maximum values for each embedding dimension. Reuse the same ranges for corpus and query vectors that will be compared in the same index.

  2. Run the quantization script.
    $ python quantize_embeddings.py
    float32: (4, 384), float32, 6144 bytes
    int8: (4, 384), int8, 1536 bytes
    top match: Int8 vectors use less storage.
  3. Confirm that the int8 output keeps the same shape and uses one quarter of the float storage.

    The sample search casts the signed 8-bit values to int32 before the dot product so integer multiplication does not overflow during the check.