Dense text embeddings are usually stored as 32-bit floating point arrays, which can dominate memory and disk usage once a retrieval corpus grows. Sentence Transformers can convert those embeddings to lower precision so a vector search experiment keeps the same rows and dimensions while using fewer bytes.
Embedding quantization is separate from model quantization. The model still produces semantic vectors, while quantize_embeddings() changes the stored embedding array to int8, uint8, binary, or ubinary for downstream search code that supports that representation.
Scalar int8 quantization needs consistent bucket ranges across corpus and query embeddings. Build those ranges from representative calibration data or reuse saved ranges from the corpus build; calculating buckets from a tiny query batch can shift values enough to change retrieval behavior.
from sentence_transformers import SentenceTransformer from sentence_transformers.util.quantization import quantize_embeddings import numpy as np corpus = [ "Int8 vectors use less storage.", "Cross encoders rerank candidate pairs.", "Image search compares pictures and captions.", "Fine-tuning updates model weights.", ] query = "Int8 vectors use less storage." calibration_sentences = corpus + [ f"Calibration sentence {i} about compact vector search." for i in range(120) ] model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2") corpus_float = model.encode(corpus, normalize_embeddings=True) query_float = model.encode([query], normalize_embeddings=True) calibration = model.encode(calibration_sentences, normalize_embeddings=True) ranges = np.vstack((calibration.min(axis=0), calibration.max(axis=0))) corpus_int8 = quantize_embeddings(corpus_float, precision="int8", ranges=ranges) query_int8 = quantize_embeddings(query_float, precision="int8", ranges=ranges) scores = corpus_int8.astype(np.int32) @ query_int8[0].astype(np.int32) best = int(np.argmax(scores)) print(f"float32: {corpus_float.shape}, {corpus_float.dtype}, {corpus_float.nbytes} bytes") print(f"int8: {corpus_int8.shape}, {corpus_int8.dtype}, {corpus_int8.nbytes} bytes") print(f"top match: {corpus[best]}")
ranges is a two-row array with minimum and maximum values for each embedding dimension. Reuse the same ranges for corpus and query vectors that will be compared in the same index.
$ python quantize_embeddings.py float32: (4, 384), float32, 6144 bytes int8: (4, 384), int8, 1536 bytes top match: Int8 vectors use less storage.
The sample search casts the signed 8-bit values to int32 before the dot product so integer multiplication does not overflow during the check.