How to quantize an OpenVINO model with Sentence Transformers

OpenVINO model quantization converts the transformer component of a Sentence Transformers backend model into a lower-precision int8 artifact. Use it after the model already runs with the OpenVINO backend and the deployment needs a smaller CPU inference file.

Sentence Transformers exposes this workflow through its OpenVINO static quantization export helper. The model must be loaded with backend=“openvino”, saved locally, calibrated with a representative text sample, and then reloaded with the quantized OpenVINO XML filename.

The result is still a Sentence Transformers model directory, not just a standalone OpenVINO file. Sentence Transformers keeps the sentence-level pooling and normalization around the exported transformer graph, so validate the reloaded model with the same kind of text that the service will encode.

Steps to quantize an OpenVINO model with Sentence Transformers:

Create a requirements file for the OpenVINO quantization dependencies.
requirements.txt
```
datasets
sentence-transformers[openvino]
```
datasets provides calibration rows for post-training static quantization. The openvino extra installs the OpenVINO, Optimum Intel, and NNCF packages used by the Sentence Transformers export helper.
Install the requirements in the active Python environment.
```
$ pip install -r requirements.txt
```

Create the quantization script.

quantize_openvino_model.py

from pathlib import Path
 
import optimum.intel as oi
import sentence_transformers as st
 
 
def main() -> None:
    model_id = (
        "sentence-transformers/"
        "all-MiniLM-L6-v2"
    )
    base_file = "openvino/openvino_model.xml"
    qint8_name = (
        "openvino_model_"
        "qint8_quantized.xml"
    )
    qint8_file = (
        "openvino/"
        + qint8_name
    )
    output_dir = Path("minilm-ov-int8")
    quantized_file = (
        output_dir
        / "openvino"
        / qint8_name
    )
 
    if output_dir.exists():
        raise SystemExit("Choose a new output directory.")
 
    model = st.SentenceTransformer(
        model_id,
        backend="openvino",
        model_kwargs={
            "file_name": base_file
        },
    )
    model.save_pretrained(str(output_dir))
    config = oi.OVQuantizationConfig(
        num_samples=16
    )
 
    st.export_static_quantized_openvino_model(
        model=model,
        quantization_config=config,
        model_name_or_path=str(output_dir),
    )
 
    reloaded = st.SentenceTransformer(
        str(output_dir),
        backend="openvino",
        model_kwargs={
            "file_name": qint8_file
        },
    )
    texts = [
        "OpenVINO quantization reduces "
        "model precision."
    ]
    embeddings = reloaded.encode(texts)
 
    print(f"model directory: {output_dir}")
    print("qint8 file:")
    print(quantized_file.name)
    print(f"saved: {quantized_file.exists()}")
    print(f"embedding shape: {embeddings.shape}")
    print(f"embedding dtype: {embeddings.dtype}")
 
 
if __name__ == "__main__":
    main()

num_samples=16 keeps this verification run small. Use more representative calibration text before comparing production retrieval quality, latency, or model size.

Run the script and confirm the quantized model reloads.
```
$ python quantize_openvino_model.py
##### snipped #####
model directory: minilm-ov-int8
qint8 file:
openvino_model_qint8_quantized.xml
saved: True
embedding shape: (1, 384)
embedding dtype: float32
```
The saved: True line confirms the qint8 OpenVINO XML file exists. The embedding shape and dtype confirm Sentence Transformers can reload the quantized file for an encode call.

Author: Mohd Shakir Zakaria
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.