OpenVINO model quantization converts the transformer component of a Sentence Transformers backend model into a lower-precision int8 artifact. Use it after the model already runs with the OpenVINO backend and the deployment needs a smaller CPU inference file.
Sentence Transformers exposes this workflow through its OpenVINO static quantization export helper. The model must be loaded with backend=“openvino”, saved locally, calibrated with a representative text sample, and then reloaded with the quantized OpenVINO XML filename.
The result is still a Sentence Transformers model directory, not just a standalone OpenVINO file. Sentence Transformers keeps the sentence-level pooling and normalization around the exported transformer graph, so validate the reloaded model with the same kind of text that the service will encode.
datasets sentence-transformers[openvino]
datasets provides calibration rows for post-training static quantization. The openvino extra installs the OpenVINO, Optimum Intel, and NNCF packages used by the Sentence Transformers export helper.
$ pip install -r requirements.txt
from pathlib import Path import optimum.intel as oi import sentence_transformers as st def main() -> None: model_id = ( "sentence-transformers/" "all-MiniLM-L6-v2" ) base_file = "openvino/openvino_model.xml" qint8_name = ( "openvino_model_" "qint8_quantized.xml" ) qint8_file = ( "openvino/" + qint8_name ) output_dir = Path("minilm-ov-int8") quantized_file = ( output_dir / "openvino" / qint8_name ) if output_dir.exists(): raise SystemExit("Choose a new output directory.") model = st.SentenceTransformer( model_id, backend="openvino", model_kwargs={ "file_name": base_file }, ) model.save_pretrained(str(output_dir)) config = oi.OVQuantizationConfig( num_samples=16 ) st.export_static_quantized_openvino_model( model=model, quantization_config=config, model_name_or_path=str(output_dir), ) reloaded = st.SentenceTransformer( str(output_dir), backend="openvino", model_kwargs={ "file_name": qint8_file }, ) texts = [ "OpenVINO quantization reduces " "model precision." ] embeddings = reloaded.encode(texts) print(f"model directory: {output_dir}") print("qint8 file:") print(quantized_file.name) print(f"saved: {quantized_file.exists()}") print(f"embedding shape: {embeddings.shape}") print(f"embedding dtype: {embeddings.dtype}") if __name__ == "__main__": main()
num_samples=16 keeps this verification run small. Use more representative calibration text before comparing production retrieval quality, latency, or model size.
$ python quantize_openvino_model.py ##### snipped ##### model directory: minilm-ov-int8 qint8 file: openvino_model_qint8_quantized.xml saved: True embedding shape: (1, 384) embedding dtype: float32
The saved: True line confirms the qint8 OpenVINO XML file exists. The embedding shape and dtype confirm Sentence Transformers can reload the quantized file for an encode call.