How to generate multimodal embeddings with Sentence Transformers

Multimodal embedding models place different input types, such as images and text, into one vector space. In Sentence Transformers, a CLIP-based checkpoint can turn a picture and matching text prompts into vectors that can be compared before building a larger image search or cross-modal retrieval system.

The smoke test uses sentence-transformers/clip-ViT-B-32 because it reports both text and image support and accepts a Pillow image through the normal encode() method. A generated red square keeps the check self-contained, so no external image download or dataset fixture is required.

Install image support in the same Python environment that owns torch and torchvision. If an existing project pins PyTorch wheels, keep torchvision from the same CPU or CUDA wheel source instead of letting a fresh install silently replace the runtime stack.

Steps to generate multimodal embeddings with Sentence Transformers:

Install Sentence Transformers image support in the active Python environment.
```
$ python -m pip install --upgrade "sentence-transformers[image]"
```
Use a project virtual environment when possible. The sentence-transformers[image] extra installs the image dependencies needed for CLIP-style image inputs.
Related: How to install Sentence Transformers with pip

Create the multimodal embedding smoke-test script.

embed_mm.py

from PIL import Image, ImageDraw
from sentence_transformers import SentenceTransformer
 
 
model_id = "sentence-transformers/clip-ViT-B-32"
 
image = Image.new("RGB", (224, 224), "white")
draw = ImageDraw.Draw(image)
draw.rectangle((48, 48, 176, 176), fill=(220, 30, 30))
 
model = SentenceTransformer(model_id)
 
print(f"model_id={model_id}")
 
if hasattr(model, "modalities"):
    print(f"modalities={model.modalities}")
if hasattr(model, "supports"):
    print(f"supports_image={model.supports('image')}")
 
image_embedding = model.encode(
    image,
    normalize_embeddings=True,
    convert_to_numpy=True,
    show_progress_bar=False,
)
text_labels = ["a red square", "a blue circle"]
text_embeddings = model.encode(
    text_labels,
    normalize_embeddings=True,
    convert_to_numpy=True,
    show_progress_bar=False,
)
 
scores = model.similarity(image_embedding, text_embeddings)[0]
best_index = int(scores.argmax())
 
print(f"image_embedding_shape={image_embedding.shape}")
print(f"text_embedding_shape={text_embeddings.shape}")
for label, score in zip(text_labels, scores):
    print(f"score[{label}]={float(score):.4f}")
print(f"best_text_match={text_labels[best_index]}")
 
if text_labels[best_index] != "a red square":
    raise SystemExit("unexpected text match")

The image is created in memory. Replace it with Image.open("image.png").convert("RGB") when testing a real file.

Run the smoke-test script.

$ python embed_mm.py
model_id=sentence-transformers/clip-ViT-B-32
modalities=['text', 'image']
supports_image=True
image_embedding_shape=(512,)
text_embedding_shape=(2, 512)
score[a red square]=0.2739
score[a blue circle]=0.2454
best_text_match=a red square

The first run may download the model from Hugging Face before printing the output.

Check that the model reports image support.

The modalities line should include both text and image, and supports_image should print True.
Check that the image and text vectors share the same embedding dimension.

The image vector shape is (512,) and the text matrix shape is (2, 512) for sentence-transformers/clip-ViT-B-32, so both outputs can be compared by model.similarity() or stored in a vector index with a 512-dimension schema.
Confirm that the image ranks the matching prompt first.

The exact scores can change slightly across runtime versions, but best_text_match should remain a red square for the generated red-square image.
Remove the smoke-test script when the project has its own embedding path.
```
$ rm embed_mm.py
```
Keep the file instead if it will become a regression check for the project's chosen multimodal model.