How to build image search with Sentence Transformers

Image search with Sentence Transformers works by embedding pictures and text descriptions into the same vector space. A CLIP-based model can compare a text query such as a red square against image vectors from a small catalog, which is enough to prove the retrieval pattern before adding a database or web interface.

The prototype uses sentence-transformers/clip-ViT-B-32 because the model accepts both Pillow images and text strings through the normal encode() interface. It creates three sample images, embeds the images, embeds three text queries, and ranks the images with normalized vector scores.

Keep the image extra installed in the same Python environment that owns PyTorch and torchvision. In an existing pinned PyTorch environment, install matching vision wheels from the same wheel source before loading CLIP; a fresh project environment can usually use the sentence-transformers[image] extra directly.

Steps to build image search with Sentence Transformers:

Install image support in the active Python environment.
```
$ python -m pip install --upgrade "sentence-transformers[image]"
```
Use a project virtual environment when possible. If torch is already pinned, keep torchvision from the same CPU or CUDA wheel source.
Related: How to install Sentence Transformers with pip

Create the image-search script.

build_image_search.py

from pathlib import Path
 
import numpy as np
from PIL import Image, ImageDraw
from sentence_transformers import SentenceTransformer
 
 
def make_sample(path, color, shape):
    image = Image.new("RGB", (224, 224), "white")
    draw = ImageDraw.Draw(image)
 
    if shape == "red square":
        draw.rectangle((46, 46, 178, 178), fill=color)
    elif shape == "blue circle":
        draw.ellipse((42, 42, 182, 182), fill=color)
    elif shape == "green triangle":
        draw.polygon([(112, 34), (38, 188), (186, 188)], fill=color)
 
    image.save(path)
 
 
image_dir = Path("demo-images")
image_dir.mkdir(exist_ok=True)
 
samples = [
    ("red-square.png", "red square", (220, 30, 30)),
    ("blue-circle.png", "blue circle", (30, 80, 220)),
    ("green-triangle.png", "green triangle", (40, 155, 75)),
]
 
for filename, label, color in samples:
    make_sample(image_dir / filename, color, label)
 
model = SentenceTransformer("sentence-transformers/clip-ViT-B-32")
 
images = [Image.open(image_dir / filename) for filename, _, _ in samples]
image_embeddings = model.encode(
    images,
    normalize_embeddings=True,
    convert_to_numpy=True,
)
 
queries = [
    ("a red square", "red square"),
    ("a blue circle", "blue circle"),
    ("a green triangle", "green triangle"),
]
query_embeddings = model.encode(
    [query for query, _ in queries],
    normalize_embeddings=True,
    convert_to_numpy=True,
)
 
scores = np.matmul(query_embeddings, image_embeddings.T)
labels = [label for _, label, _ in samples]
 
print("model: sentence-transformers/clip-ViT-B-32")
print(f"indexed images: {len(samples)}")
print(f"embedding dimension: {image_embeddings.shape[1]}")
print("top matches:")
 
all_expected = True
for row, (query, expected) in enumerate(queries):
    best_index = int(np.argmax(scores[row]))
    matched = labels[best_index]
    score = scores[row][best_index]
    print(f"- query={query!r} match={matched} score={score:.4f}")
    all_expected = all_expected and matched == expected
 
print(f"all expected matches: {all_expected}")

The sample images are generated locally so the script can run without external image files. Replace the samples list with real image paths and labels when moving from the smoke test to an application index.

Run the image-search script.

$ python build_image_search.py
model: sentence-transformers/clip-ViT-B-32
indexed images: 3
embedding dimension: 512
top matches:
- query='a red square' match=red square score=0.2718
- query='a blue circle' match=blue circle score=0.3417
- query='a green triangle' match=green triangle score=0.3531
all expected matches: True

The first run may download the CLIP model before printing the search output. Scores are relative to the query and model, so use the top-ranked label and score ordering rather than an absolute threshold for this small smoke test.

Author: Mohd Shakir Zakaria
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.