Image search with Sentence Transformers works by embedding pictures and text descriptions into the same vector space. A CLIP-based model can compare a text query such as a red square against image vectors from a small catalog, which is enough to prove the retrieval pattern before adding a database or web interface.

The prototype uses sentence-transformers/clip-ViT-B-32 because the model accepts both Pillow images and text strings through the normal encode() interface. It creates three sample images, embeds the images, embeds three text queries, and ranks the images with normalized vector scores.

Keep the image extra installed in the same Python environment that owns PyTorch and torchvision. In an existing pinned PyTorch environment, install matching vision wheels from the same wheel source before loading CLIP; a fresh project environment can usually use the sentence-transformers[image] extra directly.

Steps to build image search with Sentence Transformers:

  1. Install image support in the active Python environment.
    $ python -m pip install --upgrade "sentence-transformers[image]"

    Use a project virtual environment when possible. If torch is already pinned, keep torchvision from the same CPU or CUDA wheel source.
    Related: How to install Sentence Transformers with pip

  2. Create the image-search script.
    build_image_search.py
    from pathlib import Path
     
    import numpy as np
    from PIL import Image, ImageDraw
    from sentence_transformers import SentenceTransformer
     
     
    def make_sample(path, color, shape):
        image = Image.new("RGB", (224, 224), "white")
        draw = ImageDraw.Draw(image)
     
        if shape == "red square":
            draw.rectangle((46, 46, 178, 178), fill=color)
        elif shape == "blue circle":
            draw.ellipse((42, 42, 182, 182), fill=color)
        elif shape == "green triangle":
            draw.polygon([(112, 34), (38, 188), (186, 188)], fill=color)
     
        image.save(path)
     
     
    image_dir = Path("demo-images")
    image_dir.mkdir(exist_ok=True)
     
    samples = [
        ("red-square.png", "red square", (220, 30, 30)),
        ("blue-circle.png", "blue circle", (30, 80, 220)),
        ("green-triangle.png", "green triangle", (40, 155, 75)),
    ]
     
    for filename, label, color in samples:
        make_sample(image_dir / filename, color, label)
     
    model = SentenceTransformer("sentence-transformers/clip-ViT-B-32")
     
    images = [Image.open(image_dir / filename) for filename, _, _ in samples]
    image_embeddings = model.encode(
        images,
        normalize_embeddings=True,
        convert_to_numpy=True,
    )
     
    queries = [
        ("a red square", "red square"),
        ("a blue circle", "blue circle"),
        ("a green triangle", "green triangle"),
    ]
    query_embeddings = model.encode(
        [query for query, _ in queries],
        normalize_embeddings=True,
        convert_to_numpy=True,
    )
     
    scores = np.matmul(query_embeddings, image_embeddings.T)
    labels = [label for _, label, _ in samples]
     
    print("model: sentence-transformers/clip-ViT-B-32")
    print(f"indexed images: {len(samples)}")
    print(f"embedding dimension: {image_embeddings.shape[1]}")
    print("top matches:")
     
    all_expected = True
    for row, (query, expected) in enumerate(queries):
        best_index = int(np.argmax(scores[row]))
        matched = labels[best_index]
        score = scores[row][best_index]
        print(f"- query={query!r} match={matched} score={score:.4f}")
        all_expected = all_expected and matched == expected
     
    print(f"all expected matches: {all_expected}")

    The sample images are generated locally so the script can run without external image files. Replace the samples list with real image paths and labels when moving from the smoke test to an application index.

  3. Run the image-search script.
    $ python build_image_search.py
    model: sentence-transformers/clip-ViT-B-32
    indexed images: 3
    embedding dimension: 512
    top matches:
    - query='a red square' match=red square score=0.2718
    - query='a blue circle' match=blue circle score=0.3417
    - query='a green triangle' match=green triangle score=0.3531
    all expected matches: True

    The first run may download the CLIP model before printing the search output. Scores are relative to the query and model, so use the top-ranked label and score ordering rather than an absolute threshold for this small smoke test.