How to search a large corpus with Sentence Transformers

Semantic search over a larger text collection needs a retrieval path that compares a query against stored document embeddings without building a full all-pairs score matrix. Sentence Transformers can run that exact-search stage with precomputed corpus embeddings and chunked scoring, which keeps the prototype close to the code used later in a vector index or retrieval service.

The semantic_search() helper accepts query embeddings, corpus embeddings, a top_k limit, and chunk sizes for query and corpus scanning. Smaller corpus_chunk_size values lower the temporary score matrix size, while larger chunks can be faster when the available CPU or GPU memory can hold them.

The local sample repeats six support-document topics into a 2,400-document corpus, searches for a password-reset question, and verifies that password-reset documents rank first. Replace the repeated records with stable application IDs and real text before saving results; move the embeddings into FAISS, Qdrant, or another vector store when persistence, filtering, or approximate nearest-neighbor search becomes the main job.

Steps to run large-corpus semantic search with Sentence Transformers:

Install Sentence Transformers in the active Python environment.
```
$ python -m pip install --upgrade sentence-transformers
```
The first model run may download files from Hugging Face. Use the same environment that will encode the production corpus.
Related: How to install Sentence Transformers with pip

Create the large-corpus semantic search script.

large_corpus_search.py

from sentence_transformers import SentenceTransformer, util
 
 
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
 
topics = [
    (
        "password-reset",
        "Reset a forgotten password from account settings, open the email link, "
        "and choose a new password.",
    ),
    (
        "invoice-export",
        "Export paid invoices from the billing dashboard as a CSV file for accounting.",
    ),
    (
        "api-token-rotation",
        "Rotate API tokens before sharing a new integration with a teammate.",
    ),
    (
        "notification-email",
        "Change the notification email address and confirm the new address for alerts.",
    ),
    (
        "workspace-theme",
        "Change the dashboard color theme for a workspace user profile.",
    ),
    (
        "vector-search",
        "Store dense embeddings in a vector index for semantic search retrieval.",
    ),
]
 
corpus = []
for shard_id in range(1, 401):
    for topic, text in topics:
        corpus.append(
            {
                "id": f"{topic}-{shard_id:03d}",
                "topic": topic,
                "text": f"{text} Region {shard_id:03d}.",
            }
        )
 
documents = [item["text"] for item in corpus]
query = "How does a user reset a forgotten password with an email link?"
 
document_embeddings = model.encode_document(
    documents,
    batch_size=128,
    normalize_embeddings=True,
    convert_to_tensor=True,
    show_progress_bar=False,
)
query_embedding = model.encode_query(
    query,
    normalize_embeddings=True,
    convert_to_tensor=True,
    show_progress_bar=False,
)
 
query_chunk_size = 1
corpus_chunk_size = 256
top_k = 3
 
hits = util.semantic_search(
    query_embedding,
    document_embeddings,
    query_chunk_size=query_chunk_size,
    corpus_chunk_size=corpus_chunk_size,
    top_k=top_k,
    score_function=util.dot_score,
)[0]
 
print(f"corpus documents: {len(corpus)}")
print(f"embedding dimension: {document_embeddings.shape[1]}")
print(f"query chunk size: {query_chunk_size}")
print(f"corpus chunk size: {corpus_chunk_size}")
print(f"top k: {top_k}")
print(f"query: {query}")
print("top matches:")
for rank, hit in enumerate(hits, start=1):
    record = corpus[hit["corpus_id"]]
    print(
        f"{rank}. {record['id']} topic={record['topic']} "
        f"score={hit['score']:.4f}"
    )
 
if corpus[hits[0]["corpus_id"]]["topic"] != "password-reset":
    raise SystemExit("unexpected top topic")
 
print("verification: PASS password reset documents ranked first")

encode_document() and encode_query() keep the code ready for embedding models that define separate document and query prompts. Normalized embeddings plus dot_score keep the scores on a cosine-similarity scale.
Related: How to encode queries and documents with Sentence Transformers

Run the script and confirm the chunked search settings.

$ python large_corpus_search.py
corpus documents: 2400
embedding dimension: 384
query chunk size: 1
corpus chunk size: 256
top k: 3
query: How does a user reset a forgotten password with an email link?
top matches:
1. password-reset-253 topic=password-reset score=0.6977
2. password-reset-106 topic=password-reset score=0.6953
3. password-reset-112 topic=password-reset score=0.6937
verification: PASS password reset documents ranked first

The reported corpus count and corpus chunk size prove that the query was scored in corpus chunks instead of as a tiny toy list. If the expected topic does not rank first, inspect the source text, chunk boundaries, model choice, and normalization setting before increasing top_k.

Remove the temporary script after the search behavior is verified.
```
$ rm large_corpus_search.py
```

Author: Mohd Shakir Zakaria
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.