Semantic search over a larger text collection needs a retrieval path that compares a query against stored document embeddings without building a full all-pairs score matrix. Sentence Transformers can run that exact-search stage with precomputed corpus embeddings and chunked scoring, which keeps the prototype close to the code used later in a vector index or retrieval service.
The semantic_search() helper accepts query embeddings, corpus embeddings, a top_k limit, and chunk sizes for query and corpus scanning. Smaller corpus_chunk_size values lower the temporary score matrix size, while larger chunks can be faster when the available CPU or GPU memory can hold them.
The local sample repeats six support-document topics into a 2,400-document corpus, searches for a password-reset question, and verifies that password-reset documents rank first. Replace the repeated records with stable application IDs and real text before saving results; move the embeddings into FAISS, Qdrant, or another vector store when persistence, filtering, or approximate nearest-neighbor search becomes the main job.
Steps to run large-corpus semantic search with Sentence Transformers:
- Install Sentence Transformers in the active Python environment.
$ python -m pip install --upgrade sentence-transformers
The first model run may download files from Hugging Face. Use the same environment that will encode the production corpus.
Related: How to install Sentence Transformers with pip - Create the large-corpus semantic search script.
- large_corpus_search.py
from sentence_transformers import SentenceTransformer, util model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2") topics = [ ( "password-reset", "Reset a forgotten password from account settings, open the email link, " "and choose a new password.", ), ( "invoice-export", "Export paid invoices from the billing dashboard as a CSV file for accounting.", ), ( "api-token-rotation", "Rotate API tokens before sharing a new integration with a teammate.", ), ( "notification-email", "Change the notification email address and confirm the new address for alerts.", ), ( "workspace-theme", "Change the dashboard color theme for a workspace user profile.", ), ( "vector-search", "Store dense embeddings in a vector index for semantic search retrieval.", ), ] corpus = [] for shard_id in range(1, 401): for topic, text in topics: corpus.append( { "id": f"{topic}-{shard_id:03d}", "topic": topic, "text": f"{text} Region {shard_id:03d}.", } ) documents = [item["text"] for item in corpus] query = "How does a user reset a forgotten password with an email link?" document_embeddings = model.encode_document( documents, batch_size=128, normalize_embeddings=True, convert_to_tensor=True, show_progress_bar=False, ) query_embedding = model.encode_query( query, normalize_embeddings=True, convert_to_tensor=True, show_progress_bar=False, ) query_chunk_size = 1 corpus_chunk_size = 256 top_k = 3 hits = util.semantic_search( query_embedding, document_embeddings, query_chunk_size=query_chunk_size, corpus_chunk_size=corpus_chunk_size, top_k=top_k, score_function=util.dot_score, )[0] print(f"corpus documents: {len(corpus)}") print(f"embedding dimension: {document_embeddings.shape[1]}") print(f"query chunk size: {query_chunk_size}") print(f"corpus chunk size: {corpus_chunk_size}") print(f"top k: {top_k}") print(f"query: {query}") print("top matches:") for rank, hit in enumerate(hits, start=1): record = corpus[hit["corpus_id"]] print( f"{rank}. {record['id']} topic={record['topic']} " f"score={hit['score']:.4f}" ) if corpus[hits[0]["corpus_id"]]["topic"] != "password-reset": raise SystemExit("unexpected top topic") print("verification: PASS password reset documents ranked first")
encode_document() and encode_query() keep the code ready for embedding models that define separate document and query prompts. Normalized embeddings plus dot_score keep the scores on a cosine-similarity scale.
Related: How to encode queries and documents with Sentence Transformers - Run the script and confirm the chunked search settings.
$ python large_corpus_search.py corpus documents: 2400 embedding dimension: 384 query chunk size: 1 corpus chunk size: 256 top k: 3 query: How does a user reset a forgotten password with an email link? top matches: 1. password-reset-253 topic=password-reset score=0.6977 2. password-reset-106 topic=password-reset score=0.6953 3. password-reset-112 topic=password-reset score=0.6937 verification: PASS password reset documents ranked first
The reported corpus count and corpus chunk size prove that the query was scored in corpus chunks instead of as a tiny toy list. If the expected topic does not rank first, inspect the source text, chunk boundaries, model choice, and normalization setting before increasing top_k.
- Remove the temporary script after the search behavior is verified.
$ rm large_corpus_search.py
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.