How to use Sentence Transformers embeddings in LlamaIndex

LlamaIndex retrieval pipelines need an embedding model that turns source text and user queries into vectors before a vector index can compare them. Using a local Sentence Transformers model through the Hugging Face embedding integration keeps that vector step inside the Python process instead of sending text to a hosted embedding API.

The HuggingFaceEmbedding class loads a Hugging Face embedding model through sentence-transformers and can be assigned to Settings.embed_model or passed directly to a VectorStoreIndex. Using sentence-transformers/all-MiniLM-L6-v2 on CPU keeps the smoke test small, and the model returns 384-dimensional vectors that are easy to verify from command output.

Model files download into the Hugging Face cache the first time the script runs, so the first execution can take longer than later runs. The retrieval check uses as_retriever() directly instead of an answer-synthesis query engine, which proves the local embeddings are driving vector matching without requiring an LLM API key.

Steps to use Sentence Transformers embeddings in LlamaIndex:

  1. Install the Hugging Face embedding integration in the active Python environment.
    $ python3 -m pip install --upgrade llama-index-embeddings-huggingface
    Successfully installed llama-index-embeddings-huggingface

    Use a project virtual environment before installing LlamaIndex packages when the system Python environment is shared.
    Related: venv-create
    Related: pip-install

  2. Create a LlamaIndex script that uses HuggingFaceEmbedding for indexing and retrieval.
    $ cat > llamaindex-hf-embed.py <<'PY'
    from llama_index.core import Document, Settings, VectorStoreIndex
    from llama_index.embeddings.huggingface import HuggingFaceEmbedding
     
    embed_model = HuggingFaceEmbedding(
        model_name="sentence-transformers/all-MiniLM-L6-v2",
        device="cpu",
    )
    Settings.embed_model = embed_model
     
    documents = [
        Document(text="Password resets are handled in the identity portal."),
        Document(text="VPN access is approved through the network team."),
    ]
    index = VectorStoreIndex.from_documents(documents, embed_model=embed_model)
    retriever = index.as_retriever(similarity_top_k=1)
    results = retriever.retrieve("Where are password resets handled?")
    embedding = embed_model.get_text_embedding("hello")
     
    print(f"embedding_dimensions={len(embedding)}")
    print(f"top_source={results[0].node.get_content()}")
    PY

    Settings.embed_model sets the default embedding model for LlamaIndex components created later in the same process. Passing embed_model to VectorStoreIndex.from_documents() keeps this smoke test explicit.
    Related: embedding-model-set
    Related: retriever-create

  3. Run the script and confirm the retrieved source matches the password reset query.
    $ python3 llamaindex-hf-embed.py
    embedding_dimensions=384
    top_source=Password resets are handled in the identity portal.

    The first line confirms the local model returned the expected vector size. The second line confirms LlamaIndex used those embeddings to rank the relevant document first.

  4. Remove the sample script if it was only used for the smoke test.
    $ rm llamaindex-hf-embed.py