LlamaIndex retrieval pipelines need an embedding model that turns source text and user queries into vectors before a vector index can compare them. Using a local Sentence Transformers model through the Hugging Face embedding integration keeps that vector step inside the Python process instead of sending text to a hosted embedding API.
The HuggingFaceEmbedding class loads a Hugging Face embedding model through sentence-transformers and can be assigned to Settings.embed_model or passed directly to a VectorStoreIndex. Using sentence-transformers/all-MiniLM-L6-v2 on CPU keeps the smoke test small, and the model returns 384-dimensional vectors that are easy to verify from command output.
Model files download into the Hugging Face cache the first time the script runs, so the first execution can take longer than later runs. The retrieval check uses as_retriever() directly instead of an answer-synthesis query engine, which proves the local embeddings are driving vector matching without requiring an LLM API key.
$ python3 -m pip install --upgrade llama-index-embeddings-huggingface Successfully installed llama-index-embeddings-huggingface
Use a project virtual environment before installing LlamaIndex packages when the system Python environment is shared.
Related: venv-create
Related: pip-install
$ cat > llamaindex-hf-embed.py <<'PY' from llama_index.core import Document, Settings, VectorStoreIndex from llama_index.embeddings.huggingface import HuggingFaceEmbedding embed_model = HuggingFaceEmbedding( model_name="sentence-transformers/all-MiniLM-L6-v2", device="cpu", ) Settings.embed_model = embed_model documents = [ Document(text="Password resets are handled in the identity portal."), Document(text="VPN access is approved through the network team."), ] index = VectorStoreIndex.from_documents(documents, embed_model=embed_model) retriever = index.as_retriever(similarity_top_k=1) results = retriever.retrieve("Where are password resets handled?") embedding = embed_model.get_text_embedding("hello") print(f"embedding_dimensions={len(embedding)}") print(f"top_source={results[0].node.get_content()}") PY
Settings.embed_model sets the default embedding model for LlamaIndex components created later in the same process. Passing embed_model to VectorStoreIndex.from_documents() keeps this smoke test explicit.
Related: embedding-model-set
Related: retriever-create
$ python3 llamaindex-hf-embed.py embedding_dimensions=384 top_source=Password resets are handled in the identity portal.
The first line confirms the local model returned the expected vector size. The second line confirms LlamaIndex used those embeddings to rank the relevant document first.
$ rm llamaindex-hf-embed.py