How to use llama.cpp models with Codex

Running Codex against llama.cpp is for workstations where the model is already a GGUF file served by llama-server, not by Ollama or LM Studio. Codex can use that server through a custom provider when llama-server exposes the OpenAI-compatible Responses endpoint at a local /v1 base URL.

llama-server publishes one loaded model through its OpenAI-compatible model-list endpoint. Setting a short model alias when the server starts keeps that row stable, so the Codex profile can request a readable model name instead of the GGUF file path.

Store the provider in a user-level Codex profile file because project-local config ignores provider and auth keys. Use a current llama.cpp build and a chat template that can handle OpenAI-style tool requests, keep the listener on the loopback interface unless another host deliberately needs access, and add real API-key handling before exposing the port beyond the local machine.

Steps to use llama.cpp models with Codex:

  1. Start llama-server with a local bind address, a stable model alias, and Jinja chat-template handling.
    $ llama-server -m ~/Models/Llama-3.2-3B-Instruct-Q4_K_M.gguf --alias llama3.2-3b-instruct-q4_k_m --host 127.0.0.1 --port 8080 --jinja
    build: 8680
    main: HTTP server listening on http://127.0.0.1:8080

    --alias sets the model identifier returned by the OpenAI-compatible API.

    Related: server-start
    Related: server-option-set

  2. Check the model list exposed by the local API.
    $ curl http://127.0.0.1:8080/v1/models
    {"object":"list","data":[{"id":"llama3.2-3b-instruct-q4_k_m","object":"model","owned_by":"llamacpp"}]}

    The id value must match the model name saved in the Codex profile.

  3. Post a minimal Responses request to the same model alias.
    $ curl http://127.0.0.1:8080/v1/responses \
    -H "Content-Type: application/json" \
    -d '{"model":"llama3.2-3b-instruct-q4_k_m","input":"Reply with OK."}'
    {"id":"resp_123","object":"response","status":"completed","model":"llama3.2-3b-instruct-q4_k_m","output":[{"type":"message","role":"assistant","content":[{"type":"output_text","text":"OK"}]}]}

    This confirms the endpoint selected by wire_api = “responses” before Codex sends a prompt.

  4. Create a Codex profile file for the llama.cpp provider.
    ~/.codex/llamacpp.config.toml
    model_provider = "llamacpp"
    model = "llama3.2-3b-instruct-q4_k_m"
     
    [model_providers.llamacpp]
    name = "llama.cpp"
    base_url = "http://127.0.0.1:8080/v1"
    wire_api = "responses"

    Current Codex profiles are separate files selected with --profile. Do not place these keys under [profiles.llamacpp] in /~/.codex/config.toml.

  5. Run Codex with the llama.cpp profile and the repository directory that should own the task.
    $ codex exec --profile llamacpp -C ~/repo "Reply with exactly: OK"
    OpenAI Codex v0.139.0
    --------
    model: llama3.2-3b-instruct-q4_k_m
    provider: llamacpp
    --------
    codex
    OK

    -C keeps the run anchored to the target repository so the trusted-directory check applies before the local model receives the prompt.