Running Codex against llama.cpp is for workstations where the model is already a GGUF file served by llama-server, not by Ollama or LM Studio. Codex can use that server through a custom provider when llama-server exposes the OpenAI-compatible Responses endpoint at a local /v1 base URL.
llama-server publishes one loaded model through its OpenAI-compatible model-list endpoint. Setting a short model alias when the server starts keeps that row stable, so the Codex profile can request a readable model name instead of the GGUF file path.
Store the provider in a user-level Codex profile file because project-local config ignores provider and auth keys. Use a current llama.cpp build and a chat template that can handle OpenAI-style tool requests, keep the listener on the loopback interface unless another host deliberately needs access, and add real API-key handling before exposing the port beyond the local machine.
Related: How to use local models with Codex
Related: How to troubleshoot local models in Codex
Steps to use llama.cpp models with Codex:
- Start llama-server with a local bind address, a stable model alias, and Jinja chat-template handling.
$ llama-server -m ~/Models/Llama-3.2-3B-Instruct-Q4_K_M.gguf --alias llama3.2-3b-instruct-q4_k_m --host 127.0.0.1 --port 8080 --jinja build: 8680 main: HTTP server listening on http://127.0.0.1:8080
--alias sets the model identifier returned by the OpenAI-compatible API.
Related: server-start
Related: server-option-set - Check the model list exposed by the local API.
$ curl http://127.0.0.1:8080/v1/models {"object":"list","data":[{"id":"llama3.2-3b-instruct-q4_k_m","object":"model","owned_by":"llamacpp"}]}The id value must match the model name saved in the Codex profile.
- Post a minimal Responses request to the same model alias.
$ curl http://127.0.0.1:8080/v1/responses \ -H "Content-Type: application/json" \ -d '{"model":"llama3.2-3b-instruct-q4_k_m","input":"Reply with OK."}' {"id":"resp_123","object":"response","status":"completed","model":"llama3.2-3b-instruct-q4_k_m","output":[{"type":"message","role":"assistant","content":[{"type":"output_text","text":"OK"}]}]}This confirms the endpoint selected by wire_api = “responses” before Codex sends a prompt.
- Create a Codex profile file for the llama.cpp provider.
- ~/.codex/llamacpp.config.toml
model_provider = "llamacpp" model = "llama3.2-3b-instruct-q4_k_m" [model_providers.llamacpp] name = "llama.cpp" base_url = "http://127.0.0.1:8080/v1" wire_api = "responses"
Current Codex profiles are separate files selected with --profile. Do not place these keys under [profiles.llamacpp] in /~/.codex/config.toml.
- Run Codex with the llama.cpp profile and the repository directory that should own the task.
$ codex exec --profile llamacpp -C ~/repo "Reply with exactly: OK" OpenAI Codex v0.139.0 -------- model: llama3.2-3b-instruct-q4_k_m provider: llamacpp -------- codex OK
-C keeps the run anchored to the target repository so the trusted-directory check applies before the local model receives the prompt.
Mohd Shakir Zakaria is a cloud architect with deep roots in software development and open-source advocacy. Certified in AWS, Red Hat, VMware, ITIL, and Linux, he specializes in designing and managing robust cloud and on-premises infrastructures.