mlx-serve/Deep dive · Claude Code

Claude Code,
fully local.

Point Claude Code at a model running on your own Mac. No API key, no per-token bills, no code leaving your machine — one environment variable, or one click in MLX Core.

Download MLX Core See the setup
$0 per request Works offline once the model is downloaded Your code stays home — 127.0.0.1 by default

One env var. That's the tutorial.

mlx-serve implements Anthropic's /v1/messages endpoint natively — the same wire protocol Claude Code speaks to the cloud. Change where it points, and everything else just works.

Terminal
# 1. Install and serve a model (Ollama-style, resumable download)
brew tap ddalcu/mlx-serve https://github.com/ddalcu/mlx-serve
brew install mlx-serve
mlx-serve run qwen3.6:27b   # or gemma4 — any MLX or GGUF model

# 2. Point Claude Code at your Mac
export ANTHROPIC_BASE_URL=http://localhost:11234
claude

Prefer zero terminal? The MLX Core menu-bar app has a one-click Launch Claude Code button: it picks a working folder, wires ANTHROPIC_BASE_URL, dummy API keys, and the default-model variables, and opens Claude Code already connected.

A real coding agent, no cloud in sight

MLX Core menu-bar app showing the one-click Launch Claude Code button
One click in the menu bar wires up the environment and launches Claude Code.
Claude Code running in a terminal against a local model served by mlx-serve
Claude Code editing files and running tools against a local Qwen 3.6 — no API key set.

The whole wire, not a subset

Plenty of servers claim "Anthropic-compatible" and fall over the first time Claude Code streams thinking and tools in the same turn. mlx-serve implements the protocol Claude Code actually uses.

Streaming

Valid block lifecycle

Proper message_startcontent_block_start/delta/stopmessage_delta sequencing. Every delta references an open block — no "Content block not found" protocol errors mid-stream.

Thinking

Extended thinking, cleanly split

Thinking streams as real thinking_delta blocks with signatures — never leaked into the visible answer, never swallowing it. Reasoning budgets are honored via the thinking field.

Tools

Tool use that doesn't drop calls

tool_use blocks with input_json_delta streaming, and API-layer repair for the JSON small models mangle — Claude Code receives a clean call either way.

Usage

Honest token accounting

Responses report cache_read_input_tokens and reasoning_tokens, so Claude Code sees real prompt-cache savings the same way it does against the cloud API.

Built for agent traffic

Claude Code sends huge system prompts, repeats most of the context every turn, and echoes file content back into edits. That exact shape is what mlx-serve is optimized for.

Warm turns

~0.1 s from send to first token

A Claude Code-sized system prompt (~30 KB) tokenizes in 8 ms — it used to take seconds on every request. A shared-prefix KV cache keeps up to 32 conversation roots warm and survives interleaved subagent traffic, so turn N+1 skips re-prefilling everything it already saw.

8 ms
to tokenize a 30 KB agent system prompt
~0.1 s
warm agent turn, end to end
Long prompts

40K-token prompts don't wedge anything

Big MCP-laden Claude Code prompts can take minutes of prefill on large models. mlx-serve sends keepalive pings every 5 seconds during prefill so the client never times out — and if Claude Code retries or disconnects, the abandoned request is cancelled within seconds instead of stacking ghost prefills behind each other.

  • SSE keepalives flow while the model is still prefilling.
  • Disconnect detection aborts orphaned work at the next chunk.
  • Graceful 400 — an over-long prompt gets a clear error, not an OOM crash.
Speculative decoding

~2× on the edits themselves

File-edit tool calls echo existing code back with small changes — speculative decoding's best case. With tools active (every Claude Code request), PLD and the Gemma 4 drafter decode edit loops at roughly 2× (72 → 150 tok/s measured on Gemma 4 E4B), with byte-identical output. Qwen 3.6 checkpoints with a native MTP sidecar hit up to 1.8× on agent turns automatically.

  • 72 → 150 tok/s on file-edit tool loops (Gemma 4 E4B, 4-bit).
  • Sane sampling defaults — Claude Code omits sampling params, so mlx-serve applies each model's own generation_config.json instead of raw temp-1.0.
  • Exact output — speculation is verified token-by-token; it's speed, not approximation.

What should Claude Code talk to?

16 GB Macs

Gemma 4 E4B

~4.3 GB resident at 4-bit, strong tool calling, and the optional assistant drafter for extra code-completion speed. mlx-serve run gemma4.

32 GB+ Macs

Qwen 3.6 27B + MTP

The sweet spot for agent work. Checkpoints with the trained mtp/ sidecar speculate with the model's own head — up to 1.8× on edit loops, zero setup.

96 GB+ Macs

DeepSeek V4 Flash

The 284B flagship through the embedded antirez/ds4 engine. Agent mode and tools work on it too.

Anything else

Any GGUF on HuggingFace

The embedded llama.cpp engine serves the whole GGUF universe behind the same Anthropic endpoint — pick a file and Claude Code can use it.

Claude Code + local models, answered

Can Claude Code really run against a local model?

Yes. mlx-serve implements Anthropic's /v1/messages endpoint natively — streaming, tool calling, and extended thinking included. Set ANTHROPIC_BASE_URL=http://localhost:11234 and Claude Code talks to your Mac. The MLX Core app's Launch Claude Code button does the wiring for you.

Is it actually free?

There's no API key and no per-token billing — your Mac does the inference. mlx-serve is MIT-licensed open source, and once a model is downloaded everything works fully offline.

Do tool calls and thinking work together, like against the real API?

Yes — that exact shape (streaming + thinking + tools in one turn) is hermetically tested against recorded output from every supported model family. Thinking arrives as proper thinking blocks, tool calls stream in valid order, and JSON a small model mangles is repaired at the API layer before Claude Code sees it.

Is a local model fast enough?

Agent traffic is the best case: warm turns round-trip in ~0.1 s thanks to the prefix cache and an 8 ms tokenizer, keepalives survive multi-minute prefills on 40K-token prompts, and speculative decoding roughly doubles decode on file-edit loops. A 27B-class local model won't match a frontier cloud model's intelligence — but it's yours, it's private, and it's free to run all day.

Does this work with other coding agents too?

Yes — the app has one-click launchers for Claude Code, OpenCode, and pi, and anything that speaks the Anthropic or OpenAI wire (Continue, Cursor, the SDKs) can point at the same server.

More deep dives

Your coding agent. Your hardware.

Download MLX Core, pick a model, click Launch Claude Code. Five minutes from now your prompts never leave your desk again.