Every local agent stack dies the same way: a 4B model writes a whole file into one JSON string, mangles an escape, and the entire tool call is silently dropped. mlx-serve repairs it at the API layer — invisible to the model, invisible to your client.
Ask a small model to write an HTML page through a writeFile tool and watch what it emits: literal newlines where \n should be, unescaped quotes from <meta charset="UTF-8">, invalid backslashes from Windows paths and regexes — or the call simply cut off mid-content by the token limit.
A strict JSON parser rejects the whole blob. Most stacks then drop the call: the file never lands, the content leaks into chat as raw text, and the turn is wasted. The model was right — the plumbing failed it.
// real failure shapes, captured live: {"path": "index.html", "content": "<!DOCTYPE html> <meta charset="UTF-8"> ← raw newline + inner quotes <script>if (m.match(\d+)) … ← invalid \d escape <tool_call>writeFile{…19,000 chars… ← token cap: no closing tag
Repair only ever runs after a strict parse fails, and its output must re-parse strictly or it's discarded — a big model's valid JSON passes through byte-identical.
Well-formed calls take the fast path untouched. The repair machinery is invisible for models that don't need it.
A tolerant re-serializer fixes what models actually break: raw control bytes → \n/\t, inner quotes closed only at structural delimiters, invalid backslashes doubled.
A call cut off mid-content recovers its tool name and the client is steered to retry in chunks — a half-written file is never silently committed.
Every recovery must survive a second strict parse. A repair that can't produce valid JSON is thrown away rather than guessed at.
Different families emit tool calls in different formats — and break them in different ways. All of them normalize to clean OpenAI-style tool_calls or Anthropic tool_use blocks.
| Model dialect | What breaks in the wild | Handled |
|---|---|---|
Hermes XML (<tool_call>) | escaping, truncated opener, missing close tags | ✓ |
| Gemma 4 custom args | dropped string delimiter on large content, double-brace args | ✓ |
| Raw / fenced JSON (Gemma 3, Qwen MoE) | markdown fences, dropped closing braces | ✓ |
| Parallel calls as a JSON array | only the first call executed, rest dropped | ✓ |
| Bare argument object, no tool name | call unusable without schema matching | ✓ matched to the unique fitting tool |
| DeepSeek V4 Flash XML variants | name-in-the-tag forms, JSON-free argument bodies | ✓ |
Because parsing is a single server-side chokepoint, one fix covers all four HTTP surfaces — chat completions, Anthropic messages, OpenAI Responses, legacy completions — streaming and non-streaming, every client.
Every repair shipped because a real model broke a real call — and each one landed with the captured bytes as a regression test. A hermetic corpus of recorded model outputs runs in CI on every commit, under two universal invariants: no format tags may leak into visible content, and arguments must always parse as valid JSON. A live seven-family matrix (Qwen, Gemma 4, Gemma 3, Qwen MoE, GGUF, DeepSeek V4 Flash) re-verifies over HTTP before releases.
The payoff: 1–4B models reliably write whole files, and the "write me a big file" requests that used to silently fail now succeed — with the model none the wiser.

No — repair only runs after strict parsing fails, so valid JSON is never modified. And every recovery is re-validated by a second strict parse; if the repair can't produce valid JSON, it's discarded rather than guessed at.
No. It's always on, server-side, on every API surface. Claude Code, pi, OpenCode, your own SDK code — anything receiving tool_calls or tool_use blocks from mlx-serve gets the repaired result automatically.
Prompt steering doesn't survive contact with a 19,000-character HTML file — small models will emit literal newlines in big content no matter what the system prompt says. Fixing it below the model is the only approach that works across every model and every client, and it costs nothing when the JSON is already valid.
The parser recovers the tool name from the unclosed opener, and the client is told the output was cut off so it can chunk the write and append — deliberately without committing the partial content. A half-written file is worse than a clean retry.
Run an agent on a 4B model and watch it finish jobs that used to die on a stray quote.