mlx-serve/Deep dive · Tool calling

Tool calls that
survive small models.

Every local agent stack dies the same way: a 4B model writes a whole file into one JSON string, mangles an escape, and the entire tool call is silently dropped. mlx-serve repairs it at the API layer — invisible to the model, invisible to your client.

Download for Mac How the repair works
Always on — nothing to configure Valid JSON is never touched All four API surfaces, streaming included
The failure class

The model did the work. The parser threw it away.

Ask a small model to write an HTML page through a writeFile tool and watch what it emits: literal newlines where \n should be, unescaped quotes from <meta charset="UTF-8">, invalid backslashes from Windows paths and regexes — or the call simply cut off mid-content by the token limit.

A strict JSON parser rejects the whole blob. Most stacks then drop the call: the file never lands, the content leaks into chat as raw text, and the turn is wasted. The model was right — the plumbing failed it.

what a 4B model actually emits
// real failure shapes, captured live:
{"path": "index.html", "content": "<!DOCTYPE html>
<meta charset="UTF-8">          ← raw newline + inner quotes
<script>if (m.match(\d+)) …    ← invalid \d escape

<tool_call>writeFile{…19,000 chars…   ← token cap: no closing tag

Strict first. Tolerant second. Validated always.

Repair only ever runs after a strict parse fails, and its output must re-parse strictly or it's discarded — a big model's valid JSON passes through byte-identical.

Step 1

Strict parse

Well-formed calls take the fast path untouched. The repair machinery is invisible for models that don't need it.

Step 2

Position-aware re-escape

A tolerant re-serializer fixes what models actually break: raw control bytes → \n/\t, inner quotes closed only at structural delimiters, invalid backslashes doubled.

Step 3

Truncation recovery

A call cut off mid-content recovers its tool name and the client is steered to retry in chunks — a half-written file is never silently committed.

Step 4

Strict re-validation

Every recovery must survive a second strict parse. A repair that can't produce valid JSON is thrown away rather than guessed at.

Every dialect a local model speaks

Different families emit tool calls in different formats — and break them in different ways. All of them normalize to clean OpenAI-style tool_calls or Anthropic tool_use blocks.

Model dialectWhat breaks in the wildHandled
Hermes XML (<tool_call>)escaping, truncated opener, missing close tags
Gemma 4 custom argsdropped string delimiter on large content, double-brace args
Raw / fenced JSON (Gemma 3, Qwen MoE)markdown fences, dropped closing braces
Parallel calls as a JSON arrayonly the first call executed, rest dropped
Bare argument object, no tool namecall unusable without schema matching matched to the unique fitting tool
DeepSeek V4 Flash XML variantsname-in-the-tag forms, JSON-free argument bodies

Because parsing is a single server-side chokepoint, one fix covers all four HTTP surfaces — chat completions, Anthropic messages, OpenAI Responses, legacy completions — streaming and non-streaming, every client.

Receipts

Pinned by real failures, not synthetic tests

Every repair shipped because a real model broke a real call — and each one landed with the captured bytes as a regression test. A hermetic corpus of recorded model outputs runs in CI on every commit, under two universal invariants: no format tags may leak into visible content, and arguments must always parse as valid JSON. A live seven-family matrix (Qwen, Gemma 4, Gemma 3, Qwen MoE, GGUF, DeepSeek V4 Flash) re-verifies over HTTP before releases.

The payoff: 1–4B models reliably write whole files, and the "write me a big file" requests that used to silently fail now succeed — with the model none the wiser.

MLX Core agent chat: a small model writing a complete file through folded tool-call cards
A small model writing a complete web page through tool calls — every call parsed, nothing leaked into chat.

Tool-call repair, answered

Could the repair corrupt a valid tool call?

No — repair only runs after strict parsing fails, so valid JSON is never modified. And every recovery is re-validated by a second strict parse; if the repair can't produce valid JSON, it's discarded rather than guessed at.

Do I need to enable anything, or change my client?

No. It's always on, server-side, on every API surface. Claude Code, pi, OpenCode, your own SDK code — anything receiving tool_calls or tool_use blocks from mlx-serve gets the repaired result automatically.

Why not just prompt the model to escape properly?

Prompt steering doesn't survive contact with a 19,000-character HTML file — small models will emit literal newlines in big content no matter what the system prompt says. Fixing it below the model is the only approach that works across every model and every client, and it costs nothing when the JSON is already valid.

What happens when a call is truncated by the token limit?

The parser recovers the tool name from the unclosed opener, and the client is told the output was cut off so it can chunk the write and append — deliberately without committing the partial content. A half-written file is worse than a clean retry.

More deep dives

Small models. Big files. No drops.

Run an agent on a 4B model and watch it finish jobs that used to die on a stray quote.