Why do tool calls fail with small local models?

A model emitting a tool call has to hand-write JSON, and code or HTML content is full of things that break it: literal newlines instead of \n, unescaped inner quotes, invalid backslash escapes, and calls cut off mid-content by the token limit. Strict parsers reject the whole blob, so most stacks silently drop the call and the file content leaks into the chat as text. The smaller the model, the more often it happens.

How does mlx-serve fix malformed tool calls?

Strict parse first — valid JSON is never touched. Only when that fails does a position-aware tolerant re-serializer re-escape control bytes and inner quotes and fix invalid backslashes, and the result must re-parse strictly or it's discarded. Truncated calls recover the tool name so the client can retry in chunks instead of losing the turn. It runs server-side on every API surface, so every client benefits with zero configuration.

Which tool-call formats does mlx-serve understand?

Hermes-style XML tool calls, Gemma's custom argument format (including its dropped-delimiter failure mode), raw JSON and markdown-fenced JSON, ChatML, parallel calls emitted as a JSON array, bare argument objects matched against the request's tool schemas, and DeepSeek V4 Flash's name-in-the-tag XML variants — all normalized to standard OpenAI-style tool_calls or Anthropic tool_use blocks.

mlx-serve/Deep dive · Tool calling

Tool calls that
survive small models.

Q: How is this tested?

A hermetic regression corpus of real captured model outputs runs in CI on every commit, alongside universal invariants — no format tags may leak into visible content, and arguments must always parse as valid JSON — plus a live seven-model-family matrix exercised over HTTP before releases.

Every local agent stack dies the same way: a 4B model writes a whole file into one JSON string, mangles an escape, and the entire tool call is silently dropped. mlx-serve repairs it at the API layer — invisible to the model, invisible to your client.

Download for Mac How the repair works

Always on — nothing to configure Valid JSON is never touched All four API surfaces, streaming included

The failure class

The model did the work. The parser threw it away.

Ask a small model to write an HTML page through a writeFile tool and watch what it emits: literal newlines where \n should be, unescaped quotes from <meta charset="UTF-8">, invalid backslashes from Windows paths and regexes — or the call simply cut off mid-content by the token limit.

A strict JSON parser rejects the whole blob. Most stacks then drop the call: the file never lands, the content leaks into chat as raw text, and the turn is wasted. The model was right — the plumbing failed it.

what a 4B model actually emits

// real failure shapes, captured live:
{"path": "index.html", "content": "<!DOCTYPE html>
<meta charset="UTF-8">          ← raw newline + inner quotes
<script>if (m.match(\d+)) …    ← invalid \d escape

<tool_call>writeFile{…19,000 chars…   ← token cap: no closing tag
          

The repair pipeline

Strict first. Tolerant second. Validated always.

Repair only ever runs after a strict parse fails, and its output must re-parse strictly or it's discarded — a big model's valid JSON passes through byte-identical.

Step 1

Strict parse

Well-formed calls take the fast path untouched. The repair machinery is invisible for models that don't need it.

Step 2

Position-aware re-escape

A tolerant re-serializer fixes what models actually break: raw control bytes → \n/\t, inner quotes closed only at structural delimiters, invalid backslashes doubled.

Step 3

Truncation recovery

A call cut off mid-content recovers its tool name and the client is steered to retry in chunks — a half-written file is never silently committed.

Step 4

Strict re-validation

Every recovery must survive a second strict parse. A repair that can't produce valid JSON is thrown away rather than guessed at.

Format coverage

Every dialect a local model speaks

Different families emit tool calls in different formats — and break them in different ways. All of them normalize to clean OpenAI-style tool_calls or Anthropic tool_use blocks.

Model dialect	What breaks in the wild	Handled
Hermes XML (`<tool_call>`)	escaping, truncated opener, missing close tags	✓
Gemma 4 custom args	dropped string delimiter on large content, double-brace args	✓
Raw / fenced JSON (Gemma 3, Qwen MoE)	markdown fences, dropped closing braces	✓
Parallel calls as a JSON array	only the first call executed, rest dropped	✓
Bare argument object, no tool name	call unusable without schema matching	✓ matched to the unique fitting tool
DeepSeek V4 Flash XML variants	name-in-the-tag forms, JSON-free argument bodies	✓

Because parsing is a single server-side chokepoint, one fix covers all four HTTP surfaces — chat completions, Anthropic messages, OpenAI Responses, legacy completions — streaming and non-streaming, every client.

Receipts

Pinned by real failures, not synthetic tests

Every repair shipped because a real model broke a real call — and each one landed with the captured bytes as a regression test. A hermetic corpus of recorded model outputs runs in CI on every commit, under two universal invariants: no format tags may leak into visible content, and arguments must always parse as valid JSON. A live seven-family matrix (Qwen, Gemma 4, Gemma 3, Qwen MoE, GGUF, DeepSeek V4 Flash) re-verifies over HTTP before releases.

The payoff: 1–4B models reliably write whole files, and the "write me a big file" requests that used to silently fail now succeed — with the model none the wiser.

MLX Core agent chat: a small model writing a complete file through folded tool-call cards — A small model writing a complete web page through tool calls — every call parsed, nothing leaked into chat.

FAQ

Tool-call repair, answered

Could the repair corrupt a valid tool call?

No — repair only runs after strict parsing fails, so valid JSON is never modified. And every recovery is re-validated by a second strict parse; if the repair can't produce valid JSON, it's discarded rather than guessed at.

Do I need to enable anything, or change my client?

No. It's always on, server-side, on every API surface. Claude Code, pi, OpenCode, your own SDK code — anything receiving tool_calls or tool_use blocks from mlx-serve gets the repaired result automatically.

Why not just prompt the model to escape properly?

Prompt steering doesn't survive contact with a 19,000-character HTML file — small models will emit literal newlines in big content no matter what the system prompt says. Fixing it below the model is the only approach that works across every model and every client, and it costs nothing when the JSON is already valid.

What happens when a call is truncated by the token limit?

The parser recovers the tool name from the unclosed opener, and the client is told the output was cut off so it can chunk the write and append — deliberately without committing the partial content. A half-written file is worse than a clean retry.

Tool calls that
survive small models.

The model did the work. The parser threw it away.

Strict first. Tolerant second. Validated always.

Strict parse

Position-aware re-escape

Truncation recovery

Strict re-validation

Every dialect a local model speaks

Pinned by real failures, not synthetic tests

Tool-call repair, answered

More deep dives

Small models. Big files. No drops.

Tool calls thatsurvive small models.

The model did the work. The parser threw it away.

Strict first. Tolerant second. Validated always.

Strict parse

Position-aware re-escape

Truncation recovery

Strict re-validation

Every dialect a local model speaks

Pinned by real failures, not synthetic tests

Tool-call repair, answered

More deep dives

Claude Code, fully local →

Agent Sandbox →

Speculative decoding →

Ollama alternative →

LM Studio alternative →

Always-on assistant →

Small models. Big files. No drops.

Tool calls that
survive small models.