mlx-serve/Deep dive · Speculative decoding

Same output.
Just faster.

Three speculation engines draft tokens ahead and let the model verify them exactly — byte-identical greedy output, mathematically exact sampling. Up to 2.65× on the workloads local models actually run.

Download for Mac The three engines
On by default — zero setup Exact — verified token by token Self-gating — parity on novel content

Where the speed lands

Apple M-series, 4-bit MLX weights, temp 0. Reproduce with tests/bench.sh — the harness ships in the repo.

2.65×
echo workloads · Gemma 4 E4B + PLD
1.61×
agentic code edits · Gemma 4 E4B
1.8×
agent edit loops · Qwen 3.6 native MTP
1.0×
creative writing — gated, no regression
Benchmark chart showing mlx-serve PLD and drafter speedups vs LM Studio across Gemma 4 models
PLD takes the top bar on echo-heavy workloads; the drafter wins Gemma 4 code completion.

Draft cheap. Verify exact. Keep what lands.

PLD · default on

Prompt Lookup Decoding

Drafts by matching n-grams in the prompt and generated text — when the model echoes content it has seen (file edits, RAG, quoting), the guess is usually right. Model-agnostic, works on every architecture, nothing to download.

Drafter · Gemma 4

Cross-attention assistant

Google's tiny 4-layer drafters propose token blocks by cross-attending into the target model's own KV cache — no duplicated context, block sizes tuned per target from E2B to 31B. Wins code completion.

MTP · Qwen 3.6

The model drafts itself

Checkpoints shipping a trained mtp/ sidecar speculate with the model's own prediction head — auto-loaded, self-tuning depth, up to 1.8× on agent loops and 1.43× on code. Zero setup.

Exactness

Speed you don't have to distrust

Speculation never changes what the model would have said. Every drafted token is verified against the target model's own distribution: rejected drafts are resampled correctly, so greedy output is byte-identical and sampled output is exact in distribution (the Leviathan probability-ratio method).

This isn't a claim, it's a test suite: byte-equivalence checks for PLD, the drafter, and MTP run against every release — including with tools active and across streaming and non-streaming paths.

  • temp = 0 — byte-identical output, pinned by regression tests.
  • temp > 0 — mathematically exact output distribution.
  • Tools on — agent loops are speculation's best workload, not an excluded case.
Adaptive gates

It knows when to quit

Speculation costs a verify pass, so it must pay for itself. Before decoding starts, a prompt-time repetition score checks whether the request even looks draftable — novel creative prompts skip speculation entirely. Mid-request, a runtime acceptance gate watches how many drafts actually land and steps aside when the content turns novel, recovering the full pipelined decode rate — then PLD re-engages the moment output turns repetitive again.

Net effect: echo-heavy work gets the multiplier, everything else runs at parity. You never pay for speculation that won't pay back.

  • Prompt-time gate — n-gram repetition score decides before the first token.
  • Runtime gate — acceptance below break-even disables mid-request.
  • MTP depth controller — draft depth adapts to the acceptance rate per request.
  • Per-request telemetry — every speculative request logs its acceptance stats.
Coverage

Every surface, or it doesn't count

Speculation runs on all four HTTP surfaces — chat completions, Anthropic messages, OpenAI Responses, and legacy completions — streaming and non-streaming, tools or not. That includes the FIM/autocomplete path your editor uses: repetitive code completions decode at ~1.9×.

It's the reason Claude Code against a local model feels snappy exactly where agent traffic is heaviest: edits that echo file content back.

defaults & knobs
# PLD is on by default; opt out per launch
mlx-serve --no-pld# pair a Gemma 4 drafter
mlx-serve --drafter gemma-4-E4B-it-assistant-bf16 …

# per-request overrides on any endpoint
{"enable_pld": true, "enable_drafter": false, "enable_mtp": false}

Speculation, answered

Does it change the model's output?

No. Drafts are verified by the target model itself — greedy decoding is byte-identical over the verification window, and sampling is mathematically exact in distribution. Byte-equivalence suites for all three engines run on every release.

Which engine should I use?

Usually: don't think about it. PLD is on by default everywhere; Qwen 3.6 MTP checkpoints activate their own head automatically; and the Gemma 4 drafter is a one-toggle pairing in the app's Settings when you want the extra code-completion win.

Will it slow down creative writing or Q&A?

No — the prompt-time gate skips speculation on novel content, and the runtime gate backs off mid-request if drafts stop landing. Measured creative workloads run at parity (≈1.0×) with speculation compiled in.

What exactly is the MTP sidecar?

A small trained head (about 15 tensors) that predicts upcoming tokens alongside the trunk — shipped in mtp/weights.safetensors on checkpoints like Qwen3.6-27B-4bit-MTP-MLX-Serve. Because the model drafts itself, acceptance stays high (~73% on fully novel content) without any prompt-repetition requirement.

More deep dives

Free speed, verified exact.

It's already on. Download, load a model, and watch the tok/s counter on an edit-heavy agent turn.