Does speculative decoding change the model's output?

No — that's the defining property. Drafted tokens are verified by the target model itself: at temperature 0 output is byte-identical over the verification window, and at temperature > 0 the Leviathan probability-ratio sampler keeps the output distribution mathematically exact. Equivalence is pinned by automated byte-comparison tests on every release.

What is Prompt Lookup Decoding (PLD)?

PLD drafts future tokens by matching n-grams in the prompt and generated text — when the model is echoing content it has already seen (file edits, RAG, quoting), the draft is usually right. It's model-agnostic, needs nothing extra to download, and is on by default in mlx-serve. Agent-style edit loops decode at roughly 2×.

What is native multi-token prediction (MTP)?

Some Qwen 3.6 checkpoints ship a small trained head that predicts the next few tokens alongside the main forward pass. mlx-serve auto-loads it and speculates with the model's own predictions — up to 1.8× on agent-style edit loops and 1.43× on code, with a controller that watches acceptance per request and adapts draft depth. Zero setup.

Does speculation slow down creative writing?

No — two gates prevent that. A prompt-time repetition score disables speculation on novel content before it starts, and a runtime acceptance gate backs off mid-request if drafts stop landing. Creative and one-shot Q&A workloads run at parity with speculation off.

Speculative decoding: same output, just faster

Measured

Where the speed lands

Apple M-series, 4-bit MLX weights, temp 0. Reproduce with tests/bench.sh — the harness ships in the repo.

2.65×

echo workloads · Gemma 4 E4B + PLD

1.61×

agentic code edits · Gemma 4 E4B

1.8×

agent edit loops · Qwen 3.6 native MTP

1.0×

creative writing — gated, no regression

Benchmark chart showing mlx-serve PLD and drafter speedups vs LM Studio across Gemma 4 models — PLD takes the top bar on echo-heavy workloads; the drafter wins Gemma 4 code completion.

Three engines

Draft cheap. Verify exact. Keep what lands.

PLD · default on

Prompt Lookup Decoding

Drafts by matching n-grams in the prompt and generated text — when the model echoes content it has seen (file edits, RAG, quoting), the guess is usually right. Model-agnostic, works on every architecture, nothing to download.

Drafter · Gemma 4

Cross-attention assistant

Google's tiny 4-layer drafters propose token blocks by cross-attending into the target model's own KV cache — no duplicated context, block sizes tuned per target from E2B to 31B. Wins code completion.

MTP · Qwen 3.6

The model drafts itself

Checkpoints shipping a trained mtp/ sidecar speculate with the model's own prediction head — auto-loaded, self-tuning depth, up to 1.8× on agent loops and 1.43× on code. Zero setup.

Exactness

Speed you don't have to distrust

Speculation never changes what the model would have said. Every drafted token is verified against the target model's own distribution: rejected drafts are resampled correctly, so greedy output is byte-identical and sampled output is exact in distribution (the Leviathan probability-ratio method).

This isn't a claim, it's a test suite: byte-equivalence checks for PLD, the drafter, and MTP run against every release — including with tools active and across streaming and non-streaming paths.

temp = 0 — byte-identical output, pinned by regression tests.
temp > 0 — mathematically exact output distribution.
Tools on — agent loops are speculation's best workload, not an excluded case.

Adaptive gates

It knows when to quit

Speculation costs a verify pass, so it must pay for itself. Before decoding starts, a prompt-time repetition score checks whether the request even looks draftable — novel creative prompts skip speculation entirely. Mid-request, a runtime acceptance gate watches how many drafts actually land and steps aside when the content turns novel, recovering the full pipelined decode rate — then PLD re-engages the moment output turns repetitive again.

Net effect: echo-heavy work gets the multiplier, everything else runs at parity. You never pay for speculation that won't pay back.

Prompt-time gate — n-gram repetition score decides before the first token.
Runtime gate — acceptance below break-even disables mid-request.
MTP depth controller — draft depth adapts to the acceptance rate per request.
Per-request telemetry — every speculative request logs its acceptance stats.

Coverage

Every surface, or it doesn't count

Speculation runs on all four HTTP surfaces — chat completions, Anthropic messages, OpenAI Responses, and legacy completions — streaming and non-streaming, tools or not. That includes the FIM/autocomplete path your editor uses: repetitive code completions decode at ~1.9×.

It's the reason Claude Code against a local model feels snappy exactly where agent traffic is heaviest: edits that echo file content back.

defaults & knobs

# PLD is on by default; opt out per launch
mlx-serve --no-pld …

# pair a Gemma 4 drafter
mlx-serve --drafter gemma-4-E4B-it-assistant-bf16 …

# per-request overrides on any endpoint
{"enable_pld": true, "enable_drafter": false, "enable_mtp": false}
          

FAQ

Speculation, answered

Does it change the model's output?

No. Drafts are verified by the target model itself — greedy decoding is byte-identical over the verification window, and sampling is mathematically exact in distribution. Byte-equivalence suites for all three engines run on every release.

Which engine should I use?

Usually: don't think about it. PLD is on by default everywhere; Qwen 3.6 MTP checkpoints activate their own head automatically; and the Gemma 4 drafter is a one-toggle pairing in the app's Settings when you want the extra code-completion win.

Will it slow down creative writing or Q&A?

No — the prompt-time gate skips speculation on novel content, and the runtime gate backs off mid-request if drafts stop landing. Measured creative workloads run at parity (≈1.0×) with speculation compiled in.

What exactly is the MTP sidecar?

A small trained head (about 15 tensors) that predicts upcoming tokens alongside the trunk — shipped in mtp/weights.safetensors on checkpoints like Qwen3.6-27B-4bit-MTP-MLX-Serve. Because the model drafts itself, acceptance stays high (~73% on fully novel content) without any prompt-repetition requirement.

Go deeper

More deep dives

Free speed, verified exact.

It's already on. Download, load a model, and watch the tok/s counter on an edit-heavy agent turn.

Download for Mac View on GitHub

Same output.Just faster.

Where the speed lands

Draft cheap. Verify exact. Keep what lands.

Prompt Lookup Decoding

Cross-attention assistant

The model drafts itself

Speed you don't have to distrust

It knows when to quit

Every surface, or it doesn't count

Speculation, answered

More deep dives

LM Studio alternative →

Claude Code, fully local →

Self-healing tool calls →

Ollama alternative →

Agent Sandbox →

Image generation & editing →

Free speed, verified exact.

Same output.
Just faster.