Three speculation engines draft tokens ahead and let the model verify them exactly — byte-identical greedy output, mathematically exact sampling. Up to 2.65× on the workloads local models actually run.
Apple M-series, 4-bit MLX weights, temp 0. Reproduce with tests/bench.sh — the harness ships in the repo.

Drafts by matching n-grams in the prompt and generated text — when the model echoes content it has seen (file edits, RAG, quoting), the guess is usually right. Model-agnostic, works on every architecture, nothing to download.
Google's tiny 4-layer drafters propose token blocks by cross-attending into the target model's own KV cache — no duplicated context, block sizes tuned per target from E2B to 31B. Wins code completion.
Checkpoints shipping a trained mtp/ sidecar speculate with the model's own prediction head — auto-loaded, self-tuning depth, up to 1.8× on agent loops and 1.43× on code. Zero setup.
Speculation never changes what the model would have said. Every drafted token is verified against the target model's own distribution: rejected drafts are resampled correctly, so greedy output is byte-identical and sampled output is exact in distribution (the Leviathan probability-ratio method).
This isn't a claim, it's a test suite: byte-equivalence checks for PLD, the drafter, and MTP run against every release — including with tools active and across streaming and non-streaming paths.
Speculation costs a verify pass, so it must pay for itself. Before decoding starts, a prompt-time repetition score checks whether the request even looks draftable — novel creative prompts skip speculation entirely. Mid-request, a runtime acceptance gate watches how many drafts actually land and steps aside when the content turns novel, recovering the full pipelined decode rate — then PLD re-engages the moment output turns repetitive again.
Net effect: echo-heavy work gets the multiplier, everything else runs at parity. You never pay for speculation that won't pay back.
Speculation runs on all four HTTP surfaces — chat completions, Anthropic messages, OpenAI Responses, and legacy completions — streaming and non-streaming, tools or not. That includes the FIM/autocomplete path your editor uses: repetitive code completions decode at ~1.9×.
It's the reason Claude Code against a local model feels snappy exactly where agent traffic is heaviest: edits that echo file content back.
# PLD is on by default; opt out per launch mlx-serve --no-pld … # pair a Gemma 4 drafter mlx-serve --drafter gemma-4-E4B-it-assistant-bf16 … # per-request overrides on any endpoint {"enable_pld": true, "enable_drafter": false, "enable_mtp": false}
No. Drafts are verified by the target model itself — greedy decoding is byte-identical over the verification window, and sampling is mathematically exact in distribution. Byte-equivalence suites for all three engines run on every release.
Usually: don't think about it. PLD is on by default everywhere; Qwen 3.6 MTP checkpoints activate their own head automatically; and the Gemma 4 drafter is a one-toggle pairing in the app's Settings when you want the extra code-completion win.
No — the prompt-time gate skips speculation on novel content, and the runtime gate backs off mid-request if drafts stop landing. Measured creative workloads run at parity (≈1.0×) with speculation compiled in.
A small trained head (about 15 tensors) that predicts upcoming tokens alongside the trunk — shipped in mtp/weights.safetensors on checkpoints like Qwen3.6-27B-4bit-MTP-MLX-Serve. Because the model drafts itself, acceptance stays high (~73% on fully novel content) without any prompt-repetition requirement.
It's already on. Download, load a model, and watch the tok/s counter on an edit-heavy agent turn.