A native Zig inference server with a macOS menu-bar app. MLX + GGUF, faster than LM Studio on the same file. OpenAI- and Anthropic-compatible. No Python. No Electron.
Identical 4-bit MLX weights, same machine, same prompts. mlx-serve wins every cell — and speculative decoding pushes the lead further on the workloads where it counts.
Decode = free-form generation · Echo = high-repetition (where PLD shines) · Code = code completion (where the drafter shines). Tokens/sec, higher is better. Apple M4 Max (128 GB) · identical 4-bit MLX weights · ctx 4096 · temp 0 · LM Studio (MLX runtime) as baseline.
The 284-billion-parameter flagship — running on your own machine, no cloud, no API key. If you have a 96 GB+ Apple Silicon Mac, it's one click away in the Model Browser.
Generate multiple tokens per forward pass, verified exactly — so output is identical, just faster. Works on every API surface, streaming or not, tools included: agent loops that echo file contents into edits decode at ~2×. Smart gates keep it on where it pays and step aside where it doesn't.
Model-agnostic n-gram drafting from the prompt + generated text. Works on every architecture — Gemma, Qwen, Llama, Mistral, Nemotron-H, LFM2.5 — with nothing extra to download.
A tiny cross-attention drafter reuses the target model's own K/V cache to propose blocks of tokens. Tuned block sizes per target (E2B → 31B).
A prompt-time repetition score disables drafting on novel content; a runtime acceptance gate backs off mid-decode when drafts stop landing. You never pay for speculation that won't pay back.
Qwen 3.6 native MTP. Models with a trained MTP sidecar (like ddalcu/Qwen3.6-27B-4bit-MTP-MLX-Serve) auto-load it and speculate from the model's own head — up to 1.8× on agent-style edit loops (29 → 51.6 tok/s on Qwen3.6-27B 4-bit, M4 Max), 1.43× on code. The controller watches its own acceptance rate per request and adapts draft depth on the fly. Zero setup — drop in the model and it's on.
From server to UI, built from scratch in Zig and Swift — with the production features you actually run into.
Written in Zig with direct MLX-C bindings. Eager warmup makes the first request 3.5× faster; a rewritten BPE tokenizer chews a 30 KB agent system prompt in 8 ms (was 3.9 s); the prefix cache keeps up to 32 conversation roots warm, so agent turns round-trip in ~0.1 s.
Drop-in chat completions, streaming, tool calling, embeddings, the Responses API and WebSockets — plus a native Anthropic Messages endpoint, so you can point Claude Code at a local model.
--max-concurrent batches multiple decode requests through one forward pass — about 1.6× throughput at 4-way parallel on dense models. Concurrent streams stay byte-identical to solo runs, and a 24-hour soak holds memory drift under 5%.
--kv-quant shrinks KV memory ~4× at 4-bit and ~2× at 8-bit — fit 16K contexts that didn't fit dense, or double your parallel-request budget. TurboQuant variants handle heavy-tailed activations.
Ten built-in tools — shell, file read/write/edit, search, browse, web search, memory — with a per-tool approval dialog. Connect MCP servers from a curated marketplace, or extend with markdown skills.
The MLX Core menu-bar app downloads models from HuggingFace with resumable transfers and hot-switches in place. --model-dir serves a whole folder. NVFP4, MXFP4, and MXFP8 quantized checkpoints load and serve out of the box.
DiffusionGemma runs natively — Google's block-diffusion model writes 256-token canvases in parallel instead of one token at a time. Up to 25 tokens per forward pass, ~30% faster than the mlx-vlm reference. Works on every API surface including streaming.
Attach a folder of mixed files — notes, PDFs, JSON exports — and ask questions in plain language. GPU-batched embedding (~5× faster indexing) via the server's /v1/embeddings API. Nothing leaves your Mac, nothing written to disk.
# Build from source (or grab a release) git clone https://github.com/ddalcu/mlx-serve cd mlx-serve zig build -Doptimize=ReleaseFast # Serve a model — speculative decoding (PLD) is on by default ./zig-out/bin/mlx-serve \ --model ~/models/gemma-4-e4b-it-4bit \ --serve --port 8080
curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "messages": [{"role": "user", "content": "Hello!"}], "stream": true }' # Point Claude Code at your Mac export ANTHROPIC_BASE_URL=http://localhost:8080
Quantized MLX-format models plus DeepSeek V4 Flash via the embedded ds4 engine. Download directly from HuggingFace in the app.
284B · via antirez/ds4 engine
Google · 26B-A4B block diffusion · ~30% faster than mlx-vlm
Google · E2B / E4B / 31B / 26B-A4B MoE · vision
Alibaba · 27B · 35B-A3B MoE · GatedDeltaNet
Meta · 8B · 70B
Mistral AI · 7B · 8x7B
Nemotron-H · LFM2.5 · Qwen 3-Next · Gemma 3
The things people actually ask in HN comments, Discord, and AI search.
Yes — every cell, every model we've benchmarked. On identical 4-bit MLX weights mlx-serve wins by +39% geomean across 18 workloads (Gemma 4 E2B/E4B/31B/26B-A4B-MoE and Qwen 3.6 27B/35B-A3B-MoE). On the same .gguf file as LM Studio (gemma-4-E4B-it-Q4_K_M.gguf), mlx-serve's embedded llama.cpp wrapper still wins +12-15% on decode and +5% on prefill. Speculative decoding pushes the lead further on echo-heavy and code-completion workloads — up to 2.65× on Gemma 4 E4B echo.
For most use cases, yes. mlx-serve runs the same MLX and GGUF models, exposes an OpenAI-compatible API on the same kind of port, and ships a native menu-bar app instead of an Electron one. It also adds things LM Studio doesn't have: a real Anthropic Messages API (works with Claude Code), the OpenAI Responses API + WebSockets, MCP tool calling, agent mode with 10 built-in tools, KV-cache quantization, continuous batching, and the antirez/ds4 engine for DeepSeek V4 Flash.
On Apple Silicon, yes. Ollama is cross-platform and uses llama.cpp; mlx-serve runs llama.cpp and native MLX with the Mac-specific optimizations Ollama doesn't ship — Metal kernels through mlx-c, JIT-compiled activations, shared-prefix KV cache, and the Gemma 4 cross-attention drafter. The OpenAI-compatible wire is identical, so you can drop in http://localhost:11234 wherever you had http://localhost:11434.
Yes. mlx-serve embeds llama.cpp's inference library (libllama) inside the same signed, notarized binary. Point --model at any .gguf file and the server auto-detects the format and routes to the right engine — no pip, no venv, no llama-server to install separately. DeepSeek V4 Flash GGUFs go through the dedicated antirez/ds4 engine instead, also embedded.
Yes — natively. mlx-serve implements Anthropic's /v1/messages endpoint including streaming, tool calling, and extended thinking. Point Claude Code at it with ANTHROPIC_BASE_URL=http://localhost:11234. The MLX Core app ships a one-click Launch Claude Code button that wires up the env vars for you.
All work — anything that talks the OpenAI chat-completions or Anthropic Messages wire protocol does. mlx-serve also implements the newer OpenAI Responses API (/v1/responses) for clients that want stateful chains via previous_response_id, plus a WebSocket transport on the same endpoint.
Yes, on 96 GB+ Apple Silicon Macs. Open the MLX Core Model Browser, pick DeepSeek-V4-Flash, hit Download — the server routes the GGUF through the embedded ds4 engine (native Metal kernels, byte-validated against the reference forward). Agent mode and MCP tools work on DSV4 too.
Native MLX dispatch for Gemma 3/4, Qwen 3 / 3.5 / 3.6 / 3-Next, Llama 3.x, Mistral, Nemotron-H, LFM2.5, and DeepSeek V4 Flash. Anything else as GGUF via embedded llama.cpp — Qwen, Llama, Mistral, Gemma, DeepSeek, Phi, Yi, and thousands more from HuggingFace.
Yes, on both API surfaces. The server detects tool-call patterns across architectures (Hermes XML, Gemma 4 <|tool_call>, raw JSON, ChatML), repairs common Qwen 3.5/3.6 escape quirks, and emits OpenAI-style tool_calls deltas in the SSE stream. The MLX Core app ships 10 built-in tools (shell, file I/O, search, browse, web search, memory) and connects to MCP servers from a curated marketplace.
Zig with direct mlx-c FFI — no Python runtime, no Electron, no IPC bridge. The release binary is ~4.5 MB. Eager warmup at boot page-faults weights and pre-compiles decode kernels (first request 3.5× faster). Multi-turn agent loops reuse KV across turns and skip re-prefilling system prompts via a shared-prefix cache that survives interleaved subagent traffic; a Claude Code-sized prompt tokenizes in 8 ms, so a warm agent turn round-trips in ~0.1 s end to end.
For greedy decoding (temp=0), mlx-serve is byte-identical to the reference for the first ~30-80 generated tokens, with long-tail divergence inherent to INT4 float-reduction order. For temp > 0, the Leviathan probability-ratio sampler keeps speculative decoding mathematically exact in distribution. Equivalence is pinned by automated tests on every release.
Nowhere. Everything runs locally on your Mac — no analytics, no telemetry, no cloud calls. The HTTP server binds to 127.0.0.1 by default. Open source under MIT.
The easiest way is the MLX Core app from GitHub Releases (signed and notarized DMG). Or via Homebrew: brew tap ddalcu/mlx-serve https://github.com/ddalcu/mlx-serve && brew install --cask mlx-core. CLI server alone: brew install mlx-serve.
Have another question? Open an issue · ★ Star the repo if mlx-serve saved you from spinning up another Electron app.