v26.6.10 DiffusionGemma · Qwen MTP · NVFP4 · Document RAG — now shipping

Run any LLM
on your Mac

A native Zig inference server with a macOS menu-bar app. MLX + GGUF, faster than LM Studio on the same file. OpenAI- and Anthropic-compatible. No Python. No Electron.

Download for Mac View on GitHub
macOS 26+
M1 – M4
MIT License
Star on GitHub
MLX Core app — DiffusionGemma block diffusion running live in the chat interface

Faster than LM Studio.
Every model.

Identical 4-bit MLX weights, same machine, same prompts. mlx-serve wins every cell — and speculative decoding pushes the lead further on the workloads where it counts.

176
tok/s decode · Gemma 4 E2B 4-bit
3,749
tok/s prefill
284B
params, DeepSeek V4 Flash — local
0
Python dependencies

Decode = free-form generation · Echo = high-repetition (where PLD shines) · Code = code completion (where the drafter shines). Tokens/sec, higher is better. Apple M4 Max (128 GB) · identical 4-bit MLX weights · ctx 4096 · temp 0 · LM Studio (MLX runtime) as baseline.

The marquee capability

Run DeepSeek V4 Flash locally on your Mac

The 284-billion-parameter flagship — running on your own machine, no cloud, no API key. If you have a 96 GB+ Apple Silicon Mac, it's one click away in the Model Browser.

  • Built on Salvatore Sanfilippo's antirez/ds4 engine — native Metal kernels, byte-validated against the reference forward.
  • One-click download of the GGUF, served from the same picker as every MLX model.
  • Agent mode and MCP tool calling work on DSV4 too — the full toolset is inlined into the prompt.
  • A single self-contained binary — kernel sources are embedded and staged at first launch.
Get MLX Core
284B
parameters · running on your desk
96 GB+
unified memory
0
cloud calls
1
binary

Two ways to draft ahead

Generate multiple tokens per forward pass, verified exactly — so output is identical, just faster. Works on every API surface, streaming or not, tools included: agent loops that echo file contents into edits decode at ~2×. Smart gates keep it on where it pays and step aside where it doesn't.

PLD

Prompt Lookup Decoding

Model-agnostic n-gram drafting from the prompt + generated text. Works on every architecture — Gemma, Qwen, Llama, Mistral, Nemotron-H, LFM2.5 — with nothing extra to download.

up to on agent tool loops, echo & RAG
DRAFTER

Gemma 4 assistant drafter

A tiny cross-attention drafter reuses the target model's own K/V cache to propose blocks of tokens. Tuned block sizes per target (E2B → 31B).

up to +30% on Gemma 4 code completion
ADAPTIVE

Gates that know when to quit

A prompt-time repetition score disables drafting on novel content; a runtime acceptance gate backs off mid-decode when drafts stop landing. You never pay for speculation that won't pay back.

exact output, zero quality cost

Qwen 3.6 native MTP. Models with a trained MTP sidecar (like ddalcu/Qwen3.6-27B-4bit-MTP-MLX-Serve) auto-load it and speculate from the model's own head — up to 1.8× on agent-style edit loops (29 → 51.6 tok/s on Qwen3.6-27B 4-bit, M4 Max), 1.43× on code. The controller watches its own acceptance rate per request and adapts draft depth on the fly. Zero setup — drop in the model and it's on.

A complete local-AI stack

From server to UI, built from scratch in Zig and Swift — with the production features you actually run into.

Native, no Python

Written in Zig with direct MLX-C bindings. Eager warmup makes the first request 3.5× faster; a rewritten BPE tokenizer chews a 30 KB agent system prompt in 8 ms (was 3.9 s); the prefix cache keeps up to 32 conversation roots warm, so agent turns round-trip in ~0.1 s.

OpenAI + Anthropic APIs

Drop-in chat completions, streaming, tool calling, embeddings, the Responses API and WebSockets — plus a native Anthropic Messages endpoint, so you can point Claude Code at a local model.

Continuous batching

--max-concurrent batches multiple decode requests through one forward pass — about 1.6× throughput at 4-way parallel on dense models. Concurrent streams stay byte-identical to solo runs, and a 24-hour soak holds memory drift under 5%.

KV-cache quantization

--kv-quant shrinks KV memory ~4× at 4-bit and ~2× at 8-bit — fit 16K contexts that didn't fit dense, or double your parallel-request budget. TurboQuant variants handle heavy-tailed activations.

Built-in agent + MCP

Ten built-in tools — shell, file read/write/edit, search, browse, web search, memory — with a per-tool approval dialog. Connect MCP servers from a curated marketplace, or extend with markdown skills.

One server, every model

The MLX Core menu-bar app downloads models from HuggingFace with resumable transfers and hot-switches in place. --model-dir serves a whole folder. NVFP4, MXFP4, and MXFP8 quantized checkpoints load and serve out of the box.

Block diffusion

DiffusionGemma runs natively — Google's block-diffusion model writes 256-token canvases in parallel instead of one token at a time. Up to 25 tokens per forward pass, ~30% faster than the mlx-vlm reference. Works on every API surface including streaming.

Document folder RAG

Attach a folder of mixed files — notes, PDFs, JSON exports — and ask questions in plain language. GPU-batched embedding (~5× faster indexing) via the server's /v1/embeddings API. Nothing leaves your Mac, nothing written to disk.

Up and running in seconds

Terminal
# Build from source (or grab a release)
git clone https://github.com/ddalcu/mlx-serve
cd mlx-serve
zig build -Doptimize=ReleaseFast

# Serve a model — speculative decoding (PLD) is on by default
./zig-out/bin/mlx-serve \
  --model ~/models/gemma-4-e4b-it-4bit \
  --serve --port 8080
Use the API — OpenAI or Anthropic
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'

# Point Claude Code at your Mac
export ANTHROPIC_BASE_URL=http://localhost:8080
MLX Core
MLX Core
Desktop app for macOS — chat, agent & model browser
Get

Run the latest open models

Quantized MLX-format models plus DeepSeek V4 Flash via the embedded ds4 engine. Download directly from HuggingFace in the app.

Gemma 4

Google · E2B / E4B / 31B / 26B-A4B MoE · vision

Qwen 3.6

Alibaba · 27B · 35B-A3B MoE · GatedDeltaNet

Llama 3

Meta · 8B · 70B

Mistral

Mistral AI · 7B · 8x7B

And more

Nemotron-H · LFM2.5 · Qwen 3-Next · Gemma 3

Questions, answered

The things people actually ask in HN comments, Discord, and AI search.

Is mlx-serve faster than LM Studio?

Yes — every cell, every model we've benchmarked. On identical 4-bit MLX weights mlx-serve wins by +39% geomean across 18 workloads (Gemma 4 E2B/E4B/31B/26B-A4B-MoE and Qwen 3.6 27B/35B-A3B-MoE). On the same .gguf file as LM Studio (gemma-4-E4B-it-Q4_K_M.gguf), mlx-serve's embedded llama.cpp wrapper still wins +12-15% on decode and +5% on prefill. Speculative decoding pushes the lead further on echo-heavy and code-completion workloads — up to 2.65× on Gemma 4 E4B echo.

Does mlx-serve replace LM Studio?

For most use cases, yes. mlx-serve runs the same MLX and GGUF models, exposes an OpenAI-compatible API on the same kind of port, and ships a native menu-bar app instead of an Electron one. It also adds things LM Studio doesn't have: a real Anthropic Messages API (works with Claude Code), the OpenAI Responses API + WebSockets, MCP tool calling, agent mode with 10 built-in tools, KV-cache quantization, continuous batching, and the antirez/ds4 engine for DeepSeek V4 Flash.

Does mlx-serve replace Ollama on Mac?

On Apple Silicon, yes. Ollama is cross-platform and uses llama.cpp; mlx-serve runs llama.cpp and native MLX with the Mac-specific optimizations Ollama doesn't ship — Metal kernels through mlx-c, JIT-compiled activations, shared-prefix KV cache, and the Gemma 4 cross-attention drafter. The OpenAI-compatible wire is identical, so you can drop in http://localhost:11234 wherever you had http://localhost:11434.

Can I run GGUF models on Mac without Python?

Yes. mlx-serve embeds llama.cpp's inference library (libllama) inside the same signed, notarized binary. Point --model at any .gguf file and the server auto-detects the format and routes to the right engine — no pip, no venv, no llama-server to install separately. DeepSeek V4 Flash GGUFs go through the dedicated antirez/ds4 engine instead, also embedded.

Does mlx-serve work with Claude Code?

Yes — natively. mlx-serve implements Anthropic's /v1/messages endpoint including streaming, tool calling, and extended thinking. Point Claude Code at it with ANTHROPIC_BASE_URL=http://localhost:11234. The MLX Core app ships a one-click Launch Claude Code button that wires up the env vars for you.

What about the OpenAI SDK, Continue, Cursor, Open WebUI?

All work — anything that talks the OpenAI chat-completions or Anthropic Messages wire protocol does. mlx-serve also implements the newer OpenAI Responses API (/v1/responses) for clients that want stateful chains via previous_response_id, plus a WebSocket transport on the same endpoint.

Can mlx-serve run DeepSeek V4 Flash locally?

Yes, on 96 GB+ Apple Silicon Macs. Open the MLX Core Model Browser, pick DeepSeek-V4-Flash, hit Download — the server routes the GGUF through the embedded ds4 engine (native Metal kernels, byte-validated against the reference forward). Agent mode and MCP tools work on DSV4 too.

What models are supported?

Native MLX dispatch for Gemma 3/4, Qwen 3 / 3.5 / 3.6 / 3-Next, Llama 3.x, Mistral, Nemotron-H, LFM2.5, and DeepSeek V4 Flash. Anything else as GGUF via embedded llama.cpp — Qwen, Llama, Mistral, Gemma, DeepSeek, Phi, Yi, and thousands more from HuggingFace.

Does it support tools / function calling?

Yes, on both API surfaces. The server detects tool-call patterns across architectures (Hermes XML, Gemma 4 <|tool_call>, raw JSON, ChatML), repairs common Qwen 3.5/3.6 escape quirks, and emits OpenAI-style tool_calls deltas in the SSE stream. The MLX Core app ships 10 built-in tools (shell, file I/O, search, browse, web search, memory) and connects to MCP servers from a curated marketplace.

How does it stay this small / fast?

Zig with direct mlx-c FFI — no Python runtime, no Electron, no IPC bridge. The release binary is ~4.5 MB. Eager warmup at boot page-faults weights and pre-compiles decode kernels (first request 3.5× faster). Multi-turn agent loops reuse KV across turns and skip re-prefilling system prompts via a shared-prefix cache that survives interleaved subagent traffic; a Claude Code-sized prompt tokenizes in 8 ms, so a warm agent turn round-trips in ~0.1 s end to end.

Is the inference exact, or quantized output drift?

For greedy decoding (temp=0), mlx-serve is byte-identical to the reference for the first ~30-80 generated tokens, with long-tail divergence inherent to INT4 float-reduction order. For temp > 0, the Leviathan probability-ratio sampler keeps speculative decoding mathematically exact in distribution. Equivalence is pinned by automated tests on every release.

Where does my data go?

Nowhere. Everything runs locally on your Mac — no analytics, no telemetry, no cloud calls. The HTTP server binds to 127.0.0.1 by default. Open source under MIT.

How do I install it?

The easiest way is the MLX Core app from GitHub Releases (signed and notarized DMG). Or via Homebrew: brew tap ddalcu/mlx-serve https://github.com/ddalcu/mlx-serve && brew install --cask mlx-core. CLI server alone: brew install mlx-serve.

Have another question? Open an issue · ★ Star the repo if mlx-serve saved you from spinning up another Electron app.