Is mlx-serve a drop-in replacement for Ollama?

On Apple Silicon, yes. mlx-serve speaks the Ollama wire protocol natively — /api/chat, /api/generate, /api/tags, /api/show, /api/ps, /api/embed, /api/pull — including NDJSON streaming, tool calls with object arguments, thinking, JSON-schema formats, images, and name:tag model names. Point anything that expects Ollama at http://localhost:11234 and it works unchanged.

Why is mlx-serve faster than Ollama on a Mac?

Ollama only runs GGUF through llama.cpp. mlx-serve additionally runs native MLX models with Apple-specific optimizations — Metal kernels via mlx-c, speculative decoding, a shared-prefix KV cache, and continuous batching. Same weights family, measurably faster engine; and for GGUF it embeds llama.cpp too, so nothing is lost.

Does the Ollama CLI workflow work?

Yes — mlx-serve run gemma4 downloads (resumable, straight from Hugging Face), serves, and drops into a streaming chat REPL; pull and list manage the model store; serve exposes everything for on-demand loading by name. Short names, org/repo HuggingFace ids, and name:tag all resolve.

Which Ollama clients work with mlx-serve?

Raycast, Obsidian plugins, Enchanted, Open WebUI, ollama-python, ollama-js — anything that talks the Ollama API. Swap http://localhost:11434 for http://localhost:11234 and keep your workflow.

Keep your Ollama workflow. Upgrade the engine.

Migration

Change one URL. Keep everything else.

Every Ollama-connected tool — Raycast, Obsidian, Enchanted, Open WebUI, ollama-python/js — works against mlx-serve unchanged. The CLI habits carry over too.

Your clients

# wherever a tool asked for your Ollama URL…
-  http://localhost:11434
+  http://localhost:11234

# …that's it. /api/chat, /api/generate, /api/tags, /api/show,
# /api/ps, /api/embed, /api/pull all answer natively.

          Your terminal
          
mlx-serve run gemma4         # download + serve + chat REPL, one command
mlx-serve pull qwen3.6:27b   # resumable, straight from Hugging Face
mlx-serve list               # what's on disk
mlx-serve serve              # serve everything, models load on demand

mlx-serve run gemma4 — terminal chat REPL with live tokens per second — `mlx-serve run gemma4` — pull, serve, and chat with live tok/s, Ollama-style.

Fidelity

The whole wire, translated natively

Not a proxy in front of a different API — the server re-frames every Ollama request and response itself, so the details Ollama clients rely on survive.

Streaming

NDJSON, like Ollama does it

Line-delimited streaming with done_reason, eval_count, and prompt_eval_count — the fields dashboards and plugins actually read.

Tools

Tool calls, object arguments

Ollama-style tool calling with arguments as JSON objects, both directions — including models whose raw output needed API-layer repair first.

Thinking

`think` → thinking

Reasoning models stream their thought channel into Ollama's thinking field, cleanly separated from the answer.

Formats

JSON mode & schemas

format: "json" and full JSON-schema constraints translate to grammar-constrained decoding — fence-happy models can't break your parser.

Names

`qwen3.6:latest` just resolves

Tagged names, basenames, and unique substrings match against what's on disk — the naming style Ollama clients send by default.

Pull

`/api/pull`, from HuggingFace

Clients can trigger model downloads through the API; mlx-serve pulls resumably into the same store the GUI app uses.

Why switch

Ollama can't run MLX. Your Mac wants MLX.

Ollama serves GGUF through llama.cpp — solid, but it leaves Apple-specific performance on the table. mlx-serve runs native MLX models through Metal kernels via mlx-c, adds speculative decoding, a shared-prefix KV cache, KV-cache quantization, and continuous batching — and still embeds llama.cpp for everything that only exists as GGUF.

Same drop-in API. One server for both worlds.

Native MLX — the format Apple's own framework is fastest at.
Every GGUF too — embedded llama.cpp, auto-routed by file format.
Three more APIs — OpenAI, Anthropic (Claude Code), and OpenAI Responses on the same port.
A real Mac app — menu-bar chat, agent mode, model browser, media generation.

FAQ

Ollama questions, answered

Is it really drop-in? What exactly is implemented?

/api/chat, /api/generate, /api/tags, /api/show, /api/ps, /api/embed (and the legacy /api/embeddings), /api/pull, /api/version — with NDJSON streaming, tool calls, thinking, images, format schemas, and name:tag resolution. Registry-style operations that don't map to a HuggingFace world (create/copy/push/blobs) return explicit 501s rather than pretending.

Will my Modelfiles work?

Modelfiles are Ollama's packaging format and don't apply — mlx-serve pulls models straight from HuggingFace (mlx-serve pull org/repo or short names like gemma4). System prompts and parameters come from your client's request or the app's settings instead.

Is it faster than Ollama?

For MLX models — a format Ollama doesn't support at all — yes, decisively; that's the headline of the benchmark page. For GGUF, both run llama.cpp; mlx-serve embeds a current build and adds a session LRU so multi-document agent loops stay warm.

Can I run both side by side while I evaluate?

Yes — they bind different ports (11434 vs 11234), so nothing conflicts. Point one client at mlx-serve, keep the rest on Ollama, and migrate at your own pace.

Go deeper

More deep dives

Your Ollama clients won't notice.
Your Mac will.

Install with Homebrew or grab the app — then swap one URL and keep working.

Download for Mac View on GitHub

Keep the Ollama workflow.Upgrade the engine.

Change one URL. Keep everything else.

The whole wire, translated natively

NDJSON, like Ollama does it

Tool calls, object arguments

think → thinking

JSON mode & schemas

qwen3.6:latest just resolves

/api/pull, from HuggingFace

Ollama can't run MLX. Your Mac wants MLX.

Ollama questions, answered

More deep dives

LM Studio alternative →

Claude Code, fully local →

Speculative decoding →

Self-healing tool calls →

Always-on assistant →

Image generation & editing →

Your Ollama clients won't notice.Your Mac will.

Keep the Ollama workflow.
Upgrade the engine.

`think` → thinking

`qwen3.6:latest` just resolves

`/api/pull`, from HuggingFace

Your Ollama clients won't notice.
Your Mac will.