Can I clone a voice locally on a Mac?

Yes. mlx-serve runs Qwen3-TTS natively on Apple Silicon: record or pick a few seconds of any voice and the model speaks your text in that voice — zero-shot, no training run, no fine-tune. The speaker-embedding pipeline was validated bit-for-bit against the reference implementation.

Do I need a transcript of the reference audio?

No — the reference clip alone is enough. The model extracts a speaker embedding from the audio itself, so a voice memo or any short recording works as-is.

Does my voice data leave the Mac?

No. The recording, the embedding, and the synthesis all happen on-device. Voice data is biometric data — this is exactly the workload you don't want passing through someone's cloud API.

What models and hardware does local TTS need?

Qwen3-TTS ships in 0.6B (~1.5 GB) and 1.7B (~3.5 GB) sizes, both MLX-native — an 8 GB Mac is enough. Models download with one click from the Audio pane and load on demand.

mlx-serve/Deep dive · Voice cloning

Clone any voice from
six seconds of audio.

Record a short clip — or drop one in — and Qwen3-TTS speaks your text in that voice. Zero-shot, no training, no transcript. And because a voice is biometric data, it all runs on your Mac, never through a cloud API.

Download MLX Core How it works

Zero-shot — no training run No transcript needed Bit-for-bit validated vs the reference

How it works

Record. Type. Listen.

Open the Audio pane, record 6–8 seconds of any voice in-app (or pick a file — reference clips are normalized automatically), type your text, and generate. The model extracts a speaker embedding from the clip — a compact fingerprint of timbre and delivery — and conditions the synthesis on it. No fine-tune, no waiting, no dataset.

The embedding pipeline was validated bit-for-bit against the reference implementation — the clone you get locally is the clone the model's authors shipped.

MLX Core Audio pane with in-app voice recording for zero-shot cloning and text-to-speech controls — The Audio pane: record a reference in-app, type your text, generate in that voice.

Why local matters here

Your voice is a password. Treat it like one.

Voice is biometric data — the same signal that unlocks bank phone lines and impersonates you to your family. Cloud cloning services keep your reference uploads on their terms. Here, the recording, the embedding, and every synthesized sample stay on your Mac: no account, no upload, no retention policy to read.

Clone voices you have the right to use — yours, a consenting collaborator's, a character you own. The tool is local; the responsibility is too.

No upload — the reference clip never leaves the machine.
No account — nothing to subscribe to, nothing to leak.
Works offline — once the model is downloaded, airplane mode is fine.

Beyond speech

A voice you can point anywhere

TTS

Neural text-to-speech

Even without cloning, Qwen3-TTS reads any text aloud in a natural neural voice — two model sizes, 0.6B (~1.5 GB) and 1.7B (~3.5 GB), both fine on an 8 GB Mac.

Video

Give your characters the voice

Chain it into video generation: type a line, have the cloned voice speak it, and the character performs to it — lip sync and all, original audio in the mp4.

API

One endpoint for developers

POST /v1/audio/speech with input text and an optional base64 ref_audio clip — the same OpenAI-style shape your tooling already speaks.

Voice Mode

Talk to your model, hands-free

The same audio stack powers "Hey Loki" Voice Mode — on-device speech recognition in, spoken replies out, no window open.

FAQ

Voice cloning, answered

How much reference audio do I need?

A few seconds — 6–8 is plenty. Record it in-app or drop in a file; clips are normalized automatically, and no transcript is needed.

Is this a fine-tune? How long does "cloning" take?

No fine-tune — it's zero-shot. The model computes a speaker embedding from your clip in a moment and conditions generation on it immediately. Cloning is as fast as generating.

How good is the clone, really?

The speaker-encoder port is bit-exact against the reference implementation, so fidelity matches what the Qwen3-TTS authors published. Quality scales with the reference clip: clean, close-mic audio clones better than a noisy room.

Whose voice am I allowed to clone?

Yours, and anyone's who gave you permission. Running locally means no service is policing you — which means the judgment is yours. Don't impersonate people.

Clone any voice from
six seconds of audio.

Record. Type. Listen.

Your voice is a password. Treat it like one.

A voice you can point anywhere

Neural text-to-speech

Give your characters the voice

One endpoint for developers

Talk to your model, hands-free

Voice cloning, answered

More deep dives

Your voice. Your hardware. Your call.

Clone any voice fromsix seconds of audio.

Record. Type. Listen.

Your voice is a password. Treat it like one.

A voice you can point anywhere

Neural text-to-speech

Give your characters the voice

One endpoint for developers

Talk to your model, hands-free

Voice cloning, answered

More deep dives

Video generation →

Image generation & editing →

Always-on assistant →

Claude Code, fully local →

LM Studio alternative →

Self-healing tool calls →

Your voice. Your hardware. Your call.

Clone any voice from
six seconds of audio.