Can I generate AI video locally on a Mac?

Yes. mlx-serve runs LTX-Video 2.3 natively on Apple Silicon — the full diffusion + 3D-VAE pipeline ported to MLX and validated tensor-by-tensor against the reference. Clips render with synchronized audio, muxed straight to an mp4. You'll want a 24 GB+ Mac and a one-time ~50 GB model download.

Can it animate my own photo?

Yes — drop a picture into the Video pane's First-frame slot and the clip begins exactly from it: the image is VAE-encoded on-device and locked as the clean opening frame, and the model animates forward from there.

Can characters actually talk in the generated video?

Yes, three ways: put spoken lines in quotes in the prompt and LTX generates a voice timed to the picture; attach a real speech or music clip and the video is generated against that soundtrack — performance and lip sync follow the audio, and the original clip (not a re-synthesis) lands in the mp4; or type a line and have the local Qwen3-TTS voice it, including a cloned voice.

Do style LoRAs work on video too?

Yes — attach any diffusers-format LoRA .safetensors under Advanced options to restyle LTX generations at runtime, with zero quality loss on the base weights. The same mechanism works for FLUX and Krea image generation.

Text to video. Photo to video. Soundtrack to video. On your Mac.

Three ways in

Prompt, photo, or soundtrack

Text → video

Describe the shot

Type a prompt, pick a quality preset, get a clip with synchronized audio. The one-stage path runs guidance-free by default — ~2× faster per step with a more natural look.

Photo → video

Animate your own picture

Drop an image into the First-frame slot and the clip begins exactly from it — VAE-encoded on-device and locked as the clean opening frame, then animated forward.

Audio → video

Perform to a soundtrack

Attach speech or music, and the video is generated against that clip — timing, performance, and lip sync follow the audio, and the original recording lands in the mp4, not a re-synthesis.

MLX Core Video generation pane with prompt, quality presets, first-frame slot, and Speech & sound section — The Video pane: prompt, presets, First-frame slot, and the Speech & sound section for soundtrack-driven clips.

Talking characters

Put the lines in quotes. They speak.

Write spoken words in quotes in your prompt — short phrases with acting directions between them — and LTX generates the voice, timed to the picture. Want a specific voice? Attach a real recording, or type a line and have the local Qwen3-TTS speak it — including a voice you cloned from a few seconds of reference audio.

Any WAV/MP3/M4A works, the frame count auto-fits the clip length, and audio guidance on the Quality presets steers toward clean speech — clearer voices, less stray background noise.

Video pane with a personal photo loaded in the First-frame slot and a speech clip attached — A photo in the First-frame slot plus a voice clip: your picture, performing your audio.

Quality presets

Fast when you're iterating, two-stage when it matters

The Fast preset runs the distilled one-stage pipeline. The Quality and Super-Quality presets run the full reference two-stage pipeline natively: a guided half-resolution pass on the dev model (CFG + modality guidance — Super adds a second-order sampler), a learned 2× latent upscale, then a distilled refine at full resolution.

Style LoRAs apply here too — the same runtime "filters on steroids" that restyle image generations restyle your clips, zero quality loss on the base weights.

Fast — distilled one-stage, guidance-free, ~2× faster per step.
Quality — guided dev pass → 2× latent upscale → distilled refine.
Super — same, with a second-order sampler on the guided pass.
LoRA restyle — attach a .safetensors, dial the strength.

Honest requirements

What it takes to run

Local video generation is the heaviest thing a consumer Mac can do. Here's the real bill — once.

24 GB

recommended unified memory

~50 GB

one-time download — every quality preset included, switch offline

1 click

download lives right in the Video pane, resumable

$0

per clip, forever after

The app checks free RAM before starting and offers to stop the chat server if it's competing for memory. Outputs land in ~/.mlx-serve/generations/videos/, by date. API users: POST /v1/video/generations with prompt, optional first_frame_image, audio, pipeline, and lora_path.

FAQ

Local video generation, answered

Does it really run on-device — no cloud render farm?

Really. The full LTX-Video 2.3 pipeline — diffusion transformer, 3D VAE, audio decoder — was ported natively to MLX and validated tensor-by-tensor against the reference implementation. Your prompts, photos, and clips never leave the Mac.

How does the soundtrack feature work?

The audio clip is encoded on-device and held fixed while the video denoises around it, so motion and lip sync follow the sound. The mp4 gets your original recording at native quality — trimmed to the clip length, never re-synthesized.

Can I use a cloned voice for the character?

Yes — the Speech & sound section chains into the local TTS: type the line, pick (or clone) a voice with Qwen3-TTS, and the generated speech drives the video. All three models run through the same local server.

Why is the download so big?

The ~50 GB snapshot ships both transformer variants (one-stage distilled + two-stage dev, ~11 GB each), the upscaler, and the Gemma text encoder — so every quality preset works offline without re-downloading. The downloader pulls only the files the engine actually reads, not the repo's full ~70 GB.

Go deeper

More deep dives

A film set on your desk.

Download MLX Core, grab LTX-Video with one click, and turn a prompt, a photo, or a voice memo into a clip — audio and all.

Download for Mac View on GitHub

Text to video. Photo to video.Soundtrack to video.

Prompt, photo, or soundtrack

Describe the shot

Animate your own picture

Perform to a soundtrack

Put the lines in quotes. They speak.

Fast when you're iterating, two-stage when it matters

What it takes to run

Local video generation, answered

More deep dives

Image generation & editing →

Voice cloning & TTS →

Always-on assistant →

Claude Code, fully local →

LM Studio alternative →

Agent Sandbox →

A film set on your desk.

Text to video. Photo to video.
Soundtrack to video.