LTX-Video 2.3 runs natively on Apple Silicon — clips with synchronized audio, muxed straight to an mp4 on your desk. Start from a prompt, from your own photo, or from a voice clip your characters perform to.
Type a prompt, pick a quality preset, get a clip with synchronized audio. The one-stage path runs guidance-free by default — ~2× faster per step with a more natural look.
Drop an image into the First-frame slot and the clip begins exactly from it — VAE-encoded on-device and locked as the clean opening frame, then animated forward.
Attach speech or music, and the video is generated against that clip — timing, performance, and lip sync follow the audio, and the original recording lands in the mp4, not a re-synthesis.

Write spoken words in quotes in your prompt — short phrases with acting directions between them — and LTX generates the voice, timed to the picture. Want a specific voice? Attach a real recording, or type a line and have the local Qwen3-TTS speak it — including a voice you cloned from a few seconds of reference audio.
Any WAV/MP3/M4A works, the frame count auto-fits the clip length, and audio guidance on the Quality presets steers toward clean speech — clearer voices, less stray background noise.

The Fast preset runs the distilled one-stage pipeline. The Quality and Super-Quality presets run the full reference two-stage pipeline natively: a guided half-resolution pass on the dev model (CFG + modality guidance — Super adds a second-order sampler), a learned 2× latent upscale, then a distilled refine at full resolution.
Style LoRAs apply here too — the same runtime "filters on steroids" that restyle image generations restyle your clips, zero quality loss on the base weights.
.safetensors, dial the strength.Local video generation is the heaviest thing a consumer Mac can do. Here's the real bill — once.
The app checks free RAM before starting and offers to stop the chat server if it's competing for memory. Outputs land in ~/.mlx-serve/generations/videos/, by date. API users: POST /v1/video/generations with prompt, optional first_frame_image, audio, pipeline, and lora_path.
Really. The full LTX-Video 2.3 pipeline — diffusion transformer, 3D VAE, audio decoder — was ported natively to MLX and validated tensor-by-tensor against the reference implementation. Your prompts, photos, and clips never leave the Mac.
The audio clip is encoded on-device and held fixed while the video denoises around it, so motion and lip sync follow the sound. The mp4 gets your original recording at native quality — trimmed to the clip length, never re-synthesized.
Yes — the Speech & sound section chains into the local TTS: type the line, pick (or clone) a voice with Qwen3-TTS, and the generated speech drives the video. All three models run through the same local server.
The ~50 GB snapshot ships both transformer variants (one-stage distilled + two-stage dev, ~11 GB each), the upscaler, and the Gemma text encoder — so every quality preset works offline without re-downloading. The downloader pulls only the files the engine actually reads, not the repo's full ~70 GB.
Download MLX Core, grab LTX-Video with one click, and turn a prompt, a photo, or a voice memo into a clip — audio and all.