Allegro - AI Video Models Tool
Overview
Allegro is an open-source text-to-video generation model from RhymesAI that converts short natural-language prompts into high-quality, 6-second videos at 15 frames per second and 720×1280 resolution. The system combines a compact VideoVAE encoder for spatio-temporal compression with a scaled Diffusion Transformer (VideoDiT) backbone to produce temporally coherent, detailed motion in a constrained short-clip format. ([huggingface.co](https://huggingface.co/blog/RhymesAI/allegro)) Designed for research and creative prototyping, Allegro ships with full weights and inference/training code under an Apache-2.0 license and is integrated with the Diffusers ecosystem for easier use. The released model configuration includes a 175M-parameter VideoVAE and a 2.8B-parameter Diffusion Transformer; the design targets efficiency (single-GPU BF16 inference with CPU offload) while enabling visually rich outputs like close-ups, animal motion, and imaginative scenes. The project also provides multi-card inference recipes and optional 30 FPS interpolation (EMA-VFI) to improve playback smoothness. ([huggingface.co](https://huggingface.co/blog/RhymesAI/allegro))
Key Features
- Open-source release with full weights and Apache-2.0 license.
- Generates 6-second videos from single text prompts at 15 FPS, 720×1280 resolution.
- VideoVAE compresses spatio-temporal data (175M VAE) to reduce generator sequence length.
- 2.8B-parameter Diffusion Transformer (VideoDiT) with 3D attention for spatial-temporal coherence.
- Supports BF16/FP16/FP32 precisions and sequential CPU offload to lower GPU memory footprint.
- Includes image-to-video variant (Allegro-TI2V) for first-frame / last-frame conditioned generation.
- Multi-card inference recipes and VideoSys optimizations to reduce runtime on clusters.
Example Usage
Example (python):
import torch
from diffusers import AutoencoderKLAllegro, AllegroPipeline
from diffusers.utils import export_to_video
# load VAE (float32 recommended) and pipeline (BF16/FP32 OK for DiT)
vae = AutoencoderKLAllegro.from_pretrained("rhymes-ai/Allegro", subfolder="vae", torch_dtype=torch.float32)
pipe = AllegroPipeline.from_pretrained("rhymes-ai/Allegro", torch_dtype=torch.bfloat16).to("cuda:0")
prompt = "A seaside harbor with bright sunlight and colorful fishing boats, aerial view"
video = pipe(prompt, guidance_scale=7.5, max_sequence_length=512, num_inference_steps=100, generator=torch.Generator(device="cuda:0").manual_seed(42)).frames[0]
# export to mp4 at 15 FPS
export_to_video(video, "allegro_output.mp4", fps=15)
# Notes:
# - Python >=3.10, PyTorch >=2.4 and CUDA >=12.4 are recommended.
# - Use pipe.enable_sequential_cpu_offload() to reduce GPU memory at cost of runtime.
# (See repository and Hugging Face blog for full instructions and multi-GPU options.) Pricing
Allegro is distributed as open-source software under the Apache-2.0 license (no commercial model pricing). Users incur compute costs for running inference or training (GPU/infra).
Benchmarks
Output resolution: 720 x 1280 (vertical) @ 15 FPS (Source: https://huggingface.co/blog/RhymesAI/allegro)
Video length (default): 6 seconds (88 frames compressed) (Source: https://github.com/rhymes-ai/Allegro)
Model parameters: VideoVAE: 175M; VideoDiT: 2.8B (Source: https://huggingface.co/blog/RhymesAI/allegro)
Context length: 79.2K tokens (equivalent to 88 frames) (Source: https://github.com/rhymes-ai/Allegro)
Single-GPU memory usage (inference): ≈9.3 GB (BF16) with sequential CPU offload (Source: https://github.com/rhymes-ai/Allegro)
Reported inference time: ≈20 minutes (single H100); ≈3 minutes (8×H100 with VideoSys optimizations) (Source: https://github.com/rhymes-ai/Allegro)
Key Information
- Category: Video Models
- Type: AI Video Models Tool