Allegro - AI Video Models Tool

Overview

Allegro is an open-source text-to-video generation model from RhymesAI that converts short natural-language prompts into high-quality, 6-second videos at 15 frames per second and 720×1280 resolution. The system combines a compact VideoVAE encoder for spatio-temporal compression with a scaled Diffusion Transformer (VideoDiT) backbone to produce temporally coherent, detailed motion in a constrained short-clip format. ([huggingface.co](https://huggingface.co/blog/RhymesAI/allegro)) Designed for research and creative prototyping, Allegro ships with full weights and inference/training code under an Apache-2.0 license and is integrated with the Diffusers ecosystem for easier use. The released model configuration includes a 175M-parameter VideoVAE and a 2.8B-parameter Diffusion Transformer; the design targets efficiency (single-GPU BF16 inference with CPU offload) while enabling visually rich outputs like close-ups, animal motion, and imaginative scenes. The project also provides multi-card inference recipes and optional 30 FPS interpolation (EMA-VFI) to improve playback smoothness. ([huggingface.co](https://huggingface.co/blog/RhymesAI/allegro))

Key Features

  • Open-source release with full weights and Apache-2.0 license.
  • Generates 6-second videos from single text prompts at 15 FPS, 720×1280 resolution.
  • VideoVAE compresses spatio-temporal data (175M VAE) to reduce generator sequence length.
  • 2.8B-parameter Diffusion Transformer (VideoDiT) with 3D attention for spatial-temporal coherence.
  • Supports BF16/FP16/FP32 precisions and sequential CPU offload to lower GPU memory footprint.
  • Includes image-to-video variant (Allegro-TI2V) for first-frame / last-frame conditioned generation.
  • Multi-card inference recipes and VideoSys optimizations to reduce runtime on clusters.

Example Usage

Example (python):

import torch
from diffusers import AutoencoderKLAllegro, AllegroPipeline
from diffusers.utils import export_to_video

# load VAE (float32 recommended) and pipeline (BF16/FP32 OK for DiT)
vae = AutoencoderKLAllegro.from_pretrained("rhymes-ai/Allegro", subfolder="vae", torch_dtype=torch.float32)
pipe = AllegroPipeline.from_pretrained("rhymes-ai/Allegro", torch_dtype=torch.bfloat16).to("cuda:0")

prompt = "A seaside harbor with bright sunlight and colorful fishing boats, aerial view"
video = pipe(prompt, guidance_scale=7.5, max_sequence_length=512, num_inference_steps=100, generator=torch.Generator(device="cuda:0").manual_seed(42)).frames[0]

# export to mp4 at 15 FPS
export_to_video(video, "allegro_output.mp4", fps=15)

# Notes:
# - Python >=3.10, PyTorch >=2.4 and CUDA >=12.4 are recommended.
# - Use pipe.enable_sequential_cpu_offload() to reduce GPU memory at cost of runtime.
# (See repository and Hugging Face blog for full instructions and multi-GPU options.)

Pricing

Allegro is distributed as open-source software under the Apache-2.0 license (no commercial model pricing). Users incur compute costs for running inference or training (GPU/infra).

Benchmarks

Output resolution: 720 x 1280 (vertical) @ 15 FPS (Source: https://huggingface.co/blog/RhymesAI/allegro)

Video length (default): 6 seconds (88 frames compressed) (Source: https://github.com/rhymes-ai/Allegro)

Model parameters: VideoVAE: 175M; VideoDiT: 2.8B (Source: https://huggingface.co/blog/RhymesAI/allegro)

Context length: 79.2K tokens (equivalent to 88 frames) (Source: https://github.com/rhymes-ai/Allegro)

Single-GPU memory usage (inference): ≈9.3 GB (BF16) with sequential CPU offload (Source: https://github.com/rhymes-ai/Allegro)

Reported inference time: ≈20 minutes (single H100); ≈3 minutes (8×H100 with VideoSys optimizations) (Source: https://github.com/rhymes-ai/Allegro)

Last Refreshed: 2026-01-09

Key Information

  • Category: Video Models
  • Type: AI Video Models Tool