Home › Video Models › Mochi 1 Preview

Mochi 1 Preview - AI Video Models Tool

Overview

Mochi 1 Preview is an open-source text-to-video generation model from Genmo, released under the Apache 2.0 license. The model is presented as a state-of-the-art diffusion-based generator that uses a reported 10-billion-parameter diffusion backbone together with a novel Asymmetric Diffusion Transformer architecture to translate natural-language prompts into short, high-fidelity video clips. Mochi 1 Preview is hosted on Hugging Face and targets creators and researchers who need an accessible, research-oriented text-to-video baseline. Typical use cases include rapid concept visualization, creative content prototyping, and research into temporal consistency for generative video. Because the model is open and licensed permissively (Apache-2.0), teams can inspect, fine-tune, and integrate Mochi 1 Preview into experiments or production pipelines subject to the license terms. The Hugging Face model page provides model files, usage examples, and community feedback to help users get started (see source).

Model Statistics

Downloads: 2,677
Likes: 1298
Pipeline: text-to-video

License: apache-2.0

Model Details

Architecture and capabilities - Core design: Mochi 1 Preview is described as a diffusion-based text-to-video model built around an Asymmetric Diffusion Transformer. Genmo positions the architecture to better handle spatial and temporal generative challenges in video synthesis. - Scale: The model is reported as a ~10 billion parameter diffusion model (as stated in the model description). - Input/Output: The model accepts text prompts and generates short video clips (text-to-video pipeline available on Hugging Face). - Conditioning: As with other text-conditioned diffusion models, Mochi 1 relies on a transformer-based text encoder to condition the diffusion process (the model card indicates transformer-based conditioning via the Asymmetric Diffusion Transformer). - Open-source posture: Released under Apache-2.0, users can download model weights, examine training details on the Hugging Face page, and reuse model artifacts under the license. Notes and caveats - Mochi 1 Preview is provided as a preview/research release. Specific low-level architectural diagrams, training datasets, or step-by-step training hyperparameters are not fully enumerated on the public model page. For implementation details and the most recent technical notes, consult the model card and repository on Hugging Face.

Key Features

Text-to-video generation using a diffusion-based model
Asymmetric Diffusion Transformer for improved spatial-temporal modeling
Reported ~10 billion parameters for high-capacity generation
Open-source Apache-2.0 license for reuse and research
Hosted on Hugging Face with model files and community examples

Example Usage

Example (python):

from diffusers import DiffusionPipeline

# Example usage (illustrative). Actual API and arguments may differ on the model page.
# Install diffusers and accelerate, and authenticate with Hugging Face if needed.
# pip install diffusers accelerate transformers

model_id = "genmo/mochi-1-preview"
pipe = DiffusionPipeline.from_pretrained(model_id)

prompt = "A sunlit street market in a busy seaside town, cinematic lighting"
# Generation parameters (examples). Refer to the model card for correct method names and output handling.
video = pipe(prompt, num_inference_steps=50)

# Many model pipelines return a PIL sequence, tensor, or a ready-made video object;
# adapt saving according to the pipeline's return type.
if hasattr(video, "save"):  # illustrative
    video.save("mochi_output.mp4")
else:
    # Example fallback: if pipeline returns frames as a list of PIL images
    frames = video.frames if hasattr(video, "frames") else video
    frames[0].save(
        "mochi_output.gif",
        save_all=True,
        append_images=frames[1:],
        duration=40,
        loop=0,
    )

print("Generation finished. Check mochi_output.*")