Home › Image Models › HiDream-I1

HiDream-I1 - AI Image Models Tool

Overview

HiDream-I1 is an open-source text-to-image foundation model with 17 billion parameters that targets high-quality, low-latency image generation. The model implements a novel sparse Diffusion Transformer (DiT) with a dual-stream and dynamic Mixture-of-Experts design; the authors published a technical report describing the architecture and results (arXiv:2505.22705) and released code and weights under an MIT license on GitHub. HiDream-I1 is distributed in three variants—HiDream-I1-Full, HiDream-I1-Dev, and HiDream-I1-Fast—with recommended inference step counts of 50, 28, and 16 respectively for Full/Dev/Fast as documented in the project README. (Source: HiDream-I1 GitHub README and arXiv technical report.) Beyond text-to-image generation, HiDream-I1 is extended into instruction-driven image editing via HiDream-E1. The project provides a HiDreamImagePipeline compatible with Hugging Face Diffusers, a Gradio demo, and an official Hugging Face model card with benchmark results (DPG-Bench, GenEval, HPSv2.1). Community feedback has praised the model's prompt-following and style versatility, though some users have reported occasional instability or “busy” errors on the hosted Hugging Face Space. The repository recommends FlashAttention (CUDA 12.4) and uses multiple text encoders (including a Llama 3.1 8B instruct component) and a VAE from FLUX.1 for its multimodal conditioning pipeline. (Sources: GitHub README; arXiv:2505.22705; Hugging Face model card and discussions.)

GitHub Statistics

Stars: 2,491
Forks: 240
Contributors: 6
License: MIT
Primary Language: Python
Last Updated: 2025-07-16T14:38:12Z

Key Features

17-billion-parameter sparse Diffusion Transformer backbone for high-quality image synthesis.
Three variants (Full/Dev/Fast) tuned for quality, development speed, and fast inference.
Instruction-based image editing through HiDream-E1 with refine_strength control.
Diffusers-compatible HiDreamImagePipeline and a runnable Gradio demo for experimentation.
Open-source MIT license with commercial-friendly usage; text encoders include Llama 3.1 and T5 components.

Example Usage

Example (python):

import torch
from transformers import PreTrainedTokenizerFast, LlamaForCausalLM
from diffusers import HiDreamImagePipeline

# Load the text encoder components the repo expects (you must accept model licenses on HF when required)
tokenizer_4 = PreTrainedTokenizerFast.from_pretrained("meta-llama/Meta-Llama-3.1-8B-Instruct")
text_encoder_4 = LlamaForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3.1-8B-Instruct",
    output_hidden_states=True,
    output_attentions=True,
    torch_dtype=torch.bfloat16,
)

# Load HiDream pipeline (choose Full/Dev/Fast variant available on Hugging Face)
pipe = HiDreamImagePipeline.from_pretrained(
    "HiDream-ai/HiDream-I1-Full",  # or HiDream-I1-Dev | HiDream-I1-Fast
    tokenizer_4=tokenizer_4,
    text_encoder_4=text_encoder_4,
    torch_dtype=torch.bfloat16,
)
pipe = pipe.to("cuda")

# Generate an image
image = pipe(
    "A photorealistic portrait of a golden retriever wearing a leather jacket, cinematic lighting",
    height=1024,
    width=1024,
    guidance_scale=5.0,
    num_inference_steps=50,  # use 28 for Dev, 16 for Fast
    generator=torch.Generator("cuda").manual_seed(42),
).images[0]
image.save("hidream_output.png")

# Notes: Follow the model README: install FlashAttention for best performance and meet CUDA requirements.