Stable Diffusion - AI Image Models Tool

Overview

Stable Diffusion is an open-source family of latent text-to-image diffusion models that convert textual prompts into high-fidelity images and support image editing (inpainting), img2img, upscaling, depth-conditional generation and ControlNet-style conditioning. The original v1 models (developed by CompVis with Stability AI and Runway) operate in a compressed latent space to reduce compute and enable 512×512 outputs on consumer GPUs; the canonical v1 config uses an ~860M-parameter U-Net and a ~123M-parameter CLIP text encoder. (See the CompVis repository for architecture and checkpoints: https://github.com/CompVis/stable-diffusion.) Since the initial release, Stability AI and collaborators have published multiple families and checkpoints: Stable Diffusion v2 (OpenCLIP text encoder, 768×768 variants), SDXL (larger multi-encoder/refinement architecture for higher fidelity), specialized models for inpainting, depth-guided synthesis, upscaling (x4), and unCLIP-style image-to-text-conditioned reimagination. The Hugging Face Diffusers library and the official repositories provide ready-made inference pipelines, schedulers, and example scripts that make deployment straightforward for research and prototyping (see Hugging Face Diffusers documentation: https://huggingface.co/docs/diffusers). Stable Diffusion’s open weights and tooling democratized image generation but also sparked debates about training data, dataset hygiene, and safety; researchers and journalists have highlighted issues in LAION training subsets and subsequent dataset remediation efforts (for instance, reporting by TechCrunch and Stanford investigations). Users should evaluate licensing and safety guidance before production or commercial deployments (sources: https://techcrunch.com/2024/08/30/the-org-behind-the-data-set-used-to-train-stable-diffusion-claims-it-has-removed-csam/ and the CompVis model card at https://github.com/CompVis/stable-diffusion).

Key Features

  • Text-to-image generation with prompt conditioning and classifier-free guidance
  • Image-to-image (img2img) transformations preserving composition and style
  • Inpainting/outpainting with masked-region text conditioning
  • Latent x4 upscaling model for text-guided super-resolution
  • Depth-conditional and ControlNet-style conditioning for structure-aware edits
  • Multiple checkpoints (v1, v2, v2.1, SDXL) tuned for fidelity or speed
  • Local self-hosting possible — run on consumer GPUs using open weights

Example Usage

Example (python):

from diffusers import StableDiffusionPipeline
import torch

# Example: text-to-image with a Stable Diffusion 2.1 checkpoint (requires model access)
model_id = "stabilityai/stable-diffusion-2-1"
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipe = pipe.to("cuda")

prompt = "A cinematic photo of a lighthouse at sunset, dramatic lighting, 35mm lens"
image = pipe(prompt, guidance_scale=7.5, num_inference_steps=50).images[0]
image.save("lighthouse_sunset.png")

# Notes: install diffusers, accelerate, transformers and follow the model's licensing/access requirements.
# See Hugging Face Diffusers docs for schedulers, img2img and inpainting pipelines: https://huggingface.co/docs/diffusers

Benchmarks

UNet parameters (Stable Diffusion v1): ≈860 million parameters (Source: https://github.com/CompVis/stable-diffusion)

Text encoder parameters (Stable Diffusion v1): ≈123 million parameters (CLIP ViT-L/14) (Source: https://github.com/CompVis/stable-diffusion)

Native trained resolution (v1 / v2): 512×512 (v1); 768×768 variants available in v2/v2.1 (Source: https://github.com/CompVis/stable-diffusion)

Reference FID (class-conditional ImageNet, latent diffusion work): FID ≈ 3.6 (class-conditional LDM on ImageNet reported in repo) (Source: https://github.com/CompVis/latent-diffusion)

SDXL architecture note: SDXL uses ~3× larger U-Net backbone and multi-encoder conditioning for improved fidelity (Source: https://arxiv.org/abs/2307.01952)

Last Refreshed: 2026-01-09

Key Information

  • Category: Image Models
  • Type: AI Image Models Tool