Stable Diffusion 3 Medium - AI Image Models Tool

Overview

Stable Diffusion 3 Medium (SD3 Medium) is Stability AI’s 2-billion-parameter multimodal diffusion transformer (MMDiT) built for text-to-image generation. The release emphasizes improved prompt comprehension, photorealism, and especially typography quality (fewer spelling/kerning errors) while being compact enough to run on many consumer GPUs. Stability AI positions SD3 Medium as a resource-efficient option in the SD3 family that trades some model size for lower VRAM needs and easier on-device use. ([stability.ai](https://stability.ai/news/stable-diffusion-3-medium)) SD3 Medium was trained with large-scale pretraining and focused fine-tuning: Stability reports pretraining on roughly 1 billion images and fine-tuning on curated high-aesthetic data plus preference data to improve quality and alignment. The project provides multiple packaging options (weights-only or packages that include text encoders in fp16/fp8) to let users trade off inference speed, memory, and quality. The model is available on Hugging Face under Stability’s community / non-commercial research license terms (with enterprise licensing paths for larger commercial use). ([huggingface.co](https://huggingface.co/stabilityai/stable-diffusion-3-medium))

Model Statistics

  • Downloads: 6,899
  • Likes: 4891
  • Pipeline: text-to-image

License: other

Model Details

Architecture and encoders: SD3 Medium uses the Multimodal Diffusion Transformer (MMDiT) design described in Stability’s SD3 research (Rectified Flow / diffusion-transformer family). The released Medium configuration has ~2 billion parameters and employs three fixed pretrained text encoders (OpenCLIP-ViT/G, CLIP-ViT/L and T5-XXL) so users can select different encoder combinations to trade off VRAM and prompt comprehension. ([arxiv.org](https://arxiv.org/abs/2403.03206)) VAE and image fidelity: Stability notes a 16-channel VAE in the SD3 family to improve fine detail, faces, and hands versus many previous models. Training details published by Stability indicate a large pretraining corpus (~1B images) plus fine-tuning on ~30M aesthetic images and ~3M preference-data images to shape visual quality and human preference alignment. ([huggingface.co](https://huggingface.co/stabilityai/stable-diffusion-3-medium)) Packaging and runtimes: Stability provides multiple safetensors packages (weights-only and variants that bundle text encoders in fp16 or fp8). A diffusers-compatible pipeline is available for straightforward inference in Python (StableDiffusion3Pipeline). Stability and partners (NVIDIA/AMD) also provide optimized runtimes (TensorRT, vendor-specific inference builds) for production deployments. ([huggingface.co](https://huggingface.co/stabilityai/stable-diffusion-3-medium))

Key Features

  • MMDiT transformer-based text-to-image architecture with strong prompt comprehension.
  • 2B-parameter Medium variant balances quality and on-device resource use.
  • Three-text-encoder design (OpenCLIP-G, CLIP-L, T5-XXL) for flexible performance trade-offs.
  • 16-channel VAE tuned for improved hands, faces, and fine detail.
  • Multiple packaging options (weights-only, fp16, fp8) to reduce VRAM or improve speed.
  • Diffusers pipeline support and vendor-optimized runtimes (TensorRT, AMD optimizations).
  • License options: community / research license with enterprise licensing path available.

Example Usage

Example (python):

import torch
from diffusers import StableDiffusion3Pipeline

# Example: basic text-to-image generation (requires diffusers >= latest)
pipe = StableDiffusion3Pipeline.from_pretrained(
    "stabilityai/stable-diffusion-3-medium-diffusers",
    torch_dtype=torch.float16,
)
pipe = pipe.to("cuda")

image = pipe(
    "A photorealistic portrait of a golden retriever wearing a bandana",
    negative_prompt="",
    num_inference_steps=28,
    guidance_scale=7.0,
).images[0]

image.save("sd3_medium_sample.png")

# Notes: pick text-encoder packaging that matches your memory/quality targets.
# See the model card for fp16/fp8 encoder packaging and diffusers guidance. ([huggingface.co](https://huggingface.co/stabilityai/stable-diffusion-3-medium))

Benchmarks

Model size (parameters): 2 billion parameters (Medium) (Source: ([stability.ai](https://stability.ai/news/stable-diffusion-3-medium)))

Pretraining dataset size: Pretrained on ~1 billion images; fine-tuned on 30M aesthetic + 3M preference images (Source: ([huggingface.co](https://huggingface.co/stabilityai/stable-diffusion-3-medium)))

Hugging Face downloads (last month): 6,899 downloads (last month) (Source: ([huggingface.co](https://huggingface.co/stabilityai/stable-diffusion-3-medium)))

Hugging Face likes: ≈4.89k likes on model card (Source: ([huggingface.co](https://huggingface.co/stabilityai/stable-diffusion-3-medium)))

Vendor optimization claim: Up to ~50% faster with TensorRT-optimized runtimes (vendor reported) (Source: ([stability.ai](https://stability.ai/news/stable-diffusion-3-medium?utm_source=openai)))

Typical VRAM guidance: Minimum ~5 GB VRAM to run; 16 GB recommended for higher speed / comfort (Source: ([venturebeat.com](https://venturebeat.com/ai/stability-ai-brings-new-size-to-image-generation-with-stable-diffusion-medium?utm_source=openai)))

Last Refreshed: 2026-01-09

Key Information

  • Category: Image Models
  • Type: AI Image Models Tool