IP-Adapter - AI Image Models Tool
Overview
IP-Adapter is a lightweight, open-source image-prompt adapter from Tencent AI Lab that adds image-prompt capability to pretrained text-to-image diffusion models without fine-tuning the base model. Its central design is a decoupled cross-attention mechanism that adds separate cross-attention layers for image features, enabling the UNet to attend to image and text features independently. The official implementation uses an OpenCLIP ViT-H/14 image encoder, totals about 22M trainable parameters, and was trained on a multimodal corpus built from LAION-2B and COYO (roughly ten million image-text pairs). ([ar5iv.org](https://ar5iv.org/pdf/2308.06721)) Because the adapter trains only the added modules (the original diffusion UNet remains frozen), a single IP-Adapter checkpoint can be reused across custom models derived from the same base (e.g., community SD v1.5 variants) and combined with existing controllable adapters such as ControlNet and T2I-Adapter. The project provides multiple checkpoints (including face-specialized and SDXL experimental variants), supports safetensors, and is integrated into Hugging Face Diffusers and popular UIs (WebUI, ComfyUI, InvokeAI), making it straightforward to load ip-adapter weights and pass an image prompt alongside text prompts at inference. The codebase, demos, and release notes are on GitHub; Diffusers documentation includes loader/usage examples and masking/scale controls for fine-grained conditioning. ([github.com](https://github.com/tencent-ailab/IP-Adapter))
GitHub Statistics
- Stars: 6,391
- Forks: 412
- Contributors: 16
- License: Apache-2.0
- Primary Language: Jupyter Notebook
- Last Updated: 2024-06-28T02:47:03Z
Key Features
- Decoupled cross-attention: separate attention layers for image and text conditioning.
- Lightweight: adapter trainable parameters around 22 million.
- Reusability: single adapter can be applied to multiple custom models derived from the same base.
- ControlNet/T2I-Adapter compatible for combined structural and image guidance.
- Face-specialized checkpoints (FaceID / FaceID-Plus / FaceID-PlusV2) and SDXL experimental variants.
- Supports safetensors and integrated into Hugging Face Diffusers for easy loading.
- Layer-scale control and masking to target IP-Adapter influence by region or layer.
Example Usage
Example (python):
import torch
from diffusers import AutoPipelineForText2Image
from diffusers.utils import load_image
# Load a base pipeline (example: SDXL base pipeline)
pipeline = AutoPipelineForText2Image.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=torch.float16,
).to("cuda")
# Load an IP-Adapter checkpoint (weights stored under the 'models' subfolder)
pipeline.load_ip_adapter(
"h94/IP-Adapter", # example HF repo containing IP-Adapter weights
subfolder="sdxl_models", # or 'models' for SD1.5 checkpoints
weight_name="ip-adapter_sdxl.bin",
)
# Optional: set how strongly the image prompt should affect generation
pipeline.set_ip_adapter_scale(0.8)
# Load the image to use as the image prompt
image_prompt = load_image("/path/to/reference.jpg")
# Generate an image using both text and image prompts
out = pipeline(
prompt="A cinematic portrait of a person standing on a cliff, dramatic lighting",
ip_adapter_image=image_prompt,
guidance_scale=7.5,
num_inference_steps=50,
).images[0]
out.save("generated_with_ip_adapter.png")
# For more advanced use (masking, multi-IP-Adapters, face-specific checkpoints),
# see the Diffusers IP-Adapter documentation and repository demos. ([huggingface.co](https://huggingface.co/docs/diffusers/v0.35.0/en/using-diffusers/ip_adapter?utm_source=openai)) Benchmarks
Model parameters (adapter): ≈22M (Source: https://arxiv.org/abs/2308.06721)
Training data (approx.): ≈10 million image-text pairs (LAION-2B + COYO-700M) (Source: https://arxiv.org/abs/2308.06721)
CLIP-T (image-to-caption alignment) on COCO val: 0.588 (Source: https://arxiv.org/abs/2308.06721)
CLIP-I (image-to-image embedding similarity) on COCO val: 0.828 (Source: https://arxiv.org/abs/2308.06721)
GitHub community signals: ~6.4k stars (repo) (Source: https://github.com/tencent-ailab/IP-Adapter)
Key Information
- Category: Image Models
- Type: AI Image Models Tool