IP-Adapter - AI Image Models Tool

Overview

IP-Adapter is a lightweight, open-source image-prompt adapter from Tencent AI Lab that adds image-prompt capability to pretrained text-to-image diffusion models without fine-tuning the base model. Its central design is a decoupled cross-attention mechanism that adds separate cross-attention layers for image features, enabling the UNet to attend to image and text features independently. The official implementation uses an OpenCLIP ViT-H/14 image encoder, totals about 22M trainable parameters, and was trained on a multimodal corpus built from LAION-2B and COYO (roughly ten million image-text pairs). ([ar5iv.org](https://ar5iv.org/pdf/2308.06721)) Because the adapter trains only the added modules (the original diffusion UNet remains frozen), a single IP-Adapter checkpoint can be reused across custom models derived from the same base (e.g., community SD v1.5 variants) and combined with existing controllable adapters such as ControlNet and T2I-Adapter. The project provides multiple checkpoints (including face-specialized and SDXL experimental variants), supports safetensors, and is integrated into Hugging Face Diffusers and popular UIs (WebUI, ComfyUI, InvokeAI), making it straightforward to load ip-adapter weights and pass an image prompt alongside text prompts at inference. The codebase, demos, and release notes are on GitHub; Diffusers documentation includes loader/usage examples and masking/scale controls for fine-grained conditioning. ([github.com](https://github.com/tencent-ailab/IP-Adapter))

GitHub Statistics

  • Stars: 6,391
  • Forks: 412
  • Contributors: 16
  • License: Apache-2.0
  • Primary Language: Jupyter Notebook
  • Last Updated: 2024-06-28T02:47:03Z

Key Features

  • Decoupled cross-attention: separate attention layers for image and text conditioning.
  • Lightweight: adapter trainable parameters around 22 million.
  • Reusability: single adapter can be applied to multiple custom models derived from the same base.
  • ControlNet/T2I-Adapter compatible for combined structural and image guidance.
  • Face-specialized checkpoints (FaceID / FaceID-Plus / FaceID-PlusV2) and SDXL experimental variants.
  • Supports safetensors and integrated into Hugging Face Diffusers for easy loading.
  • Layer-scale control and masking to target IP-Adapter influence by region or layer.

Example Usage

Example (python):

import torch
from diffusers import AutoPipelineForText2Image
from diffusers.utils import load_image

# Load a base pipeline (example: SDXL base pipeline)
pipeline = AutoPipelineForText2Image.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
).to("cuda")

# Load an IP-Adapter checkpoint (weights stored under the 'models' subfolder)
pipeline.load_ip_adapter(
    "h94/IP-Adapter",           # example HF repo containing IP-Adapter weights
    subfolder="sdxl_models",    # or 'models' for SD1.5 checkpoints
    weight_name="ip-adapter_sdxl.bin",
)

# Optional: set how strongly the image prompt should affect generation
pipeline.set_ip_adapter_scale(0.8)

# Load the image to use as the image prompt
image_prompt = load_image("/path/to/reference.jpg")

# Generate an image using both text and image prompts
out = pipeline(
    prompt="A cinematic portrait of a person standing on a cliff, dramatic lighting",
    ip_adapter_image=image_prompt,
    guidance_scale=7.5,
    num_inference_steps=50,
).images[0]

out.save("generated_with_ip_adapter.png")

# For more advanced use (masking, multi-IP-Adapters, face-specific checkpoints),
# see the Diffusers IP-Adapter documentation and repository demos. ([huggingface.co](https://huggingface.co/docs/diffusers/v0.35.0/en/using-diffusers/ip_adapter?utm_source=openai))

Benchmarks

Model parameters (adapter): ≈22M (Source: https://arxiv.org/abs/2308.06721)

Training data (approx.): ≈10 million image-text pairs (LAION-2B + COYO-700M) (Source: https://arxiv.org/abs/2308.06721)

CLIP-T (image-to-caption alignment) on COCO val: 0.588 (Source: https://arxiv.org/abs/2308.06721)

CLIP-I (image-to-image embedding similarity) on COCO val: 0.828 (Source: https://arxiv.org/abs/2308.06721)

GitHub community signals: ~6.4k stars (repo) (Source: https://github.com/tencent-ailab/IP-Adapter)

Last Refreshed: 2026-01-09

Key Information

  • Category: Image Models
  • Type: AI Image Models Tool