Aria - AI Language Models Tool

Overview

Aria is an open-source, multimodal mixture-of-experts (MoE) model released by Rhymes AI that natively integrates vision, language and code understanding. Built with a MoE decoder and a lightweight visual encoder, Aria is designed for long-form multimodal tasks (document understanding, scene text/chart reading and video comprehension) and supports an extended multimodal context window of up to 64,000 tokens. The project includes checkpoints (e.g., Aria-Base-8K and Aria-Base-64K), a codebase with inference and fine-tuning recipes, and permissive Apache-2.0 licensing to encourage research and downstream adaptation (Aria paper and GitHub repository). (Sources: arXiv:2410.05993; rhymes-ai/Aria on GitHub; Hugging Face model card.) Aria targets both research and applied use: the published checkpoints are suitable for continued pre-training or fine-tuning (including long-document and video QA scenarios), and the authors provide tooling for vLLM/transformers-based inference and single-GPU fine-tuning recipes. The project has seen active community uptake (Hugging Face downloads and GitHub stars) and follow-on releases such as Aria-Chat (a chat-optimized variant) and third-party integrations (PaddleMIX support). Key strengths are its multimodal-native pretraining recipe, large effective context window, and parameter-efficiency via sparse MoE activation. (Sources: Hugging Face model card; GitHub README; project news.)

Model Statistics

  • Downloads: 35,685
  • Likes: 637
  • Pipeline: image-text-to-text

License: apache-2.0

Model Details

Architecture and capacities: Aria combines a lightweight vision encoder that converts images/frames into visual tokens with an MoE-based language decoder that predicts text autoregressively—this design lets visual tokens share representation space with text tokens for unified multimodal generation. The published family is roughly a ~25B-parameter class (paper and model card report ≈25.3B total parameters) while the MoE sparsity means only a fraction of parameters are activated per token (the paper reports ~3.9B activated parameters per visual token and ~3.5B per text token). Aria is trained with a four-stage pipeline (language pretraining, multimodal pretraining, multimodal long-context pretraining, and multimodal post-training) to progressively build language, multimodal, long-context, and instruction-following capabilities (arXiv paper; GitHub README). Practical details and developer tooling: Aria supports a long 64K multimodal context window (Aria-Base-64K checkpoint) and is available as base checkpoints (Aria-Base-8K and Aria-Base-64K) and a chat-optimized Aria-Chat variant. The model card and repo include transformers/vLLM-compatible inference examples, grouped-gemm/flash-attn recommendations for performance, and recipes for fine-tuning on video and document tasks. The codebase and checkpoints are released under Apache-2.0. For hardware, the authors and community note that MoE sparsity reduces active compute (inference may run on a high-end consumer GPU for lighter workloads; full bfloat16 loading of large checkpoints typically expects 80GB-class accelerators for single-card loads). (Sources: arXiv paper; rhymes-ai/Aria GitHub; Hugging Face model card.)

Key Features

  • Native multimodal MoE architecture combining vision tokens and language decoding.
  • Very long multimodal context window: up to 64,000 tokens for long-document/video reasoning.
  • Parameter-efficiency via sparse MoE activation (~3.9B/3.5B active parameters per token).
  • Designed for video, long-document QA, scene text/chart reading, and general V+L tasks.
  • Open-source weights, codebase and recipes under Apache-2.0 for research and fine-tuning.
  • Chat-optimized variant (Aria-Chat) and single-GPU fine-tuning/inference recipes available.
  • Integrations and third-party support (e.g., vLLM tooling and PaddleMIX integration).

Example Usage

Example (python):

import requests
import torch
from PIL import Image
from transformers import AriaForConditionalGeneration, AriaProcessor

model_id = "rhymes-ai/Aria"
# Load model and processor (trust_remote_code required for custom classes)
model = AriaForConditionalGeneration.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True)
processor = AriaProcessor.from_pretrained(model_id, trust_remote_code=True)

# Example: image + question -> text
image_url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(image_url, stream=True).raw)

messages = [
    {"role": "user", "content": [
        {"type": "image", "text": None},
        {"type": "text", "text": "Describe the image."}
    ]}
]

text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=text, images=image, return_tensors="pt")
inputs["pixel_values"] = inputs["pixel_values"].to(model.dtype)
inputs = {k: v.to(model.device) for k, v in inputs.items()}

with torch.inference_mode(), torch.cuda.amp.autocast(dtype=torch.bfloat16):
    output = model.generate(
        **inputs,
        max_new_tokens=200,
        stop_strings=["<|im_end|>"],
        tokenizer=processor.tokenizer,
        do_sample=True,
        temperature=0.8,
    )

# Decode generated ids (skip special tokens)
output_ids = output[0][inputs["input_ids"].shape[1]:]
result = processor.decode(output_ids, skip_special_tokens=True)
print(result)

# Note: see the Aria model card and repository for vLLM/flash-attn and grouped_gemm recommendations.

Benchmarks

Total parameters (reported): ≈25.3B (Source: Aria paper (arXiv:2410.05993) and Hugging Face model card)

Activated parameters per token: ≈3.9B (visual token), ≈3.5B (text token) (Source: Aria paper (arXiv:2410.05993))

Context window: 64,000 tokens (long multimodal context) (Source: Hugging Face model card (Aria / Aria-Base-64K))

MMLU (reported, base family): MMLU 70+ (reported for Aria-Base-8K family in model card notes) (Source: Aria-Base-8K model card on Hugging Face)

Multimodal speed example: Caption a 256-frame video in ~10 seconds (example from model card) (Source: Hugging Face model card (Aria))

Community usage / adoption: Hugging Face: tens of thousands of downloads (37k+ last-month reported) and 637 likes; GitHub: ~1.1k stars (Source: Hugging Face model page and rhymes-ai/Aria GitHub README)

Comparative claim (paper): Reported to outperform Pixtral-12B and Llama3.2-11B; competitive with leading proprietary multimodal models (Source: Aria paper (arXiv:2410.05993))

Last Refreshed: 2026-02-24

Key Information

  • Category: Language Models
  • Type: AI Language Models Tool