Aria - AI Language Models Tool
Overview
Aria is an open-source, multimodal native mixture-of-experts (MoE) model released by Rhymes AI that combines vision, language, video and code understanding in a single decoder–encoder architecture. It exposes a long multimodal context (up to 64K tokens) and is explicitly pre-trained from scratch on a large, mixed multimodal+language corpus using a four-stage pipeline, making it suitable for image-and-text QA, document and chart reading, video captioning, coding, and long-form multimodal reasoning. ([huggingface.co](https://huggingface.co/rhymes-ai/Aria)) Aria ships as a 25.3B-parameter MoE with an efficient activation pattern (only a few billion parameters are active per token), native visual tokenization across multiple resolutions, and developer-facing tooling (Transformers processor, example notebooks, and vLLM/accelerate cookbooks). The model weights, code, and paper are released under an Apache-2.0 license, enabling research and commercial adaptation subject to that license. Community users report active adoption and some deployment integration notes (for example, vLLM/transformers version mismatches require attention). ([huggingface.co](https://huggingface.co/rhymes-ai/Aria))
Model Statistics
- Downloads: 33,279
- Likes: 637
- Pipeline: image-text-to-text
- Parameters: 25.3B
License: apache-2.0
Model Details
Architecture: Aria pairs a native vision encoder with a mixture-of-experts decoder. The MoE decoder contains many experts per layer (66 experts reported), with a routing mechanism that activates a small subset of experts per token so that only ~3.9B parameters are activated per visual token and ~3.5B per text token, while the total parameter count is ~25.3B. The vision encoder produces visual tokens in three resolution modes (medium: ~128 visual tokens, high: ~256 tokens, ultra-high: tiled sub-images). ([arxiv.org](https://arxiv.org/abs/2410.05993?utm_source=openai)) Pretraining & data: Aria was pre-trained from scratch with a 4-stage pipeline: (1) large-scale language pretraining (text/code), (2) multimodal pretraining (image/video-text interleaved), (3) long-context multimodal pretraining (64K sequences), and (4) multimodal post-training for instruction-following. Stage token counts reported in the model card and paper include trillions of text tokens (e.g., 6.4T language tokens in early stages, 400B multimodal tokens in multimodal pretraining, and 33B tokens for long-context stage), reflecting heavy long-context and multimodal exposure. ([huggingface.co](https://huggingface.co/blog/RhymesAI/aria?utm_source=openai)) Deployment & inference: The Hugging Face model card shows Aria can be loaded in bfloat16 on a single A100 (80GB) for inference and includes Transformers-friendly processors (AriaProcessor / AriaForConditionalGeneration) and vLLM cookbooks. The authors provide quantized checkpoints and guidance for grouped_gemm / flash-attn to improve throughput. Users should follow the model-card installation notes and watch for library version compatibility when deploying with vLLM or other runtimes. ([huggingface.co](https://huggingface.co/rhymes-ai/Aria))
Key Features
- Native multimodal MoE: single model processes text, images, video, and code without modality-specific bridges.
- Long multimodal context: supports up to 64K tokens for long documents or multi-frame video inputs.
- Efficient MoE activation: ~3.9B parameters activated per visual token, lowering runtime compute compared to full-activation models.
- Multi-resolution vision encoder: medium/high/ultra-high modes (128 / 256 visual tokens) for variable image sizes.
- Open-source release: weights, code, and technical report available under Apache-2.0 license.
- Single-A100 inference: model card shows loadable on one A100 (80GB) in bfloat16 with provided tooling.
Example Usage
Example (python):
import requests
import torch
from PIL import Image
from transformers import AriaProcessor, AriaForConditionalGeneration
model_id = "rhymes-ai/Aria"
model = AriaForConditionalGeneration.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16)
processor = AriaProcessor.from_pretrained(model_id)
# load an image and form a multimodal chat message
image = Image.open(requests.get("http://images.cocodataset.org/val2017/000000039769.jpg", stream=True).raw)
messages = [{"role": "user", "content": [{"type": "image"}, {"text": "Describe the image in one sentence.", "type": "text"}]}]
inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt")
inputs = inputs.to(model.device, torch.bfloat16)
out = model.generate(**inputs, max_new_tokens=64, stop_strings=["<|im_end|>"], tokenizer=processor.tokenizer)
# strip prompt tokens and decode
response_ids = out[0][inputs["input_ids"].shape[1]:]
print(processor.decode(response_ids, skip_special_tokens=True))
# Notes: follow the model card for required package versions and optional grouped_gemm/flash-attn installs. ([huggingface.co](https://huggingface.co/rhymes-ai/Aria)) Benchmarks
MMMU (Knowledge, multimodal): 54.9 (Source: https://huggingface.co/rhymes-ai/Aria)
MathVista (Math, multimodal): 66.1 (Source: https://huggingface.co/rhymes-ai/Aria)
DocQA (Document understanding): 92.6 (Source: https://huggingface.co/rhymes-ai/Aria)
LongVideoBench (Video understanding): 65.3 (Source: https://huggingface.co/rhymes-ai/Aria)
HumanEval (Coding): 73.2 (Source: https://huggingface.co/rhymes-ai/Aria)
Key Information
- Category: Language Models
- Type: AI Language Models Tool