Llama 4 Maverick & Scout - AI Language Models Tool
Overview
Llama 4 Maverick and Scout are Meta’s next-generation Mixture-of-Experts (MoE) large language models released April 5, 2025. Both expose a 17B "active" parameter footprint while using MoE routing over many more total parameters (Scout: ~109B total with 16 experts; Maverick: ~400B total with 128 experts). They were designed for native multimodality (early-fusion text+image), long-context workloads, and production deployment through Hugging Face integrations (transformers, Text Generation Inference). Hugging Face and Meta publish pretrained and instruction-tuned checkpoints alongside integration code and quantization recipes to make local and cloud deployment practical. ([huggingface.co](https://huggingface.co/blog/llama4-release?utm_source=openai)) The two models trade design points: Scout targets extreme context and accessibility (instruction-tuned Scout supports up to a 10 million token context window and is deployable on a single server GPU with on‑the‑fly int4/8 quantization), while Maverick focuses on top-tier performance and scale (instruction-tuned Maverick supports up to 1M tokens and ships BF16/FP8 weight formats). The release includes architectural notes (interleaved NoPE/RoPE layers, chunked attention, attention temperature tuning, QK normalization), published evaluation numbers on reasoning/code/image benchmarks, and a community license (Llama 4 Community License Agreement) that governs usage. The release sparked strong community interest and discussion about benchmark transparency and variant parity between experimental and public instances. ([huggingface.co](https://huggingface.co/blog/llama4-release?utm_source=openai))
Key Features
- Mixture-of-Experts design: 17B active parameters with many more total parameters.
- Native multimodality (early fusion) for text + image inputs out of the box.
- Extreme context support: Scout instruct model up to 10M tokens, Maverick up to 1M.
- Hugging Face transformers & TGI integrations for immediate deployment.
- Quantization support: on‑the‑fly int4/8 for Scout; BF16/FP8 formats for Maverick.
Example Usage
Example (python):
from transformers import AutoProcessor, Llama4ForConditionalGeneration
import torch
# Note: model weights require accepting the Llama 4 Community License on Hugging Face.
model_id = "meta-llama/Llama-4-Maverick-17B-128E-Instruct"
processor = AutoProcessor.from_pretrained(model_id)
model = Llama4ForConditionalGeneration.from_pretrained(
model_id,
attn_implementation="flex_attention",
device_map="auto",
torch_dtype=torch.bfloat16,
)
inputs = processor(text="Summarize the following: Explain chain-of-thought in two sentences.")
input_ids = torch.tensor([inputs.input_ids]).to(model.device)
with torch.no_grad():
outputs = model.generate(input_ids, max_new_tokens=128)
print(processor.decode(outputs[0], skip_special_tokens=True))
# See Hugging Face Llama4 blog for multimodal examples and multi-GPU tensor-parallel usage. ([huggingface.co](https://huggingface.co/blog/llama4-release?utm_source=openai)) Benchmarks
Instruction-tuned Maverick — MMLU-Pro: 80.5% (Source: https://huggingface.co/blog/llama4-release)
Instruction-tuned Maverick — GPQA Diamond: 69.8% (Source: https://huggingface.co/blog/llama4-release)
Instruction-tuned Scout — MMLU-Pro: 74.3% (Source: https://huggingface.co/blog/llama4-release)
Instruction-tuned Scout — GPQA Diamond: 57.2% (Source: https://huggingface.co/blog/llama4-release)
Context window (instruction-tuned): Scout: 10,000,000 tokens; Maverick: 1,000,000 tokens (Source: https://huggingface.co/blog/llama4-release)
Key Information
- Category: Language Models
- Type: AI Language Models Tool