Llama 4 Maverick & Scout - AI Language Models Tool

Overview

Llama 4 Maverick and Scout are Meta’s next-generation Mixture-of-Experts (MoE) large language models released April 5, 2025. Both expose a 17B "active" parameter footprint while using MoE routing over many more total parameters (Scout: ~109B total with 16 experts; Maverick: ~400B total with 128 experts). They were designed for native multimodality (early-fusion text+image), long-context workloads, and production deployment through Hugging Face integrations (transformers, Text Generation Inference). Hugging Face and Meta publish pretrained and instruction-tuned checkpoints alongside integration code and quantization recipes to make local and cloud deployment practical. ([huggingface.co](https://huggingface.co/blog/llama4-release?utm_source=openai)) The two models trade design points: Scout targets extreme context and accessibility (instruction-tuned Scout supports up to a 10 million token context window and is deployable on a single server GPU with on‑the‑fly int4/8 quantization), while Maverick focuses on top-tier performance and scale (instruction-tuned Maverick supports up to 1M tokens and ships BF16/FP8 weight formats). The release includes architectural notes (interleaved NoPE/RoPE layers, chunked attention, attention temperature tuning, QK normalization), published evaluation numbers on reasoning/code/image benchmarks, and a community license (Llama 4 Community License Agreement) that governs usage. The release sparked strong community interest and discussion about benchmark transparency and variant parity between experimental and public instances. ([huggingface.co](https://huggingface.co/blog/llama4-release?utm_source=openai))

Key Features

  • Mixture-of-Experts design: 17B active parameters with many more total parameters.
  • Native multimodality (early fusion) for text + image inputs out of the box.
  • Extreme context support: Scout instruct model up to 10M tokens, Maverick up to 1M.
  • Hugging Face transformers & TGI integrations for immediate deployment.
  • Quantization support: on‑the‑fly int4/8 for Scout; BF16/FP8 formats for Maverick.

Example Usage

Example (python):

from transformers import AutoProcessor, Llama4ForConditionalGeneration
import torch

# Note: model weights require accepting the Llama 4 Community License on Hugging Face.
model_id = "meta-llama/Llama-4-Maverick-17B-128E-Instruct"

processor = AutoProcessor.from_pretrained(model_id)
model = Llama4ForConditionalGeneration.from_pretrained(
    model_id,
    attn_implementation="flex_attention",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

inputs = processor(text="Summarize the following: Explain chain-of-thought in two sentences.")
input_ids = torch.tensor([inputs.input_ids]).to(model.device)

with torch.no_grad():
    outputs = model.generate(input_ids, max_new_tokens=128)

print(processor.decode(outputs[0], skip_special_tokens=True))

# See Hugging Face Llama4 blog for multimodal examples and multi-GPU tensor-parallel usage. ([huggingface.co](https://huggingface.co/blog/llama4-release?utm_source=openai))

Benchmarks

Instruction-tuned Maverick — MMLU-Pro: 80.5% (Source: https://huggingface.co/blog/llama4-release)

Instruction-tuned Maverick — GPQA Diamond: 69.8% (Source: https://huggingface.co/blog/llama4-release)

Instruction-tuned Scout — MMLU-Pro: 74.3% (Source: https://huggingface.co/blog/llama4-release)

Instruction-tuned Scout — GPQA Diamond: 57.2% (Source: https://huggingface.co/blog/llama4-release)

Context window (instruction-tuned): Scout: 10,000,000 tokens; Maverick: 1,000,000 tokens (Source: https://huggingface.co/blog/llama4-release)

Last Refreshed: 2026-01-16

Key Information

  • Category: Language Models
  • Type: AI Language Models Tool