Llama4 - AI Language Models Tool

Overview

Llama4 is Meta’s fourth-generation Llama family and the company’s first public Mixture-of-Experts (MoE) multimodal release. It ships in two production variants: Llama 4 Scout (17B active parameters drawn from ≈109B total with 16 experts) and Llama 4 Maverick (17B active parameters drawn from ≈400B total with 128 experts). Both variants use early-fusion multimodality (text + images), support long-context modes in specific builds, and are integrated into the Hugging Face Transformers ecosystem for inference and deployment. ([huggingface.co](https://huggingface.co/docs/transformers/en/model_doc/llama4)) Scout is positioned for extreme long-context workloads (Hugging Face and Meta documentation and partner coverage describe Scout builds with context modes reported up to ~10 million tokens) and is optimized to run on a single high-end server GPU via aggressive quantization and offloading techniques. Maverick targets higher single-query capability and benchmarking performance (several published comparisons place Maverick among the top-performing models on reasoning, coding, and multimodal tasks). Meta’s public materials also report large-scale pretraining (tens of trillions of tokens) and instruction-tuned variants released under the Llama 4 community license. Users should note that some prominent benchmark placements and leaderboard results prompted discussion about variant transparency and reproducibility. ([huggingface.co](https://huggingface.co/docs/transformers/en/model_doc/llama4))

Key Features

  • Mixture-of-Experts architecture: only a subset of experts activate per token for efficiency.
  • Native multimodality: early fusion supports text+image inputs (text output).
  • Dual flavors: Scout (17B active/≈109B total, 16 experts) and Maverick (17B active/≈400B, 128 experts).
  • Long-context modes: Scout reported supporting very long contexts (reported up to ~10M tokens).
  • Quantization & offloading: on-the-fly INT4/FP8 and CPU-offloading to reduce GPU memory footprint.
  • Transformers & TGI integrations: first-party support in Hugging Face Transformers and Text Generation Inference.
  • Community license & weights: model checkpoints published under Llama 4 Community License on Hugging Face.

Example Usage

Example (python):

from transformers import pipeline
import torch

# Example: run an instruction-tuned Llama4 Scout model via Hugging Face Transformers
# (model IDs and recommended device/dtype taken from Hugging Face Llama4 docs).
model_id = "meta-llama/Llama-4-Scout-17B-16E-Instruct"

pipe = pipeline(
    "text-generation",
    model=model_id,
    device_map="auto",
    dtype=torch.bfloat16
)

messages = [
    {"role": "user", "content": "Summarize the key steps to prepare mayonnaise."}
]

output = pipe(messages, do_sample=False, max_new_tokens=200)
print(output[0]["generated_text"][-1]["content"])

# For long-context, quantized, or FP8/INT4 variants, follow the Hugging Face Llama4 docs
# to enable attn_implementation, quantization_config, or offloading. See docs for details.
# Reference: https://huggingface.co/docs/transformers/en/model_doc/llama4

Benchmarks

Maverick — MMLU (instruction-tuned, reported): 80.5% (MMLU-Pro / instruction-tuned Maverick as reported by Meta/Hugging Face) (Source: https://huggingface.co/blog/llama4-release)

Maverick — GPQA Diamond (instruction-tuned, reported): 69.8% (GPQA Diamond / instruction-tuned Maverick as reported by Meta/Hugging Face) (Source: https://huggingface.co/blog/llama4-release)

Scout — Long-context capability (reported): Context modes reported up to ~10 million tokens (Scout long-context variants) (Source: https://huggingface.co/docs/transformers/en/model_doc/llama4)

LMArena ELO — Maverick (community leaderboard placement): ELO ≈ 1417 (Maverick experimental chat variant reported on leaderboards) (Source: https://beebom.com/meta-releases-llama-4-ai-models-beats-gpt-4o-grok-3-lmarena/)

Training scale (reported by Meta/Hugging Face): Trained on up to ~40 trillion tokens (as stated in model documentation) (Source: https://huggingface.co/docs/transformers/en/model_doc/llama4)

Last Refreshed: 2026-02-24

Key Information

  • Category: Language Models
  • Type: AI Language Models Tool