Jamba-v0.1 - AI Language Models Tool

Overview

Jamba-v0.1 is AI21 Labs’ first production-scale hybrid SSM–Transformer language model that interleaves Transformer and Mamba (SSM) blocks and uses mixture-of-experts (MoE) to increase capacity while keeping active inference footprint small. The published base checkpoint exposes 52B total parameters with ~12B active parameters at inference, and is designed to support extremely long contexts (claimed up to 256K tokens) while fitting very large effective windows on a single 80GB GPU (AI21 reports up to ~140K tokens with quantization). ([huggingface.co](https://huggingface.co/ai21labs/Jamba-v0.1)) Released with an Apache-2.0 license and accompanied by a research paper and usage guide, Jamba is positioned as a high-throughput foundation model for fine-tuning into instruct/chat variants and scaling long-context applications (RAG, long-document summarization, multi-document synthesis). AI21 provides guidance for optimized runtimes (mamba-ssm kernels, FlashAttention2) and multiple precision/quantization strategies for deployment. The model is available on Hugging Face and integrated across AI21’s platform and cloud partners. ([ai21.com](https://www.ai21.com/blog/announcing-jamba))

Model Statistics

  • Downloads: 624
  • Likes: 1190
  • Pipeline: text-generation
  • Parameters: 51.6B

License: apache-2.0

Model Details

Architecture and scale: Jamba implements a Joint-Attention-and-Mamba hybrid (often described as “SSM–Transformer”) that interleaves Transformer attention layers with Mamba SSM layers and inserts MoE layers to expand capacity. The specific base configuration published by AI21 exposes ~52B total parameters with 12B active parameters served by the MoE routing at inference, which reduces memory and compute compared with a dense Transformer of equivalent total size. Context support and memory: Jamba supports a 256K token context window and AI21 reports practical single-GPU deployments up to ~140K tokens when using 8-bit quantization plus optimized kernels. ([arxiv.org](https://arxiv.org/abs/2403.19887)) Precision, runtime, and deployment notes: The checkpoint is saved in BF16; AI21 and the Hugging Face model card recommend transformers>=4.39/4.40, the mamba-ssm and causal-conv1d kernels for best latencies, and optionally FlashAttention2 for half precision. BitsAndBytes-based 8-bit quantization is supported with recommendations to skip quantizing Mamba modules to avoid quality loss. Fine-tuning: the model is provided as a base (not instruction-aligned) and AI21 documents example PEFT/LoRA workflows; AI21 also publishes instruct-tuned variants (Jamba-1.5 family) and platform integrations for managed deployment. ([huggingface.co](https://huggingface.co/ai21labs/Jamba-v0.1))

Key Features

  • Hybrid SSM (Mamba) + Transformer blocks for memory-efficient long-context processing.
  • Mixture-of-experts design: 52B total params with ~12B active parameters at inference.
  • 256K maximum context length claim for long-document tasks and RAG pipelines.
  • Fits very large effective context (AI21 reports ~140K tokens) on an 80GB GPU with quantization.
  • Optimized runtimes: supports FlashAttention2, mamba-ssm kernels, and bitsandbytes quantization.
  • Released under Apache-2.0 as a base model intended for fine-tuning and research.

Example Usage

Example (python):

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Recommended: run on CUDA device with transformers >=4.39/4.40 and mamba-ssm kernels installed
model_id = "ai21labs/Jamba-v0.1"

tokenizer = AutoTokenizer.from_pretrained(model_id)
# load BF16 checkpoint (model saved in BF16 on HF)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")

inputs = tokenizer("Summarize the following long document:\n\n", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])

# Notes: for best long-context throughput install mamba-ssm and causal-conv1d,
# and consider 8-bit quantization with bitsandbytes while skipping Mamba modules.
# Source: Hugging Face model card for ai21labs/Jamba-v0.1. ([huggingface.co](https://huggingface.co/ai21labs/Jamba-v0.1))

Pricing

AI21 publishes usage pricing for hosted Jamba models on AI21 Studio: Jamba Mini (example hosted variant) listed at $0.20 per 1M input tokens and $0.40 per 1M output tokens; Jamba Large listed at $2 / 1M input tokens and $8 / 1M output tokens. For self-hosted use the base weights are released under Apache-2.0; enterprise/private deployments and volume discounts are available via AI21 sales. ([ai21.com](https://www.ai21.com/pricing/?utm_source=openai))

Benchmarks

HellaSwag: 87.1% (Source: https://huggingface.co/ai21labs/Jamba-v0.1)

Arc Challenge: 64.4% (Source: https://huggingface.co/ai21labs/Jamba-v0.1)

WinoGrande: 82.5% (Source: https://huggingface.co/ai21labs/Jamba-v0.1)

PIQA: 83.2% (Source: https://huggingface.co/ai21labs/Jamba-v0.1)

MMLU: 67.4% (Source: https://huggingface.co/ai21labs/Jamba-v0.1)

BBH: 45.4% (Source: https://huggingface.co/ai21labs/Jamba-v0.1)

TruthfulQA: 46.4% (Source: https://huggingface.co/ai21labs/Jamba-v0.1)

GSM8K (CoT): 59.9% (Source: https://huggingface.co/ai21labs/Jamba-v0.1)

Last Refreshed: 2026-02-03

Key Information

  • Category: Language Models
  • Type: AI Language Models Tool