Jamba-v0.1 - AI Language Models Tool
Overview
Jamba-v0.1 is AI21 Labs’ first production-scale hybrid SSM–Transformer language model that interleaves Transformer and Mamba (SSM) blocks and uses mixture-of-experts (MoE) to increase capacity while keeping active inference footprint small. The published base checkpoint exposes 52B total parameters with ~12B active parameters at inference, and is designed to support extremely long contexts (claimed up to 256K tokens) while fitting very large effective windows on a single 80GB GPU (AI21 reports up to ~140K tokens with quantization). ([huggingface.co](https://huggingface.co/ai21labs/Jamba-v0.1)) Released with an Apache-2.0 license and accompanied by a research paper and usage guide, Jamba is positioned as a high-throughput foundation model for fine-tuning into instruct/chat variants and scaling long-context applications (RAG, long-document summarization, multi-document synthesis). AI21 provides guidance for optimized runtimes (mamba-ssm kernels, FlashAttention2) and multiple precision/quantization strategies for deployment. The model is available on Hugging Face and integrated across AI21’s platform and cloud partners. ([ai21.com](https://www.ai21.com/blog/announcing-jamba))
Model Statistics
- Downloads: 624
- Likes: 1190
- Pipeline: text-generation
- Parameters: 51.6B
License: apache-2.0
Model Details
Architecture and scale: Jamba implements a Joint-Attention-and-Mamba hybrid (often described as “SSM–Transformer”) that interleaves Transformer attention layers with Mamba SSM layers and inserts MoE layers to expand capacity. The specific base configuration published by AI21 exposes ~52B total parameters with 12B active parameters served by the MoE routing at inference, which reduces memory and compute compared with a dense Transformer of equivalent total size. Context support and memory: Jamba supports a 256K token context window and AI21 reports practical single-GPU deployments up to ~140K tokens when using 8-bit quantization plus optimized kernels. ([arxiv.org](https://arxiv.org/abs/2403.19887)) Precision, runtime, and deployment notes: The checkpoint is saved in BF16; AI21 and the Hugging Face model card recommend transformers>=4.39/4.40, the mamba-ssm and causal-conv1d kernels for best latencies, and optionally FlashAttention2 for half precision. BitsAndBytes-based 8-bit quantization is supported with recommendations to skip quantizing Mamba modules to avoid quality loss. Fine-tuning: the model is provided as a base (not instruction-aligned) and AI21 documents example PEFT/LoRA workflows; AI21 also publishes instruct-tuned variants (Jamba-1.5 family) and platform integrations for managed deployment. ([huggingface.co](https://huggingface.co/ai21labs/Jamba-v0.1))
Key Features
- Hybrid SSM (Mamba) + Transformer blocks for memory-efficient long-context processing.
- Mixture-of-experts design: 52B total params with ~12B active parameters at inference.
- 256K maximum context length claim for long-document tasks and RAG pipelines.
- Fits very large effective context (AI21 reports ~140K tokens) on an 80GB GPU with quantization.
- Optimized runtimes: supports FlashAttention2, mamba-ssm kernels, and bitsandbytes quantization.
- Released under Apache-2.0 as a base model intended for fine-tuning and research.
Example Usage
Example (python):
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Recommended: run on CUDA device with transformers >=4.39/4.40 and mamba-ssm kernels installed
model_id = "ai21labs/Jamba-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
# load BF16 checkpoint (model saved in BF16 on HF)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")
inputs = tokenizer("Summarize the following long document:\n\n", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])
# Notes: for best long-context throughput install mamba-ssm and causal-conv1d,
# and consider 8-bit quantization with bitsandbytes while skipping Mamba modules.
# Source: Hugging Face model card for ai21labs/Jamba-v0.1. ([huggingface.co](https://huggingface.co/ai21labs/Jamba-v0.1)) Pricing
AI21 publishes usage pricing for hosted Jamba models on AI21 Studio: Jamba Mini (example hosted variant) listed at $0.20 per 1M input tokens and $0.40 per 1M output tokens; Jamba Large listed at $2 / 1M input tokens and $8 / 1M output tokens. For self-hosted use the base weights are released under Apache-2.0; enterprise/private deployments and volume discounts are available via AI21 sales. ([ai21.com](https://www.ai21.com/pricing/?utm_source=openai))
Benchmarks
HellaSwag: 87.1% (Source: https://huggingface.co/ai21labs/Jamba-v0.1)
Arc Challenge: 64.4% (Source: https://huggingface.co/ai21labs/Jamba-v0.1)
WinoGrande: 82.5% (Source: https://huggingface.co/ai21labs/Jamba-v0.1)
PIQA: 83.2% (Source: https://huggingface.co/ai21labs/Jamba-v0.1)
MMLU: 67.4% (Source: https://huggingface.co/ai21labs/Jamba-v0.1)
BBH: 45.4% (Source: https://huggingface.co/ai21labs/Jamba-v0.1)
TruthfulQA: 46.4% (Source: https://huggingface.co/ai21labs/Jamba-v0.1)
GSM8K (CoT): 59.9% (Source: https://huggingface.co/ai21labs/Jamba-v0.1)
Key Information
- Category: Language Models
- Type: AI Language Models Tool