Jamba-v0.1 - AI Language Models Tool

Overview

Jamba-v0.1 is an open base language model from AI21 Labs that combines a Structured State-Space (Mamba/SSM) component with Transformer attention to deliver high-throughput, long-context generation. The published checkpoint is a mixture-of-experts (MoE) generative model with 12B active parameters and a total of ~52B parameters across experts, and it supports an extended 256K token context window — enabling tasks like book-scale summarization, long document Q&A, and large-context retrieval-augmented generation. ([huggingface.co](https://huggingface.co/ai21labs/Jamba-v0.1?utm_source=openai)) Jamba-v0.1 is provided under an open license (Apache 2.0) on Hugging Face and is intended as a base model for fine-tuning into instruct/chat variants. The model is shipped in BF16 by default, supports optimized Mamba kernels (mamba-ssm / causal-conv1d) for production throughput gains, and includes guidance for quantized and multi-GPU loading for long-sequence inference. The Hugging Face model card also reports a set of benchmark scores (MMLU, HellaSwag, GSM8K, etc.) that position Jamba-v0.1 competitively in its size class. ([huggingface.co](https://huggingface.co/ai21labs/Jamba-v0.1?utm_source=openai))

Model Statistics

  • Downloads: 1,025
  • Likes: 1188
  • Pipeline: text-generation
  • Parameters: 51.6B

License: apache-2.0

Model Details

Architecture and training: Jamba is a hybrid SSM-Transformer (marketed as Joint Attention + Mamba / “Jamba”) that interleaves Mamba SSM blocks with Transformer attention and uses a MoE routing layer to give the model 12B active parameters while totaling ~52B parameters across experts. This design targets sub-quadratic memory/compute scaling for long contexts and higher throughput on long-sequence tasks. ([huggingface.co](https://huggingface.co/ai21labs/Jamba-v0.1?utm_source=openai)) Capabilities and runtime: The model supports a 256K token context length (Hugging Face and AI21 documentation report effective long-context behavior) and is distributed in BF16 format. The Hugging Face model card advises transformers>=4.39/4.40, optionally installing mamba-ssm and causal-conv1d for optimized kernels; it describes approaches for BF16/FP16 loading, FlashAttention2 usage, and 8-bit quantization via bitsandbytes (with recommendations to exclude Mamba blocks from quantization). Practical notes include the ability to fit up to ~140K tokens on a single 80GB GPU (with 8-bit/quantized setups) and guidance for multi-GPU parallelization and PEFT-style fine-tuning (example requiring ~120GB GPU RAM for the shown setup). The model card also notes Jamba-v0.1 is an unaligned base model and recommends adding alignment/safety layers for production chat/instruct deployments. ([huggingface.co](https://huggingface.co/ai21labs/Jamba-v0.1?utm_source=openai))

Key Features

  • Hybrid SSM (Mamba) + Transformer architecture for long-context throughput.
  • Mixture-of-experts: ~12B active parameters, ~52B total across experts.
  • Supports 256K token context windows for book- and document-scale tasks.
  • Distributed BF16 checkpoint with explicit guidance for FlashAttention2 and 8-bit quantization.
  • Open Apache-2.0 licensed base model intended for fine-tuning into instruct/chat versions.

Example Usage

Example (python):

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load tokenizer + model (BF16 checkpoint on HF)
model_id = "ai21labs/Jamba-v0.1"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")

prompt = "In the recent Super Bowl LVIII,"
input_ids = tokenizer(prompt, return_tensors="pt").to(model.device)["input_ids"]

# simple generation
outputs = model.generate(input_ids, max_new_tokens=200)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

# Notes: for best throughput on long contexts install `mamba-ssm` and `causal-conv1d`,
# and consider bitsandbytes quantization (skip Mamba blocks) per the model card.

Pricing

AI21 publishes per-token API pricing for the Jamba family. Public (AI21 Studio) list prices: Jamba Mini — $0.20 per 1M input tokens and $0.40 per 1M output tokens; Jamba Large — $2.00 per 1M input tokens and $8.00 per 1M output tokens. Prices are published by AI21 and may change; contact AI21 for enterprise or private-cloud plans. (Source: AI21 pricing page).

Benchmarks

HellaSwag: 87.1% (Source: https://huggingface.co/ai21labs/Jamba-v0.1)

MMLU: 67.4% (Source: https://huggingface.co/ai21labs/Jamba-v0.1)

GSM8K (CoT): 59.9% (Source: https://huggingface.co/ai21labs/Jamba-v0.1)

PIQA: 83.2% (Source: https://huggingface.co/ai21labs/Jamba-v0.1)

TruthfulQA: 46.4% (Source: https://huggingface.co/ai21labs/Jamba-v0.1)

Last Refreshed: 2026-01-16

Key Information

  • Category: Language Models
  • Type: AI Language Models Tool