Jamba-v0.1 - AI Language Models Tool

Overview

Jamba-v0.1 is a research-grade base checkpoint from AI21 Labs that implements a hybrid SSM (Mamba)–Transformer mixture-of-experts (MoE) architecture intended for high-throughput, long-context text generation. The checkpoint published on Hugging Face is the base generative model configuration (12B active parameters, 52B total parameters across experts) and is offered under a permissive Apache-2.0 license. Jamba was designed to interleave Transformer and Mamba/SSM blocks to reduce memory footprint while preserving or improving quality, and the published checkpoint supports extremely long contexts (claimed 256K tokens) that enable applications such as long-document summarization, retrieval-augmented generation, and extended conversation state. ([huggingface.co](https://huggingface.co/ai21labs/Jamba-v0.1?utm_source=openai)) As a base model, Jamba-v0.1 is intended primarily for fine-tuning into instruct/chat variants and for research into hybrid SSM–Transformer designs. The released checkpoint is practical to run with modern tooling (Transformers + mamba-ssm kernels) and offers options for BF16/FP16 loading and 8-bit quantization with recommended exclusions for Mamba blocks. The model card and AI21 documentation include usage examples, fine-tuning hints, and benchmark results showing competitive performance in its size class. ([huggingface.co](https://huggingface.co/ai21labs/Jamba-v0.1?utm_source=openai))

Model Statistics

  • Downloads: 769
  • Likes: 1190
  • Pipeline: text-generation

License: apache-2.0

Model Details

Architecture and size: Jamba-v0.1 is a hybrid “Joint Attention and Mamba” model that interleaves standard Transformer attention blocks with Mamba (SSM) blocks, and applies mixture-of-experts (MoE) routing in some layers. The checkpoint published as Jamba-v0.1 exposes 12B active parameters during inference and a total of ~52B parameters across the expert pool (MoE capacity). Key runtime and precision notes: the published checkpoint is stored in BF16; recommended runtime uses torch bfloat16/FP16 and the optimized Mamba kernels (mamba-ssm and causal-conv1d) for reasonable latency and memory. The Hugging Face model card documents that a single 80GB GPU can hold up to ~140K tokens in 8-bit mode, while the model is claimed to support a 256K token context length. ([huggingface.co](https://huggingface.co/ai21labs/Jamba-v0.1?utm_source=openai)) Tooling and deployment: the model is runnable via transformers (>=4.40 recommended) with trust_remote_code when using older transformers releases, and supports bitsandbytes quantization with module exclusion (recommended: skip Mamba blocks during int8 quantization). AI21 also publishes API and managed offerings of Jamba-family models (Jamba Mini / Jamba Large) via AI21 Studio and partner platforms for production use. Supported languages for the Jamba family include English, Spanish, French, Portuguese, Italian, Dutch, German, Arabic, and Hebrew (per AI21 docs). Note: Jamba-v0.1 is a base model and is not instruction/alignment-tuned; downstream safety, moderation, and alignment should be added by the integrator. ([huggingface.co](https://huggingface.co/ai21labs/Jamba-v0.1?utm_source=openai))

Key Features

  • Hybrid Transformer + Mamba (SSM) blocks for lower memory and high throughput.
  • Mixture-of-experts layout: 12B active parameters, 52B total expert parameters.
  • Very long context support—claimed to 256K tokens for long-document tasks.
  • Published BF16 checkpoint with guidance for FP16/8-bit quantization workflows.
  • Designed as a base model for fine-tuning into instruct/chat variants.
  • Optimized kernels available (mamba-ssm, causal-conv1d) for reduced latency.

Example Usage

Example (python):

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Recommended: transformers>=4.40.0 and mamba-ssm kernels installed for best performance.
model_id = "ai21labs/Jamba-v0.1"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,            # checkpoint published in BF16
    device_map="auto",
    attn_implementation="flash_attention_2",  # optional, requires GPU support
)

prompt = "In the recent Super Bowl LVIII,"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(inputs["input_ids"], max_new_tokens=120)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Pricing

AI21 publishes usage pricing for managed Jamba-family endpoints on AI21 Studio: Jamba Mini (corresponding to the 12B-active / 52B total family) listed at approximately $0.20 per 1M input tokens and $0.40 per 1M output tokens; Jamba Large at approximately $2 per 1M input tokens and $8 per 1M output tokens. AI21 Studio also offers a $10 free trial credit (three months) and custom enterprise plans. (See AI21 pricing page.) ([ai21.com](https://www.ai21.com/pricing/?utm_source=openai))

Benchmarks

HellaSwag: 87.1% (Source: https://huggingface.co/ai21labs/Jamba-v0.1)

MMLU: 67.4% (Source: https://huggingface.co/ai21labs/Jamba-v0.1)

GSM8K (Chain-of-Thought): 59.9% (Source: https://huggingface.co/ai21labs/Jamba-v0.1)

BBH: 45.4% (Source: https://huggingface.co/ai21labs/Jamba-v0.1)

TruthfulQA: 46.4% (Source: https://huggingface.co/ai21labs/Jamba-v0.1)

Last Refreshed: 2026-02-24

Key Information

  • Category: Language Models
  • Type: AI Language Models Tool