DeepSeek-V3.2-Exp - AI Language Models Tool

Overview

DeepSeek-V3.2-Exp is an experimental open-weight large language model from DeepSeek that explores fine-grained sparse attention to reduce compute and memory costs for long-context tasks while preserving V3.1 (Terminus) quality. The release includes safetensor weights (FP8 / bfloat16 / float32), example scripts, Docker images, and a conversion/inference demo so researchers and engineers can run the model locally or via inference stacks. ([huggingface.co](https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp?utm_source=openai)) The release pairs the model architecture changes with optimized runtime kernels and libraries (FlashMLA, DeepGEMM) and offers day‑zero support in popular inference runtimes such as SGLang and vLLM to simplify adoption in production and research workflows. DeepSeek reports benchmark parity with V3.1 across reasoning and agentic tool-use suites while cutting per-token computation for long prompts through DeepSeek Sparse Attention (DSA). The model and most kernels are open‑source under the MIT license, enabling inspection and custom integration. ([github.com](https://github.com/deepseek-ai/FlashMLA?utm_source=openai))

Model Statistics

  • Downloads: 41,341
  • Likes: 963
  • Pipeline: text-generation

License: mit

Model Details

Core innovation: DeepSeek Sparse Attention (DSA). DSA uses a token-level selection/indexer mechanism to compute attention only for content-relevant token pairs, reducing quadratic attention costs for very long contexts (the V3.2 family targets extended windows). DeepSeek released specialized attention kernels (FlashMLA) and FP8-aware matrix routines (DeepGEMM) to realize these gains in practice. ([huggingface.co](https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp?utm_source=openai)) Formats and runtimes: the Hugging Face model card lists safetensor checkpoints in FP8/BF16/F32 formats plus example conversion scripts and an inference demo for users who need to convert checkpoints for the provided kernels. SGLang published day‑0 support and Docker images; vLLM also provides day‑0 recipes for running the model in streaming / high-throughput setups. The model card identifies the base checkpoint as deepseek-ai/DeepSeek-V3.2-Exp-Base; parameter counts are not published in the model card. For high-performance inference the project recommends GPUs compatible with the FlashMLA/DeepGEMM kernels (Hopper-class and supported SM architectures) and specific CUDA/PyTorch versions. ([huggingface.co](https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp?utm_source=openai)) Community & maturity notes: the repo and issues show active community discussion and bug reports (e.g., RoPE/indexer implementation questions and decoding/logprobs issues). Expect experimental rough edges and active fixes if you plan production deployment. ([github.com](https://github.com/deepseek-ai/DeepSeek-V3.2-Exp/issues?utm_source=openai))

Key Features

  • DeepSeek Sparse Attention (DSA): token-level selection to prune attention computation.
  • Long-context efficiency: optimized for extended windows (128K token workflows targeted).
  • Multi-format checkpoints: safetensors available in FP8, bfloat16 and float32.
  • Open-source kernels: FlashMLA attention kernels and DeepGEMM FP8 GEMM library released.
  • Day‑0 runtime support: SGLang and vLLM recipes for immediate deployment.
  • Conversion & demo scripts: example convert.py and interactive generate.py in repo.
  • MIT license: permissive license allowing commercial and research usage.

Example Usage

Example (python):

# Minimal example: load the Hugging Face repo and run a single prompt using transformers.
# NOTE: DeepSeek-V3.2-Exp ships specialized kernels and may require conversion or custom runtimes
# (see the model's inference/convert scripts and SGLang/vLLM recipes on the model page).
# Model page: https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

def simple_generate(prompt):
    # Use an appropriate dtype matching your environment (bfloat16/float32). FP8 requires special kernels.
    tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-V3.2-Exp")
    model = AutoModelForCausalLM.from_pretrained(
        "deepseek-ai/DeepSeek-V3.2-Exp",
        torch_dtype=torch.bfloat16,  # or torch.float32 depending on your setup
        device_map="auto",
        trust_remote_code=True
    )

    inputs = tokenizer(prompt, return_tensors="pt")
    with torch.no_grad():
        outputs = model.generate(**{k: v.to(model.device) for k, v in inputs.items()}, max_new_tokens=256)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

if __name__ == "__main__":
    print(simple_generate("Write a concise summary of DeepSeek Sparse Attention (DSA)."))

Benchmarks

MMLU-Pro (Reasoning): 85.0 (V3.1) | 85.0 (V3.2-Exp) (Source: https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp)

LiveCodeBench (Code): 74.9 (V3.1) | 74.1 (V3.2-Exp) (Source: https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp)

AIME 2025 (Math): 88.4 (V3.1) | 89.3 (V3.2-Exp) (Source: https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp)

Codeforces (Rating proxy): 2046 (V3.1) | 2121 (V3.2-Exp) (Source: https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp)

BrowseComp (Agentic tool use): 38.5 (V3.1) | 40.1 (V3.2-Exp) (Source: https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp)

Last Refreshed: 2026-02-24

Key Information

  • Category: Language Models
  • Type: AI Language Models Tool