DeepSeek-MoE - AI Language Models Tool
Overview
DeepSeek-MoE 16B is an open research Mixture-of-Experts (MoE) causal language model released by DeepSeek. The architecture uses fine-grained expert segmentation and shared-experts isolation to concentrate specialization in sparse experts, letting the 16.4B-parameter model match or exceed the performance of several 7B dense baselines while activating far fewer computations. According to the project repository, the model was trained from scratch on a large bilingual corpus (reported in the README) and is published with both a base causal LM and a chat-optimized variant (context length 4096). The authors report that the MoE design requires roughly 40% of the computation of comparable dense models for similar benchmark performance. (Source: GitHub repository and Hugging Face model pages.) The release includes ready-to-run checkpoints on Hugging Face, inference examples using Hugging Face Transformers, and fine-tuning scripts supporting DeepSpeed, LoRA and QLoRA workflows. Checkpoints are provided in BF16 format and, per the repository, can be deployed for inference on a single 40GB GPU without quantization. The code is distributed under an MIT code license and a separate model license that permits commercial use (see repository for license text). For exact usage, deployment constraints, and experiment logs refer to the project README and Hugging Face model cards linked in the sources below.
GitHub Statistics
- Stars: 1,879
- Forks: 299
- Contributors: 1
- License: MIT
- Primary Language: Python
- Last Updated: 2024-01-16T12:17:59Z
Key Features
- Mixture-of-Experts architecture with fine-grained expert segmentation and shared-experts isolation.
- 16.4B total parameters delivered as a sparsely-activated MoE for compute efficiency.
- Both base and chat variants provided with 4096-token context length.
- Hugging Face checkpoints in BF16; inference via Transformers using device_map="auto".
- Fine-tuning recipes included: DeepSpeed ZeRO, LoRA and QLoRA support for parameter-efficient tuning.
Example Usage
Example (python):
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
model_name = "deepseek-ai/deepseek-moe-16b-base"
# Load tokenizer and model (BF16 weights) — device_map="auto" will place layers on available devices
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")
# Load generation config from the hub (model card provides recommended defaults)
model.generation_config = GenerationConfig.from_pretrained(model_name)
model.generation_config.pad_token_id = model.generation_config.eos_token_id
prompt = "Explain the difference between attention and self-attention in transformers."
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs.to(model.device), max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) Benchmarks
Total parameters: 16.4B (Source: https://github.com/deepseek-ai/DeepSeek-MoE)
Context length: 4096 tokens (Source: https://huggingface.co/deepseek-ai/deepseek-moe-16b-base)
Reported compute vs dense baselines: ≈40% of computations for comparable performance (Source: https://github.com/deepseek-ai/DeepSeek-MoE)
Downloads (last month) — base checkpoint: 13,768 (downloads last month shown on model page) (Source: https://huggingface.co/deepseek-ai/deepseek-moe-16b-base)
Downloads (last month) — chat checkpoint: 5,081 (downloads last month shown on model page) (Source: https://huggingface.co/deepseek-ai/deepseek-moe-16b-chat)
Key Information
- Category: Language Models
- Type: AI Language Models Tool