Seed-Coder - AI Language Models Tool
Overview
Seed-Coder is an open-source family of code-focused large language models (8B parameters) from ByteDance Seed that includes Base, Instruct and Reasoning variants. The project emphasizes a model-centric pretraining pipeline in which LLMs are used to score and filter code from GitHub, commits, and web sources — reducing manual rule‑writing while producing high-quality code pretraining data. ([github.com](https://github.com/ByteDance-Seed/Seed-Coder)) The family targets code generation, completion, infilling (Fill-in-the-Middle), instruction following, and advanced algorithmic reasoning. The Base and Instruct checkpoints use a 32,768-token context length; the Reasoning variants extend to 65,536 tokens and are further RL-trained for improved multi-step reasoning. Seed-Coder was released by ByteDance Seed in May 2025 and the models (safetensors / bf16 checkpoints and quantized variants) are published on Hugging Face for community use; the project is MIT‑licensed. ([huggingface.co](https://huggingface.co/ByteDance-Seed/Seed-Coder-8B-Base?utm_source=openai))
GitHub Statistics
- Stars: 722
- Forks: 54
- Contributors: 3
- License: MIT
- Last Updated: 2025-06-06T02:10:40Z
Key Features
- Model-centric data curation: LLMs score and filter GitHub, commits, and web code to build pretraining data.
- Three specialized 8B variants: Base (32K), Instruct (32K), Reasoning (64K/65K) for different coding needs.
- Supports Fill‑in‑the‑Middle (FIM) code infilling for completing function bodies and missing code segments.
- Pretrained on ~6 trillion tokens of curated code and web data for broad code coverage.
- Production-ready: Hugging Face/safetensors checkpoints, GPTQ/AWQ quantizations, and vLLM/transformers support.
Example Usage
Example (python):
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
# Load the instruction-tuned Seed-Coder model (example from Hugging Face quickstart)
model_id = "ByteDance-Seed/Seed-Coder-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
prompt = "Write a quick sort algorithm in Python."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[-1]:], skip_special_tokens=True)
print(response)
# Source: Hugging Face model page for Seed-Coder-8B-Instruct (quickstart examples)
# https://huggingface.co/ByteDance-Seed/Seed-Coder-8B-Instruct Benchmarks
HumanEval (Seed-Coder-8B-Base): 77.4 (Source: https://huggingface.co/ByteDance-Seed/Seed-Coder-8B-Base)
MBPP (Seed-Coder-8B-Base): 82.0 (Source: https://huggingface.co/ByteDance-Seed/Seed-Coder-8B-Base)
MultiPL-E (Seed-Coder-8B-Base): 67.6 (Source: https://huggingface.co/ByteDance-Seed/Seed-Coder-8B-Base)
Training tokens (pretraining): 6 trillion tokens (Source: https://huggingface.co/ByteDance-Seed/Seed-Coder-8B-Base)
Context length (Reasoning variant): 65,536 tokens (Source: https://huggingface.co/ByteDance-Seed/Seed-Coder-8B-Reasoning-bf16)
Key Information
- Category: Language Models
- Type: AI Language Models Tool