Seed-Coder - AI Language Models Tool

Overview

Seed-Coder is an open-source family of code-focused large language models (8B parameters) from ByteDance Seed that includes Base, Instruct and Reasoning variants. The project emphasizes a model-centric pretraining pipeline in which LLMs are used to score and filter code from GitHub, commits, and web sources — reducing manual rule‑writing while producing high-quality code pretraining data. ([github.com](https://github.com/ByteDance-Seed/Seed-Coder)) The family targets code generation, completion, infilling (Fill-in-the-Middle), instruction following, and advanced algorithmic reasoning. The Base and Instruct checkpoints use a 32,768-token context length; the Reasoning variants extend to 65,536 tokens and are further RL-trained for improved multi-step reasoning. Seed-Coder was released by ByteDance Seed in May 2025 and the models (safetensors / bf16 checkpoints and quantized variants) are published on Hugging Face for community use; the project is MIT‑licensed. ([huggingface.co](https://huggingface.co/ByteDance-Seed/Seed-Coder-8B-Base?utm_source=openai))

GitHub Statistics

  • Stars: 722
  • Forks: 54
  • Contributors: 3
  • License: MIT
  • Last Updated: 2025-06-06T02:10:40Z

Key Features

  • Model-centric data curation: LLMs score and filter GitHub, commits, and web code to build pretraining data.
  • Three specialized 8B variants: Base (32K), Instruct (32K), Reasoning (64K/65K) for different coding needs.
  • Supports Fill‑in‑the‑Middle (FIM) code infilling for completing function bodies and missing code segments.
  • Pretrained on ~6 trillion tokens of curated code and web data for broad code coverage.
  • Production-ready: Hugging Face/safetensors checkpoints, GPTQ/AWQ quantizations, and vLLM/transformers support.

Example Usage

Example (python):

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load the instruction-tuned Seed-Coder model (example from Hugging Face quickstart)
model_id = "ByteDance-Seed/Seed-Coder-8B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)

prompt = "Write a quick sort algorithm in Python."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[-1]:], skip_special_tokens=True)
print(response)

# Source: Hugging Face model page for Seed-Coder-8B-Instruct (quickstart examples)
# https://huggingface.co/ByteDance-Seed/Seed-Coder-8B-Instruct

Benchmarks

HumanEval (Seed-Coder-8B-Base): 77.4 (Source: https://huggingface.co/ByteDance-Seed/Seed-Coder-8B-Base)

MBPP (Seed-Coder-8B-Base): 82.0 (Source: https://huggingface.co/ByteDance-Seed/Seed-Coder-8B-Base)

MultiPL-E (Seed-Coder-8B-Base): 67.6 (Source: https://huggingface.co/ByteDance-Seed/Seed-Coder-8B-Base)

Training tokens (pretraining): 6 trillion tokens (Source: https://huggingface.co/ByteDance-Seed/Seed-Coder-8B-Base)

Context length (Reasoning variant): 65,536 tokens (Source: https://huggingface.co/ByteDance-Seed/Seed-Coder-8B-Reasoning-bf16)

Last Refreshed: 2026-01-09

Key Information

  • Category: Language Models
  • Type: AI Language Models Tool