Kimi-K2-Instruct - AI Language Models Tool

Overview

Kimi-K2-Instruct is Moonshot AI’s instruction-tuned variant of the Kimi K2 Mixture-of-Experts (MoE) family. It combines a 1-trillion-parameter MoE backbone with sparse activation (≈32B activated parameters per forward pass) and an extended long-context window to target chat, coding, math/reasoning, and agentic tool workflows. The release emphasizes practical deployment: block-FP8 checkpoints on Hugging Face, OpenAI/Anthropic-compatible APIs, and examples for chat completion and tool-calling. Built and evaluated with a heavy focus on agentic use (autonomous tool invocation, multi-step program synthesis, and long-horizon reasoning), Kimi-K2-Instruct ships with a detailed model card and tech report describing architecture choices, training scale, and benchmark results. The model card also provides deployment recommendations (vLLM, SGLang, KTransformers, TensorRT-LLM) and sample code for chat and function/tool calling, making it suitable both for self-hosting and for using Moonshot’s hosted API. (Source: Hugging Face model card and Moonshot AI tech report.) ([huggingface.co](https://huggingface.co/moonshotai/Kimi-K2-Instruct))

Model Statistics

  • Downloads: 192,938
  • Likes: 2323
  • Pipeline: text-generation

License: other

Model Details

Architecture: Mixture-of-Experts Transformer (MoE) with 1T total parameters and sparse activation (~32B active per request). The published model summary lists 61 total layers (1 dense), 384 experts, 8 selected experts per token, 64 attention heads, an attention hidden dimension of 7168, and per-expert hidden dim of 2048. Activation uses SwiGLU and the attention variant is listed as MLA. Vocabulary size is reported as ~160K tokens. Context window: the instruct variant supports an ultra-long context (official model card lists 128K tokens for the main K2-Instruct releases; later K2 variants expand context further). Training and optimizer: pretraining across large-scale corpora (the team reports ~15.5T training tokens) and uses the Muon/MuonClip optimizer family and scale-stability techniques described in the project tech report. Model artifacts: checkpoints are published in block-FP8 format on Hugging Face to enable efficient storage and faster inference on compatible runtimes. Deployment & runtime compatibility: Moonshot provides an OpenAI/Anthropic-compatible API surface (including an Anthropic temperature mapping note) and recommends inference engines vLLM, SGLang, KTransformers, and TensorRT-LLM for best performance. The model card includes chat and tool-calling examples and explains the requirement to pass the available tool list per request so the model can autonomously decide when to invoke tools. (Sources: Hugging Face model card; Moonshot tech report and docs.) ([huggingface.co](https://huggingface.co/moonshotai/Kimi-K2-Instruct))

Key Features

  • 1 trillion-parameter MoE backbone with sparse activation (~32B active per request).
  • Large expert pool: 384 experts, selecting 8 experts per token for efficient compute.
  • Ultra-long context support (officially 128K tokens for K2-Instruct variants; later variants extend further).
  • Instruction-tuned for chat and agentic tool use, with native function/tool-calling support.
  • Block-FP8 (block-fp8) checkpoints published on Hugging Face for efficient storage and inference.
  • OpenAI/Anthropic-compatible API surface and Anthropic temperature mapping guidance.
  • Recommended inference runtimes: vLLM, SGLang, KTransformers, and TensorRT-LLM.

Example Usage

Example (python):

from openai import OpenAI
import json

# Example: simple chat + tool-calling loop (OpenAI-compatible client)
# Adapt the client import to your OpenAI-compatible SDK if needed.

client = OpenAI(api_key="YOUR_API_KEY")
model_name = "moonshotai/Kimi-K2-Instruct"

# Simple chat completion example
messages = [
    {"role": "system", "content": "You are Kimi, an AI assistant created by Moonshot AI."},
    {"role": "user", "content": "Please give a brief self-introduction."}
]
resp = client.chat.completions.create(
    model=model_name,
    messages=messages,
    temperature=0.6,
    max_tokens=256
)
print(resp.choices[0].message.content)

# Tool-calling example (local tool mapping)
def get_weather(city: str) -> dict:
    return {"weather": "Sunny", "city": city}

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Retrieve current weather information.",
        "parameters": {"type": "object", "required": ["city"], "properties": {"city": {"type": "string"}}}
    }
}]

messages = [
    {"role": "system", "content": "You are Kimi, an AI assistant."},
    {"role": "user", "content": "What's the weather in Beijing? Use the tool to check."}
]
finish_reason = None
while finish_reason is None or finish_reason == "tool_calls":
    completion = client.chat.completions.create(
        model=model_name,
        messages=messages,
        temperature=0.6,
        tools=tools,
        tool_choice="auto"
    )
    choice = completion.choices[0]
    finish_reason = choice.finish_reason
    if finish_reason == "tool_calls":
        messages.append(choice.message)
        for tool_call in choice.message.tool_calls:
            args = json.loads(tool_call.function.arguments)
            result = get_weather(**args)
            messages.append({"role": "tool", "tool_call_id": tool_call.id, "name": tool_call.function.name, "content": json.dumps(result)})

print("--- final response ---")
print(choice.message.content)

# Example adapted from the model card and deployment guide on Hugging Face. (See model card for streaming and advanced patterns.)

Pricing

Moonshot provides a hosted API for Kimi models (OpenAI-compatible) and has announced pricing updates on its platform. Moonshot’s blog notes reduced input pricing and tier/rate-limit changes, but per-token rates depend on plan/provider. Third-party aggregators and marketplaces list examples (input/output token rates vary; e.g., several vendors list input ~$0.10–$0.60 per 1M tokens and output ~$2.50–$3.00 per 1M tokens), and some platform-specific subscription tiers are shown on community pages. Because official, consistently published numeric rates vary by region/provider and change over time, consult the Moonshot platform docs or your chosen inference provider for exact, up-to-date billing details. (See Moonshot platform announcement and representative aggregator listings.) ([platform.moonshot.ai](https://platform.moonshot.ai/blog/posts/Kimi_API_Newsletter))

Benchmarks

LiveCodeBench v6 (Pass@1): 53.7% (Source: https://huggingface.co/moonshotai/Kimi-K2-Instruct)

SWE-bench Verified (Agentic Coding, Single Attempt Pass@1): 65.8% (Source: https://huggingface.co/moonshotai/Kimi-K2-Instruct)

MMLU (Exact Match, EM): 89.5 EM (Source: https://huggingface.co/moonshotai/Kimi-K2-Instruct)

AIME 2024 (Avg@64): 69.6 (Source: https://huggingface.co/moonshotai/Kimi-K2-Instruct)

Downloads (last month, Hugging Face): 192,938 downloads (Source: https://huggingface.co/moonshotai/Kimi-K2-Instruct)

Last Refreshed: 2026-02-24

Key Information

  • Category: Language Models
  • Type: AI Language Models Tool