Kimi-K2-Thinking - AI Language Models Tool
Overview
Kimi-K2-Thinking is Moonshot AI’s “thinking” variant of the Kimi K2 family: a 1-trillion-parameter Mixture‑of‑Experts (MoE) model that activates ~32B parameters per token and is explicitly trained to interleave step‑by‑step reasoning with native tool calls. The model exposes a 256k token context window and is released with native INT4 weights produced via quantization‑aware training, enabling lower‑memory, faster inference while retaining the model’s reported benchmark performance. ([huggingface.co](https://huggingface.co/moonshotai/Kimi-K2-Thinking)) Moonshot positioned K2 Thinking for long‑horizon, agentic workflows — claiming stable, coherent behavior across roughly 200–300 sequential tool invocations and strong results on multi‑step reasoning and browsing benchmarks (e.g., Humanity’s Last Exam and BrowseComp). The release (Nov 6–8, 2025) drew active community attention (Hugging Face, Reddit threads, and industry writeups) and Moonshot provides an OpenAI/Anthropic‑compatible API as well as deployment recipes for vLLM, KTransformers and similar inference engines. Users and early reviewers highlight the model’s long context, INT4 QAT support, and agentic stability; discussions and examples are available on the model card and community forums. ([huggingface.co](https://huggingface.co/moonshotai/Kimi-K2-Thinking))
Model Statistics
- Downloads: 253,995
- Likes: 1673
- Pipeline: text-generation
License: other
Model Details
Architecture and key technical specs: Kimi‑K2‑Thinking is a MoE transformer with a reported total parameter count of 1T and ~32B activated parameters per forward pass. The published model card lists 61 layers (including one dense layer), 384 experts, 8 experts selected per token, 64 attention heads, an attention hidden dimension of 7168, an MoE expert hidden dim of 2048, a 160K token vocabulary, and Multi‑head Latent Attention (MLA) with SwiGLU activations. The model is released with a 256k token context window. ([huggingface.co](https://huggingface.co/moonshotai/Kimi-K2-Thinking)) Quantization & checkpoints: K2 Thinking uses Quantization‑Aware Training (QAT) to provide native INT4 weight‑only inference for MoE components; Moonshot reports roughly 2x generation speed improvements in low‑latency INT4 mode while reporting benchmark scores under INT4 precision. Checkpoints are published in compressed‑tensors format with guidance on unpacking to higher precisions (FP8/BF16) if needed. Recommended inference stacks include vLLM, SGLang and KTransformers; API access is offered on platform.moonshot.ai. ([huggingface.co](https://huggingface.co/moonshotai/Kimi-K2-Thinking)) Training & tooling notes: Moonshot states K2‑family models were trained at large scale (public commentary and company materials reference multi‑trillion token pretraining for the K2 series), and K2 Thinking adds targeted training for test‑time scaling, self‑critique rewards, and synthetic agentic trajectory generation to harden long‑horizon tool use. The model card and Moonshot docs include chat and tool‑calling examples and deployment guidance. ([kimi.com](https://www.kimi.com/share/19b4fbe1-01a2-8675-8000-000000edce97?utm_source=openai))
Key Features
- Native INT4 inference via Quantization‑Aware Training for ~2x low‑latency generation.
- 256k token context window for long‑horizon reasoning and document workflows.
- MoE architecture: 1T total params with ~32B active parameters per token.
- Stable agentic tool orchestration for ~200–300 consecutive function calls.
- Open weights and modified‑MIT license with compressed‑tensors checkpoints.
- Recommended inference engines: vLLM, SGLang, KTransformers (deployment guides provided).
Example Usage
Example (python):
def simple_chat(client: openai.OpenAI, model_name: str):
messages = [
{"role": "system", "content": "You are Kimi, an AI assistant created by Moonshot AI."},
{"role": "user", "content": [{"type": "text", "text": "which one is bigger, 9.11 or 9.9? think carefully."}]},
]
response = client.chat.completions.create(
model=model_name,
messages=messages,
stream=False,
temperature=1.0,
max_tokens=4096
)
print(f"k2 answer: {response.choices[0].message.content}")
print("=====below is reasoning content======")
# K2 Thinking exposes a reasoning stream in the response object
print(f"reasoning content: {response.choices[0].message.reasoning_content}")
# Example adapted from the model card; run against a local or platform.moonshot.ai endpoint.
Pricing
Moonshot announced updated Kimi API pricing and new kimi‑k2‑thinking / kimi‑k2‑thinking‑turbo models (announcement Nov 8, 2025) and stated input costs were reduced (up to 75% in their announcement). Exact per‑token or tiered rates are not published in full detail on the public blog; developers must consult platform.moonshot.ai/pricing or create a platform account for precise, up‑to‑date billing rates and rate limits. ([platform.moonshot.ai](https://platform.moonshot.ai/blog/posts/Kimi_API_Newsletter?utm_source=openai))
Benchmarks
Humanity's Last Exam (HLE) — text-only (no tools): 23.9 (reported) (Source: https://huggingface.co/moonshotai/Kimi-K2-Thinking)
Humanity's Last Exam (HLE) — with tools: 44.9 (reported) (Source: https://huggingface.co/moonshotai/Kimi-K2-Thinking)
HLE — heavy setting: 51.0 (reported) (Source: https://huggingface.co/moonshotai/Kimi-K2-Thinking)
BrowseComp (with tools): 60.2 (reported) (Source: https://huggingface.co/moonshotai/Kimi-K2-Thinking)
AIME25 (with Python/tooling): 99.1 (reported) (Source: https://huggingface.co/moonshotai/Kimi-K2-Thinking)
MMLU‑Pro (no tools): 84.6 (reported) (Source: https://huggingface.co/moonshotai/Kimi-K2-Thinking)
SWE‑bench Verified (coding, with tools): 71.3 (reported) (Source: https://huggingface.co/moonshotai/Kimi-K2-Thinking)
Key Information
- Category: Language Models
- Type: AI Language Models Tool