MiniMax-M2 - AI Language Models Tool
Overview
MiniMax-M2 is an open‑weight Mixture‑of‑Experts (MoE) language model designed for coding and agentic tool workflows. The model exposes ≈230B total parameters with roughly 10B active activations, trading a large parameter budget for low per‑request activation cost and responsive inference ideal for compile/test loops, multi‑file edits, and long‑horizon planning. The Hugging Face release includes safetensors weights, multiple quantizations, and first‑day deployment guides for Transformers, vLLM, SGLang, and MLX, enabling both local hosting and cloud deployment. (Source: Hugging Face model page and MiniMax release notes.) MiniMax‑M2 emphasizes practical agent capabilities: built‑in tool/function calling, an “interleaved thinking” format (assistant thinking wrapped in <think>...</think>), and long context handling (advertised 128k token context) to keep large codebases, terminal sessions, and multi‑step web‑browse chains in memory. The developers also publish benchmark results showing strong performance on coding and agentic suites (SWE‑bench, Terminal‑Bench, BrowseComp) and provide a hosted MiniMax Agent and an Open Platform API (free for a limited time as of the release notes). (Sources: Hugging Face, minimax.io.)
Model Statistics
- Downloads: 132,755
- Likes: 1446
- Pipeline: text-generation
- Parameters: 228.7B
License: other
Model Details
Architecture & parameters: MiniMax‑M2 is a Mixture‑of‑Experts (MoE) causal language model with an advertised ~230B total parameters and ~10B active parameters per request, which reduces activation memory and latency while preserving a large model capacity for routing and specialization. The model is distributed with safetensors and multiple tensor formats (F32, BF16, and an FP8 variant F8_E4M3) and has prebuilt quantizations and merges in the Hugging Face repo. (Source: Hugging Face model page.) Context, thinking & tool use: The project advertises a 128k token context window for deep multi‑file code and long agent traces. MiniMax‑M2 implements an “interleaved thinking” output style; assistant internal reasoning is wrapped in <think>...</think> tags and the authors explicitly instruct retaining prior thinking tokens in conversation history for best results. The model also provides built‑in tool/function calling parsers and guidance for automated tool selection in vLLM and other servers. (Source: Hugging Face docs and vLLM recipe.) Deployment & integrations: Official docs and guides provide day‑0 support for Transformers (with trust_remote_code=True examples), vLLM (recipes and recommended multi‑GPU setups), SGLang, and MLX. Transformers deployment notes include a sample load-and-generate snippet and system requirements (Linux, Python 3.9–3.12, ~220 GB GPU memory for weights when using a single host split across devices). vLLM recipes show recommended launch commands and flags to enable the model’s tool parsers and reasoning parser. (Sources: Hugging Face Transformers guide, vLLM recipes.)
Key Features
- Mixture‑of‑Experts architecture with ≈230B total and ~10B active parameters.
- Designed for coding workflows: multi‑file edits, compile‑run‑fix loops, test‑validated repairs.
- Agentic tool/function calling with parsers for automated tool selection and execution.
- Interleaved thinking format: assistant reasoning wrapped in <think>...</think> for planning.
- Long context (advertised 128k tokens) for large codebases and long agent traces.
- Day‑0 deployment guides for Transformers, vLLM, SGLang, and MLX; safetensors and quantizations provided.
- Published benchmark results showing strong coding and agent performance (SWE‑bench, Terminal‑Bench, BrowseComp).
Example Usage
Example (python):
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Example usage based on the model's Transformers deployment guide.
MODEL_ID = "MiniMaxAI/MiniMax-M2"
def run_example():
# note: trust_remote_code=True is required per the model's README
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, device_map="auto", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
# sample chat-style messages (the model uses a chat template helper)
messages = [
{"role": "user", "content": [{"type": "text", "text": "Write a short Python function to reverse a string."}]}
]
# apply the model's chat template (as shown in the official guide) and move tensors to GPU
model_inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to("cuda")
# generate — adjust max_new_tokens and sampling params to taste
generated_ids = model.generate(model_inputs, max_new_tokens=200)
response = tokenizer.batch_decode(generated_ids)[0]
print(response)
if __name__ == "__main__":
run_example()
# See the Hugging Face Transformers deployment guide for MiniMax-M2 for details and required flags. Benchmarks
AA Intelligence (composite): 61 (Artificial Analysis composite score) (Source: https://huggingface.co/MiniMaxAI/MiniMax-M2)
LiveCodeBench (LCB): 83 (Source: https://huggingface.co/MiniMaxAI/MiniMax-M2)
SWE‑bench Verified: 69.4 (Source: https://huggingface.co/MiniMaxAI/MiniMax-M2)
Terminal‑Bench: 46.3 (Source: https://huggingface.co/MiniMaxAI/MiniMax-M2)
BrowseComp: 44 (Source: https://huggingface.co/MiniMaxAI/MiniMax-M2)
Key Information
- Category: Language Models
- Type: AI Language Models Tool