GLM-4.5 - AI Language Models Tool

Overview

GLM-4.5 is an open-weight, Mixture-of-Experts (MoE) large language model family from Zhipu AI (Z.ai) designed for agentic applications that require long-context reasoning, tool use, and coding. The flagship GLM-4.5 uses a sparse MoE design with ~355 billion total parameters and ~32 billion active parameters per forward pass; a smaller, deployment-focused GLM-4.5‑Air variant (≈106B total, 12B active) is also provided. The series implements a hybrid "thinking / non-thinking" inference paradigm — thinking mode enables multi-step reasoning and tool/function calling while non-thinking mode returns fast direct responses — and supports a 128k-token context window for long-horizon tasks. (See the model card and technical report for detailed results and recipes.) (Hugging Face model card; arXiv technical report). GLM-4.5 is released under an MIT-compatible open-source license and includes BF16 and FP8 quantized weights plus example integration code for Transformers, vLLM, and SGLang. The project emphasizes turnkey self-hosting (with recommended GPU counts for BF16/FP8 setups), RL-enhanced post-training for agentic behavior, and tool/function-calling interfaces compatible with OpenAI-style tool descriptors. The release has been covered by press and integrated into several inference stacks and docs for production deployment. (Hugging Face model card; GitHub repo; arXiv).

Model Statistics

  • Downloads: 19,848
  • Likes: 1389
  • Pipeline: text-generation
  • Parameters: 358.3B

License: mit

Model Details

Architecture and core design: GLM-4.5 is a depth-heavy MoE transformer family that uses sparse expert layers (MoE) to provide large model capacity while keeping the per‑token activated compute smaller (≈32B active). The design incorporates grouped-query attention (GQA) with partial RoPE, QK-Norm to stabilize attention logits, Multi-Token Prediction (MTP) layers for speculative decoding, and an optimizer/recipe reported as Muon for faster convergence. The open technical report and project documentation describe multi-stage training on tens of trillions of tokens and RL-based post-training to improve agentic behaviors. (arXiv; lmsys.org; Hugging Face). Capacity and context: GLM-4.5 is reported at ~355B total parameters (32B active) with a 128k-token context window; GLM-4.5‑Air is reported at ~106B total (12B active). The HF model card and deployment docs list hardware guidelines for BF16 and FP8 inference and recommended GPU counts to reach full 128k context usage. (Hugging Face; NVIDIA Megatron Bridge docs). Tooling and deployment: Z.ai provides ready examples and parsers for Transformers (trust_remote_code), vLLM, and SGLang; SGLang and vLLM integrations include built-in reasoning and tool-call parsers and speculative decoding configs (EAGLE / MTP). The project repo and model card include Transformers quick-start snippets and vLLM / SGLang serve examples for both BF16 and FP8 variants. (Hugging Face; GitHub). Training & evaluation highlights: GLM-4.5’s public technical report documents multi-stage pretraining (reported ~23T tokens), mid‑training on targeted code and reasoning corpora, and RL tuning. Evaluation on a 12-benchmark suite yields high marks on agentic and coding tasks, positioning GLM-4.5 among the top open models in those areas. (arXiv; Hugging Face).

Key Features

  • Mixture-of-Experts (MoE): 355B total parameters with ~32B activated per inference.
  • Hybrid reasoning: explicit thinking mode for multi-step reasoning; non-thinking mode for fast replies.
  • 128k-token context window for long-horizon documents, codebases, and agent traces.
  • Native tool and function calling using OpenAI-style tool descriptors.
  • BF16 and FP8 weight releases plus turnkey examples for Transformers, vLLM, and SGLang.
  • Speculative decoding & MTP layers to improve throughput during deployment.
  • Open-source MIT-compatible license enabling commercial use and self-hosting.

Example Usage

Example (python):

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Quick example (from the GLM-4.5 Hugging Face quickstart)
model_id = "zai-org/GLM-4.5-FP8"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,            # adjust to your precision and hardware
    low_cpu_mem_usage=True,
    device_map="auto",
    trust_remote_code=True,
)
model.eval()

messages = [{"role": "user", "content": "Explain Newton's second law in one paragraph."}]

# Non-thinking (fast direct reply)
inputs_nothink_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False, add_nothink_token=True)
input_ids_nothink = tokenizer(inputs_nothink_text, return_tensors="pt").input_ids.to(model.device)
outputs_nothink = model.generate(input_ids_nothink, max_new_tokens=150)
print("Non-thinking response:\n", tokenizer.decode(outputs_nothink[0][len(input_ids_nothink[0]):], skip_special_tokens=True))

# Thinking (enable multi-step reasoning and tool usage)
inputs_think_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False, add_nothink_token=False)
input_ids_think = tokenizer(inputs_think_text, return_tensors="pt").input_ids.to(model.device)
outputs_think = model.generate(input_ids_think, max_new_tokens=300)
print("\nThinking response:\n", tokenizer.decode(outputs_think[0][len(input_ids_think[0]):], skip_special_tokens=True))

# Note: This example is taken from the GLM-4.5 Hugging Face model card and quickstart examples.
# See the model card for vLLM and SGLang server examples and production configs.

Benchmarks

TAU-Bench (agentic/reasoning): 70.1% (Source: https://huggingface.co/zai-org/GLM-4.5-FP8)

AIME-24: 91.0% (Source: https://huggingface.co/zai-org/GLM-4.5-FP8)

SWE-bench (Verified, coding): 64.2% (Source: https://huggingface.co/zai-org/GLM-4.5-FP8)

Aggregate published evaluation score (12 benchmarks): 63.2 (ranked #3 overall in published eval) (Source: https://huggingface.co/zai-org/GLM-4.5-FP8)

Model parameters (reported): ≈355B total / 32B active (Source: https://arxiv.org/abs/2508.06471)

Last Refreshed: 2026-01-09

Key Information

  • Category: Language Models
  • Type: AI Language Models Tool