GLM-4.5 - AI Language Models Tool
Overview
GLM-4.5 is an open-weight, Mixture-of-Experts (MoE) large language model series from Z.ai (Zhipu AI) designed for agentic applications that combine reasoning, coding, and tool-enabled workflows. The family includes GLM-4.5 (reported in the technical report as 355B total parameters with ~32B active per forward pass) and a compact GLM-4.5‑Air (≈106B total / 12B active). Both models expose a hybrid "thinking" (multi-step chain-of-thought and tool use) and "non-thinking" (low-latency reply) execution mode and are released with open weights and model artifacts for local deployment. ([arxiv.org](https://arxiv.org/abs/2508.06471)) GLM-4.5 targets long-context agent workflows: it supports a 128k token context window, native/OpenAI-style tool & function calling, BF16 and FP8 weight releases, and turnkey inference integrations with Transformers, vLLM and SGLang (production-focused inference stack). The model was trained and post-processed with multi-stage pipelines (large-token pretraining and RL/expert iteration) and shows competitive results across agentic, reasoning and coding benchmarks in the GLM technical evaluation. ([docs.z.ai](https://docs.z.ai/guides/llm/glm-4.5))
Model Statistics
- Downloads: 33,921
- Likes: 1397
- Pipeline: text-generation
- Parameters: 358.3B
License: mit
Model Details
Architecture and scaling: GLM-4.5 is built as a Mixture‑of‑Experts (MoE) LLM that exposes a large total parameter capacity while activating a much smaller subset per inference (published paper: 355B total / 32B active; Hugging Face displays a model size ~358B depending on packaging). The series includes GLM-4.5-Air (106B total / 12B active) for cost-conscious deployments. The model uses architectural and deployment optimizations (MTP/speculative decoding, FP8 weight formats, and SGLang/vLLM integration) to reduce inference cost while preserving long-context reasoning. ([arxiv.org](https://arxiv.org/abs/2508.06471)) Capabilities & training: According to the authors, GLM-4.5 was trained with a multi-stage pipeline (large-scale pretraining + targeted post-training and RLHF/expert iteration) on tens of trillions of tokens (paper notes multi-stage training on ~23T tokens) to improve agentic, reasoning, and coding performance. The model supports tool/function calling (OpenAI-style descriptors), structured JSON outputs, streaming, context caching, and toggled "thinking" modes for hybrid reasoning workflows. ([arxiv.org](https://arxiv.org/abs/2508.06471)) Deployment & inference: Day‑one or near‑day support exists for major inference stacks: Transformers (trust_remote_code quickstarts and CLI helpers), vLLM (model serving flags including --tool-call-parser glm45 and reasoning-parser glm45) and SGLang (sglang launch_server with speculative/EAGLE configs). Hardware guidance and recommended FP8/BF16 GPU counts for full 128K context runs are documented in the project pages. ([huggingface.co](https://huggingface.co/zai-org/GLM-4.5)) Licensing note: Hugging Face model pages list the model weights under an MIT-compatible license, while related code repositories (in some places) use Apache-2.0 for tools/code; check the specific artifact license before redistribution. ([huggingface.co](https://huggingface.co/zai-org/GLM-4.5))
Key Features
- Hybrid reasoning: explicit thinking (multi-step CoT/tool use) and non-thinking (low-latency) modes.
- Mixture‑of‑Experts scaling: very large total capacity with a smaller active parameter set per pass.
- 128k token context window for long documents, codebases, and agent traces.
- Native tool/function calling using OpenAI‑style tool descriptors and structured JSON outputs.
- FP8 and BF16 weight releases plus turnkey examples for Transformers, vLLM and SGLang.
- Speculative decoding and MTP layers to improve throughput at inference time.
Example Usage
Example (python):
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
# Small example using the Hugging Face weights (adjust model id to FP8/BF16 variant you downloaded)
model_id = "zai-org/GLM-4.5-FP8"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
trust_remote_code=True,
torch_dtype=torch.bfloat16, # choose appropriate dtype for your hardware
device_map="auto",
)
prompt = "Write a short plan to refactor a legacy Python codebase for maintainability."
inputs = tokenizer(prompt, return_tensors="pt")
inputs = {k: v.to(next(model.parameters()).device) for k, v in inputs.items()}
outputs = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# Quick vLLM serve example (CLI):
# vllm serve zai-org/GLM-4.5-Air --tensor-parallel-size 8 --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice
# (See docs for vLLM and SGLang integration and production flags.) Pricing
Z.ai publishes per‑1M token API pricing on its developer site. As listed in the official docs (Z.ai): GLM‑4.5 — $0.60 per 1M input tokens / $2.20 per 1M output tokens (cached input tiers listed separately); GLM‑4.5‑Air — $0.20 per 1M input / $1.10 per 1M output. (See Z.ai developer pricing for up‑to‑date regional and cached‑input rates.) ([docs.z.ai](https://docs.z.ai/guides/overview/pricing))
Benchmarks
Overall reported ARC/12-benchmark score: 63.2 (ranked 3rd among evaluated models) (Source: https://huggingface.co/zai-org/GLM-4.5)
TAU‑Bench (agentic): 70.1% (Source: https://arxiv.org/abs/2508.06471)
AIME‑24 (reasoning): 91.0% (Source: https://arxiv.org/abs/2508.06471)
SWE‑bench Verified (coding): 64.2% (Source: https://arxiv.org/abs/2508.06471)
Hugging Face downloads (latest listed): 33,921 downloads (Hugging Face model card) (Source: https://huggingface.co/zai-org/GLM-4.5)
Key Information
- Category: Language Models
- Type: AI Language Models Tool