GLM-4.5 - AI Language Models Tool

Overview

GLM-4.5 is an open-weight, Mixture-of-Experts (MoE) large language model series from Z.ai (Zhipu AI) designed for agentic applications that combine reasoning, coding, and tool-enabled workflows. The family includes GLM-4.5 (reported in the technical report as 355B total parameters with ~32B active per forward pass) and a compact GLM-4.5‑Air (≈106B total / 12B active). Both models expose a hybrid "thinking" (multi-step chain-of-thought and tool use) and "non-thinking" (low-latency reply) execution mode and are released with open weights and model artifacts for local deployment. ([arxiv.org](https://arxiv.org/abs/2508.06471)) GLM-4.5 targets long-context agent workflows: it supports a 128k token context window, native/OpenAI-style tool & function calling, BF16 and FP8 weight releases, and turnkey inference integrations with Transformers, vLLM and SGLang (production-focused inference stack). The model was trained and post-processed with multi-stage pipelines (large-token pretraining and RL/expert iteration) and shows competitive results across agentic, reasoning and coding benchmarks in the GLM technical evaluation. ([docs.z.ai](https://docs.z.ai/guides/llm/glm-4.5))

Model Statistics

Downloads: 33,921
Likes: 1397
Pipeline: text-generation
Parameters: 358.3B

License: mit

Model Details

Architecture and scaling: GLM-4.5 is built as a Mixture‑of‑Experts (MoE) LLM that exposes a large total parameter capacity while activating a much smaller subset per inference (published paper: 355B total / 32B active; Hugging Face displays a model size ~358B depending on packaging). The series includes GLM-4.5-Air (106B total / 12B active) for cost-conscious deployments. The model uses architectural and deployment optimizations (MTP/speculative decoding, FP8 weight formats, and SGLang/vLLM integration) to reduce inference cost while preserving long-context reasoning. ([arxiv.org](https://arxiv.org/abs/2508.06471)) Capabilities & training: According to the authors, GLM-4.5 was trained with a multi-stage pipeline (large-scale pretraining + targeted post-training and RLHF/expert iteration) on tens of trillions of tokens (paper notes multi-stage training on ~23T tokens) to improve agentic, reasoning, and coding performance. The model supports tool/function calling (OpenAI-style descriptors), structured JSON outputs, streaming, context caching, and toggled "thinking" modes for hybrid reasoning workflows. ([arxiv.org](https://arxiv.org/abs/2508.06471)) Deployment & inference: Day‑one or near‑day support exists for major inference stacks: Transformers (trust_remote_code quickstarts and CLI helpers), vLLM (model serving flags including --tool-call-parser glm45 and reasoning-parser glm45) and SGLang (sglang launch_server with speculative/EAGLE configs). Hardware guidance and recommended FP8/BF16 GPU counts for full 128K context runs are documented in the project pages. ([huggingface.co](https://huggingface.co/zai-org/GLM-4.5)) Licensing note: Hugging Face model pages list the model weights under an MIT-compatible license, while related code repositories (in some places) use Apache-2.0 for tools/code; check the specific artifact license before redistribution. ([huggingface.co](https://huggingface.co/zai-org/GLM-4.5))

Key Features

Hybrid reasoning: explicit thinking (multi-step CoT/tool use) and non-thinking (low-latency) modes.
Mixture‑of‑Experts scaling: very large total capacity with a smaller active parameter set per pass.
128k token context window for long documents, codebases, and agent traces.
Native tool/function calling using OpenAI‑style tool descriptors and structured JSON outputs.
FP8 and BF16 weight releases plus turnkey examples for Transformers, vLLM and SGLang.
Speculative decoding and MTP layers to improve throughput at inference time.

Example Usage

Example (python):

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Small example using the Hugging Face weights (adjust model id to FP8/BF16 variant you downloaded)
model_id = "zai-org/GLM-4.5-FP8"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,  # choose appropriate dtype for your hardware
    device_map="auto",
)

prompt = "Write a short plan to refactor a legacy Python codebase for maintainability."
inputs = tokenizer(prompt, return_tensors="pt")
inputs = {k: v.to(next(model.parameters()).device) for k, v in inputs.items()}

outputs = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

# Quick vLLM serve example (CLI):
# vllm serve zai-org/GLM-4.5-Air --tensor-parallel-size 8 --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice
# (See docs for vLLM and SGLang integration and production flags.)

Pricing

Z.ai publishes per‑1M token API pricing on its developer site. As listed in the official docs (Z.ai): GLM‑4.5 — $0.60 per 1M input tokens / $2.20 per 1M output tokens (cached input tiers listed separately); GLM‑4.5‑Air — $0.20 per 1M input / $1.10 per 1M output. (See Z.ai developer pricing for up‑to‑date regional and cached‑input rates.) ([docs.z.ai](https://docs.z.ai/guides/overview/pricing))

Benchmarks

Overall reported ARC/12-benchmark score: 63.2 (ranked 3rd among evaluated models) (Source: https://huggingface.co/zai-org/GLM-4.5)

TAU‑Bench (agentic): 70.1% (Source: https://arxiv.org/abs/2508.06471)

AIME‑24 (reasoning): 91.0% (Source: https://arxiv.org/abs/2508.06471)

SWE‑bench Verified (coding): 64.2% (Source: https://arxiv.org/abs/2508.06471)

Hugging Face downloads (latest listed): 33,921 downloads (Hugging Face model card) (Source: https://huggingface.co/zai-org/GLM-4.5)

Last Refreshed: 2026-02-24

HuggingFace

Key Information

Category: Language Models
Type: AI Language Models Tool

Visit Official Website

GLM-4.5 - AI Language Models Tool

Overview

Model Statistics

Model Details

Key Features

Example Usage

Pricing

Benchmarks

Key Information

Related Tools

Qwen2.5-7B

DeepSeek-V3

Llama 3

UNfilteredAI-1B

Shuttle-3

WizardLM