OpenAI GPT OSS - AI Language Models Tool

Overview

OpenAI GPT OSS is an open-weight family of large language models released by OpenAI in 2025, consisting of two variants: gpt-oss-120b (≈117B parameters) and gpt-oss-20b (≈21B parameters). Both use a token-choice mixture-of-experts (MoE) architecture with a 4-bit MXFP4 quantization format on MoE weights to lower memory and inference cost while keeping strong reasoning and chain-of-thought capabilities. The models support adjustable "reasoning effort" levels, Structured Outputs / harmony-format responses, and tool use (web search, function calling, Python execution) via the Responses API and local inference stacks. (Sources: Hugging Face blog; OpenAI model card.) ([huggingface.co](https://huggingface.co/blog/welcome-openai-gpt-oss?utm_source=openai)) The releases target broad deployment scenarios: gpt-oss-120b is designed to run on a single 80 GB-class GPU (Hopper/H100 or similar) for production-grade reasoning, while gpt-oss-20b is optimized to run on consumer-class GPUs with ~16 GB VRAM for edge or laptop use. The models natively support very long contexts (RoPE, up to 128k tokens), grouped multi-query attention, and integrations with inference engines such as Hugging Face Transformers, vLLM, llama.cpp, and Ollama for high-throughput or low-latency setups. The weights are distributed under the Apache 2.0 license with an accompanying gpt-oss usage policy. (Sources: OpenAI announcement; Hugging Face; GitHub repo.) ([openai.com](https://openai.com/index/introducing-gpt-oss?utm_source=openai))

Key Features

  • Mixture-of-experts (MoE) transformer: activates fewer parameters per token for efficiency.
  • MXFP4 4-bit quantization on MoE weights to reduce memory footprint and accelerate inference.
  • Two sizes: gpt-oss-120b (≈117B) and gpt-oss-20b (≈21B) for cloud and consumer-grade deployment.
  • Up to 128k token context using RoPE for long-document reasoning and retrieval-augmented tasks.
  • Configurable reasoning effort (low/medium/high) to trade latency for chain-of-thought depth.
  • Tool use and agent workflows: structured tool calling, web search, and code execution via Responses API.
  • Official integrations: Transformers, vLLM, llama.cpp, Ollama, plus cloud partner deployments.

Example Usage

Example (python):

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "openai/gpt-oss-20b"

tokenizer = AutoTokenizer.from_pretrained(model_id)
# device_map and torch_dtype='auto' let transformers pick the best placement
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype="auto",
)

messages = [{"role": "user", "content": "Explain what a Möbius strip is in simple terms."}]

# Use the chat template utility (as shown in the official examples)
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)

generated = model.generate(**inputs, max_new_tokens=256)
# decode only the generated tokens
output = tokenizer.decode(generated[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
print(output)

# Notes: For MXFP4 speedups and memory reductions, follow the model card and install
# the recommended 'kernels' and Triton versions. For production inference, consider
# vLLM, Hugging Face Inference Providers, or specialized runtimes (llama.cpp / ollama).

Benchmarks

Total parameters (gpt-oss-120b): ≈117 billion parameters (Source: https://huggingface.co/blog/welcome-openai-gpt-oss)

Total parameters (gpt-oss-20b): ≈21 billion parameters (Source: https://huggingface.co/blog/welcome-openai-gpt-oss)

Active parameters per token: gpt-oss-120b ≈ 5.1B active; gpt-oss-20b ≈ 3.6B active (Source: https://huggingface.co/blog/welcome-openai-gpt-oss)

Context length: Supports up to 128,000 tokens (RoPE) (Source: https://openai.com/index/introducing-gpt-oss)

Relative performance on core reasoning: gpt-oss-120b: near-parity with OpenAI o4-mini on core reasoning benchmarks; gpt-oss-20b: comparable to o3-mini (Source: https://openai.com/index/introducing-gpt-oss)

Last Refreshed: 2026-01-09

Key Information

  • Category: Language Models
  • Type: AI Language Models Tool