OpenAI GPT OSS - AI Language Models Tool
Overview
OpenAI GPT OSS is an open-weight family of large language models released by OpenAI in 2025, consisting of two variants: gpt-oss-120b (≈117B parameters) and gpt-oss-20b (≈21B parameters). Both use a token-choice mixture-of-experts (MoE) architecture with a 4-bit MXFP4 quantization format on MoE weights to lower memory and inference cost while keeping strong reasoning and chain-of-thought capabilities. The models support adjustable "reasoning effort" levels, Structured Outputs / harmony-format responses, and tool use (web search, function calling, Python execution) via the Responses API and local inference stacks. (Sources: Hugging Face blog; OpenAI model card.) ([huggingface.co](https://huggingface.co/blog/welcome-openai-gpt-oss?utm_source=openai)) The releases target broad deployment scenarios: gpt-oss-120b is designed to run on a single 80 GB-class GPU (Hopper/H100 or similar) for production-grade reasoning, while gpt-oss-20b is optimized to run on consumer-class GPUs with ~16 GB VRAM for edge or laptop use. The models natively support very long contexts (RoPE, up to 128k tokens), grouped multi-query attention, and integrations with inference engines such as Hugging Face Transformers, vLLM, llama.cpp, and Ollama for high-throughput or low-latency setups. The weights are distributed under the Apache 2.0 license with an accompanying gpt-oss usage policy. (Sources: OpenAI announcement; Hugging Face; GitHub repo.) ([openai.com](https://openai.com/index/introducing-gpt-oss?utm_source=openai))
Key Features
- Mixture-of-experts (MoE) transformer: activates fewer parameters per token for efficiency.
- MXFP4 4-bit quantization on MoE weights to reduce memory footprint and accelerate inference.
- Two sizes: gpt-oss-120b (≈117B) and gpt-oss-20b (≈21B) for cloud and consumer-grade deployment.
- Up to 128k token context using RoPE for long-document reasoning and retrieval-augmented tasks.
- Configurable reasoning effort (low/medium/high) to trade latency for chain-of-thought depth.
- Tool use and agent workflows: structured tool calling, web search, and code execution via Responses API.
- Official integrations: Transformers, vLLM, llama.cpp, Ollama, plus cloud partner deployments.
Example Usage
Example (python):
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "openai/gpt-oss-20b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
# device_map and torch_dtype='auto' let transformers pick the best placement
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
torch_dtype="auto",
)
messages = [{"role": "user", "content": "Explain what a Möbius strip is in simple terms."}]
# Use the chat template utility (as shown in the official examples)
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt",
return_dict=True,
).to(model.device)
generated = model.generate(**inputs, max_new_tokens=256)
# decode only the generated tokens
output = tokenizer.decode(generated[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
print(output)
# Notes: For MXFP4 speedups and memory reductions, follow the model card and install
# the recommended 'kernels' and Triton versions. For production inference, consider
# vLLM, Hugging Face Inference Providers, or specialized runtimes (llama.cpp / ollama). Benchmarks
Total parameters (gpt-oss-120b): ≈117 billion parameters (Source: https://huggingface.co/blog/welcome-openai-gpt-oss)
Total parameters (gpt-oss-20b): ≈21 billion parameters (Source: https://huggingface.co/blog/welcome-openai-gpt-oss)
Active parameters per token: gpt-oss-120b ≈ 5.1B active; gpt-oss-20b ≈ 3.6B active (Source: https://huggingface.co/blog/welcome-openai-gpt-oss)
Context length: Supports up to 128,000 tokens (RoPE) (Source: https://openai.com/index/introducing-gpt-oss)
Relative performance on core reasoning: gpt-oss-120b: near-parity with OpenAI o4-mini on core reasoning benchmarks; gpt-oss-20b: comparable to o3-mini (Source: https://openai.com/index/introducing-gpt-oss)
Key Information
- Category: Language Models
- Type: AI Language Models Tool