Qwen/QwQ-32B-Preview - AI Language Models Tool
Overview
QwQ-32B-Preview is an experimental reasoning-focused large language model released by the Qwen Team as a preview of their QwQ line. The model is a 32.5B-parameter causal transformer designed to emphasize deliberative reasoning: the team describes QwQ as a model that "reflects deeply" and uses internal reflection to improve performance on hard math and coding problems. The preview was published November 28, 2024 and is available on Hugging Face with code, a demo, and a detailed blog post describing design choices and known limitations. ([qwenlm.github.io](https://qwenlm.github.io/blog/qwq-32b-preview/)) QwQ-32B-Preview combines a dense 32.5B parameter architecture (31.0B non-embedding parameters), long-context support (32,768 tokens in the preview build with documented techniques to extrapolate length), and engineering choices such as RoPE position embeddings, SwiGLU activations, RMSNorm, and attention QKV bias. The preview emphasizes strengths in technical domains — the Qwen team reports strong results on benchmarks for mathematics and code — while calling out limitations such as occasional language-mixing, recursive reasoning loops, and safety/consistency issues that make the model most suitable for research and experimentation rather than production deployment. ([huggingface.co](https://huggingface.co/Qwen/QwQ-32B-Preview))
Model Statistics
- Downloads: 7,510
- Likes: 1741
- Pipeline: text-generation
- Parameters: 32.8B
License: apache-2.0
Model Details
Architecture and core specs: QwQ-32B-Preview is a dense decoder-only transformer built on the Qwen2.5 family. The model has approximately 32.5 billion total parameters (about 31.0B non-embedding), uses 64 transformer layers, and employs a GQA-style attention configuration with 40 query heads and 8 KV heads. Context handling in the preview is configured for a 32,768-token window; the Qwen2.5 codebase and config support YaRN (rope_scaling) for extrapolating to longer contexts when needed. The published model artifacts use BF16/ bfloat16 tensor formats on Hugging Face. ([huggingface.co](https://huggingface.co/Qwen/QwQ-32B-Preview)) Model building blocks and capabilities: QwQ uses rotary position embeddings (RoPE), SwiGLU activation functions, RMSNorm normalization, and attention QKV bias to improve optimization and emergent reasoning behavior. The Qwen team recommends using modern Hugging Face transformers (>= the versions that include qwen2 support) and inference runtimes such as vLLM for high-throughput long-context serving. The model card and blog explicitly list known behavioral limitations (language mixing/code-switching, circular reasoning loops, and safety considerations) and position this release as a research preview rather than a production-ready assistant. ([huggingface.co](https://huggingface.co/Qwen/QwQ-32B-Preview))
Key Features
- 32.5B-parameter dense decoder transformer with ≈31.0B non-embedding parameters.
- Configured for 32,768-token context in the preview; supports rope scaling / YaRN for extrapolation.
- Architecture uses RoPE, SwiGLU activations, RMSNorm, and attention QKV bias for improved reasoning.
- Reported strong benchmark performance on MATH-500, GPQA, AIME, and LiveCodeBench (math/code).
- Openly available on Hugging Face with Apache-2.0 license and community quantizations/ports.
Example Usage
Example (python):
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "Qwen/QwQ-32B-Preview"
# Load model (automatic device mapping; requires recent transformers and sufficient GPU memory)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
messages = [
{"role": "system", "content": "You are a helpful assistant. Think step-by-step."},
{"role": "user", "content": "Solve: sum of first ten positive integers."}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
# Generate (adjust max_new_tokens as needed)
outputs = model.generate(**inputs, max_new_tokens=256)
# trim the prompt tokens and decode
generated_ids = [out[len(inp.input_ids):] for out, inp in zip(outputs, [inputs])]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
# Example adapted from the QwQ-32B-Preview Hugging Face quickstart. ([huggingface.co](https://huggingface.co/Qwen/QwQ-32B-Preview)) Benchmarks
GPQA (Graduate-level Google-proof Q&A): 65.2% (Source: https://qwenlm.github.io/blog/qwq-32b-preview/ (Qwen blog) ([qwenlm.github.io](https://qwenlm.github.io/blog/qwq-32b-preview/)))
AIME (math challenge subset): 50.0% (Source: https://qwenlm.github.io/blog/qwq-32b-preview/ (Qwen blog) ([qwenlm.github.io](https://qwenlm.github.io/blog/qwq-32b-preview/)))
MATH-500: 90.6% (Source: https://qwenlm.github.io/blog/qwq-32b-preview/ (Qwen blog) ([qwenlm.github.io](https://qwenlm.github.io/blog/qwq-32b-preview/)))
LiveCodeBench: 50.0% (Source: https://qwenlm.github.io/blog/qwq-32b-preview/ (Qwen blog) ([qwenlm.github.io](https://qwenlm.github.io/blog/qwq-32b-preview/)))
Hugging Face downloads (last month): 7,510 (Source: Hugging Face model page (downloads & community stats). ([huggingface.co](https://huggingface.co/Qwen/QwQ-32B-Preview)))
Key Information
- Category: Language Models
- Type: AI Language Models Tool