Qwen/QwQ-32B-Preview - AI Language Models Tool
Overview
QwQ-32B-Preview is an experimental research preview from the Qwen team that aims to push large-model reasoning and analytical capabilities. The model is a 32.5B-parameter causal transformer that emphasizes stepwise reflection and deliberative problem solving, and the release notes and model card highlight strengths in mathematics and coding benchmarks while also noting limits in language consistency and common-sense reasoning. ([huggingface.co](https://huggingface.co/Qwen/QwQ-32B-Preview)) The preview is published with open weights under the Apache-2.0 license and is available on Hugging Face for researchers and developers to download or run via hosted providers. Early community testing (HuggingChat demos, Reddit threads and third‑party host pages) report that QwQ-32B-Preview can be verbose and occasionally mixes languages or loops in recursive reasoning, but many testers praise its step-by-step math and code solutions. Deployment recommendations from the maintainers include using YaRN/rope_scaling for very long contexts and vLLM for inference. ([huggingface.co](https://huggingface.co/Qwen/QwQ-32B-Preview))
Model Statistics
- Downloads: 8,416
- Likes: 1742
- Pipeline: text-generation
- Parameters: 32.8B
License: apache-2.0
Model Details
Architecture and scale: QwQ-32B-Preview is implemented as a causal transformer (based on the Qwen2.5 family) with 32.5 billion total parameters (≈31.0B non-embedding), 64 layers, and a GQA attention layout with 40 query heads and 8 KV heads. The model uses RoPE (rotary positional embeddings), SwiGLU activations, RMSNorm, and attention QKV bias. Tensor type for the published weights is BF16. ([huggingface.co](https://huggingface.co/Qwen/QwQ-32B-Preview)) Context and long-sequence handling: The model card and blog state a practical context length of 32,768 tokens for the preview release; the QwQ/Qwen family also provides guidance for YaRN (rope_scaling) to extrapolate longer contexts when needed and recommends vLLM for production inference of long sequences. Training stages include pretraining and subsequent supervised fine-tuning and RL-based post‑training steps described by the Qwen team. The model card also includes a chat template and an apply_chat_template helper in the tokenizer to standardize multi-turn prompts. ([huggingface.co](https://huggingface.co/Qwen/QwQ-32B-Preview))
Key Features
- 32.5 billion parameters optimized for analytical and stepwise reasoning.
- Extended context support up to 32,768 tokens (YaRN/rope_scaling recommended for very long inputs).
- Transformer blocks with RoPE positional embeddings, SwiGLU activations, and RMSNorm.
- Demonstrates strong math and code benchmark performance (see MATH-500, LiveCodeBench).
- Published under Apache-2.0 license; weights available on Hugging Face for local use.
- Includes chat templates and tokenizer helpers to standardize multi-turn prompts.
- Recommended inference stacks: vLLM for long contexts and device_map auto for local runs.
Example Usage
Example (python):
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Qwen/QwQ-32B-Preview"
# Load model (auto device mapping; requires recent transformers and sufficient GPU memory)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompt = "How many r's are in the word \"strawberry\"?"
messages = [
{"role": "system", "content": "You are a helpful and harmless assistant. You should think step-by-step."},
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
**inputs,
max_new_tokens=256,
temperature=0.6,
top_p=0.95,
top_k=40
)
# Remove prompt tokens from output and decode
generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, generated_ids)]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response) Benchmarks
MATH-500: 90.6% (Source: https://qwenlm.github.io/blog/qwq-32b-preview/)
LiveCodeBench: 50.0% (Source: https://qwenlm.github.io/blog/qwq-32b-preview/)
AIME: 50.0% (Source: https://qwenlm.github.io/blog/qwq-32b-preview/)
GPQA: 65.2% (Source: https://qwenlm.github.io/blog/qwq-32b-preview/)
Key Information
- Category: Language Models
- Type: AI Language Models Tool