Microsoft Phi-4-reasoning-plus - AI Language Models Tool
Overview
Phi-4-reasoning-plus is an open-weight, research-focused large language model from Microsoft Research designed for high-quality stepwise reasoning in math, science, and code. The model is a 14B-parameter dense decoder-only Transformer that was finetuned from the Phi-4 base using supervised fine-tuning on chain-of-thought traces and additional reinforcement learning; outputs are structured as a reasoning (chain-of-thought) block followed by a concise Solution summary. The model was released with an MIT license on April 30, 2025 and is positioned for research and generative AI applications where output quality matters more than minimal latency. ([huggingface.co](https://huggingface.co/microsoft/Phi-4-reasoning-plus)) Phi-4-reasoning-plus targets environments with constrained memory or latency (edge/embedded scenarios) while retaining strong performance on competitive reasoning benchmarks. It supports long contexts (native 32k token context length; Microsoft reports experiments up to 64k tokens), and the model card documents recommended sampling settings (temperature=0.8, top_k=50, top_p=0.95, do_sample=True) to encourage full chain-of-thought generation. The model card, technical report, and Microsoft model catalog provide usage guidance, evaluation results, and responsible-AI considerations for researchers and developers. ([huggingface.co](https://huggingface.co/microsoft/Phi-4-reasoning-plus))
Model Statistics
- Downloads: 4,073
- Likes: 330
- Pipeline: text-generation
- Parameters: 14.7B
License: mit
Model Details
Architecture and size: Phi-4-reasoning-plus is a dense, decoder-only Transformer with roughly 14 billion parameters (15B metadata size reported on the model page) and BF16 weights; it is a finetune of microsoft/phi-4. ([huggingface.co](https://huggingface.co/microsoft/Phi-4-reasoning-plus)) Training recipe: Microsoft applied supervised fine-tuning (SFT) on curated chain-of-thought traces and then additional reinforcement learning to increase final-answer accuracy. The development introduced a dedicated “thinking block” and training on teachable STEM prompts to improve multi-step reasoning. Training ran across H100 GPUs (catalog lists 32× H100-80G) on a dataset reported as ~16B tokens (~8.3B unique tokens) with a training window from January–April 2025. ([microsoft.com](https://www.microsoft.com/en-us/research/publication/phi-4-reasoning-technical-report/?utm_source=openai)) Context, I/O and outputs: The model’s documented context length is 32k tokens (with internal experiments up to 64k tokens). Inference outputs are organized into two explicit sections (a Thought / chain-of-thought block followed by a Solution summary). The model card also provides a ChatML system prompt template and recommended generation hyperparameters for best CoT behavior. ([huggingface.co](https://huggingface.co/microsoft/Phi-4-reasoning-plus)) Benchmarks & scope: Microsoft evaluated the model on a mix of reasoning benchmarks (AIME, OmniMath, GPQA-D, LiveCodeBench, HumanEvalPlus, MMLU-Pro, FlenQA, and others). The model card reports substantial improvements over Phi-4-reasoning on several math/science/code tasks, at the cost of producing about 50% more tokens on average (higher latency). The model is released under an MIT license and intended primarily for English-language research and development; high-risk deployment should follow additional safety checks. ([huggingface.co](https://huggingface.co/microsoft/Phi-4-reasoning-plus))
Key Features
- Structured output: explicit chain-of-thought (Thought) then concise Solution.
- Dense decoder-only 14B model optimized for reasoning and code generation.
- 32k native context length; Microsoft reports experiments up to 64k tokens.
- Finetuned with supervised CoT traces plus reinforcement learning for higher accuracy.
- Open-weight release under MIT license for research and integration.
- Recommended sampling settings documented for reliable CoT generation.
Example Usage
Example (python):
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
# Load model and tokenizer (device_map='auto' recommended for multi-GPU/GPUs)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-4-reasoning-plus")
model = AutoModelForCausalLM.from_pretrained(
"microsoft/Phi-4-reasoning-plus",
device_map="auto",
torch_dtype=torch.bfloat16,
)
# Chat-style prompt using the model's recommended template
system = (
"You are Phi, a language model trained by Microsoft to help users."
" Please structure your response into two sections: <think>...</think> and {Solution}."
)
user = "Solve: What is the integral of x^2 from 0 to 3?"
messages = [
{"role": "system", "content": system},
{"role": "user", "content": user},
]
# Apply the chat template helper if available, otherwise concatenate
inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
inputs = {k: v.to(model.device) for k, v in inputs.items()}
# Recommended sampling settings per model card
outputs = model.generate(
**inputs,
max_new_tokens=1024,
temperature=0.8,
top_k=50,
top_p=0.95,
do_sample=True,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# Note: for long CoT runs set max_new_tokens larger (model card shows up to 32768 for full chains). Benchmarks
AIME (2024) - Phi-4-reasoning-plus: 81.3 (Source: https://huggingface.co/microsoft/Phi-4-reasoning-plus)
AIME (2025) - Phi-4-reasoning-plus: 78.0 (Source: https://huggingface.co/microsoft/Phi-4-reasoning-plus)
OmniMath - Phi-4-reasoning-plus: 81.9 (Source: https://huggingface.co/microsoft/Phi-4-reasoning-plus)
LiveCodeBench (code generation) - Phi-4-reasoning-plus: 53.1 (Source: https://huggingface.co/microsoft/Phi-4-reasoning-plus)
HumanEvalPlus (functional code generation) - Phi-4-reasoning-plus: 92.9 (Source: https://huggingface.co/microsoft/Phi-4-reasoning-plus)
Downloads (last month): 4,073 (Source: https://huggingface.co/microsoft/Phi-4-reasoning-plus)
Key Information
- Category: Language Models
- Type: AI Language Models Tool