Microsoft Phi-4-reasoning-plus - AI Language Models Tool

Overview

Phi-4-reasoning-plus is a 14‑billion‑parameter, open‑weight reasoning LLM from Microsoft Research designed for advanced multi‑step reasoning in math, science, and code. It is a dense, decoder‑only Transformer finetuned from the Phi‑4 base with supervised chain‑of‑thought traces and additional reinforcement learning (RL) to improve correctness; outputs are structured as a detailed Thought (chain‑of‑thought) block followed by a concise Solution summary. The model is released under an MIT license and distributed on Hugging Face for researcher and developer use (release April 30, 2025). ([huggingface.co](https://huggingface.co/microsoft/Phi-4-reasoning-plus)) Phi‑4‑reasoning‑plus targets workloads that need strong reasoning while remaining deployable in memory‑ or latency‑constrained environments: it supports a 32k token context length (experiments extended to 64k), standard inference recipes for producing extended chains of thought, and an emphasis on safety evaluation and red‑teaming during post‑training alignment. Microsoft’s technical report and blog coverage describe architectural choices (a “thinking” mechanism and compute‑aware tuning) and show competitive benchmark results versus larger models on math and science evaluations. These design tradeoffs make the model a practical option when you need explicit, inspectable reasoning traces rather than black‑box single‑step answers. ([huggingface.co](https://huggingface.co/microsoft/Phi-4-reasoning-plus))

Model Statistics

  • Downloads: 174,121
  • Likes: 335
  • Pipeline: text-generation

License: mit

Model Details

Architecture and size: dense, decoder‑only Transformer (Phi‑4 family), ~14B parameters. The model was finetuned from the Phi‑4 base and further improved with RL to lengthen/clarify reasoning traces while increasing accuracy. Recommended inference hyperparameters from the model card: temperature=0.8, top_k=50, top_p=0.95, do_sample=True; for deep CoT allow large max_new_tokens (examples use up to 32k or 64k experimentally). ([huggingface.co](https://huggingface.co/microsoft/Phi-4-reasoning-plus)) Training and data: training occurred January–April 2025 on ~16B tokens (~8.3B unique tokens) using 32x H100‑80G GPUs for ~2.5 days (model card details). Outputs are intentionally two‑part: a Thought (chain‑of‑thought) section that documents stepwise reasoning, and a Solution summary that provides the final answer. Microsoft’s technical report documents a dedicated “thinking block” style approach and the SFT+RL pipeline used to produce longer, higher‑quality reasoning traces. ([huggingface.co](https://huggingface.co/microsoft/Phi-4-reasoning-plus)) Deployment and runtime: default context length is 32k tokens with promising 64k experiments reported; supported tooling includes Hugging Face transformers, vLLM, Ollama, llama.cpp and common Phi‑4 compatible frameworks. The model card notes optimizations for memory/latency constrained environments and gives guidance for ChatML-style system prompts to obtain the two‑section output format. ([huggingface.co](https://huggingface.co/microsoft/Phi-4-reasoning-plus))

Key Features

  • Explicit chain‑of‑thought output followed by a concise summary (Thought + Solution).
  • Dense, decoder‑only Transformer architecture derived from Phi‑4 (≈14B parameters).
  • Default 32k token context; experimental support/experiments reported up to 64k tokens.
  • Trained via supervised CoT fine‑tuning plus reinforcement learning for higher reasoning accuracy.
  • Optimized guidance for memory/latency‑constrained deployments and multiple inference backends.

Example Usage

Example (python):

from transformers import AutoTokenizer, AutoModelForCausalLM

# Recommended by the model card
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-4-reasoning-plus")
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-4-reasoning-plus",
    device_map="auto",
    torch_dtype="auto",
)

messages = [
    {"role": "system", "content": "You are Phi, a language model trained by Microsoft to help users. Please structure responses with a Thought (chain-of-thought) section, then a Solution section."},
    {"role": "user", "content": "Solve: What is the derivative of x**2?"},
]

inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
outputs = model.generate(
    inputs.to(model.device),
    max_new_tokens=4096,
    temperature=0.8,
    top_k=50,
    top_p=0.95,
    do_sample=True,
)
print(tokenizer.decode(outputs[0]))

# Notes: the model card documents using ChatML-style prompts and larger max_new_tokens for long CoT outputs. See the Hugging Face model page for vLLM examples and deployment tips. ([huggingface.co](https://huggingface.co/microsoft/Phi-4-reasoning-plus))

Pricing

Open‑weight release under the MIT license on Hugging Face; no commercial usage fee listed (free to download/use subject to the MIT license). See the Hugging Face model page for license details.

Benchmarks

AIME 2024 (Phi-4-reasoning-plus): 81.3 (Source: https://huggingface.co/microsoft/Phi-4-reasoning-plus)

AIME 2025 (Phi-4-reasoning-plus): 78.0 (Source: https://huggingface.co/microsoft/Phi-4-reasoning-plus)

OmniMath (Phi-4-reasoning-plus): 81.9 (Source: https://huggingface.co/microsoft/Phi-4-reasoning-plus)

GPQA‑Diamond (Phi-4-reasoning-plus): 68.9 (Source: https://huggingface.co/microsoft/Phi-4-reasoning-plus)

MATH500 (external evaluation - robustness): ≈84.7% (reported by MathGPT evaluation) (Source: https://resources.mathgpt.ai/2025/06/03/are-the-best-open-source-models-qwen-phi-nvidia-deepseek-robust-mathematical-reasoners-insights-from-large-scale-evaluations/)

Last Refreshed: 2026-02-24

Key Information

  • Category: Language Models
  • Type: AI Language Models Tool