Microsoft Phi-4 - AI Language Models Tool

Overview

Microsoft Phi-4 is a 14-billion-parameter, dense decoder‑only language model developed by Microsoft Research with a training recipe that emphasizes high-quality and synthetic “textbook-like” data to improve reasoning, math, and code generation. The model combines supervised fine‑tuning (SFT) and Direct Preference Optimization (DPO) post‑training alignment to improve instruction adherence and safety while targeting workloads that need strong reasoning in memory- or latency-constrained environments (e.g., research inference, on-premise deployment, and generative AI features). Source documentation and the technical report describe phi-4 as optimized to exceed its teacher models on STEM-focused QA and reasoning benchmarks. (See the Hugging Face model card and the Phi‑4 technical report on arXiv.) Phi‑4 was released in December 2024 and supports long-context inference (16k tokens) and wide usage scenarios such as coding assistance, competitive-math question solving, stepwise reasoning tasks, and chat-style assistants. Microsoft and community posts around the Phi family (Phi‑4, Phi‑4‑mini, Phi‑4‑multimodal, and reasoning variants) also document later offshoots optimized for lower compute and multimodal inputs, expanding deployment choices. Key references: the official Hugging Face model card and the Phi‑4 technical report (arXiv).

Model Statistics

  • Downloads: 56
  • Likes: 55

License: other

Model Details

Architecture and scale: Phi‑4 is a dense decoder‑only Transformer with roughly 14B parameters, trained with a curriculum emphasizing synthetic reasoning data and filtered high‑quality public/academic sources. Microsoft’s model card and technical report list a 16,000 token context window (mid‑training expansion from a shorter context), a training corpus of ~9.8T tokens, and training conducted on a large H100 fleet (reported as 1920 H100‑80G GPUs, ~21 days). (Hugging Face model card; Phi‑4 technical report on arXiv.) Tokenizer, vocabulary and layers: public details note use of a tiktoken-compatible tokenizer and a large vocabulary (reported ~100,352 tokens in community documentation). Community and Microsoft posts indicate an architecture depth consistent with mid‑to‑large SLM designs (reporting ~40 layers in some technical summaries). The training stack emphasizes synthetic “textbook-like” data, acquired academic Q&A, and targeted supervised chat-format data to teach stepwise reasoning, math, and coding skills. Alignment and safety: Phi‑4 uses supervised fine‑tuning plus iterative Direct Preference Optimization (DPO) and red‑teaming before release. Microsoft reports safety evaluation with open benchmarks and collaboration with an independent AI Red Team. For full methodology and dataset descriptions see the Phi‑4 technical report and the Hugging Face model card.

Key Features

  • 14B-parameter dense decoder‑only Transformer tuned for advanced reasoning and STEM tasks.
  • 16,000-token context window (mid‑training expansion) for long‑form reasoning and documents.
  • Training mix includes synthetic 'textbook-like' data, filtered public domain and academic Q&A.
  • Post‑training alignment via supervised fine‑tuning (SFT) plus Direct Preference Optimization (DPO).
  • Strong code and math performance (high HumanEval and MATH scores on SimpleEval benchmarks).

Example Usage

Example (python):

from transformers import pipeline

# Simple chat-style generation using Hugging Face transformers
pipe = pipeline(
    "text-generation",
    model="microsoft/phi-4",
    model_kwargs={"torch_dtype": "auto"},
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a helpful assistant that explains math clearly."},
    {"role": "user", "content": "Explain how to solve quadratic equations by completing the square."},
]

# The Hugging Face pipeline expects a chat-style input list for this model card; adjust max_new_tokens as needed
outputs = pipe(messages, max_new_tokens=256)
print(outputs[0]["generated_text"][-1])

# Note: check the model card and technical report for recommended chat formatting and tokenizer specifics.

Benchmarks

MMLU (SimpleEval aggregated): 84.8 (phi-4, 14B) (Source: https://huggingface.co/microsoft/phi-4)

MATH (competition math): 80.6 (phi-4, 14B) (Source: https://huggingface.co/microsoft/phi-4)

HumanEval (code generation): 82.6 (phi-4, 14B) (Source: https://huggingface.co/microsoft/phi-4)

GPQA (graduate-level science QA): 56.1 (phi-4, 14B) (Source: https://huggingface.co/microsoft/phi-4)

DROP (complex comprehension/reasoning): 75.5 (phi-4, 14B) (Source: https://huggingface.co/microsoft/phi-4)

Last Refreshed: 2026-03-03

Key Information

  • Category: Language Models
  • Type: AI Language Models Tool