Phi-4-mini-instruct - AI Language Models Tool

Overview

Phi-4-mini-instruct is a 3.8-billion-parameter instruction-tuned language model from Microsoft’s Phi-4 family, optimized for instruction-following, reasoning, and long-context tasks. The model is explicitly tuned with supervised fine-tuning and direct preference optimization to improve helpfulness, truthfulness, and adherence to instructions while keeping a small compute footprint suitable for latency-sensitive or memory-constrained deployments. According to the model card, Phi-4-mini-instruct supports an unusually long 128K-token context window and a large 200K-wordpiece vocabulary to improve multilingual and long-document performance. (Source: Hugging Face model card.) Phi-4-mini-instruct is released under an MIT license and is available on Hugging Face and via Microsoft/Azure tooling, intended for both research and commercial use. The model is designed to be run with mainstream inference stacks (transformers, vLLM) and benefits from flash-attention and grouped-query attention optimizations on modern GPUs. Microsoft provides a technical report and blog/portal links from the model card for deeper details and benchmark results. (Source: Hugging Face model card; Microsoft/Azure catalog.)

Model Statistics

  • Downloads: 128,000
  • Likes: 661
  • Pipeline: text-generation
  • Parameters: 3.8B

License: mit

Model Details

Architecture and core specs: Phi-4-mini-instruct is a dense decoder-only Transformer with 3.8B parameters, a 128K token context length, and a vocabulary sized around 200,064 tokens. Major architectural/efficiency features include grouped-query attention and shared input/output embeddings; the model uses Flash Attention by default for faster attention computation on supported GPUs. Training and data: Microsoft reports training on a mixture of filtered public data, high-quality educational/synthetic (“textbook-like”) reasoning data and supervised chat data, totaling approximately 5 trillion training tokens. Hardware and training logistics: the model card lists training run resources (512 A100-80G GPUs) and an approximate 21-day training period; the model was trained using time windows in late 2024 and released in early 2025. Integration and runtimes: Phi-4-mini-instruct is supported in common inference ecosystems (transformers v4.49.0 integration is referenced) and vLLM, and Microsoft recommends flash-attn-enabled GPUs (A100, A6000, H100) for best throughput; for older GPUs the model card suggests an eager attention fallback. The model card also documents safety post-training, multilingual red-teaming, and recommended mitigations for high-risk applications. (Sources: Hugging Face model card; Microsoft/Azure model catalog.)

Key Features

  • 3.8 billion parameters in a dense decoder-only Transformer.
  • Extremely long 128K-token context window for long-document tasks.
  • 200K-size tokenizer vocabulary to improve multilingual coverage.
  • Instruction-tuned via supervised fine-tuning and direct preference optimization.
  • Grouped-query attention and shared embeddings for efficiency.
  • Optimized for flash-attention on A100/A6000/H100 GPUs; eager fallback for older GPUs.
  • Available under an MIT license and accessible via Hugging Face and Azure.

Example Usage

Example (python):

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

# Example: basic Transformers inference (trust_remote_code required for some Phi-4 checkpoints)
model_id = "microsoft/Phi-4-mini-instruct"

torch.manual_seed(0)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Chat-format prompt (Phi-4 mini-instruct prefers the chat template)
chat_prompt = (
    "<|system|>You are a helpful assistant.<|end|>"
    "<|user|>Summarize the following passage in one sentence: 'Artificial intelligence research advances rapidly.'<|end|>"
)
inputs = tokenizer(chat_prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=128, do_sample=False)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

# Alternative: use vLLM for chat-style API (example shown in model card)
# from vllm import LLM, SamplingParams
# llm = LLM(model="microsoft/Phi-4-mini-instruct", trust_remote_code=True)
# messages = [{"role":"system","content":"You are a helpful AI assistant."},{"role":"user","content":"Solve 2x+3=7."}]
# sampling_params = SamplingParams(max_tokens=64, temperature=0.0)
# print(llm.chat(messages=messages, sampling_params=sampling_params))

Benchmarks

Arena Hard (aggregated): 32.8 (Phi-4-mini-instruct) (Source: https://huggingface.co/microsoft/Phi-4-mini-instruct)

BigBench Hard (0-shot, CoT): 70.4 (Source: https://huggingface.co/microsoft/Phi-4-mini-instruct)

MMLU (5-shot): 67.3 (Source: https://huggingface.co/microsoft/Phi-4-mini-instruct)

GSM8K (8-shot, CoT): 88.6 (Source: https://huggingface.co/microsoft/Phi-4-mini-instruct)

Overall aggregated score (reported): 63.5 (Source: https://huggingface.co/microsoft/Phi-4-mini-instruct)

Last Refreshed: 2026-01-17

Key Information

  • Category: Language Models
  • Type: AI Language Models Tool