Phi-4-mini-instruct - AI Language Models Tool
Overview
Phi-4-mini-instruct is a 3.8-billion-parameter instruction-tuned language model from Microsoft’s Phi-4 family, optimized for instruction-following, reasoning, and long-context tasks. The model is explicitly tuned with supervised fine-tuning and direct preference optimization to improve helpfulness, truthfulness, and adherence to instructions while keeping a small compute footprint suitable for latency-sensitive or memory-constrained deployments. According to the model card, Phi-4-mini-instruct supports an unusually long 128K-token context window and a large 200K-wordpiece vocabulary to improve multilingual and long-document performance. (Source: Hugging Face model card.) Phi-4-mini-instruct is released under an MIT license and is available on Hugging Face and via Microsoft/Azure tooling, intended for both research and commercial use. The model is designed to be run with mainstream inference stacks (transformers, vLLM) and benefits from flash-attention and grouped-query attention optimizations on modern GPUs. Microsoft provides a technical report and blog/portal links from the model card for deeper details and benchmark results. (Source: Hugging Face model card; Microsoft/Azure catalog.)
Model Statistics
- Downloads: 128,000
- Likes: 661
- Pipeline: text-generation
- Parameters: 3.8B
License: mit
Model Details
Architecture and core specs: Phi-4-mini-instruct is a dense decoder-only Transformer with 3.8B parameters, a 128K token context length, and a vocabulary sized around 200,064 tokens. Major architectural/efficiency features include grouped-query attention and shared input/output embeddings; the model uses Flash Attention by default for faster attention computation on supported GPUs. Training and data: Microsoft reports training on a mixture of filtered public data, high-quality educational/synthetic (“textbook-like”) reasoning data and supervised chat data, totaling approximately 5 trillion training tokens. Hardware and training logistics: the model card lists training run resources (512 A100-80G GPUs) and an approximate 21-day training period; the model was trained using time windows in late 2024 and released in early 2025. Integration and runtimes: Phi-4-mini-instruct is supported in common inference ecosystems (transformers v4.49.0 integration is referenced) and vLLM, and Microsoft recommends flash-attn-enabled GPUs (A100, A6000, H100) for best throughput; for older GPUs the model card suggests an eager attention fallback. The model card also documents safety post-training, multilingual red-teaming, and recommended mitigations for high-risk applications. (Sources: Hugging Face model card; Microsoft/Azure model catalog.)
Key Features
- 3.8 billion parameters in a dense decoder-only Transformer.
- Extremely long 128K-token context window for long-document tasks.
- 200K-size tokenizer vocabulary to improve multilingual coverage.
- Instruction-tuned via supervised fine-tuning and direct preference optimization.
- Grouped-query attention and shared embeddings for efficiency.
- Optimized for flash-attention on A100/A6000/H100 GPUs; eager fallback for older GPUs.
- Available under an MIT license and accessible via Hugging Face and Azure.
Example Usage
Example (python):
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
# Example: basic Transformers inference (trust_remote_code required for some Phi-4 checkpoints)
model_id = "microsoft/Phi-4-mini-instruct"
torch.manual_seed(0)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
torch_dtype="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Chat-format prompt (Phi-4 mini-instruct prefers the chat template)
chat_prompt = (
"<|system|>You are a helpful assistant.<|end|>"
"<|user|>Summarize the following passage in one sentence: 'Artificial intelligence research advances rapidly.'<|end|>"
)
inputs = tokenizer(chat_prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=128, do_sample=False)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# Alternative: use vLLM for chat-style API (example shown in model card)
# from vllm import LLM, SamplingParams
# llm = LLM(model="microsoft/Phi-4-mini-instruct", trust_remote_code=True)
# messages = [{"role":"system","content":"You are a helpful AI assistant."},{"role":"user","content":"Solve 2x+3=7."}]
# sampling_params = SamplingParams(max_tokens=64, temperature=0.0)
# print(llm.chat(messages=messages, sampling_params=sampling_params)) Benchmarks
Arena Hard (aggregated): 32.8 (Phi-4-mini-instruct) (Source: https://huggingface.co/microsoft/Phi-4-mini-instruct)
BigBench Hard (0-shot, CoT): 70.4 (Source: https://huggingface.co/microsoft/Phi-4-mini-instruct)
MMLU (5-shot): 67.3 (Source: https://huggingface.co/microsoft/Phi-4-mini-instruct)
GSM8K (8-shot, CoT): 88.6 (Source: https://huggingface.co/microsoft/Phi-4-mini-instruct)
Overall aggregated score (reported): 63.5 (Source: https://huggingface.co/microsoft/Phi-4-mini-instruct)
Key Information
- Category: Language Models
- Type: AI Language Models Tool