Phi-3-mini-4k-instruct - AI Language Models Tool

Overview

Phi-3-mini-4k-instruct is a 3.8 billion-parameter instruction-tuned language model from Microsoft designed for low-latency, on-device and server inference while retaining strong reasoning and instruction-following ability. The model is part of the Phi-3 family and ships in two context-length variants (4K and 128K tokens), with the 4K-instruct variant optimized for general-purpose conversational use, structured output (JSON/XML), math and logical reasoning tasks. The model card documents supervised fine-tuning (SFT) followed by Direct Preference Optimization (DPO) to improve instruction adherence and safety behavior. ([huggingface.co](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct)) Microsoft published optimized runtime formats (ONNX) and community-distributed GGUF quantized weights to ease deployment across servers, desktops and mobile devices; ONNX variants include int4/FP16 builds and DirectML support for Windows GPUs, while GGUF packages provide popular 4-bit quantized files for local inference. The model is released under an MIT license and is integrated with Transformers (noted supported version in the model card) and ONNX Runtime for cross-platform deployments. ([huggingface.co](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-onnx?utm_source=openai))

Model Statistics

  • Downloads: 706,030
  • Likes: 1391
  • Pipeline: text-generation

License: mit

Model Details

Architecture and training: Phi-3-mini-4k-instruct is a dense decoder-only Transformer with ~3.8B parameters, trained and tuned with SFT and DPO for instruction following and safety. The model card lists a 4K token context-length for the Mini-4K instruct variant and points to an alternative 128K-long-context variant in the Phi-3 family. Training used large-scale compute (reported 512 H100-80G GPUs) and multi-trillion token corpora (model card cites the Phi-3 training data totals and cut-off). Release and training dates, and a detailed model card, are published on the official Hugging Face model page. ([huggingface.co](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct)) Tokenizer and integration: The model uses the Phi-3 tokenizer with a vocabulary reported around 32,064 tokens and is distributed with tokenizer files ready for downstream fine-tuning. Transformers integration is documented (example references a compatible transformers version and recommends specific runtime packages such as flash_attention and accelerate in the model card). ([huggingface.co](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct)) Deployment formats and optimizations: Microsoft supplies ONNX-optimized builds to run with ONNX Runtime across CPU, GPU and mobile backends (FP16 and int4 ONNX builds are listed), and community/official GGUF conversions (including quantized Q4 variants) are available for fast local inference on LLM runtimes (llama.cpp-compatible frontends, LoLLMS/LM Studio, etc.). ONNX runtime notes include DirectML support for Windows GPUs and guidance for int4 CPU/mobile builds. File-size/quantization examples (Q4_K_M ≈ 2.2 GB; FP16 ~7.2 GB in some GGUF releases) appear on repository pages describing available artifacts. ([huggingface.co](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-onnx?utm_source=openai)) Limitations and safety: Microsoft’s model card lists standard LLM caveats (bias, hallucination, domain limitations, language coverage skewed toward English) and recommends responsible deployment practices, RAG/grounding for critical knowledge, and additional safeguards for high-risk scenarios. The card also documents iterative safety improvements and red-team testing as part of Phi-3 development. ([huggingface.co](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct))

Key Features

  • Instruction-tuned with SFT and Direct Preference Optimization for stronger instruction adherence.
  • 3.8B-parameter decoder-only Transformer optimized for low-latency inference and on-device use.
  • Supports 4K context by default; 128K long-context variant available in the Phi-3 family.
  • Distributed in ONNX (FP16/int4) for cross-platform acceleration and GGUF quantized files for local runtimes.
  • Improved structured-output performance (JSON/XML) after June 2024 post-training update.
  • Integrates with Transformers and ONNX Runtime; example runtime flags and flash-attention guidance provided.

Example Usage

Example (python):

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

# Example based on the model card: load model (GPU preferred) and run a chat-style prompt
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    device_map="auto",
    torch_dtype="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "Explain how to solve 2x + 3 = 7."},
]

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
)

generation_args = {
    "max_new_tokens": 200,
    "return_full_text": False,
    "temperature": 0.0,
    "do_sample": False,
}

output = pipe(messages, **generation_args)
print(output[0]["generated_text"])  # model card example and recommended parameters.  
# See the official model card for additional flags (e.g. attn_implementation="flash_attention_2").
# Source: Hugging Face model card for microsoft/Phi-3-mini-4k-instruct. ([huggingface.co](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct))

Benchmarks

MMLU (Phi-3-mini-4k, June update): 70.9 (Source: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct)

JSON structure output (Phi-3-mini-4k, June update): 52.3 (Source: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct)

Instructions Challenge (Phi-3-mini-4k, June update): 42.3 (Source: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct)

GPQA (Phi-3-mini-4k, June update): 30.6 (Source: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct)

Average (reported composite score, June update): 36.7 (Source: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct)

Last Refreshed: 2026-03-03

Key Information

  • Category: Language Models
  • Type: AI Language Models Tool