Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone¶
ArXiv: 2404.14219
🎯 Pitch¶
This paper introduces the Phi-3 family of small language models—including phi-3-mini with just 3.8 billion parameters—that achieve performance rivaling models nearly 10 times larger, all while running efficiently and privately on a modern smartphone. By pioneering a data-centric training approach and targeted alignment, Phi-3 marks a leap in practical AI: it makes powerful, safe, and low-latency language capabilities available to everyone—without the need for internet connectivity or massive infrastructure.
1. Executive Summary (2-3 sentences)¶
This report introduces the Phi-3 family of small language models (SLMs)—notably phi-3-mini (3.8B parameters)—that approach the capability of much larger models while running locally on a phone. The core advance is a data-first training strategy (“data‑optimal regime”) plus targeted post‑training that, together with several efficiency techniques, delivers competitive reasoning, coding, and safety performance at a fraction of the size and memory footprint (e.g., 4‑bit phi-3-mini runs at >12 tokens/s on an iPhone 14; Figure 2).
2. Context and Motivation¶
- Problem addressed
- High-performing LLMs typically require hundreds of billions of parameters and vast compute, which precludes private, low-latency, on-device use and makes training/deployment costly. The paper targets a long-standing tension: can a small model be both fast and broadly capable?
- Why it matters
- Real-world: On-device models reduce latency, preserve privacy, and enable offline use (e.g., medical, enterprise, travel). Cost-effective models widen access and make edge deployments feasible.
- Theoretical: Classic scaling laws assume fixed data quality; the paper investigates how better data curation plus synthetic data can shift the size–performance trade-off.
- Prior approaches and gaps
- Traditional “bigger is better” (compute-optimal) scaling [KMH+20, HBM+22] improves quality but scales cost and latency. Earlier small models often lag in reasoning and safety.
phi-2showed small models can punch above their size using curated and synthetic “textbook-like” data. However, it did not demonstrate strong on-device runtime or breadth across multilingual, long-context, and multimodal tasks.- Positioning
- This work expands the “Textbooks Are All You Need” recipe with a larger, more refined dataset and introduces architectural and systems-level optimizations. It also extends the series to
phi-3.5for multilingual/long-context andphi-3.5-Visionfor multimodal, benchmarking them against open and commercial models.
3. Technical Approach¶
The paper’s advances combine a data-centric training regime, efficient architectures, and post-training alignment.
- Model family and sizes
phi-3-mini(3.8B): Decoder-only transformer with 32 layers/heads, hidden size 3072; context 4K by default; 128K with LongRope; trained 3.3T tokens in bfloat16; chat-finetuned with a simple prompt template. Section “Technical Specifications.”phi-3-small(7B) andphi-3-medium(14B): Same family, trained 4.8T tokens.phi-3-smallincorporates several efficiency changes (below). Section “Technical Specifications.”-
phi-3.5-miniandphi-3.5-MoE(language), andphi-3.5-Vision(multimodal): Add multilingual, long-context, and vision capabilities (Sections 4 and 7). -
Data pipeline and the “data‑optimal regime”
- Two-phase pretraining (Section “Training Methodology”):
- Phase 1: Heavily filtered public web data to impart general knowledge/language.
- Phase 2: Mix of even more filtered web data (subset of Phase 1) and synthetic LLM-generated data to teach reasoning and niche skills.
-
Data-optimal regime (Section “Data Optimal Regime”):
- Rather than merely scaling compute or epochs, the pipeline curates data to fit a model’s capacity—keeping material with educational/reasoning value while pruning trivia that would waste capacity. The paper illustrates the idea by removing ephemeral facts (e.g., sports results) so a small model can allocate capacity to reasoning.
- Evidence that this regime changes scaling behavior appears in Figure 3, which contrasts
Phiscaling vs. Llama-2 trained on fixed data.
-
Post-training to become an assistant (Section “Post-training”)
- SFT (Supervised Fine-Tuning): Curated, high-quality instruction data across math, coding, reasoning, safety, and model identity. Starts with English-only examples.
-
DPO (Direct Preference Optimization): Preference-based tuning on chat, reasoning, and safety data; “rejected” responses steer the model away from undesired behaviors.
-
Efficiency choices (especially for
phi-3-small) GEGLUactivation: A gated linear unit variant that tends to stabilize training.muP(Maximal Update Parameterization): A method to tune hyperparameters on a proxy model and transfer to a larger target for stable scaling.Grouped-Query Attention (GQA): Multiple query heads share one key/value (K/V) set to reduce memory bandwidth during decoding.Blocksparse attention(Figure 1): Each attention head attends to different blocks of the past context (local and “vertical/remote” blocks). This “divide-and-conquer” across heads reduces the amount of K/V state the model must store and fetch.- Definition:
KV cacheis the stored keys/values from prior tokens used during autoregressive decoding; reducing it lowers memory and latency. - Implementation: Custom high-efficiency kernels for training (Triton, based on FlashAttention) and inference (custom prefill kernel; extended paged attention in vLLM).
- Alternating dense and blocksparse layers balances recall and memory savings.
- Definition:
- Quantization:
phi-3-minican be quantized to 4-bit weights to fit ≈1.8GB RAM and still run >12 tokens/s on iPhone 14 A16, fully offline (Figure 2).- Definition:
Quantizationstores parameters at lower precision to reduce memory and speed up inference with minimal quality loss.
- Definition:
-
Long context via
LongRope(Section 4): A technique to extend positional encoding (rope scaling) to 128K context without retraining from scratch. -
Mixture-of-Experts variant (
phi-3.5-MoE) - Architecture (Section 2): 16 experts, top-2 routing per token; each expert is a GLU feed-forward network; 6.6B “active” parameters per token out of 42B total.
- Definition:
MoEactivates a subset of expert sub-networks per token, increasing capacity without proportionally increasing compute per token.
- Definition:
-
Training uses
SparseMixerto stabilize the sparse router. -
Multimodal model (
phi-3.5-Vision, Section 7) - Components: CLIP ViT-L/14 image encoder +
phi-3.5-minitext decoder. - Token interleaving: Visual tokens (from CLIP) are interleaved with text tokens; no special ordering required.
- Dynamic cropping: Splits high-res images into blocks and concatenates their tokens to cover diverse aspect ratios and maintain detail (up to 1344×1344 during pretraining).
- Training:
- Pretraining: ~0.5T tokens across interleaved image-text documents (e.g., OBELICS), FLD-5B pairs, OCR-synthesized data, chart/table datasets, and text-only; apply next-token prediction on text tokens (no image-token loss).
- Post-training: Multimodal SFT (~33B tokens) across natural images, charts, diagrams, presentations, videos, multi-image reasoning, plus safety; then DPO with text and smaller-scale multimodal prefs.
4. Key Insights and Innovations¶
- Data-first scaling for small models (“data‑optimal regime”) is powerful
- What’s new: Rather than scaling model size, the team scales and curates data to match a small model’s capacity, blending filtered “educational” web data with synthetic reasoning data in two phases.
- Why it matters: Figure 3 shows
Phimodels deviate favorably from Llama-2 scaling trained on fixed data;phi-3-minireaches 68.8% MMLU (Table in Section 3) and MT-Bench 8.38 while being only 3.8B parameters. - Practical on-device LLM at useful quality
- What’s new: 4-bit
phi-3-minioccupies ~1.8GB and runs >12 tokens/s fully offline on an iPhone 14 (Figure 2). - Why it matters: This demonstrates a step-change in deployability—capabilities close to GPT‑3.5-level benchmarks without a datacenter.
- Memory-efficient attention at scale (blocksparse + kernels)
- What’s new: Per-head block patterns that collectively cover the full context, plus custom kernels for training and decoding (Figure 1).
- Why it matters: Reduces KV cache and improves speed without giving up long-context retrieval; enables practical deployment with constrained memory.
- Compact MoE with high active capacity (
phi-3.5-MoE) - What’s new: A 16×3.8B MoE (6.6B active) using top-2 routing and SparseMixer for stable training.
- Why it matters: Table 3 shows it outperforms similarly sized open models (e.g., Llama‑3.1‑8B, Mixtral series) and approaches Gemini‑1.5‑Flash and GPT‑4o‑mini across many language benchmarks.
- Multimodal
phi-3.5‑Visiontuned for real-user prompting - What’s new: High-resolution dynamic cropping, interleaved tokens, and a large SFT+DPO recipe tuned under simple, 0‑shot prompts (Section 7.2 setup).
- Why it matters: Table 5 shows competitive or better results versus open multimodal baselines on diverse tasks (e.g., 91.3 on ScienceQA, 81.9 on MMBench), while being only 4.2B parameters.
5. Experimental Analysis¶
- Evaluation setup (Section 3 and Section 7.2)
- Language benchmarks: MMLU, HellaSwag, ANLI, GSM‑8K (with Chain-of-Thought, CoT), MATH (CoT), MedQA, AGIEval, TriviaQA, ARC‑C/E, PIQA, Social IQA, BigBench-Hard (CoT), WinoGrande, OpenBookQA, BoolQ, CommonSenseQA, TruthfulQA, HumanEval, MBPP. Few-shot prompts; temperature 0; same pipeline for comparability.
- Long-context: RULER and RepoQA (Section 4; Tables 1–2).
- Multilingual: MMLU‑multilingual and MGSM (Figure 4; Table 3).
- Safety: Internal multi-turn RAI benchmark (Table 4) and red-teaming (Figure 5).
- Multimodal: MMMU, ScienceQA, MathVista, Inter-GPS, MMBench, POPE, AI2D, ChartQA, TextVQA (Table 5), plus multi-image/video BLINK and VideoMME (Table 6).
-
Note on fairness: The report emphasizes consistent prompts and that numbers may differ from other publications due to evaluation choices; e.g., “we did no optimization to the pipeline for the phi‑3 models” and even omitted known prompt tweaks that help (footnote in Section 3).
-
Main quantitative results (selected)
- Capability of small
phi-3models (Section 3 table):phi-3-mini(3.8B):- MMLU 68.8 vs GPT‑3.5 71.4, Mixtral‑8×7B 70.5, Llama‑3‑8B‑Instruct 66.5.
- GSM‑8K (8‑shot CoT): 82.5, higher than GPT‑3.5 78.1 and Mixtral‑8×7B 64.7.
- MATH (0‑shot CoT): 41.3 vs GPT‑3.5 45.3 (close for a much smaller model).
- Average across tasks: 69.7 vs GPT‑3.5 72.8 and Mixtral‑8×7B 66.8.
- Scaling within family:
phi-3-small(7B): MMLU 75.7; GSM‑8K 89.6; BigBench‑Hard (CoT) 79.1.phi-3-medium(14B): MMLU 78.0; GSM‑8K 91.0; MATH 53.1.- These jumps support the “data-optimal regime” up to 7B; the paper notes smaller gains from 7B to 14B (Section “Data Optimal Regime”).
- On-device feasibility
-
“phi‑3‑mini can be quantized to 4‑bits so that it only occupies ≈ 1.8GB of memory… on iPhone 14… fully offline achieving more than 12 tokens per second.” (Figure 2 caption and preceding paragraph)
-
- Long-context and multilingual (Section 4)
- RepoQA (Table 1, code long-context QA):
phi-3.5-MoEavg 85 vs Llama‑3.1‑8B 71 and Mixtral‑8×7B 68; GPT‑4o (May‑2024) 90.6. - RULER (Table 2):
phi-3.5-MoEavg 87.1 vs Llama‑3.1‑8B 88.3. Performance drops sharply at 128K (64.2), attributed to limited high-quality long-context mid-training data. - Multilingual MMLU (Figure 4):
phi-3.5-miniimproves average from 47.3 (phi‑3‑mini) to 55.4;phi-3.5-MoEreaches 69.9.
- RepoQA (Table 1, code long-context QA):
- Aggregate comparison vs strong baselines (Table 3)
- Average across representative language benchmarks:
phi-3.5-MoE69.2,phi-3.5-mini61.1, Gemini‑1.5 Flash 68.5, GPT‑4o‑mini 74.9. - Selected tasks:
- BigBench‑Hard (0‑shot CoT):
phi-3.5-MoE79.1 vs GPT‑4o‑mini 80.4. - GSM‑8K (8‑shot CoT):
phi-3.5-MoE88.7 vs GPT‑4o‑mini 91.3. - HumanEval (code):
phi-3.5-MoE70.7 vs GPT‑4o‑mini 86.6; Gemini‑1.5 Flash 74.4.
- Average across representative language benchmarks:
- Safety and robustness
- Red-team results (Figure 5) show large reductions in harmful responses after safety alignment.
- Internal RAI benchmark (Table 4): Lower is better.
phi-3.5-MoEhas ungroundedness 0.228 (better than Mistral‑7B’s 0.935 and Gemma‑7B’s 0.679), and competitive defect rates across categories.
- Multimodal (
phi-3.5-Vision, Table 5; Section 7.2)- ScienceQA: 91.3 (beats Gemini 1.0 Pro V 79.7; close to GPT‑4O 88.5).
- MMBench (dev‑en): 81.9 (competitive with open baselines; below GPT‑4O 88.4).
- ChartQA: 81.8 (substantially higher than many open baselines; GPT‑4O at 64.0 under this evaluation setup).
- TextVQA: 72.0 (close to GPT‑4O 75.6 under identical pipeline).
-
Multi-image/video (Table 6)
- BLINK: 57.0 (competitive with GPT‑4o‑mini 51.9; below GPT‑4O 63.2).
- VideoMME: 50.8 (below Gemini‑1.5 Flash 62.3; the paper uses a uniform 16‑frame protocol for fairness across models).
-
Robustness and qualitative checks
- Search augmentation can fix factual gaps: Figure 6 shows a failure without search and a correct, more specific answer with search enabled.
-
The paper notes prompt sensitivity (e.g., adding “##” before questions improves scores) but doesn’t exploit such tweaks in reported numbers (footnote in Section 3).
-
Overall assessment
- The experimental suite is broad and carefully standardized. Results substantiate the central claims: (1) small models can reach strong general and reasoning performance with data‑optimal training; (2) the approach scales to multilingual, long-context, and multimodal settings; and (3) safety alignment substantially improves red-team outcomes. Some areas (extreme long context, code on HumanEval vs GPT‑4o‑mini, certain video tasks) still lag SOTA, clarifying the boundaries of the current recipe.
6. Limitations and Trade-offs¶
- Capacity constraints in factual knowledge (Section 6 “Weakness”)
- Small models have limited memory for world facts; this surfaces in lower TriviaQA scores and other knowledge-heavy tasks. The report suggests search/RAG augmentation as a remedy (Figure 6).
- English-centric training at the
miniscale phi-3-miniis primarily English. Multilingual strength appears only after mid-training forphi-3.5series;phi-3.5-miniimproves but still trailsphi-3.5-MoE(Figure 4, Table 3).- Long-context fragility at maximum window
- RULER performance drops sharply at 128K (Table 2), attributed to insufficient high-quality long-context mid-training data. This indicates window extension via LongRope alone is not enough; high-quality long-context supervision is needed.
- Diminishing returns at larger sizes with the same data mixture
- The jump from 3.8B → 7B brings large gains; 7B → 14B brings smaller gains (Section “Data Optimal Regime”), hinting the data mixture may be suboptimal for 14B scaling.
- Systems and portability trade-offs
- Blocksparse attention depends on custom kernels and a particular sparsity pattern per head (Figure 1), which complicates portability across runtimes and hardware.
- MoE complexity
- While
phi-3.5-MoEdelivers strong accuracy, MoE introduces routing complexity and potential load-balancing challenges. Training stability requires mechanisms like SparseMixer; inference may require specialized infrastructure for efficiency. - Safety and hallucinations are mitigated, not solved
- Despite improved RAI metrics and red-teaming (Figure 5, Table 4), the paper notes remaining issues: factual inaccuracies, bias reproduction, and occasional unsafe responses (Sections 5 and 7.4).
- Reproducibility of the “data-optimal” recipe
- The precise filtering and synthetic data generation pipeline is not fully public. Reproducing the exact mixture and quality may be challenging for third parties.
7. Implications and Future Directions¶
- Field impact
- This work shifts the default assumption from “capability requires scale” to “capability can be bought with better data and alignment,” at least up to mid-tier performance. It makes on-device assistants plausible and useful.
- Practical applications
- Private mobile assistants; offline enterprise copilots; edge analytics in healthcare/finance where data egress is restricted; field operations (e.g., industrial inspection) needing multimodal understanding without connectivity.
- Multilingual and long‑context variants broaden applicability to global use and large-document/code understanding (Tables 1–2).
- Research avenues
- Data mixture optimization: Formalizing the “data‑optimal regime” (selection, scheduling, synthetic generation criteria) and making it adaptive to model size.
- Long-context learning: Curate high-quality 128K+ training data; study retrieval and memory mechanisms that complement LongRope.
- Efficient attention: Generalize blocksparse patterns, auto-learn sparsity, and standardize kernels for broader hardware support.
- MoE for small models: Explore latency-aware expert routing and dynamic capacity control for mobile/edge inference.
- Safety and grounding: Combine DPO with tool-use (retrieval, calculators, verifiers) and multimodal red-teaming; reduce hallucinations especially in finance and safety-critical domains (Section 7.4).
- Multimodal scaling: Better video understanding through frame selection policies and temporal modeling beyond uniform 16-frame sampling (Table 6).
“Highly capable language model running locally on a cell-phone… quantized to 4-bits so that it only occupies ≈ 1.8GB… achieving more than 12 tokens per second.” (Section 2; Figure 2)
“Scaling law close to the ‘Data Optimal Regime’… phi‑1.5 → phi‑2 → phi‑3‑mini → phi‑3‑small vs Llama‑2 family trained on the same fixed data.” (Figure 3)
“
phi-3.5-MoE… achieves superior performance… on par with Gemini-1.5-Flash and GPT-4o-mini” in many language tasks (Table 3), whilephi-3.5-Visionis competitive across single- and multi-image benchmarks (Tables 5–6).