Skip to content

Phi-4 Technical Report

ArXiv: 2412.08905

🎯 Pitch

Phi-4 introduces a 14-billion-parameter language model that sets a new standard for small models by leveraging a data-centric approach: extensive, curated synthetic data and an innovative token-level preference optimization during post-training. This design allows phi-4 to rival or surpass much larger models—like outperforming GPT-4o on complex STEM and math tasks—while keeping inference cost and latency low. The significance: Phi-4 proves high-quality data and training curriculum can unlock remarkable reasoning ability in compact models, expanding access to capable AI systems and establishing new benchmarks for efficient, trustworthy small-model deployment.


1. Executive Summary

Phi‑4 is a 14‑billion‑parameter language model built around a data‑centric recipe: heavy, carefully designed synthetic data during pretraining and midtraining, plus a new token‑level preference optimization method in post‑training. The result is a small model that matches or surpasses much larger systems on reasoning‑heavy tasks—e.g., it beats GPT‑4o on graduate‑level STEM QA (GPQA) and math competition problems (MATH)—while being evaluated on fresh, decontaminated data such as the November 2024 AMC exams (Figure 1; Table 1).

2. Context and Motivation

  • Problem/gap addressed
  • Small models rarely rival large frontier models on complex reasoning without expensive inference-time tricks (e.g., long chains of thought). Even when small models are strong, benchmark contamination makes results hard to trust. This paper targets both: (1) how to train a small model to reason well at low inference cost, and (2) how to evaluate it credibly (Section 1.1).
  • Why it matters
  • Practical impact: lower latency and cost at deployment compared to very large models or long-chain-of-thought systems (Section 1.1 notes QwQ‑32B uses >4× tokens per solution and >2× parameters, making its inference “an order of magnitude” costlier).
  • Scientific value: shows how data quality and curriculum can substitute for brute‑force scaling (Introduction, Section 1).
  • Prior approaches and shortcomings
  • “Organic” web‑heavy pretraining often misaligns with inference settings (e.g., forum style vs. chat style) and can be contaminated by leaked benchmarks (Sections 1 and 2.1).
  • Earlier Phi models heavily distilled from a teacher (GPT‑4), but still trailed frontier models on advanced reasoning (Introduction).
  • Recent long chain‑of‑thought models (O1/O‑series style) perform well but at high inference cost (Section 1.1).
  • Positioning
  • Phi‑4 keeps the phi‑3 architecture but overhauls the data pipeline: synthetic data is central in all stages; web data is curated and filtered as “seeds,” and post‑training introduces Pivotal Token Search (PTS), a token‑level DPO method (Sections 2, 3, 4).

3. Technical Approach

This section unpacks how phi‑4 is built and aligned.

  • Model and training overview
  • Architecture: decoder‑only transformer, 14B parameters; 4K context in pretraining extended to 16K in midtraining; tiktoken tokenizer (100,352 vocab, padded); full attention at 4K (Section 3).
  • Pretraining schedule: ~10T tokens with linear warm‑up/decay, peak LR 3e‑4, weight decay 0.1, global batch 5760; midtraining adds 250B tokens for 16K context with a 10× smaller LR and RoPE base 250K (Section 3, 3.3).
  • Instruction following is learned later; hence pretraining evaluation uses log‑likelihood and few‑shot formats rather than strict 0‑shot templates (Table 2).

  • Data recipe (the core of the method)

  • What “synthetic data” means here: text generated by LLMs in carefully designed workflows so every next token is predicted in‑distribution, often with explicit step‑by‑step reasoning (“spoonfeeding,” Section 2.1).
  • Three pillars (Section 1): 1) Synthetic data for pre/midtraining; 2) meticulous curation/filtering of organic sources (web, books, code) as seeds and as complementary data; 3) new post‑training techniques.

  • Synthetic data generation (Section 2.2; Appendix D)

  • 50 dataset types (~400B unweighted tokens) produced by multi‑stage prompting and quality control. Key elements:

    • Seed curation
    • Web/code snippets and book/paper passages selected via two‑stage filtering for “educational potential,” reasoning depth, and factual content (Section 2.2, “Web and Code‑based Seeds”).
    • Large Q&A pools collected and filtered by plurality voting: discard too‑easy (all answers agree) and too‑ambiguous (no agreement) items; keep “challenging but approachable” ones; use plurality answers for rejection sampling during generation (Section 2.2, “Question Datasets”).
    • Extract Q&A from reasoning chains embedded in organic text by detecting deduction steps and reformulating them into questions/answers (Section 2.2, “Creating Question‑Answer pairs…”).
    • Rewrite and augment: turn informative passages into exercises, discussions, or structured reasoning tasks (Section 2.2, “Rewrite and Augment”).
    • Self‑revision: generate → critique → revise with rubrics targeting reasoning and accuracy (Section 2.2, “Self‑revision”; Appendix D.1.2).
    • Instruction reversal (notable for code): start from code; generate the corresponding instruction so the pair is in “instruction → code” order; only keep high‑fidelity pairs where regenerated code matches original (Section 2.2).
    • Validation: run code/tests; for science QA, apply extraction pipelines that ensure relevance and difficulty balance (Section 2.2).
  • Organic (human) data curation and filtering (Section 2.3)

  • Targeted acquisitions: arXiv, PubMed Central, GitHub, licensed books (reasoning‑dense corpora).
  • Filtering web dumps: small non‑LLM classifiers trained on ~10^6 LLM‑generated annotations to select high‑quality pages; extra pipeline to avoid STEM over‑bias and amplify non‑STEM content; remove corrupted artifacts using n‑gram/compression heuristics.
  • Multilingual coverage: fastText‑based language ID for 176 languages; quality filtering with the same classifiers distilled to multilingual (Section 2.3).
  • Custom extraction/cleaning: robust HTML‑to‑text preserving equations, code, tables, thread structure; parsers for TeX, EPUB/XML, Word, PDFs (Section 2.3).

  • How much of each data type? (Sections 3.1–3.2)

  • Ablations establish two principles:
    • More epochs on high‑quality synthetic beats adding more unique web tokens for reasoning benchmarks (Figure 2).
    • Purely synthetic models underperform on knowledge retrieval (large gap on TriviaQA, Table 3).
  • Final pretraining mixture (Table 5):

    • Synthetic 40% (≈290B unique tokens, 13.8 epochs),
    • Web rewrites 15% (≈290B, 5.2 epochs),
    • Filtered web 15% (≈1.3T, 1.2 epochs),
    • Code 20% (≈820B, 2.4 epochs),
    • Acquired sources 10% (≈580B, 1.7 epochs).
  • Midtraining for long context (Section 3.3; Table 6)

  • Strategy: prefer inherently long inputs (books, academic, code) over concatenated padding; up‑weight ≥8K/16K samples; add new synthetic long sequences; final mix = 30% new long‑context data + 70% recall tokens from pretraining.
  • Outcome: at 16K, improved many‑shot ICL and long‑doc QA/summarization (Table 6).

  • Post‑training to become an assistant (Section 4)

  • Format: chatml schema (two turns example provided).
  • SFT: ~8B tokens covering math, coding, reasoning, general chat, safety, model identity, and 40 languages (Section 4.1).
  • Two rounds of DPO (Direct Preference Optimization; Section 4.2):

    • Stage 1: Pivotal Token Search (PTS) DPO (novel) using token‑level preferences (Section 4.3; Tables 7 for data mix).
    • Stage 2: judge‑guided DPO with ~850k full‑response pairs; responses from GPT‑4o, GPT‑4t, and phi‑4 judged by GPT‑4o with rubrics scoring accuracy/style/detail (Appendix A.2; Table 8).
  • What is PTS and how it works? (Section 4.3; Figures 3–5)

  • “Pivotal tokens” are individual tokens where choosing one continuation vs. another causes a sharp change in the chance that the final answer is correct.
  • Estimating “probability of success”: from each prefix t1..ti, sample many rollouts and use an oracle to score correctness (unit tests for code; exact answer for math/QA).
  • Algorithm (Figure 4): recursively subdivide the completion into segments and keep the single tokens where |Δ p(success)| ≥ threshold pgap; build DPO pairs where the prompt is the prefix and the accepted vs. rejected completions are the two candidate tokens.
  • Why it’s needed: standard DPO spreads learning signal across whole sequences; low‑probability but harmful tokens can get incorrectly reinforced; PTS isolates the exact decision points that flip success (Figure 3 shows such flips in a math solution).

  • Hallucination mitigation (Appendix A.1; Figure 6)

  • Goal: when the model is unlikely to know an obscure fact, prefer refusal over guessing.
  • Data creation: for each trivia item (from TriviaQA seeds), estimate phi‑4’s success rate; generate correct answers and refusal messages; also generate “bogus but plausible” variants and paired refusals; then use SFT and DPO on short (first 5 tokens) responses to nudge behavior (Appendix A.1.1).
  • Effect: large rise in “not attempted” on SimpleQA; fewer incorrect guesses (Figure 6).

  • Robust decontamination (Appendix B)

  • Hybrid 13‑gram and 7‑gram decontamination with thresholds; safelist of ubiquitous 13‑grams (Algorithm 1). Example shows detection against AGIEval with overlapping n‑grams (Appendix B, last page).

4. Key Insights and Innovations

  • Pivotal Token Search (PTS) for token‑level preference optimization
  • What’s new: preference pairs operate at the single‑token level at precisely the decision points that flip correctness (Section 4.3; Figures 3–4).
  • Why it matters: concentrates gradient where it counts; reduces noise from long completions; complements standard DPO. Empirically, PTS boosts reasoning tasks—e.g., GPQA rises from 47.3 (SFT) → 53.6 (Stage 1 PTS DPO) → 56.1 (final) and MATH from 77.1 → 80.5 → 80.4 (Table 9).
  • Contrast with prior “critical token” work (e.g., [LLX+24]): PTS estimates success probabilities directly from rollouts and works for both accepted and rejected tokens (Section 4.3, “Related Work”).
  • A synthetic‑first curriculum that still preserves knowledge
  • What’s new: 40% synthetic + 30% web/web‑rewrites + 20% code + 10% acquisitions (Table 5), after ablations show that extra synthetic epochs beat adding web tokens for reasoning (Figure 2) while some web/knowledge sources are necessary for factual tasks (Table 3, TQA gap).
  • Why it matters: balances reasoning (synthetic) and knowledge (clean web/books/code) while ensuring style alignment with inference contexts (Section 2.1–2.3).
  • Instruction‑reversal and self‑revision pipelines at scale
  • What’s new: reverse engineer instructions from code to create faithful instruction→solution pairs; multi‑agent self‑critique/refinement (Section 2.2; Appendix D).
  • Why it matters: improves alignment with user prompts, raises solution fidelity in code/science, and makes pretraining contexts closer to inference distribution (Section 2.1).
  • Long‑context midtraining that favors naturally long samples
  • What’s new: rather than padding short items, curate inherently long (>8K/16K) academic/books/code data and add long synthetic sequences (Section 3.3).
  • Why it matters: improves many‑shot ICL and long‑document tasks at 16K (Table 6), showing the importance of “natural” long inputs.

5. Experimental Analysis

  • Evaluation design
  • Academic benchmarks via the reproducible simple-evals framework at temperature 0.5 with fixed prompts/extraction (Table 1).
  • Contamination‑resistant tests:
    • Fresh AMC‑10/12 November 2024 exams (Figure 1; Appendix C details 100 runs per test, t=0.5; GPT‑4o used only to extract the final boxed option from long solutions).
    • GPQA Diamond (new, web‑proof) and an internal team‑written benchmark, PhiBench (Section 1.1; Section 5).
  • Long‑context evaluation: HELMET tasks including recall, RAG, re‑ranking, in‑context learning, QA, and summarization (Section 3.3; Table 6).

  • Headline results

  • Cross‑benchmark performance (Table 1): > GPQA: 56.1 (phi‑4) vs 49.1 (GPT‑4o), 50.6 (Qwen‑72B), 49.0 (Llama‑70B). > > MATH: 80.4 (phi‑4) vs 74.6 (GPT‑4o), 80.0 (Qwen‑72B), 66.3 (Llama‑70B, extraction caveat). > > MMLU: 84.8 (phi‑4) vs 88.1 (GPT‑4o), 86.3 (Llama‑70B), 85.3 (Qwen‑72B). > > HumanEval+: 82.8 (phi‑4) vs 82.0 (GPT‑4o‑mini), 78.4 (Qwen‑72B), 77.9 (Llama‑70B).
    • Interpretation: phi‑4 is “small” (14B) yet competitive with or better than much larger open models, especially on reasoning (GPQA, MATH) and coding (HumanEval/HumanEval+).
  • Fresh AMC exams (Figure 1): > Average score (max 150): 91.8 (phi‑4) vs 89.8 (Gemini Pro 1.5), 78.7 (GPT‑4o), 77.9 (GPT‑4o‑mini), 78.2 (Qwen‑72B), 66.4 (Llama‑3.3‑70B).
    • Significance: protects against contamination (Section 1.1; Appendix C).
  • Long‑context (HELMET, Table 6): > At 16K: phi‑4 improves ICL from 68.0 → 77.0 and QA 26.7 → 36.0; summarization 38.3 → 40.5.
    • Trade‑offs: Re‑ranking drops (65.3 → 54.4) and RAG slightly decreases (58.1 → 57.1), suggesting task‑specific effects of longer context.
  • Post‑training ablations (Table 9): > SFT → PTS DPO (stage 1) → judge‑guided DPO (stage 2): GPQA 47.3 → 53.6 → 56.1; MATH 77.1 → 80.5 → 80.4; ArenaHard 56.7 → 66.5 → 75.4.
    • Interpretation: PTS is especially helpful on reasoning; judge‑guided DPO shines on human‑preference‑style judgments (ArenaHard). Combined stages are complementary.
  • Safety/RAI (Table 10): > “Jailbreak (DR1)” defect rate 0.073 (phi‑4) lower than several 7‑8B baselines; “Grounding” 4.619/5.0; harmful content continuation DR3 = 0.036.

    • Plus dedicated red‑teaming by Microsoft AIRT and safety post‑training (Section 7).
  • Pretraining improvements relative to phi‑3 (Table 2)

    After pretraining, phi‑4 gains on MMLU (+3.0), MMLU‑pro (+10.3), HumanEval (+7.8), MBPP (+6.8), MATH (+8.9).

  • Supports the claim that the new data/curriculum materially improves core capabilities before any instruction tuning.

  • Ablations that justify the mixture (Tables 3–4; Figure 2)

  • “Synthetic only” improves reasoning/coding but hurts TriviaQA by −14.8 (Table 3).
  • “More synthetic epochs” beats “more fresh web tokens” on MMLU (Figure 2).
  • Mixture search shows uniform allocation is suboptimal; synthetic‑heavy variants edge out others but the final mixture balances knowledge benchmarks and later gains from post‑training (Table 4; Section 3.2).

  • Hallucination behavior (Figure 6; Appendix A.1)

    SimpleQA “Not Attempted” rises to 81.1% with incorrect reduced to 15.8% (final), reflecting safer behavior even though F1 drops.

  • Do the experiments support the claims?

  • Yes for reasoning: multiple independent indicators (GPQA, MATH, AMC) show strong gains with contamination control (Section 1.1; Appendix C).
  • Mixed for instruction following: IFEval is relatively weak (63.0; Table 1), consistent with the paper’s own assessment (Section 8).
  • Long‑context is improved in some categories but not uniformly (Table 6), a useful nuance.

6. Limitations and Trade-offs

  • Reliance on synthetic data requires immaculate seeds and validation; small errors in seeds can propagate and degrade synthetic generations (Section 2.3).
  • Knowledge limitations remain: without tools/search, the model still hallucinates factual biographies/names and obscure facts (Section 8). The paper mitigates this by training refusals but does not eliminate the issue.
  • Instruction following is a weakness: strict format adherence (e.g., tables, bullet structure) is inferior to peers (IFEval 63.0; Table 1; Section 8).
  • Verbosity: chain‑of‑thought‑heavy training can yield overly long answers even for simple queries (Section 8).
  • Single‑turn optimization: the model is tuned primarily for single‑turn tasks; multi‑turn reliability may lag (Section 8).
  • Long‑context trade‑offs: 16K improves ICL/QA but can reduce re‑ranking performance (Table 6).
  • Compute/data cost: although inference is small‑model‑class, training uses ~10T pretraining tokens + 250B midtraining (Section 3), so the training budget is substantial.
  • Residual contamination risk: even with hybrid 7‑gram/13‑gram decontamination (Appendix B), paraphrase‑level leakage can’t be completely ruled out (Section 5).
  • LLM‑as‑judge dependence: judge‑guided DPO and some evaluations (ArenaHard, parts of HELMET scoring) can reflect judge biases (Section 5).

7. Implications and Future Directions

  • Field impact
  • Demonstrates that a data‑centric recipe—diverse synthetic generation, seed curation, and token‑level preference optimization—can push small models into “reasoning parity” with much larger systems (Table 1; Figure 1). This shifts attention from pure scale to curriculum and post‑training design.
  • PTS opens a new granularity for preference learning—token‑level at pivotal decisions—which could benefit many domains (code repair, mathematical proof search, tool‑use planners).
  • Follow‑up research enabled
  • End‑to‑end optimization of pretraining mixtures that account for post‑training effects (Section 3.2 notes this as promising).
  • Extending PTS: beyond binary accepted/rejected tokens to structured action spaces, multi‑token decisions, or reinforcement learning settings (Section 4.3).
  • Better contamination‑proof and process‑based benchmarks; more “fresh” evaluations like AMC (Section 1.1; Section 5).
  • Strengthening instruction following with targeted synthetic datasets (Section 6 notes this as a likely fix).
  • Tool‑augmented phi‑4 (search, calculators, code runners) to reduce factual hallucinations (Section 8 suggests this).
  • Practical applications
  • Cost‑sensitive reasoning: education (competition math tutors), scientific QA, coding assistants that run locally or at low latency (Table 1; Figure 1).
  • Long‑document tasks within 16K window: technical document QA and legal summarization (Table 6).
  • Safer deployment defaults: refusal‑to‑hallucinate behavior for knowledge‑sparse queries (Appendix A.1; Figure 6).

Overall, phi‑4 shows that carefully engineered data pipelines plus targeted token‑level preference learning can move the performance needle more efficiently than parameter count alone—especially for reasoning‑centric tasks—while flagging clear next steps for instruction following, long‑context trade‑offs, and tool integration.