Skip to content

Textbooks Are All You Need

ArXiv: 2306.11644

🎯 Pitch

This paper introduces phi-1, a compact 1.3B-parameter language model for code that achieves state-of-the-art performance on Python coding benchmarks using just a fraction of the data and compute required by its much larger peers. By training exclusively on high-quality, 'textbook-style' and synthetic data instead of noisy web-scale corpora, the authors show data curation—not just raw scale—can unlock strong reasoning abilities and efficiency in code generation models. This work demonstrates a compelling new paradigm: with the right educational data, smaller models can rival the best, dramatically reducing both resource costs and environmental impact.


1. Executive Summary

This paper presents phi-1, a 1.3B-parameter code-focused language model that achieves competitive or state-of-the-art accuracy on standard Python benchmarks while using orders-of-magnitude less compute and data than prior systems. The key idea is to replace massive, noisy web corpora with “textbook-quality” data—filtered real code plus synthetic textbooks and exercises—which reshapes the usual scaling behavior and yields strong capabilities after a brief fine-tuning stage.

2. Context and Motivation

  • Problem addressed
  • Code LLMs typically scale performance by increasing model size and training tokens (“scaling laws”). But standard code corpora (e.g., open-source repositories) are noisy and poorly structured for learning algorithmic reasoning (Section 2: bullet list of drawbacks). This paper asks: can high-quality, curriculum-like data substitute sheer scale for code generation?
  • Why this matters
  • Practical: high accuracy with drastically fewer parameters and training tokens lowers cost and environmental impact (Introduction).
  • Scientific: demonstrates that improving data quality can “change the shape of the scaling laws” (Introduction), showing an alternative axis—data curation—along which capability emerges.
  • Prior approaches and their shortcomings
  • Large models trained on huge code corpora: StarCoder (15.5B parameters, ~1T tokens), CodeGen (up to 16.1B, 577B tokens), PaLM-Coder, etc. (Table 1). These rely on scale and broad web data.
  • Web datasets (e.g., The Stack, StackOverflow) contain many non-instructive snippets: not self-contained, dominated by boilerplate, algorithmic logic buried in context, and topic imbalance (Section 2).
  • This paper’s positioning
  • Instead of scaling compute, it scales data quality. It constructs a “textbook” training set and a small “exercise” fine-tuning set, then shows competitive results with a 1.3B model trained in ~4 days on 8 A100 GPUs (Abstract; Section 2.3). It also proposes decontamination/pruning procedures to address training-test overlap concerns (Section 5).

3. Technical Approach

Step-by-step pipeline (Section 2). - Overall training plan - Pretrain phi-1-base on CodeTextbook: a mixture of (i) filtered Python code from The Stack + StackOverflow (≈6B tokens) and (ii) synthetic “textbooks” generated by GPT‑3.5 (<1B tokens). - Fine-tune on CodeExercises: a small, synthetic set of Python docstring-style exercises with solutions (~180M tokens). - The final model phi-1 is the fine-tuned phi-1-base. A smaller control model phi-1-small (350M parameters) is trained with the same pipeline for comparisons.

  • Data curation: how “textbook quality” is achieved 1) Filtering web code (Section 2.1)

    • Annotate ≈100k code files from The Stack/StackOverflow for “educational value” using GPT‑4 (to avoid human labeling effort).
    • Train a random forest classifier to predict educational quality from pretrained CodeGen embeddings of each file.
    • Use this classifier to select a high-quality subset (≈6B tokens). Examples of “high” vs “low” educational value show concise, self-contained functions vs complex, context-dependent boilerplate (Figure: “Educational values deemed by the filter”).
    • Impact: On a 350M model, training on unfiltered data saturates at 12.19% HumanEval after ~200B tokens; filtered data yields 17.68% after 36k steps. Adding synthetic textbooks increases to 20.12% (Section 2.1). 2) Synthetic “textbooks” (Section 2.2, “The synthetic textbook dataset”)
    • Generate <1B tokens of Python “textbook” prose interleaved with code using GPT‑3.5.
    • Enforce diversity by randomly constraining topics and target audiences in prompts (inspired by TinyStories diversity trick; Section 2.2). 3) Synthetic “exercises” (Section 2.2, “The CodeExercises dataset”)
    • Create ~880k docstring-style problems and solutions (~180M tokens). Encourage diversity by constraining function names.
    • This dataset intentionally targets short, self-contained function completion as in HumanEval/MBPP; explicit decontamination checks are later applied (Sections 4–5).
  • Model architecture and training (Section 2.3)

  • Architecture (1.3B phi-1): decoder-only Transformer, 24 layers, hidden size 2048, MLP size 8192, 32 attention heads of size 64, rotary position embeddings, FlashAttention; tokenizer as in codegen-350M-mono. No FIM or MQA.
  • Training setup: sequence length 2048; next-token prediction; AdamW; dropout 0.1; linear warmup/decay; fp16; 8×A100; deepspeed.
  • Pretraining hyperparameters: effective batch 1024; peak LR 1e‑3; warmup 750 steps; weight decay 0.1; total 36k steps (≈8 epochs over ~7B tokens → ~50B tokens seen). phi-1-base is the 24k-step checkpoint.
  • Fine-tuning hyperparameters: effective batch 256; peak LR 1e‑4; warmup 50 steps; weight decay 0.01; 6k steps; pick best checkpoint.
  • Compute: <4 days pretraining; ~7 hours fine-tuning on same hardware (Section 2.3).

  • Evaluation-specific definitions (used throughout the paper)

  • HumanEval: a Python function-completion benchmark with hidden unit tests (commonly used for code LLMs).
  • MBPP (“Mostly Basic Python Problems”): another Python function benchmark with unit tests.
  • pass@1: percentage of tasks solved by the model’s first generated attempt (one-shot).
  • Decontamination measures (Section 5):
    • n-gram overlap: text overlap using sequences of n tokens; here 13-gram used for a strict check.
    • embedding distance: L2 distance between CodeGen-Mono 350M embeddings of two code snippets.
    • AST match rate: edit-distance-based similarity between code abstract syntax trees (syntax-structure comparison).

4. Key Insights and Innovations

  • High-quality, curriculum-like data can outperform raw scale
  • Innovation: A deliberate “textbook + exercises” curriculum rather than indiscriminate web scraping (Sections 2.1–2.2).
  • Evidence: phi-1 (1.3B params, ~7B tokens) achieves 50.6% pass@1 on HumanEval and 55.5% on MBPP (Abstract; Table 1)—surpassing many larger models trained on hundreds of billions to trillions of tokens. This challenges the compute- or parameter-centric view of scaling laws.
  • Minimal, targeted fine-tuning reorganizes pretraining knowledge
  • Innovation: A small exercise dataset (~180M tokens) unlocks a large capability jump on tasks beyond the fine-tuning distribution (Section 3).
  • Evidence: phi-1 shows improved instruction following and better library usage (PyGame, Tkinter, PyTorch, matplotlib) despite those libraries not appearing in the fine-tuning set (examples in Section 3 and Appendix A; import distribution in Figure 3.1).
  • A strong-form decontamination protocol based on pruning, not only overlap metrics
  • Innovation: Instead of just reporting overlaps, they retrain after removing any CodeExercises items similar to HumanEval under strict embedding/AST thresholds (Section 5).
  • Significance: After pruning >40% of exercises (τ down to 0.8), retrained phi-1 still outperforms StarCoder-Prompted on HumanEval (Table 3), supporting that gains are not from memorization.
  • LLM-graded evaluation on novel “unconventional” tasks
  • Innovation: Construct 50 new problems designed to be outside training distribution and grade using GPT‑4 with rubric-like prompts (Section 4).
  • Result: Ranking matches HumanEval; phi-1 scores 52% vs StarCoder 51%, while phi-1 far exceeds StarCoder on standard HumanEval (Table 2). This triangulates the model’s genuine reasoning gains.

5. Experimental Analysis

  • Evaluation setup
  • Benchmarks and metrics: HumanEval and MBPP measured by pass@1 (Table 1). Additional 50 “unconventional” problems graded 0–10 by GPT‑4 (“Understanding” score; Table 2).
  • Baselines: CodeGen, Replit, StarCoder, PaLM 2-S, WizardCoder, GPT‑3.5/4 (Table 1). Internal ablations across data settings and parameter counts (Figure 2.1).
  • Main quantitative results
  • Core leaderboard (Table 1):
    • phi-1 (1.3B, 7B tokens): HumanEval 50.6%, MBPP 55.5%.
    • StarCoder-Prompted (15.5B, 1T tokens): HumanEval 40.8%, MBPP 49.5%.
    • WizardCoder (16B, 1T tokens): HumanEval 57.3%, MBPP 51.8% (better on HumanEval, worse on MBPP).
    • GPT‑4: HumanEval 67% (no MBPP reported in Table 1).
  • Training data quality ablations (Figure 2.1 and Section 2.1):
    • 350M model, unfiltered Stack+SO: saturates at 12.19% (≈200B tokens).
    • 350M model, classifier-filtered subset: 17.68% (36k steps).
    • 350M model, filtered + synthetic textbooks: 20.12%.
    • 1.3B model with CodeTextbook: phi-1-base achieves 29% without any exercises fine-tuning (Figure 2.1).
    • Fine-tuning on CodeExercises boosts to 51% (Figure 2.1 caption).
  • LLM-graded unconventional tasks (Table 2):
    • phi-1: 52% Understanding score; StarCoder: 51%; phi-1-base: 37%; phi-1-small: 45%. The HumanEval column in the same table recapitulates their HumanEval standings (e.g., phi-1 ≈51%, StarCoder 34%).
  • Evidence quality and robustness
  • Decontamination:
    • n-gram overlap: only 4 HumanEval items show 13-gram matches with any exercise item, and manual review marks them as false positives (Section 5.1).
    • Pruning and retraining: Using embedding distance + AST match rate τ∈{0.95, 0.9, 0.85, 0.8}, they remove 42.5k–354k items from ~880k exercises (Section 5.2). Even with τ=0.9–0.8, the retrained model’s total HumanEval accuracy remains 45.1–46.3%, above StarCoder-Prompted 41.5% (Table 3).
    • Breakdown: Across “similar” vs “non-similar” subsets of HumanEval, performance is lower on non-similar items for all models (Table 3), a reasonable pattern consistent with distribution shift.
  • Capability “spikes” after fine-tuning (Section 3):
    • Instruction following improves (Section 3.1 example).
    • External library usage improves (PyGame/Tkinter examples in Section 3.2), even though import counts in CodeExercises are dominated by basic libraries (Figure 3.1).
    • Chat-style helpfulness improves (Section 3 “Chat mode example”), though not perfect API correctness.
  • Do the experiments support the claims?
  • The head-to-head with large baselines (Table 1) and the ablation trajectory (Figure 2.1) jointly support the central claim: curated “textbook” data + small exercise fine-tuning can rival or surpass larger systems on HumanEval/MBPP.
  • The pruning-based decontamination and LLM-graded new tasks add credibility that the improvements are not merely from memorizing benchmark-like items (Sections 4–5).
  • Caveat: LLM grading introduces subjectivity; however, the authors mitigate leakage by having a separate team author the 50 tasks and by using the same grader across models (Section 4).

6. Limitations and Trade-offs

  • Scope and specialization
  • Python-only training and evaluation; limited multi-language or domain-specific APIs knowledge (Conclusion; Appendix B).
  • Prompt and language robustness
  • Sensitivity to longer or stylistically varied prompts; performance can drop with grammatical errors or slight wording changes (Appendix B: “Sensitivity to prompt variations” and “Sensitivity to natural language inputs”). Examples show failures when extending a 3-layer to 4-layer PyTorch network or misinterpreting “unchanged.”
  • Reasoning limitations
  • Weaknesses in counting and spatial reasoning (Appendix B, Tkinter layout example).
  • Data and evaluation assumptions
  • Synthetic data is generated by GPT‑3.5, which has a “high error rate” (Conclusion). The model nevertheless learns despite errors; however, correctness of synthetic content is an assumption and a potential bottleneck.
  • LLM-graded evaluation (Section 4) trades objective unit tests for richer rubric-based assessment; this can introduce grader bias, despite controls.
  • Compute and scalability
  • Although compute is much lower than typical SOTA models, the approach still requires nontrivial infrastructure (8×A100 for days). Scaling to broader domains may require larger models or more diverse synthetic curricula (Conclusion).

7. Implications and Future Directions

  • Field-level impact
  • Demonstrates that carefully engineered training data can rival brute-force scaling. This encourages a shift toward dataset design—clarity, self-containment, balance, and diversity—as a first-class lever in LLM training (Introduction; Conclusion).
  • Research directions
  • Improved synthetic data generation: using stronger teachers (e.g., GPT‑4) to reduce error rate (Conclusion).
  • Curriculum design and diversity metrics: methods to quantify and optimize coverage and non-redundancy in synthetic corpora (Conclusion: calls out lack of good methodology to measure diversity/redundancy).
  • Broader skills with small models: extend the “textbook + exercises” recipe to domains beyond Python (multi-language code; math; structured reasoning). Investigate how fine-tuning reorganizes pretrained knowledge (Section 3).
  • Robustness and reasoning: target counting/spatial reasoning and prompt robustness with specialized exercises and adversarial prompt augmentation (Appendix B limitations).
  • Data ethics and recursive training: understand societal implications as LLMs curate data for future LLMs, including bias and accountability (Conclusion).
  • Practical applications
  • Cost-effective coding assistants for education and lightweight IDE integrations where GPU/CPU budgets are constrained.
  • On-device or edge-deployable code helpers given the small parameter count.
  • Educational tools: the pipeline itself (textbook + exercises) mirrors how humans learn; it can produce pedagogical datasets in other domains.

Core quantitative takeaway (Table 1): “phi-1 (1.3B params, ~7B tokens) attains HumanEval 50.6% and MBPP 55.5%, outperforming many larger open models trained on 100–1000× more tokens.”

Mechanistic takeaway (Figure 2.1; Sections 2–3): “Filtering for educational value + synthetic textbooks shape pretraining; a small CodeExercises fine-tune reorganizes this knowledge, producing a large capability jump—including on tasks not present in the fine-tuning set.”