Textbooks Are All You Need¶
ArXiv: 2306.11644
🎯 Pitch¶
This paper introduces phi-1, a compact 1.3B-parameter language model for code that achieves state-of-the-art performance on Python coding benchmarks using just a fraction of the data and compute required by its much larger peers. By training exclusively on high-quality, 'textbook-style' and synthetic data instead of noisy web-scale corpora, the authors show data curation—not just raw scale—can unlock strong reasoning abilities and efficiency in code generation models. This work demonstrates a compelling new paradigm: with the right educational data, smaller models can rival the best, dramatically reducing both resource costs and environmental impact.
1. Executive Summary¶
This paper presents phi-1, a 1.3B-parameter code-focused language model that achieves competitive or state-of-the-art accuracy on standard Python benchmarks while using orders-of-magnitude less compute and data than prior systems. The key idea is to replace massive, noisy web corpora with “textbook-quality” data—filtered real code plus synthetic textbooks and exercises—which reshapes the usual scaling behavior and yields strong capabilities after a brief fine-tuning stage.
2. Context and Motivation¶
- Problem addressed
- Code LLMs typically scale performance by increasing model size and training tokens (“scaling laws”). But standard code corpora (e.g., open-source repositories) are noisy and poorly structured for learning algorithmic reasoning (Section 2: bullet list of drawbacks). This paper asks: can high-quality, curriculum-like data substitute sheer scale for code generation?
- Why this matters
- Practical: high accuracy with drastically fewer parameters and training tokens lowers cost and environmental impact (Introduction).
- Scientific: demonstrates that improving data quality can “change the shape of the scaling laws” (Introduction), showing an alternative axis—data curation—along which capability emerges.
- Prior approaches and their shortcomings
- Large models trained on huge code corpora:
StarCoder(15.5B parameters, ~1T tokens),CodeGen(up to 16.1B, 577B tokens),PaLM-Coder, etc. (Table 1). These rely on scale and broad web data. - Web datasets (e.g., The Stack, StackOverflow) contain many non-instructive snippets: not self-contained, dominated by boilerplate, algorithmic logic buried in context, and topic imbalance (Section 2).
- This paper’s positioning
- Instead of scaling compute, it scales data quality. It constructs a “textbook” training set and a small “exercise” fine-tuning set, then shows competitive results with a 1.3B model trained in ~4 days on 8 A100 GPUs (Abstract; Section 2.3). It also proposes decontamination/pruning procedures to address training-test overlap concerns (Section 5).
3. Technical Approach¶
Step-by-step pipeline (Section 2).
- Overall training plan
- Pretrain phi-1-base on CodeTextbook: a mixture of (i) filtered Python code from The Stack + StackOverflow (≈6B tokens) and (ii) synthetic “textbooks” generated by GPT‑3.5 (<1B tokens).
- Fine-tune on CodeExercises: a small, synthetic set of Python docstring-style exercises with solutions (~180M tokens).
- The final model phi-1 is the fine-tuned phi-1-base. A smaller control model phi-1-small (350M parameters) is trained with the same pipeline for comparisons.
-
Data curation: how “textbook quality” is achieved 1) Filtering web code (Section 2.1)
- Annotate ≈100k code files from The Stack/StackOverflow for “educational value” using GPT‑4 (to avoid human labeling effort).
- Train a random forest classifier to predict educational quality from pretrained
CodeGenembeddings of each file. - Use this classifier to select a high-quality subset (≈6B tokens). Examples of “high” vs “low” educational value show concise, self-contained functions vs complex, context-dependent boilerplate (Figure: “Educational values deemed by the filter”).
- Impact: On a 350M model, training on unfiltered data saturates at 12.19% HumanEval after ~200B tokens; filtered data yields 17.68% after 36k steps. Adding synthetic textbooks increases to 20.12% (Section 2.1). 2) Synthetic “textbooks” (Section 2.2, “The synthetic textbook dataset”)
- Generate <1B tokens of Python “textbook” prose interleaved with code using GPT‑3.5.
- Enforce diversity by randomly constraining topics and target audiences in prompts (inspired by TinyStories diversity trick; Section 2.2). 3) Synthetic “exercises” (Section 2.2, “The CodeExercises dataset”)
- Create ~880k docstring-style problems and solutions (~180M tokens). Encourage diversity by constraining function names.
- This dataset intentionally targets short, self-contained function completion as in HumanEval/MBPP; explicit decontamination checks are later applied (Sections 4–5).
-
Model architecture and training (Section 2.3)
- Architecture (1.3B
phi-1): decoder-only Transformer, 24 layers, hidden size 2048, MLP size 8192, 32 attention heads of size 64, rotary position embeddings, FlashAttention; tokenizer as incodegen-350M-mono. No FIM or MQA. - Training setup: sequence length 2048; next-token prediction; AdamW; dropout 0.1; linear warmup/decay; fp16; 8×A100; deepspeed.
- Pretraining hyperparameters: effective batch 1024; peak LR 1e‑3; warmup 750 steps; weight decay 0.1; total 36k steps (≈8 epochs over ~7B tokens → ~50B tokens seen).
phi-1-baseis the 24k-step checkpoint. - Fine-tuning hyperparameters: effective batch 256; peak LR 1e‑4; warmup 50 steps; weight decay 0.01; 6k steps; pick best checkpoint.
-
Compute: <4 days pretraining; ~7 hours fine-tuning on same hardware (Section 2.3).
-
Evaluation-specific definitions (used throughout the paper)
HumanEval: a Python function-completion benchmark with hidden unit tests (commonly used for code LLMs).MBPP(“Mostly Basic Python Problems”): another Python function benchmark with unit tests.pass@1: percentage of tasks solved by the model’s first generated attempt (one-shot).- Decontamination measures (Section 5):
n-gram overlap: text overlap using sequences of n tokens; here 13-gram used for a strict check.embedding distance: L2 distance betweenCodeGen-Mono 350Membeddings of two code snippets.AST match rate: edit-distance-based similarity between code abstract syntax trees (syntax-structure comparison).
4. Key Insights and Innovations¶
- High-quality, curriculum-like data can outperform raw scale
- Innovation: A deliberate “textbook + exercises” curriculum rather than indiscriminate web scraping (Sections 2.1–2.2).
- Evidence:
phi-1(1.3B params, ~7B tokens) achieves 50.6% pass@1 on HumanEval and 55.5% on MBPP (Abstract; Table 1)—surpassing many larger models trained on hundreds of billions to trillions of tokens. This challenges the compute- or parameter-centric view of scaling laws. - Minimal, targeted fine-tuning reorganizes pretraining knowledge
- Innovation: A small exercise dataset (~180M tokens) unlocks a large capability jump on tasks beyond the fine-tuning distribution (Section 3).
- Evidence:
phi-1shows improved instruction following and better library usage (PyGame, Tkinter, PyTorch, matplotlib) despite those libraries not appearing in the fine-tuning set (examples in Section 3 and Appendix A; import distribution in Figure 3.1). - A strong-form decontamination protocol based on pruning, not only overlap metrics
- Innovation: Instead of just reporting overlaps, they retrain after removing any
CodeExercisesitems similar to HumanEval under strict embedding/AST thresholds (Section 5). - Significance: After pruning >40% of exercises (τ down to 0.8), retrained
phi-1still outperformsStarCoder-Promptedon HumanEval (Table 3), supporting that gains are not from memorization. - LLM-graded evaluation on novel “unconventional” tasks
- Innovation: Construct 50 new problems designed to be outside training distribution and grade using GPT‑4 with rubric-like prompts (Section 4).
- Result: Ranking matches HumanEval;
phi-1scores 52% vsStarCoder51%, whilephi-1far exceedsStarCoderon standard HumanEval (Table 2). This triangulates the model’s genuine reasoning gains.
5. Experimental Analysis¶
- Evaluation setup
- Benchmarks and metrics: HumanEval and MBPP measured by
pass@1(Table 1). Additional 50 “unconventional” problems graded 0–10 by GPT‑4 (“Understanding” score; Table 2). - Baselines:
CodeGen,Replit,StarCoder,PaLM 2-S,WizardCoder, GPT‑3.5/4 (Table 1). Internal ablations across data settings and parameter counts (Figure 2.1). - Main quantitative results
- Core leaderboard (Table 1):
phi-1(1.3B, 7B tokens): HumanEval 50.6%, MBPP 55.5%.StarCoder-Prompted(15.5B, 1T tokens): HumanEval 40.8%, MBPP 49.5%.WizardCoder(16B, 1T tokens): HumanEval 57.3%, MBPP 51.8% (better on HumanEval, worse on MBPP).- GPT‑4: HumanEval 67% (no MBPP reported in Table 1).
- Training data quality ablations (Figure 2.1 and Section 2.1):
- 350M model, unfiltered Stack+SO: saturates at 12.19% (≈200B tokens).
- 350M model, classifier-filtered subset: 17.68% (36k steps).
- 350M model, filtered + synthetic textbooks: 20.12%.
- 1.3B model with
CodeTextbook:phi-1-baseachieves 29% without any exercises fine-tuning (Figure 2.1). - Fine-tuning on
CodeExercisesboosts to 51% (Figure 2.1 caption).
- LLM-graded unconventional tasks (Table 2):
phi-1: 52% Understanding score;StarCoder: 51%;phi-1-base: 37%;phi-1-small: 45%. The HumanEval column in the same table recapitulates their HumanEval standings (e.g.,phi-1≈51%,StarCoder34%).
- Evidence quality and robustness
- Decontamination:
- n-gram overlap: only 4 HumanEval items show 13-gram matches with any exercise item, and manual review marks them as false positives (Section 5.1).
- Pruning and retraining: Using embedding distance + AST match rate τ∈{0.95, 0.9, 0.85, 0.8}, they remove 42.5k–354k items from ~880k exercises (Section 5.2). Even with τ=0.9–0.8, the retrained model’s total HumanEval accuracy remains 45.1–46.3%, above
StarCoder-Prompted41.5% (Table 3). - Breakdown: Across “similar” vs “non-similar” subsets of HumanEval, performance is lower on non-similar items for all models (Table 3), a reasonable pattern consistent with distribution shift.
- Capability “spikes” after fine-tuning (Section 3):
- Instruction following improves (Section 3.1 example).
- External library usage improves (PyGame/Tkinter examples in Section 3.2), even though import counts in
CodeExercisesare dominated by basic libraries (Figure 3.1). - Chat-style helpfulness improves (Section 3 “Chat mode example”), though not perfect API correctness.
- Do the experiments support the claims?
- The head-to-head with large baselines (Table 1) and the ablation trajectory (Figure 2.1) jointly support the central claim: curated “textbook” data + small exercise fine-tuning can rival or surpass larger systems on HumanEval/MBPP.
- The pruning-based decontamination and LLM-graded new tasks add credibility that the improvements are not merely from memorizing benchmark-like items (Sections 4–5).
- Caveat: LLM grading introduces subjectivity; however, the authors mitigate leakage by having a separate team author the 50 tasks and by using the same grader across models (Section 4).
6. Limitations and Trade-offs¶
- Scope and specialization
- Python-only training and evaluation; limited multi-language or domain-specific APIs knowledge (Conclusion; Appendix B).
- Prompt and language robustness
- Sensitivity to longer or stylistically varied prompts; performance can drop with grammatical errors or slight wording changes (Appendix B: “Sensitivity to prompt variations” and “Sensitivity to natural language inputs”). Examples show failures when extending a 3-layer to 4-layer PyTorch network or misinterpreting “unchanged.”
- Reasoning limitations
- Weaknesses in counting and spatial reasoning (Appendix B, Tkinter layout example).
- Data and evaluation assumptions
- Synthetic data is generated by GPT‑3.5, which has a “high error rate” (Conclusion). The model nevertheless learns despite errors; however, correctness of synthetic content is an assumption and a potential bottleneck.
- LLM-graded evaluation (Section 4) trades objective unit tests for richer rubric-based assessment; this can introduce grader bias, despite controls.
- Compute and scalability
- Although compute is much lower than typical SOTA models, the approach still requires nontrivial infrastructure (8×A100 for days). Scaling to broader domains may require larger models or more diverse synthetic curricula (Conclusion).
7. Implications and Future Directions¶
- Field-level impact
- Demonstrates that carefully engineered training data can rival brute-force scaling. This encourages a shift toward dataset design—clarity, self-containment, balance, and diversity—as a first-class lever in LLM training (Introduction; Conclusion).
- Research directions
- Improved synthetic data generation: using stronger teachers (e.g., GPT‑4) to reduce error rate (Conclusion).
- Curriculum design and diversity metrics: methods to quantify and optimize coverage and non-redundancy in synthetic corpora (Conclusion: calls out lack of good methodology to measure diversity/redundancy).
- Broader skills with small models: extend the “textbook + exercises” recipe to domains beyond Python (multi-language code; math; structured reasoning). Investigate how fine-tuning reorganizes pretrained knowledge (Section 3).
- Robustness and reasoning: target counting/spatial reasoning and prompt robustness with specialized exercises and adversarial prompt augmentation (Appendix B limitations).
- Data ethics and recursive training: understand societal implications as LLMs curate data for future LLMs, including bias and accountability (Conclusion).
- Practical applications
- Cost-effective coding assistants for education and lightweight IDE integrations where GPU/CPU budgets are constrained.
- On-device or edge-deployable code helpers given the small parameter count.
- Educational tools: the pipeline itself (textbook + exercises) mirrors how humans learn; it can produce pedagogical datasets in other domains.
Core quantitative takeaway (Table 1): “
phi-1(1.3B params, ~7B tokens) attains HumanEval 50.6% and MBPP 55.5%, outperforming many larger open models trained on 100–1000× more tokens.”Mechanistic takeaway (Figure 2.1; Sections 2–3): “Filtering for educational value + synthetic textbooks shape pretraining; a small
CodeExercisesfine-tune reorganizes this knowledge, producing a large capability jump—including on tasks not present in the fine-tuning set.”