Textbooks Are All You Need¶

🎯 Pitch¶

This paper introduces phi-1, a compact 1.3B-parameter language model for code that achieves state-of-the-art performance on Python coding benchmarks using just a fraction of the data and compute required by its much larger peers. By training exclusively on high-quality, 'textbook-style' and synthetic data instead of noisy web-scale corpora, the authors show data curation—not just raw scale—can unlock strong reasoning abilities and efficiency in code generation models. This work demonstrates a compelling new paradigm: with the right educational data, smaller models can rival the best, dramatically reducing both resource costs and environmental impact.

1. Executive Summary¶

This paper presents phi-1, a 1.3B-parameter code-focused language model that achieves competitive or state-of-the-art accuracy on standard Python benchmarks while using orders-of-magnitude less compute and data than prior systems. The key idea is to replace massive, noisy web corpora with “textbook-quality” data—filtered real code plus synthetic textbooks and exercises—which reshapes the usual scaling behavior and yields strong capabilities after a brief fine-tuning stage.

2. Context and Motivation¶

Problem addressed
Code LLMs typically scale performance by increasing model size and training tokens (“scaling laws”). But standard code corpora (e.g., open-source repositories) are noisy and poorly structured for learning algorithmic reasoning (Section 2: bullet list of drawbacks). This paper asks: can high-quality, curriculum-like data substitute sheer scale for code generation?
Why this matters
Practical: high accuracy with drastically fewer parameters and training tokens lowers cost and environmental impact (Introduction).
Scientific: demonstrates that improving data quality can “change the shape of the scaling laws” (Introduction), showing an alternative axis—data curation—along which capability emerges.
Prior approaches and their shortcomings
Large models trained on huge code corpora: StarCoder (15.5B parameters, ~1T tokens), CodeGen (up to 16.1B, 577B tokens), PaLM-Coder, etc. (Table 1). These rely on scale and broad web data.
Web datasets (e.g., The Stack, StackOverflow) contain many non-instructive snippets: not self-contained, dominated by boilerplate, algorithmic logic buried in context, and topic imbalance (Section 2).
This paper’s positioning
Instead of scaling compute, it scales data quality. It constructs a “textbook” training set and a small “exercise” fine-tuning set, then shows competitive results with a 1.3B model trained in ~4 days on 8 A100 GPUs (Abstract; Section 2.3). It also proposes decontamination/pruning procedures to address training-test overlap concerns (Section 5).

3. Technical Approach¶

Step-by-step pipeline (Section 2). - Overall training plan - Pretrain phi-1-base on CodeTextbook: a mixture of (i) filtered Python code from The Stack + StackOverflow (≈6B tokens) and (ii) synthetic “textbooks” generated by GPT‑3.5 (<1B tokens). - Fine-tune on CodeExercises: a small, synthetic set of Python docstring-style exercises with solutions (~180M tokens). - The final model phi-1 is the fine-tuned phi-1-base. A smaller control model phi-1-small (350M parameters) is trained with the same pipeline for comparisons.

Data curation: how “textbook quality” is achieved 1) Filtering web code (Section 2.1)
- Annotate ≈100k code files from The Stack/StackOverflow for “educational value” using GPT‑4 (to avoid human labeling effort).
- Train a random forest classifier to predict educational quality from pretrained CodeGen embeddings of each file.
- Use this classifier to select a high-quality subset (≈6B tokens). Examples of “high” vs “low” educational value show concise, self-contained functions vs complex, context-dependent boilerplate (Figure: “Educational values deemed by the filter”).
- Impact: On a 350M model, training on unfiltered data saturates at 12.19% HumanEval after ~200B tokens; filtered data yields 17.68% after 36k steps. Adding synthetic textbooks increases to 20.12% (Section 2.1). 2) Synthetic “textbooks” (Section 2.2, “The synthetic textbook dataset”)
- Generate <1B tokens of Python “textbook” prose interleaved with code using GPT‑3.5.
- Enforce diversity by randomly constraining topics and target audiences in prompts (inspired by TinyStories diversity trick; Section 2.2). 3) Synthetic “exercises” (Section 2.2, “The CodeExercises dataset”)
- Create ~880k docstring-style problems and solutions (~180M tokens). Encourage diversity by constraining function names.
- This dataset intentionally targets short, self-contained function completion as in HumanEval/MBPP; explicit decontamination checks are later applied (Sections 4–5).
Model architecture and training (Section 2.3)
Architecture (1.3B phi-1): decoder-only Transformer, 24 layers, hidden size 2048, MLP size 8192, 32 attention heads of size 64, rotary position embeddings, FlashAttention; tokenizer as in codegen-350M-mono. No FIM or MQA.
Training setup: sequence length 2048; next-token prediction; AdamW; dropout 0.1; linear warmup/decay; fp16; 8×A100; deepspeed.
Pretraining hyperparameters: effective batch 1024; peak LR 1e‑3; warmup 750 steps; weight decay 0.1; total 36k steps (≈8 epochs over ~7B tokens → ~50B tokens seen). phi-1-base is the 24k-step checkpoint.
Fine-tuning hyperparameters: effective batch 256; peak LR 1e‑4; warmup 50 steps; weight decay 0.01; 6k steps; pick best checkpoint.
Compute: <4 days pretraining; ~7 hours fine-tuning on same hardware (Section 2.3).
Evaluation-specific definitions (used throughout the paper)
HumanEval: a Python function-completion benchmark with hidden unit tests (commonly used for code LLMs).
MBPP (“Mostly Basic Python Problems”): another Python function benchmark with unit tests.
pass@1: percentage of tasks solved by the model’s first generated attempt (one-shot).
Decontamination measures (Section 5):
- n-gram overlap: text overlap using sequences of n tokens; here 13-gram used for a strict check.
- embedding distance: L2 distance between CodeGen-Mono 350M embeddings of two code snippets.
- AST match rate: edit-distance-based similarity between code abstract syntax trees (syntax-structure comparison).

4. Key Insights and Innovations¶

High-quality, curriculum-like data can outperform raw scale
Innovation: A deliberate “textbook + exercises” curriculum rather than indiscriminate web scraping (Sections 2.1–2.2).
Evidence: phi-1 (1.3B params, ~7B tokens) achieves 50.6% pass@1 on HumanEval and 55.5% on MBPP (Abstract; Table 1)—surpassing many larger models trained on hundreds of billions to trillions of tokens. This challenges the compute- or parameter-centric view of scaling laws.
Minimal, targeted fine-tuning reorganizes pretraining knowledge
Innovation: A small exercise dataset (~180M tokens) unlocks a large capability jump on tasks beyond the fine-tuning distribution (Section 3).
Evidence: phi-1 shows improved instruction following and better library usage (PyGame, Tkinter, PyTorch, matplotlib) despite those libraries not appearing in the fine-tuning set (examples in Section 3 and Appendix A; import distribution in Figure 3.1).
A strong-form decontamination protocol based on pruning, not only overlap metrics
Innovation: Instead of just reporting overlaps, they retrain after removing any CodeExercises items similar to HumanEval under strict embedding/AST thresholds (Section 5).
Significance: After pruning >40% of exercises (τ down to 0.8), retrained phi-1 still outperforms StarCoder-Prompted on HumanEval (Table 3), supporting that gains are not from memorization.
LLM-graded evaluation on novel “unconventional” tasks
Innovation: Construct 50 new problems designed to be outside training distribution and grade using GPT‑4 with rubric-like prompts (Section 4).
Result: Ranking matches HumanEval; phi-1 scores 52% vs StarCoder 51%, while phi-1 far exceeds StarCoder on standard HumanEval (Table 2). This triangulates the model’s genuine reasoning gains.

5. Experimental Analysis¶

Evaluation setup
Benchmarks and metrics: HumanEval and MBPP measured by pass@1 (Table 1). Additional 50 “unconventional” problems graded 0–10 by GPT‑4 (“Understanding” score; Table 2).
Baselines: CodeGen, Replit, StarCoder, PaLM 2-S, WizardCoder, GPT‑3.5/4 (Table 1). Internal ablations across data settings and parameter counts (Figure 2.1).
Main quantitative results
Core leaderboard (Table 1):
- phi-1 (1.3B, 7B tokens): HumanEval 50.6%, MBPP 55.5%.
- StarCoder-Prompted (15.5B, 1T tokens): HumanEval 40.8%, MBPP 49.5%.
- WizardCoder (16B, 1T tokens): HumanEval 57.3%, MBPP 51.8% (better on HumanEval, worse on MBPP).
- GPT‑4: HumanEval 67% (no MBPP reported in Table 1).
Training data quality ablations (Figure 2.1 and Section 2.1):
- 350M model, unfiltered Stack+SO: saturates at 12.19% (≈200B tokens).
- 350M model, classifier-filtered subset: 17.68% (36k steps).
- 350M model, filtered + synthetic textbooks: 20.12%.
- 1.3B model with CodeTextbook: phi-1-base achieves 29% without any exercises fine-tuning (Figure 2.1).
- Fine-tuning on CodeExercises boosts to 51% (Figure 2.1 caption).
LLM-graded unconventional tasks (Table 2):
- phi-1: 52% Understanding score; StarCoder: 51%; phi-1-base: 37%; phi-1-small: 45%. The HumanEval column in the same table recapitulates their HumanEval standings (e.g., phi-1 ≈51%, StarCoder 34%).
Evidence quality and robustness
Decontamination:
- n-gram overlap: only 4 HumanEval items show 13-gram matches with any exercise item, and manual review marks them as false positives (Section 5.1).
- Pruning and retraining: Using embedding distance + AST match rate τ∈{0.95, 0.9, 0.85, 0.8}, they remove 42.5k–354k items from ~880k exercises (Section 5.2). Even with τ=0.9–0.8, the retrained model’s total HumanEval accuracy remains 45.1–46.3%, above StarCoder-Prompted 41.5% (Table 3).
- Breakdown: Across “similar” vs “non-similar” subsets of HumanEval, performance is lower on non-similar items for all models (Table 3), a reasonable pattern consistent with distribution shift.
Capability “spikes” after fine-tuning (Section 3):
- Instruction following improves (Section 3.1 example).
- External library usage improves (PyGame/Tkinter examples in Section 3.2), even though import counts in CodeExercises are dominated by basic libraries (Figure 3.1).
- Chat-style helpfulness improves (Section 3 “Chat mode example”), though not perfect API correctness.
Do the experiments support the claims?
The head-to-head with large baselines (Table 1) and the ablation trajectory (Figure 2.1) jointly support the central claim: curated “textbook” data + small exercise fine-tuning can rival or surpass larger systems on HumanEval/MBPP.
The pruning-based decontamination and LLM-graded new tasks add credibility that the improvements are not merely from memorizing benchmark-like items (Sections 4–5).
Caveat: LLM grading introduces subjectivity; however, the authors mitigate leakage by having a separate team author the 50 tasks and by using the same grader across models (Section 4).

6. Limitations and Trade-offs¶

Scope and specialization
Python-only training and evaluation; limited multi-language or domain-specific APIs knowledge (Conclusion; Appendix B).
Prompt and language robustness
Sensitivity to longer or stylistically varied prompts; performance can drop with grammatical errors or slight wording changes (Appendix B: “Sensitivity to prompt variations” and “Sensitivity to natural language inputs”). Examples show failures when extending a 3-layer to 4-layer PyTorch network or misinterpreting “unchanged.”
Reasoning limitations
Weaknesses in counting and spatial reasoning (Appendix B, Tkinter layout example).
Data and evaluation assumptions
Synthetic data is generated by GPT‑3.5, which has a “high error rate” (Conclusion). The model nevertheless learns despite errors; however, correctness of synthetic content is an assumption and a potential bottleneck.
LLM-graded evaluation (Section 4) trades objective unit tests for richer rubric-based assessment; this can introduce grader bias, despite controls.
Compute and scalability
Although compute is much lower than typical SOTA models, the approach still requires nontrivial infrastructure (8×A100 for days). Scaling to broader domains may require larger models or more diverse synthetic curricula (Conclusion).

7. Implications and Future Directions¶

Field-level impact
Demonstrates that carefully engineered training data can rival brute-force scaling. This encourages a shift toward dataset design—clarity, self-containment, balance, and diversity—as a first-class lever in LLM training (Introduction; Conclusion).
Research directions
Improved synthetic data generation: using stronger teachers (e.g., GPT‑4) to reduce error rate (Conclusion).
Curriculum design and diversity metrics: methods to quantify and optimize coverage and non-redundancy in synthetic corpora (Conclusion: calls out lack of good methodology to measure diversity/redundancy).
Broader skills with small models: extend the “textbook + exercises” recipe to domains beyond Python (multi-language code; math; structured reasoning). Investigate how fine-tuning reorganizes pretrained knowledge (Section 3).
Robustness and reasoning: target counting/spatial reasoning and prompt robustness with specialized exercises and adversarial prompt augmentation (Appendix B limitations).
Data ethics and recursive training: understand societal implications as LLMs curate data for future LLMs, including bias and accountability (Conclusion).
Practical applications
Cost-effective coding assistants for education and lightweight IDE integrations where GPU/CPU budgets are constrained.
On-device or edge-deployable code helpers given the small parameter count.
Educational tools: the pipeline itself (textbook + exercises) mirrors how humans learn; it can produce pedagogical datasets in other domains.

Core quantitative takeaway (Table 1): “phi-1 (1.3B params, ~7B tokens) attains HumanEval 50.6% and MBPP 55.5%, outperforming many larger open models trained on 100–1000× more tokens.”

Mechanistic takeaway (Figure 2.1; Sections 2–3): “Filtering for educational value + synthetic textbooks shape pretraining; a small CodeExercises fine-tune reorganizes this knowledge, producing a large capability jump—including on tasks not present in the fine-tuning set.”