Scaling Laws for Neural Language Models¶
ArXiv: 2001.08361
🎯 Pitch¶
This paper uncovers precise empirical laws that govern how the performance of Transformer language models scales as you grow model size, dataset size, and training compute. By revealing simple power-law relationships across up to eight orders of magnitude, it provides actionable formulas for predicting model performance and, crucially, shows how to optimally allocate compute—favoring much larger models trained on relatively modest datasets, and stopping training early. These insights not only save resources but also reshape how researchers and practitioners design and scale up language models, offering a universal framework that advances both the science and practice of deep learning.
1. Executive Summary (2-3 sentences)¶
This paper empirically maps how the test loss of Transformer language models scales with three knobs you can turn during training: model size (N parameters), data size (D tokens), and training compute (C FLOPs). Across 6–8 orders of magnitude, it finds simple power-law equations that predict performance and prescribe how to allocate a fixed compute budget—train very large models on comparatively modest data, stop early, and use batch sizes near a “critical” value.
2. Context and Motivation¶
- Problem addressed
- Practitioners must decide how big a model to train, how much data to use, how long to train, and how to size batches. Before this work, there was no robust, quantitative recipe that predicted how test loss improves when any of these are scaled, or how to best spend a fixed compute budget.
- Why it matters
- Practical: Training trillion-parameter models costs millions of dollars and months of time. A predictive “scaling law” lets one forecast returns before training and allocate compute efficiently.
- Scientific: Power laws that persist across architectures and scales hint at universal behavior—useful for theory-building (akin to thermodynamic laws). The paper frames such regularities for language modeling.
- Prior approaches and gaps
- Prior empirical hints about scaling with data or model size existed in narrower settings (e.g., [Hestness et al. 2017]), but they did not:
- Jointly model performance vs. model size, data size, training steps, and compute.
- Provide a compute-optimal allocation strategy.
- Establish universality across six-to-eight orders of magnitude with Transformers.
- Positioning relative to existing work
- Extends and unifies earlier observations into a single framework with explicit equations:
- Three single-variable laws for
L(N),L(D), and compute-optimalL(Cmin)(Figure 1; Equations (1.1)–(1.3)). - A joint law for loss vs. model size and data size
L(N, D)(Equation (1.5), Figure 4/9). - A training-time law
L(N, Smin)(Equation (1.6), Figure 4 right; Table 3). - A batch size law
Bcrit(L)(Equation (1.4), Figure 10).
- Three single-variable laws for
3. Technical Approach¶
This is a large, carefully controlled empirical study backed by simple, interpretable formulas. The key moving parts:
- Core setup (Section 2)
- Models: Decoder-only Transformers with standard components; sizes from ~10^3 to ~10^9 non-embedding parameters; variants across depth/width/heads/context (Figures 5–6).
- Non-embedding parameters: The parameter count excludes token and positional embedding matrices. This choice yields clean scaling trends (Figure 6 right) whereas including embeddings confounds depth effects (Figure 6 left).
- Loss: Cross-entropy loss in nats per token over 1024-token contexts.
- Compute proxy:
C ≈ 6NBS, whereNis non-embedding parameters,Bis tokens per batch,Sis parameter update steps. The factor 6 accounts for forward/backward passes (Section 2.1). - Data: WebText2 (2.29×10^10 tokens total; 6.6×10^8 tokens held out for test). Additional out-of-distribution test sets include Books, Internet Books, Wikipedia, Common Crawl (Section 2.3; Figure 8).
- Training: Mostly Adam with warmup+cosine decay, 2.5×10^5 steps, batch size 512×1024-token sequences; Adafactor for the very largest models (Section 2.2).
-
The scaling-law program (Sections 3–6) 1) Measure single-factor laws where other factors are not limiting:
L(N)at effectively infinite data and long training (Equation (1.1), Figure 1 right).L(D)with early stopping to avoid overfitting (Equation (1.2), Figure 1 middle).L(C)at fixed batch size; then correct to compute-optimalL(Cmin)using a batch-size adjustment (Equations (5.5), (1.3); Figure 13). 2) Joint laws and training dynamics:- Derive and validate a two-argument loss surface
L(N, D)that reduces to the single-variable limits and captures overfitting (Equation (1.5); Figure 4 left, Figure 9). - Characterize learning curves as a function of steps at the critical batch size:
L(N, Smin)(Equation (1.6); Figure 4 right; Table 3). 3) Batch size and optimization efficiency: - Measure the “critical batch size”
Bcrit(L)—the point up to which increasing batch size yields near-linear speedups, beyond which returns diminish (Section 5.1). - Empirically,
Bcrit(L)follows a power law in loss (Equation (1.4), Figure 10) and aligns with the “gradient noise scale” as predicted by a prior model of large-batch training [McCandlish et al. 2018]. This enables two conversions: - Convert a run at batch
Band stepsSto the minimal stepsSminit would have needed at very large batch:Smin(S) = S / (1 + Bcrit(L)/B)(Equation (5.4)). - Convert to the minimal compute
Cminit would have needed at very small batch:Cmin(C) = C / (1 + B/Bcrit(L))(Equation (5.5)). 4) Compute-efficient frontier: - Using
L(N, Smin)together withBcrit(L), derive how to allocate a fixed compute budget across model sizeN, batch sizeB, and stepsSto minimize loss (Equations (1.7)–(1.8); Figure 14). - Key conclusion: spend most compute on model size; increase batch size substantially; increase the number of serial steps very slowly.
-
Why these designs
- Excluding embedding parameters isolates the capacity that directly scales compute and learning dynamics (Figure 6).
- Early stopping when varying
Dprevents conflating optimization underfitting with data overfitting (Section 4.2; Figure 16). - The
L(N, D)form is chosen to:- Reduce to the single-variable laws in the
N → ∞orD → ∞limits. - Admit a series expansion in
1/D(overfitting treated as variance-like, proportional to1/D) (Section 4.1).
- Reduce to the single-variable laws in the
4. Key Insights and Innovations¶
- Universal power laws with actionable exponents
- What’s new: Precise power-law relations between test loss and each of model size, dataset size, and compute—each holding over many orders of magnitude (Figure 1).
- Why it matters: Each exponent quantifies the “return on investment”:
- Doubling
Nimproves loss by a factor2^{-αN};αN ≈ 0.076(Equation (1.1)). - Doubling
Dimproves loss by2^{-αD};αD ≈ 0.095(Equation (1.2)). - Increasing compute on the optimal frontier improves loss as
Cmin^{-αCmin};αCmin ≈ 0.050–0.054(Equations (1.3), (6.3)–(6.4); Figure 13).
- Doubling
- A joint law for loss vs. model and data size that quantifies overfitting
- Innovation: The two-argument formula
L(N, D) = [(Nc/N)^{αN/αD} + (Dc/D)]^{αD}(Equation (1.5))- It accurately predicts early-stopped test loss surfaces (Figure 4 left; Figure 9 left).
- Significance: It yields a simple overfitting control rule:
- Overfitting depends primarily on
N^{αN/αD}/D(Figure 9 right). To keep it constant, scaleD ∝ N^{αN/αD} ≈ N^{0.74}(Section 4.2; Equation (4.4)).
- Overfitting depends primarily on
- Universality of training dynamics and the critical batch size
- Finding: Learning curves at large batch collapse to a two-term power law:
L(N, Smin) = (Nc/N)^{αN} + (Sc/Smin)^{αS}withαS ≈ 0.76,Sc ≈ 2.1×10^3(Equation (1.6); Table 3; Figure 4 right).
- Batch size law:
Bcrit(L) = B* L^{1/αB}withB* ≈ 2×10^8tokens andαB ≈ 0.21(Equation (1.4), Figure 10). - Impact: Enables batch/step/compute conversions (Equations (5.4)–(5.5)) and a principled choice of batch size near
Bcrit. - Compute-optimal training stops early and favors larger models
- Empirical and derived allocation (Equations (1.7)–(1.8); Figure 14; Section 6):
N ∝ Cmin^0.73,B ∝ Cmin^0.24,S ∝ Cmin^0.03,D = B·S.
- This is a fundamental shift: most additional compute should increase parameters; the number of serial optimization steps grows extremely slowly. Training “to convergence” is compute-inefficient (Appendix B.3).
- Weak dependence on architectural shape
- At fixed non-embedding parameter count, depth/width/heads/FFN ratios matter little: loss varies by only a few percent over wide ranges (Figure 5). This de-emphasizes architecture search relative to scaling.
5. Experimental Analysis¶
- Evaluation methodology
- Data: WebText2 (2.29×10^10 tokens) with 6.6×10^8 test tokens; out-of-distribution tests on Books, Internet Books, Wikipedia, Common Crawl (Section 2.3; Figure 8).
- Metric: Cross-entropy loss (“nats per token”; lower is better).
- Setups: Systematic sweeps over model size (10^3 to 10^9 parameters), data subsets (2.1×10^7 to 2.2×10^10 tokens), and compute budgets; early stopping when comparing
D(Figure 9), long training when comparingN. - Batch scans to measure
Bcrit(L)and validate the noisy-quadratic “critical batch size” relationship (Figure 18 → Figure 10). - Main quantitative results
- Single-factor power laws (Figure 1; Equations (1.1)–(1.3)):
L(N) = (Nc/N)^{αN}withαN ≈ 0.076,Nc ≈ 8.8×10^13.L(D) = (Dc/D)^{αD}withαD ≈ 0.095,Dc ≈ 5.4×10^13.- Compute-adjusted:
L(Cmin) ∝ Cmin^{-0.050}empirically (Figure 13, dashed fit), with a theoretical predictionαCmin ≈ 0.054(Equation (6.4)).
- Joint surfaces:
L(N, D)fits well over two orders of magnitude inD; fitted parametersαN=0.076, αD=0.103, Nc=6.4×10^13, Dc=1.8×10^13(Table 2; Figure 9).- Learning curves:
L(N, Smin)fits withαS=0.76, Sc=2.1×10^3(Table 3; Figure 4 right).
- Critical batch size:
- Empirical
Bcrit(L)doubles for every ~13% decrease in loss; independent of model size at fixed loss (Figure 10).
- Empirical
- Compute-efficient allocations (Figure 14; Section 6.1):
N(Cmin) ∝ Cmin^0.73(left panel).Smin(Cmin) ∝ Cmin^0.03(right panel).- Deviation from optimal size costs little: models 0.6× to 2.2× optimal need only ~20% extra compute to hit the same loss (Figure 12 left).
- Generalization:
- Loss on other datasets improves in near-lockstep with training-distribution loss—a roughly constant offset across scales (Figure 8 left).
- Generalization tracks in-distribution validation loss during training; proximity to convergence and network depth do not matter (Figure 8 right; Appendix D.8).
- Architecture and baselines:
- LSTMs match Transformers on early-context tokens but plateau by ~100 tokens; Transformers keep improving across the entire 1024-token context and asymptotically win (Figure 7).
- Shape invariance at fixed
N(Figure 5); embedding parameters confound scaling (Figure 6).
- Support strength and robustness
- Trends span 6–8 orders of magnitude with consistent slopes, survive changes in depth/width/heads/context (Figures 1, 5, 6, 8).
- Batch-size–adjusted compute and steps align with independent “gradient noise scale” theory (Figure 10; Section 5.1).
- Caveat: Very small datasets (~2×10^7 tokens) behave differently (poor fits; early overfitting) (Table 2 note; Figure 16).
- Notable ablations or diagnostics
- Early stopping vs overfitting gap: A lower bound on early-stopping step relates the excess loss from finite data to steps needed (Equation (5.7); Figure 16 left).
- Per-token loss patterns show power-law dependence on context position and faster short-range learning early in training (Figure 20–21).
6. Limitations and Trade-offs¶
- Assumptions
- Stationarity of scaling exponents over large ranges. While supported up to the study’s scale, the exponents may drift at larger scales or with different tokenization/data distributions (Section 8; Appendix C).
- Early stopping and fixed 10% dropout as the primary anti-overfitting mechanism (Section 4.2). Different regularization/augmentation might alter
L(N, D). - Compute proxy
C ≈ 6NBSignores context-length–dependent terms, which grow withnctxand could matter when contexts become very long (Appendix C). - Scope limits and edge cases
- Small-data regime: Behavior differs when an epoch is only ~40 steps; the
L(N, D)fit degrades (Table 2; Figure 16). - Extrapolation uncertainties:
Bcrit(L)extrapolated outside observed loss range could shift step/compute trade-offs (Appendix C).- The constants
Nc,Dcdepend on tokenization and vocabulary size (Section 1.2), limiting cross-setup comparability of absolute numbers (though exponents tend to be robust).
- Potential contradiction at extreme scale
- If one continually follows the compute-optimal policy, data used per training run grows only as
D(Cmin) ≈ (4×10^10)·(Cmin/PF‑day)^0.26tokens (Equation (6.7)), much slower than the data needed to keep overfitting constantD ∝ N^0.74 ∝ Cmin^0.54(Equation (6.6)). This implies theL(Cmin)curve must eventually be capped byL(D), producing an intersection aroundC* ~ 10^4 PF‑days,N* ~ 10^12,D* ~ 10^12tokens,L* ~ 1.7 nats/token(Figure 15; Equation (6.8)).
- The paper treats this as a likely breakdown point for the simple power laws; the exact location is sensitive to exponents (Figure 15 caption).
- Practical costs
- Compute-optimal training favors massive models; even with short training, wall-clock time, memory, and parallelism requirements are substantial. Hardware and engineering constraints can force suboptimal allocations (Figure 12; Section 6.1).
7. Implications and Future Directions¶
- How it changes the field’s practice
- Planning: The power laws give a quantitative playbook for budget allocation. For a fixed compute budget, prioritize parameters, scale batch near
Bcrit, and stop far short of convergence (Equations (1.7)–(1.8); Figure 14). - Expectations: Larger models are more sample-efficient—needing fewer updates and fewer unique tokens to hit the same loss (Figure 2; Figure 19). This reframes “data vs. model size”: big models can do more with less data when trained compute-efficiently.
- Architecture search de-emphasized: At fixed
N, shape choices have minor effect (Figure 5). - Research avenues
- Theoretical foundations: The persistence of power laws suggests an underlying “statistical mechanics” of optimization and generalization. The learning-curve exponent
αSlikely reflects the Hessian spectrum; making this precise is an open problem (Section 5.2). - Cross-domain tests: Do the same laws hold for images, audio, video, or multimodal models? The paper conjectures portability to other maximum-likelihood generative modeling tasks (Section 8).
- Beyond the intersection point: Investigate how
L(Cmin)saturates—e.g., by modeling intrinsic “noise floors” or non-text entropy—and whether different training data, curricula, or architectures shift the critical point (Section 6.3). - Training systems: Since
Sgrows slowly with budget (≈Cmin^0.03), throughput improvements should target parallelism and model/distributed systems (pipeline/tensor parallel, sparsity/mixture-of-experts) to enable much largerN(Discussion; [Huang et al. 2018], [Shazeer et al. 2018]). - Applications and downstream use
- Forecasting for large model programs: Organizations can use the equations to estimate the loss (and thereby downstream task performance proxies) achievable for a given budget, then choose
N,B,S, andDaccordingly. - Dataset strategy: To avoid overfitting penalties when scaling
N, expand or diversify data roughly asD ∝ N^0.74(Equation (4.4)), though compute-efficient training suggests using far less and stopping early when compute-limited. - Evaluation: Since out-of-distribution losses track in-distribution with roughly constant offsets (Figure 8), monitoring one can predict the other during training.
Quoted highlights tied to figures/equations
- “Performance depends strongly on scale, weakly on model shape” (Section 3; Figure 5).
- “Smooth power laws” (Figure 1; Equations (1.1)–(1.3)): αN ≈ 0.076, αD ≈ 0.095, αCmin ≈ 0.050.
- “Universality of overfitting” captured by L(N, D) and the ratio N^{0.74}/D (Equation (1.5); Figure 9 right).
- “Universality of training”: L(N, Smin) with αS ≈ 0.76 (Equation (1.6); Table 3).
- “Compute-efficient training stops far short of convergence” with N ∝ Cmin^0.73, B ∝ Cmin^0.24, S ∝ Cmin^0.03 (Equations (1.7)–(1.8); Figure 14; Figure 3).
- “Optimal batch size” grows with lower loss: Bcrit(L) = B* L^{1/αB}; B* ~ 2×10^8, αB ~ 0.21 (Equation (1.4); Figure 10).