daVinci-LLM: Towards the Science of Pretraining¶
ArXiv: 2603.27164
Pitch¶
daVinci-LLM addresses a critical gap in LLM research by combining industrial-scale computational resources with full academic transparency, enabling systematic exploration of pretraining science typically inaccessible to researchers. Through 200+ controlled ablations on a 3B-parameter model trained across 8T tokens, the work demonstrates that principled data processing depth can substitute for massive volume scaling, achieving performance comparable to 7B-scale models with less than half the parameters.
1. Executive Summary¶
daVinci-LLM presents a systematic investigation of LLM pretraining through the lens of data processing depth, establishing that hierarchical data quality enhancement—from filtering (L3) to generative refinement (L4) to cognitive synthesis (L5)—can substitute for multi-fold increases in raw data volume. The paper trains a 3B-parameter model on 8T tokens that achieves 51.72 overall score, matching the 7B-scale OLMo-3 despite having less than half the parameters. Through 200+ controlled ablations, the work documents both successful strategies and failed experiments, releasing complete training trajectories, intermediate checkpoints, and processed datasets under a fully-open paradigm.
2. Context and Motivation¶
The Structural Paradox in Pretraining Research¶
The paper identifies a fundamental gap in how LLM pretraining is studied and documented. Organizations with computational resources for large-scale pretraining—commercial labs like OpenAI, Anthropic, and Google—operate under competitive pressures that favor rapid deployment over systematic exploration and inhibit transparent disclosure of training processes. Conversely, academic institutions possess research freedom but lack pretraining-scale infrastructure. The authors explicitly note that even well-funded efforts like OLMo face "severe scale limitations" that make large-scale systematic exploration "structurally infeasible."
This matters because pretraining determines a model's capability ceiling. The paper cites research showing that pretraining advantages are "amplified rather than compensated for" in post-training phases—meaning that post-training techniques like fine-tuning and RLHF cannot fundamentally overcome capability foundations established during pretraining. Yet precisely when this foundational phase is most critical, the community has limited ability to systematically investigate the principles governing how models acquire and organize knowledge.
The Opacity of Current Releases¶
The paper provides a detailed transparency comparison (Table 1) across current LLM releases:
- Closed-source models (GPT, Claude, Gemini): Accessible only through APIs, providing zero visibility into training processes.
- Open-weight models (LLaMA, Qwen, DeepSeek): Release model checkpoints but withhold critical pretraining details—data compositions, mixture ratios, and training dynamics remain largely undisclosed.
- Fully-open efforts (OLMo, YuLan): Achieve transparency but face computational constraints that limit systematic exploration.
The authors position daVinci-LLM at the intersection that remains largely unexplored: combining industrial-scale computational resources with full research freedom to conduct and publish systematic ablation studies.
Limitations of Prior Work¶
The paper identifies several specific gaps in existing pretraining research:
Lack of systematic methodology for data processing decisions. While data quality's importance is increasingly recognized, the field lacks a principled framework for comparing processing operations across heterogeneous data sources. Practitioners cannot answer questions like: "Does investing in LLM-based refinement justify the cost versus simply collecting more data?"
No documentation of negative results. Existing releases present finalized configurations as settled conventions without documenting the exploration process. This leaves practitioners without guidance on what approaches were tried and failed, potentially causing redundant experimentation.
Evaluation protocol inconsistencies. The paper demonstrates that perplexity-based and generative evaluations can produce opposite model rankings, yet prior work rarely acknowledges this discrepancy or clarifies which protocol aligns with intended deployment scenarios.
Positioning: Openness as Scientific Methodology¶
daVinci-LLM explicitly treats openness itself as scientific methodology. The paper frames its contributions around three pillars:
- Data: Complete processing pipelines documented through the Data Darwinism L0-L9 taxonomy, with all source datasets and processing decisions explicitly annotated.
- Training Recipe: A 3B model trained from random initialization with all intermediate checkpoints released at 5k-step intervals.
- Exploration: 200+ controlled ablations documenting both successes and failures across four thematic areas.
The goal is transforming pretraining from an "intuition-led craft" to an "evidence-based discipline."
3. Technical Approach¶
3.1 Reader Orientation¶
daVinci-LLM is a pretraining research paper that builds a 3B-parameter language model through a principled, two-stage curriculum while systematically investigating how data processing depth, training dynamics, and mixture design shape capability development. The system combines the Data Darwinism framework (a ten-level taxonomy for organizing data processing operations from basic filtering to knowledge synthesis) with a multi-stage adaptive training curriculum that shifts from broad knowledge acquisition to reasoning-intensive enhancement.
3.2 Big-Picture Architecture¶
The system has five major components:
-
Data Pool (~7.58T tokens total): Four domains—General (Common Crawl), Code (GitHub + synthetic), Science (papers/books), and QA (structured question-answer pairs)—each annotated with its Darwin Level processing depth.
-
Data Darwinism Framework (L0-L9): A hierarchical taxonomy organizing processing operations from basic acquisition (L0-L2) through model-based filtering (L3) to generative refinement (L4) and cognitive synthesis (L5-L9).
-
Stage 1 Training (6T tokens): Foundation-building phase split into Stage 1-1 (4T tokens, progressive batch scaling) and Stage 1-2 (2T tokens, domain proportion adjustment).
-
Stage 2 Training (2T tokens): Reasoning enhancement phase split into Stage 2-1 (balanced 30% QA introduction) and Stage 2-2 (aggressive 70% QA intensification).
-
Evaluation Protocol: 19 benchmarks across General, Code, and Science domains, with explicit distinction between perplexity-based and generative evaluation methods.
Information flows as follows: Raw data sources → Darwin-level processing → Domain-specific token pools → Two-stage curriculum training → Intermediate checkpoints at 5k steps → Final model evaluation.
3.3 Roadmap for the Deep Dive¶
The technical breakdown proceeds in this order:
- Model Architecture: The base configuration and design choices for the 3B model.
- Data Darwinism Framework: Detailed explanation of the L0-L9 taxonomy and how it organizes processing decisions.
- Data Pool Composition: Specific data sources, their Darwin levels, and token allocations across training stages.
- Training Methodology: The two-stage curriculum with hyperparameters and stage transitions.
- Exploration Findings: The four thematic ablation areas that inform the training recipe.
3.4 Detailed Technical Breakdown¶
This is primarily an empirical investigation paper whose core contribution is establishing systematic relationships between data processing depth, training dynamics, and model capabilities through controlled experimentation.
Model Architecture¶
daVinci-LLM adopts a Qwen2-based transformer architecture with 3.09B parameters. Table 4 provides the complete specification:
- Layers: 36 transformer layers
- Hidden size: 2048
- MLP intermediate size: 11008 (expansion ratio ~5.4×)
- Attention heads: 16 query heads
- KV heads: 2 (Grouped-Query Attention for memory efficiency)
- Head dimension: 128
- Activation function: SwiGLU
- Normalization: RMSNorm with \(\epsilon = 10^{-6}\)
- Positional encoding: Rotary Position Embeddings (RoPE) with base \(\theta = 10000\)
- Max sequence length: 4096 tokens
- Vocabulary: 151936 tokens (Qwen2 tokenizer)
- Precision: bfloat16
The architecture follows Qwen2's design philosophy of prioritizing depth (36 layers) with moderate hidden dimensions, which the authors note has proven effective for balancing parameter count, training throughput, and downstream performance in the 3B scale regime.
Data Darwinism Framework¶
The paper introduces the Data Darwinism framework as a principled taxonomy for organizing data processing operations. The framework spans ten levels (L0-L9), with an underlying evolutionary logic: processing begins with selection and preservation of existing content, progressively moves toward active rewriting and enrichment, and ultimately reaches the capacity to synthesize entirely new content.
L0: Data Acquisition. Raw data gathering from web crawls, PDF repositories, code platforms, and curated databases. Data exists in highly variable formats (HTML, PDF, binary) with significant noise and duplication.
L1: Format Normalization. Converting heterogeneous raw data into unified, training-ready text representations. Operations include OCR processing of scanned PDFs and HTML parsing. No content filtering occurs; the goal is uniform processability.
L2: Rule-based Filtering. Quality control through deterministic pattern-based rules: near-duplicates detected via MinHash LSH, excessively short or malformed text, non-target languages, and garbled text from encoding errors. Requires no learned models and runs efficiently on CPU.
L3: Lightweight Model Filtering. Semantic-level quality assessment using pretrained lightweight classifiers. Tasks include educational value scoring, domain identification, and document type classification. Documents are retained or discarded based on predicted quality, but content is never modified.
L4: Generative Refinement. A qualitative shift from selection to active transformation. Medium-to-large generative models purify content by removing structural noise (navigation elements, reference lists, OCR artifacts, formatting defects) and repairing fragmented text. A critical constraint: no external knowledge may be introduced, and output must remain semantically equivalent to input.
L5: Cognitive Completion. Frontier LLMs enrich data by making implicit reasoning explicit. Research and technical documents are typically written for expert audiences with compressed logical steps and assumed background knowledge. L5 bridges this "learnability gap" through reasoning reconstruction, terminological explication, and pedagogical bridging.
L6-L9: Higher-Order Synthesis. Contextual Completion (L6) expands documents with external references. Environment Synthesis (L7) constructs executable environments for validation. Ecosystem Synthesis (L8) builds multi-agent interaction systems. World Synthesis (L9) represents the theoretical apex—comprehensive simulated worlds as unlimited synthetic data sources.
The paper emphasizes that levels are not mutually exclusive one-time passes: operations can be applied multiple times with different models or parameters, and ordering need not follow the hierarchy strictly.
Data Pool Composition¶
The pretraining corpus totals approximately 7.58T tokens across four categories. Table 2 provides detailed composition by training stage.
General Domain (4.28T pool): - Nemotron-CC-v1 (L3): Built from 99 Common Crawl snapshots, undergoing text extraction, English filtering, global deduplication, and ensemble classifier scoring for educational value and informativeness. Documents grouped into five quality tiers. Contributes ~4.28T tokens.
Code Domain (598B pool): - Self-Crawled GitHub (L3): Repositories crawled directly with minimum 10-star threshold, filtered through OpenCoder's pipeline. ~187B tokens. - Nemotron-Pretraining-Code-v1-non-synthetic (L3): Permissively licensed GitHub code with exact and fuzzy deduplication. ~220B tokens. - Nemotron-Pretraining-Code-v1-synthetic-code (L5): LLM-generated question-answer pairs grounded in code snippets across 11 languages. ~171B tokens. - TxT360-Stack-Exchange (L2): Technical community discourse from 364 Stack Exchange sub-communities. ~20B tokens.
Science Domain (1.94T pool): - MegaMath-Web (L3): Mathematical content from Common Crawl with fastText-based filtering. ~231B tokens. - MegaMath-Web-Pro (L4): High-quality subset refined by Llama-3.3-70B-Instruct. ~13B tokens. - MegaMath Refined (L4): MegaMath-Web refined by Qwen3-235B-A22B-Instruct. ~176B tokens. - Nemotron-CC-Math-v1 variants (L4): Math corpus from 98 Common Crawl snapshots with Phi-4-based cleanup and FineMath classifier scoring. Multiple variants totaling ~322B tokens. - Darwin-Science-Book (L4): Scientific books processed through L4 refinement using GPT-OSS-120B. ~251B tokens. - Darwin-Science-Paper variants (L4-L5): Academic papers with L4 refinement and L5 cognitive completion using both GPT-OSS-120B and Qwen3-235B. Combined ~945B tokens.
QA Domain (734B pool): All QA sources reach Darwin Level L5, including synthetic QA from Nemotron-CC-v1 (~492B tokens), various Nemotron-Pretraining-SFT subsets, rejection-sampled post-training datasets, and domain-specific QA extracted from Darwin-Science-Book.
Training Methodology¶
The training process uses a two-stage curriculum totaling 8T tokens with adaptive data composition.
Stage 1: General Foundation Pretraining (6T tokens)
Stage 1-1 (4T tokens): - Global batch size: Progressive scaling from 1024 → 2048 → 4096 - Learning rate: Constant at \(3 \times 10^{-4}\) after 2000-step warmup - Data mixture: CC 68.2%, Code 9.53%, Science 22.27%, QA 0% - Purpose: Establish broad linguistic fluency with stability focus
Stage 1-2 (2T tokens): - Global batch size: Fixed at 4096 - Learning rate: Cosine decay from \(3 \times 10^{-4}\) to \(3 \times 10^{-5}\) - Data mixture: CC 55.42%, Code 11.66%, Science 32.92%, QA 0% - Purpose: Strengthen reasoning by increasing code and science proportions
The transition from Stage 1-1 to Stage 1-2 is motivated by differential saturation patterns: general knowledge benchmarks plateau rapidly within the first 1T tokens, while code and science benchmarks sustain growth through 4T tokens.
Stage 2: Reasoning Capability Enhancement (2T tokens)
Stage 2-1 (1T tokens): - Global batch size: 4096 - Learning rate: Constant at \(3 \times 10^{-5}\) after 2000-step warmup - Data mixture: CC 10%, Code 30%, Science 30%, QA 30% - Purpose: Balanced introduction of structured QA without domain collapse
Stage 2-2 (1T tokens): - Global batch size: 4096 - Learning rate: Constant at \(3 \times 10^{-5}\) - Data mixture: CC 18.84%, Code 2.61%, Science 8.55%, QA 70% - Purpose: Aggressive QA intensification leveraging established foundation
The paper notes that Stage 2-2's 70% QA concentration would have caused collapse if applied directly, but succeeds because Stage 2-1 established a stable representational base.
Training Stability: Figure 5 shows smooth loss convergence and stable gradient norms throughout training, with no significant spikes or divergences requiring manual intervention.
Evaluation Protocol¶
The paper evaluates on 19 benchmarks across three capability domains using the lm-eval-harness framework:
General (9 benchmarks): MMLU, MMLU-Pro, AGIEval, HellaSwag, TriviaQA, RACE, WinoGrande, OpenBookQA, PIQA
Code (3 benchmarks): HumanEval, EvalPlus, MBPP
Science (7 benchmarks): GSM8K, GSM-Plus, MATH, GPQA-Main, SuperGPQA, MMLU-STEM, MMLU-Pro-STEM
Evaluation Methods: - Perplexity-based (PPL): Applied to multiple-choice tasks (PIQA, MMLU, OpenBookQA, GPQA-Main, MMLU-STEM) where the model scores each candidate answer directly. - Generative-based: Applied to reasoning-intensive tasks requiring chain-of-thought outputs (MATH, HumanEval, etc.).
The paper explicitly distinguishes these protocols because they probe different aspects of capability: PPL measures latent knowledge access (assigning higher probability to correct options), while generative evaluation requires surfacing and organizing knowledge into explicit answers.
Baselines (7 models): - Scale-matched (3-4B): LLaMA-3.2-3B, Qwen-2.5-3B, Qwen-3-4B, Qwen-3.5-4B, Yulan-Mini-2.4B - Larger baselines (7B): OLMo-2-7B, OLMo-3-7B
Exploration Findings¶
The paper presents 200+ controlled ablations organized into four thematic areas:
1. Data Processing Depth (Section 4.1)
L3 Model-Based Filtering (Table 7): - Comparison on 500B token training with code data - L3 model-based filtering vs. L2 rule-based filtering - Result: Modest overall improvement (+0.27 overall), with notable gain on MBPP (+3.40) - Conclusion: Model-based filtering provides incremental gains; more intensive processing needed for breakthrough improvements
L4 Generative Refinement (Table 8): - Comparison on 500B token training with math data - L4-refined math data vs. baseline - Result: MATH +7.00, GSM8K +1.37, Overall +0.56 - Conclusion: Structural purification disproportionately benefits complex multi-step reasoning; L4 processing reduces difficulty for models to extract fundamental patterns
L5 Synthetic QA (Figure 7): - CodeQA and CC-QA evaluated at 500B tokens - Result: Strong domain-specific steering—CodeQA benefits code benchmarks, CC-QA benefits science reasoning - Conclusion: L5 synthesis acts as high-precision capability steering tool rather than universal performance booster
2. Training Dynamics (Section 4.2)
Domain Proportion Adjustment (Figure 8): - Stage 1 trajectory showing general knowledge plateaus at ~1T tokens - Code and science sustain growth through 4T tokens - Stage 1-2 adjustment (increasing code/science proportions) successfully amplifies reasoning gains - Stage 1-3 (further adjustment) yields marginal improvements—proportion adjustment encounters saturation boundaries
QA Introduction vs. Domain Adjustment (Figure 9): - Controlled comparison from Stage 1-2 checkpoint - Stage 1-3 (continued proportion adjustment) vs. Stage 2 (30% QA introduction) - Result: QA introduction substantially outperforms proportion adjustment alone - Code +20.34 points, Science +15.75 points with QA vs. marginal gains from adjustment
3. Data Mixture Design (Section 4.3)
Code-Science Balance (Figure 11a): - Fixed 30% QA, varying code-science proportions - C-30-S-30 (balanced) outperforms C-40-S-20 (code-heavy) - Conclusion: Internal balance prevents over-specialization while maintaining aggressive reasoning concentration
QA Concentration Progression (Figure 11b, Table 9): - Stage 2-1: Conservative 30% QA needed; higher ratios cause code degradation (imbalanced QA pool with ~80% science) - Stage 2-2: 70% QA succeeds without collapse because Stage 2-1 established stable foundation - Result: QA-70% achieves best performance (49.84 overall at 419B tokens)
4. Evaluation Validity (Section 4.4, Figure 12)
MMLU protocol comparison: - PPL evaluation: OLMo-2-7B slightly outperforms Qwen-2.5-3B - Generative evaluation: Ranking reverses in Qwen's favor (3.10% swing) - Explanation: Generative evaluation is more sensitive to whether models have learned to operationalize knowledge through answer production; QA-heavy pretraining amplifies generative performance
Summary of Design Choices and Justifications¶
- Progressive batch size scaling in Stage 1-1: Ensures training stability when starting from random initialization.
- Domain proportion adjustment in Stage 1-2: Motivated by differential saturation (general knowledge plateaus early, reasoning sustains growth).
- Conservative QA introduction in Stage 2-1: Prevents domain collapse when transitioning to structured reasoning formats.
- Aggressive QA intensification in Stage 2-2: Leverages established foundation from Stage 2-1.
- Balanced code-science allocation: Prevents over-specialization while maintaining reasoning concentration.
- Cosine LR decay in late stages: Consolidates and stabilizes capabilities with gradual reduction.
- No QA question masking: Question quality and diversity matter more than masking strategy for pretraining.
4. Key Insights and Innovations¶
Innovation 1: Data Processing Depth as a Systematic Optimization Dimension¶
The paper's most fundamental contribution is establishing that hierarchical data processing—progressing from filtering (L3) to generative refinement (L4) to cognitive synthesis (L5)—provides systematic capability enhancement that can substitute for raw data scaling. This is empirically demonstrated through controlled ablations:
- L4 refinement on math data yields +7.00 on MATH while holding data volume constant
- L5 synthesis enables targeted capability steering through domain-aligned QA generation
- Processing depth provides a "principled alternative to naive scaling"
What makes this genuinely novel is the systematic framework for reasoning about processing operations. Prior work recognized data quality's importance but lacked a principled taxonomy for comparing operations across heterogeneous sources. The Darwin framework provides common vocabulary and ordering, enabling practitioners to answer questions like "when does investing in L4 refinement justify the cost versus collecting more L3 data?"
Innovation 2: Adaptive Training Dynamics Based on Capability Saturation Patterns¶
The paper demonstrates that different capability dimensions exhibit vastly different saturation timescales: general knowledge plateaus at ~1T tokens, while code and science sustain growth through 4T+ tokens. This differential pattern motivates adaptive interventions:
- Stage 1-2's domain proportion adjustment (reducing CC, increasing code/science) amplifies reasoning gains
- Stage 2's introduction of structured QA outperforms continued proportion optimization
Critically, the paper documents when domain adjustment fails: once standard pretraining corpora collectively approach saturation (Stage 1-3), reallocating proportions yields marginal gains. Sustaining growth requires data format shifts from raw text to structured reasoning scaffolds.
This establishes that "no single data mixture or format suffices across extended training: sustained capability development demands monitoring convergence patterns and adapting both domain proportions and data formats accordingly."
Innovation 3: Progressive Intensification Strategy for QA Data¶
The paper reveals a stage-dependent tolerance for reasoning data concentration:
- Stage 2-1: Conservative 30% QA prevents collapse; higher ratios cause code degradation due to imbalanced QA pool composition (~80% science QA, scarce code QA)
- Stage 2-2: Aggressive 70% QA succeeds without collapse because Stage 2-1 established a stable representational base
This finding is practically significant because it contradicts naive approaches that apply high reasoning data concentrations from the start. The "foundation-then-intensify" strategy enables aggressive capability enhancement without triggering catastrophic forgetting or domain collapse.
Innovation 4: Evaluation Protocol Awareness and Ranking Reversals¶
The paper provides concrete evidence that PPL-based and generative evaluations can produce opposite model rankings. On MMLU:
- PPL evaluation: OLMo-2-7B slightly outperforms Qwen-2.5-3B
- Generative evaluation: Ranking reverses with 3.10% swing
This matters because models with heavy QA exposure during pretraining gain disproportionately under generative evaluation—not because they know more, but because they've learned to operationalize knowledge through answer production. The paper explicitly warns: "when comparing base models, substantial discrepancies across protocols should not be dismissed as noise: they often indicate meaningful differences in pretraining data composition."
This contribution addresses a methodological gap in how pretraining progress is measured and compared.
Innovation 5: Fully-Open Paradigm with Negative Results Documentation¶
While not a technical innovation, the paper's commitment to releasing complete training trajectories, intermediate checkpoints at 5k-step intervals, and documentation of failed experiments represents a significant methodological contribution. The transparency comparison (Table 1) shows daVinci-LLM exceeds all prior efforts in releasing: model weights, training code, training logs, intermediate checkpoints, data composition, processing pipeline, full training data, processing methodology (L0-9), pretraining ablations, mixture rationale, decision transparency, and negative results.
This enables the community to build upon documented boundary conditions rather than rediscovering failed approaches.
5. Experimental Analysis¶
Evaluation Methodology¶
Dataset Composition: The paper uses 19 benchmarks organized into three capability domains: - General (9 benchmarks): MMLU, MMLU-Pro, AGIEval, HellaSwag, TriviaQA, RACE, WinoGrande, OpenBookQA, PIQA - Code (3 benchmarks): HumanEval, EvalPlus, MBPP - Science (7 benchmarks): GSM8K, GSM-Plus, MATH, GPQA-Main, SuperGPQA, MMLU-STEM, MMLU-Pro-STEM
Evaluation Protocols: - Perplexity-based: Multiple-choice tasks where model scores candidate answers directly - Generative-based: Free-form response generation requiring reasoning chains
Baselines: - Scale-matched (3-4B): LLaMA-3.2-3B, Qwen-2.5-3B, Qwen-3-4B, Qwen-3.5-4B, Yulan-Mini-2.4B - Larger (7B): OLMo-2-7B, OLMo-3-7B
Inference Configuration: All evaluations use greedy decoding to ensure consistency across experiments since evaluated models are base checkpoints.
Main Quantitative Results¶
Final Performance (Table 6):
daVinci-LLM-3B achieves 51.72 overall average, matching OLMo-3-7B's 51.65 despite having less than half the parameters.
Domain-specific performance:
| Domain | daVinci-3B | OLMo-3-7B | Best 3-4B Baseline |
|---|---|---|---|
| General | 52.96 | 55.13 | 59.13 (Qwen-3.5-4B) |
| Code | 55.99 | 54.42 | 64.39 (Qwen-3-4B) |
| Science | 48.30 | 45.98 | 56.74 (Qwen-3-4B) |
Notable individual benchmark results:
- MATH: daVinci achieves 62.80, exceeding OLMo-3-7B (39.60) by over 23 points
- HumanEval: daVinci achieves 61.64, exceeding OLMo-3-7B (59.05)
- GSM8K: daVinci achieves 72.86, competitive with OLMo-3-7B (76.80)
- MMLU: daVinci achieves 62.53, below OLMo-3-7B (66.53) but above parameter-matched baselines
The paper emphasizes that reasoning capability enhancement is achieved while maintaining general knowledge competence—no catastrophic forgetting during specialized training.
Training Trajectory Results¶
Stage 1 Trajectory (Figure 4): - Stage 1-1 (4T tokens): Overall score reaches ~39.58 - Stage 1-2 (2T tokens): +4.90 improvement to ~44.48 - General knowledge plateaus early; code and science sustain growth - The data proportion adjustment in Stage 1-2 successfully amplifies reasoning gains
Stage 2 Trajectory (Figure 6): - Stage 2-1 (1T tokens): +8.97 improvement from 39.58 to 48.55 - Stage 2-2 (1T tokens): +2.43 improvement to 51.72 - Stage 2 efficiency stems from increased processing depth (L4/L5 data)
Key trajectory insight: MATH performance improves drastically from 22.0 (Stage 1 endpoint) to 62.8 (final), demonstrating the power of structured QA introduction.
Ablation Study Results¶
L3 vs. L2 Filtering (Table 7): | Metric | L2 Rule-Based | L3 Model-Based | Delta | |--------|---------------|----------------|-------| | HumanEval | 54.43 | 54.70 | +0.27 | | EvalPlus | 48.58 | 48.39 | -0.19 | | MBPP | 42.40 | 45.80 | +3.40 | | Overall | 46.66 | 46.93 | +0.27 |
Conclusion: L3 provides modest gains; filtering alone insufficient for breakthrough improvements.
L4 Generative Refinement (Table 8): | Metric | Baseline | L4 Refined | Delta | |--------|----------|------------|-------| | GSM8K | 64.06 | 65.43 | +1.37 | | GSM-Plus | 40.58 | 42.38 | +1.80 | | MATH | 38.00 | 45.00 | +7.00 | | Overall | 47.27 | 47.83 | +0.56 |
Conclusion: Structural purification benefits complex multi-step reasoning disproportionately.
QA Ratio Effects (Table 9, Stage 2-2): | Config | Training Tokens | General | Code | Science | Overall | |--------|-----------------|---------|------|---------|---------| | QA-30% | 419B | 51.80 | 47.16 | 40.59 | 46.94 | | QA-50% | 419B | 51.83 | 49.89 | 43.42 | 48.43 | | QA-70% | 419B | 52.16 | 52.40 | 45.77 | 49.84 |
Conclusion: Progressive QA intensification succeeds in Stage 2-2; 70% achieves best performance.
Code-Science Balance (Figure 11a): C-30-S-30 (balanced 30% each) outperforms C-40-S-20 (unbalanced) in overall performance, demonstrating that internal balance prevents over-specialization.
Assessment: Do the Experiments Support the Claims?¶
Claim 1: Processing depth systematically enhances capabilities. Strongly supported. L4 refinement yields +7.00 on MATH (Table 8), and L5 synthesis enables targeted steering (Figure 7). The hierarchical progression from L3→L4→L5 shows increasing capability gains with diminishing generality.
Claim 2: Different domains exhibit distinct saturation dynamics. Strongly supported. Figure 8 clearly shows general knowledge plateauing at ~1T tokens while code and science sustain growth through 4T+. This differential pattern motivates adaptive interventions.
Claim 3: QA introduction outperforms continued domain adjustment. Supported. Figure 9 shows QA introduction (+20.34 code, +15.75 science) substantially outperforms Stage 1-3's continued proportion adjustment (marginal gains).
Claim 4: Progressive intensification prevents collapse. Supported. Figure 11b shows code degradation at high QA ratios in Stage 2-1, but Table 9 shows successful 70% QA in Stage 2-2 without collapse.
Claim 5: Evaluation protocol choice affects conclusions. Strongly supported. Figure 12 demonstrates ranking reversal between PPL and generative evaluation on MMLU, with models heavy in QA pretraining showing amplified generative performance.
Potential limitations:
- Single model scale: All experiments use 3B parameter models; whether findings generalize to larger scales is not tested.
- QA pool imbalance: The Stage 2-1 QA pool is ~80% science, potentially explaining code degradation at high ratios rather than inherent concentration limits.
- No comparison with other data selection methods: The paper doesn't compare Darwin framework against alternative data quality taxonomies or selection approaches.
- Limited L6-L9 exploration: Higher-order synthesis levels remain theoretical; empirical validation focuses on L3-L5.
Failure Cases and Negative Results¶
The paper explicitly documents failed approaches:
Stage 1-3 proportion adjustment: Further increasing code/science proportions after Stage 1-2 yielded only marginal improvements, demonstrating that domain adjustment encounters saturation boundaries when standard corpora collectively approach saturation.
High QA ratios in Stage 2-1: Applying 70% QA directly would have caused collapse; the conservative 30% approach was necessary to establish stability before intensification.
ReST\(^{EM}\) revision model (mentioned but not detailed in main text): The paper notes this approach was attempted but found ineffective for their setting.
The documentation of negative results enables practitioners to avoid redundant failed experiments and understand boundary conditions.
6. Limitations and Trade-offs¶
Assumption: Findings at 3B Scale Generalize to Larger Models¶
The paper conducts all experiments with a single model scale (3B parameters). While the results convincingly demonstrate that principled pretraining enables a 3B model to match a 7B baseline (OLMo-3), the question of whether these findings transfer to larger scales remains unexplored. The authors acknowledge that their goal was to "demonstrate that principled multi-stage pretraining can enable a 3B model to approach or exceed the performance of larger baselines" (Section 3.2), but several key relationships documented in the paper may not hold at scale:
- Saturation dynamics: General knowledge plateaus at ~1T tokens for a 3B model, but larger models have higher capacity and may sustain learning across more tokens.
- Optimal QA ratios: The 30% → 70% progressive intensification strategy was derived empirically for 3B models; whether these ratios apply to 70B+ models is unknown.
- Processing depth effectiveness: L4 refinement yields +7.00 on MATH for 3B models, but the cost-benefit calculation shifts when training larger models where per-token compute costs are substantially higher.
The paper does not provide scaling law analysis that would enable practitioners to extrapolate findings to different model sizes. This limits the direct applicability of the training recipe for organizations working at different scales.
Imbalanced QA Pool Composition as Confounding Factor¶
Section 4.3.2 reports that the Stage 2-1 QA pool is heavily weighted toward science (~80%) while code-related QA is relatively scarce (~26B tokens). This imbalance creates a confounding factor in the QA concentration experiments:
"At higher total QA concentrations, the insufficient diversity of code samples may trigger premature over-fitting or representation collapse, whereas the abundant science data sustain growth."
This means the code degradation observed at high QA ratios (Figure 11b) may not reflect a fundamental limitation of QA concentration, but rather an artifact of the specific QA pool composition. A differently balanced QA corpus might tolerate higher concentrations without collapse. The paper acknowledges this limitation but does not conduct controlled experiments with balanced QA pools to isolate the effect. Practitioners with different QA data compositions may observe different optimal ratios.
Computational Cost of Higher Processing Depths¶
The paper documents impressive gains from L4 generative refinement and L5 cognitive completion, but provides limited discussion of the computational cost of these processing stages. Processing 176B tokens of MegaMath through Qwen3-235B-A22B-Instruct (L4 refinement) or applying L5 cognitive completion to scientific papers requires substantial compute that may not be feasible for all organizations.
The paper frames processing depth as a "principled alternative to naive scaling" (Section 4.1), but the cost comparison is not systematically analyzed. Key questions remain unanswered:
- What is the FLOPs comparison between L4 refinement + training versus simply training on more L3 data?
- Does the cost-benefit ratio favor processing depth for all domains, or only reasoning-intensive domains like math and science?
- How do costs scale as processing depth increases (L6-L9)?
Without this analysis, practitioners cannot make informed decisions about when processing depth investments are justified versus volume scaling.
Theoretical Levels L6-L9 Remain Unexplored¶
The Darwin framework extends to L9 (World Synthesis), but empirical validation in the paper focuses exclusively on L3-L5 operations. The higher-order synthesis levels (L6: Contextual Completion, L7: Environment Synthesis, L8: Ecosystem Synthesis, L9: World Synthesis) are described conceptually but never implemented or evaluated.
While this is understandable given the paper's scope, it means the framework's upper tiers remain speculative. The paper states these levels represent "increasingly ambitious forms of data generation" but provides no evidence that L6+ operations yield further capability gains or that the hierarchical logic continues to hold. The framework's completeness as a taxonomy cannot be verified without empirical exploration of upper levels.
Limited Benchmark Coverage for Certain Capabilities¶
The 19-benchmark evaluation suite focuses heavily on reasoning-intensive tasks (math, code, science) with relatively limited coverage of other capability dimensions:
- Long-context understanding: Not evaluated despite the model's 4096-token context window.
- Multilingual capabilities: Not evaluated; the paper notes English-only filtering in data processing.
- Knowledge-intensive QA beyond MMLU: TriviaQA is included, but broader factual knowledge evaluation is limited.
- Multimodal understanding: Not applicable to this text-only model, but relevant for understanding pretraining's role in multimodal systems.
The paper's claim that "reasoning capability enhancement is achieved while maintaining general knowledge competence" (Section 5.2) is supported by MMLU and related benchmarks, but a more comprehensive evaluation would strengthen this claim.
Evaluation Protocol Discrepancies Lack Prescriptive Guidance¶
Section 4.4 documents that PPL-based and generative evaluations can produce opposite model rankings, demonstrating that "evaluation protocol choice affects conclusions." However, the paper stops short of providing prescriptive guidance:
- When should practitioners prefer one protocol over the other?
- How should evaluation results be interpreted when protocols disagree?
- What evaluation protocols best predict downstream fine-tuning performance?
The paper acknowledges that "framing one protocol as inherently superior would therefore be misleading" and that "the appropriate choice depends on the deployment scenario" (Section 4.4), but practitioners evaluating their own models need clearer decision frameworks for choosing evaluation protocols aligned with their use cases.
Reproducibility Constraints for External Practitioners¶
While the paper provides exceptional transparency compared to prior work, several practical constraints limit reproducibility:
- Model access: GPT-OSS-120B and Qwen3-235B-A22B-Instruct are used for L4/L5 processing but may not be accessible to all practitioners.
- Data sources: Self-crawled GitHub repositories and custom PDF collections (Darwin-Science) require reproduction of data collection pipelines.
- Compute requirements: Training 8T tokens requires substantial compute not available to most academic groups.
The paper releases "complete training specifications" (Table 1), but the resource requirements may limit reproduction to organizations with similar computational capacity.
7. Implications and Future Directions¶
Establishing Data Processing Depth as a First-Class Optimization Dimension¶
This paper's most significant conceptual contribution is establishing data processing depth as a systematic optimization dimension alongside data volume and model scale. Prior to this work, data quality was recognized as important but lacked a principled framework for reasoning about processing operations. The Darwin taxonomy provides:
- Common vocabulary: L0-L9 levels enable precise communication about processing depth across heterogeneous data sources.
- Ordering principle: The evolutionary logic from selection → transformation → synthesis provides guidance for where additional processing investment is most valuable.
- Empirical validation: Controlled ablations demonstrate that processing depth yields systematic capability gains, with L4 refinement providing +7.00 on MATH.
This reframes how practitioners should think about compute allocation: rather than simply collecting more data, investing in deeper processing of existing data can substitute for multi-fold volume increases. For organizations with fixed compute budgets, this provides an alternative scaling pathway.
The framework also enables systematic comparison across data sources—practitioners can now ask "is this L3 dataset worth collecting, or should I invest in L4 processing of existing data?" The paper's evidence that L4 refinement disproportionately benefits complex reasoning (MATH +7.00 vs. GSM8K +1.37) suggests the answer depends on target capabilities.
Adaptive Training Based on Capability Saturation Monitoring¶
The paper's documentation of differential saturation dynamics—general knowledge plateauing at ~1T tokens while code and science sustain growth—establishes a methodological template for adaptive training. Rather than applying fixed mixtures across predetermined token budgets, practitioners should:
- Track capability-specific convergence: Monitor benchmark performance at regular intervals to identify when specific capabilities approach saturation.
- Intervene adaptively: When general knowledge plateaus while reasoning sustains growth, reallocate compute from saturated domains to actively-learning ones.
- Transition data formats: When proportion adjustment encounters diminishing returns (Stage 1-3), shift to new data formats (structured QA) rather than continuing with exhausted formats.
This adaptive approach requires additional evaluation compute (benchmarking at 5k-step intervals), but the paper's results suggest efficiency gains justify the cost. The Stage 1-2 adjustment successfully amplified reasoning gains, while Stage 2's QA introduction outperformed continued proportion optimization by large margins (+20.34 code, +15.75 science).
Foundation-Then-Intensify Strategy for Reasoning Data¶
The progressive intensification strategy—conservative QA introduction (30%) in Stage 2-1 followed by aggressive intensification (70%) in Stage 2-2—provides a practical recipe for reasoning enhancement without capability collapse. This finding directly informs pretraining design:
- Do not apply high reasoning data concentrations from the start: The model lacks the representational foundation to absorb intensive supervision without degradation.
- Build foundation first, then intensify: Once a balanced capability base is established, the model can tolerate much higher reasoning data ratios.
- Monitor for domain-specific degradation: Code performance collapsed at high QA ratios in Stage 2-1 due to imbalanced QA pool composition; ensure diverse reasoning coverage before intensification.
This strategy contradicts naive approaches that maximize reasoning data from initialization. The paper's empirical evidence provides a safer, more stable pathway to reasoning enhancement.
Evaluation Protocol Selection Implications¶
The ranking reversal between PPL and generative evaluation on MMLU (Figure 12) has immediate implications for how pretraining progress is measured:
- Match evaluation to deployment: Applications requiring direct answer generation (chatbots, QA systems) should prioritize generative evaluation; applications using models as scoring/ranking functions should prioritize PPL evaluation.
- Report both protocols: Given that protocols can produce opposite rankings, complete evaluation should include both to avoid protocol-induced artifacts.
- QA pretraining amplifies generative performance: Models with heavy QA exposure gain disproportionately under generative evaluation; this should inform model selection for generative deployment scenarios.
This finding also raises questions about benchmark validity. If MMLU rankings depend on evaluation protocol, conclusions about relative model quality require explicit protocol specification. The paper's documentation of this discrepancy is itself a contribution to evaluation methodology.
Practical Applications and Downstream Use Cases¶
Base model for reasoning-focused fine-tuning: daVinci-LLM's strong reasoning performance (MATH 62.80, exceeding 7B OLMo-3 by 23 points) makes it a strong base model for downstream reasoning tasks. Organizations fine-tuning for mathematical or scientific applications would benefit from this foundation.
Data processing pipeline template: The Darwin framework and associated prompts (Appendix C) provide reusable infrastructure for data curation. Organizations can apply L4 refinement prompts to their own scientific corpora or use the domain-specific QA extraction prompts for QA generation.
Training recipe for resource-constrained settings: The finding that principled pretraining enables 3B models to match 7B performance has implications for deployment in resource-constrained settings. Smaller models with optimized pretraining may substitute for larger models in compute-limited deployments.
Follow-Up Research Directions¶
Scaling the Darwin framework: The paper operates at 3B scale; extending to larger models would test whether processing depth relationships hold. Key questions: Do optimal QA ratios scale with model size? Does processing depth effectiveness increase or decrease for larger models?
Balanced QA pool experiments: Controlled experiments with balanced QA compositions would isolate the effect of QA concentration from QA composition. This would clarify whether the Stage 2-1 code degradation is a fundamental limitation or an artifact.
L6-L9 synthesis exploration: Empirical validation of higher-order synthesis levels would complete the framework. Environment Synthesis (L7) for code validation or Ecosystem Synthesis (L8) for multi-agent reasoning could yield novel data sources.
Cross-domain transfer analysis: The paper shows L5 synthesis exhibits "strong source-target alignment" (CodeQA → code, CC-QA → science). Understanding cross-domain transfer—when does code QA benefit science reasoning?—would enable more strategic QA generation.
Evaluation protocol prediction: Systematic analysis of when PPL vs. generative evaluations predict downstream performance would provide prescriptive guidance for practitioners.
When to Prefer This Approach¶
Prefer the daVinci approach (processing depth + adaptive training) when: - Training models for reasoning-intensive applications (math, code, science) - Compute budget favors smaller models with optimized training over larger models - Target deployment requires generative capabilities (answer production) - Long training runs allow stage-based curriculum adaptation
Prefer simpler approaches when: - Training for general knowledge tasks where PPL evaluation suffices - Compute constraints limit processing depth investment - Single-stage training without adaptation monitoring is required - Target capabilities are not reasoning-intensive (where L3 filtering may suffice)
The paper's contribution is not that processing depth always dominates volume scaling, but that it provides a systematic optimization dimension that should be considered alongside model size and data quantity. The framework enables informed decisions rather than defaulting to naive scaling.