THINKING AUGMENTED PRE-TRAINING¶
ArXiv: 2509.20186
🎯 Pitch¶
This paper proposes Thinking-augmented Pre-Training (TPT), a simple yet transformative method that augments language model training data by appending automatically generated step-by-step reasoning—'thinking trajectories'—to each text sample. By making the implicit reasoning behind challenging tokens explicit, TPT dramatically boosts data efficiency and reasoning performance, allowing models to achieve state-of-the-art capabilities with up to 3× less training data—an essential advance as high-quality pre-training data becomes increasingly scarce.
1. Executive Summary¶
- The paper introduces Thinking-augmented Pre-Training (
TPT), a simple data-engineering method that turns ordinary text into “explain-then-text” training samples by appending automatically generated reasoning (“thinking trajectories”) to each document, then trains with standard next-token prediction on the concatenated sequence. - Across pre-training from scratch and continual “mid-training,”
TPTyields large gains in reasoning-heavy benchmarks and sharply improves data efficiency (≈3× fewer training tokens to reach a given score; see Figure 1a and Figure 2). With only 100B training tokens, an 8B model approaches the performance of models trained on 15T tokens (Table 1).
2. Context and Motivation¶
- Problem addressed
- LLMs need immense amounts of high-quality text, but the web’s “good” data is finite and increasingly exhausted. Merely scaling compute and tokens is not enough when many valuable tokens are hard to learn because they compress the result of multi-step reasoning into a single target token (Section 1; Figure 1b).
- Why this matters
- Practically: future model scaling is constrained by data scarcity. Making each document more “learnable” increases the value extracted per token.
- Conceptually: next-token prediction struggles when the next token implicitly encodes long chains of reasoning (e.g., answering “890” without the intermediate math; Figure 1b).
- Prior approaches and gaps
- Data selection/prioritization keeps only “learnable” and “worth learning” tokens (e.g., Lin et al., 2024; Mindermann et al., 2022), but still asks the model to learn a difficult token in one step.
- Test-time prompting (e.g., chain-of-thought) or RL-style “reasoning” training (OpenAI o1; DeepSeek-R1) increases inference compute, not training learnability.
- Reinforcement Pre-Training (
RPT) improves pre-training but is compute-heavy due to online rollouts and token-level control (Section 1). - Reasoning CPT and BoLT generate “latent thoughts,” but were evaluated at much smaller scales (≤8B tokens) or narrow domains (Section 1).
- Positioning of this work
TPTis a training-time, data-centric method that:- Requires no human labels and no online RL.
- Works at document level for any text domain.
- Scales to 100B training tokens and multiple model families (Sections 2–3).
- Naturally allocates more training compute to harder content via longer generated thinking (Section 4).
3. Technical Approach¶
- Core idea in plain terms
- For each document
din the pre-training corpus, automatically generate an expert-like “thinking trajectory”tthat explains or reasons aboutd. Concatenate them into one sequencex = [d; t]. Train the model to predict every next token inx(Section 2). - What is a “thinking trajectory”?
- A model-generated, step-by-step analysis of the document that focuses on “complex and informative aspects” (prompt in Section 2). It is similar to chain-of-thought but is produced during data preparation, not at inference. Example shown in Figure 1b and Appendix Table 8.
- Generation prompt and controls
- Prompt (Section 2): “Simulate an expert’s in-depth thought process… Use Feynman technique…”; for ablations, a “random focus point” variant is in Appendix A.3.
- Practical settings (Appendix A.1): truncate document to ≤2k tokens; generate up to 8k thinking tokens with temperature 0.6 and top-p 0.9; stop at
</think>to avoid redundant summaries. - Thinking models: Qwen3-8B for from-scratch pre-training; DeepSeek-R1-Distill-Qwen-7B for mid-training studies (Appendix A.1). Surprisingly, a 1.5B thinking generator sometimes works even better (Table 4).
- Training objective
- Standard next-token prediction over the concatenated sequence:
- Equation (1): minimize
- (1/N) * Σ log p(x_i | x_<i)over all tokens of[d; t].
- Equation (1): minimize
- No special losses, no RL, no extra supervision; this makes it easy to scale (Section 2).
- Why this design?
- Document-level augmentation is scalable and agnostic to data format.
- By breaking a “hard next token” into many explanatory steps, the model receives a path to generalize beyond memorizing answers (Figure 1b and Section 2 “Dynamic Allocation of Training Compute”).
- Thinking trajectories are naturally longer for challenging domains, effectively up-sampling them with more training tokens (Section 4; Figure 4).
- Training setups used to test generality (Section 3; Appendix A.1, A.2):
- From-scratch pre-training (8B models; 100B tokens; Figure 2; Table 1).
- Constrained-data pre-training (limit raw documents to 10B tokens; 40B training budget; Figure 3; Table 7).
- Mid-training (continual pre-training) of existing checkpoints: Qwen2.5-1.5B/7B and LLaMA‑3.2‑3B on 100B augmented tokens, then supervised fine-tuning (
SFT) on the public 350k-sample “Mixture-of-Thoughts” dataset distilled from DeepSeek-R1 (Section 3.3; Table 3; Appendix A.1). - Implementation: MI300 GPUs; 8B pre-training to 100B tokens takes about a week; thinking-data generation ≈20k A100 GPU hours (Appendix A.1).
Analogy: If the original data asks “What’s 890?” the model must “jump” to the answer in one step. TPT teaches the model the staircase—polynomial division, Remainder Theorem, divisibility—so that predicting the final token becomes a sequence of small, learnable steps (Figure 1b; Appendix Table 8).
4. Key Insights and Innovations¶
- Thinking-augmented data as a universal, scalable pre-training format
- Novelty: Append automatic, expert-like reasoning to any document, then train with the usual language-model objective (Section 2). No new loss, no task structure, no labels.
- Why it matters: It turns hard-to-learn targets into decomposed sequences, improving generalization beyond memorization (Figure 1b).
- Dynamic training compute allocation emerges automatically
- Observation: “High-value” documents (math, physics; advanced reasoning) produce longer thinking trajectories, thus receiving more training tokens and compute (Section 4; Figure 4 shows ≈50% longer thoughts for “Advanced Reasoning” vs “No Reasoning”).
- Significance: Mimics “test-time scaling” benefits (longer reasoning improves accuracy) but moves them into training, improving learnability with fixed inference cost if desired (Section 2).
- Data efficiency at scale
- Claim grounded by results: With 100B tokens,
TPT-8Breaches an average score 43.9 vs 26.2 for a vanilla 8B trained identically (Table 1) and trends toward models trained on orders of magnitude more data (e.g., LLaMA‑3.1‑8B trained on 15T tokens scores 46.8; Table 1; Figure 2). - Figure 1a visualizes ≈3× data efficiency.
- Simplicity over RL or EM pipelines, yet strong gains across stages
- In contrast to RL-based
RPTor EM-style bootstrapping in BoLT,TPTrequires only generation + standard training, yet improves from-scratch pre-training and mid-training, and continues to help after SFT (Sections 3.1–3.3; Tables 1–3, Figures 2–3).
5. Experimental Analysis¶
- Evaluation methodology
- Base-model (no SFT) evaluation (Appendix A.2): average across five datasets—
GSM8K,MATH,BoolQ,MMLU,MMLUPro—with specified shot settings and strict answer extraction (Section 3.1; Figure 2; Table 1). - Post-SFT evaluation: ten challenging benchmarks spanning math (
MATH-500,AIME24,AIME25,GSM8K,HMMT), code (HumanEval,LiveCodeBench v4/v5), and general reasoning (GPQA-Diamond,MMLUPro,JEEBench) using Pass@1; multiple samples per question for stability (Section 3.3; Appendix A.2; Table 3). - Main results, with specific numbers
- From-scratch pre-training (abundant data; 100B tokens):
- Training loss:
TPThas far lower loss, signaling more learnable/less noisy data (Figure 2, left; authors caution distributions differ). - Downstream performance curve:
TPTovertakes vanilla after ≈20B tokens and widens the gap to 100B (Figure 2, right). - Final scores (Table 1):
Vanilla-8B (100B)average 26.2 vsTPT‑8B (100B)43.9. On math:GSM8K19.2→50.1,MATH9.1→21.8.
LLaMA‑3.1‑8B (15T)achieves 46.8—only modestly aboveTPT‑8Btrained on 150× fewer tokens.
- Training loss:
- Constrained-data setting (only 10B raw-document tokens available; 40B training budget):
TPTkeeps improving while vanilla plateaus (Figure 3).- Final (Table 7):
TPT‑8B (40B)average 32.6 vsVanilla‑8B (40B)16.6. NotablyGSM8K30.5 vs 6.7 andMATH12.9 vs 4.8.
- Mid-training + SFT (Table 3):
- Consistent gains across model sizes and families. Examples:
TPT‑LLaMA‑3Bvs OpenR1‑LLaMA‑3B onAIME24: 18.6 vs 5.8; onMMLUPro: 55.5 vs 45.8; onGPQA‑D: 41.7 vs 32.8.
TPT‑Qwen2.5‑7Bvs OpenR1‑7B onAIME24: 57.5 vs 50.5; onGPQA‑D: 54.7 vs 52.1; onJEEBench: 73.6 vs 69.1. - With SFT, a
TPT-pretrained 8B surpassesLLaMA‑3.1‑8B‑Instructon all five reported benchmarks (Table 2):AIME24: 35.2 vs 5.4; MATH‑500: 82.4 vs 49.4; LCB: 23.4 vs 9.4; GPQA: 45.2 vs 31.4; MMLUPro: 59.8 vs 43.6.
- Consistent gains across model sizes and families. Examples:
- Ablations and analyses
- Thinking generator variations (Table 4):
- “Back-thinking” (fine-tuned generator that writes thoughts inside
<think>...</think>) and “random focus” prompt yield similar performance to default—small deltas. - Smaller thinking generator (
DeepSeek‑R1‑Distill‑Qwen‑1.5B) sometimes outperforms 7B (e.g.,AIME2417.7 vs 11.7), suggesting “simpler thoughts” may be easier to learn.
- “Back-thinking” (fine-tuned generator that writes thoughts inside
- Token budget scaling for mid-training (Figure 5):
> Increasing from 0→100B thinking-augmented tokens steadily raises scores on
AIME24,MATH-500,GPQA‑D,LiveCodeBenchfor both 1.5B and 3B models. - SFT epochs (Figure 6):
> Without mid-training, a 3B base barely solves
AIME24; withTPTmid-training, performance starts higher and stays higher across 0.5–5 SFT epochs. SFT alone is insufficient for strong reasoning. - Vanilla mid-training control (Table 6):
> Continual training on plain text (40B) does not help and even hurts code (e.g.,
LCB5.7) compared to direct SFT (13.9). This isolates thinking augmentation as the driver of gains. - Thinking-length distribution (Section 4; Figure 4): > Math and physics documents trigger the longest thoughts; “Advanced reasoning” has ≈50% longer thoughts than “No reasoning,” effectively up-sampling hard content.
- Do the experiments support the claims?
- The breadth (from-scratch, constrained data, mid-training, SFT) and consistent margins—especially in math/code/general-reasoning suites—provide strong evidence that
TPTimproves learnability and data efficiency. - Caveat: training-loss curves are not directly comparable due to distribution differences (Figure 2 caption); nonetheless the downstream metrics and multiple baselines (Tables 1–3) are compelling.
6. Limitations and Trade-offs¶
- Dependence on thought quality
- Generated thinking can be verbose, partially incorrect, or stylistically biased. The method assumes that “explanation is easier to learn than answer,” even if explanations are imperfect (Section 2; examples in Appendix A.5). No explicit mechanism filters incorrect thoughts.
- Compute and memory overhead
- Thinking trajectories can add up to 8k tokens per sample (Appendix A.1), increasing training tokens and sequence lengths. Although this is the lever that boosts learnability, it raises training cost and may require long-context models (they train with 8k context for pre/mid-training; 32k for SFT/inference; Table 5).
- Data generation itself is non-trivial (≈20k A100 GPU hours; Appendix A.1).
- Potential distribution shift
- Longer thoughts disproportionately up-sample math/advanced content (Figure 4). While beneficial for reasoning, it could distort domain balance if not managed (they do apply sample weights when mixing datasets; Appendix A.1).
- Scale beyond 100B unknown
- Results scale cleanly up to 100B mid-training tokens (Figure 5), but behavior at trillion-token scale remains to be validated.
- Inference-time behavior
- Many evaluations allow long thinking at inference (up to 32k tokens; Section 3.3), which can increase latency/cost. It remains an open question how well
TPTtransfers when constrained to short outputs. - Generalization to non-reasoning tasks
- Benchmarks emphasize math/code/general reasoning. Effects on style, summarization, or safety-alignment tasks are not reported.
7. Implications and Future Directions¶
- How this changes the landscape
TPTreframes “data scaling” as “explanation scaling”: instead of scraping more web text, create more learnable content per document. This offers a path to stronger reasoning without trillion-token corpora (Figure 1a; Table 1).- It bridges a gap between test-time chain-of-thought and training: the model practices decomposed reasoning during pre-training/mid-training, not only when prompted at inference.
- Practical applications
- Upgrading existing open-source models via mid-training to achieve large reasoning gains before SFT (Table 3) is attractive for organizations without the budget for massive pre-training runs.
- Domains that benefit most—STEM education tools, math/code assistants, scientific reading and verification—align with where thinking trajectories naturally lengthen (Figure 4).
- Follow-up research
- Thought quality control: automatic correctness checks, verifiers, or self-consistency filters to prune harmful thoughts.
- Budgeted thinking: adaptively choose thought length per document to trade off compute and benefit; e.g., learn a policy to allocate thinking tokens.
- Co-evolution of thought generator and learner: iterate generation with the current model’s weaknesses (EM- or RL-style) while keeping the pipeline simple.
- Prompt/program synthesis: diversify thinking styles (proofs, counterexamples, sketches) beyond a single prompt (Appendix A.3 hints at random focus).
- Transfer under constrained inference: train with thoughts but distill into short-answer variants that keep the gains with minimal inference cost.
Overall, the paper demonstrates a clear, scalable mechanism—attach generated explanations to training data—that substantially improves reasoning and data efficiency. The method’s simplicity, strong empirical results (Figures 1–3; Tables 1–3), and analyses (Figure 4; Table 4; Figures 5–6) make it a practical addition to large-scale LLM training pipelines, while leaving rich avenues for refining thought generation quality, adaptivity, and efficiency.