Instruction Pre-Training: Language Models are Supervised Multitask Learners¶

🎯 Pitch¶

This paper introduces Instruction Pre-Training, a novel approach that augments massive raw corpora with synthetic instruction–response pairs generated by an efficient, open-source instruction synthesizer, enabling supervised multitask learning during language model pre-training. By infusing explicit task supervision at scale, this method significantly improves model generalization, data efficiency, and domain adaptation—showing that even smaller models can match or surpass much larger ones in specialized domains—all without relying on closed or proprietary generators. This work demonstrates a transformative alternative to traditional unsupervised pre-training, effectively bridging the gap between pre-training and instruction tuning for next-generation language models.

1. Executive Summary¶

This paper introduces Instruction Pre-Training (Instruct PT), a pre-training paradigm that augments raw text corpora with synthetic instruction–response pairs and then trains language models on this enriched data. It shows that supervised signals can be injected at pre-training scale—improving generalization, data efficiency, and domain adaptation—without relying on closed-source generators.

2. Context and Motivation¶

Problem addressed
Standard large language models are pre-trained “unsupervised” on raw text using next-token prediction, here called Vanilla Pre-Training. While scalable, this lacks explicit task supervision.
Supervised multitask learning via instruction tuning (fine-tuning on instruction–response tasks) improves generalization afterward but is applied post hoc, not during pre-training.
Why it matters
Injecting explicit task structure earlier could make models more sample- and compute-efficient and better aligned with later instruction tuning. It could also enable small models to close the gap with much larger ones in specialized domains.
Prior approaches and their gaps
Synthetic instruction generation for post-training (e.g., Self-Instruct-like pipelines) often depends on large, closed models to synthesize data or uses rule-based heuristics with limited diversity.
Existing pre-training methods rely on raw corpora or specialized data curation and mixture optimization but not on scaled, supervised task signals woven into the pre-training text stream.
Positioning
This work proposes supervised multitask learning at pre-training time by turning raw texts into instruction-augmented training instances using an efficient, open-source instruction synthesizer. It scales to 200M instruction–response pairs spanning 40+ task categories and demonstrates benefits in both general from-scratch pre-training and domain-adaptive continual pre-training.

3. Technical Approach¶

The pipeline has two stages: build a data generator (instruction synthesizer), then pre-train on the augmented corpora.

1) Instruction synthesizer (Figures 2–3) - What it is - A 7B-parameter model (Mistral-7B-v0.1) fine-tuned to read a raw text (context) and produce multiple instruction–response pairs grounded in that text. An instruction is a natural-language task prompt (e.g., “What club does Helen like?”), and the response is the answer derived from the context. - How training data is constructed (Section 2.1, Appendix A) - The paper reformats many context-based QA/RC datasets so that each example contains: - The original passage as raw text. - One or more downstream tasks over that passage as instruction–response pairs. - This yields diverse supervision across domains (encyclopedia, fiction, news, social media, expert materials) and task types (free-form, multiple-choice, with/without chain-of-thought). - How it is fine-tuned (Figure 3; Appendix B, Table 8) - “One-shot example”: one raw text followed by several instruction–response pairs derived from that text. - “Few-shot sequence”: concatenate multiple one-shot examples from the same dataset into a single training sequence. This encourages the synthesizer to learn “patterns” (task format/category) and generalize that pattern to new texts. - Loss is computed only on the tokens of the generated instruction–response pairs, not the raw text, focusing learning on task synthesis rather than copying context. - How it generates data (inference; Figure 3) - Multi-round inference: for a new raw text, the system first generates a set of pairs; in round 2, it prepends the text plus round-1 pairs and generates more; and so on. This creates few-shot demonstrations anchored to the same context and increases diversity/coverage. - On average, ~5 pairs per raw text are generated, ~52 tokens per pair (Section 3.1).

2) Constructing instruction-augmented corpora (Section 2.2) - Formatting - Templates diversify instruction styles (from Longpre et al., 2023) and concatenate raw text with its synthetic pairs (Appendix B, Table 7). Concatenating outputs across M rounds yields an M-shot training example (Figure 3). - Two pre-training regimes - General pre-training from scratch: - Start with a 100B-token subset of RefinedWeb (Section 3.2). - Convert 1/5 of raw texts (40M) into instruction-augmented texts in two rounds, yielding 200M pairs (~10B tokens). - Mix in the synthesizer’s fine-tuning data (repeated 4 times due to small size) to increase task diversity. - Domain-adaptive continual pre-training: - For Biomed (PubMed abstracts) and Finance (financial news), convert all domain corpora in 3 rounds and mix with general instruction data (ratios per Cheng et al., 2023; Section 3.3).

3) Model training and evaluation - Training - Use the same next-token prediction objective as Vanilla PT; compute loss on all tokens (Section 2.2). - From-scratch models: Mistral-style architectures at 500M and 1.3B parameters; 100B tokens total (Appendix, Table 14). - Continual pre-training: Llama3-8B, 4k steps, sequence length 4096 (Table 14). - Evaluation setup (Appendix C) - General benchmarks: ARC-e/c, BoolQ, SIQA, WinoGrande, PIQA, OBQA, HellaSwag, MMLU (0-shot or 5-shot depending on dataset). - Instruction tuning: further fine-tune the 500M model on FLAN-style data and track MMLU (Figure 4). - Domain tasks: PubMedQA, ChemProt, RCT, MQP, USMLE (Biomed); ConvFinQA, Headline, FiQA SA, FPB, NER (Finance), with zero/few-shot prompting.

Why these design choices? - Open-source 7B synthesizer: drastically reduces cost relative to using very large or closed models, enabling scale (Section 3.1). - Multi-round synthesis: creates few-shot-style contextual demonstrations that improve prompting and task coverage (Figure 3; ablation in Table 4 shows 1-shot underperforms). - Loss only on pairs during synthesizer tuning: focuses the generator on producing high-quality instructions/answers rather than modeling the raw text (Figure 3). - Mix instruction-augmented and raw text: balances supervised signals with broad coverage of natural language.

4. Key Insights and Innovations¶

1) Supervised multitask signals during pre-training, not just post-training - What’s new: Turn raw web text into instruction-driven tasks at scale and train on them as part of pre-training (Figure 1). - Why it matters: Improves data efficiency and generalization of base models and makes later instruction tuning smoother (Figure 4).

2) A scalable, open-source instruction synthesizer - Built on a 7B model and fine-tuned on diverse context-task datasets to generate grounded tasks for arbitrary raw text (Figures 2–3; Appendix A/B). - Significance: Enables generation of 200M instruction–response pairs across 40+ task categories with good quality and coverage (Table 6; Figure 6) at feasible cost.

3) Multi-round (few-shot) synthesis aligned with training objectives - Different from one-shot or rule-based augmentation: synthesizer leverages earlier rounds as demonstrations to produce richer, few-shot-like contexts (Figure 3). - Evidence: Ablations show 1-shot synthesis reduces performance compared to multi-round (Table 4).

4) Demonstrated data efficiency and strong domain adaptation - From-scratch: a 500M model trained on 100B tokens matches or beats several 1–3B models trained on ~300B+ tokens (Table 2). - Continual pre-training: Llama3-8B with Instruct PT matches or surpasses Llama3-70B on average in Biomed and Finance (Table 3).

These are more than incremental tweaks: the work reframes pre-training to include scalable supervised signals and shows concrete benefits in both generalization and domain transfer.

5. Experimental Analysis¶

Evaluation methodology - General pre-training from scratch - Data: 100B tokens from RefinedWeb; 40M texts augmented with two rounds of synthesis, yielding 200M pairs (~10B tokens) (Section 3.2). - Models: 500M and 1.3B Mistral-style (Table 14). - Metrics: Accuracy (acc-norm where applicable) on ARC-e/c, BoolQ, SIQA, WinoGrande, PIQA, OBQA, HellaSwag, MMLU. - Instruction tuning follow-up - Fine-tune the 500M model on FLAN collection; evaluate MMLU zero/few-shot over steps (Figure 4). - Domain-adaptive continual pre-training - Base: Llama3-8B; domains: Biomed (PubMed abstracts), Finance (financial news). - Setup: 3-round synthesis of all domain corpora; mix with general instruction data; evaluate on domain-specific datasets (Section 3.3; Appendix C).

Main quantitative results (selected) - From-scratch pre-trained base models (Table 1): - 500M parameters: > Vanilla PT average (approx.): ARC-e 50.3, ARC-c 26.4, BoolQ 57.5, SIQA 44.6, WinoGrande 53.8, PIQA 71.1, OBQA 29.8, HellaSwag 47.2, MMLU 25.4
> Instruct PT: ARC-e 54.8, ARC-c 27.4, BoolQ 62.0, SIQA 47.2, WinoGrande 54.8, PIQA 69.9, OBQA 30.8, HellaSwag 47.3, MMLU 25.3 - Observations: Broad gains on ARC-e/c, BoolQ, SIQA, WinoGrande; PIQA slightly down; HellaSwag similar; MMLU roughly unchanged at base stage. - 1.3B parameters: > Vanilla PT: ARC-e 58.5, ARC-c 28.8, BoolQ 60.3, SIQA 47.9, WinoGrande 54.9, PIQA 73.0, OBQA 33.6, HellaSwag 54.9, MMLU 25.7
> Instruct PT: ARC-e 60.5, ARC-c 30.9, BoolQ 62.2, SIQA 49.2, WinoGrande 55.9, PIQA 73.6, OBQA 33.4, HellaSwag 54.3, MMLU 27.3 - Observations: Consistent improvements across most benchmarks with a notable +1.6 on MMLU. - Data efficiency vs other open models (Table 2/15):

With only 100B tokens, the 500M Instruct PT model achieves an average of 46.6 (Table 2) comparable to Pythia-1B trained on 300B tokens (47.1), and the 1.3B Instruct PT (49.7) rivals BLOOM-3B trained on 341B tokens (50.1). - Instruction tuning efficiency (Figure 4): During instruction tuning, the Instruct PT 500M model rapidly outperforms the Vanilla PT counterpart on MMLU and maintains a stable upward trend across training steps (zero- and few-shot panels), indicating smoother transfer from pre-training to instruction tuning. - Domain-adaptive continual pre-training (Table 3): - Biomed (average over 5 tasks): > Llama3-8B without continued PT: 53.6
> Vanilla PT-8B: 58.4
> Instruct PT-8B: 61.3
> Llama3-70B: 63.9 - Instruct PT-8B substantially narrows the gap to 70B (−2.6 average). - Finance (average over 5 tasks): > Llama3-8B without continued PT: 70.1
> Vanilla PT-8B: 72.0
> Instruct PT-8B: 74.7
> Llama3-70B: 71.9 - Instruct PT-8B surpasses 70B on average (+2.8), with a large ConvFinQA jump (62.9 → 74.6). - Ablations (Table 4): Removing domain corpora (“w/o Corpora”) lowers performance (e.g., Biomed avg 61.3 → 58.6).
Replacing synthesis with rule-based augmentation reduces performance (Biomed 61.3 → 58.8).
Limiting to single-turn (1-shot) also hurts (Biomed 61.3 → 58.5). - Conclusion: Domain-grounded, multi-round synthetic instructions matter. - Synthesizer quality (Table 5; Figure 5): Response accuracy vs base Mistral-7B improves dramatically on both seen and unseen datasets (e.g., seen zero-shot 70.0 vs 30.6; unseen few-shot 49.9 vs 21.8).
Including synthesized pairs in prompts boosts an LM’s task performance compared to “w/o pairs” or “random” baselines (Figure 5). - Data quality, diversity, and contamination (Table 6; Figure 6; Table 9; Table 10; Table 11; Table 12): Instruction–response pairs cover 49 categories with high context relevance (≥85%) and 70–86% response accuracy depending on domain (Table 6; human eval in Table 10).
Minimal contamination is introduced by synthesized pairs (e.g., only 2 MMLU examples; Table 9).
Synthesized pairs mirror the domain distribution of the raw corpora (Table 12) and have high domain coverage/overlap with the source texts (Table 11).

Do the experiments support the claims? - Yes, on three fronts: - Generalization: from-scratch Instruct PT improves most benchmarks over Vanilla PT at the same token budget and delivers better MMLU after subsequent instruction tuning (Table 1; Figure 4). - Data efficiency: competitive with larger baselines trained on far more tokens (Table 2/15). - Domain transfer: consistent, often large gains in Biomed and Finance, with 8B models reaching or surpassing 70B performance (Table 3). - Mixed/conditional results: - Some base-stage tasks (e.g., PIQA at 500M, HellaSwag at 1.3B) show small regressions or stasis (Table 1). Gains are not universal without instruction tuning.

6. Limitations and Trade-offs¶

Accuracy of synthetic data
The paper’s analyses estimate around 70–86% response accuracy and 85–99% context relevance (Table 6; Table 10). This implies non-trivial noise; training on noisy supervision risks entrenching hallucinations.
Scale and compute
From-scratch demonstrations use 100B tokens and small models (500M/1.3B); it is unclear how effects scale to trillion-token, 10B–70B+ pre-training regimes. The paper notes scaling remains an open question (Limitations section).
Data generation cost and engineering
While far cheaper than using closed models, synthesizing 200M pairs is still computationally and operationally non-trivial (Appendix B mentions ~1 day per 1B raw tokens on a single A100 for inference—implying sizable cluster requirements at web scale).
Benchmarks and mixing
Mixture ratios (general instructions vs domain corpora) and template choices may influence results; optimal mixture design is not fully explored (a common open issue in data-centric LMs).
Contamination check scope
Contamination is measured via substring matches (Table 9). This is standard but may miss paraphrastic leakage.
Domain coverage vs depth
Synthesized tasks track domain distributions (Table 12), but depth/complexity of expert reasoning is not fully characterized beyond benchmark scores.

7. Implications and Future Directions¶

How this changes the landscape
Demonstrates that supervised multitask signals can be integrated at pre-training time using scalable, open-source synthesis, improving both general capabilities and domain adaptation without closed-model distillation.
Suggests a path to smaller-but-stronger models through data-centric pre-training, potentially lowering the barrier for specialized or resource-constrained deployments.
Follow-up research
Post-verification and filtering of synthetic data to reduce noise (the paper points to iterative filtering and verifier-based pipelines; Limitations).
Mixture optimization: principled methods (e.g., DoReMi-like) to balance raw text, synthesized tasks, and domain instructions during pre-training.
Scaling studies: characterize how instruction pre-training benefits evolve with model size and trillions of tokens.
Richer synthesizers: incorporate retrieval, tool use, or cross-document grounding to generate harder, more compositional tasks.
Safety and bias: evaluate how synthetic supervision affects harmful bias propagation and robustness.
Practical applications
Domain LMs (finance, biomedicine, law) via continual instruction pre-training that rivals much larger models (Table 3).
Pre-training pipelines for organizations that cannot rely on closed APIs for data synthesis.
Faster and more stable downstream instruction tuning of base models (Figure 4), reducing fine-tuning compute and time.

Overall, the work provides a concrete, reproducible recipe—build an open instruction synthesizer, generate diverse grounded tasks from raw corpora in multiple rounds, and pre-train on these instruction-augmented texts—that consistently improves generalization and domain transfer while being computationally pragmatic.