Skip to content

Instruction Following Without Instruction Tuning

ArXiv: 2409.14254

Pitch

This paper reveals the surprising finding that language models can be made to follow instructions without any instruction training data—either by training solely on responses or even by finetuning on narrow, unrelated tasks like poetry. These findings suggest that the instruction-response mapping is largely learned during pretraining, and instruction tuning simply teaches the model which responses are desirable.


1. Executive Summary

This paper demonstrates that instruction-following behavior in language models can emerge from adaptation methods that were never designed to produce it—a phenomenon the authors call "implicit instruction tuning." Through experiments on Llama-2-7B and OLMo-7B-Feb2024, they show that (1) training solely on responses without any instructions ("response tuning") achieves a 43.3% win rate against instruction-tuned models on AlpacaEval 2, and (2) single-task finetuning on narrow domains like poetry or math word problems yields broad instruction-following capabilities (e.g., 23.7% win rate for poetry-tuned Llama-2-7B vs. instruction-tuned). The paper further shows that a hand-written 3-rule adapter—penalizing repetition, upweighting end-of-sequence, and uniformly modifying 15 tokens' probabilities—combined with a base model via product-of-experts achieves a 24.4% win rate, suggesting that the transition from pretrained to instruction-following distributions may be remarkably simple.

2. Context and Motivation

The Core Problem: Understanding What Instruction Tuning Actually Teaches

Instruction tuning—finetuning a language model on instruction-response pairs—has become the standard approach for creating models that can follow user commands. Yet a fundamental question remains poorly understood: what does instruction tuning actually teach the model? Does it learn the mapping from instructions to responses? Does it learn what constitutes a "good" response? Or does it simply surface capabilities already present in the pretrained model?

This question matters for both theoretical and practical reasons. Theoretically, understanding the mechanism of instruction tuning would illuminate how language models generalize and what knowledge is acquired during pretraining versus fine-tuning. Practically, if instruction tuning is teaching something simpler than previously thought, it changes how we should approach model adaptation, safety fine-tuning, and deployment.

Prior Work and the Gap It Left

The paper situates itself in a line of work progressively showing that instruction tuning requires less supervision than initially assumed:

  • Taori et al. (2023) showed that finetuning Llama on 52,000 instruction-response pairs caused surprisingly strong instruction-following behavior.
  • Zhou et al. (2023) demonstrated that with careful example selection, as few as 1,000 pairs could suffice (the LIMA dataset).
  • Lin et al. (2024) pushed further, showing that with careful prompting and in-context examples, a handful of examples could yield instruction following without any parameter updates.
  • Kung & Peng (2023) ran ablations on NaturalInstructions-format data, removing either task descriptions or in-context examples and finding smaller-than-expected degradation.

The authors note that Shi et al. (2024) found that training on joint likelihood of instruction and response (rather than just response conditioned on instruction) improved performance—but this was in the direction of more supervision, not less.

The gap this paper addresses: no prior work has systematically tested whether instruction-response pairs are necessary at all, or whether narrow-domain fine-tuning produces broad instruction-following behavior. The paper asks how minimal the supervision can be while still yielding instruction following.

A Critical Concern: Contamination in Pretraining

A major challenge in interpreting these results is that modern language models may have been exposed to instruction-tuning data during pretraining itself. The authors address this by using two models:

  1. Llama-2-7B: Stronger performance, but no guarantee against intentional instruction tuning in pretraining.
  2. OLMo-7B-Feb2024: Weaker, but explicitly confirmed (via public communication) to have no intentional instruction-tuning data in pretraining.

The fact that both models show similar patterns of implicit instruction tuning strengthens the conclusions, reducing the likelihood that results are explained by pretraining contamination.

How This Paper Positions Itself

The paper frames itself as investigating "implicit instruction tuning"—adaptation methods that were not designed to yield instruction following but do so anyway. It proceeds through three increasingly surprising findings:

  1. Response tuning shows that instruction-response mappings are not learned during fine-tuning—they must exist from pretraining.
  2. Single-task finetuning shows that even learning the "correct" distribution of responses isn't necessary—narrow-domain training yields broad generalization.
  3. Rule-based adaptation provides mechanistic insight: the change from pretrained to instruction-following distributions may involve only simple adjustments to token probabilities.

The paper's stance is that instruction following is easier to "stumble upon" than previously recognized, with implications for how practitioners should approach model deployment and safety testing.

3. Technical Approach

3.1 Reader Orientation

This is an empirical analysis paper that discovers and characterizes a phenomenon (implicit instruction tuning) rather than proposing a new method. The core mechanism being studied is how various deficient forms of adaptation nonetheless yield instruction-following behavior, revealing that pretrained language models encode instruction-response mappings that can be surface through simple distributional changes.

3.2 Big-Picture Architecture (Diagram in Words)

The paper studies four pathways to instruction-following behavior:

  1. Instruction tuning (baseline): Given instruction-response pairs \(\{(instruction_i, response_i)\}\), optimize \(\min_\theta -\log p_\theta(response | instruction)\) using standard supervised fine-tuning. This is the standard approach.

  2. Response tuning: Train on responses alone, replacing instructions with empty strings. Optimize \(\min_\theta -\log p_\theta(response | \text{empty string})\). The model learns only the marginal distribution of desirable responses—no instruction conditioning.

  3. Single-task finetuning: Train on a narrow domain (e.g., poetry, Python code, math problems). Optimize the same objective as instruction tuning, but on a specialized dataset. Test whether the model generalizes to unrelated instructions.

  4. Rule-based adapter: Construct a product-of-experts between a pretrained model \(p_{base}\) and a hand-written rule-based model \(p_{rules}\): \(p_a(w|x) \propto p_{base}(w|x) \cdot p_{rules}(w|x)\). Three rules: upweight EOS, uniformly modify 15 tokens' scores, penalize repetition.

3.3 Roadmap for the Deep Dive

The technical explanation proceeds as follows:

  1. Formal problem setup — defining instruction tuning mathematically and the evaluation methodology
  2. Response tuning — the method, experiments on Llama-2-7B and OLMo-7B, and the "response ranking capability" hypothesis
  3. Single-task finetuning — five narrow domains, experimental results, and analysis of when models adhere vs. deviate from finetuning behavior
  4. Rule-based adapter — the product-of-experts formulation, the three rules, and ablations
  5. Ablation studies — testing whether formatting tag semantics or instruction rephrasing explain response tuning's success

3.4 Detailed, Sentence-Based Technical Breakdown

Formal Problem Setup

Language model notation. A neural language model \(p_\theta(x)\) defines a distribution over strings \(x \in \mathcal{V}^*\), where \(\mathcal{V}\) is a finite vocabulary and \(\theta\) are learnable parameters. During pretraining, the model minimizes cross-entropy loss on a large text corpus.

Instruction tuning objective. Given a dataset \(\mathcal{D}_{ins} = \{(instruction_i, response_i)\}_{i=1}^k\) of instruction-response pairs, instruction tuning optimizes:

\[\min_\theta \frac{1}{k} \sum_{i=1}^k -\log p_\theta(response_i | instruction_i)\]

This is standard conditional likelihood estimation—maximizing the probability of the response given the instruction.

Instruction formatting. The paper uses Tulu formatting to distinguish instructions from responses:

BOS<|user|>
{instruction}
<|assistant|>
{response}EOS

The authors explicitly keep this formatting because it reflects current practice, though they test alternative formats in Appendix B. BOS and EOS are model-specific special tokens for beginning and end of sequence.

Evaluation methodology. The paper uses AlpacaEval 2, an LLM-as-a-judge framework where a model (e.g., GPT-4) compares responses from two models on the same instruction. The metric is length-controlled win rate against a comparable instruction-tuned model. A 50% win rate indicates equal performance; higher is better.

The authors also use greedy decoding to observe whether instruction-following responses are the model's locally most likely outputs (not just high-probability alternatives found via sampling).

Response Tuning (Section 4)

Core finding. Training on responses without their corresponding instructions yields substantial instruction-following behavior.

Method. Response tuning modifies the instruction tuning objective by replacing the instruction string with empty string:

\[\min_\theta \frac{1}{k} \sum_{i=1}^k -\log p_\theta(response_i | \text{[empty string]})\]

The model sees only responses in the desired format—it receives no information about which instruction led to which response. At test time, the model is given a novel instruction and must generate a response.

Experimental setup. - Dataset: LIMA (Zhou et al., 2023), containing 1,030 training examples - Models: Llama-2-7B and OLMo-7B-Feb2024 (both 7B parameter models) - Training: Full parameter finetuning, sweeping over 5, 7, 10, 15, or 20 epochs - Hyperparameters: Learning rate \(10^{-5}\) for Llama-2-7B, \(3 \times 10^{-6}\) for OLMo-7B-Feb2024; Adam optimizer with cosine annealing to zero; 10% warmup; batch size 64 - Selection: AlpacaEval win rate against GPT-3.5-turbo on a held-out validation set (56 instructions, partly hand-written, partly GPT-4-generated) - Reporting: Average and standard deviation over 5 training runs

Results (Table 1). - Response-tuned Llama-2-7B achieves 43.3% ± 1.1% win rate against instruction-tuned Llama-2-7B - Base Llama-2-7B achieves only 2.4% ± 0.14% win rate - Response-tuned OLMo-7B-Feb2024 achieves 43.7% ± 1.7% win rate - Base OLMo-7B-Feb2024 achieves only 4.7% ± 0.57% win rate

The response-tuned models perform much closer to instruction-tuned models than to base models. Instruction tuning still outperforms response tuning, but not massively—the gap is informative but not decisive.

The response ranking capability hypothesis. Why does response tuning work? The authors propose that pretrained models already encode the instruction-response mapping, but desirable responses have too-low probability to be generated.

They define the response ranking capability: a model has this capability if it assigns higher likelihood to the correct response for an instruction than to a desirable response for a random other instruction:

\[p_\theta(response | instruction) > p_\theta(response' | instruction)\]

where \((instruction, response)\) and \((instruction', response')\) are independently sampled instruction-response pairs.

This capability can hold even when neither response has high absolute probability—what matters is relative ranking. Response tuning may simply "unlock" this capability by increasing the absolute probability of desirable response formats.

Testing the hypothesis (Table 2). The authors compute how often the response ranking capability holds for pretrained vs. instruction-tuned models: - Llama-2-7B base: 80.4% of instruction pairs satisfy the property - Llama-2-7B instruction-tuned: 77.4% - OLMo-7B-Feb2024 base: 74.5% - OLMo-7B-Feb2024 instruction-tuned: 74.3%

Pretrained models have the response ranking capability at least as well as instruction-tuned models. This supports the hypothesis: pretrained models can distinguish correct from incorrect responses for a given instruction, but need adaptation to generate them.

Addressing alternative explanations.

One concern: responses often start by rephrasing the instruction, which might provide implicit supervision. The authors estimate ~10% of LIMA responses begin with rephrasing (Appendix C). They: 1. Use GPT-4 to identify 99 responses that begin with rephrasing 2. Use GPT-4 to rewrite these responses without initial rephrasing 3. Re-run response tuning on the transformed dataset

Result: response tuning on the no-rephrasing dataset achieves 43.3% ± 5.3% win rate, nearly identical to the original 43.7%. Instruction rephrasing is not the cause of response tuning's success.

Another concern: the formatting tags <|user|> and <|assistant|> have semantic content that might cue instruction-following behavior. The authors test using non-semantic tags <|A|> and <|B|> (Appendix B, Table 5): - Response tuning with A/B tags achieves 41.8% ± 0.84% win rate - Compared to 43.3% with original user/assistant tags

Tag semantics provide slight benefit but don't explain the improvement from base to response-tuned.

Single-Task Finetuning (Section 5)

Core finding. Finetuning on narrow, single-task data (poetry, code, math) yields broad instruction-following behavior on unrelated instructions.

Datasets (Figure 3). Five narrow-domain datasets, each targeting a specific behavior: 1. MBPP: 374 English-to-Python pairs (input: Python function request; output: code) 2. GSM: 1,000 grade school math problems (input: word problem; output: English-and-math derivation with final answer formatted as #### N) 3. Recipes: 1,000 recipe strings (input: "Recipe for X"; output: hyphenated ingredient list followed by instructions) 4. Poetry: 571 poems (input: "Write a poem called X"; output: the poem) 5. Chess: 1,000 chess games in PGN notation (input: ELO ratings; output: game moves)

Experimental setup. Same as response tuning: full parameter finetuning on Llama-2-7B and OLMo-7B-Feb2024, sweeping over epoch counts, selecting on validation set, reporting AlpacaEval win rate against instruction-tuned models.

Results (Table 3). Win rates against instruction-tuned models:

Tuning Llama-2-7B OLMo-7B-Feb2024
Base (no tuning) 2.4% 4.7%
MBPP 16.9% 10.4%
GSM 23.7% 30.3%
Poetry 22.9% 21.9%
Recipes 14.6% 21.5%
Chess 2.1% 6.3%

Four of five single-task finetuning settings yield substantial instruction following—models trained only on poetry or math generate reasonable recipes and biographies. Chess is the exception, which the authors attribute to the very low entropy of chess opening moves (most games start identically).

Qualitative behavior (Figures 4 and 5). The authors analyze when models adhere to vs. deviate from finetuning behavior:

  • For instructions similar to the finetuning domain, models produce domain-appropriate outputs (e.g., GSM-tuned models produce math-style derivations for math-like questions)
  • For instructions dissimilar from the finetuning domain, models generate general instruction-following responses, not domain-specific ones

Figure 5 plots the relationship explicitly for GSM-tuned models: instruction similarity to GSM (x-axis, measured by Nomic embedding cosine similarity) vs. response similarity to GSM minus response similarity to LIMA (y-axis). For low-similarity instructions, responses are closer to general LIMA responses than to GSM format. Only for very high-similarity instructions do responses strongly adhere to GSM format.

Key observation from Figure 4: A Recipe-tuned model, when asked about US state names, starts with a hyphen (matching recipe format) but then proceeds with a reasonable answer about state name origins. The finetuning format subtly leaks into outputs, but doesn't dominate.

Practical implication. Models finetuned for specific tasks may exhibit general instruction-following behavior on out-of-distribution inputs, even when the practitioner expected task-specific behavior. This has safety implications: a model finetuned on benign code generation may still behave as a general assistant on unrelated queries.

Rule-Based Adapter (Section 6)

Core finding. A hand-written 3-rule adapter, combined with a base model via product-of-experts, yields instruction-following behavior comparable to single-task finetuning.

Motivation. If response tuning and single-task finetuning both yield instruction following, perhaps the underlying change in the model's distribution is simple. The authors test this by constructing a simple rule-based adapter.

Product-of-experts formulation. For word \(w \in \mathcal{V}\) and prefix \(x \in \mathcal{V}^*\), the adapted distribution is:

\[p_a(w | x) = \frac{p_{base}(w | x) \cdot p_{rules}(w | x)}{Z(x)}\]

where \(Z(x) = \sum_{w \in \mathcal{V}} p_{base}(w | x) \cdot p_{rules}(w | x)\) is the normalization constant.

The product-of-experts computes a soft AND: tokens likely under both distributions are upweighted. The rules distribution modifies base probabilities by multiplicative factors.

Rule formulation. Let \(r(w, x)\) be the sum of all rules' scores for word \(w\). The rules distribution is:

\[p_{rules}(\cdot | x) = \text{softmax}\left([r(w^{(1)}, x), \ldots, r(w^{|\mathcal{V}|}, x)]\right)\]

The three rules (Table 8, Listing 1):

  1. Slowly upweight EOS. Increase the score of the end-of-sequence token linearly with response length. This favors shorter responses and prevents runaway generation. The score for EOS is: (length of response) * 15 for responses between 0 and 250 tokens, and a large constant (100) for responses exceeding 1024 tokens.

  2. Uniform token changes. Modify scores for 15 specific vocabulary items at every token decision:

  3. Strongly penalize formatting tokens: <, _<, | (score -4)
  4. Penalize first-person pronouns: _I, I (score -5), We (score -3)
  5. Penalize question words: What (score -3), _should (score -6)
  6. Boost markdown/punctuation: _*, _-, ___, _#, _##, newline \n, ! (score +1)

The authors found that base models erroneously refuse to respond using phrases like "I should" or "We should" and format tokens from incomplete output—these rules suppress such behavior.

  1. Encourage word diversity. Compute the set of all tokens generated so far and add a penalty (score -1.5) to generating any of them again. This reduces repetition.

Implementation. Listing 1 provides Python code implementing the forward pass. The rules operate on the final logits tensor, modifying scores for specific vocabulary indices before softmax normalization.

Results (Table 4). - Full 3-rule adapter with Llama-2-7B: 24.4% ± 0.40% win rate against instruction-tuned Llama-2-7B - Base Llama-2-7B: 2.4% ± 0.14%

Ablations: - Remove Rule 1 (EOS): 10.4% ± 0.30% - Remove Rule 3 (diversity): 14.3% ± 0.58% - Remove Rule 2 (uniform token changes): 16.3% ± 0.25%

All three rules contribute substantially; removing any one roughly halves the win rate.

Qualitative outputs (Figure 6). The rule-based model produces coherent, somewhat helpful responses. The authors note: "we did not cherry-pick, and many other responses are more reasonable." The kickball explanation is genuinely helpful; other responses are coherent but incomplete.

Significance. The rule-based adapter achieves performance comparable to single-task finetuning without any training—just three hand-written rules. This suggests that the change from pretrained to instruction-following distributions may involve only simple adjustments to token probabilities.

4. Key Insights and Innovations

Innovation 1: Response Tuning Reveals Latent Instruction-Response Mappings

The discovery that training on responses alone—without any instruction conditioning—yields instruction following is the paper's most striking contribution. Prior work showed instruction tuning requires few examples, but assumed the instruction-response pair was essential. This paper shows the instruction is unnecessary: optimizing \(p(response | empty)\) yields a model that responds appropriately to novel instructions.

The significance is theoretical. It implies that pretrained language models encode instruction-response mappings during pretraining, but desirable responses have vanishingly low probability. The model already "knows" what a good response looks like for an instruction; it just needs to learn to generate responses in that format. The response ranking capability experiments (Table 2) support this: base models correctly rank appropriate responses higher than inappropriate ones at ~75-80% rate, matching instruction-tuned models.

This reframes instruction tuning from "teaching the model to follow instructions" to "unlocking a latent capability by teaching the model what format responses should take."

Innovation 2: Narrow-Domain Finetuning Generalizes Broadly

Single-task finetuning results challenge the intuition that training on a narrow domain (poetry, Python code) produces task-specific behavior. Instead, models trained on poetry generate recipes; models trained on math word problems answer biography questions.

The key nuance is when models adhere to finetuning behavior. The analysis in Section 5.2 and Figure 5 shows a smooth transition: for instructions similar to the finetuning domain, outputs match domain format; for dissimilar instructions, models revert to general instruction-following behavior. This is neither simple generalization (it's not just applying the finetuned behavior everywhere) nor simple preservation (the model does change from base behavior).

This has practical implications for safety: practitioners should not assume that task-specific finetuning constrains model behavior to that task. A model finetuned for code generation may still function as a general assistant on unrelated inputs.

Innovation 3: A Minimal Mechanistic Account via Rule-Based Adaptation

The rule-based adapter provides the first mechanistic account of how simple distributional changes can yield instruction following. The three rules—ending sequences, reducing repetition, and suppressing specific tokens—are not obviously tied to "following instructions." Yet they achieve 24.4% win rate against instruction-tuned models.

This supports the hypothesis that the change from pretrained to instruction-following distributions involves simple adjustments. The rules don't encode task knowledge or instruction-response mappings; they primarily address formatting and generation quality (preventing refusals, encouraging termination, reducing repetition). The underlying "intelligence" comes entirely from the pretrained model; the rules just shape its output distribution.

The ablations show all three rules contribute meaningfully, but the uniform token changes (Rule 2) are most impactful—removing them drops performance to 16.3%. The suppressed tokens (pronouns, question words, formatting artifacts) are precisely what base models use when they fail to follow instructions (e.g., "I should not..." refusals).

Innovation 4: Two-Model Strategy Addresses Pretraining Contamination

Using both Llama-2-7B and OLMo-7B-Feb2024—with the latter confirmed to have no intentional instruction-tuning data in pretraining—strengthens the conclusions. If results held only for Llama-2-7B, one might suspect hidden instruction-tuning in pretraining. The fact that OLMo-7B-Feb2024 shows similar patterns (43.7% for response tuning, 30.3% for GSM finetuning) suggests the results reflect genuine generalization, not memorization.

This is a methodological contribution: future work on instruction tuning should similarly control for pretraining data composition, especially as instruction tuning becomes standard in pretraining pipelines.

5. Experimental Analysis

Evaluation Methodology

Primary benchmark. AlpacaEval 2, an LLM-as-a-judge evaluation framework. An annotator model (GPT-4) compares responses from two models on the same instruction and declares a winner. The paper uses length-controlled win rate, which accounts for length bias in the judge.

Comparison protocol. Models are compared head-to-head against a "comparable instruction-tuned model"—specifically, a LIMA-tuned version of the same base model. A 50% win rate indicates equal quality; higher is better.

Models tested. - Llama-2-7B (meta-llama/Llama-2-7b-hf) - OLMo-7B-Feb2024 (allenai/OLMo-7B-Feb2024)

Both are ~7B parameter models. Llama-2-7B is stronger; OLMo-7B-Feb2024 has transparent pretraining data.

Training datasets. - Instruction tuning and response tuning: LIMA (1,030 examples) - Single-task finetuning: MBPP (374), GSM (1,000), Recipes (1,000), Poetry (571), Chess (1,000)

Hyperparameters. - Learning rates: \(10^{-5}\) for Llama-2-7B, \(3 \times 10^{-6}\) for OLMo-7B-Feb2024 - Optimizer: Adam with cosine annealing to zero, 10% warmup - Batch size: 64 - Epochs: swept over {5, 7, 10, 15, 20}, selecting best on validation set

Validation set. 56 instructions, hand-written and GPT-4-generated, testing various capabilities (knowledge, translation, reasoning, formatting). Used only for hyperparameter selection; final evaluation on AlpacaEval test set.

Reporting. Mean and standard deviation over 5 independent training runs with different random seeds.

Main Quantitative Results

Response tuning (Table 1). - Llama-2-7B response-tuned wins 43.3% ± 1.1% vs. instruction-tuned - Llama-2-7B base wins 2.4% ± 0.14% - OLMo-7B-Feb2024 response-tuned wins 43.7% ± 1.7% - OLMo-7B-Feb2024 base wins 4.7% ± 0.57%

Response tuning closes most of the gap from base to instruction-tuned, without seeing any instruction during training.

Single-task finetuning (Table 3). Win rates against instruction-tuned models:

Tuning Llama-2-7B OLMo-7B-Feb2024
Base 2.4% 4.7%
MBPP (Python) 16.9% ± 0.70% 10.4% ± 1.0%
GSM (math) 23.7% ± 0.74% 30.3% ± 0.6%
Poetry 22.9% ± 0.97% 21.9% ± 0.48%
Recipes 14.6% ± 0.81% 21.5% ± 0.86%
Chess 2.1% ± 0.36% 6.3% ± 1.1%

GSM finetuning yields strongest generalization. Chess yields minimal improvement, attributed to low-entropy chess openings.

Rule-based adapter (Table 4). - Full 3-rule adapter: 24.4% ± 0.40% - Base: 2.4% ± 0.14% - Ablation without EOS rule: 10.4% ± 0.30% - Ablation without diversity rule: 14.3% ± 0.58% - Ablation without uniform token changes: 16.3% ± 0.25%

Response ranking capability (Table 2). - Llama-2-7B base: 80.4% of pairs satisfy the property - Llama-2-7B instruction-tuned: 77.4% - OLMo-7B-Feb2024 base: 74.5% - OLMo-7B-Feb2024 instruction-tuned: 74.3%

Pretrained models already distinguish correct from incorrect responses at near-identical rates to instruction-tuned models.

Ablation Studies and Robustness Checks

Formatting tag semantics (Appendix B, Table 5). Using non-semantic tags <|A|> and <|B|> instead of <|user|> and <|assistant|>: - Response tuning with A/B tags: 41.8% ± 0.84% - Response tuning with original tags: 43.3%

Tag semantics provide ~1.5 percentage points of improvement—detectable but not the primary cause.

Instruction rephrasing removal (Appendix C). After removing responses that begin by rephrasing the instruction (99/1030 examples): - Response tuning win rate on transformed data: 43.3% ± 5.3% - Original response tuning win rate: 43.7%

Instruction rephrasing is not necessary for response tuning's success.

Rule ablations (Table 4). Each rule contributes meaningfully: - Removing EOS rule drops performance from 24.4% to 10.4% (largest drop) - Removing diversity rule drops to 14.3% - Removing uniform token changes drops to 16.3%

The EOS rule is most critical—without it, models generate overly long, unfocused outputs.

Assessment: Do the Experiments Support the Claims?

Claim 1: Response tuning yields instruction following. Strongly supported. Both Llama-2-7B and OLMo-7B-Feb2024 response-tuned models achieve ~43% win rates against instruction-tuned models, compared to 2-5% for base models. The gap is substantial and replicable across models and seeds.

Claim 2: Single-task finetuning yields broad instruction following. Supported for four of five datasets. GSM, Poetry, Recipes, and MBPP all yield substantial win rates (10-30%) compared to base models. Chess is an informative failure case that supports the low-entropy hypothesis. The qualitative analysis (Figure 5) shows the predicted dependence on instruction similarity.

Claim 3: Simple rules yield instruction following. Supported. The 3-rule adapter achieves 24.4% win rate—comparable to single-task finetuning—without training. Ablations confirm each rule contributes.

Claim 4: Pretrained models have response ranking capability. Supported by Table 2. Base models rank correct responses higher than incorrect ones at 75-80% rate, matching instruction-tuned models. This is a clean explanation for why response tuning works.

Potential limitations:

  1. AlpacaEval as sole benchmark. The paper relies entirely on AlpacaEval 2 for evaluation. While this is a standard benchmark, it would strengthen the paper to show results on additional benchmarks (e.g., IFEval, instruction-following evaluations from other suites) or human evaluation.

  2. Win rate interpretation. A 43% win rate against instruction-tuned models indicates response tuning is competitive but still clearly worse. The gap from 43% to 50% (equal performance) is real and consistent.

  3. Chess failure case. The explanation for Chess (low entropy of opening moves) is plausible but not rigorously tested. An ablation varying chess data entropy could strengthen this claim.

  4. Rule hyperparameters. The rules were "heuristically tuned" on the validation set, but no systematic hyperparameter search is reported. The sensitivity of results to rule weights is unknown.

  5. Limited model scale. Both models are ~7B parameters. It's unclear whether results generalize to larger models or different architectures.

The experiments are convincing for the paper's claims, which are carefully scoped: the paper shows implicit instruction tuning occurs, not that it matches explicit instruction tuning. The 43% vs. 50% gap is informative—response tuning is not a replacement for instruction tuning, but reveals what instruction tuning teaches.

6. Limitations and Trade-offs

Dependence on the LIMA Dataset Distribution

Response tuning experiments use only the LIMA dataset (1,030 examples), which is explicitly curated to be "aligned" in format and style. The authors acknowledge that LIMA responses are high-quality, helpful answers—but this raises a question: would response tuning work if trained on a different distribution of responses?

The paper does not test response tuning on datasets with different stylistic properties (e.g., more casual responses, non-English responses, or responses with different formatting conventions). If response tuning primarily teaches the model "respond in a helpful, formatted way," then training on a different response distribution might yield different instruction-following behaviors. The finding that response tuning works is specific to LIMA-style responses, and the paper does not establish how broadly this generalizes.

A related concern: LIMA is a small dataset (1,030 examples) that was itself carefully selected. The authors note that Zhou et al. (2023) chose examples to "teach nothing fundamentally new." This is ideal for isolating the effect of response tuning, but it means we don't know whether response tuning would work with noisier, larger, or more diverse response collections.

Single Benchmark Evaluation

All quantitative results rely on AlpacaEval 2, an LLM-as-a-judge evaluation. While this is a standard benchmark, it has known limitations:

  • Judge bias: GPT-4 (the annotator) may have systematic preferences for certain response styles
  • Length bias: Though the paper uses "length-controlled" win rates, this correction may not fully eliminate length preferences
  • Distribution mismatch: AlpacaEval instructions may not reflect real user queries

The paper would be stronger with additional evaluations—for example, human evaluation of response quality, or benchmarks like IFEval that test specific instruction-following capabilities. The authors note that they manually inspected outputs (Figure 2, Figure 4, Figure 6), but this is not systematic.

The 43% Win Rate Gap Is Real and Unexplained

While response tuning achieves a striking 43% win rate against instruction tuning, this is still meaningfully below 50% (equal performance). The paper frames this as showing that "specifying instructions during adaptation... is not crucial in yielding a baseline level of instruction-following behavior" (Section 4.1), but does not deeply analyze what accounts for the remaining gap.

What does instruction tuning provide that response tuning cannot? The response ranking capability experiments (Table 2) show pretrained models already rank responses correctly—but this doesn't explain why instruction tuning still outperforms. Possible explanations include: - Instruction tuning teaches the model to condition on instruction content, not just format - Instruction tuning teaches domain-specific response styles for different instruction types - Response tuning overfits to the LIMA response distribution

The paper does not investigate these possibilities, leaving a significant aspect of the phenomenon unexplored.

Chess as an Unexplained Failure Mode

The Chess dataset yields near-zero improvement over base models (2.1% win rate for Llama-2-7B, 6.3% for OLMo-7B-Feb2024), which the authors attribute to "the very low entropy of the beginning sequences of chess games" (Section 5.1). This explanation is plausible—most chess games open with the same small set of moves—but not rigorously tested.

Alternative explanations could include: - Chess PGN notation is structurally very different from natural language, so the format doesn't transfer - The task (generating entire games from ELO ratings) is fundamentally different from instruction following - The dataset size (1,000 examples) is insufficient

The paper does not run ablations to distinguish these hypotheses, making the Chess failure mode an open question.

Rule-Based Adapter Is Heuristically Designed

The three rules in the rule-based adapter were "heuristically tuned" on the validation set. The authors provide no systematic hyperparameter search, no analysis of sensitivity to rule weights, and no exploration of alternative rules.

Several questions remain unanswered: - Would additional rules improve performance further? - How sensitive are results to the specific weights (e.g., -5 for I vs. -3 for We)? - Are these rules optimal, or just the first thing that worked?

The code in Listing 1 shows the rules are simple but somewhat arbitrary—for example, the EOS rule has a bug where "there's no weight for indices 251-1023" but "in practice all the responses ended before 250." This kind of ad-hoc design limits the mechanistic interpretation: we know simple rules work, but not whether these are the right simple rules.

Limited Model Scale and Architecture Diversity

All experiments use ~7B parameter models (Llama-2-7B and OLMo-7B-Feb2024). The paper does not address whether findings generalize to: - Larger models (70B, 100B+ scales) - Different architectures (e.g., decoder-only vs. encoder-decoder) - Models trained with different objectives (e.g., with explicit instruction-tuning data mixed into pretraining)

The authors note that larger models like Llama-2-70B are "harder to experiment with," but this leaves a significant gap: if implicit instruction tuning is a property of pretrained language models generally, it should hold across scales. If it's specific to 7B models in a certain training regime, that would change the interpretation.

The Response Ranking Capability Metric Has Limitations

The response ranking capability experiments (Table 2) show pretrained models correctly rank responses at ~75-80% rate. But this metric compares an instruction's response to a random other instruction's response—not to the many possible incorrect responses that base models actually generate.

A base model might correctly rank the true response above a random alternative response, yet still generate a completely different incorrect response because the true response's absolute probability remains too low. The metric establishes that models have some signal about correct responses, but doesn't fully explain why response tuning unlocks this signal.

No Direct Test of the "Latent Mapping" Hypothesis

The paper hypothesizes that pretrained models encode instruction-response mappings that are "revealed" by response tuning. But this is inferred rather than directly tested. The response ranking capability provides indirect support, but does not constitute direct evidence.

A stronger test might involve: - Probing the pretrained model's hidden states for instruction-response alignment - Analyzing whether response tuning changes specific circuits or attention patterns - Testing whether response tuning works when the base model has been trained on data specifically designed to break instruction-response mappings

The paper does not attempt such mechanistic investigations, leaving the hypothesis at the level of inference from behavioral results.

Practical Deployment Concerns Not Addressed

The paper concludes with a practical warning: "if a practitioner deploys a language model adapted to some specific task, they should not assume that the model will exhibit that task's behavior on inputs dissimilar to those trained on" (Section 7). But it does not provide guidance on: - How to predict which finetuning domains will yield implicit instruction tuning - Whether there are techniques to prevent implicit instruction tuning when task-specific behavior is desired - Whether implicit instruction tuning introduces specific safety risks

These are practical questions that the paper raises but does not answer.

7. Implications and Future Directions

How This Work Changes the Landscape

This paper fundamentally reframes instruction tuning from a teaching process to an unlocking process. Prior work viewed instruction tuning as teaching models to map instructions to responses—a process that might require substantial data and careful curation. This paper shows that pretrained models already encode these mappings; what they learn from fine-tuning is primarily response format and distribution.

This reframing has immediate consequences for how researchers think about alignment and instruction tuning:

The pretraining-alignment boundary is blurrier than assumed. If instruction-response mappings are acquired during pretraining, then "alignment" begins before fine-tuning. The instruction tuning phase is less about teaching new capabilities and more about shaping output distributions. This aligns with recent work showing strong in-context instruction following (Lin et al., 2024), but extends it: even training can be minimal.

Generalization from narrow finetuning is the default, not the exception. The single-task finetuning results show that models don't easily "forget" general capabilities when trained on narrow tasks. This contradicts the intuition that fine-tuning specializes models—it may instead primarily change output style while preserving general reasoning. This has implications for domain adaptation, safety fine-tuning, and catastrophic forgetting.

Output distribution shaping may be more important than task-specific training. The rule-based adapter results suggest that simple distributional changes (preventing repetition, encouraging termination, suppressing refusals) account for much of the behavioral difference between base and instruction-tuned models. This points toward a research direction focused on what distribution to target, rather than how to train toward it.

Follow-Up Research This Work Enables or Suggests

Mechanistic interpretability of implicit instruction tuning. The most pressing follow-up is understanding why these phenomena occur at the mechanistic level. Specific questions: - What circuits or attention heads in pretrained models encode instruction-response alignment? - Does response tuning modify specific layers or components, or change the model uniformly? - Can we identify the minimal set of parameters that need modification?

The rule-based adapter provides a behavioral-level account, but circuit-level analysis (using methods like those in Olah et al. or Elhage et al.) could reveal whether specific mechanisms implement the instruction-response mapping.

Systematic variation of response distributions. The paper tests one response distribution (LIMA). Future work should vary: - Response style (formal vs. casual, verbose vs. concise) - Response domain (code, creative writing, factual QA) - Response quality (helpful responses vs. harmful or misleading ones)

This would establish whether response tuning teaches a specific format or a general "instruction-following style" that transfers.

Negative control: can we prevent implicit instruction tuning? The practical warning about unintended instruction following suggests a research direction: can we finetune on narrow tasks without inducing general instruction following? Approaches might include: - Constraining updates to specific layers or parameters - Using adversarial data that explicitly breaks instruction following outside the target domain - Modifying the pretraining objective to separate instruction-following from general language modeling

Testing at scale. All experiments use 7B models. Testing implicit instruction tuning at 70B, 100B+, and with different architectures (e.g., MoE models, encoder-decoder models) would establish whether this is a fundamental property of language models or specific to certain regimes.

Better evaluation beyond AlpacaEval. Future work should evaluate on multiple benchmarks: - IFEval for instruction-following precision - Domain-specific evaluations (code benchmarks, math benchmarks) to test whether single-task finetuning improves or degrades performance in the target domain - Human evaluation for nuanced aspects of response quality that LLM judges may miss

Connections to safety and alignment. The paper raises but does not explore safety implications. Specific questions: - Does implicit instruction tuning transfer safety properties? If a model is safety-tuned for one domain, does it remain safe on others? - Can adversarial training on narrow domains "break" implicit instruction tuning? - Are there specific harmful behaviors that implicit instruction tuning might enable (e.g., models finetuned on benign tasks still responding to harmful instructions)?

Practical Applications and Downstream Use Cases

Efficient instruction tuning with less data. Response tuning suggests an alternative approach: if the goal is to shape output distribution rather than teach instruction-response mappings, practitioners might be able to curate response-only datasets rather than full instruction-response pairs. This could reduce data collection costs.

However, the 43% win rate indicates this is not a replacement—response tuning is a technique for understanding instruction tuning, not a practical alternative. The paper does not recommend response tuning for deployment.

Testing task-specific models for unintended behavior. The single-task finetuning results provide a concrete recommendation: any model finetuned for a specific task should be tested on out-of-distribution instructions before deployment. The model may function as a general assistant on unrelated queries, potentially violating deployment assumptions.

The authors explicitly state: "they should put it through testing and safety trials as if they were releasing a general-purpose chatbot, since the finetuning may implicitly instruction-tune the model" (Section 7). This is actionable guidance.

Rule-based output shaping as a lightweight alternative. The rule-based adapter achieves 24.4% win rate without training. For applications where full instruction tuning is impractical (e.g., on-device deployment with limited compute), a pretrained model combined with simple rules might provide acceptable instruction-following behavior at minimal cost.

This is most relevant for constrained deployment scenarios: - Edge devices where fine-tuning is infeasible - Rapid prototyping where training time is prohibitive - Applications where the full instruction-tuning pipeline is unavailable

Debugging instruction tuning datasets. Response tuning provides a diagnostic: if response tuning achieves high performance, the instruction tuning dataset may be primarily teaching response format rather than task-specific behavior. Conversely, if response tuning performs poorly, the instruction tuning dataset may contain substantial task-specific information worth studying.

When to Prefer This Approach Over Alternatives

For understanding instruction tuning mechanisms, this paper's approach—testing deficient forms of adaptation—is a powerful methodology. Future work investigating other aspects of instruction tuning (e.g., what happens when we train on mismatched instruction-response pairs) could use similar experimental paradigms.

For practical model development, the paper does not recommend response tuning or single-task finetuning as replacements for instruction tuning. The results show that these methods yield some instruction following, but still underperform explicit instruction tuning.

For safety-conscious deployment, the key takeaway is a testing requirement: any finetuned model should be evaluated on diverse, out-of-distribution instructions to assess whether implicit instruction tuning has occurred. This testing should happen before deployment, regardless of the intended use case.

For research on alignment and generalization, this paper provides evidence that generalization from narrow training is a default behavior of language models, not something that requires special conditions. Research on preventing unintended generalization—or understanding its mechanism—is a clear direction suggested by these findings.