In-context Learning and Induction Heads¶
ArXiv: 2209.11895
🎯 Pitch¶
This paper identifies and characterizes “induction heads,” a specific attention-head circuit that appears during an early training “phase change” and provides strong evidence that these circuits are a primary mechanism for transformers’ in‑context learning. By linking circuit-level structure, training dynamics, architectural interventions, and causal ablations, the work offers a plausible, mechanistic path for explaining, predicting, and potentially controlling emergent prompt-driven capabilities—an advance with both scientific and safety implications.
1. Executive Summary (2-3 sentences)¶
This work argues that a specific attention-head “circuit” called an induction head is a major (and possibly dominant) mechanism behind transformers’ in-context learning, operationalized as lower loss on later tokens in a context compared to earlier tokens. Across many transformer sizes and architectures, the paper identifies a sharp early-training “phase change” where induction heads appear at the same time that in-context learning rapidly increases, and it supplements this with architectural interventions and (for small models) causal ablations. The significance is that if in-context learning is largely implemented by a relatively crisp, discoverable circuit, then mechanistic interpretability can plausibly explain—and potentially help control—capability emergence and safety-relevant behaviors.
2. Context and Motivation¶
- Problem / gap addressed
- Large language models show strong
in-context learning: their predictions improve as they see more tokens in the prompt, enabling behaviors like task specification via prompting. - Mechanistic interpretability wants to explain how internal computations produce such capabilities, but large models (many layers +
MLPblocks) make direct weight-level reverse engineering difficult. -
The paper seeks an empirically grounded bridge: identify a specific internal mechanism that plausibly accounts for a large fraction of in-context learning, even in large models where full mechanistic proofs are not yet feasible.
-
Why it matters
- Scientific: In-context learning is one of the defining phenomena of scaled transformers; understanding its mechanism is a step toward understanding “what transformers are doing.”
-
Safety: The paper highlights that abrupt capability changes (“phase changes”) are safety-relevant because unexpected behaviors (including undesirable ones) can emerge suddenly during training, and in-context learning makes behavior more context-dependent and harder to anticipate.
-
Prior approaches and shortcomings (as positioned here)
- Prior mechanistic interpretability progress (in the authors’ prior transformer-circuits framework) could give a near-complete account of small, attention-only transformers, including identifying
induction heads. -
For large models with MLPs, the same style of mathematical decomposition is not yet sufficient to precisely “pin down circuitry,” so the paper uses indirect evidence: training dynamics, correlations, perturbations, and targeted ablations where feasible.
-
How this paper positions itself
- It advances the hypothesis: induction-head-like circuits may implement most of the generic “loss decreases with token index” in-context learning signal.
- It explicitly distinguishes evidence strength:
- Small attention-only: strongest, causal + mechanistic.
- Small with MLPs: some causal evidence, but more complicated interactions.
- Large models: primarily correlational + plausibility arguments, plus continuity across scales.
3. Technical Approach¶
3.1 Reader orientation (approachable technical breakdown)¶
- The system is an empirical + mechanistic interpretability study that tracks how specific attention heads emerge during training and how they relate to in-context learning measured from per-token losses.
- It solves the problem by combining (i) a concrete operational definition of
induction heads, (ii) training-time measurements across many model snapshots, (iii) a targeted architectural change that changes when induction can form, and (iv) head ablations (for small models).
3.2 Big-picture architecture (diagram in words)¶
- Train multiple transformers (different depths, with/without MLPs; plus a “smeared key” variant) and save checkpoints (“snapshots”) throughout training.
- For each snapshot:
- Compute per-token losses on a fixed evaluation set (10,000 examples of length 512 tokens).
- Compute an in-context learning score from losses at two context positions (50 and 500).
- Score each attention head with head activation evaluators for
prefix matchingandcopyingon synthetic sequences, identifyinginduction heads. - For small models, run many attention head ablations (pattern-preserving) and quantify how removing a head changes in-context learning and phase-change-related behavior.
- Run PCA on “per-token loss vectors” to visualize training trajectories in function space.
- Compare how these signals evolve over training and across architectural interventions.
3.3 Roadmap for the deep dive¶
- I first define
in-context learningas measured here and thein-context learning score, because it is the macroscopic quantity being explained. - Next I define
induction headsoperationally (prefix-matching + copying) and show the concrete algorithmic behavior they implement. - Then I explain the training-time measurement pipeline: per-token losses, PCA, head evaluators, and ablations.
- Finally I explain the key intervention (
smeared key) and how it tests a minimal-architectural-requirement prediction.
3.4 Detailed, sentence-based technical breakdown¶
This is an empirical-mechanistic interpretability paper whose core idea is: track the emergence of a specific attention mechanism (“induction heads”) during training and test whether it explains the sharp acquisition of in-context learning.
3.4.1 What “in-context learning” means here (the macro metric)¶
- The paper adopts the
Kaplan et al.-style framing: in-context learning is decreasing next-token prediction loss at larger token indices within the same context. - It summarizes this with a single heuristic scalar:
- Let \(L(i)\) be the average loss at context position \(i\) (averaged over dataset examples).
- The in-context learning score is: [ \text{ICLScore} = L(500) - L(50). ]
- A more negative value means the model predicts token 500 better than token 50 by a larger margin (i.e., more in-context learning by this definition).
- The paper emphasizes the choice of 50 and 500 is somewhat arbitrary but reports that changing indices does not alter the core qualitative conclusions (discussed again under “Seemingly Constant In-Context Learning Score”).
3.4.2 What an “induction head” is (operational definition + algorithm)¶
- An
induction headis defined empirically via behavior on a repeated random sequence, capturing the pattern: - Given a context like \([A][B]\dots[A]\), predict \([B]\).
- Formally, the paper uses two required empirical properties on repeated random token sequences:
- Prefix matching: the head attends back to earlier tokens that were followed by the current (or recent) token(s), i.e., it attends to the token that “induction would suggest comes next.”
- Copying: the head’s output increases the logit for the attended-to token (so attending to an earlier \([B]\) makes \([B]\) more likely now).
- Intuition in plain terms:
- The head implements a small “pattern completion” algorithm: find where the current token appeared before; then copy what came after it last time.
Micro-example walk-through (single input → output)¶
- Suppose the sequence contains: “…
catsat…cat”. - When predicting the token after the second
cat, an induction head tries to: - Find the earlier
cat(prefix match). - Attend to the token right after it (the earlier
sat). - Push up the probability of
satnow (copying via logits).
3.4.3 Mechanistic picture from small attention-only models (how induction is implemented)¶
- The paper’s mechanistic account (summarized here; full reverse engineering is referenced as coming from the authors’ earlier transformer-circuits work) describes induction as emerging from a two-head, two-layer circuit:
- A previous-token head in an earlier layer writes information from token \(i-1\) into token \(i\) (a “key shifting” enabling step).
- A later-layer induction head uses that written information to compute attention scores so that, at the current position, it can attend to positions whose preceding token matches the current token.
- The decomposition language used:
OV circuit(Output–Value path) explains copying-like behavior: the head’s output is aligned with token embeddings so it can raise logits of specific tokens.QK circuit(Query–Key path) explains attention patterns: induction relies onK-composition(and sometimesQ-composition) so that the key at an attended position depends on the token before it, not only the attended token itself.- A key mechanistic claim in this summary is:
- Simple induction heads can have a dominant QK term corresponding to pure K-composition with a previous-token head, yielding attention that compares the current token to earlier positions’ previous tokens.
- The paper also notes a different “pointer-arithmetic” mechanism observed in GPT-2 (using positional embeddings + Q-composition), but also states this mechanism is not available to the specific models studied here because they do not add positional information into the residual stream.
3.4.4 The “phase change” measurement: what happens when (training-time story)¶
- Across many models with more than one layer, the paper observes a narrow early-training window where several things change abruptly and together:
- In-context learning (ICLScore) jumps sharply.
- Induction heads appear abruptly (by the prefix-matching evaluator).
- The training loss curve shows a visible “bump” (a short interval of steeper improvement; described as the only place where the loss is not convex).
- PCA trajectories of per-token losses “pivot” during the same window.
- Reported magnitudes/timing (from the provided text):
- For many models, in-context learning develops abruptly around \(2.5\times 10^9\) to \(5\times 10^9\) training tokens early in training.
- Before the window: < 0.15 nats of in-context learning (by the chosen metric).
- After the window: about 0.4 nats, then roughly constant thereafter (and unexpectedly similar across model sizes except 1-layer models).
- For small models, the phase change is reported around \(1\times 10^9\) to \(3\times 10^9\) tokens.
- The paper adds qualitative support by looking at which tokens’ log-likelihoods improve during the phase change:
- Tokens in repeated sequences become better predicted on repetition.
- Tokens that “break” a previously seen local continuation can become worse predicted (consistent with a mechanism that expects repetition).
3.4.5 Data pipeline diagram in words (what happens first, second, third…)¶
- Train model families and save snapshots
- The study analyzes 34 decoder-only Transformer language models with multiple saved checkpoints over training.
-
There are several “series”:
Small attention-only models: 1–6 layers, no MLPs.Small models with MLPs: 1–6 layers, with attention + MLP.Full-scale models: 4–40 layers, with MLPs, ranging 13M → 13B non-embedding parameters.Smeared key models: 1–2 layers, modified to make induction easier to express.
-
Compute per-token losses (unaltered models) on a fixed eval set
- For each snapshot, run on 10,000 randomly selected examples from the dataset, each example length 512 tokens, and record per-token losses.
- Two derived views are used:
- “Random token per example”: pick one consistent random position per example to produce a length-10,000 loss vector per snapshot.
- “Context index average”: average losses by position to get a length-512 vector per snapshot.
-
From these, compute:
- Loss curves over training.
- 2D heatmaps of loss vs (training time, token index).
- A derivative view: partial derivative of loss with respect to \(\log(\text{context index})\), interpreted as “in-context learning per \(\epsilon\%\) more context.”
-
Compute PCA on “per-token loss vectors”
- Concatenate many snapshots’ loss vectors and apply PCA.
-
Plot training trajectories of different models in the first two PCs to visualize macro training dynamics and detect pivots around the phase change.
-
Score attention heads with empirical “head activation evaluators”
- For each attention head, compute three heuristic scores:
Copying: whether the head’s direct residual-stream contribution raises logits of the attended-to token.Prefix matching: whether, on repeated random sequences, the head attends to earlier positions where the prefix matches.Previous token attention: whether the head attends to token \(i-1\) from position \(i\).
- Induction heads are identified as heads that score as both
prefix matchingandcopyingunder the empirical definition.
Details of the evaluators (as provided): - Copying evaluator: - Generate a sequence of 25 random tokens (excluding most/least common tokens). - Compute the head’s contribution to the residual stream, map through unembedding to logits on the “direct path,” normalize logits by subtracting their mean, apply a ReLU to focus on raised logits, and compute the fraction of raised-logit mass assigned to the attended token (scaled into \([-1,1]\)). - Prefix matching evaluator: - Generate 25 random tokens, repeat 4 times, prepend a start-of-sequence token, compute the attention pattern, and score attention mass from a token to positions whose following token matches the current token in earlier repeats. - Previous-token evaluator: - On a real example from the training distribution, average attention mass on the previous-token off-diagonal (from \(i\) to \(i-1\)).
- (Small models only) run attention head ablations
- The paper performs pattern-preserving ablations:
- Run 1: record all attention patterns.
- Run 2: zero out one head’s result vector (its contribution to the residual stream) while forcing all attention patterns to match Run 1.
- This isolates the effect of removing the head’s value/output contribution while keeping the “routing” (attention patterns) fixed.
-
The paper reports more than 50,000 attention head ablations across the analyzed models (as described in the introduction).
-
Compare signals across time, architectures, and interventions
- The central comparisons are:
- Co-occurrence of induction head formation and in-context learning improvement (Argument 1).
- Co-perturbation when architecture is changed to make induction easier (Argument 2).
- Direct causal impact via ablation (Argument 3).
- Case studies showing induction heads doing more abstract “fuzzy” matching behaviors (Argument 4).
- Mechanistic plausibility that the known induction mechanism generalizes (Argument 5).
- Continuity across scales as an argument for similar mechanisms in large models (Argument 6).
3.4.6 Architectural intervention: the “smeared key” modification¶
- Motivation: a standard induction head needs two layers because the attention score at an attended position must depend on the token before it; composing attention heads across layers enables that.
- The paper defines a 1-parameter-per-head modification that mixes the key of the current token with the key of the previous token, making induction-like behavior expressible even in one-layer models.
- The modification is: [ k_j = \sigma(\alpha)\,k_j + (1-\sigma(\alpha))\,k_{j-1}, ] where \(\alpha\) is a trainable scalar per head and \(\sigma(\cdot)\) maps to \([0,1]\).
- (The paper notes no interpolation happens for the first token.)
- Reported effect: with this change,
- One-layer models (which otherwise do not show the phase change) now develop in-context learning.
- Two-layer and larger models develop in-context learning earlier.
3.4.7 Core configurations / hyperparameters (as provided; missing items noted)¶
Small models (attention-only and with MLPs) - Architecture: - Depth: 1–6 layers. - Context window: 8192 tokens. - Vocabulary size: \(2^{16}\) tokens. - Model dimension: \(d_\text{model}=768\). - Attention heads per layer: 12 (regardless of depth). - Positional embeddings: “a variant on standard positional embeddings (similar to Press et al.).” - Training: - Trained for 10,000 steps \(\approx\) 10 billion tokens. - Saved 200 snapshots every 50 steps. - Learning rate warm-up occurs over the first \(1.5\times 10^9\) tokens. - Weight decay is reduced at 4750 steps \(\approx\) 5 billion tokens, and this is stated to be after the phase change (so not the cause). - Missing from provided excerpt (so cannot be specified): optimizer name/settings, base learning rate values, batch size, exact LR schedule beyond warm-up, hardware/compute.
Full-scale models - Architecture: - Depths / non-embedding parameter counts: - 4L: 13M - 6L: 42M - 10L: 200M - 16L: 810M - 24L: 2.7B - 40L: 13B - Context window: 8192 tokens. - Vocabulary size: \(2^{16}\) tokens. - Model dimension rule: \(d_\text{model} = 128 \cdot n_\text{layer}\). - Attention heads per layer / head dimension: - 4L: 8 heads, \(d_\text{head}=64\) - 6L: 12 heads, \(d_\text{head}=64\) - 10L: 20 heads, \(d_\text{head}=64\) - 16L: 32 heads, \(d_\text{head}=64\) - 24L: 48 heads, \(d_\text{head}=64\) - 40L: 40 heads, \(d_\text{head}=128\) - Contains both dense and local attention heads (local heads attend only within a fixed window). - Training snapshots: - Saved at exponentially increasing steps: \(2^5\) through \(2^{17}\), plus one or two final saves: 15 snapshots (40L has 14). - Token-per-step changes after \(2^{11}\) steps affect later-token counts for 24L and 40L. - Missing from provided excerpt: optimizer/LR/batch size/compute/hardware.
Evaluation for analyses (common) - Analyses use 10,000 evaluation sequences of length 512 tokens for per-token loss measurements.
4. Key Insights and Innovations¶
- (1) A training-time “phase change” tightly links macro behavior to a micro circuit
- Novelty: Instead of only analyzing a trained model, the paper tracks formation during training and finds a specific early window where in-context learning capacity appears abruptly alongside induction heads, a loss-curve bump, and PCA trajectory pivot (Argument 1).
-
Significance: This suggests a single, large-scale internal reorganization event that is visible even at the training-loss scale, unlike many “microscopic” emergent behaviors.
-
(2) An architectural minimality test via the “smeared key” intervention
- Novelty: The paper changes the model so that a one-layer transformer can express induction-like behavior by mixing previous-token information into keys (Argument 2).
-
Significance: The matching shift—one-layer models now exhibit the phase change/in-context learning—supports the claim that induction-like mechanisms are sufficient (and perhaps close to minimal) for large in-context learning gains.
-
(3) Strong causal evidence via head ablations in small models
- Novelty: The study performs large-scale, pattern-preserving head ablations and shows that knocking out induction heads sharply reduces in-context learning in small attention-only models (Argument 3).
-
Significance: This moves beyond correlational interpretability: in small attention-only settings, induction heads are not just “present,” they are functionally responsible for most of the measured in-context learning.
-
(4) Induction heads can support “fuzzy” / abstract in-context behaviors
- Novelty: Heads that meet the strict “copy random sequences via prefix matching” definition also appear to participate in more abstract behaviors, including translation-like and pattern-matching tasks (Argument 4).
-
Significance: This makes it plausible that induction is not only literal copying but can become a nearest-neighbor-in-context mechanism over abstract representations.
-
(5) Continuity across scale as an interpretability strategy
- Novelty: The paper explicitly frames an inference method: prove mechanisms in small models, then use continuity in observable signatures (phase change timing, induction-head scores, etc.) to argue similar mechanisms may operate in large models (Argument 6).
- Significance: This is a pragmatic approach to interpretability when full mechanistic proofs do not scale yet.
5. Experimental Analysis¶
Evaluation methodology (what is measured and how)¶
- Models analyzed
- 34 decoder-only transformers, across:
- small attention-only (1–6 layers),
- small with MLPs (1–6 layers),
- full-scale with MLPs (4–40 layers; 13M–13B non-embedding parameters),
- smeared-key variants (1–2 layers).
- Datasets
- Small + smeared-key models: an earlier version of the dataset described in Askell et al., consisting of filtered Common Crawl + internet books + other sources, including ~10% Python code.
- Full-scale: an improved version of roughly the same distribution.
- A control set of small models is trained on an alternate dataset of internet books only to test dataset sensitivity.
- The paper states models do not see the same training data twice.
- Primary metrics
ICLScore = L(500) - L(50)(loss difference in nats).- Head-level metrics:
prefix matching score,copying score, andprevious tokenscore from head activation evaluators. - Training dynamics views: loss curves, per-token loss heatmaps, derivative of loss w.r.t. \(\log(\text{context index})\), PCA trajectories.
- Ablation impact: change in ICLScore; and “before-and-after vector” attribution measuring similarity between ablation-induced behavior change and the phase-change behavior shift.
- Baselines / comparisons
- Architectural: 1-layer vs ≥2-layer; vanilla vs smeared-key.
- Across sizes: small vs full-scale, attention-only vs MLP.
- Across training time: before vs after phase change.
Main quantitative results (numbers explicitly present in provided text)¶
- In-context learning jumps sharply in a narrow training window (Argument 1)
- Before the phase change: < 0.15 nats of in-context learning (by their metric).
- After: about 0.4 nats, remaining roughly constant for the rest of training and largely similar across many model sizes (except 1-layer models).
-
Timing: around \(2.5\times 10^9\)–\(5\times 10^9\) training tokens for “models of every size” (and \(1\times 10^9\)–\(3\times 10^9\) tokens for the small-model setup as described in Model Details).
-
Induction heads form in the same window (Argument 1)
- Induction-head scores (prefix matching) rise abruptly during the same phase-change window where ICLScore improves.
-
One-layer models are described as not forming induction heads and not developing substantial in-context learning.
-
Smeared-key intervention shifts the phenomenon (Argument 2)
- One-layer smeared-key models exhibit the phase change / in-context learning improvement, whereas one-layer vanilla models do not.
-
Two-layer (and larger) smeared-key models show the in-context learning increase earlier than their vanilla counterparts.
-
Ablations show strong causal contribution in small models (Argument 3)
-
Pattern-preserving head ablations in small models show that removing induction heads greatly decreases in-context learning; the paper summarizes that “almost all the in-context learning in small attention-only models appears to come from these induction heads.”
-
Examples of abstract behavior (Argument 4)
- On a synthetic pattern-matching task, one showcased head allocates about 65% of its attention from “:” tokens to the correct earlier line positions (for the correct category).
- For three showcased heads (in a 40-layer, 13B-parameter model), the table reports:
- Literal copying head (layer 21/40): copying 0.89, prefix matching 0.75.
- Translation head (layer 7/40): copying 0.20, prefix matching 0.85.
- Pattern-matching head (layer 8/40): copying 0.69, prefix matching 0.94.
-
The translation example notes attention is “off-diagonal” and meanders due to different word order/token lengths; logit attribution is not perfectly sharp, suggesting downstream layers may be needed.
-
Replication snippets (external comments included in the document)
- One replication reports that replacing a previous-token head’s attention scores with ideal previous-token attention recovers 99% of the loss difference, and a simple induction-head attention approximation recovers about 65%, with an additional 10% when including a longer pattern \([A][B][C]\dots[A][B]\to[C]\).
Do the experiments support the claims?¶
- Strongly supported (within small attention-only models)
-
The combination of (i) a concrete operational definition, (ii) training-time co-occurrence, and (iii) ablation-caused drops in ICLScore supports the claim that induction heads are a primary driver of the measured in-context learning in this restricted regime.
-
Moderately supported (small models with MLPs)
-
The paper reports causal ablation evidence exists, but also explains why interpretation is harder: MLP–attention interactions mean head ablation effects are not a clean linear decomposition of behavior.
-
Suggestive but not conclusive (large models with MLPs)
- Evidence is mainly correlational (co-occurrence during phase change; continuity across scale) plus plausibility via examples of abstract heads.
- The paper itself flags potential confounds, including low time resolution in large-model checkpoints (14–15 snapshots).
Ablations / robustness checks / failure cases¶
- Ablations
-
Pattern-preserving ablations are central, but only for small models due to computational cost scaling (the paper notes superlinear scaling with size and the evaluation set size per checkpoint).
-
Robustness / checks
- The paper checks that the phase change window does not coincide with scheduled LR/WD changes (e.g., weight decay reduction at ~5B tokens is after the small-model phase change).
-
It trains some small models on a different dataset (books-only) and reports the phase change still appears similarly.
-
Curiosities / anomalies explicitly noted
- Constant ICLScore across sizes post-phase-change is described as surprising.
- A loss-derivative “order inversion” across model sizes around the phase change is noted.
- Several model-specific anomalies are listed (e.g., a non-induction head in a 6-layer attention-only model whose ablation mimics reversing the phase change; “loss spikes”; and “anti-copying prefix-search” heads in larger models).
6. Limitations and Trade-offs¶
- Correlation vs causation for large models
-
For large models with MLPs, the paper does not provide direct ablations, and explicitly frames much of the evidence as correlational and potentially confounded.
-
Time-resolution limitations
-
Full-scale models have only 14–15 saved snapshots, which weakens confidence about the precise timing/shape of co-occurrence compared to the small-model setup with 200 snapshots.
-
Metric choice and interpretational ambiguity
- The core in-context learning metric is a difference between two token indices (50 and 500, in a 512-token analysis), which could miss changes in mechanisms that preserve that particular difference.
-
The paper acknowledges that a constant ICLScore does not guarantee the underlying mechanisms remain constant over later training.
-
Ablation interpretability in MLP models
- Even where ablations are possible, the paper notes that attention heads can interact with MLP layers in ways that make marginal ablation effects harder to interpret as “true importance.”
-
Redundancy and LayerNorm rescaling can mask individual head importance.
-
Operational definition may miss non-standard induction
-
Induction heads are defined narrowly via behavior on repeated random sequences; mechanisms that support in-context learning but do not score highly on these tests could be undercounted.
-
Compute constraints
- The study notes ablations are expensive (evaluating 10,000 examples per ablation per checkpoint), limiting feasibility at scale.
7. Implications and Future Directions¶
- How this changes the landscape
- It offers a concrete candidate mechanism (
induction heads) for a broad, important transformer capability (in-context learning) and ties it to a visible training-time phase transition. -
It suggests a methodological template for interpretability at scale: prove circuits in small models, then use training dynamics + continuity + targeted interventions to build a case in larger models.
-
Follow-up research directions suggested by the provided content
- Causal tests in larger models: develop scalable ablation or intervention techniques to move from correlational to causal claims in MLP-containing, large-scale transformers.
- Mechanistic understanding with MLPs: extend the transformer-circuits decomposition framework to include MLP layers well enough to reverse engineer “abstract induction” heads.
- Explain the constant ICLScore phenomenon: why does the measured in-context learning improvement saturate at ~0.4 nats across sizes, and what does that imply about difficulty/availability of context-derived information?
- Characterize “anti-copying prefix-search” heads and other anomalies to refine the taxonomy of attention-head roles.
-
Learning-dynamics bridge: use the phase change as a “Rosetta stone” connecting mechanistic circuits to macroscopic training dynamics and scaling behaviors.
-
Practical applications / downstream use cases (as implied here)
-
If induction heads are a main in-context learning mechanism, then:
- Interpretability tooling can focus on detecting, tracking, and intervening on induction circuits to understand prompt sensitivity and model generalization.
- Safety work might monitor for abrupt emergence of induction-like circuits as an early warning signal of capability changes during training.
-
Repro/Integration Guidance
- When trying to reproduce or extend this line of work (based on the described methods):
- Prefer collecting many snapshots early in training (high time resolution) because the key event is a narrow early window.
- Use a fixed evaluation set and compute per-token losses by position to see in-context learning directly as a function of context index.
- Implement the paper’s empirical head evaluators (
prefix matchingon repeated random sequences +copyingvia direct logit effect) to identify induction heads without requiring full mechanistic reverse engineering. - For causal attribution in small models, use pattern-preserving head ablations (freeze attention patterns, zero out head outputs) to better isolate the head’s value/output contribution.