End-to-End Test-Time Training for Long Context¶
ArXiv: 2512.23675
🎯 Pitch¶
The paper introduces TTT-E2E, an end-to-end test-time training method that compresses long context into a Transformer’s weights by running next-token prediction updates at inference and meta-learning the model’s initialization so it performs well after those updates. This yields full-attention-level scaling with context length while keeping constant per-token inference cost—enabling practical, much faster processing of very long contexts (e.g., 2.7× faster prefill at 128K) without bespoke architectures.
1. Executive Summary (2-3 sentences)¶
This work introduces TTT-E2E, a long-context language modeling method that treats long-context use as a continual learning problem: the model continues learning at inference time by running next-token prediction updates on the provided context, thereby compressing context into its weights. The method keeps a standard Transformer backbone with sliding-window attention (SWA), but adds an end-to-end meta-learning objective so the model’s initialization is explicitly optimized to perform well after test-time training. Experiments emphasize scaling: for 3B models trained with 164B tokens, TTT-E2E scales with context length similarly to full attention while keeping constant per-token inference latency, reaching 2.7× faster prefill than full attention at 128K context (Figure 1).
2. Context and Motivation¶
- Problem / gap.
- Full self-attention over an entire context (“
full attention”) is effective at using long context, but prefill cost grows quadratically in context length (O(T^2)), and decode cost grows linearly (O(T)), making very long contexts expensive (Introduction; Section 2). -
Linear-time sequence models (described here as “RNNs such as
Mamba 2andGated DeltaNet”) have constant per-token cost, but their language modeling quality degrades as context length grows (Figure 1). -
Why it matters.
- The paper frames a core tension: long-context models often aim for “nearly lossless recall” (full attention) which is compute-heavy, while humans benefit from long experience via compression rather than exact recall (Introduction).
-
The practical target is “constant cost per token” inference while still benefiting from longer context (Introduction; Section 2.1).
-
Prior approaches and shortcomings (as set up in this paper).
Sliding-window attention (SWA): reduces cost by limiting attention to a window, but becomes less effective than full attention at leveraging longer context (Figure 1).- Hybrid architectures (e.g., SWA layers mixed with occasional full-attention layers) partially bridge the gap but still do not match full attention’s long-context effectiveness (Figure 1).
- RNN/SSM-style models (
Mamba 2,Gated DeltaNet): constant per-token cost, but worse loss scaling with context length (Figure 1). -
Earlier test-time training (TTT) variants for long context (notably
TTT-KVB) compress context into weights, but use non-end-to-end test-time objectives (layer-wise reconstruction / key-value binding) rather than next-token prediction (Section 2.4; Table 1). -
How this paper positions itself.
- The key repositioning is conceptual: long-context modeling is treated as continual learning at inference time rather than as an attention-approximation or architecture-design problem (Abstract; Introduction).
- The method aims to be “end-to-end” in two senses:
- Test-time E2E: the inner-loop update objective is standard next-token prediction loss (Section 2.1–2.4).
- Training-time E2E: the outer-loop training objective directly optimizes performance after test-time updates via meta-learning (Section 2.2).
3. Technical Approach¶
3.1 Reader orientation (approachable technical breakdown)¶
- The system is a
Transformerlanguage model withsliding-window attentionthat updates part of its weights while reading the prompt/context. - It solves long-context modeling by using test-time training (
TTT)—gradient steps on next-token prediction over the context—to compress information into weights, with an outer-loop meta-learning procedure that trains the model to adapt well at test time.
3.2 Big-picture architecture (diagram in words)¶
- Base model: a standard Transformer stack, but attention is
SWA(k)(window sizek) instead of full attention. - During inference (prefill):
- Run forward passes to compute next-token predictions over context tokens.
- Compute next-token cross-entropy losses on those context tokens.
- Backpropagate those losses only into selected MLP layers (not into embeddings/attention/norms).
- Apply gradient updates (in mini-batches of tokens) to those MLP weights.
- During training:
- Inner loop: simulate the same test-time updates on training sequences.
- Outer loop: update the initialization parameters so that after inner-loop adaptation, the model’s loss is low (meta-learning via gradients-through-gradients).
3.3 Roadmap for the deep dive¶
- Explain the core inner-loop mechanism: test-time updates using next-token prediction (Eq. (1)–(2)).
- Show why naive training mismatches test-time behavior and how meta-learning fixes it (Eq. (3)–(4)).
- Add mini-batch updates and SWA to make the method efficient/stable at scale (Eq. (5)–(6); Section 2.3).
- Detail the practical implementation choices (what is updated/frozen; how many layers; dual-MLP blocks) and their rationale (Section 2.3.1; Figure 4).
- Connect to prior TTT-KVB through the alternative derivation (Eq. (7)–(9); Table 1) to clarify what is “E2E” here.
3.4 Detailed, sentence-based technical breakdown¶
This is an algorithmic + empirical systems paper whose core idea is to make long-context modeling work with constant per-token inference cost by turning inference into continual learning: the model trains on the context itself, and training-time meta-learning prepares the model to adapt effectively.
3.4.1 The inference-time task: prefill then decode¶
- The paper uses standard next-token prediction with two inference phases (Section 2):
- Prefill: condition on provided tokens
x0, x1, …, xT(withx0 = <BOS>). -
Decode: produce a distribution for the next token
\hat{p}_{T+1}and incur test lossCE(\hat{p}_{T+1}, x_{T+1}). -
Complexity baseline (Section 2):
- Full attention: prefill
O(T^2), decodeO(T). - The proposed TTT framing targets: prefill
O(T), decodeO(1)(for the single-token decode setting; Section 2.1).
3.4.2 Inner-loop: “TTT via next-token prediction” (what happens at test time)¶
- The model defines a next-token loss at each time step (Eq. (1)):
ℓ_t(W) = CE(f(x_{t-1}; W), x_t),-
where
fis the network andWare the weights being updated by TTT. -
Online TTT (conceptual starting point):
-
As the model reads tokens sequentially, it updates its weights after each token (Eq. (2)):
W_t = W_{t-1} − η ∇ℓ_t(W_{t-1}),- with learning rate
η, and initial test-time weightsW_0.
-
Intuition via the toy example (Figure 2):
- They remove all self-attention from a Transformer, leaving only MLP layers; without memory, it behaves like a bigram model.
- TTT adds a gradient step while reading context (e.g., train on predicting
x2fromx1), and the updated MLP weights store information about earlier tokens (Figure 2, left), improving subsequent predictions (Figure 2, right).
3.4.3 Outer-loop: “learning to (learn at test time)” via meta-learning¶
- The key training objective matches test-time behavior by optimizing the loss after the sequence of test-time updates (Eq. (3)):
L(W_0; X) = (1/T) Σ_{t=1..T} CE(f(x_{t-1}; W_{t-1}), x_t),-
where
W_{t-1}is the result of applying the inner-loop update rule up to stept−1. -
This requires differentiating through the update rule (because
W_{t}depends on∇ℓ_t), i.e., “gradients of gradients” (Section 2.2). -
Why meta-learning is needed (the “naive” mismatch).
- A naive alternative trains
W_0as if no test-time updates happen (Eq. (4)):L_naive(W_0; X) = (1/T) Σ_{t=1..T} ℓ_t(W_0).
- This creates a mismatch between training-time and test-time behavior and empirically underperforms versus the E2E objective in the toy setup (Figure 2, right:
TTT-naivevsTTT-E2E).
3.4.4 Scaling the inner-loop: mini-batch TTT + sliding-window attention¶
Two issues arise for large models / long contexts (Section 2.2–2.3):
- Efficiency: per-token (online) updates are sequential and hard to parallelize.
- Stability: single-token gradients can be noisy and can “easily lead to gradient explosion by chance” (Section 2.2).
To address both, the method uses mini-batch TTT:
- Partition the context tokens into batches of size
band update once per batch (Eq. (5)): -
W_i = W_{i-1} − η (1/b) Σ_{t=(i−1)b+1..ib} ∇ℓ_t(W_{i-1}). -
Update the outer-loop objective accordingly (Eq. (6)):
L(W_0; X) = (1/T) Σ_{i=1..T/b} Σ_{t=(i−1)b+1..ib} ℓ_t(W_{i-1}).
However, batching introduces a new problem:
- Within a batch, predictions are made with the same weights
W_{i-1}(not updated token-by-token), so the model cannot incorporate earlier tokens inside the batch unless the architecture provides short-term memory (Section 2.3; Figure 2 shows degradation atb = 16for the attention-free toy model).
The fix is architectural but minimal:
- Use a standard Transformer with
sliding-window attentionof sizekso the model can attend locally within a window, and setk ≥ bso the model can “remember the context within each mini-batch before TTT has a chance to update its weights” (Section 2.3). - In main long-context results, they use
k = 8Kandb = 1KforT = 128K(Section 2.3).
3.4.5 Practical implementation details (what gets updated, and why)¶
Section 2.3.1 describes three implementation choices that the authors deem necessary for reported performance:
- Update only MLP layers during TTT.
-
They freeze embeddings, normalization layers, and attention layers in the inner loop because updating them “causes instability in the outer loop” (Section 2.3.1).
-
Update only the last
1/4of Transformer blocks. - Updating more layers increases “storage” (more parameters to hold compressed context) but costs more backprop compute per inner step (Section 2.3.1).
-
Their default is last quarter of blocks, chosen via ablations (Figure 4 right panel; Section 3.2.1).
-
Use two MLP layers per updated block: one train-time-static and one updated by TTT.
- Motivation: mitigate “forgetting the knowledge learned during pre-training” by keeping a static MLP as a “safe” store (Section 2.3.1).
- To keep parameter count constant versus baselines, they reduce MLP hidden dimension throughout the network (Section 2.3.1).
Figure 3 (left) illustrates the computation graph: a forward pass computes next-token loss at the end; the backward pass updates only MLPs in the last quarter of blocks (and gradient is stopped before earlier blocks).
3.4.6 Decoding multiple tokens (beyond one-step decode)¶
- The method extends naturally to generating long continuations (Section 2.3.2):
- During decoding, it accumulates generated tokens until it fills a TTT mini-batch of size
b. - Then it performs one TTT update on that batch of generated tokens before continuing generation with updated weights.
This makes decode behave like: - Standard SWA decode within a batch, plus - Periodic “prefill-like” TTT steps between batches (Section 3.7, decode latency discussion).
3.4.7 Alternative derivation: from TTT-KVB to TTT-E2E¶
Section 2.4 reframes prior long-context TTT (TTT-KVB) and shows the paper’s key conceptual shift:
-
TTT-KVBuses a layer-wise reconstruction objective that tries to predict “value” projections from “key” projections inside each layer (Eq. (7)), updating an internal modelgper layer; output embeddings come from another call tog(Eq. (8)). -
Step 1: simplify output rule by reusing the prediction directly (Eq. (9)); Table 1 shows negligible change:
TTT-KVB (Zhang et al.): loss2.818(diff−0.009vs SWA baseline2.827)-
TTT-KVB simplified: loss2.819(diff+0.001relative to the prior row; below their stated0.001significance threshold) -
Key step: replace layer-wise KVB loss with next-token prediction loss.
-
This yields
TTT-E2E all layers MH(MH = multi-head), improving to loss2.806(diff−0.013vs SWA baseline in Table 1). -
Final step: update fewer blocks but with larger MLP state (regular MLPs, not multi-head LoRA).
- The paper argues it is more cost-effective to update fewer blocks with larger “state,” because the backward pass must traverse heavy upstream computation anyway (Section 2.4.4).
- They quantify for a
760Mmodel:- state size
88Mparameters (finalTTT-E2E) vs18M(intermediate), - and prefill latency
0.0086vs0.017seconds per1Ktokens on anH100(Section 2.4.4).
- state size
4. Key Insights and Innovations¶
- (1) Long-context as continual learning via next-token self-training on the context.
- Novelty: instead of designing a new long-context attention approximation, the model keeps learning at inference time to compress context into weights (Abstract; Introduction; Section 2.1).
-
Significance: enables constant per-token latency while still benefiting from longer context (Figure 1, right; Section 2).
-
(2) End-to-end meta-learning objective that matches “loss after TTT.”
- Novelty: the outer loop trains the initialization
W_0to minimize the loss sequence that is actually experienced during test-time adaptation (Eq. (3), Eq. (6)), unlikeL_naive(Eq. (4)). -
Significance: in the toy example, E2E training turns an attention-free “bigram-like” model into one that nearly matches full attention, while naive training helps little (Figure 2, right).
-
(3) Practical, scalable TTT: mini-batch updates + SWA interleaving + selective layer updates.
- Novelty (as packaged here): mini-batch TTT (
b) to improve parallelism/stability (Eq. (5)) combined with SWA (k ≥ b) to avoid losing within-batch context (Section 2.3). -
Significance: supports experiments up to
128Kcontext with constant inference latency (Figure 1, right). -
(4) A concrete design point for “state size vs compute”: update only the last quarter of blocks and only MLPs, with a second static MLP to reduce forgetting.
- Novelty: the paper turns a vague “fast weights” idea into a specific Transformer recipe (Section 2.3.1) and validates the “how many layers” knob as central for context scaling (Figure 4, right).
-
Significance: provides an empirical handle for trading off long-context quality vs compute.
-
(5) Clarifying connection to
TTT-KVBand RNN framing via an alternative derivation. - Novelty: isolates the key improvement as swapping the inner-loop loss to next-token prediction (“E2E at test time”) and then resizing the hidden state effectively (Section 2.4; Table 1).
- Significance: explains why earlier “TTT layers as attention replacements” are not necessary for this method’s gains.
5. Experimental Analysis¶
5.1 Evaluation methodology: datasets, stages, metrics¶
- Training stages (Section 3.1):
- Pre-training at
8Kcontext length. -
Extension fine-tuning at the final evaluated context length (up to
128K). -
Datasets (Section 3.1):
- Pre-training:
DCLM-Baseline(filtered Common Crawl subset).- They discard all documents shorter than
8K(to avoid resetting updated MLPs across packed document boundaries, which slows their infra). - They sample training sets of varying total token counts from remaining documents.
- They discard all documents shorter than
- Fine-tuning + evaluation:
Booksdataset for long-context extension because very long DCLM sequences (>128K) are described as “low quality.” -
Language modeling evaluation uses a held-out partition of
Books(Section 3.1). -
Metric:
- Primary: language modeling loss reported as “log perplexity” (Figures 1, 4–7, 9).
-
In scaling plots, they use
Loss ∆ = (loss of method) − (loss of full attention)so full attention is a flat line at 0 (Figures 1, 4, 5). -
Baselines (Section 3.1):
- Transformer with
full attention. - Transformer with
SWA. - Hybrid
SWA+ full attention (pattern “5:1”, in the style of Gemma). Mamba 2(hybrid with SWA layers).Gated DeltaNet(hybrid with SWA layers).TTT-KVB(hybrid of TTT-MLP KVB layers with SWA).
All SWA-based methods use window k = 8K except in ablations (Section 3.1).
5.2 Core configurations / hyperparameters (as reported)¶
- Model sizes and recipes (Table 3; Appendix B):
- Sizes:
125M,350M,760M,1.3B,2.7Bparameters (the main text often refers to3Bfor the largest setting). - Example configuration fields in Table 3: number of blocks, embedding dimension, attention heads, batch size (tokens per outer-loop batch), peak LR, and total tokens.
-
Training uses one epoch: steps = total_tokens / batch_size (Table 3).
-
Optimization schedule (Appendix B):
- Learning rate warmup: first
10%linear warmup from 0 to peak. -
Then cosine decay to
1e-5. -
Tokenizer and positional encoding (Appendix B):
- Tokenizer:
Llama 3 tokenizer. - Positional encoding:
RoPEwithθ = 500Kfor pre-training at8K. -
For extension fine-tuning, they change
θfor full attention and set (following their cited practice):θ = 1Mfor16K,2Mfor32K,5Mfor64K,10Mfor128K.
-
TTT hyperparameters for main long-context results (Section 2.3):
k = 8K(SWA window),b = 1K(TTT mini-batch size),- update only the last
1/4of blocks, - update only MLP layers in those blocks, with an added second static MLP per updated block (Section 2.3.1).
5.3 Main quantitative results¶
(A) Scaling with context length: quality and latency¶
- Quality scaling (Figure 1 left; Figure 9):
- In the
3B-scale setting trained with164Btokens,TTT-E2Eis shown as maintaining an advantage over full attention as context increases up to128K, while methods likeSWA,Mamba 2, andGated DeltaNetdegrade inLoss ∆at long context (Figure 1 left). - Figure 9 (same experiments as Figure 1 left but plotting absolute loss) shows:
- longer context improves loss for full attention and the hybrid baseline across all tested lengths,
- but after
32K, longer context hurts loss forSWA,Mamba 2,Gated DeltaNet, andTTT-KVB, - while
TTT-E2Econtinues improving through128K.
The paper attributes the “hurts after 32K” trend to higher gradient variance during fine-tuning (fewer sequences per outer-loop batch) combined with inability to leverage longer context effectively (Figure 9 caption).
- Latency scaling (Figure 1 right):
TTT-E2Ehas constant prefill latency w.r.t. context length (like SWA and the RNN baselines).- It is reported as
2.7×faster than full attention for128Kcontext on anH100(Figure 1 right caption).
(B) Ablations: what matters for TTT-E2E¶
- Sliding window size
k(Figure 4 left; Section 3.2): - Larger
kimproves performance forSWA,Gated DeltaNet, andTTT-E2E. -
They choose
k = 8Kbecause smallerk“does not significantly improve runtime.” -
TTT mini-batch size
b(Figure 4 middle; Section 3.2): - Larger
b(from1Kup to8K) significantly hurts performance for bothTTT-E2EandTTT-KVB. -
They choose
b = 1Kas a default because smallerbharms hardware utilization and stability. -
Number of layers updated (Figure 4 right; Section 3.2.1):
- Updating too few layers (e.g.,
1or3layers in a 24-layer760Mmodel) fails to scale with context like full attention. - Updating
6or12layers scales similarly, with12not clearly better than6. - They therefore update the last
1/4of layers across model sizes.
(C) Scaling with training compute (Figure 5)¶
- Axes explored (Section 3.3):
- Model size scaling:
125M→3Brange (Figure 5a,b). -
Token scaling: for
760M, vary tokens16Bto80B(Figure 5c,d), keeping fine-tuning tokens at5%of pre-training tokens. -
Evaluations (Section 3.3):
- After pre-training: evaluate on
DCLMat8Kcontext (Figure 5a,c). -
After fine-tuning: evaluate on
Booksat32Kcontext (Figure 5b,d). -
Observed trend (Section 3.3):
- Under small compute budgets, the advantage of
TTT-E2Eover full attention decreases. - Under medium budgets,
TTT-E2Efollows a similar scaling trend to full attention (blue line relatively flat inLoss ∆). - The paper marks regime boundaries around model size
760Mand token count48B(dotted lines in Figure 5).
(D) Where the loss gains come from (Figure 6)¶
- Figure 6 breaks down token-wise loss vs token index for
32Kand128Kcontexts (Section 3.4.1). - Two key observations:
TTT-E2Eis the only method plotted that stays below full attention throughout the whole context length.- Most of
TTT-E2E’s aggregate advantage comes from earlier tokens, with differences smaller near the end of the context window.
The paper highlights that TTT-E2E outperforms full attention even before the first inner-loop update happens (before t = 1K given b = 1K), implying the meta-learned initialization itself changes behavior (Section 3.4.1).
(E) Recall-focused evaluation: Needle in a Haystack (Table 2)¶
- Task:
RULER’s three “S-NIAH” tasks at varying context lengths (Section 3.5). - Result: full attention “dramatically outperforms” all other methods, including
TTT-E2E, especially at long context (Table 2; Section 3.5). - Example: on
S-NIAH-1 (pass-key retrieval)at128K, full attention is0.99whileTTT-E2Eis0.06(Table 2). - Interpretation: supports the paper’s framing that full attention’s strength is near-lossless recall, while
TTT-E2Erelies on compression that can drop “seemingly irrelevant details” like the needle string (Section 3.5).
(F) Long decoding behavior (Figure 7)¶
- Setup (Section 3.6; Appendix D):
- Use
Qwen-3-8B-Baseas an evaluator. - Prefill
8Ktokens fromBooks, then decode8Kcontinuation tokens. - Sampling uses temperature
1, top-p0.95, plus repetition penalty1.1(Appendix D). - Result:
TTT-E2Eachieves lower Qwen-evaluated loss than full attention in this test, with both showing a loss spike at the prefill→decode boundary then decreasing (Figure 7; Section 3.6).
5.4 Do the experiments support the claims?¶
- The paper’s central empirical claims are strongly aligned with its plots:
- Constant-latency long-context prefill vs full attention: directly shown (Figure 1 right).
- Context-length scaling in loss for
3B/164Btokens: shown as stable advantage inLoss ∆(Figure 1 left) and improving absolute loss through128K(Figure 9). -
Importance of state capacity (layers updated) for scaling: supported by the dedicated ablation (Figure 4 right).
-
The paper is also explicit about trade-offs:
- It shows
TTT-E2Eis not competitive on recall-heavy NIAH tasks (Table 2), which constrains when its loss improvements translate into “better long-context reasoning” vs “better compression.”
6. Limitations and Trade-offs¶
- Weakness on near-lossless recall tasks (explicit).
TTT-E2Eperforms far worse than full attention on Needle-in-a-Haystack retrieval, especially at long contexts (Table 2; Section 3.5).-
This is consistent with the method’s compression mechanism: it may discard details that matter for exact string retrieval.
-
Training efficiency is a significant limitation (explicit).
- Training requires gradients of gradients (meta-learning), which is “much less optimized” than standard Transformer training (Section 3.7).
- On
H200, training latency is reported as:1.2×faster than full attention at128K,- but
3.4×slower at8K(Figure 8 left; Section 3.7).
-
Even though FLOPs/token remains constant for
TTT-E2Ein their accounting (Figure 8 right), latency still grows between8Kand32Kbecause they increase gradient checkpointing-through-time bylog(T)to manage memory (Section 3.7 footnote 9). -
Method depends on careful hyperparameter coupling and design constraints.
- The method relies on setting
k ≥ bso within-batch context is available (Section 2.3). - Larger
bhurts performance (Figure 4 middle), but smallerbharms utilization/stability (Section 3.2). -
Updating too few layers breaks context scaling (Figure 4 right), while updating too many layers increases compute.
-
Inner-loop updates are restricted for stability.
-
They freeze embeddings, norms, and attention layers during TTT due to instability in the outer loop (Section 2.3.1), which may limit how much the model can adapt at test time.
-
Long decoding evaluation is limited in realism (explicit caveat).
- The paper notes that realistic long decoding often happens after instruction tuning or RL, which they do not cover; their evaluation uses base models and an external evaluator model (Section 3.6).
7. Implications and Future Directions¶
- Conceptual shift: long-context modeling as continual learning.
-
The paper suggests progress on long-context need not come only from new attention mechanisms; instead, inference-time adaptation plus meta-learned initialization can provide a different route to long-context effectiveness (Abstract; Introduction; Section 4.2).
-
A practical regime where
TTT-E2Elooks attractive (based on provided results). - If the task is language modeling / predictive use of long context where exact recall is not central,
TTT-E2Eoffers:- constant prefill latency with context length (Figure 1 right),
- improving loss up to
128K(Figure 9), - and favorable scaling vs several architecture baselines (Figure 1 left).
-
If the task requires exact retrieval of a small detail buried far back, full attention remains far stronger (Table 2).
-
Systems directions for making training practical (explicitly proposed).
-
The paper identifies two routes (Section 3.7):
- Build a custom attention kernel that supports gradients-of-gradients (they cannot use cuDNN FlashAttention during training because it does not support this in their setup).
- Initialize
TTT-E2Etraining from a pre-trained Transformer without TTT, so meta-learning is a smaller fraction of total compute.
-
Algorithmic extensions hinted by the paper.
-
They propose exploring TTT on self-generated tokens (e.g., filtered/rephrased context or “review” tokens) as a way to guard against spurious inputs, drawing an analogy to gating mechanisms in RNNs (Section 3.3).
-
Repro/Integration Guidance (from their setup choices).
- A minimal integration, consistent with the paper’s working recipe, is:
- Start from a Transformer with
SWA(k). - Add inner-loop updates on next-token prediction loss with mini-batches
band ensurek ≥ b(Section 2.3). - Update only MLPs in the last quarter of blocks; freeze attention/embeddings/norms for stability (Section 2.3.1).
- Train the initialization with the outer-loop objective that differentiates through the inner loop (Eq. (6)), acknowledging the training-time overhead (Section 3.7).
- Start from a Transformer with
- Based on their ablations, the most sensitive knobs for long-context scaling are
band the number of updated layers (Figure 4 middle/right), whilekbehaves similarly to other SWA-based baselines (Figure 4 left).