End-to-End Test-Time Training for Long Context¶

🎯 Pitch¶

The paper introduces TTT-E2E, an end-to-end test-time training method that compresses long context into a Transformer’s weights by running next-token prediction updates at inference and meta-learning the model’s initialization so it performs well after those updates. This yields full-attention-level scaling with context length while keeping constant per-token inference cost—enabling practical, much faster processing of very long contexts (e.g., 2.7× faster prefill at 128K) without bespoke architectures.

1. Executive Summary (2-3 sentences)¶

This work introduces TTT-E2E, a long-context language modeling method that treats long-context use as a continual learning problem: the model continues learning at inference time by running next-token prediction updates on the provided context, thereby compressing context into its weights. The method keeps a standard Transformer backbone with sliding-window attention (SWA), but adds an end-to-end meta-learning objective so the model’s initialization is explicitly optimized to perform well after test-time training. Experiments emphasize scaling: for 3B models trained with 164B tokens, TTT-E2E scales with context length similarly to full attention while keeping constant per-token inference latency, reaching 2.7× faster prefill than full attention at 128K context (Figure 1).

2. Context and Motivation¶

Problem / gap.
Full self-attention over an entire context (“full attention”) is effective at using long context, but prefill cost grows quadratically in context length (O(T^2)), and decode cost grows linearly (O(T)), making very long contexts expensive (Introduction; Section 2).
Linear-time sequence models (described here as “RNNs such as Mamba 2 and Gated DeltaNet”) have constant per-token cost, but their language modeling quality degrades as context length grows (Figure 1).
Why it matters.
The paper frames a core tension: long-context models often aim for “nearly lossless recall” (full attention) which is compute-heavy, while humans benefit from long experience via compression rather than exact recall (Introduction).
The practical target is “constant cost per token” inference while still benefiting from longer context (Introduction; Section 2.1).
Prior approaches and shortcomings (as set up in this paper).
Sliding-window attention (SWA): reduces cost by limiting attention to a window, but becomes less effective than full attention at leveraging longer context (Figure 1).
Hybrid architectures (e.g., SWA layers mixed with occasional full-attention layers) partially bridge the gap but still do not match full attention’s long-context effectiveness (Figure 1).
RNN/SSM-style models (Mamba 2, Gated DeltaNet): constant per-token cost, but worse loss scaling with context length (Figure 1).
Earlier test-time training (TTT) variants for long context (notably TTT-KVB) compress context into weights, but use non-end-to-end test-time objectives (layer-wise reconstruction / key-value binding) rather than next-token prediction (Section 2.4; Table 1).
How this paper positions itself.
The key repositioning is conceptual: long-context modeling is treated as continual learning at inference time rather than as an attention-approximation or architecture-design problem (Abstract; Introduction).
The method aims to be “end-to-end” in two senses:
- Test-time E2E: the inner-loop update objective is standard next-token prediction loss (Section 2.1–2.4).
- Training-time E2E: the outer-loop training objective directly optimizes performance after test-time updates via meta-learning (Section 2.2).

3. Technical Approach¶

3.1 Reader orientation (approachable technical breakdown)¶

The system is a Transformer language model with sliding-window attention that updates part of its weights while reading the prompt/context.
It solves long-context modeling by using test-time training (TTT)—gradient steps on next-token prediction over the context—to compress information into weights, with an outer-loop meta-learning procedure that trains the model to adapt well at test time.

3.2 Big-picture architecture (diagram in words)¶

Base model: a standard Transformer stack, but attention is SWA(k) (window size k) instead of full attention.
During inference (prefill):
Run forward passes to compute next-token predictions over context tokens.
Compute next-token cross-entropy losses on those context tokens.
Backpropagate those losses only into selected MLP layers (not into embeddings/attention/norms).
Apply gradient updates (in mini-batches of tokens) to those MLP weights.
During training:
Inner loop: simulate the same test-time updates on training sequences.
Outer loop: update the initialization parameters so that after inner-loop adaptation, the model’s loss is low (meta-learning via gradients-through-gradients).

3.3 Roadmap for the deep dive¶

Explain the core inner-loop mechanism: test-time updates using next-token prediction (Eq. (1)–(2)).
Show why naive training mismatches test-time behavior and how meta-learning fixes it (Eq. (3)–(4)).
Add mini-batch updates and SWA to make the method efficient/stable at scale (Eq. (5)–(6); Section 2.3).
Detail the practical implementation choices (what is updated/frozen; how many layers; dual-MLP blocks) and their rationale (Section 2.3.1; Figure 4).
Connect to prior TTT-KVB through the alternative derivation (Eq. (7)–(9); Table 1) to clarify what is “E2E” here.

3.4 Detailed, sentence-based technical breakdown¶

This is an algorithmic + empirical systems paper whose core idea is to make long-context modeling work with constant per-token inference cost by turning inference into continual learning: the model trains on the context itself, and training-time meta-learning prepares the model to adapt effectively.

3.4.1 The inference-time task: prefill then decode¶

The paper uses standard next-token prediction with two inference phases (Section 2):
Prefill: condition on provided tokens x0, x1, …, xT (with x0 = <BOS>).
Decode: produce a distribution for the next token \hat{p}_{T+1} and incur test loss CE(\hat{p}_{T+1}, x_{T+1}).
Complexity baseline (Section 2):
Full attention: prefill O(T^2), decode O(T).
The proposed TTT framing targets: prefill O(T), decode O(1) (for the single-token decode setting; Section 2.1).

3.4.2 Inner-loop: “TTT via next-token prediction” (what happens at test time)¶

The model defines a next-token loss at each time step (Eq. (1)):
ℓ_t(W) = CE(f(x_{t-1}; W), x_t),
where f is the network and W are the weights being updated by TTT.
Online TTT (conceptual starting point):
As the model reads tokens sequentially, it updates its weights after each token (Eq. (2)):
- W_t = W_{t-1} − η ∇ℓ_t(W_{t-1}),
- with learning rate η, and initial test-time weights W_0.
Intuition via the toy example (Figure 2):
They remove all self-attention from a Transformer, leaving only MLP layers; without memory, it behaves like a bigram model.
TTT adds a gradient step while reading context (e.g., train on predicting x2 from x1), and the updated MLP weights store information about earlier tokens (Figure 2, left), improving subsequent predictions (Figure 2, right).

3.4.3 Outer-loop: “learning to (learn at test time)” via meta-learning¶

The key training objective matches test-time behavior by optimizing the loss after the sequence of test-time updates (Eq. (3)):
L(W_0; X) = (1/T) Σ_{t=1..T} CE(f(x_{t-1}; W_{t-1}), x_t),
where W_{t-1} is the result of applying the inner-loop update rule up to step t−1.
This requires differentiating through the update rule (because W_{t} depends on ∇ℓ_t), i.e., “gradients of gradients” (Section 2.2).
Why meta-learning is needed (the “naive” mismatch).
A naive alternative trains W_0 as if no test-time updates happen (Eq. (4)):
- L_naive(W_0; X) = (1/T) Σ_{t=1..T} ℓ_t(W_0).
This creates a mismatch between training-time and test-time behavior and empirically underperforms versus the E2E objective in the toy setup (Figure 2, right: TTT-naive vs TTT-E2E).

3.4.4 Scaling the inner-loop: mini-batch TTT + sliding-window attention¶

Two issues arise for large models / long contexts (Section 2.2–2.3):

Efficiency: per-token (online) updates are sequential and hard to parallelize.
Stability: single-token gradients can be noisy and can “easily lead to gradient explosion by chance” (Section 2.2).

To address both, the method uses mini-batch TTT:

Partition the context tokens into batches of size b and update once per batch (Eq. (5)):
W_i = W_{i-1} − η (1/b) Σ_{t=(i−1)b+1..ib} ∇ℓ_t(W_{i-1}).
Update the outer-loop objective accordingly (Eq. (6)):
L(W_0; X) = (1/T) Σ_{i=1..T/b} Σ_{t=(i−1)b+1..ib} ℓ_t(W_{i-1}).

However, batching introduces a new problem:

Within a batch, predictions are made with the same weights W_{i-1} (not updated token-by-token), so the model cannot incorporate earlier tokens inside the batch unless the architecture provides short-term memory (Section 2.3; Figure 2 shows degradation at b = 16 for the attention-free toy model).

The fix is architectural but minimal:

Use a standard Transformer with sliding-window attention of size k so the model can attend locally within a window, and set k ≥ b so the model can “remember the context within each mini-batch before TTT has a chance to update its weights” (Section 2.3).
In main long-context results, they use k = 8K and b = 1K for T = 128K (Section 2.3).

3.4.5 Practical implementation details (what gets updated, and why)¶

Section 2.3.1 describes three implementation choices that the authors deem necessary for reported performance:

Update only MLP layers during TTT.
They freeze embeddings, normalization layers, and attention layers in the inner loop because updating them “causes instability in the outer loop” (Section 2.3.1).
Update only the last 1/4 of Transformer blocks.
Updating more layers increases “storage” (more parameters to hold compressed context) but costs more backprop compute per inner step (Section 2.3.1).
Their default is last quarter of blocks, chosen via ablations (Figure 4 right panel; Section 3.2.1).
Use two MLP layers per updated block: one train-time-static and one updated by TTT.
Motivation: mitigate “forgetting the knowledge learned during pre-training” by keeping a static MLP as a “safe” store (Section 2.3.1).
To keep parameter count constant versus baselines, they reduce MLP hidden dimension throughout the network (Section 2.3.1).

Figure 3 (left) illustrates the computation graph: a forward pass computes next-token loss at the end; the backward pass updates only MLPs in the last quarter of blocks (and gradient is stopped before earlier blocks).

3.4.6 Decoding multiple tokens (beyond one-step decode)¶

The method extends naturally to generating long continuations (Section 2.3.2):
During decoding, it accumulates generated tokens until it fills a TTT mini-batch of size b.
Then it performs one TTT update on that batch of generated tokens before continuing generation with updated weights.

This makes decode behave like: - Standard SWA decode within a batch, plus - Periodic “prefill-like” TTT steps between batches (Section 3.7, decode latency discussion).

3.4.7 Alternative derivation: from `TTT-KVB` to `TTT-E2E`¶

Section 2.4 reframes prior long-context TTT (TTT-KVB) and shows the paper’s key conceptual shift:

TTT-KVB uses a layer-wise reconstruction objective that tries to predict “value” projections from “key” projections inside each layer (Eq. (7)), updating an internal model g per layer; output embeddings come from another call to g (Eq. (8)).
Step 1: simplify output rule by reusing the prediction directly (Eq. (9)); Table 1 shows negligible change:
TTT-KVB (Zhang et al.): loss 2.818 (diff −0.009 vs SWA baseline 2.827)
TTT-KVB simplified: loss 2.819 (diff +0.001 relative to the prior row; below their stated 0.001 significance threshold)
Key step: replace layer-wise KVB loss with next-token prediction loss.
This yields TTT-E2E all layers MH (MH = multi-head), improving to loss 2.806 (diff −0.013 vs SWA baseline in Table 1).
Final step: update fewer blocks but with larger MLP state (regular MLPs, not multi-head LoRA).
The paper argues it is more cost-effective to update fewer blocks with larger “state,” because the backward pass must traverse heavy upstream computation anyway (Section 2.4.4).
They quantify for a 760M model:
- state size 88M parameters (final TTT-E2E) vs 18M (intermediate),
- and prefill latency 0.0086 vs 0.017 seconds per 1K tokens on an H100 (Section 2.4.4).

4. Key Insights and Innovations¶

(1) Long-context as continual learning via next-token self-training on the context.
Novelty: instead of designing a new long-context attention approximation, the model keeps learning at inference time to compress context into weights (Abstract; Introduction; Section 2.1).
Significance: enables constant per-token latency while still benefiting from longer context (Figure 1, right; Section 2).
(2) End-to-end meta-learning objective that matches “loss after TTT.”
Novelty: the outer loop trains the initialization W_0 to minimize the loss sequence that is actually experienced during test-time adaptation (Eq. (3), Eq. (6)), unlike L_naive (Eq. (4)).
Significance: in the toy example, E2E training turns an attention-free “bigram-like” model into one that nearly matches full attention, while naive training helps little (Figure 2, right).
(3) Practical, scalable TTT: mini-batch updates + SWA interleaving + selective layer updates.
Novelty (as packaged here): mini-batch TTT (b) to improve parallelism/stability (Eq. (5)) combined with SWA (k ≥ b) to avoid losing within-batch context (Section 2.3).
Significance: supports experiments up to 128K context with constant inference latency (Figure 1, right).
(4) A concrete design point for “state size vs compute”: update only the last quarter of blocks and only MLPs, with a second static MLP to reduce forgetting.
Novelty: the paper turns a vague “fast weights” idea into a specific Transformer recipe (Section 2.3.1) and validates the “how many layers” knob as central for context scaling (Figure 4, right).
Significance: provides an empirical handle for trading off long-context quality vs compute.
(5) Clarifying connection to TTT-KVB and RNN framing via an alternative derivation.
Novelty: isolates the key improvement as swapping the inner-loop loss to next-token prediction (“E2E at test time”) and then resizing the hidden state effectively (Section 2.4; Table 1).
Significance: explains why earlier “TTT layers as attention replacements” are not necessary for this method’s gains.

5. Experimental Analysis¶

5.1 Evaluation methodology: datasets, stages, metrics¶

Training stages (Section 3.1):
Pre-training at 8K context length.
Extension fine-tuning at the final evaluated context length (up to 128K).
Datasets (Section 3.1):
Pre-training: DCLM-Baseline (filtered Common Crawl subset).
- They discard all documents shorter than 8K (to avoid resetting updated MLPs across packed document boundaries, which slows their infra).
- They sample training sets of varying total token counts from remaining documents.
Fine-tuning + evaluation: Books dataset for long-context extension because very long DCLM sequences (> 128K) are described as “low quality.”
Language modeling evaluation uses a held-out partition of Books (Section 3.1).
Metric:
Primary: language modeling loss reported as “log perplexity” (Figures 1, 4–7, 9).
In scaling plots, they use Loss ∆ = (loss of method) − (loss of full attention) so full attention is a flat line at 0 (Figures 1, 4, 5).
Baselines (Section 3.1):
Transformer with full attention.
Transformer with SWA.
Hybrid SWA + full attention (pattern “5:1”, in the style of Gemma).
Mamba 2 (hybrid with SWA layers).
Gated DeltaNet (hybrid with SWA layers).
TTT-KVB (hybrid of TTT-MLP KVB layers with SWA).

All SWA-based methods use window k = 8K except in ablations (Section 3.1).

5.2 Core configurations / hyperparameters (as reported)¶

Model sizes and recipes (Table 3; Appendix B):
Sizes: 125M, 350M, 760M, 1.3B, 2.7B parameters (the main text often refers to 3B for the largest setting).
Example configuration fields in Table 3: number of blocks, embedding dimension, attention heads, batch size (tokens per outer-loop batch), peak LR, and total tokens.
Training uses one epoch: steps = total_tokens / batch_size (Table 3).
Optimization schedule (Appendix B):
Learning rate warmup: first 10% linear warmup from 0 to peak.
Then cosine decay to 1e-5.
Tokenizer and positional encoding (Appendix B):
Tokenizer: Llama 3 tokenizer.
Positional encoding: RoPE with θ = 500K for pre-training at 8K.
For extension fine-tuning, they change θ for full attention and set (following their cited practice):
- θ = 1M for 16K, 2M for 32K, 5M for 64K, 10M for 128K.
TTT hyperparameters for main long-context results (Section 2.3):
k = 8K (SWA window),
b = 1K (TTT mini-batch size),
update only the last 1/4 of blocks,
update only MLP layers in those blocks, with an added second static MLP per updated block (Section 2.3.1).

5.3 Main quantitative results¶

(A) Scaling with context length: quality and latency¶

Quality scaling (Figure 1 left; Figure 9):
In the 3B-scale setting trained with 164B tokens, TTT-E2E is shown as maintaining an advantage over full attention as context increases up to 128K, while methods like SWA, Mamba 2, and Gated DeltaNet degrade in Loss ∆ at long context (Figure 1 left).
Figure 9 (same experiments as Figure 1 left but plotting absolute loss) shows:
- longer context improves loss for full attention and the hybrid baseline across all tested lengths,
- but after 32K, longer context hurts loss for SWA, Mamba 2, Gated DeltaNet, and TTT-KVB,
- while TTT-E2E continues improving through 128K.

The paper attributes the “hurts after 32K” trend to higher gradient variance during fine-tuning (fewer sequences per outer-loop batch) combined with inability to leverage longer context effectively (Figure 9 caption).

Latency scaling (Figure 1 right):
TTT-E2E has constant prefill latency w.r.t. context length (like SWA and the RNN baselines).
It is reported as 2.7× faster than full attention for 128K context on an H100 (Figure 1 right caption).

(B) Ablations: what matters for `TTT-E2E`¶

Sliding window size k (Figure 4 left; Section 3.2):
Larger k improves performance for SWA, Gated DeltaNet, and TTT-E2E.
They choose k = 8K because smaller k “does not significantly improve runtime.”
TTT mini-batch size b (Figure 4 middle; Section 3.2):
Larger b (from 1K up to 8K) significantly hurts performance for both TTT-E2E and TTT-KVB.
They choose b = 1K as a default because smaller b harms hardware utilization and stability.
Number of layers updated (Figure 4 right; Section 3.2.1):
Updating too few layers (e.g., 1 or 3 layers in a 24-layer 760M model) fails to scale with context like full attention.
Updating 6 or 12 layers scales similarly, with 12 not clearly better than 6.
They therefore update the last 1/4 of layers across model sizes.

(C) Scaling with training compute (Figure 5)¶

Axes explored (Section 3.3):
Model size scaling: 125M → 3B range (Figure 5a,b).
Token scaling: for 760M, vary tokens 16B to 80B (Figure 5c,d), keeping fine-tuning tokens at 5% of pre-training tokens.
Evaluations (Section 3.3):
After pre-training: evaluate on DCLM at 8K context (Figure 5a,c).
After fine-tuning: evaluate on Books at 32K context (Figure 5b,d).
Observed trend (Section 3.3):
Under small compute budgets, the advantage of TTT-E2E over full attention decreases.
Under medium budgets, TTT-E2E follows a similar scaling trend to full attention (blue line relatively flat in Loss ∆).
The paper marks regime boundaries around model size 760M and token count 48B (dotted lines in Figure 5).

(D) Where the loss gains come from (Figure 6)¶

Figure 6 breaks down token-wise loss vs token index for 32K and 128K contexts (Section 3.4.1).
Two key observations:
TTT-E2E is the only method plotted that stays below full attention throughout the whole context length.
Most of TTT-E2E’s aggregate advantage comes from earlier tokens, with differences smaller near the end of the context window.

The paper highlights that TTT-E2E outperforms full attention even before the first inner-loop update happens (before t = 1K given b = 1K), implying the meta-learned initialization itself changes behavior (Section 3.4.1).

(E) Recall-focused evaluation: Needle in a Haystack (Table 2)¶

Task: RULER’s three “S-NIAH” tasks at varying context lengths (Section 3.5).
Result: full attention “dramatically outperforms” all other methods, including TTT-E2E, especially at long context (Table 2; Section 3.5).
Example: on S-NIAH-1 (pass-key retrieval) at 128K, full attention is 0.99 while TTT-E2E is 0.06 (Table 2).
Interpretation: supports the paper’s framing that full attention’s strength is near-lossless recall, while TTT-E2E relies on compression that can drop “seemingly irrelevant details” like the needle string (Section 3.5).

(F) Long decoding behavior (Figure 7)¶

Setup (Section 3.6; Appendix D):
Use Qwen-3-8B-Base as an evaluator.
Prefill 8K tokens from Books, then decode 8K continuation tokens.
Sampling uses temperature 1, top-p 0.95, plus repetition penalty 1.1 (Appendix D).
Result: TTT-E2E achieves lower Qwen-evaluated loss than full attention in this test, with both showing a loss spike at the prefill→decode boundary then decreasing (Figure 7; Section 3.6).

5.4 Do the experiments support the claims?¶

The paper’s central empirical claims are strongly aligned with its plots:
Constant-latency long-context prefill vs full attention: directly shown (Figure 1 right).
Context-length scaling in loss for 3B / 164B tokens: shown as stable advantage in Loss ∆ (Figure 1 left) and improving absolute loss through 128K (Figure 9).
Importance of state capacity (layers updated) for scaling: supported by the dedicated ablation (Figure 4 right).
The paper is also explicit about trade-offs:
It shows TTT-E2E is not competitive on recall-heavy NIAH tasks (Table 2), which constrains when its loss improvements translate into “better long-context reasoning” vs “better compression.”

6. Limitations and Trade-offs¶

Weakness on near-lossless recall tasks (explicit).
TTT-E2E performs far worse than full attention on Needle-in-a-Haystack retrieval, especially at long contexts (Table 2; Section 3.5).
This is consistent with the method’s compression mechanism: it may discard details that matter for exact string retrieval.
Training efficiency is a significant limitation (explicit).
Training requires gradients of gradients (meta-learning), which is “much less optimized” than standard Transformer training (Section 3.7).
On H200, training latency is reported as:
- 1.2× faster than full attention at 128K,
- but 3.4× slower at 8K (Figure 8 left; Section 3.7).
Even though FLOPs/token remains constant for TTT-E2E in their accounting (Figure 8 right), latency still grows between 8K and 32K because they increase gradient checkpointing-through-time by log(T) to manage memory (Section 3.7 footnote 9).
Method depends on careful hyperparameter coupling and design constraints.
The method relies on setting k ≥ b so within-batch context is available (Section 2.3).
Larger b hurts performance (Figure 4 middle), but smaller b harms utilization/stability (Section 3.2).
Updating too few layers breaks context scaling (Figure 4 right), while updating too many layers increases compute.
Inner-loop updates are restricted for stability.
They freeze embeddings, norms, and attention layers during TTT due to instability in the outer loop (Section 2.3.1), which may limit how much the model can adapt at test time.
Long decoding evaluation is limited in realism (explicit caveat).
The paper notes that realistic long decoding often happens after instruction tuning or RL, which they do not cover; their evaluation uses base models and an external evaluator model (Section 3.6).

7. Implications and Future Directions¶

Conceptual shift: long-context modeling as continual learning.
The paper suggests progress on long-context need not come only from new attention mechanisms; instead, inference-time adaptation plus meta-learned initialization can provide a different route to long-context effectiveness (Abstract; Introduction; Section 4.2).
A practical regime where TTT-E2E looks attractive (based on provided results).
If the task is language modeling / predictive use of long context where exact recall is not central, TTT-E2E offers:
- constant prefill latency with context length (Figure 1 right),
- improving loss up to 128K (Figure 9),
- and favorable scaling vs several architecture baselines (Figure 1 left).
If the task requires exact retrieval of a small detail buried far back, full attention remains far stronger (Table 2).
Systems directions for making training practical (explicitly proposed).
The paper identifies two routes (Section 3.7):
- Build a custom attention kernel that supports gradients-of-gradients (they cannot use cuDNN FlashAttention during training because it does not support this in their setup).
- Initialize TTT-E2E training from a pre-trained Transformer without TTT, so meta-learning is a smaller fraction of total compute.
Algorithmic extensions hinted by the paper.
They propose exploring TTT on self-generated tokens (e.g., filtered/rephrased context or “review” tokens) as a way to guard against spurious inputs, drawing an analogy to gating mechanisms in RNNs (Section 3.3).
Repro/Integration Guidance (from their setup choices).
A minimal integration, consistent with the paper’s working recipe, is:
- Start from a Transformer with SWA(k).
- Add inner-loop updates on next-token prediction loss with mini-batches b and ensure k ≥ b (Section 2.3).
- Update only MLPs in the last quarter of blocks; freeze attention/embeddings/norms for stability (Section 2.3.1).
- Train the initialization with the outer-loop objective that differentiates through the inner loop (Eq. (6)), acknowledging the training-time overhead (Section 3.7).
Based on their ablations, the most sensitive knobs for long-context scaling are b and the number of updated layers (Figure 4 middle/right), while k behaves similarly to other SWA-based baselines (Figure 4 left).