Learning to (Learn at Test Time): RNNs with Expressive Hidden States¶

🎯 Pitch¶

The paper introduces Test-Time Training (TTT) layers, which make the hidden state a small, learnable model (W_t) that is updated by gradient steps online as a sequence is processed, yielding sequence-modeling layers with linear per-token complexity. By compressing context via learned inner-loop updates (TTT-Linear and TTT-MLP), the approach preserves RNN-style efficiency while substantially improving long-context utilization—maintaining perplexity gains as context grows where modern RNNs plateau—making it a practical path toward scalable long-context models.

1. Executive Summary (2-3 sentences)¶

This paper introduces Test-Time Training (TTT) layers, a sequence-modeling layer family with linear-time per-token complexity whose hidden state is not a fixed vector but the weights W_t of a small model that is updated by gradient descent while processing the sequence, even at test time (Eq. (2), Figure 1). The central significance is a practical framework that aims to combine the hardware- and asymptotic-efficiency of RNN-style recurrence with improved long-context utilization, addressing the observed plateau of modern RNNs (e.g., Mamba) beyond ~16k tokens in their evaluation (Figure 2 right).

2. Context and Motivation¶

Problem / gap
Self-attention is effective in long-context modeling but has quadratic cost in context length because each token attends over all previous tokens (Figure 3, “Self-attention” row; discussion in §1 and §2).
RNN layers have linear cost but must compress all past context into a fixed-size hidden state, which can limit long-context performance (§1, §2).
Why important
The paper argues that the practical advantage of linear-time models only becomes meaningful at long context (they reference “after 8k” in their setting; §1, Figure 12 discussion).
In long contexts, one wants perplexity to keep improving as more conditioning information is available; their evaluation shows this happens for Transformers but not for Mamba after ~16k (Figure 2 right; §1).
Prior approaches and shortcomings (as positioned by the paper)
Classic RNNs (e.g., LSTMs) were previously observed not to scale like Transformers (cited as Kaplan et al.; §1).
Modern RNNs (Mamba) improve scaling at moderate context, but still show a long-context plateau in their token-index perplexity diagnostic (Figure 2; §1).
Linear-attention / “fast weight” style methods exist; the paper later shows a formal equivalence between a particular TTT instantiation and linear attention (Theorem 1, §2.6).
How this paper positions itself
It reframes sequence modeling layers via a unifying “hidden state + update rule + output rule” lens (Figure 3).
It proposes a framework where the hidden state can be an arbitrary learnable model (not just a vector or matrix), trained online via a self-supervised objective whose form is itself learned in an outer loop (§2.1–§2.3).
It contributes systems techniques (mini-batch token updates and a “dual form”) to make this tractable on accelerators (§2.4–§2.5).

3. Technical Approach¶

3.1 Reader orientation (approachable technical breakdown)¶

The system is a new kind of recurrent sequence layer whose internal memory is a small model (e.g., a linear map or a 2-layer MLP) that is trained online as the sequence is processed.
It solves long-context modeling by compressing past tokens into the model’s weights using self-supervised learning updates, while keeping per-token recurrence rather than attention’s growing KV cache (Figure 1, Figure 3; Eq. (2)).

3.2 Big-picture architecture (diagram in words)¶

Input token representation x_t enters a TTT layer.
The layer maintains a hidden state = learner state, mainly the current weights W_t (and potentially optimizer state; §2.6).
For each token (or token mini-batch):
Compute a self-supervised loss (multi-view reconstruction) using learnable projections (θ_K, θ_V) (Eq. (4), §2.3).
Take (mini-batch) gradient-descent-style updates to update W (Eq. (6), §2.4).
Produce the layer output using a test-view projection θ_Q and the updated model f(·; W_t) (Eq. (5), §2.3).
The TTT layer is used inside a standard LM training setup (next-token prediction outer objective), optionally with a Mamba-style backbone (temporal convolutions + gating) rather than a Transformer block (§2.7, Figure 13).

3.3 Roadmap for the deep dive¶

First, define the TTT hidden state as model weights and the basic update/output rules (Eq. (1)–(3), Figure 1).
Next, explain how the self-supervised task is made learnable via multi-view projections optimized in the outer loop (Eq. (4)–(5), Figure 5).
Then cover the two efficiency techniques:
Mini-batch TTT (parallelize gradients within token blocks) (§2.4, Figure 6–7).
The dual form (rewrite computations to use matmuls, avoid materializing per-token gradients and intermediate weights) (§2.5, Eq. (7)–(8)).
Finally, connect the framework to known mechanisms:
Equivalence to linear attention (Theorem 1, Table 1).
Equivalence to self-attention via a nonparametric learner (Theorem 2).
Close with concrete instantiations: TTT-Linear and TTT-MLP and their implementation choices (§2.7).

3.4 Detailed, sentence-based technical breakdown¶

Framing sentence (type of paper + core idea).
This is an algorithm + systems + empirical scaling paper whose core idea is to turn the recurrent hidden state into a trainable model updated by self-supervised gradient steps during the forward pass, so that the layer “learns at test time” by construction (Eq. (2), Figure 1, §2.1).

3.4.1 Unifying view: sequence layer = hidden state + update rule + output rule¶

The paper treats any autoregressive sequence layer as maintaining a hidden state s_t that evolves via an update rule and produces outputs via an output rule (Figure 3 top).
In this lens:
A “naive RNN” keeps a fixed-size vector state and updates it with a parametric recurrence (Figure 3 bottom).
Self-attention keeps an ever-growing list state (KV cache), making per-token cost grow with time index t (Figure 3 bottom).
A “naive TTT” keeps a fixed-size state too—but that state is the parameter vector/matrix of a model f (Figure 3 bottom).

3.4.2 Core TTT mechanism: hidden state as weights `W_t`, update rule as learning¶

Hidden state definition. The hidden state at time t is W_t, the parameters of a “learner model” f (Figure 1; §2.1).
Output rule. The layer output at token t is a prediction produced by the current learner:
Base form: z_t = f(x_t; W_t) (Eq. (1)).
After introducing views (below): z_t = f(θ_Q x_t; W_t) (Eq. (5)).
Update rule. The recurrence update is a gradient step on a self-supervised loss:
W_t = W_{t-1} - η ∇ℓ(W_{t-1}; x_t) (Eq. (2)).
Interpretation. Because this update happens while processing a sequence during inference too, the layer performs training at test time on the test sequence (§2.1).

3.4.3 Self-supervised objective: from naive reconstruction to learned multi-view reconstruction¶

Naive reconstruction. A straightforward self-supervised task is to reconstruct x_t from a corrupted version x̃_t:
ℓ(W; x_t) = || f(x̃_t; W) - x_t ||^2 (Eq. (3)).
Learned task via multi-view projections.
The paper introduces views—learnable low-rank linear projections that define what the learner sees and what it must predict (§2.3).
A training view: θ_K x_t.
A label view: θ_V x_t.
A test view: θ_Q x_t.
The inner-loop loss becomes:
- ℓ(W; x_t) = || f(θ_K x_t; W) - θ_V x_t ||^2 (Eq. (4)).
The output becomes:
- z_t = f(θ_Q x_t; W_t) (Eq. (5)).
Outer vs inner parameters (critical distinction).
W_t is not an outer-loop parameter; it is a sequence-specific hidden state updated per example/sequence.
θ_K, θ_V, θ_Q (and later θ_init, θ_lr) are outer-loop parameters trained with the standard LM objective (next-token prediction) (§2.2–§2.3, Table 2).
Code-level picture.
Figure 5 shows a conceptual implementation where Task contains θ_K, θ_V, θ_Q, while Learner contains the inner-loop model and optimizer, and TTT_Layer.forward() iterates through tokens calling train() then predict().

3.4.4 Why learning updates are plausible as “compression”¶

The paper motivates the update rule as a compression heuristic: tokens that generate large gradients cause larger updates, so the hidden model’s weights preferentially encode information that is “learn-worthy” under the self-supervised objective (§2.1).
Empirically, they show the TTT loss improves over time within test sequences:
In Figure 4 (and expanded in Figure 14), ℓ(W_t; x_t) is lower than ℓ(W_{t-1}; x_t) after one gradient step, and performance relative to the initialization W_0 improves further along the sequence (Figure 4 caption; §2.1).

3.4.5 Making TTT trainable end-to-end: backprop through inner-loop updates¶

Training a model containing TTT layers uses standard outer-loop next-token prediction, but gradients must flow through the inner-loop update computations.
The paper notes that although the forward pass uses a gradient operator ∇ internally, this still corresponds to differentiable computations; backprop “through gradients” is described as taking “gradients of gradients,” connecting to meta-learning ideas (§2.2).
The paper’s terminology:
Inner loop: updates to W inside the TTT layer via ∇ℓ (§2.2, Table 2).
Outer loop: standard LM training updating θ_rest and also the TTT task parameters θ_K, θ_Q, θ_V (and others) (§2.2–§2.3, Table 2).

3.4.6 Systems/efficiency technique #1: mini-batch TTT (token blocks)¶

Problem: Online GD updates W_t strictly depend on W_{t-1} inside the gradient, which is sequential and hard to parallelize (§2.4).
Key observation: Gradient descent variants can be written as W_t = W_{t-1} - η G_t = W_0 - η Σ_{s=1}^t G_s (Eq. (6)), so once G_t are available, W_t can be obtained via prefix sums (“cumsum”).
Mini-batch gradient descent across tokens.
Divide the sequence into mini-batches of tokens of size b.
For tokens within a mini-batch, compute gradients with respect to the same reference weights (the weights at the start of the mini-batch), enabling parallel gradient computation (§2.4, Figure 6).
This yields a speed–quality trade-off:
- Smaller b is closer to online GD (more steps, better perplexity).
- Larger b is closer to batch GD (more parallelism, worse perplexity).
They choose b = 16 for all experiments (Figure 7; §2.4).

3.4.7 Systems/efficiency technique #2: the dual form (matmul-friendly rewrite)¶

Problem: Even with mini-batches, the “primal” computation involves many per-token outer products and materialization of G_t and W_t, which is inefficient on accelerators and heavy in memory I/O (§2.5).
Dual-form idea: Do not materialize intermediate gradients G_1…G_b or intermediate weights W_1…W_{b-1} inside a mini-batch; instead compute:
The end-of-batch weights W_b, and
The batch of outputs z_1…z_b, using a small number of large matrix multiplications (§2.5, Figure 6).
Concrete derivation in the simplified linear case (§2.5).
With f(x;W)=Wx and a simple squared reconstruction loss, they show:
- W_b = W_0 - 2η (W_0 X - X) X^T where X=[x_1,…,x_b] (derivation around Eq. (7)).
For outputs, they derive a masked formulation:
- Define a masked triangular accumulation using mask(X^T X) (Eq. (8)), analogous to causal masking in attention but with zeros instead of -∞.
Complexity trade-off:
- Dual form uses O(b d^2) for end-of-batch weights and adds O(b^2 d) to compute all outputs in the batch (§2.5).
- The paper argues this is acceptable because they pick small b (16) and typical d is a few hundred, and it improves wall-clock time substantially on TPU: “more than 5× faster” in JAX (§2.5).
Extension to MLP learners.
Appendix A generalizes the dual form to multi-layer MLPs using standard forward/backprop quantities like ∇Z^k l and masked matmuls.

3.4.8 Theoretical equivalences: how TTT unifies RNN/attention constructions¶

Equivalence to linear attention (parametric learner).
Theorem 1 (§2.6) states: with a linear inner model f(x)=Wx, batch GD, η=1/2, and W_0=0, the TTT output matches linear attention.
The proof rewrites the batch gradient update to show W_t becomes a running sum of outer products of projected values and keys, and the output becomes the standard linear-attention form (§2.6).
Table 1 empirically verifies the equivalence (“Linear attn. improved” and “TTT equivalence” both have perplexity 15.23, diff 0).
Equivalence to self-attention (nonparametric learner).
Theorem 2 (§2.6) shows that if the learner is a Nadaraya–Watson estimator with an exponential kernel κ(x,x') ∝ exp((θ_K x)^T θ_Q x'), then the induced TTT layer corresponds to self-attention.
This reframes attention as a particular “learner” that stores all past data points and predicts via kernel-weighted averaging (§2.6; Appendix B elaborates Nadaraya–Watson).

3.4.9 Final instantiations: TTT-Linear and TTT-MLP, plus stabilizers¶

Two proposed variants (§2.7).
TTT-Linear: the learner f_lin(x)=W x with square W.
TTT-MLP: the learner is a 2-layer MLP with hidden dimension 4× input dimension and GELU activation (§2.7).
Stability structure inside f.
They wrap learners with residual + LayerNorm:
- f(x) = x + LN(f_res(x)) where f_res is linear or MLP (§2.7).
Table 1 shows adding “LN and residual in f” yields a large perplexity improvement (15.27 → 14.05, improvement −1.22) on their 125M ablation path.
Learnable initialization W_0.
They learn W_0 as an outer-loop parameter θ_init to improve training stability (§2.7).
Table 1 indicates it slightly hurts perplexity in isolation (15.23 → 15.27) but is needed for stable training of later improvements.
Learnable inner learning rate η.
They learn a token-dependent gate:
- η(x) = η_base σ(θ_lr · x) with η_base=1 (TTT-Linear) and 0.1 (TTT-MLP) (§2.7).
Backbone choice.
Instead of directly swapping attention in a Transformer block, their strongest versions use a Mamba-style backbone with temporal convolutions and gating (§2.7, Figure 13).
They also report ablations using a Transformer backbone (Figures 10–11 show both “(M)” and “(T)” variants).

3.4.10 Worked micro-example (single-step intuition without full LM scale)¶

To make the mechanism concrete, consider one time step of a simplified TTT layer.

Let the token embedding be x_t ∈ R^d.
Choose projections θ_K, θ_V, θ_Q so that:
training input: x̂_t = θ_K x_t
training label: y_t = θ_V x_t
test input: x̄_t = θ_Q x_t
Let the learner be linear: f(u;W)=W u.
The self-supervised loss at time t is:
ℓ(W; x_t) = || W x̂_t - y_t ||^2 (Eq. (4) specialized).
A single inner-loop update (online GD form) is:
W_t = W_{t-1} - η ∇_W || W x̂_t - y_t ||^2 (Eq. (2)).
The output for the sequence model at this time is:
z_t = W_t x̄_t (Eq. (5) specialized).
Intuition: if the current token has structure that helps predict y_t from x̂_t, the update will adjust W so that future tokens—processed with the updated W—can reuse that learned structure, acting like an adaptive memory.

4. Key Insights and Innovations¶

(1) Hidden state as an explicit learner trained online (TTT layers).
Novelty: Instead of a vector/matrix state updated by a fixed recurrence, the hidden state is the weights of a model updated by gradient-based learning on self-supervised loss (Eq. (2), Figure 1).
Significance: This increases the expressive capacity of what an RNN-style layer can store, aiming to better compress long context (§1, §2.1).
(2) Learnable self-supervised task via multi-view reconstruction.
Novelty: The reconstruction task is not handcrafted; it is parameterized by θ_K, θ_V, θ_Q and optimized in the outer loop for next-token prediction (§2.3, Eq. (4)–(5), Table 2).
Significance: This aligns the inner-loop adaptation with the final LM objective, rather than relying on human-designed corruptions/tasks (§2.3).
(3) Mini-batch token updates to trade off parallelism and adaptation quality.
Novelty: The paper introduces mini-batch gradient computations across tokens inside the sequence to expose parallel computation while retaining multi-step adaptation across mini-batches (§2.4, Figure 6–7).
Significance: This is a practical bridge between purely online updates (slow, sequential) and batch updates (parallel but too limited), with an empirically chosen sweet spot b=16 (Figure 7).
(4) Dual form rewrite to make TTT matmul-friendly on accelerators.
Novelty: Within a token mini-batch, they derive an equivalent computation that avoids materializing per-token gradients and intermediate weights, using masked matmuls (Eq. (7)–(8), §2.5; Appendix A for MLPs).
Significance: The paper reports >5× speedup in JAX on TPU for training with the dual form compared to primal (§2.5).
(5) Unifying theoretical lens: TTT spans linear attention and (via learners) self-attention.
Novelty: Theorem 1 shows linear attention is a special case of TTT (and Table 1 verifies equal perplexity under an improved implementation). Theorem 2 shows self-attention corresponds to a nonparametric learner choice (§2.6).
Significance: This reframes common sequence layers as points in a broader design space, motivating “expressive hidden states” as “richer learners.”

5. Experimental Analysis¶

Evaluation methodology¶

Primary metric: perplexity (Ppl), reported on validation/evaluation following protocols aligned with the Mamba paper (§3; Figure 2 caption references Kaplan-style evaluation; §3).
Datasets:
The Pile for standard 2k and 8k context evaluations (§3, §3.1).
Books3 subset of the Pile for long-context evaluations from 1k to 32k in ×2 steps (§3, §3.2; Figures 11, 15, 16).
Baselines compared:
A strong Transformer baseline (“Transformer++” style, based on Llama architecture) (§3, Appendix C).
Mamba as a modern RNN baseline (§3).
Model scales: Four sizes are used: 125M, 350M, 760M, 1.3B parameters (§3 protocols).
Mamba sizes are slightly different: 130M, 370M, 790M, 1.4B (noted in §3 protocols).
Training setup (“Chinchilla recipe” as used in Mamba):
Optimizer: AdamW with β=(0.9, 0.95) (Appendix C).
LR schedule: cosine decay to 1e-5, with linear warmup for 10% of steps (Appendix C).
Weight decay 0.1, grad clip 1.0, no dropout, mixed precision (Appendix C).
Training tokens per model size (Table 3): 2.5B, 7B, 15B, 26B.
Peak LRs (Table 3): 3e-3, 1.5e-3, 1.25e-3, 1e-3.
Block counts / embed dims / heads (Table 3): e.g., 1.3B uses 24 blocks, d=2048, 32 heads.
Context/batch policy: They keep total tokens per training batch fixed at 0.5M tokens regardless of context length (§3.2 footnote 12; Appendix C reiterates).

Main quantitative results (numbers grounded where available)¶

Long-context utilization diagnostic (token-index perplexity):
Figure 2 (right) shows perplexity vs token index up to 32k.
Reported qualitative outcome: TTT-Linear and TTT-MLP continue reducing perplexity as token index increases (similar to Transformer), while Mamba plateaus after 16k (Figure 2 right; §1).
Scaling trends on the Pile (2k and 8k contexts):
Figure 10 summarizes Ppl vs FLOPs for 2k and 8k.
Reported conclusions (§3.1):
- At 2k, TTT-Linear (M), Mamba, and Transformer are comparable (lines overlap).
- At 8k, both TTT-Linear (M) and TTT-MLP (M) outperform Mamba, and the gap widens with longer context (§3.1).
Long-context on Books (up to 32k):
Figure 11 (2k and 32k points) and Figure 15 (full set) show that at 32k, TTT-Linear (M) and TTT-MLP (M) outperform Mamba (§3.2).
Figure 16 shows an alternate view: for models trained from scratch, perplexity can worsen when context gets “too large,” and the best context length increases with model size; this trend is less present with Transformer finetuning (“TF finetune”) (§3.2, Figure 16 caption).
Ablation path from linear attention to TTT-Linear (concrete perplexities).
Table 1 reports, for 125M models trained on Pile with their recipe:
- Linear attn. improved: 15.23.
- TTT equivalence: 15.23 (diff 0).
- + LN and residual in f: 14.05 (improvement −1.22).
- + mini-batch TTT: 12.35 (improvement −1.70).
- + learnable η: 11.99 (improvement −0.36).
- + Mamba backbone: 11.09 (improvement −0.90), which they identify as the final TTT-Linear result used in Figure 10 (Table 1 caption).

Wall-clock / latency evaluation¶

Figure 12 reports inference latency on NVIDIA A100 80GB PCIe:
Prefill (forward) latency for batch size 16 increases with context for Transformer (consistent with quadratic-ish attention costs), while TTT-Linear/TTT-MLP/Mamba are roughly constant per token as context grows (Figure 12; §3.3).
Decode (generation) uses the primal form because it is sequential (§3.3); the figure shows roughly constant per-token behavior for non-Transformer methods across tested context lengths, with Transformer growing (Figure 12).
They also note TPU training iteration time on a v5e-256 pod at 2k context:
Transformer baseline: 0.30s/iter, TTT-Linear: 0.27s/iter (about 10% faster) without extra systems optimization (§3.3).
However, they do not provide full comparable TPU timing vs Mamba because Mamba implementation is GPU-focused (§3.3).

Do experiments support the claims?¶

Supportive evidence:
The long-context token-index perplexity diagnostic in Figure 2 right directly targets the claim that some RNNs fail to keep benefiting from additional context; TTT variants behave more like Transformer there.
Table 1 provides a grounded ablation chain showing which components matter most (mini-batch TTT and LN/residual inside f).
Multiple model sizes (125M to 1.3B) and two datasets (Pile, Books) provide breadth (§3).
Caveats evident in the paper’s own reporting:
The paper explicitly notes they do not see a clean scaling-law linear fit in their FLOPs–perplexity plots (Figures 10–11; §3.1), limiting extrapolation claims.
TTT-MLP has wall-clock challenges due to memory I/O despite favorable FLOPs comparisons (Figure 12; Abstract; §4.2).

6. Limitations and Trade-offs¶

Wall-clock efficiency, especially for more expressive learners (TTT-MLP).
The paper repeatedly emphasizes that while TTT-MLP may be “effective in terms of FLOPs,” its structure increases wall-clock time much more than FLOPs would suggest, attributing this to memory I/O (Abstract; §4.2; Figure 12).
Speed–quality trade-off controlled by token mini-batch size b.
Mini-batching is necessary for parallelism but reduces the “gradient channel” dependency within a mini-batch (§2.4).
Figure 7 shows perplexity degrades as b increases, and they fix b=16 as a compromise (§2.4).
Training stability and reliance on particular stabilizers.
Learning W_0 (θ_init) is reported as crucial for stability even if it slightly hurts perplexity in isolation (Table 1 caption; §2.7).
The learner architecture requires LN + residual wrapper for stability and performance (Table 1; §2.7).
Backbone dependence / architectural confounding.
Their strongest TTT results use a Mamba-style backbone with temporal convolutions and gating (§2.7, Figure 13).
This means improvements are not purely attributable to the TTT mechanism alone; Table 1 explicitly shows a gain from “+ Mamba backbone” (Table 1).
Long-context training-from-scratch difficulties.
On Books, Figure 16 indicates that for all methods trained from scratch, perplexity can worsen once context becomes too large, and the optimal context depends on model size (Figure 16 caption).
This suggests that simply increasing context length in training is not uniformly beneficial under the fixed recipe.
Evaluation scope constraints acknowledged by the paper.
No hybrid architectures (mixing attention and TTT) are explored, to keep baselines clean (§3 protocols).
They did not train at extremely long contexts (millions/billions of tokens of context), citing academic resource constraints (§5).
Systems implementation gaps.
Training is run on TPUs in JAX, but their GPU kernel work is limited to inference; they explicitly did not build a full training kernel for GPUs (§3.3).

7. Implications and Future Directions¶

How this work changes the landscape (as implied by the paper’s framework + results)
It proposes that the design space of sequence layers can be expanded from “choose a recurrence vs attention” to “choose a learner (model + optimizer + task) as memory,” with attention and linear attention appearing as special cases (Figure 8–9; Theorems 1–2).
Empirically, it suggests that long-context weaknesses observed in modern RNNs (their Mamba plateau result) may be mitigated by making the hidden state more expressive and adaptively learned during inference (Figure 2 right).
Follow-up research directions explicitly suggested
Richer outer-loop task parameterizations: explore other families of self-supervised tasks beyond linear multi-view projections (θ_K, θ_V, θ_Q) (§5).
Systems optimization: better kernels, pipeline parallelism through time, and multi-device processing for very long sequences (§5).
Longer contexts + larger models: explore regimes beyond 32k and potentially very long contexts, where the paper expects TTT advantages to grow (§5).
More ambitious learners f: larger inner models (possibly convolutional nets for video/agents) when context becomes extremely long (§5).
Multi-level nested learning: if f is itself attention, it can be seen as nesting additional inner loops (discussion in §5, connected to §2.6 Theorem 2).
Practical applications / downstream use cases (grounded in the paper’s framing)
Any setting needing efficient long-context conditioning where quadratic attention is costly, but where a fixed-size RNN state is too restrictive, could benefit if the wall-clock issues can be resolved (Abstract; §1; §4.2).
The paper hints at very long sequential domains like video streams and embodied agents as future targets, where online adaptation is natural (§5; related work in §4.1 references video-stream TTT motivation).
Repro/Integration Guidance (based on the paper’s reported choices)
If you want a linear-time layer with relatively mature efficiency, TTT-Linear is the more practical instantiation in this paper (Abstract; Figure 12 shows TTT-MLP heavier).
If your priority is long-context utilization and you can tolerate more systems complexity, TTT-MLP is positioned as having “larger potential in long context” but currently bottlenecked by memory I/O (Abstract; §3.3; §4.2).
For matching the paper’s training setup, use:
- AdamW(β=(0.9,0.95)), cosine decay to 1e-5, 10% warmup, weight decay 0.1, grad clip 1.0, no dropout, mixed precision (Appendix C),
- model-scale hyperparameters as in Table 3 (blocks, embed dim, heads, steps, peak LR, tokens),
- TTT token mini-batch size b=16 (Figure 7; §2.4),
- learner stabilizers: LN + residual inside f, learnable W_0, and learnable gated η(x) (§2.7; Table 1).
When comparing to Transformers in long context, the paper highlights that finetuning a short-context Transformer is a stronger real-world baseline than training from scratch at long context, and they include TF finetune following a described recipe (20% more tokens) for ≥4k contexts (§3.2; Appendix C).