Skip to content

TriAttention: Efficient Long Reasoning with Trigonometric KV Compression

ArXiv: 2604.04921

Pitch

TriAttention solves the KV cache memory bottleneck in long-reasoning LLMs by discovering that pre-RoPE query and key vectors cluster around fixed centers—a property that enables stable importance estimation via trigonometric series rather than unreliable sliding attention windows. On AIME25 with 32K-token generation, it matches full attention accuracy while achieving 2.5× higher throughput or 10.7× memory reduction, enabling deployment of reasoning models on consumer GPUs that would otherwise run out of memory.


1. Executive Summary

This paper introduces TriAttention, a KV cache compression method that exploits a newly discovered phenomenon called Q/K concentration—where pre-RoPE query and key vectors cluster tightly around fixed non-zero centers—to predict token importance via trigonometric series rather than unreliable post-RoPE observation windows. On AIME25 with 32K-token generation using Qwen3-8B, TriAttention matches Full Attention accuracy (40.8%) while achieving 2.5× higher throughput or 10.7× KV memory reduction, whereas leading baselines like R-KV achieve only about half the accuracy at the same efficiency. The method enables deployment of long-context reasoning models on consumer GPUs that would otherwise run out of memory.

2. Context and Motivation

The Core Problem: KV Cache Memory Bottleneck in Long Reasoning

Large language models performing extended chain-of-thought reasoning generate sequences spanning tens of thousands of tokens. During autoregressive generation, the KV cache—storing key and value vectors for all previously generated tokens—grows proportionally with sequence length. For a 32K-token generation with typical model dimensions, this cache can consume tens of gigabytes of GPU memory, creating severe memory bottlenecks that limit batch size, throughput, and deployment feasibility on resource-constrained hardware.

The problem has intensified with reasoning models like DeepSeek-R1 and OpenAI's o1, which produce verbose reasoning traces. The paper explicitly targets mathematical reasoning benchmarks (AIME24, AIME25, MATH 500) where models generate long chain-of-thought sequences, making memory pressure acute.

Why Existing Approaches Fall Short

Prior KV cache compression methods estimate token importance by analyzing attention patterns from recent queries. The paper identifies a fundamental flaw: these methods operate in post-RoPE space, where queries have been rotated by positional encodings.

The rotation problem. Rotary Position Embedding (RoPE) applies position-dependent rotations to query and key vectors. A query at position 1000 has a different orientation than a query at position 1001. This means only the most recent queries have "up-to-date" orientations aligned with the current decoding position. Prior work (Zhang et al., 2025) found that performance peaks at approximately 25 recent queries—a tiny observation window for contexts spanning thousands of tokens.

Why this window matters for reasoning. In chain-of-thought reasoning, certain tokens may remain dormant for hundreds or thousands of steps before becoming critical. If a token receives low attention during the short observation window, it gets evicted permanently—even if it becomes essential later. The paper cites retrieval heads (Wu et al., 2025; Xiao et al., 2025) as particularly vulnerable: relevant tokens can remain unattended for long periods before being retrieved.

Post-RoPE norm-based methods have complementary limitations. Methods like VATP (Guo et al., 2024) incorporate value vector norms to complement attention scores, but still operate post-RoPE where directional information is entangled with positional rotation—making it difficult to exploit Q/K angular relationships that determine attention.

How This Paper Positions Itself

The paper makes a decisive architectural choice: move importance estimation to pre-RoPE space, before positional rotations are applied. The key observation is that pre-RoPE Q and K vectors exhibit remarkable concentration around fixed centers that remain stable across positions and contexts. This stability enables predictable distance preferences through trigonometric series—allowing importance estimation without relying on short observation windows.

The paper positions itself as addressing the root cause of instability (positional rotation) rather than working around it. Figure 2(A-B) visually contrasts pre-RoPE concentration with post-RoPE dispersion, making the case that post-RoPE methods are fighting against an inherent structural problem.

3. Technical Approach

3.1 Reader Orientation

TriAttention is a KV cache compression system that estimates which cached tokens will receive high attention from future queries, retaining only the top-scoring tokens under a fixed memory budget. The solution exploits a newly discovered structural property of transformer attention heads—Q/K concentration in pre-RoPE space—to predict attention patterns via trigonometric series, eliminating dependence on short observation windows.

3.2 Big-Picture Architecture (Diagram in Words)

The system has four major components:

  1. Offline Calibration Module — collects pre-RoPE Q/K statistics from a calibration dataset, computing per-head concentration metrics (Mean Resultant Length R), Q centers E[qf], and expected norms E[∥qf∥].

  2. Trigonometric Series Score (Strig) — uses Q centers and the RoPE frequency structure to compute a distance-dependent attention prediction for each key, based on its position relative to potential future queries.

  3. Norm-Based Score (Snorm) — weights keys by their vector norms and expected query norms, with adaptive weighting inversely proportional to concentration strength.

  4. KV Cache Pruning Engine — combines scores from both components, aggregates across multiple query heads (handling GQA), and triggers batched eviction every 128 tokens to minimize overhead.

Information flows: calibration data → pre-RoPE Q/K statistics → combined scoring function → periodic pruning during generation → compressed KV cache.

3.3 Roadmap for the Deep Dive

  • First, the mathematical foundation of RoPE and why it creates the rotation problem.
  • Second, the Q/K concentration phenomenon—what it is, how to measure it, and why it matters.
  • Third, the derivation of the trigonometric series from Q/K concentration.
  • Fourth, the complete scoring function combining trigonometric and norm-based components.
  • Fifth, the adaptive weighting mechanism using Mean Resultant Length.
  • Sixth, practical considerations: window-based pruning, GQA handling, and calibration.

3.4 Detailed, Sentence-Based Technical Breakdown

This is primarily a methods paper whose core technical contribution is discovering Q/K concentration and leveraging it to design a fundamentally different importance scoring mechanism.


Rotary Position Embedding (RoPE) and the Rotation Problem

RoPE encodes positional information as rotations in vector space. For a d-dimensional vector, RoPE divides it into d/2 two-dimensional subspaces (frequency bands), each indexed by f ∈ {0, ..., d/2-1}. Frequency band f rotates at rate ωf = θ^(-2f/d) where θ = 10000 typically.

For a vector in frequency band f at position p, RoPE applies a 2D rotation matrix. In complex notation, this simplifies to:

\[\tilde{q}_f(p) = q_f \cdot e^{i\omega_f p}\]

where qf ∈ C is the pre-RoPE component and the exponential term represents rotation by angle ωf·p. This rotation is applied independently to each frequency band.

The attention logit between a query at position pq and a key at position pk is:

\[\langle q, k \rangle_\Delta = \sum_f \|q_f\| \|k_f\| \cos(\omega_f \Delta + \phi_f)\]

where Δ = pq - pk is the Q-K distance and φf = arg(qf) - arg(kf) is the phase difference between Q and K in band f.

The rotation problem becomes clear: after RoPE, a vector's direction encodes its position. A query at position 1000 points in a different direction than position 1001, so attention patterns from past queries don't predict patterns from future queries.


The Q/K Concentration Phenomenon

The paper's key discovery: before RoPE rotation, Q and K vectors across most attention heads cluster tightly around fixed non-zero centers. This is illustrated in Figure 2(A), where pre-RoPE vectors from three different input sequences overlay on nearly identical arc patterns.

To quantify concentration, the paper uses Mean Resultant Length (R) from directional statistics:

\[R = \frac{\|E[q]\|}{E[\|q\|]}\]

where E[q] is the expected (mean) vector and E[∥q∥] is the expected norm. When R approaches 1, vectors cluster tightly around their mean direction; when R approaches 0, vectors disperse uniformly.

Figure 2(C) shows that across all 1152 heads in Qwen3-8B, the vast majority exhibit R values approaching 1.0. The paper reports that on Qwen3-8B, ~90% of heads exhibit R > 0.95 across Math, Coding, and Chat domains (values 0.977–0.980).

Why this matters: concentration is position-independent. Unlike post-RoPE vectors that rotate with position, pre-RoPE vectors have stable orientations determined by content, not position. This stability enables prediction.


From Concentration to Trigonometric Series

When Q/K vectors are concentrated, we can approximate them by their centers: qf ≈ E[qf] and kf ≈ E[kf]. Substituting these constants into the attention formula:

\[\text{logit}(\Delta) \approx \sum_f \|E[q_f]\| \|E[k_f]\| \cos(\omega_f \Delta + \bar{\phi}_f)\]

Using angle addition formulas, this becomes:

\[\text{logit}(\Delta) = \sum_f [a_f \cos(\omega_f \Delta) + b_f \sin(\omega_f \Delta)]\]

where coefficients af and bf are constants determined by the Q/K centers. This is a trigonometric series in Q-K distance Δ.

The crucial insight: different centers produce different distance preference curves. Some peaks at small distances (local attention), others at large distances (attention sinks). The centers encode this preference without needing to observe actual attention.

Experimental validation. The paper defines Reconstruction Correlation as the mean Pearson correlation between predicted logits (from trigonometric series) and actual attention logits. For each query, correlation is computed at logarithmically-spaced distances {1, 2, 4, 8, ...} to ensure balanced coverage.

Figure 2(D) shows an example head (layer 1, head 1, "chosen to avoid cherry-picking") achieving r̄ = 0.72. Figure 3 shows distributions across Qwen3, Qwen2.5, and Llama3, with means above 0.5 and peaks in the 0.6–0.9 range.


TriAttention Scoring Function

The core technical contribution is a scoring function that estimates key importance without requiring future queries.

Trigonometric Series Score (Strig). For a key k at position pk and a future query at position pq, the score is:

\[S_{\text{trig}}(k, \Delta) = \sum_f \|E[q_f]\| \cdot \|k_f\| \cdot \cos(\omega_f \Delta + \phi_f)\]

where Δ = pq - pk and φf = arg(E[qf]) - arg(kf). The Q center E[qf] comes from calibration; the key's representation kf is known from the cached KV.

Key insight: this substitutes the unknown future query with the known Q center—valid because of Q concentration.

Norm-Based Score (Snorm). The trigonometric series assumes perfect concentration; real distributions have variance. To account for this:

\[S_{\text{norm}}(k) = \sum_f (1 - R_f) \cdot E[\|q_f\|] \cdot \|k_f\|\]

where Rf is the concentration in frequency band f. When Rf is high (concentrated), the (1-Rf) term is small, so Snorm contributes little. When Rf is low, the full norm contribution is preserved.

This can be rewritten as:

\[S_{\text{norm}}(k) = \sum_f (E[\|q_f\|] - \|E[q_f]\|) \cdot \|k_f\|\]

Combined Score. The final score is:

\[S(k, \Delta) = S_{\text{trig}}(k, \Delta) + S_{\text{norm}}(k)\]

Since a key may be queried from any future position, importance is averaged over multiple offsets:

\[\tilde{S}(k) = \frac{1}{|D|} \sum_{\delta \in D} S(k, \Delta + \delta)\]

where D = {1, 2, 4, ..., 216} (logarithmically-spaced future offsets). Ablation studies (Table E) show geometric spacing dramatically outperforms linear (45.8% vs. 28.7%), and increasing max distance from 128 to 4096 improves accuracy from 41.7% to 48.8%.


Handling Grouped-Query Attention (GQA)

In GQA, each KV head is shared by G query heads. Each query head has different statistics, so a key receives G different scores operating at different scales.

The paper uses normalize-then-aggregate:

\[\hat{S}^{(g)}(k) = \frac{\tilde{S}^{(g)}(k) - \mu_g}{\sigma_g}\]

where μg and σg are the mean and standard deviation of scores computed from query head g. Scores are then aggregated via maximum:

\[S_{\text{final}}(k) = \max_{g \in \{0, ..., G-1\}} \hat{S}^{(g)}(k)\]

This means a key is retained if any query head deems it important.


Window-Based Pruning

Scoring all keys at every decoding step is expensive. TriAttention triggers pruning every β = 128 generated tokens: when the 128th token of each interval is generated, if the cache exceeds budget B, all keys are scored and the bottom keys are evicted.

This matches R-KV's protocol, enabling fair comparison.


Calibration and Implementation

Calibration requires computing per-head statistics from a representative dataset. The paper tests robustness (Table F) and finds:

  • Data quantity: Performance stable from 50k to 960k tokens (45.4–45.8%).
  • Data quality: Google homepage HTML achieves 46.2%, comparable to ShareGPT chat (46.7%).
  • Cross-domain generalization: Calibration on coding data tested on reasoning achieves comparable accuracy (Table 3C: 44.2% vs. 42.1%).

This confirms Q/K statistics are model-intrinsic properties, not task-specific.

Default implementation uses: - NVIDIA A100 80GB GPUs with bfloat16 precision - FlashAttention-2 (Dao, 2024) for standard models; FlashAttention-3 for Hopper-architecture models - Maximum generation length: 32,768 tokens - Temperature 0.6, top-p 0.95 - Default KV budget: 2048 tokens (512 for shorter-response benchmarks)


Summary of Design Choices and Justifications

  • Pre-RoPE over post-RoPE: Avoids position-dependent rotation, enabling stable importance estimates.
  • Trigonometric series over observation windows: Predicts attention without needing queries from the target position.
  • Concentration-weighted norm score: Adaptively reduces norm contribution when concentration makes the trigonometric series accurate.
  • Geometric spacing of future offsets: Denser sampling at near distances where attention patterns vary rapidly.
  • Max aggregation for GQA: Conservative retention—if any head needs a token, keep it.
  • Batched pruning every 128 tokens: Balances scoring overhead with cache freshness.

4. Key Insights and Innovations

Innovation 1: Discovery of Q/K Concentration as a Structural Property

The most significant contribution is discovering that pre-RoPE Q and K vectors cluster around fixed non-zero centers—a phenomenon the paper terms Q/K concentration. This is not a minor observation but a fundamental structural property of transformer attention heads.

What makes this genuinely novel: prior work treated Q/K vectors as varying with content and position. The paper demonstrates that at the pre-RoPE level, vectors cluster tightly (R > 0.95 for ~90% of heads), and this clustering is stable across positions, contexts, and even domains. Figure 2(C) shows this visually; Table F and the cross-domain experiments (Table 3C) confirm it quantitatively.

The significance is twofold: (1) it explains why post-RoPE methods struggle—rotation disperses concentrated clusters into arc patterns (Figure 2(B))—and (2) it provides a stable statistical basis for prediction. The trigonometric series is only accurate because concentration exists.

Innovation 2: Trigonometric Series as a Predictive Mechanism

The derivation connecting Q/K concentration to trigonometric series is a fundamental insight. When Q and K are concentrated, attention reduces to a function of position alone:

\[\text{logit}(\Delta) = \sum_f [a_f \cos(\omega_f \Delta) + b_f \sin(\omega_f \Delta)]\]

This transforms the importance estimation problem from "observe what queries attend to" (which requires queries at the right positions) to "compute distance preferences from known statistics" (which requires only Q centers and key positions).

The experimental validation is compelling: Figure 2(D) shows tight alignment between predicted and actual attention; Figure 3 shows reconstruction correlations peaking at 0.6–0.9 across architectures. The correlation is not perfect—which justifies the norm-based complement—but it captures the dominant distance-dependent structure.

Innovation 3: Adaptive Weighting via Mean Resultant Length

The adaptive weighting mechanism—using concentration (Rf) to balance trigonometric series and norm-based scores—is a principled solution to a subtle problem. The trigonometric series assumes perfect concentration; real heads have variance.

The weighting formula Snorm(k) = Σf (1 - Rf) · E[∥qf∥] · ∥kf∥ elegantly encodes this: when Rf → 1, the norm term vanishes; when Rf → 0, it contributes fully. This is a principled fusion rather than an ad-hoc combination.

Table 3B validates this: removing the R-weighting ("w/o R") drops AIME25 accuracy from 32.9% to 28.7%, confirming that adaptive weighting matters.

Innovation 4: Cross-Architecture and Cross-Domain Generalization

The paper demonstrates that Q/K concentration generalizes beyond specific architectures. Table G shows Multi-head Latent Attention (MLA) in GLM-4.7 exhibits even stronger concentration (96.6% of heads with R > 0.95 vs. 84.7% for GQA). This suggests the phenomenon is architectural rather than model-specific.

Similarly, calibration on coding data transfers to reasoning (Table 3C), and calibration data quality shows minimal impact (Table F). This transforms TriAttention from a method requiring careful tuning to one that works "out of the box" across domains.

5. Experimental Analysis

Evaluation Methodology

Models. Four reasoning-capable LLMs: Qwen3-8B (Qwen Team, 2025), DeepSeek-R1-Distill-Llama-8B, DeepSeek-R1-Distill-Qwen-7B (DeepSeek-AI, 2025), and GPT-OSS-20B (OpenAI, 2025). These span different architectures and scales.

Datasets. - AIME 2024: 30 problems, competition-level mathematics - AIME 2025: 30 problems, competition-level mathematics
- MATH 500: 500 problems, diverse mathematical reasoning - LongBench: 16 subtasks (QA, summarization, retrieval, code) - RULER: retrieval tasks with 4K context - Recursive State Query benchmark: custom DFS simulation task

Baselines. - Full Attention: no pruning (upper bound) - SnapKV (Li et al., 2024b): attention-based selection from local window - R-KV (Cai et al., 2025): recent method for reasoning models - Additional baselines in appendix: H2O, TOVA, RaaS, LazyEviction, StreamingLLM, PyramidKV, KnormPress, Ada-KV+SnapKV

Metrics. Pass rate for mathematical benchmarks (8 samples per problem for AIME, 1 sample for MATH 500). Throughput measured as tokens/second averaged over 16K decoding length at maximum batch size on A100 80GB.

KV Budget. Default 2048 tokens; 512 for shorter-response benchmarks (DS-Llama, MATH 500).


Main Quantitative Results

AIME and MATH 500 Results (Tables 1-2)

On AIME25 with Qwen3-8B: - Full Attention: 40.8% - TriAttention: 32.9% (at 2048 budget) - R-KV: 17.5% - SnapKV: 20.0%

TriAttention nearly doubles R-KV's accuracy (32.9% vs. 17.5%), a 15.4 percentage point gap.

On AIME24 with Qwen3-8B: - Full Attention: 57.1% - TriAttention: 42.1% - R-KV: 25.4% - SnapKV: 34.6%

On MATH 500 with Qwen3-8B (512 budget): - Full Attention: 69.6% - TriAttention: 56.0% - R-KV: 46.4% - SnapKV: 49.2%

At budget 1024 (Figure 5), TriAttention matches Full Attention on MATH 500 (68.4% vs. 69.6%).


Efficiency Results (Tables 4-5, Figure 1)

At equivalent accuracy to Full Attention on AIME25 (40.8%): - TriAttention achieves 2.5× higher throughput (563.5 vs. 222.8 tokens/s) - TriAttention achieves 10.7× KV memory reduction

On MATH 500 at comparable accuracy (68.4% vs. 69.6%): - TriAttention achieves 6.3× throughput (1,405.2 vs. 222.8 tokens/s)

At comparable accuracy (Table 5), TriAttention requires half the KV budget of R-KV (1024 vs. 2048) while achieving 85% higher throughput (1,405 vs. 760 tokens/s).


Memory Retention: Recursive State Query Benchmark (Figure 5D)

This custom benchmark tests whether models retain intermediate states during depth-first search simulation. With DFS depth 6-20 on Qwen3-8B (KV budget 2048):

  • Full Attention maintains ~100% accuracy up to depth 16
  • TriAttention performs comparably to Full Attention up to depth 16, slightly outperforming at depths 8 and 12
  • R-KV shows catastrophic degradation at depth 16: drops from ~61% (depth 14) to ~31% (depth 16)

This validates that TriAttention preserves essential information for backtracking, while R-KV's short observation window causes it to discard critical intermediate states.


Ablation Studies (Tables 3, E, F)

Scoring components (Table 3A): - Removing Strig (norm-only): 18.8% on AIME24 vs. 42.1% full method - Removing Snorm (trig-only): 40.4% on AIME24 vs. 45.8% full method (−5.4%)

Both components contribute, but trigonometric series is essential.

Adaptive weighting (Table 3B): - Removing R-weighting: AIME25 drops from 32.9% to 28.7%

Future offset design (Table E): - Geometric vs. linear spacing: 45.8% vs. 28.7% (−17.1%) - Max distance 128 vs. 4096: 41.7% vs. 48.8% (+7.1%)

Geometric spacing with long range is critical.

Calibration sensitivity (Table F): - Data quantity (50k-960k tokens): 45.4-45.8% (stable) - Data quality (HTML vs. Chat): 46.2% vs. 46.7% (comparable)


Cross-Architecture Validation (Table G)

Comparing Qwen3-8B (GQA) vs. GLM-4.7-Flash (MLA):

Reconstruction correlation: - r > 0.70: GQA 13.0%, MLA 23.1% - r > 0.50: GQA 53.5%, MLA 51.6%

Concentration (MRL): - R > 0.95: GQA 84.7%, MLA 96.6% - R > 0.90: GQA 90.8%, MLA 99.8%

MLA shows stronger concentration, confirming the phenomenon generalizes across architectures.


Real-World Deployment (Figure C, Appendix J)

OpenClaw multi-turn agent with Qwen3-32B (INT4) on single RTX 4090 (24GB): - Full Attention: Out of memory before task completion - TriAttention: Successfully completes task within memory budget

This demonstrates practical deployment benefit: enabling long-context reasoning on consumer GPUs.


Assessment: Do the Experiments Support the Claims?

Claim 1: TriAttention matches Full Attention accuracy with significant compression. Strongly supported. On MATH 500, TriAttention at budget 1024 achieves 68.4% vs. Full Attention's 69.6%. On AIME25, TriAttention matches Full Attention at budget 4096 (43.3% vs. 40.8%).

Claim 2: TriAttention substantially outperforms baselines. Strongly supported. TriAttention nearly doubles R-KV's accuracy on AIME benchmarks and consistently outperforms across all budgets (Figure 5).

Claim 3: Q/K concentration is prevalent and enables reconstruction. Supported. Figure 2(C) shows high R across heads; Figure 3 shows reconstruction correlations peaking at 0.6-0.9. Cross-domain and cross-architecture experiments (Tables 3C, G) confirm generalization.

Potential weaknesses:

  • Single-model-family concentration analysis: While the paper tests multiple architectures, the detailed concentration analysis (Figure 2) focuses on Qwen3-8B. MLA validation (Table G) helps but is less detailed.

  • Reconstruction correlation ceiling: Mean r ~0.5-0.7 means the trigonometric series captures dominant but not complete attention structure. The norm-based complement helps, but residual unmodeled variance remains.

  • Throughput measurement protocol: Measured at maximum batch size on A100 80GB. Real-world deployment on smaller GPUs may yield different speedup ratios.

  • Reasoning task focus: Mathematical reasoning benchmarks dominate evaluation. LongBench and RULER results (Tables B-D) show gains but less dramatic than AIME. The method may be particularly suited to reasoning patterns.


Comparison with Additional Baselines (Table A)

On AIME24 with DeepSeek-R1-Distill-Qwen-7B at varying budgets: - TriAttention at 10% budget: 40.0% (matches LazyEviction at 20%) - TriAttention at 30% budget: 46.7% (matches Full Attention) - LazyEviction at 30%: 43.3%

TriAttention outperforms H2O (33.3%), TOVA (36.7%), RaaS (36.7%), and R-KV (43.3%) at equivalent budgets.

6. Limitations and Trade-offs

Assumption: Q/K Concentration Holds Sufficiently for Prediction

The entire method rests on the premise that pre-RoPE Q/K vectors are concentrated enough for the trigonometric series approximation to be accurate. While the paper demonstrates that ~90% of heads in Qwen3-8B exhibit R > 0.95, this still means ~10% of heads have lower concentration where the trigonometric series becomes less reliable.

The adaptive weighting mechanism (using 1 - Rf to scale the norm-based component) mitigates this but doesn't eliminate it. For heads with low concentration, the method falls back to norm-based scoring—which the ablation studies show is substantially weaker (Table 3A: norm-only achieves 18.8% vs. 42.1% with both components). The paper doesn't provide a detailed breakdown of how accuracy correlates with per-head concentration, leaving open whether certain critical heads (e.g., retrieval heads, reasoning-specific heads) systematically have lower concentration.

Reconstruction Correlation Is Meaningful but Not Perfect

The reconstruction correlation experiments (Figure 3) show mean Pearson r values above 0.5, with peaks at 0.6–0.9. This validates the trigonometric series as capturing dominant attention structure, but a correlation of 0.5–0.7 still leaves substantial unexplained variance. The paper doesn't analyze what attention patterns the trigonometric series fails to capture—whether certain distance ranges, specific token relationships, or context-dependent patterns that require actual query-key interaction.

This matters because KV cache pruning is a one-way operation: once a token is evicted, it cannot be recovered. If the trigonometric series systematically misestimates importance for certain token types (e.g., intermediate reasoning steps, bridge concepts in multi-hop reasoning), errors could accumulate over long generations.

Calibration Domain Mismatch Risk

The cross-domain calibration experiment (Table 3C) shows that calibration on coding data transfers to reasoning (44.2% vs. 42.1%), suggesting Q/K statistics are model-intrinsic. However, the paper tests only two domains (coding and reasoning) on mathematical benchmarks. For deployment in other domains—scientific reasoning, legal analysis, creative writing—the paper provides no guarantees.

More concerning: the calibration data quality experiment (Table F) shows that even "low quality" data (Google homepage HTML) works comparably to ShareGPT chat (46.2% vs. 46.7%). This suggests the statistics are robust to domain, but it also raises a question: if calibration data doesn't matter, what does determine the specific Q/K center values? The paper characterizes concentration but doesn't explain why specific centers emerge or whether they could shift under distribution shift.

Geometric Spacing of Future Offsets Is Not Theoretically Justified

The scoring function averages over future offsets D = {1, 2, 4, ..., 216} using geometric spacing. The ablation (Table E) shows geometric spacing dramatically outperforms linear (45.8% vs. 28.7%), but the paper doesn't provide theoretical justification for this choice.

Intuitively, geometric spacing provides denser sampling at near distances where attention patterns vary rapidly—this makes sense given the RoPE frequency structure. But the specific choice of starting at 1 and extending to 216 appears arbitrary. The ablation shows that max distance matters (128 vs. 4096: 41.7% vs. 48.8%), but doesn't sweep the optimal range or spacing ratio. Practitioners would benefit from guidance on how to set these for different model configurations.

Window-Based Pruning Introduces Latency Spikes

Pruning every 128 tokens is a practical optimization, but it introduces latency spikes at pruning boundaries. The paper measures throughput averaged over 16K generation, which smooths over these spikes. For latency-sensitive applications (interactive chat, real-time agents), these periodic pauses could be problematic.

The paper doesn't analyze the computational cost of the scoring operation itself. While batched pruning reduces frequency, scoring requires computing trigonometric series across all cached keys for all query heads—a non-trivial operation. Appendix A acknowledges this as future work:

"A primary avenue for future improvement involves the development of a dedicated, high-performance inference kernel designed to further accelerate the computation of trigonometric series and the subsequent cache pruning process."

Limited Evaluation on Non-Reasoning Tasks

While the paper includes LongBench and RULER results (Tables B-D), the evaluation heavily emphasizes mathematical reasoning (AIME24, AIME25, MATH 500). On LongBench, TriAttention wins 11 of 16 subtasks with average 48.1 vs. SnapKV's 45.2—a meaningful but modest improvement. On RULER, TriAttention achieves 66.1 vs. SnapKV's 55.6.

The question remains: does Q/K concentration exhibit different properties in non-reasoning contexts? The paper shows concentration is stable across Math, Coding, and Chat domains (Section 3.3), but doesn't analyze whether the downstream task performance follows similar patterns. Reasoning tasks involve specific attention patterns (step-by-step dependencies, backtracking) that may particularly benefit from TriAttention's distance-aware scoring.

No Analysis of Head-Specific Importance

The paper treats all heads uniformly, using the same scoring function and budget allocation. However, prior work (Wu et al., 2025; Xiao et al., 2025) shows that different attention heads serve different functions—retrieval heads, induction heads, sink heads. These may require different compression strategies.

The concentration analysis (Figure 2C) shows a distribution of R values—some heads have near-perfect concentration, others less so. The paper aggregates across all heads in its evaluation but doesn't analyze whether certain head types benefit more or less from the trigonometric series approach. For example, if retrieval heads (which attend to distant tokens after long dormancy) have systematically different concentration patterns, this could explain the strong performance on reasoning tasks but more modest gains on LongBench.

Hard Limit on Memory Budget

TriAttention requires setting a fixed KV budget before generation begins. The paper experiments with budgets from 512 to 4096 tokens, but provides limited guidance on how to set this hyperparameter for new tasks or models.

The adaptive weighting based on concentration helps the method degrade gracefully, but there's no mechanism to dynamically expand the budget when the model "needs more memory" for a particular problem. The Recursive State Query benchmark (Figure 5D) shows TriAttention maintains accuracy up to depth 16 but degrades beyond that—suggesting the method correctly identifies important tokens within budget, but cannot exceed the budget when genuinely needed.

Comparison with H2O Has Methodological Issues

Table D compares TriAttention with H2O but notes that H2O "requires materializing the full O(n²) attention matrix and cannot use FlashAttention." This means H2O is compared on only 12 LongBench subtasks where it fits in 48GB memory—a substantial handicap.

While this reflects a real limitation of H2O, it makes the comparison somewhat unfair. H2O's inability to use FlashAttention is an architectural constraint, but TriAttention's compatibility with FlashAttention is a meaningful advantage that deserves clearer framing.

7. Implications and Future Directions

How This Work Changes the Landscape

This paper fundamentally reframes KV cache compression from a reactive observation problem to a predictive modeling problem. Prior methods asked: "What did recent queries attend to?" TriAttention asks: "What will future queries attend to, based on structural properties of the model?"

This shift has several implications:

Theoretical clarity. The trigonometric series derivation provides a principled mathematical framework connecting RoPE structure to attention patterns. Prior methods relied on empirical observations (attention scores, norm patterns) without explaining why those signals predict importance. The Q/K concentration phenomenon and its connection to distance preference offers a causal explanation: tokens are important because the model's Q/K centers create distance-dependent attention biases.

Evaluation methodology. The paper introduces the Reconstruction Correlation metric (r̄) that directly tests whether a proposed importance estimator matches actual attention. This provides a diagnostic tool for future work: rather than only testing downstream task accuracy, researchers can measure how well their importance signal reconstructs attention patterns.

Cross-architecture portability. The validation on MLA (Table G) suggests Q/K concentration is not an artifact of GQA or specific attention implementations, but a deeper property of transformer attention. This means the TriAttention framework could extend to other architectures with positional encodings (ALiBi, relative position embeddings) with appropriate modifications.

Follow-Up Research This Work Enables

Head-specific compression budgets. The concentration analysis shows heterogeneity across heads—some have R > 0.99, others around 0.7. Future work could allocate higher budgets to heads with lower concentration (where prediction is harder) and lower budgets to highly concentrated heads (where a few tokens dominate). The paper's uniform budget is likely suboptimal.

Dynamic budget allocation. The current method uses fixed budgets per-generation. A dynamic policy could monitor model uncertainty or generation quality signals to decide when to expand the KV budget. The concentration metric R could serve as a confidence signal: when R drops, the model might request more memory.

Integration with other compression techniques. TriAttention focuses on importance scoring but could combine with other compression approaches: quantization (INT8/INT4 KV cache), low-rank approximation, or token merging. The trigonometric series could guide which tokens to merge (tokens with similar distance preferences) or which dimensions to preserve during quantization.

Extension to other positional encodings. The trigonometric series derivation relies on RoPE's rotation structure. For ALiBi or relative position embeddings, the distance preference function would have different form. Characterizing Q/K concentration under alternative positional encodings would determine whether the predictive approach generalizes.

Fine-grained analysis of prediction errors. The reconstruction correlation r̄ ~ 0.5-0.7 leaves substantial variance unexplained. Analyzing where and when the trigonometric series fails—specific token types, context conditions, or generation phases—would identify remaining bottlenecks and potential improvements.

Training models for compressibility. If Q/K concentration is a learned property rather than an architectural necessity, models could be explicitly trained to have more concentrated Q/K distributions, making them more amenable to KV compression. This could be a regularization objective during pretraining or fine-tuning.

Practical Applications and Downstream Use Cases

Consumer GPU deployment. The OpenClaw demonstration (Figure C) shows TriAttention enabling Qwen3-32B (INT4) on a single RTX 4090—a consumer GPU with 24GB memory. This has immediate practical value: individuals and small organizations can deploy long-context reasoning models without datacenter hardware. The 10.7× memory reduction means a model that would require 80GB+ KV cache can run in 8GB.

Batch inference optimization. The 2.5-6.3× throughput improvements (Table 4) translate directly to cost savings in production settings. For API providers running thousands of concurrent generations, TriAttention could dramatically increase batch sizes or reduce GPU requirements. The batched pruning approach (every 128 tokens) aligns well with production serving systems that already batch requests.

Edge device inference. Beyond consumer GPUs, the memory reduction enables deployment on edge devices (laptops, mobile phones) where memory is the primary constraint. The calibration overhead is one-time and can be pre-computed; the inference-time scoring adds minimal compute compared to attention itself.

Long-context agents. Multi-turn agents like OpenClaw accumulate context across many interactions. TriAttention's stable importance estimation (not dependent on observation windows) is particularly suited to this setting where relevant information may be mentioned many turns earlier.

When to Prefer This Method Over Alternatives

Prefer TriAttention when: - Generating long sequences (>4K tokens) where memory pressure is significant - The task involves chain-of-thought reasoning with step dependencies (mathematical reasoning, planning) - Deployment is on memory-constrained hardware (consumer GPUs, edge devices) - You can perform one-time calibration on a representative dataset (even low-quality data works)

Prefer post-RoPE observation methods (SnapKV, R-KV) when: - Generation lengths are short (<2K tokens) where observation windows are sufficient - You need zero-overhead deployment (no calibration step) - The model uses non-RoPE positional encodings where the trigonometric series derivation doesn't apply

Prefer streaming methods (StreamingLLM) when: - Memory budget is extremely tight (<512 tokens) - You need to process infinite-length streams without any eviction lag - Task has strong recency bias where old tokens are rarely needed

Reproducibility and Integration Notes

The paper provides code at https://github.com/WeianMao/triattention. Key integration steps:

  1. Calibration: Run calibration on ~50K+ tokens of representative data (Table F shows data quality doesn't matter). Compute per-head E[qf], E[∥qf∥], and Rf.

  2. Implementation: The scoring function requires pre-RoPE Q/K vectors, which standard transformers don't expose by default. Implementations need to intercept before RoPE is applied (or apply inverse RoPE to post-RoPE vectors).

  3. Hyperparameters: Default settings (geometric spacing to 216, pruning every 128 tokens, budget 2048) work well for 8B-scale models on reasoning tasks. Larger models or different tasks may require tuning—use reconstruction correlation as a diagnostic.

  4. GQA handling: The normalize-then-aggregate approach (Equations 12-13) is critical for grouped-query attention. Skipping this causes head-specific score scales to corrupt importance ranking.

  5. FlashAttention compatibility: TriAttention is compatible with FlashAttention since scoring happens outside the attention kernel. The actual attention computation remains unchanged.