Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning¶
ArXiv: 2510.19338
🎯 Pitch¶
This paper introduces Ring-linear-2.0, a new family of large language models that combine linear attention for most layers with periodic softmax attention, dramatically slashing memory and compute costs for long-context reasoning. By further integrating advanced FP8 fused kernels and a robust training–inference alignment procedure for reinforcement learning, these models achieve up to 10x faster inference and robust, stable training over exceptionally long contexts—all while preserving state-of-the-art reasoning performance. This innovation makes practical, economical deployment of long-context LLMs vastly more feasible in real-world applications such as agents, code assistants, and knowledge retrieval.
1. Executive Summary¶
This paper proposes a long-context large‑language‑model family, Ring-linear-2.0, that mixes two kinds of attention: fast, constant‑state “linear attention” for most layers and standard softmax attention sparingly. It couples this architecture with FP8 kernel fusions and a training–inference alignment procedure for reinforcement learning (RL), yielding large speedups and stable long‑horizon training while maintaining competitive reasoning quality (Figures 6–8; Tables 2–3).
2. Context and Motivation¶
- Problem addressed
- Long‑context reasoning (agents, code generation, retrieval, test‑time scaling) demands processing sequences well beyond 32K tokens. Standard softmax attention has quadratic compute with sequence length and a Key–Value (
KV) cache that grows linearly with output length, inflating both compute and memory I/O during decoding (Section 2.2; Figure 4). - Why it matters
- Practical deployments increasingly rely on “test‑time scaling”—letting models “think longer” by generating more tokens—and on long input contexts. Without better attention mechanisms and efficient kernels, cost and latency make these use cases uneconomical (Introduction; Section 2.2.4).
- Prior approaches and gaps
- Pure linear-time sequence models (RetNet, Lightning Attention, Mamba, Gated Linear Attention, DeltaNet—Section 1) reduce theoretical complexity and keep a constant‑size state. However:
- Pure linear models often underperform at large scale and on retrieval (Section 1; Section 2.2.2).
- Their advantages tend to appear only beyond ~8K tokens, while much pretraining still sits in the 4–8K range; MoE computation can dominate and blunt efficiency gains during pretraining (Section 1).
- Community kernels for linear attention were fragmented and lacked support for advanced inference features like speculative decoding with tree masks (Section 3.4).
- Hybrid designs mixing linear and softmax attention exist (e.g., Minimax M1, GPT‑OSS, Qwen3‑Next; Section 1) but lack a systematic exploration of the optimal ratio, full‑stack kernel fusions in FP8, and training–inference alignment for long‑horizon RL.
- Positioning
Ring-linear-2.0targets this gap with: (a) a tuned hybrid ratio found by scaling‑law analysis (Section 2.2.2; Figure 3), (b) infrastructure‑level FP8 kernel fusion (“linghe”) for training and inference (Section 3; Figure 6), and (c) a module‑by‑module alignment method that stabilizes RL for long Chain‑of‑Thought outputs (Section 5.2; Figures 10–12).
3. Technical Approach¶
At a glance, the system has three pillars: a hybrid linear–softmax architecture, compute kernels and FP8 training/inference optimizations, and a training pipeline (continued pretrain → SFT → RL) with training–inference alignment.
- Model architecture (Section 2; Figure 2; Table 1)
- The model is organized into layer groups. Each group has
Mlinear‑attention blocks followed by one softmax attention block (“Grouped Query Attention”,GQA).GQAreduces KV cache by sharing keys/values across multiple query heads.
- The feedforward is a high‑sparsity
MoE(Mixture‑of‑Experts) with a 1/32 activation ratio (only ~3% of expert parameters are active per token), optimized with choices like shared experts, sigmoid routing without auxiliary load‑balancing loss, and Multi‑Token Prediction (MTP) heads (Section 2.1; Figure 2). -
Two open models:
Ring-mini-linear-2.0: 16.4B total parameters, 1.6B active (957M non‑embedding),d_model=2048,n_layers=20,n_experts=256,top_k=8,n_heads=16,n_kv_heads=4, hybrid ratio1:4(one softmax after 4 linear layers), 128K context (Table 1).Ring-flash-linear-2.0: 104.2B total, 7.4B active (6.1B non‑embedding),d_model=4096,n_layers=32, same MoE settings,n_heads=32, hybrid ratio1:7, 128K context (Table 1).
-
Linear attention core (Section 2.2.1; Equations 1–4)
- Intuition: replace the pairwise attention computation with a recurrent accumulation of “key–value outer products,” so decoding only needs a constant‑size state per head.
-
Formalization:
- Standard attention output:
O = Q (K^T V)(Eq. 1). - With fixed decay
λ(Lightning Attention form), the t‑th output is o_t = q_t ∑_{s≤t} λ^{t-s} k_s^T v_s(Eq. 2),- maintained via a recurrent state
kv_t = λ kv_{t-1} + k_t^T v_t,o_t = q_t (kv_t)(Eq. 3). - The state
kv_t ∈ R^{d×d}is the constant‑size KV cache for linear attention (Eq. 4). This makes memory independent of sequence length in decoding.
- Standard attention output:
-
Why a hybrid architecture (Section 2.2.2–2.2.3)
- Pure linear attention underperforms retrieval/extrapolation; a small number of softmax layers repairs this (Section 2.2.2).
- The paper fits compute–loss scaling laws (Chinchilla‑style) for different ratios (Figure 3).
-
“the hybrid linear architecture consistently outperforms the pure softmax architecture… a large layer group size (e.g.,
M=7) performs well under high FLOP budgets” (Figure 3).
-
-
Final choices:
M=7for the 104B model,M=4for the 16B model (Section 2.2.2; Table 1). -
Design choices inside linear attention (Section 2.2.3)
Grouped RMSNorm: normalize locally per tensor‑parallel rank to avoid all‑reduce (cuts communication).Partial RoPEafterQKnormalization: RoPE on half the dimensions lowered LM loss by ~0.004 (Section 2.2.3; Figure 2).-
Head-wise decay: a power‑law schedule across heads improved LM loss by ~0.04 over linear decay and helped downstream tasks (Section 2.2.3). -
Decoding cost model (Section 2.2.4; Figure 4)
-
Decoding speed is bound by memory bandwidth to the KV cache/state. Linear attention’s constant‑size state plus
GQAyields substantially smaller memory access growth than softmax,GQA, orMLAalone as sequence length increases (Figure 4). -
Kernel and FP8 training/inference optimizations (Section 3)
- Fused kernels to minimize memory traffic and activation size (Figure 5), including:
- Linear attention gate fusion (transpose + grouped norm + sigmoid gate).
permute/unpermutefused with padding by modifying the routing map.QKnormalization fused with split, partial RoPE, and transpose.- Router casting done in‑kernel (BF16→FP32) to reduce I/O.
- A single Triton prefill kernel for linear attention by re‑partitioning Q/K and V (instead of 2–4 kernels; Section 3.1).
-
FP8 training (Section 3.2):
- Quantization fusion: e.g., make
SiLUoutput quantized tensors directly, halving I/O from ~8MNto4MNfor input shape[M, N]. - State‑aware recomputation: during backward, only produce quantized
x^Tneeded fordw = x^T y; during forward (non‑recompute) only produce quantizedx—reduces redundant compute and quantize ops.
- Quantization fusion: e.g., make
-
Inference system integration (Section 3.4; Figures 7–8)
- Integrated fused linear attention kernels into SGLang and vLLM, expanding mode coverage and throughput.
-
Implemented the first linear attention kernel supporting tree masks, enabling speculative decoding for a hybrid linear model; available in the offline framework
Floodand being ported to SGLang (Section 3.4). -
Training pipeline and schedulers (Section 4; Figure 9)
- Start from softmax‑based
Ling-base-2.0-20Tcheckpoints; convert each linear layer’sQKVinto MHA weights along head dimension and randomly initialize new gate/RMSNorm parameters (Section 4). - Two‑stage continued pretraining:
1) “Continued training” at 4K context to recover base capabilities—600B tokens (
mini), 1T tokens (flash). 2) “Mid‑training”: extend contexts 4K → 32K → 128K while increasing high‑quality reasoning data. -
Use a
WSM(Warmup‑Stable‑Merge) LR schedule (merge mid‑training checkpoints instead of explicit decay), recovering ≥98% of the base models across categories (Figure 9). -
Post‑training: SFT and RL with alignment (Section 5)
- SFT data mixes hard reasoning (math, code, science, logic) with general tasks and re‑synthesized function‑calling to match broader calling patterns (Section 5.1).
- RL training on carefully filtered, appropriately difficult data at long contexts (often 64K) to avoid truncation and reach a higher ceiling (Section 5.2).
- Training–inference alignment (Section 5.2.1; Figures 10–12):
- Systematically feed the same inputs through training and inference engines and match activations layer‑by‑layer, fixing modules that drift.
- Key fixes:
- Use FP32 state for linear‑attention KV accumulation at inference; BF16 causes error growth (Figure 11).
- FP32
lm_headmath via a custom GEMM that takes BF16 inputs but accumulates in registers to control cost. - Ensure identical
RMSNormeps, FP32 residuals, and unfused residual‑norm in both stacks. - Match RoPE numerics between PyTorch training and inference operators.
- Use the same attention backend (e.g., FlashAttention) and align prefill–decode numerics.
- Make MoE deterministic: stable
topk, fixed permutation/summation orders, same operators. - RL objective: with alignment, PPO can safely use rollout‑engine probabilities for clipping (Eq. 5), instead of bias‑inducing recomputed training probabilities (Eq. 6). This improves rewards and stabilizes the probability gap (Figure 12).
4. Key Insights and Innovations¶
- Hybrid ratio chosen by scaling laws (Figure 3)
- Novelty: explicitly fit compute–loss curves to select the number of linear layers per softmax layer, rather than ad‑hoc design. The optimal
Mdepends on FLOP budget; they fixM=7(flash) andM=4(mini) to balance efficiency and quality (Section 2.2.2). -
Significance: keeps most of the model in linear attention without losing retrieval/extrapolation, outperforming pure softmax in the scaling regime explored.
-
Constant‑state linear attention with targeted architectural tweaks (Section 2.2.1–2.2.3)
- Novelty: a practical Lightning‑style linear attention implementation with
Grouped RMSNorm,QKnorm, partial RoPE, and head‑wise power‑law decay, each justified by measured LM‑loss improvements. -
Significance: delivers linear scaling in sequence length for compute and constant state for decoding, addressing the I/O bottleneck (Figures 4, 7–8).
-
FP8 end‑to‑end kernel fusion and “state‑aware recompute” (Section 3; Figure 5)
- Novelty: quantization fused into surrounding ops (e.g.,
SiLU), single‑kernel prefill for linear attention, and recomputation that emits different quantized tensors depending on whether a backward pass needsxorx^T. -
Significance: large training throughput gains—up to +77% on the 16B model and +57% on the 104B model (Figure 6)—and higher inference throughput (Figures 7–8).
-
Training–inference alignment for long‑horizon RL in hybrid linear MoE models (Section 5.2; Figures 10–12)
- Novelty: a systematic, module‑level alignment process (KV cache precision, lm_head precision, norms, RoPE, attention backend, MoE determinism) tailored to hybrid linear+MoE, coupled with using rollout probabilities in PPO after alignment (Eq. 5).
-
Significance: prevents RL collapse, yielding steadily increasing reward and test scores (Figure 13).
-
First tree‑mask–compatible linear‑attention kernel for speculative decoding (Section 3.4)
- Novelty: enabling tree‑based speculative decoding in a linear‑attention setting, previously blocked by mask support.
- Significance: reduces small‑batch latency for long-context decoding while keeping linear‑attention efficiency.
5. Experimental Analysis¶
- Evaluation setup (Section 6.1; Tables 2–3; Figure 1)
- 17 benchmarks covering:
- Mathematical reasoning: AIME’24/’25, OlympiadBench, CNMO’24, LiveMathBench, TheoremQA.
- Coding/agents: HumanEval+, MBPP+, LiveCodeBench (LCB), Codeforces Elo, Spider, BFCL‑Live.
- General reasoning/knowledge: GPQA‑Diamond, SciBench, DROP, MuSR, Multi‑LogiEval.
-
Comparisons:
- For
Ring-mini-linear-2.0: vsRing-mini-2.0,Qwen3‑8B‑Thinking,GPT‑OSS‑20B‑Medium(Table 2). - For
Ring-flash-linear-2.0: vsRing-flash-2.0,Qwen3‑32B‑Thinking,Gemini‑2.5‑Flash,GPT‑OSS‑120B‑Medium,Seed‑OSS‑36B‑Instruct,Qwen3‑Next‑80B‑A3B‑Thinking(Table 3; Figure 1).
- For
-
Main quantitative results (Tables 2–3)
Ring-mini-linear-2.0(Table 2):- Math: matches or slightly trails
Ring-mini-2.0by small margins (e.g., AIME’25 73.65 vs 74.06; TheoremQA 69.69 vs 70.09) while staying competitive withQwen3‑8B‑Thinking. - Coding: wins on Codeforces Elo (83.84 vs Qwen’s 73.31) and is competitive on HumanEval+ / MBPP+/LCB.
- General reasoning: mixed—e.g., GPQA‑Diamond 65.69 (near GPT‑OSS‑20B 65.53), strong on DROP 83.20 but behind
Ring-mini-2.0(88.55). - Takeaway: despite using mostly linear attention and only 1.6B active params, performance remains comparable to strong 8–20B baselines across tasks.
- Math: matches or slightly trails
-
Ring-flash-linear-2.0(Table 3; Figure 1):- Math: very strong. AIME’25 86.51 (close to
Ring‑Flash‑2.086.98; near top vsQwen3‑Next‑80B‑A3B87.80), OlympiadBench 87.36, CNMO’24 84.98. - Coding: LCB 70.37 (better than
Ring‑Flash‑2.070.76≈,Qwen3‑32B62.33, andGemini‑2.5‑Flash61.40), Codeforces Elo 90.24 (high, similar toRing‑Flash‑2.090.23). - General reasoning: GPQA‑Diamond 74.49 (above
Qwen3‑32B68.40 but belowGemini‑2.5‑Flash82.80 andQwen3‑Next‑80B‑A3B77.20); DROP 89.66 (competitive). - Overall: broad competitiveness with larger models, with standout math/coding strengths (Figure 1 highlights AIME’25, LCB, Codeforces, GPQA).
- Math: very strong. AIME’25 86.51 (close to
-
Efficiency results
- Training throughput (Figure 6):
Ring-mini-linear-2.0: +21% with fused kernels vs baseline FP8; +77% when fused kernels allowTP=1(fromTP=2) at same micro‑batch size.Ring-flash-linear-2.0: +25% with fused kernels; +57% after also doubling micro‑batch (from 1→2) and adjusting pipeline parallelism.
- Inference throughput (Figures 7–8):
- Prefill: linear models overtake softmax models past 8K context and accelerate rapidly; beyond 128K they reach “>2.5×” vs
Ring‑2.0and “>8×” vs baseline dense models (Figure 7a; Figure 8a, text in Section 3.4). - Decode: past 4K generated tokens, linear models exceed
Ring‑2.0; at 64K they achieve “>2× vs Ring‑2.0” and “>10× vs baseline” (Figure 7b; Figure 8b, Section 3.4).
- Prefill: linear models overtake softmax models past 8K context and accelerate rapidly; beyond 128K they reach “>2.5×” vs
-
KV/state memory access scaling (Figure 4): hybrid linear grows much slower with sequence length than
GQAorMLA, explaining decode throughput gains. -
Ablations and robustness
- Hybrid ratio/scaling laws (Figure 3): larger
M(more linear per softmax) performs well at higher FLOP budgets; hybrid curves dominate pure softmax. - Linear‑module design tweaks (Section 2.2.3): partial RoPE (~0.004 LM loss improvement) and power‑law head‑wise decay (~0.04) show measured benefits.
-
RL stability from alignment (Figures 10–13):
-
Each added fix (KV cache,
lm_head,RMSNorm, attention backend, RoPE) “contributes to improved training efficiency and stability” (Figure 10). -
After alignment, PPO with rollout probabilities yields higher rewards and keeps the training‑inference probability gap small (Figure 12).
- Training reward and test metrics (AIME’25, LCB) rise steadily over RL steps (Figure 13).
-
-
Do the experiments support the claims?
- Yes for efficiency: clear, repeated speedups in training throughput (Figure 6) and long‑context prefill/decode (Figures 7–8), with a mechanistic explanation (Figure 4).
- Yes for capability retention: continued pretraining restores ≥98% of base performance (Figure 9).
- Yes for RL stability: alignment ablations and PPO probability choices directly correlate with reward stability (Figures 10–12).
- Quality vs top competitors: strong but not always SOTA (e.g., GPQA vs Gemini‑2.5, Table 3); strengths are most pronounced in math/coding and long‑context efficiency.
6. Limitations and Trade-offs¶
- Softmax layers remain bottlenecks
- Even though few, softmax layers still incur quadratic cost and KV cache growth, limiting the full potential of the hybrid approach (Conclusion).
- Memory overhead from attention heads
- To maintain effectiveness, the linear module keeps the same head count for
Q,K, andV, which “brings heavy memory overhead” (Conclusion). - Benefits are context‑length dependent
- Linear attention’s advantages emerge strongly past ~8K context (Section 1; Figures 7–8). In shorter contexts, dense/MoE compute may dominate, muting gains.
- Knowledge retention trade‑offs
- Continued pretraining shows slight deficits in reasoning/professional knowledge vs the original base (Figure 9), likely due to knowledge forgetting.
- Engineering complexity and reproducibility
- The gains rely on extensive kernel fusions, FP8 quantization strategies, and a careful alignment pipeline (Sections 3 and 5). Portability to all stacks and hardware may require significant effort.
- Evaluation breadth
- While 17 benchmarks are covered, robustness to adversarial retrieval, multimodal long‑context tasks, or non‑English domains is not detailed.
7. Implications and Future Directions¶
- Field impact
- This work shows that long‑context reasoning can be made economical by combining linear attention (for constant‑state decoding) with just enough softmax layers to preserve retrieval/extrapolation—chosen via scaling laws (Figure 3). The full‑stack engineering (FP8 kernels, inference integration, RL alignment) demonstrates that architectural ideas must be matched with systems work for impact.
- Follow‑up research enabled
- Attention research:
- Reduce memory by decoupling head counts for
Q,K,Vin linear modules (Conclusion). - Explore adaptive or learned head‑wise decay schedules and partial‑RoPE strategies.
- Push toward even sparser softmax layers (or smarter placement) while preserving retrieval.
- Reduce memory by decoupling head counts for
- Systems:
- Generalize tree‑mask linear attention kernels and speculative decoding across toolchains (SGLang/vLLM).
- Extend FP8 fusion patterns to more activations and attention variants; co‑design with compiler‑level fusion.
- RL and alignment:
- Standardize module‑wise alignment tests and metrics; make rollout‑probability PPO the default once aligned.
- Study alignment for multi‑GPU/MoE routing nondeterminism at larger scales and for multimodal models.
- Practical applications
- Long‑document assistants, multi‑step code agents (LCB/Codeforces results), retrieval‑augmented generation with very long contexts (128K+), and test‑time scaling settings that require thousands of “thinking” tokens—all with substantially lower serving cost and higher throughput (Figures 7–8; Section 3.4).
Bottom line:
Ring-linear-2.0combines a scaling‑law‑tuned hybrid architecture with deep systems optimizations and alignment‑aware RL to deliver long‑context efficiency without sacrificing reasoning quality—showing a credible path to economical, capable long‑context LLMs.