Skip to content

Every Activation Boosted: Scaling General Reasoner to 1 Trillion Open Language Foundation

ArXiv: 2510.22115

🎯 Pitch

This technical report introduces Ling 2.0, a reasoning‑oriented LLM family that uses a high‑sparsity Mixture‑of‑Experts design (256 experts, 8+1 active ≈3.5% per token) to scale from 16B to 1T parameters while cutting active compute by ~7× versus dense models. By combining reasoning‑focused data and mid‑training CoT pre‑activation, a checkpoint‑merging WSM schedule, and novel post‑training RL methods (Evo‑CoT, sentence‑level LPO, GAR), Ling‑1T pushes a new accuracy‑versus‑efficiency Pareto frontier for large‑scale reasoning models — enabling far more compute‑efficient, scalable reasoning capability for open foundation models.


1. Executive Summary (2-3 sentences)

Ling 2.0 is a reasoning-oriented, open language-model family that scales from 16B to 1T total parameters using a unified high-sparsity Mixture-of-Experts (MoE) design where only ~3.5% of parameters are active per token (8 routed experts + 1 shared expert), aiming to convert parameter scale into reasoning gains efficiently (Introduction; Section 2.1; Table 1). It couples this architecture with a reasoning-heavy data strategy (math/code emphasis plus mid-training chain-of-thought “pre-activation”), a decay-free training schedule via checkpoint merging (WSM), and post-training reinforcement learning methods (Evo-CoT, sentence-level RL updates via LPO, and preference alignment via GAR) (Section 3; Section 4). At the trillion scale, the report shows Ling-1T reaching strong reasoning/coding results while keeping active compute far below dense counterparts, framing this as a new accuracy–efficiency trade-off point (Table 8; Figure 13).

2. Context and Motivation

  • Problem / gap addressed.
  • Scaling LLMs to hundreds of billions or trillions of parameters becomes increasingly constrained by compute cost and by the need for reliable multi-step reasoning rather than just surface-level pattern completion (Introduction).
  • The report frames two intertwined difficulties:

    • Efficient scaling: dense models are prohibitively expensive at trillion scale, and very large sparse models require stability/predictability and tight algorithm–system co-design (Introduction).
    • Sustained reasoning enhancement: building and transferring reasoning behaviors across pre-training → mid-training → post-training is difficult, and reasoning-centric corpora are expensive to curate (Introduction; Section 3).
  • Why it matters.

  • The report motivates reasoning as foundational for general-purpose agents that “understand, decide, and act autonomously” (Introduction).
  • Practically, the goal is to increase “reasoning accuracy per unit compute,” i.e., push the Pareto frontier of capability vs. cost (Introduction; Figure 13).

  • Prior approaches and shortcomings (as positioned in the report).

  • Dense scaling: straightforward but cost-prohibitive at the trillion level (Introduction).
  • Sparse MoE scaling: promising for efficiency, but requires careful routing/load balancing, scaling-law-guided design, and infrastructure to realize actual throughput gains (Introduction; Section 2.2; Section 5).
  • Post-training reasoning: the report suggests that simply adding reasoning in fine-tuning can be unstable unless earlier stages “pre-activate” the behavior via data (Section 3.2.2.2; Section 3.3.2; Figure 12).

  • How Ling 2.0 positions itself.

  • A “coordinated innovations” stack across:
    • architecture (high-sparsity MoE + MTP),
    • pre-training (reasoning-heavy mixture + WSM + mid-training CoT),
    • post-training (DFT + Evo-CoT RL + LPO + GAR),
    • infrastructure (full-scale FP8 + heterogeneous pipeline parallelism + distributed/eval tooling)
      (Introduction; Sections 2–5).

3. Technical Approach

3.1 Reader orientation (approachable technical breakdown)

  • Ling 2.0 is a family of large language models built to achieve strong reasoning and coding with sparse activation, meaning only a small fraction of the model is used for each token.
  • The “shape” of the solution is: use a high-sparsity MoE transformer plus scaling-law-driven training, then deliberately “turn on” reasoning behaviors via data staging and reinforcement learning—while engineering the training system (FP8 + pipeline/parallelism) so sparsity translates into real efficiency (Sections 2–5).

3.2 Big-picture architecture (diagram in words)

  • Input textTokenizer (BBPE, 156K vocab) (Section 2.1)
    Transformer backbone with:
  • early dense layers (1 / 1 / 4 depending on model) (Section 2.1; Table 1),
  • repeated MoE layers where each token routes to 8 of 256 experts + 1 shared expert (Section 2.1; Table 1),
  • GQA attention + stability components (RMSNorm, SwiGLU, QKNorm, Partial RoPE) (Section 2.1),
  • auxiliary MTP head (depth 1) for multi-token prediction (Section 2.2).
  • Training uses:
  • Ling Scaling Laws to set batch size / learning rate and to predict MoE efficiency leverage (Section 2.3; Figures 1–3; Eq. (1)),
  • multi-stage data schedule + mid-training context extension + CoT data (Section 3.2; Figure 5),
  • WSM schedule: warmup → stable LR → end via checkpoint merging (Section 3.2.3; Eqs. (2)–(3)).
  • Post-training:
  • DFT supervised fine-tuning with dual system prompts (Section 4.1; Table 5),
  • Evo-CoT RL with rewards that discourage explicit <think> while allowing deeper reasoning when needed (Section 4.2),
  • RLHF alignment with GAR + RubriX (Section 4.3),
  • large-scale reward infrastructure for code execution / visual evaluation / sandboxing (Section 4.4).

3.3 Roadmap for the deep dive

  • First, I explain the core MoE transformer design and what “3.5% activation” means mechanically (Section 2.1).
  • Second, I cover the routing/load-balancing + MTP additions because they are key to stability and reasoning/coding gains (Section 2.2).
  • Third, I unpack the Ling Scaling Laws + Wind Tunnel methodology because it governs how they choose hyperparameters and extrapolate to 1T (Section 2.3; Figures 1–3; Eq. (1)).
  • Fourth, I detail the data + multi-stage training recipe, including context extension and WSM checkpoint merging (Section 3.1–3.2; Figure 5; Eqs. (2)–(3)).
  • Fifth, I explain the post-training RL stack (DFT, Evo-CoT, LPO, GAR) and what each reward component does (Section 4; Table 5).
  • Finally, I summarize the infrastructure co-design (FP8, pipeline parallelism, distributed kernels) because the report claims sparsity has no end-to-end advantage without it (Section 5).

3.4 Detailed, sentence-based technical breakdown

This is primarily an empirical systems + training recipe paper: it combines an MoE transformer architecture, scaling-law-guided training, staged data/optimization, and distributed-systems engineering to make trillion-scale sparse training stable and efficient (Sections 2–5).

3.4.1 Core model family and configurations (what is trained)

  • The report releases three “reflex-grade non-thinking (instruct)” models (Introduction):
  • Ling-mini-2.0: 16B total params, 1.4B activated (Table 1; Introduction).
  • Ling-flash-2.0: 103B total params, 6.1B activated (Table 1; Introduction).
  • Ling-1T: 1T total params, 51B activated (Table 1; Introduction).
  • All three share a single MoE paradigm:
  • 256 routed experts per MoE layer, activate 8 experts per token, plus 1 shared expert, for ~3.5% activation (Section 2.1; Table 1).
  • The first layers are dense (“First-K-Dense”): 1 / 1 / 4 dense layers for mini/flash/1T (Section 2.1; Table 1).
  • Transformer block details (Section 2.1):
  • Attention uses Grouped-Query Attention (GQA) with different # key-value heads (the text mentions 8/16/32 KV heads by model; Table 1 reports attention heads as 16/32/64 total heads).
  • SwiGLU activation and RMSNorm with pre-normalization are used for efficiency/stability.
  • QKNorm is added to improve robustness, especially under low precision (Section 2.1; reiterated in Section 5.1).
  • Partial RoPE: rotary position embeddings are applied only to the first 64 dimensions of attention heads (Section 2.1).
  • Tokenizer/vocabulary:
  • Byte-level BPE (BBPE) with a 156K vocabulary, extending Ling 1.5, to improve multilingual performance (Section 2.1).

3.4.2 What sparse MoE activation does (mechanism)

  • In a dense transformer, every token runs through the same full feed-forward module, so compute scales roughly with model size.
  • In an MoE layer here, a router selects a subset of expert feed-forward networks per token:
  • Each token activates 8 routed experts out of 256, and also goes through 1 shared expert (Section 2.1).
  • Because only a small fraction of experts run per token, the model can have huge total capacity (1T parameters) while keeping active computation closer to a much smaller dense model (Introduction; Table 1).

Key engineering implication emphasized by the report: - High sparsity does not automatically yield wall-clock speedups; routing, communication, and heterogeneous blocks can erase gains unless the training system is optimized (Section 5 opening).

3.4.3 Load balancing without auxiliary loss (routing stability)

  • The report adopts an aux-loss-free load balancing strategy similar to DeepSeek-V3, intended to encourage both expert specialization and balanced usage without adding an auxiliary loss term (Section 2.2).
  • They apply router gate scaling with factor 2.5 to stabilize the RMS of gate outputs (Section 2.2).
  • They modify the bias update rule to keep bias centered near zero (Section 2.2). The update is given as:
  • b_i = b_i + u * ( sign(e_i) - mean(sign(e)) )
    where u is an update rate, b_i the i-th expert’s bias, and e_i the expert’s “violation error” (Section 2.2).
  • They additionally use:
  • dropless routing to preserve performance,
  • group routing to improve training efficiency without degrading performance (Section 2.2).

(The paper does not define “violation error” beyond naming it; the above is the full detail provided in the excerpt.)

3.4.4 Multi-Token Prediction (MTP) as an auxiliary objective

  • Multi-Token Prediction (MTP) is integrated as an auxiliary training objective to improve performance and inference efficiency, and is reported to help on code and math across scales (Section 2.2).
  • Configuration details:
  • They use one MTP layer per model scale (“depth 1”) (Section 2.2; Section 3.2.1).
  • MTP loss weight = 0.1 (Section 2.2; Section 3.2.1).
  • Systems detail:
  • MTP adds overhead, so they implement fine-grained pipeline parallel partitioning of the MTP module in Megatron to mitigate throughput loss (Section 2.2; Section 5.2).
  • Later, they describe splitting MTP into separate transformer-layer and loss layers for scheduling flexibility (Figure 17; Section 5.2).

3.4.5 Ling Scaling Laws + “Wind Tunnel” extrapolation

Goal: Choose hyperparameters and architecture at trillion scale based on small and mid-scale experiments, reducing trial-and-error cost.

  • The report claims “over a thousand experiments” underpin their scaling laws (Introduction “Ling Scaling Laws”; Section 2.3.1).
  • Two scaling-law components are described:

1) Optimal hyperparameters as a function of compute. - They run hyperparameter searches up to 3e20 FLOPs, initially fixing the MoE architecture to 64 experts (4 active) + 1 shared expert to simplify analysis (Section 2.3.1). - They fit power-law relationships between compute C and: - optimal batch size B_opt - optimal learning rate η_opt
(Figure 1a; Section 2.3.1). - They report an MoE vs. dense difference: at larger compute, MoEs prefer larger batch size and lower learning rate, attributed to sparse gradients where any given expert sees fewer contributing tokens per batch (Section 2.3.1).

2) MoE architectural “efficiency leverage” (EL). - They define efficiency leverage (EL) as:

the ratio of compute a dense model needs vs. an MoE model needs to reach equivalent performance (e.g., same validation loss) (Section 2.3.2). - Empirical findings from “over 300 models up to 28B parameters” include (Section 2.3.2; Figure 2): - Activation ratio dominates EL (higher sparsity ⇒ higher EL), holding even at very low activation like 1/128 (Figure 2a left). - Expert granularity modulates EL; optimal activated-expert range reported as 8–12 (Figure 2a right). - EL increases with compute budget (amplification effect). - Other factors (shared experts arrangement, MoE layer placement) are secondary. - They provide a unified EL scaling law (Eq. (1), Section 2.3.2): EL(A, G, C) = Â^(α + γ(log G)^2 + β log G)
where  is a saturating transform of activation ratio A, G is expert granularity, and α = a + d log C captures compute dependence (Eq. (1); Section 2.3.2). - Using this, they choose 256 experts with 8 activated + 1 shared expert (~3.5% activation), predicted to yield > 7× EL at scale (Section 2.3.2; Figure 2b; reiterated in Introduction).

Wind Tunnel experiments (Section 2.3.3; Figure 3). - They standardize five experiments from 500M to 8B parameters (distributed by a power law) (Figure 3a; Section 2.3.3). - For each scale, they fix: 1) architecture/size from MoE efficiency scaling law, 2) training tokens from optimal model–data allocation (Figure 1b), 3) learning rate and batch size from compute-based hyperparameter scaling law (Figure 1a). - Claimed benefit: - Predict final training loss within 0.01 error when following these rules (Section 2.3.3; Figure 3b). - Cost is “35% of traditional” single-slice ablation (Section 2.3.3; Figure 3a discussion).

3.4.6 Data pipeline and composition (pre-training)

The pre-training data work has two layers: (i) broad general knowledge + multilingual + long context, and (ii) deliberately emphasized reasoning (math/code) corpora (Section 3.1).

  • General knowledge data cleaning and quality filtering (Section 3.1.1):
  • Web extraction via trafilatura parser and targeted cleaning to remove ads/URLs/symbol-heavy text; continuous parser improvements for HTML/PDF.
  • Automated pipeline to detect new low-quality patterns: multi-channel recall (classifiers, lightweight LLM scoring, PPL), issue analysis by LLMs, rule generation + generalization, human review.
  • High-quality filtering inspired by FineWeb-Edu, training feature models by data type to score quality/knowledge density/domain; plus a “recall-rewrite” scheme to rewrite dense STEM content into more structured forms (Wikipedia-style, QA conversion, summaries). They report ablation gains on MMLU/CMMLU/CEval but do not give the exact ablation numbers in the excerpt (Section 3.1.1).

  • Reasoning data: Ling Code + Ling Math (Section 3.1.2; Figure 4; Appendix B.1):

  • Ling Code Corpus:
    • GitHub code with language-specific cleaning + lint validation; covers 660 languages; 2.7T tokens after dedup; a 600B “top-quality” subset; plus 300B augmented code via paraphrasing (Appendix B.1.1).
    • Additional sources: Common Crawl code-related pages (~700B tokens, with 140B+ refined HQ), GHArchive commit reconstruction (73B tokens), programming contest data (Appendix B.1.1).
    • Evaluation uses from-scratch 1B models to validate corpus quality as a proxy for scale; Figure 4a shows benchmark curves, and the text states strong performance with 2T tokens + 300B annealing (Section 3.1.2.1; Figure 4a; Appendix B.1.1).
  • Ling Math Corpus:

    • Assembled from web, textbooks, papers, code repos, problem banks, and synthetic sources; processed via parsing/recall/filter/rewrite/synthesis (Section 3.1.2.2).
    • Uses fastText recall classifiers; “LLM-Filter” and “LLM-Refiner” (noted as small LMs; Appendix mentions 4B parameter filter/refiner) to filter/refine step-by-step math content; synthetic Q&A generation aided by a mathematical concept graph (Section 3.1.2.2; Appendix B.1.2).
    • Evaluation: continue-train a Ling-coder-1B on math-only for 1.8T tokens with last 300B as annealing; Figure 4b reports average-benchmark superiority vs Qwen baselines; Figure 4c compares web-math subsets vs open datasets (Section 3.1.2.2; Figure 4b–c).
  • Multilingual data (Section 3.1.3; Appendix B.2):

  • ~2TB multilingual data; about 30 languages; 4% of total pre-training tokens.
  • Language-family mix reported: Romance ~50%, Germanic ~10%, Slavic ~3%, rest others (Appendix B.2).
  • They note some language families have less negative impact on Chinese/English, while some “other” languages require careful balancing (Section 3.1.3; Appendix B.2).

  • Long-context data (Section 3.1.4):

  • “retrieve–synthesize–validate” pipeline across heterogeneous sources.
  • Quality controls: hygiene checks, semantic consistency checks, and “PPL gap between long- and short-window evaluations.”
  • Output: ~1.2T high-quality long-text tokens (Section 3.1.4).

  • Data infrastructure (Section 3.1.5; Appendix B.3):

  • “Data-as-Code” with version control and CI/CD; 50+ operators; iteration cycle reduced “months to days.”
  • Unified “lakehouse” wide-table design with >20 TB/hour I/O throughput; claims processing 30B trainable data points in two days (Section 3.1.5; Appendix B.3).

3.4.7 Pre-training recipe (multi-stage + WSM)

Model hyperparameters (Section 3.2.1; Table 1). - Architecture is the high-sparsity MoE described above. - Attention head dimension fixed at 128 across model sizes (Section 3.2.1). - MTP depth 1; parameters initialized with standard deviation 0.006 (Section 3.2.1).

Training hyperparameters (Section 3.2.1; Table 1). - Optimizer: AdamW with β1=0.9, β2=0.95, weight decay 0.1, gradient-norm clipping 1.0 (Section 3.2.1). - Context schedule: - 4K context for first 20T tokens, - then 32K context during the first 150B tokens of mid-training, - then extended to 128K via YaRN (Section 3.2.1; Section 3.2.2.2; Figure 6). - Load-balancing bias update rate: - γ=0.001 during pre-training, - after context extension: 0.0001 for rest of training (Section 3.2.1). - Batch-size ramp: - ramp for first ≈500B tokens (example: 3,024 to peak), then constant (Section 3.2.1). - Per-model learning rates and batch sizes (Table 1): - Ling-mini-2.0: LR 3.36×10^-4, batch size 4,400. - Ling-flash-2.0: LR 2.61×10^-4, batch size 8,352. - Ling-1T: LR 1.86×10^-4, batch size 18,144.

Data mixture schedule (Section 3.2.2.1; Figure 5). - Pre-training has two 10T-token substages at 4K context: - Substage 1: 68% general, 32% reasoning. - Substage 2: 54% general, 46% reasoning (Figure 5; Section 3.2.2.1).

Mid-training: long context + CoT “pre-activation” (Section 3.2.2.2; Figure 5). - Long context extension: - 150B tokens, sample 20% 32K long-text sequences; they state short-context performance remains stable while long-context improves (Section 3.2.2.2). - Extend to 128K with YaRN; Figure 6 shows Needle-in-a-Haystack performance for Ling-mini-2.0 after SFT up to 128K (Section 3.2.2.2; Figure 6). - Reasoning pre-activation: - next 600B tokens: keep high reasoning proportion and add high-quality CoT corpora; they train at higher LR and merge checkpoints for stability (Section 3.2.2.2). - Claim: early CoT data increases the ceiling for later reasoning and stabilizes SFT/RL; later supported by Figure 12 (Section 3.2.2.2; Section 4.5; Figure 12).

WSM scheduler (Section 3.2.3; Figure 7; Eqs. (2)–(3)). - Instead of learning rate decay, WSM does: - linear warmup to peak LR for first 2,000 steps, - constant LR until training ends, - “annealing” done by checkpoint merging (Section 3.2.1; Section 3.2.3). - Theory: merging checkpoints is equivalent to reweighting accumulated gradients (Eq. (2)), and a target LR decay schedule maps uniquely to merge weights (Eq. (3)) (Section 3.2.3). - Empirical comparison: Figure 7 shows WSM outperforming WSD and allows flexible continuation without deciding decay duration in advance; they also average top N=32 checkpoints by validation for final model (Figure 7; Section 3.2.3).

Worked micro-example (to make WSM concrete). - Suppose you have three checkpoints θ0, θ1, θ2 and you want the merged model to emphasize later checkpoints like a decay/anneal effect. - WSM forms θ̂2 = c0 θ0 + c1 θ1 + c2 θ2 with nonnegative weights summing to 1 (Section 3.2.3). - If you choose an equivalent “gradient weight” sequence w1 ≥ w2 ≥ 0, Eq. (3) gives: - c2 = w2, c1 = w1 - w2, c0 = 1 - w1. - Example: if w1=0.9 and w2=0.6, then: - c2=0.6, c1=0.3, c0=0.1. - Interpretation: the merged model is mostly the later checkpoint(s), similar to how late-stage LR decay reduces the effective step size and “settles” into a basin, but done after-the-fact by averaging weights rather than changing LR mid-run (Section 3.2.3; Eqs. (2)–(3)).

3.4.8 Post-training: DFT → Evo-CoT RL → GAR alignment

Stage 1: Decoupled Fine-Tuning (DFT) (Section 4.1; Table 5). - DFT is supervised fine-tuning that uses two system prompts to separate behaviors: - Instant Response: “detailed think off” - In-Depth Reasoning: “detailed think on” with output wrapped in <think>...</think><answer>...</answer> (Table 5). - The SFT dataset is balanced across: - reasoning (math, STEM, logic, code, operations research), - general (creative writing, empathetic dialogue), - industrial (finance, medical/health, planning/supply chain/transport)
(Section 4.1).

Stage 2: Evolutionary Chain-of-Thought RL (Evo-CoT) (Section 4.2). - Objective framing: - Start from DFT-initialized policy π in instant-response mode, and evolve reasoning depth while staying efficient (Section 4.2). - Optimization step is described as maximizing expected reward while penalizing KL divergence to a reference policy (Section 4.2). - Reward components (Section 4.2): - R_correctness: +1 if final answer matches ground truth, else 0. - R_length: dynamic length control that penalizes overly long solutions more strongly on easy tasks and less strongly on hard tasks. - R_format: explicit reasoning markers like <think> get -0.5 reward (discouraging visible CoT in this “non-thinking” model line). - Optional task-specific rewards, e.g., a visual reward for front-end generation.

Dynamic length reward details (Section 4.2.1). - They define a preference p(l) based on the relative length of sampled responses for the same input, then scale by task difficulty coefficient α. - Correct answers get length preference; incorrect answers suppress positive length reward and penalize long outputs (Section 4.2.1).

Sentence-level policy optimization: LPO (Section 4.2.2; Figure 10). - LPO treats a sentence as the RL “action unit” (sentence segmented by punctuation after detokenization). - It computes an importance ratio per sentence: - r_{i,k}(θ) averages token-level log-prob ratio across tokens in sentence s_{i,k} (Section 4.2.2). - It applies PPO-style clipping at the sentence level; clipping parameter is ε = 0.03 (Section 4.2.2). - Figure 10 reports smoother reward curves and better stability than GRPO / GSPO baselines, including improved generalization on AIME 2025 test curves (Figure 10; Section 4.2.2).
(The excerpt does not provide the exact numeric AIME uplift from Figure 10; it is described qualitatively.)

Stage 3: Human preference alignment via GAR + RubriX (Section 4.3; Figure 11). - Group Arena Reward (GAR): - For open-ended tasks, instead of absolute scores, it uses intra-group tournament-style pairwise comparisons among multiple responses from the same policy; cumulative ranking becomes reward (Section 4.3.1; Figure 11). - Intended effect: reduce reward noise / variance in subjective tasks. - RubriX: - A multi-dimensional rubric framework (clarity, coherence, creativity, emotional resonance, instruction adherence, etc.) to guide reward modeling across diverse domains (Section 4.3.2).

Reward infrastructure (Section 4.4). - A unified reward computation system supports: - rule-based and model-based rewards, - multi-language code sandboxes, - visual reward evaluators and complex environment sandboxes. - Reported systems metrics: - 40K+ concurrent reward requests, >99.9% sustained success rate, - bounded queues improve throughput by 39%, - asynchronous reward computation reduces training time by up to 30% (Section 4.4).

Checkpoint selection: ApexEval (Section 4.5; Figure 12). - Motivation: standard pass@k or format-sensitive evaluation can misjudge RL potential. - ApexEval changes: - uses “highest score of pass@k” to estimate the probability of at least one correct response in multiple attempts, - uses LLM judges (e.g., MathVerify, XVerify) for answer-based tasks and test execution for code (Section 4.5). - Figure 12 links pretraining-with-CoT to higher DFT ceilings and better ERL outcomes: - AIME24: 19.7 → 26.5 (+6.8 points) after ERL when trained with CoT vs without (Figure 12 right). - AIME25: 14.2 → 20.2 (+6.0 points) similarly (Figure 12 right).

3.4.9 Infrastructure co-design (why sparsity becomes real efficiency)

Baseline training configuration for Ling-1T systems results (Section 5 intro). - Framework: “modified Megatron 0.11” with FP8 strategy adapted for Ling 2.0 with MTP support (Section 5 intro). - Hardware: 2016 × Hopper GPUs (Section 5 intro). - Parallelism settings listed: - TP1, EP8, PP21, VPP2, - sequence length 4K, - full recomputation (Section 5 intro). - Note: the excerpt does not provide total training time, total FLOPs for Ling-1T pretraining, or PF-days.

FP8 training (Section 5.1; Figure 15). - Quantization scheme: - activations/gradients in blocks of [1,128], - weights in blocks of [128,128] (Section 5.1). - They quantize BF16 tensors to FP8 E4M3 with FP32 scaling factors; outputs return to BF16 after GEMM (Section 5.1). - Stability result on Ling-1T: - over 900B tokens, relative loss difference ≤ 0.25% (avg ~0.1%) vs BF16 baseline (Figure 15; Section 5.1). - Efficiency result: - FP8 training delivers roughly +15% MFU gain for Ling-1T (Section 5.1).

Precision tracking / safeguard (Section 5.1.1; Figure 16). - They define: - FP8 quantization underflow (fraction of elements that become zero), - FP8 quantization distortion (cosine similarity between original and reconstructed matrix), and monitor these metrics across layers (Section 5.1.1; Figure 16).

Heterogeneous fine-grained pipeline parallelism (Section 5.2; Figures 17–18). - Problem: pipeline scheduling becomes heterogeneous due to: - First-K-Dense layers, - MTP block, - differing compute/memory characteristics across blocks (Section 5.2). - Solutions: - configurable layer allocation per virtual pipeline stage (allow empty stages), - schedule MTP standalone (not bundled with loss layer), - partial recomputation for MTP (recompute transformer part only), - split MTP into transformer layers + loss computation for scheduling (Figure 17; Section 5.2). - Reported throughput gain: - MTP layer costs ~1.7× a standard MoE layer; refined partitioning yields ~40% end-to-end improvement; increasing VPP2 → VPP4 adds ~5% when routing is balanced (Section 5.2; Figure 18).

Distributed framework optimizations (Section 5.3). - Intra-node DeepEP: +2% end-to-end speedup from reduced communication redundancy, plus 13% from operator fusion (Section 5.3.1). - Fused kernels: RoPE fusion, router fusion, GroupGemm upgrades, etc., to reduce memory bottlenecks and CPU overhead (Section 5.3.2). - Fast expert full recomputation: - They describe a modified backward flow that can reduce recomputation latency (Figure 19; Section 5.3.3). - Gains: up to 10% for smaller models, ~7% for larger; less benefit at scale due to pipeline bottlenecks (Section 5.3.3). - Checkpointing overhead reduction: - checkpoint save time reduced from 269s to 30s on Ling-1T by caching metadata; reduces training-time share from 2.43% to 0.82% (Section 5.3.5).

“Bitter lesson” on compute–communication overlap (Section 5.6). - They tried overlap strategies (DualPipe / interleaved 1F1B with A2A overlap) but saw limited end-to-end gains, attributing issues to: - need for large EP configurations (sensitivity to slowest rank), - imbalanced routing in shallow layers causing OOM sensitivity and partitioning penalties (Section 5.6).

4. Key Insights and Innovations

  • (1) High-sparsity, fine-grained MoE standardized across scales (16B → 1T).
  • Novelty: the same 256 experts, 8 active + 1 shared pattern is used across all scales, with explicit activation ratio ~3.5% (Section 2.1; Table 1).
  • Significance: enables huge total capacity (up to 1T) while keeping active parameters at 51B for Ling-1T (Table 1), which the report frames as an efficiency path versus dense scaling.

  • (2) A unified empirical scaling-law framework + “Wind Tunnel” experimentation loop.

  • Novelty: not just fitting loss curves, but a standardized multi-scale experimental design to choose architecture and hyperparameters and to extrapolate to 100× compute (Section 2.3.3; Figure 3).
  • Significance: the report claims loss prediction within 0.01 and reduced validation cost (“under 1% of full run” in the Introduction; plus the 35% vs traditional ablation claim in Section 2.3.3).

  • (3) WSM schedule: replacing LR decay with checkpoint merging.

  • Novelty: formal equivalence between checkpoint merging and gradient reweighting (Eqs. (2)–(3)) and practical training advantages (Section 3.2.3).
  • Significance: Figure 7 shows WSM outperforming a WSD baseline and enabling flexible continuation without pre-committing to a decay window; the report claims +1–2 points average leaderboard gains in that comparison (Figure 7; Section 3.2.3).

  • (4) Reasoning “pre-activation” in mid-training via CoT + long-context extension.

  • Novelty: introducing high-quality CoT data before post-training, framed as raising the ceiling and stabilizing later RL/SFT (Section 3.2.2.2; Figure 12).
  • Significance: Figure 12 shows better AIME24/AIME25 outcomes after ERL when the model was pretrained with CoT-style data.

  • (5) Sentence-level RL updates (LPO) and intra-group preference ranking (GAR).

  • Novelty: policy optimization at the sentence granularity with PPO-style clipping (ε=0.03) rather than token-level or whole-sequence updates (Section 4.2.2).
  • Novelty: GAR uses round-robin pairwise comparisons inside a response group to reduce reward noise for subjective tasks (Section 4.3.1; Figure 11).
  • Significance: Figure 10 claims improved RL stability/generalization; Tables 7–8 show strong alignment/arena metrics post-training, consistent with the goal (Table 7; Table 8).

5. Experimental Analysis

5.1 Evaluation methodology (what was measured)

  • The report evaluates both base (pre-trained) models and post-trained instruct models:
  • Base model evaluation: 33 benchmarks across math, coding, general reasoning, knowledge, multilingual (Section 3.3.2). Metrics include EM/Acc, Pass@1, and some Pass@k usage for stability (Section 3.3.1).
  • Post-trained evaluation: 36 benchmarks across coding, math, reasoning, knowledge, alignment, agent/instruction following (Section 4.6). Mostly 0-shot with decontamination; repeated sampling for some benchmarks (e.g., GPQA-Diamond repeated 16 times; ARC-AGI/ZebraLogic/HLE repeated 4 times) (Section 4.6).
  • They also explicitly compare model variants with vs. without CoT data integrated during mid-training (Tables 2–4; Section 3.3.2) and show downstream RL effects (Figure 12).

5.2 Main quantitative results (selected, with numbers)

A) Base models: effect of CoT pre-activation (Tables 2–4)

  • Ling-mini-2.0-base shows very large gains on some reasoning-heavy benchmarks when trained with CoT data in mid-training (Table 2):
  • AIME25 (Pass@1): 2.08 (w/o CoT) → 43.75 (w/ CoT).
    > Table 2, AIME25 row.
  • MATH (Acc.): 61.9682.52.
    > Table 2, MATH row.
  • LiveCodeBench (Pass@1): 13.7134.47.
    > Table 2, LiveCodeBench row.
  • Not all categories improve; e.g., MMLU (EM) remains ~flat (74.2174.26) while C-Eval decreases (83.5780.41) (Table 2).

  • Ling-flash-2.0-base shows mixed changes with CoT in Table 3:

  • MATH: 66.2679.54 improves.
  • CruxEval: 69.5077.38 improves.
  • Some math metrics decrease (e.g., MathBench: 80.1877.69; TheoremQA: 46.2543.50) (Table 3).

  • Ling-1T-base similarly benefits on some reasoning tasks (Table 4):

  • MATH (Acc.): 67.4282.78.
  • MinervaMath (Acc.): 50.0062.87.
  • But some scores decrease (e.g., OmniMath: 35.7033.60) (Table 4).

Interpretation grounded in the report: - The paper argues CoT in mid-training “pre-activates” reasoning and that this advantage persists into post-training (Section 3.3.2; Figure 12). The tables support that the effect is strong on certain benchmarks but not uniformly positive across all tasks.

B) Post-trained instruct models: broad comparisons (Tables 6–8)

  • Ling-mini-2.0 (instruct) vs several models (Table 6) shows strong math/coding for its class:
  • AIME25 (Pass@1): 46.72 (Ling-mini) vs 24.01 (Qwen3-8B) and 38.59 (gpt-oss-20B) (Table 6).
  • CNMO 2024 (Pass@1): 72.66 vs 34.38 (Qwen3-8B) (Table 6).
  • LiveCodeBench (Pass@1): 41.69 vs 26.10 (Qwen3-8B) (Table 6).
  • CodeForces (Rating): 1410 (Ling-mini) vs 624 (Qwen3-8B) (Table 6).

  • Ling-flash-2.0 (instruct) is very strong on code/math in Table 7:

  • AIME25 (Pass@1): 55.83 vs 22.5 (Qwen3-32B) and 49.64 (GPT-4.1 mini low think) (Table 7).
  • LiveCodeBench (Pass@1): 51.38 vs 31.50 (Qwen3-32B) and 45.54 (GPT-4.1 mini) (Table 7).
  • CodeForces (Rating): 1600 vs 696 (Qwen3-32B) and 1309 (GPT-4.1 mini) (Table 7).
  • Alignment/arena:

    • Arena Hard v2.0 (Win-Rate): 61.33 (Ling-flash) vs 42.77 (GPT-4.1 mini) (Table 7).
  • Ling-1T (instruct) shows top or near-top results among the models compared in Table 8 on many reasoning/coding metrics:

  • Math:
    • AIME25 (Pass@1): 70.42 (Ling-1T) vs 70.10 (Gemini 2.5 Pro lowthink), 59.43 (GPT-5-main), 55.21 (DeepSeek-V3.1-Terminus), 50.16 (Kimi-K2) (Table 8).
    • OptMATH (Pass@1): 57.68 vs 42.77 (Gemini 2.5 Pro) and 39.16 (GPT-5-main) (Table 8).
  • Coding:
    • LiveCodeBench (Pass@1): 61.68 vs 48.57 (GPT-5-main) and 45.43 (Gemini 2.5 Pro) (Table 8).
    • CodeForces (Rating): 1901 vs 1675 (Gemini 2.5 Pro) and 1120 (GPT-5-main) (Table 8).
  • Reasoning:
    • ARC-AGI-1 (Pass@1): 43.81 vs 22.19 (Kimi-K2) and 18.94 (Gemini 2.5 Pro) (Table 8).
    • ZebraLogic (Pass@1): 90.80 vs 85.50 (Kimi-K2) and 70.20 (Gemini 2.5 Pro) (Table 8).
  • Alignment:
    • Arena Hard v2.0 (Win-Rate): 75.83 vs 74.46 (Gemini 2.5 Pro) and 65.06 (GPT-5-main) (Table 8).

C) Efficiency framing: accuracy vs “average tokens” (Figure 13)

  • Figure 13 plots AIME-25 accuracy vs “Average Tokens” and places Ling-1T at 70.42% with ~4200 average tokens (approximate location as drawn), contrasted with other models at higher token counts for similar accuracy (Figure 13).
  • Caution: the figure conveys a trade-off narrative, but the excerpt does not specify the exact evaluation protocol for “average tokens” beyond the axis label, so interpretation should be limited to what is visualized.

5.3 Do the experiments support the claims?

  • Supportive evidence:
  • The cross-scale results show consistent improvements from mini → flash → 1T on many benchmarks (Tables 6–8).
  • The CoT pre-activation hypothesis is supported by:
    • base-model with/without CoT comparisons (Tables 2–4),
    • downstream RL improvement differences (Figure 12).
  • Infrastructure claims are backed by concrete metrics:

    • FP8 loss gap vs BF16 (≤0.25% after 900B tokens) (Figure 15),
    • pipeline improvements (~40% end-to-end in a described refinement) (Section 5.2),
    • checkpointing overhead reduction (Section 5.3.5).
  • Where evidence is weaker or incomplete in the provided excerpt:

  • Some headline claims in the abstract/intro (e.g., “5–8% average gain on reasoning benchmarks” from data composition) are not accompanied by a table/figure in the excerpt showing that exact aggregate computation (Introduction “Reasoning-oriented Data Composition”).
  • The report references additional tables (Tables 6–8) and figures; we have them, but some internal evaluation specifics (exact prompts, decontamination method details, variance intervals) are mentioned at a high level rather than fully specified in the excerpt (Section 3.3.2; Section 4.6).

5.4 Ablations, robustness checks, and failure cases

  • Ablation-like comparisons present:
  • With vs. without CoT data during mid-training (Tables 2–4; Figure 12).
  • WSM vs WSD scheduler (Figure 7).
  • FP8 vs BF16 loss difference tracking (Figure 15) and FP8 underflow/distortion monitoring (Figure 16).
  • Systems optimization stacking is summarized via MFU changes in Figure 14.

  • Failure cases / caveats explicitly noted:

  • Pipeline gains may not hold when routing imbalance causes stage blocking; increasing VPP can be sensitive in such cases (Section 5.2).
  • Compute–communication overlap methods yielded limited end-to-end gains under their constraints, due to EP scaling sensitivity and shallow-layer routing imbalance (Section 5.6).

6. Limitations and Trade-offs

  • Long-context efficiency constraint from attention choice.
  • The conclusion states GQA constrains efficiency for long-context scenarios, and they are exploring linear/sparse attention alternatives (Section 6).

  • Reasoning depth/length not fully solved.

  • Even with CoT pre-activation and Evo-CoT RL, the report states effective reasoning length and depth “still have room for enhancement” (Section 6).
  • Additionally, the “non-thinking” line explicitly discourages visible <think> (Evo-CoT includes R_format = -0.5 for <think> markers) (Section 4.2), which may trade off transparency/debuggability for efficiency and policy control.

  • Agentic behavior remains under development.

  • The conclusion notes complex instruction following and agent behaviors are still being developed (Section 6), despite including agent-related benchmarks like BFCL-V3 in evaluations (Tables 6–8).

  • MoE systems sensitivity to routing imbalance.

  • Several infrastructure sections highlight that routing imbalance can create bottlenecks:

    • VPP scaling can become sensitive (Section 5.2).
    • EP overlap strategies can be limited by the slowest rank and shallow-layer imbalance (Section 5.6).
  • Incomplete reporting for some resources and training budgets (in the excerpt).

  • The excerpt provides token counts and many hyperparameters (Section 3.2.1; Figure 5), and hardware/parallelism configs (Section 5 intro), but does not provide a single consolidated compute budget (e.g., PF-days) for Ling-1T pretraining.

7. Implications and Future Directions

  • How this changes the landscape (based on the report’s evidence).
  • The work demonstrates a coherent recipe for scaling an open 1T-parameter model where only 51B parameters are active per token (Table 1), backed by a strong engineering stack that makes FP8 and heterogeneous pipeline scheduling viable at scale (Section 5; Figures 15, 17–18).
  • The report’s evaluation tables suggest that this combination can yield strong performance on difficult reasoning/coding benchmarks at the trillion scale (Table 8), reinforcing the claim that sparse activation can be competitive when aligned with reasoning objectives (Introduction; Section 6).

  • Follow-up research directions explicitly suggested.

  • Replace or augment GQA with more efficient long-context attention mechanisms (linear/sparse attention) (Section 6).
  • Improve effective reasoning length/depth (Section 6).
  • Extend toward stronger agentic and interactive behaviors (Section 6).

  • Practical applications / downstream use cases implied by evaluation scope.

  • Strong code generation and competition coding: improvements on LiveCodeBench and CodeForces (Tables 6–8).
  • Advanced mathematical reasoning: AIME24/25, CNMO 2024, optimization modeling tasks (OptiBench, OptMATH) (Tables 6–8).
  • Instruction-following and alignment-sensitive deployment: Arena Hard v2.0, Writing Bench, Multi-Challenge (Tables 7–8).

  • Repro/Integration Guidance (when to prefer this approach, based on the report).

  • Prefer a high-sparsity MoE like Ling 2.0 when:
    • you want trillion-scale capacity but need to constrain active compute per token (Table 1; Section 2.3.2),
    • you can invest in routing/load-balancing and distributed training infrastructure (Section 5), since the report explicitly warns that without engineering, sparsity yields no real advantage (Section 5 intro).
  • Prefer WSM-style checkpoint merging when:
    • you want the benefits of annealing/decay behavior but need flexibility to continue training without pre-committing to a decay window (Section 3.2.3; Figure 7).
  • Prefer CoT pre-activation in mid-training when:
    • your goal is downstream reasoning RL performance; Figure 12 shows higher AIME ceilings and faster ERL improvement when CoT-style data is included earlier (Section 4.5; Figure 12).
  • Prefer LPO-like sentence-level RL updates when:
    • you observe instability/plateaus in RL for reasoning tasks; Figure 10 claims smoother reward curves and better AIME-2025 generalization than token/sequence baselines (Section 4.2.2; Figure 10).

If you want, I can additionally (within the same seven-section structure) focus the Technical Approach section on one axis—e.g., (a) scaling laws, (b) post-training RL mechanisms, or (c) FP8 + pipeline parallelism—and compress the other axes, but I followed the paper-wide deep-reading interpretation requested.