Skip to content

QwenLong-L1.5: Post-Training Recipe for Long-Context Reasoning and Memory Management

ArXiv: 2512.12967

🎯 Pitch

QwenLong-L1.5 presents a complete post-training recipe that turns a 30B long-context base model into a robust long-range reasoner by unifying a scalable long-context data synthesis pipeline, stabilized reinforcement-learning techniques (including AEPO and task-aware advantage estimation), and a memory-agent architecture that extends reasoning to inputs up to millions of tokens. This matters because it achieves large empirical gains (≈+9.9 points avg., +31.7 on hard needle-in-a-haystack benchmarks) while enabling practical ultra-long-document and agentic workflows that standard pre-/mid-training or retrieval-only approaches cannot handle.


1. Executive Summary (2-3 sentences)

QwenLong-L1.5 is a post-training “recipe” for turning an existing long-context-capable LLM (Qwen3-30B-A3B-Thinking-2507) into a model that is substantially better at reasoning over long documents, not just retrieving a single fact. It combines (i) a scalable long-context reasoning data synthesis pipeline, (ii) stabilization techniques for long-context reinforcement learning (RL), and (iii) a memory-agent framework that extends reasoning to contexts far beyond the model’s physical window (reported up to 4M tokens). Across six long-context benchmarks, it improves the baseline by +9.90 points on average (Table 7), with very large gains on a hard “needle-in-a-haystack” setting (MRCR, +31.72).


2. Context and Motivation

  • Problem / gap.
  • Long-context reasoning requires a model to (a) locate relevant evidence scattered across a very long input and (b) compose multi-hop reasoning chains grounded in that evidence.
  • The paper identifies a gap specifically in post-training: while prior work often focuses on pre-/mid-training context extension or new attention architectures, there is “no mature, end-to-end system” that provides: 1) scalable long-context reasoning data, 2) RL methods that remain stable at long sequence lengths and heterogeneous tasks, and 3) an agent approach for tasks longer than the context window.

  • Why it matters.

  • The paper frames long-context reasoning as central both for:
    • Single-pass reasoning over long documents, and
    • Agentic systems operating over long histories / large information streams (multi-turn settings).
  • In real deployments, many tasks exceed even very large windows; thus memory management becomes necessary.

  • Prior approaches and shortcomings (as positioned in the paper).

  • “Needle-in-a-haystack” retrieval and single-hop RAG-style tasks are described as insufficient: they do not force global, multi-hop grounding and can be solved by shallow retrieval patterns.
  • Standard RL methods (e.g., PPO with a value network) are described as computationally problematic for long contexts due to attention costs (Section 2.2), and long-context multi-task RL is described as unstable due to distributional imbalance and reward/advantage estimation issues (Section 4.2).

  • How this work positions itself.

  • It presents a full post-training system (Figure 6) that unifies:
    • long-context data synthesis (Section 3, Figure 4),
    • long-context RL stabilization + optimization (Section 4, including AEPO), and
    • memory-agent training for ultra-long tasks (Section 2.3 and Section 5.4).

3. Technical Approach

3.1 Reader orientation (approachable technical breakdown)

  • The system is a post-training pipeline that takes a base reasoning model and improves its ability to solve tasks where the evidence is spread across extremely long documents.
  • It solves this by (i) generating training tasks that require multi-hop grounded reasoning, (ii) using a stabilized RL method that can train on long sequences, and (iii) adding a memory-agent mode for inputs longer than the model’s context window.

3.2 Big-picture architecture (diagram in words)

  • (A) Corpus + synthesis: collect long documents → synthesize long-context QA tasks (multi-hop, numerical, general) → verify and filter into an RL dataset (Figure 4).
  • (B) Full-context RL training: run progressive multi-stage RL with increasing max input/output lengths → improve single-pass long-context reasoning (Figure 6, Section 4.1).
  • (C) Memory-agent training: train a chunk-wise memory updating policy for ultra-long inputs → merge a memory-specialist with the full-context model → final full-context RL stage to retain/restore single-pass strength (Figure 6, Table 10).

3.3 Roadmap for the deep dive

  • I will explain, in order: 1) The RL formulation and why GRPO is used for long contexts (Section 2.1–2.2). 2) The memory agent mechanism for >window tasks (Section 2.3, Figure 2). 3) The data synthesis pipeline and its validation filters (Section 3, Figure 4–5). 4) The post-training schedule (progressive length extension + expert merge) (Section 4.1, Figure 6). 5) The stability methods: task-balanced sampling, task-specific advantages, negative-gradient handling, and AEPO (Section 4.2–4.4). 6) The evaluation protocol and what results support which claims (Section 5, Tables 7–10).

3.4 Detailed, sentence-based technical breakdown

This is an empirical post-training + algorithmic RL systems paper whose core idea is: generate training tasks that truly require long-range multi-hop grounding, then use stabilized on-policy RL (plus a memory-agent mode) to reliably train at long sequence lengths and beyond-window contexts.

3.4.1 RL objective and why GRPO is chosen

  • The paper formulates long-context reasoning as RL where the model is a policy πθ that generates an answer y given concatenated documents c and question q (Section 2.1).
  • The nominal objective is a KL-regularized expected reward (Eq. (1)):
  • Maximize E[rϕ(c,q,y)] − β D_KL(πθ || πref).
  • For long-context inputs, the paper argues PPO-style methods that rely on a learned value function are too expensive, and instead uses GRPO (Group Relative Policy Optimization) (Section 2.2).
  • GRPO samples a group of G candidate responses per prompt from an “old” policy and computes a group-relative advantage by z-score normalizing rewards within the group (Eq. (3)).
  • This avoids training a separate value network.

  • On-policy simplification used in this work.

  • The paper sets β = 0 (no KL regularization) and uses a strictly on-policy setting with one gradient update per batch, so πθ = πθold and the importance ratio ρ = 1 (Section 2.2).
  • With that, the clipped PPO-like term becomes inactive and the objective simplifies (Eq. (4)).

  • Token-level normalization (from DAPO).

  • They incorporate a technique from DAPO to normalize each token’s contribution by the total number of tokens in the group, aiming to avoid long responses diluting learning signals and to penalize undesirable long outputs effectively (Section 2.2).

3.4.2 Memory Agent: how ultra-long contexts are handled

  • The memory agent reframes reading as a sequential decision process rather than single-pass full-attention over the entire context (Section 2.3, Figure 2).
  • Step-by-step dataflow (what happens first, second, third). 1) The user query is decomposed into:

    • qcore: the core question content, and
    • qinst: formatting / output constraints (e.g., schema requirements).
    • The motivation is to avoid format constraints interfering with flexible iterative reasoning during memory updates (Section 2.3). 2) The long document is split into chunks {x1, …, xK}. 3) At each step t, the policy consumes (mt−1, pt−1, xt, qcore) and outputs:
    • an updated memory mt, and
    • a navigational plan pt for the next chunk (Eq. (5)). 4) After chunk xK, the policy generates the final answer y conditioned on (mK, qcore, qinst) (Eq. (6)).
  • RL training for the memory agent.

  • For each problem, the policy samples G trajectories {τi}; each trajectory contains the sequence of memory/plan updates and the final answer.
  • A trajectory-level reward R(τi) is computed from final answer correctness, then turned into an advantage (via the same z-score idea) and broadcast to all actions in the trajectory (Section 2.3).

  • Capacity parameters for memory mode (reported).

  • The paper reports (Figure 6) a memory-RL configuration with:
    • Max input 128K,
    • Chunk size 32K,
    • Memory size 15K.
  • This is the mechanism that enables tasks reported at 1M–4M tokens via iterative chunk processing (Section 5.4, Table 9).

3.4.3 Long-context data synthesis pipeline

The paper’s central data thesis is: long-context reasoning needs training tasks where evidence is distributed globally and requires multi-hop reasoning, not just retrieval.

  • Scale and filtering.
  • The pipeline starts from 42.7k synthesized examples and filters down to 14.1k final RL training samples via “multi-stage difficulty filtering, deduplication, and test set decontamination” (Section 3).
  • Compared to QwenLong-L1, this is much larger (1.6k → 14.1k) and longer on average (11,441 → 34,231 tokens), with max input length increasing to 119,932 tokens (Table 1).
  • Underlying raw document corpus after filtering: 82,175 documents totaling ~9.2B tokens (Section 3.1).

  • End-to-end pipeline stages (Figure 4). 1) Corpus collection: obtain long documents from multiple categories (code repos, academic literature, professional docs, general knowledge/literature, dialogue data), then apply rule-based + LLM-as-a-judge filtering for quality (Section 3.1). 2) QA synthesis: create question-answer pairs designed to be hard in a long context, including explicitly dispersing evidence and inserting irrelevant documents to force sparse evidence retrieval + reasoning (Section 3; Figure 4 text description). 3) Data verification:

    • Knowledge grounding check: remove the source document and test whether a model can still answer; if yes, the example is filtered out to reduce “internal knowledge” solvability and force context grounding.
    • Contextual robustness check: expand context with irrelevant documents; if answer accuracy (pass@k) drops to zero, discard the sample to avoid brittle questions (Section 3, pipeline description).
  • Three synthesis methods mapped to problem types (Section 3.2). 1) In-depth multi-hop reasoning QA via knowledge graphs.

    • Extract triples from documents → build an initial KG → expand to cross-document KG via aggregation → refine with entity/relation clustering.
    • Sample multi-hop reasoning paths (random walk / BFS) and deliberately distribute path nodes across multiple documents to enforce cross-document reasoning.
    • Increase difficulty via “information perturbation” / entity obfuscation (e.g., temporal or institutional descriptions).
    • Generate questions spanning multi-fact reasoning, temporal reasoning, causal analysis, hypothetical scenarios.
    • Apply “Blind Knowledge Screening” and “Scarce Knowledge Validation” (named but not fully specified in the excerpt) to control quality/complexity. 2) Corpus-level numerical reasoning QA via a structural tabular data engine.
    • Parse unstructured documents to extract statistical tables.
    • Extract schema → aggregate into a cross-document “corpus table.”
    • Generate NL queries from templates → translate to SQL → execute to get ground-truth numeric answers.
    • Concatenate relevant source docs to form the long context (Section 3.2). 3) General long-context reasoning via multi-agent self-evolve (MASE).
    • A proposer generates questions/answers; a solver answers using documents; a verifier checks semantic equivalence between solver output and proposer reference (Figure 5, Section 3.2).
    • A history buffer of validated QA pairs is appended as exemplars to push future proposals toward harder, non-redundant questions (Section 3.2).

3.4.4 Progressive post-training schedule + expert merging

  • The training uses progressive length extension to avoid instability from abruptly shifting to long-context multi-hop grounding (Section 4.1).
  • The paper reports synchronized growth of max input and max rollout length. It states stage settings of:
  • (20K input, 12K output), (60K input, 20K output), (120K input, 50K output) (Section 4.1).
  • Figure 6 provides a closely related staged pipeline:
  • Full-context RL Stage-1: max input 32K, max output 12K
  • Stage-2: 60K / 20K
  • Stage-3: 120K / 50K
  • Memory-RL (specialist): max input 128K, chunk 32K, memory 15K
  • Model merging with SCE (details not provided in the excerpt beyond being a merging algorithm)
  • Full-context RL Stage-4: 120K / 50K
  • Assumption / note: The excerpt contains a minor inconsistency (20K vs 32K for Stage-1); I follow Figure 6 for the stage-by-stage pipeline because it is the explicit pipeline diagram.

  • The paper states that mixing memory-management RL data with full-context RL data harms infrastructure efficiency and stability, so it trains specialized experts and merges them (Section 4.1).

3.4.5 Stabilizing long-context multi-task RL

The paper identifies multiple instability sources and proposes fixes in Sections 4.2–4.4.

(A) Task-balanced sampling (Section 4.2). - Motivation: long-context RL data is multi-clustered and heterogeneous (UMAP visualization in Figure 7), so naive random batching yields distribution imbalance and unstable entropy dynamics (Figure 8). - Implementation has two pieces: 1) Pre-training balancing: run pre-inference per data source, bin examples by pass@k, then sample equally across bins. 2) During training: per batch, sample equal numbers from five task types: - multiple-choice, - doc multi-hop reasoning, - general reading comprehension, - dialogue memory, - corpus-level numerical calculation.

(B) Task-specific advantage estimation (Section 4.2). - Motivation: GRPO’s group-normalized advantage can be biased; batch-level normalization can add noise when a batch mixes tasks with very different reward distributions. - Fix: compute the normalization standard deviation over rewards from the same task within the batch (Eq. (7)), instead of purely group-wise or whole-batch. - Result: On Qwen3-4B-Thinking-2507, adding task-balanced sampling + task-specific advantage yields an average improvement of 58.62 vs 56.07 for GRPO alone (Table 2), i.e., +2.55 points over the GRPO baseline.

(C) Negative gradient clipping (Section 4.3). - Motivation: in long-context tasks, correct and incorrect responses can share many overlapping “correct-looking” steps because they are grounded in the same context; this worsens credit assignment. - Evidence: ROUGE-L overlap between correct and incorrect responses is higher for DocMath (45.37) than for short-context AIME24/25 (27.71) (Table 3). - Observation: high-entropy tokens correlate strongly with gradient norm in negative rollouts (Spearman’s ρ = 0.96, p < 0.0001) (Figure 9). - Method: mask (clip away) contributions from selected negative gradients based on entropy thresholds, either: - token-level (H(t|i) > τtoken) or - sequence-level (H̄(i) > τsequence), applied when the sequence advantage Ai < 0 (Eqs. (8)–(10)). - Ablation: For Qwen3-4B-Thinking-2507, “clip high entropy seqs” reaches 57.36 average vs 56.07 GRPO baseline (Table 4), with noted stability trade-offs if too many negatives are removed.

(D) Adaptive Entropy-Controlled Policy Optimization (AEPO) (Section 4.4). - The paper’s key RL algorithmic contribution is AEPO, motivated by the claim that “negative advantages coupled with high entropy” are a primary instability source. - Define batch entropy H(πθ, B) as the average token entropy across the batch (Eq. (11)). - Maintain a target entropy band [Hlow, Hhigh]: - If entropy exceeds Hhigh: mask all negative-advantage samples and update only on positive samples (described as advantage-weighted online rejection sampling fine-tuning) to reduce entropy. - If entropy drops below Hlow: reintroduce negative gradients to avoid entropy collapse. - Empirical effect: - On Qwen3-4B-Thinking-2507, AEPO reaches 59.36 average vs 56.07 GRPO baseline (+3.29) (Table 5). - On the primary model (Qwen3-30B-A3B-Thinking), Figure 11 shows entropy oscillating between regimes “with negative gradient” and “without negative gradient,” intended to balance exploration vs exploitation.

3.4.6 Training configuration details (reported)

From Section 5.1 (Training Details):

  • Base model: Qwen3-30B-A3B-Thinking-2507.
  • RL framework implementation: VeRL (as named; details not included in the excerpt).
  • Sampling at generation time:
  • temperature 0.7,
  • top-p 0.95.
  • GRPO group size: G = 8.
  • Training mode: purely on-policy.
  • Batch size: 128.
  • Learning rate: constant 2 × 10^-6.
  • Reward mechanism:
  • hybrid of rule-based verification (check ground truth contained in model output) and
  • LLM-as-a-judge when rule-based fails, using gpt-oss-120b (as named in Section 5.1).

Important missing details (not provided in the excerpt, so cannot be filled in): - Optimizer type (e.g., AdamW) and optimizer hyperparameters. - Model architecture specifics (layers, hidden size, heads, tokenizer details, context window mechanics beyond the mention of 256K “physical window” in the introduction). - Total RL steps, total tokens processed during post-training, hardware, and compute budget.


4. Key Insights and Innovations

1) Long-context reasoning data synthesis that targets multi-hop grounding (not just retrieval). - Novelty: The synthesis pipeline explicitly constructs multi-hop reasoning paths over cross-document KGs and uses SQL-executed corpus tables to create verifiable numerical reasoning tasks (Section 3.2). - Why it matters: The largest gains are reported on benchmarks requiring dispersed evidence integration (e.g., CorpusQA and MRCR), aligning with this data design (Table 7, Section 5.2 discussion).

2) Stabilized multi-task long-context RL via task-balanced sampling + task-specific advantage normalization. - Novelty: Instead of only group-wise or batch-wise normalization, the paper uses task-conditioned reward variance within the batch for advantage normalization (Eq. (7)). - Significance: It improves average benchmark performance and stabilizes entropy/length dynamics in training traces (Figure 8, Table 2).

3) Entropy-aware handling of negative gradients, culminating in AEPO. - Novelty: AEPO adaptively switches between including and excluding negative-advantage samples based on batch entropy (Section 4.4). - Significance: It yields the strongest ablation gain on the 4B model (+3.29 over GRPO; Table 5) and is presented as enabling longer, non-degrading training runs (Figure 11 narrative).

4) Fusion of full-context reasoning with a memory-agent for ultra-long tasks via specialist training + model merging. - Novelty: The pipeline trains a memory-management “expert,” merges it with the full-context model (via SCE), then continues full-context RL to recover single-pass performance (Figure 6, Table 10). - Significance: Table 10 shows memory-RL boosts the memory-agent metric but hurts full-context scores, and merging largely resolves the trade-off.


5. Experimental Analysis

5.1 Evaluation methodology

  • Benchmarks (Section 5.1).
  • Multiple-choice: LongBench-V2 (503 questions).
  • Needle-in-a-haystack: MRCR (SequenceMatcher ratio is reported).
  • Multi-hop QA: Frames (824 questions), DocMath (test-mini subset of 800 queries), CorpusQA, plus LongBench-V1-QA subsets (2WikiMultihopQA, HotpotQA, Musique, NarrativeQA, Qasper).
  • Evaluation configuration (Section 5.1).
  • Max input length: 128K tokens.
  • Max generation length: 50K tokens.
  • For prompts longer than window: “middle truncation” to preserve front and tail.
  • temperature 0.7, top-p 0.95.

  • Scoring.

  • Multiple-choice: accuracy.
  • MRCR: SequenceMatcher ratio.
  • Multi-hop QA: max of CEM and LLM-as-a-judge semantic equivalence; the judge uses DeepSeek-V3 with the prompt in Table 6.

5.2 Main quantitative results (with numbers)

Overall long-context benchmarks (Table 7; Figure 1). - QwenLong-L1.5-30B-A3B average: 71.82. - Baseline Qwen3-30B-A3B-Thinking-2507 average: 61.92. - Gain: +9.90 average.

Per-benchmark gains vs baseline (Table 7): - DocMath: 66.26 vs 62.26 (∆ +4.00) - LongBench-V2: 55.27 vs 49.11 (∆ +6.16) - Frames: 74.76 vs 70.27 (∆ +4.49) - MRCR: 82.99 vs 51.27 (∆ +31.72) - CorpusQA: 81.25 vs 71.56 (∆ +9.69) - LongBench-V1-QA: 70.40 vs 67.10 (∆ +3.30)

Comparisons reported in Table 7: - Gemini-2.5-Pro: 72.40 average. - GPT-5: 74.74 average. - DeepSeek-R1-0528: 68.67 average. - The paper highlights MRCR = 82.99 as top performance in that table.

Fine-grained LongBench breakdown (Appendix A tables included in excerpt). - LongBench-V2 improves across difficulty/length buckets; e.g., “Medium” subset gain +10.23 (Table 11). - LongBench-V1-QA shows gains on Musique (+7.00) and NarrativeQA (+9.00), but a drop on Qasper (-3.00) (Table 12).

5.3 Generalization beyond long-context benchmarks

Table 8 compares baseline vs QwenLong-L1.5:

  • General benchmarks:
  • MMLU-PRO: 81.03 → 81.33 (∆ +0.30)
  • AIME24: 90.31 → 90.0 (∆ -0.31)
  • AIME25: 82.81 → 86.46 (∆ +3.65)
  • GPQA-Diamond: 75.88 → 76.78 (∆ +0.90)
  • Agentic memory (BFCL-V4 memory subset):
  • Memory-KV: 10.97 → 16.77 (∆ +5.80) is the largest gain among those listed.
  • One negative change: Memory-Rec_Sum: 41.94 → 40.00 (∆ -1.94).
  • Dialogue memory:
  • LongMemEval: 60.80 → 76.40 (∆ +15.60), a large reported improvement.

5.4 Ultra-long context with memory-agent framework (1M–4M tokens)

Table 9 reports results for contexts beyond 128K:

  • MRCR subsets:
  • 128K–512K: 16.55 → 34.87 (∆ +18.32)
  • 512K–1M: 4.24 → 22.53 (∆ +18.29)
  • CorpusQA:
  • 1M: 15.32 → 20.72 (∆ +5.40)
  • 4M: 9.52 → 14.29 (∆ +4.77)

These are specifically in the memory-agent setting (not full-context inference), and show the proposed memory training improves performance relative to the baseline memory-agent configuration.

5.5 Do experiments support the claims?

  • Claim: post-training recipe yields large long-context gains.
  • Strongly supported by Table 7’s consistent improvements and especially the very large MRCR jump.
  • Claim: progressive training + memory specialist merging matters.
  • Supported by Table 10, which shows:
    • big gains early (Stage-1: 61.92 → 69.59),
    • continued improvements through Stage-3 (71.59),
    • a full-context drop after Memory-RL (68.53) followed by recovery after merging (71.18) and final Stage-4 (71.82),
    • memory-agent metric improvements that rise to 22.53 on MRCR 512K–1M by the end.
  • Claim: RL stability methods improve training.
  • Supported by ablation improvements on the 4B model (Tables 2, 4, 5) and the entropy dynamics plots (Figures 8, 10, 11), though the plots are qualitative in the excerpt (no exact thresholds Hlow/Hhigh are given).

6. Limitations and Trade-offs

  • Missing or under-specified implementation details (from what is shown).
  • The excerpt does not provide optimizer choice, full model hyperparameters, compute budget, hardware, or total training steps/tokens, making reproducibility assessment incomplete.

  • Trade-off between full-context performance and memory-agent specialization.

  • Table 10 shows Memory-RL improves memory-agent performance (MRCR 512K–1M goes up to 20.34) but reduces full-context average score (71.59 → 68.53), requiring a merging step to reconcile.

  • Credit assignment remains a core issue.

  • The paper explicitly frames AEPO and gradient clipping as stabilization rather than a fundamental solution; advantage is still assigned at sequence/trajectory level (Section 7.2 and the discussion in Section 4.3).

  • Data synthesis constraints and external dependencies.

  • The paper notes that synthesis scalability is bottlenecked by proprietary model API quotas and the cost of serving large open-source models for generating long-context data (Section 7.1).

  • Coverage gaps for real-world long-output tasks and multimodality.

  • The paper states the current synthesis pipeline is not optimized for long-input/long-output tasks like report generation or chapter-level revision, and is currently text-only (Section 7.1).

7. Implications and Future Directions

  • How this changes the landscape (based on the provided paper).
  • It suggests long-context reasoning improvements can be achieved substantially via post-training, not only via bigger windows or new attention architectures, by aligning:
    • training tasks with true multi-hop long-range grounding, and
    • RL procedures with stability controls tailored to long sequences.
  • The reported generalization gains (Table 8), especially in dialogue memory (LongMemEval +15.60), imply that improving long-context grounding may improve broader “stay-on-track” reasoning behaviors.

  • Follow-up research directions named in the paper.

  • Data flywheel / closed-loop synthesis: use the improved long-context model to generate new QA pairs and “thinking trajectories,” reducing dependence on external models (Section 7.1).
  • Token-level credit assignment: move beyond uniform advantage broadcast to a finer-grained mechanism for reasoning trajectories (Section 7.2).
  • More sophisticated reward models: beyond rule checks + judge models, toward rubric-based reward modeling for open-ended tasks (Section 7.2).
  • Multimodal long-context reasoning: extend synthesis and training to multimodal settings (Section 7.1).
  • Long-input, long-output task synthesis: explicitly target tasks requiring lengthy outputs, not just long inputs (Section 7.1).

  • Repro/Integration Guidance (based on the described system).

  • When to prefer full-context single-pass: tasks within the model’s evaluated ≤128K window (or within the stated 256K physical window mentioned in the introduction) where you can afford full-context inference and want strongest direct reasoning performance.
  • When to prefer the memory-agent framework: tasks exceeding the context window (paper evaluates 128K–4M token regimes in Table 9), where chunking + iterative memory updates are necessary.
  • If implementing the training recipe, the key integration points are:
    • build or obtain a synthesis pipeline that enforces distributed evidence + verifiable answers (Section 3),
    • use an on-policy GRPO-style RL loop with group sampling (G=8) and stability mechanisms (task-balanced sampling, task-specific advantage, and AEPO) (Sections 4.2–4.4, Section 5.1),
    • train a memory specialist separately and merge to avoid destabilizing mixed-mode training (Section 4.1, Figure 6, Table 10).