WHEN REASONING MEETS ITS LAWS¶
ArXiv: 2512.17901
🎯 Pitch¶
The paper introduces LORE (Laws of Reasoning), a theoretical framework and benchmark (LORE-BENCH) that formalize how Large Reasoning Models should allocate test-time reasoning compute (compute law) and how accuracy should decay with problem complexity (accuracy law), operationalized via monotonicity and compositionality. It shows existing LRMs are broadly monotonic but fail at compositionality, and presents SFT-Compo, a simple supervised fine-tuning method that enforces compute compositionality and yields consistent gains in compositional behavior and downstream reasoning accuracy—improving both efficiency and robustness of model reasoning.
1. Executive Summary (2-3 sentences)¶
This paper introduces LORE (Laws of Reasoning), a theoretical framework intended to formalize how Large Reasoning Models (LRMs) should allocate test-time reasoning effort and how their accuracy should change as problems get more complex. Because the paper’s formal notion of “complexity” is not directly measurable, it operationalizes the laws via two testable properties—monotonicity and compositionality—and builds LORE-BENCH to measure them, finding that many LRMs are monotonic but not compositional. It then proposes SFT-Compo, a supervised fine-tuning method that explicitly enforces compute compositionality and empirically improves both compositionality metrics and downstream reasoning accuracy (e.g., Table 3; Figure 6; Figure 7).
2. Context and Motivation¶
- Problem / gap addressed
- LRMs can solve hard reasoning tasks, but their reasoning behavior can be “counterintuitive” in ways that waste compute or reduce accuracy.
-
The motivating failure mode is that models do not reliably allocate more reasoning effort to more complex tasks, and they behave oddly on composed tasks (Figure 1 shows an example where a composite “sum then square” gets fewer reasoning tokens than a subproblem and suffers an accuracy drop).
-
Why this matters
- If reasoning compute is misallocated, models may:
- Overthink: spend too many tokens on easy tasks.
- Underthink: spend too few tokens on hard tasks.
-
The paper frames this as harming both efficiency and performance, and argues it stems partly from high-variance Chain-of-Thought (CoT) training data that is not constrained by explicit “thinking budget” rules (Introduction).
-
Prior approaches and their shortcomings (as positioned here)
- Post-training and test-time methods try to control or scale reasoning length (the paper lists several lines: variable-length CoT SFT; test-time scaling/control).
-
The critique is that these methods are often ad hoc and still yield undesirable behaviors in the authors’ studies (Introduction; Related Work §6).
-
How this paper positions itself
- It aims to move from heuristics to principles:
- Propose theoretical “laws” connecting complexity ↔ compute and complexity ↔ accuracy (Section 2; Figure 2).
- Build targeted evaluations for these principles (
LORE-BENCH, Section 3). - Show that enforcing one principle (compute compositionality) can improve general reasoning (Sections 4–5; Table 3; Figure 7).
3. Technical Approach¶
3.1 Reader orientation (approachable technical breakdown)¶
- The system is a theory + benchmark + fine-tuning recipe for diagnosing and improving how LRMs allocate reasoning tokens.
- It targets test-time reasoning behavior (reasoning length / token budget and accuracy) and proposes measurable proxies plus a training method to make models behave more “law-abiding.”
3.2 Big-picture architecture (diagram in words)¶
- LORE theory (Section 2): Define
complexityand hypothesize two laws: Compute Law: reasoning tokens scale linearly with complexity.Accuracy Law: accuracy decays exponentially with complexity.- Proxies (Section 2.2–2.3): Replace unmeasurable complexity with two checkable properties:
MonotonicityandCompositionalityfor compute and accuracy.- LORE-BENCH (Section 3):
LORE-MONO: synthetic, step-controlled variants for monotonicity tests (Figure 3).LORE-COMPO: composed question pairs for compositionality tests (Section 3.2).- Intervention (Section 4):
SFT-Compo: build a supervision set that selects reasoning traces whose lengths best satisfy compute additivity, then supervised fine-tune.- Evaluation (Section 5):
- Re-measure compositionality on
LORE-COMPOand evaluate general reasoning on six benchmarks (Table 3).
3.3 Roadmap for the deep dive¶
- Explain the paper’s formal objects and definitions (questions, outputs, compute, accuracy).
- Explain the two hypothesized laws (compute and accuracy) and why complexity is hard.
- Explain the proxy properties (monotonicity and compositionality) and what they imply (Propositions 1–2; Appendix D).
- Explain how
LORE-MONOandLORE-COMPOoperationalize these properties and how metrics are computed. - Explain
SFT-Compo: how training examples are constructed (Equation (1)) and why it targets compositionality. - Walk through the empirical findings tying “law compliance” to improved benchmark accuracy.
3.4 Detailed, sentence-based technical breakdown¶
This is an empirical + conceptual framework paper: it introduces formal hypotheses (“laws”), constructs targeted benchmarks to test proxy properties of those hypotheses, and proposes a simple supervised fine-tuning method to improve compliance and downstream reasoning.
3.4.1 Formal setup: questions, reasoning, and outputs (Section 2.1)¶
- A question is a token sequence
x ∈ X ⊆ V*, whereV*is the set of all finite token sequences over vocabularyV. - A model is an autoregressive LRM
M_θthat follows a “thinking-then-answering” pattern: - It first samples a reasoning chain
r ∈ Rfromp_θ(r | x). - Then it samples an answer
y ∈ Yfromp_θ(y | x, r). -
The combined model output is
o = (r, y). -
The paper defines question composition as concatenation with a connector prompt
c: x1 ⊕ x2 = concat(x1, c, x2)(Section 2.1).- Example connector (footnote): “Answer the following questions in order: Q1… Q2…”
3.4.2 What “complexity” means here (Definition 1)¶
- The paper defines complexity
κ(x)using a verifier-and-Turing-machine abstraction: - A “primitive step” is one valid transition of a fixed deterministic Turing machine.
- A candidate solution is a sequence of primitive steps
τwith lengthℓ(τ). - A verifier
v(x, τ)returns 1 iffτis a valid solution forx. - The complexity is the minimum valid solution length:
κ(x) = min{ ℓ(τ) : v(x, τ) = 1 }, or∞if unsolvable.
- The key practical point is that computing
κ(x)is intractable because it requires searching over many candidate solutions to find the shortest one (Section 2.1).
3.4.3 What “reasoning compute” and “accuracy” mean here (Definitions 2–3)¶
- Reasoning compute is defined as the expected number of reasoning tokens:
C_θ(x) = E_{r ~ p_θ(·|x)}[ℓ(r)]- Here,
ℓ(r)is the token length of the chain-of-thought / reasoning segment. - Reasoning accuracy is the probability the final extracted answer matches the ground truth:
A_θ(x) = E_{(r,y) ~ p_θ(·|x)}[ 1{ ans(y) = a*(x) } ]ans(y)extracts the final answer, anda*(x)is the correct answer.
3.4.4 Compute Law: linear scaling hypothesis (Hypothesis 1; Section 2.2)¶
- The compute law hypothesizes that an “optimal” reasoning model uses compute proportional to complexity:
C_θ(x) = α_θ κ(x) + o(κ(x))α_θ > 0depends on the model and decoding strategy.o(κ(x))is sublinear overhead intended to capture “introductory/transition tokens.”
Why this is hard to test directly
- Since κ(x) is not measurable for real questions, validating linear scaling is not straightforward.
3.4.5 Proxy properties for compute: monotonicity and compositionality (Properties 1–2)¶
To avoid needing exact κ(x), the paper tests two properties:
- Compute-Complexity Monotonicity (Property 1):
- If
κ(x1) ≤ κ(x2)thenC_θ(x1) ≤ C_θ(x2). -
Intuition: harder problems should not require less reasoning compute.
-
Independence notion (Definition 4 + operationalization)
- Formally,
x1andx2are independent if:κ(x1 ⊕ x2) = κ(x1) + κ(x2).
- Practically, the paper uses a proxy: each question has a concept set
S(x), and independence is approximated by disjointness:S(x1) ∩ S(x2) = ∅(Section 2.2).
-
In
LORE-COMPO, it operationalizes independence by sampling questions from different subjects in MATH500 (Section 3.2). -
Compute-Complexity Compositionality (Property 2):
-
For independent questions, compute is additive up to overhead:
C_θ(x1 ⊕ x2) ≈ C_θ(x1) + C_θ(x2)
-
Theoretical link (Proposition 1; Appendix D)
- Under assumptions (monotonicity, exact additivity for independent compositions, and existence of arbitrarily many jointly independent same-complexity questions), the paper proves:
C_θ(x) = α_θ κ(x)(exact), and provides an asymptotic version with sublinear overhead (Corollary D.1).
3.4.6 Accuracy Law: exponential decay hypothesis (Hypothesis 2; Section 2.3)¶
- The paper motivates an exponential decay based on “many-step reasoning” where all steps must succeed:
A_θ(x) = exp(-λ_θ κ(x))for0 < A_θ(x) ≤ 1-
Equivalently:
log A_θ(x) ∝ -κ(x). -
Accuracy proxies (Properties 3–4)
- Accuracy-Complexity Monotonicity (Property 3):
- If
κ(x1) ≤ κ(x2)thenA_θ(x1) ≥ A_θ(x2).
- If
-
Accuracy-Complexity Compositionality (Property 4):
- For independent questions:
A_θ(x1 ⊕ x2) = A_θ(x1) · A_θ(x2)- In log-space, this becomes additivity:
log A_θ(x1 ⊕ x2) = log A_θ(x1) + log A_θ(x2)(up to deviations).
-
Theoretical link (Proposition 2; Appendix D)
- With analogous assumptions, the paper proves:
A_θ(x) = exp(-λ_θ κ(x))and gives an asymptotic deviation version (Corollary D.2).
3.4.7 LORE-BENCH: turning the properties into measurable tests (Section 3)¶
(A) LORE-MONO for monotonicity (Section 3.1; Figure 3)¶
- The goal is to create question families where relative complexity ordering is known.
- Construction:
- Choose 4 domains: math, science, language, code.
- Curate 10 “seed questions” per domain → 40 seed questions total.
- For each seed, generate 30 variants that require more iterative steps (variant
Napplies the update ruleNtimes), so complexity increases with the variant index (Figure 3). - Variants are generated via Python scripts, then manually checked to avoid shortcuts like periodic patterns (Appendix E.1.2).
- Metric:
- Use Spearman correlation
ρ ∈ [-1, 1]between:- variant index vs reasoning compute
C_θ(x)(tests compute monotonicity), - variant index vs
log A_θ(x)(tests accuracy monotonicity; expected negative correlation).
- variant index vs reasoning compute
- Setup detail: for LORE-MONO, at each variant index they average compute and log accuracy across the 40 questions, then compute Spearman (Section 3.3).
- They also mention a max–min normalization of compute per question to avoid domination by any item (footnote 4).
(B) LORE-COMPO for compositionality (Section 3.2)¶
- Build from MATH500, using subject labels (e.g., Algebra, Geometry).
- Construction:
- Randomly sample
(x1, x2)from distinct subjects to approximate independence. - Create composite
x12by concatenation with a fixed connector prompt. - Each original question is used at most once → 250 triplets total:
D_LoRe-Compo = {(x1^(i), x2^(i), x12^(i))}_{i=1}^{250}
- Metric:
- For
f_θ(·)being either computeC_θ(·)orlog A_θ(·), compositionality predicts:f_θ(x12) ≈ f_θ(x1) + f_θ(x2)
- They measure deviation using Mean Absolute Deviation (MAD) and normalized MAD:
MAD_f = Σ | f_θ(x12) - (f_θ(x1) + f_θ(x2)) |nMAD_f = MAD_f / S_f, whereS_f = Σ |f_θ(x1) + f_θ(x2)|
- Interpretation: smaller
nMADmeans better compositionality.
3.4.8 Enforcing compute compositionality: SFT-Compo (Section 4; Equation (1))¶
- The intervention targets compute compositionality (not accuracy compositionality) because compute provides an actionable selection criterion (Section 4, footnote 5).
- Data construction process (“what happens first, second, third”):
- Start from a training dataset
D_traincontaining questions with categories. - Sample pairs
(x1, x2)from distinct categories and form a composite questionx12 = x1 ⊕ x2(mirrors LORE-COMPO construction). - For each of
x1,x2, andx12, sampleKmodel outputs from an LRM (either the current model or a stronger teacher), where each output is(r, y). - Consider the
K^3combinations of one output for each of{x1, x2, x12}, but filter to combinations where all three answers are correct. - From the remaining correct combinations, pick the triple of reasoning traces that best satisfies compute additivity by minimizing the length mismatch:
(r1*, r2*, r12*) = argmin |ℓ(r1) + ℓ(r2) - ℓ(r12)|- subject to each trace yielding a correct final answer (Equation (1)).
- Add three supervised examples to a training set:
(x1, o1*),(x2, o2*),(x12, o12*)
-
Aggregate across triplets into
D_compand do supervised fine-tuning on this set. -
Intuition:
- The selection rule tries to make the model see supervision where “solving the composite” uses about the same amount of reasoning as “solving both parts and adding them,” thereby nudging the model toward additive compute behavior.
3.4.9 Configurations / hyperparameters (only what is stated in the provided text)¶
Decoding / sampling used in evaluation (Section 3.3): - Outputs per question: 8 samples per model. - Decoding temperature: - 0.6 for the DeepSeek family. - 0.8 for the Phi-4 family. - Max generation length: 20480 tokens (Section 3.3).
SFT-Compo training setup (Section 5.1):
- Models fine-tuned: DeepSeek-R1-Distill (Qwen-1.5B, Qwen-7B, Llama-8B) and Phi-4-mini-reasoning.
- Teacher model used to sample outputs: DeepSeek-R1-Distill-Qwen-14B.
- Samples per question for constructing supervision: K = 8.
- Size of resulting supervised dataset: 3.9K question-output pairs.
- Fine-tuning: 5 epochs, batch size 16.
Control baseline SFT hyperparameter grid (Appendix F.1):
- Learning rate grid: {1e-6, 5e-6, 5e-5}
- Batch size: 8
- Gradient accumulation: 2
- Warmup ratio: 0
Evaluation max generation length in Section 5.1: - Max generation length: 10240 tokens.
Not specified in the provided excerpt - Optimizer type (e.g., AdamW), optimizer parameters, learning-rate schedule for SFT-Compo (beyond the baseline grid), weight decay, dropout, etc. - Hardware (GPUs/TPUs), training time, compute budget. - Model architecture specifics (layers, hidden size, heads), tokenizer, context window (beyond max generation length).
4. Key Insights and Innovations¶
- (1) A “laws” framing for reasoning behavior (Compute + Accuracy)
- Novelty: Instead of only measuring task accuracy, the paper treats reasoning behavior as something that should obey structural regularities: linear compute scaling and exponential accuracy decay with complexity (Hypotheses 1–2; Figure 2).
-
Significance: This yields a conceptual target for “desirable” reasoning behavior that can be tested and potentially optimized.
-
(2) Replacing unmeasurable complexity with testable proxy properties
- Novelty: The paper explicitly acknowledges that
κ(x)is intractable and proposesmonotonicityandcompositionalityas empirically testable substitutes (Properties 1–4). -
Significance: These proxies are linked back to the laws via formal propositions (Appendix D), providing a principled justification for the benchmark metrics.
-
(3)
LORE-BENCH: targeted evaluation for monotonicity and compositionality - Novelty:
LORE-MONOis designed to enforce a known complexity ordering via iterative-step variants (Figure 3), enabling Spearman correlation tests that typical benchmarks cannot support. -
Novelty:
LORE-COMPOspecifically probes additivity/multiplicativity under composition using independent question pairs andnMADmetrics. -
(4)
SFT-Compo: fine-tuning by selecting “length-consistent” correct reasoning traces - Novelty: The training set is not just “correct traces,” but correct traces chosen to minimize
|ℓ(r1)+ℓ(r2)-ℓ(r12)|(Equation (1)). -
Significance: This directly targets compute compositionality and empirically improves both compositionality and downstream benchmark accuracy (Figure 6; Table 3).
-
(5) Reported “synergistic effects”
- Observation: Enforcing compute compositionality also improves compute monotonicity (Figure 7a) and improves accuracy compositionality in log-space (Figure 7b), even though accuracy was not directly enforced.
5. Experimental Analysis¶
Evaluation methodology¶
- Models evaluated on LORE-BENCH (Section 3.3)
-
Ten LRMs including:
- DeepSeek-R1-Distill variants (Qwen-1.5B/7B/8B/14B),
Phi-4-mini-reasoning,OpenReasoning-Nemotron-14B,Sky-T1-32B-Preview,Qwen3-Next-80B-A3B-Thinking,- and two “length control” models:
Thinkless-1.5B-RL-DeepScaleRandAdaptThink-7B-delta0.05.
-
Sampling / decoding
-
8 samples per question; temperature 0.6 (DeepSeek family) / 0.8 (Phi-4 family); max length 20480 tokens (Section 3.3).
-
Metrics
LORE-MONO: Spearman correlation between variant index and (i) compute, (ii) log accuracy.LORE-COMPO:nMADfor computeC_θand forlog A_θ.
Main results on LORE-MONO (Table 1; Figure 4)¶
- Compute monotonicity is generally strong
- Many models have overall compute Spearman correlations close to 1 (Table 1, “All” column for reasoning compute).
- Exception highlighted:
DeepSeek-R1-1.5Bhas weaker overall compute correlation 0.875, and even negative/near-zero correlations in specific domains:- Language: -0.346
- Code: 0.151
-
Figure 4 visualizes this behavior for DeepSeek-R1-1.5B.
-
Accuracy monotonicity (in log space) is generally present
- Table 1 reports negative correlations for log accuracy across domains for most models (expected under Property 3).
- The paper notes the trend is weaker for the weakest model (DeepSeek-R1-1.5B).
Main results on LORE-COMPO (Table 2; Figure 5)¶
- Compositionality failures are widespread
- Compute compositionality
nMAD_{C_θ}is fairly large across models; examples from Table 2:- DeepSeek-R1-1.5B: 0.528
- Thinkless-1.5B: 0.339
- Phi-4-mini: 0.322
- DeepSeek-R1-8B: 0.423
- Accuracy compositionality in log-space
nMAD_{log A_θ}is also large for many models; examples:- DeepSeek-R1-1.5B: 2.368
- DeepSeek-R1-7B: 1.170
- Sky-T1-32B: 1.900
-
Figure 5 shows scatter plots of
C_θ(x12)vsC_θ(x1)+C_θ(x2); points deviate substantially from they=xline, illustrating lack of additivity. -
Length-control methods do not automatically enforce compositionality
- Even Thinkless and AdaptThink have “considerable deviations” (as summarized near Table 2 / Figure 5).
SFT-Compo enforcement results (Figure 6; Section 5.2)¶
- Compute compositionality improves materially
- Reported
nMAD_{C_θ}reductions include:- 1.5B: 0.528 → 0.314 (40.5% reduction)
- 8B: 0.423 → 0.328 (22.5% reduction)
- Figure 6b shows the post-SFT scatter aligns closer to
y=xfor the 1.5B case.
General reasoning benchmarks (Table 3)¶
- Benchmarks evaluated (Section 5.1)
GSM8K,MATH500,AIME 2024,AIME 2025,AMC 2023,OlympiadBench.-
Metric: Pass@1 accuracy (%) computed over 8 sampled outputs; “Pass@1” column in Table 3 is the average across six benchmarks.
-
SFT-Compo improves Pass@1 across all listed models in Table 3
- DeepSeek-R1-1.5B:
- Pass@1: 47.6 (Base) → 52.4 (SFT-Compo) (+4.8)
- AIME24: 18.8 → 26.2 (+7.4)
- GSM8K: 71.6 → 77.6 (+6.0)
- DeepSeek-R1-7B:
- Pass@1: 61.5 → 64.7 (+3.2)
- AIME24: 36.3 → 43.3 (+7.0)
- DeepSeek-R1-8B:
- Pass@1: 54.5 → 59.5 (+5.0)
- AIME25: 22.9 → 29.2 (+6.3)
-
Phi-4-mini:
- Pass@1: 59.5 → 63.9 (+4.4)
- AIME24: 32.5 → 43.7 (+11.2)
-
Control baseline to isolate “teacher distillation” effects
- The paper compares to a baseline
SFTthat uniformly samples one correct reasoning path per question triplet (instead of minimizing the length mismatch). - Reported outcome:
SFT-CompooutperformsSFTin all cases in Table 3, supporting the claim that improvements are tied to compositionality-oriented selection, not only access to teacher-generated correct traces.
Synergistic effects (Figure 7)¶
- Compute monotonicity improves after enforcing compute compositionality
- For DeepSeek-R1-1.5B, overall compute Spearman correlation improves:
- 0.875 → 0.977
-
In the code domain specifically:
- 0.151 → 0.914 (Figure 7a).
-
Accuracy compositionality (log-space) improves even though it wasn’t directly enforced
- For 1.5B:
nMAD_{log A_θ}drops:- 2.368 → 0.685 (71.1% reduction)
- For 7B: 1.170 → 0.756 (35.4% reduction) (Figure 7b).
Are the experiments convincing?¶
- Strengths in support of the claims:
- The benchmark metrics are directly aligned with the proxy properties (Spearman for monotonic trends;
nMADfor additivity in composition). - The intervention targets the same metric family it is evaluated on (compute compositionality on LORE-COMPO), and improvements are shown quantitatively (Figure 6).
-
Downstream gains are consistent across multiple model sizes and benchmarks (Table 3), and a control baseline (
SFT) is used to argue the effect is not merely “teacher distillation.” -
What remains under-specified in the provided text:
- Training details (optimizer, schedule) and dataset construction specifics beyond size and source subset are not fully provided here (Appendix F is referenced, but only partial details are included in the excerpt).
- “Independence” is approximated by subject/category mismatch, which may not guarantee true independence of solution steps; this affects how strictly compositionality should be expected.
6. Limitations and Trade-offs¶
- Complexity is formally defined but not directly measurable
-
The laws are stated in terms of
κ(x)(minimal primitive-step solutions), but the empirical program relies on proxies (monotonicity and compositionality) becauseκ(x)is intractable (Section 2.2). -
Independence is operationalized with a proxy
- The paper explicitly notes independence is approximated by disjoint concept sets
S(x)(Section 2.2) and, in practice, by selecting questions from different subjects (Section 3.2). -
The limitations section (Appendix C) acknowledges this proxy is “not rigorous.”
-
LORE-MONO coverage is limited
-
Appendix C states LORE-MONO has 40 seed questions total and calls expanding topic diversity and coverage an important future direction.
-
Potential shortcut/artefact risks in synthetic monotonicity construction
- The benchmark must ensure that higher variant index truly requires more computation.
-
The paper recognizes periodic-answer shortcuts and says it manually reviewed and excluded such variants (Appendix E.1.2), but synthetic generation can still introduce unintended regularities.
-
SFT-Compo depends on having multiple correct samples
- The procedure filters to combinations where all three reasoning paths yield correct answers (Section 4).
-
For harder tasks or weaker models, obtaining enough correct samples among
K=8outputs could be limiting; the paper mitigates this by using a stronger teacher for sampling in experiments (Section 5.1). -
Compute-focused enforcement may change behavior in ways not fully characterized
-
The method enforces length additivity, which may encourage longer (or differently structured) reasoning on composites; the paper reports improved accuracy and improved monotonicity/accuracy-compositionality (Figure 7), but broader behavioral side effects are not fully explored in the provided excerpt.
-
Closed-source coverage
- Appendix C notes budget constraints lead them to focus on strong open-source LRMs; closed-source evaluation is not included.
7. Implications and Future Directions¶
- How this changes the landscape
- It reframes reasoning evaluation from “solve problems well” to also “solve problems with structurally sensible allocation of reasoning compute,” using principled proxy tests (Figure 2; Sections 2–3).
-
It suggests that reasoning control should be evaluated not only by average accuracy but also by whether behavior respects monotonicity and compositionality—especially for composed tasks where current models underperform (Table 2; Figure 5).
-
Follow-up research enabled
- Better independence notions: Since compositionality hinges on independence, refining
S(x)-style concept disjointness or building tasks with provable additive complexity would strengthen the empirical link to the formal laws (Appendix C). - Scaling
LORE-MONO: Increase domains, seed diversity, and variant types beyond repeated updates while still preserving known complexity orderings (Appendix C). -
Direct accuracy-law enforcement: The paper states accuracy compositionality is hard to enforce because it doesn’t specify which reasoning path to supervise (footnote 5). A future line is devising supervision or selection rules that explicitly target multiplicative accuracy (Property 4) or log-additivity.
-
Practical applications / use cases
- Model selection and diagnostics:
LORE-BENCHcan be used to detect abnormal behaviors like “underthinking on composite questions” even if raw accuracy looks competitive (motivated by Figure 1 and compositionality failures in Table 2). -
Training recipe:
SFT-Compoprovides a concrete way to improve both compositionality and downstream math/science reasoning (Figure 6; Table 3), especially when composite-task underthinking is observed. -
Repro/Integration Guidance (based on what is provided)
- When to prefer this method:
- If a model shows strong monotonicity but poor compositionality (the common pattern reported in Section 3.3), and especially if composed multi-part prompts lead to reduced reasoning length and accuracy.
- What you need to implement:
- A source of categorized questions to form
(x1, x2, x12)triplets (Section 4; Appendix F.1 mentions annotating categories for DeepScaler questions). - A way to sample
Koutputs per question (the paper usesK=8) and to filter for correctness to apply the length-matching selection rule (Equation (1)).
- A source of categorized questions to form
- What is not fully specified in the excerpt:
- The exact optimizer/scheduler and full fine-tuning hyperparameters for
SFT-Compoare referenced as being in Appendix F, but only partial details are shown here (epochs and batch size are given; optimizer is not).
- The exact optimizer/scheduler and full fine-tuning hyperparameters for