Language Modeling by Language Models¶
ArXiv: 2506.20249
đŻ Pitch¶
The paper presents Genesys, a multi-agent LLM system that autonomously discovers novel autoregressive LM block architectures by simulating research stages (proposal, literature-backed design, unit-by-unit code generation, and laddered verification) inside a purpose-built environment (LMADE). By factorizing blocks into reusable unit trees and using genetic-programming plus a Viterbi-style generation pipeline, Genesys sharply improves valid design synthesis and verifies 1,062 new architecturesâmany competitive with human baselinesâdemonstrating a scalable, verifiable path to accelerate and democratize architecture discovery.
1. Executive Summary (2-3 sentences)¶
This paper introduces Genesys, a multi-agent LLM system that autonomously discovers new autoregressive language model block architectures by simulating key stages of ML researchâproposal (ideation + literature search), implementation (code generation), and verification (pre-training + evaluation)âinside a purpose-built environment called LMADE (Figure 1; Sections 3â5). Its core technical contribution is a factorized representation of model blocks into a tree of reusable âunits,â enabling genetic programming search plus a âunit-by-unitâ (Viterbi-style) code generation pipeline that dramatically increases the success rate of generating valid executable designs (Table 4; Appendix A.1). Empirically, the system discovers and verifies over a thousand architectures and reports that top discovered designs are competitive with strong human baselines at 125M and 350M parameter scales on common benchmarks (Tables 5 and 14; Section 5.3).
2. Context and Motivation¶
- Problem/gap addressed
- Many recent LLM-driven âautomated scientific discoveryâ systems target open-ended research where goals and verification are unclear (Section 1; Related Work discussion).
-
Architecture discovery in language modeling is a compelling alternative: it has a clear artifact (an executable block program) and clear evaluation (pre-training + downstream benchmark performance), but it also demands:
- deep literature understanding,
- careful compute budgeting for pre-training,
- writing correct code in a large design space (Section 1; Section 3).
-
Why it matters
- Language model architecture research is expensive and iterative; an autonomous pipeline could accelerate exploration and reduce human bottlenecks (Section 1).
-
The paper frames architecture discovery as a concrete, verifiable ASD task that can stress-test end-to-end agent systems with measurable success criteria (Section 1; Section 3.1).
-
Prior approaches and shortcomings (as positioned here)
- Classical
neural architecture search (NAS)often explores fixed operation spaces, while this work targets a broader space of LM block code (Section 2, NAS paragraph). - LLM-driven discovery approaches that rely on direct prompting can fail frequently when asked to generate complex correct code; the paper treats valid code generation as a primary bottleneck (Figure 9; Section 4.2; Table 4).
-
Verification is expensive; naĂŻvely training every candidate at every scale is infeasible (Section 4.3; Figure 11).
-
How this paper positions itself
- It proposes a full-stack discovery setup: an environment (
LMADE) that provides knowledge + verification tooling, and an agentic genetic search system (Genesys) that proposes/implements/verifies designs (Figure 1; Sections 3â4). - It emphasizes factorization + genetic programming as the backbone that makes the search space âefficient and factorizableâ (Abstract; Section 4.1).
3. Technical Approach¶
3.1 Reader orientation (approachable technical breakdown)¶
- The system is an autonomous architecture discovery pipeline that outputs executable PyTorch code for new autoregressive LM blocks and evaluates them by training and benchmarking.
- It solves architecture discovery as a program search + resource-managed verification problem: generate candidate block programs, check correctness cheaply, then selectively pre-train/evaluate them at increasing scales under a fixed compute budget (Equation (1); Table 1; Figure 11).
3.2 Big-picture architecture (diagram in words)¶
LMADE(environment, Figure 1 left; Section 3.2)Knowledge Engine (KE): retrieves relevant literature/code from a curated library and external sources (Figure 2; Section 3.2; Appendix B.1.3).-
Verification Engine (VE): (i) checks candidate code validity via symbolic + runtime checks, and (ii) pre-trains + evaluates models on a fixed corpus and benchmark suite (Table 1; Section 3.2; Appendix B.1.2âB.1.4). -
Genesys(discovery system, Figure 1 right; Section 4) Evolution tree: stores designs, code artifacts, tree-structured representations, and performance metadata (Figure 4; Section 4.1).Designer agents: propose and implement new designs using LLM sub-agents (proposer/reviewer + planner/coder/observer) (Figures 6â8; Section 4.2).Verifier agents: select which designs to train at which scale using a confidence/fitness policy and a âLadder-of-Scalesâ budget schedule (Figure 10; Figure 11; Section 4.3).
3.3 Roadmap for the deep dive¶
- First, define what counts as a design and what objective the system optimizes (Equation (1); Section 3.1).
- Next, explain the common representation language (
GAB/GAUtrees) that makes code checkable and evolvable (Figure 3; Section 4.1; Appendix B.3). - Then, walk through the designer pipeline: proposal â adversarial review â unit-by-unit implementation with checkers (Figures 6â9; Section 4.2).
- After that, cover verifier-side scaling/budget management and selection policies (Figures 10â11; Section 4.3).
- Finally, connect these mechanisms to the empirical evaluations and ablations (Tables 2â5; Section 5).
3.4 Detailed, sentence-based technical breakdown¶
This is an empirical systems + algorithmic framework paper: it builds a working autonomous discovery system and supports key design choices with formal arguments (Appendix A) and ablation-driven experiments (Section 5).
3.4.1 Problem formulation: architecture discovery as program search¶
- The object being discovered is a block program
B_LM, i.e., code that defines how one layer/block transforms hidden states in an autoregressive LM (Figure 3; Section 3). - The paper formalizes discovery as searching for the block program that maximizes an empirical fitness function:
- [ \hat{B}{LM} = \arg\max} \in \mathcal{B{LM}} F(B), \quad F(B_{LM}) = \frac{1}{M\cdot K}\sum_{i=1}^{M}\sum_{j=1}^{K}\text{Perf}(B_{LM}, D_i, S_j) ] (Equation (1), Section 3.1).
-
Here,
Perfis downstream performance on taskD_iat model scaleS_j, and the paper uses this as the basis for selection in evolution. -
A candidate block is considered valid only if it satisfies syntactic and semantic constraints, including:
- it is a differentiable PyTorch module with the right tensor interface,
- it is causal (future tokens donât influence past outputs),
- it runs forward/backward stably and efficiently (Table 1; Section 3.1; Appendix B.1.2).
3.4.2 The representation layer: GAB and GAU trees (why factorization matters)¶
- The paper introduces a standardized code interface called a
Generalized Autoregressive Block (GAB), implemented asGABBase, which enforces: - input
Xis shape(B, L, D), - output
Yhas the same shape, - an auxiliary dictionary
Zcarries intermediate variables across units and blocks (Figure 17; Appendix B.3). - A
Generalized Autoregressive Unit (GAU)is a submodule (GAUBase) with the same(X, Z) â (X, Z)signature and additional constraints for integration and partial checking (Appendix B.3). - Each full block program
B_LMis factorized into aGAU tree, where nodes correspond to units (Figure 3, Figure 4; Section 4.1). This enables: - Mutation: replace/modify a subtree or unit (Figure 5A; Section 4.1).
- Crossover: merge units from multiple parent designs (Figure 5B; Section 4.1).
-
More targeted validity checking and incremental code generation (Section 4.2).
-
The paper provides a formal justification that programs with the relevant type structure (abstracted as
ÎŁ â ÎŁ) can be decomposed into such unit trees (Appendix A.2). Practically, this supports the claim that factorization does not âoversimplifyâ the design space for this setting (Section 4.1; Appendix A.2).
3.4.3 The environment: LMADE = Knowledge Engine + Verification Engine¶
Knowledge Engine (KE) (Section 3.2; Appendix B.1.3)
- KE includes:
- a manually curated reference library of 297 LM papers stored in a searchable vector store and augmented with extracted code snippets (Figure 2; Section 3.2; Appendix B.1.1),
- external querying over ArXiv, Semantic Scholar, web search, etc. (Section 3.2; Appendix B.1.3).
- The paper describes chunking PDFs into semantic chunks and storing embeddings in a vector DB; retrieval uses reranking (Appendix B.1.3; Table 10).
Verification Engine (VE) (Section 3.2; Appendix B.1.2âB.1.4)
- The VE provides two levels of feedback:
1. Symbolic checker that performs static and runtime checks:
- Static: parsing/formatting compliance with the GAB/GAU protocol.
- Runtime: module init, forward/backward, differentiability, causality tests, and a small training sanity check (Table 1; Appendix B.1.2).
- Causality is tested by checking whether changing future tokens changes earlier outputs (Appendix B.1.2).
2. Full verification: automated pre-training + evaluation on a filtered corpus and LM-Eval benchmarks (Section 3.2).
- An
Auto-Tuneradjusts block depth/width to match target parameter scales and tunes gradient accumulation to avoid OOM (Appendix B.1.4; Table 7). - A runtime monitor terminates runs on issues like exploding gradients/loss or excessive step time (Appendix B.1.4; Table 8).
3.4.4 The discovery system: Genesys as distributed genetic programming with LLM agents¶
Core data structure: the evolution tree (Figure 4; Section 4.1)
- Each node stores:
- executable code,
- the GAU tree representation,
- design traces (proposal/review artifacts),
- empirical performance metrics across verified scales (Figure 4; Section 4.1).
Designer agents: proposal stage (Figures 6â7; Section 4.2)
- The proposal stage selects one or more parent designs from the evolution tree and retrieves relevant references from KE (Section 4.2).
- A proposer LLM drafts a structured research proposal describing a modification:
- a unit mutation,
- crossover (mixing units),
- or from-scratch design (treated as mutating the root) (Section 4.2).
- A separate reviewer LLM adversarially critiques and scores the proposal; proposals iterate until the reviewer score exceeds a threshold (Section 4.2; Algorithm 1 in Appendix B.2 references this loop).
Designer agents: implementation stage (unit-by-unit code generation) (Figure 8; Section 4.2)
- Implementation translates an accepted proposal into executable GAU code and then a GAB block.
- The key mechanism is incremental, recursive construction of the GAU tree:
- Maintain an Unimplemented list of units.
- Repeatedly pick one unit, plan its implementation (planner agent), write code (coder agent), then validate:
- symbolic checker validates protocol compliance and runtime behavior,
- observer LLM rates code quality, adherence, novelty; threshold is 3/5 (Section 4.2; Figure 8).
- If checks fail, the system rolls back state and retries; accepted units are âfrozen,â reducing future failure impact (Figure 8; Section 4.2).
- The paper contrasts this with âdirect promptingâ (generate the entire artifact in one shot), which has expected attempts
1/pif success probability isp(Section 4.2; Figure 9; Appendix A.1). - It argues that unit-by-unit generation behaves like a Viterbi-style checkpointed search:
- If an artifact requires many subparts to be correct simultaneously, direct success probability can become multiplicative (roughly â p_k), while the checkpointed approach needs roughly ÎŁ (1/p_k) attempts (Appendix A.1, Lemma 2 and Proposition 1; also explained in Section 4.2).
- Appendix A.1 extends the argument to token-cost models (Appendix A.1.2) and suggests an additional advantage: the approach allows more âdesign tokensâ (iterative refinement) than single-shot generation, which is hypothesized to improve quality (Appendix A.1.3).
Verifier agents and efficient evolution (Section 4.3)
- Genesys runs designers and verifiers in parallel (Section 4.3).
- The evolution tree is seeded with five architectures: GPT/Transformer, Mamba2, RetNet, RWKV6, and TTT (Section 4.3).
- Design selection policy uses two signals:
- fitness = aggregated downstream performance (F),
- confidence = number of scales verified (Figure 10; Section 4.3).
- Designs are bucketed into four quadrants (good/poor Ă confident/unconfident), and designers/verifiers choose quadrants differently to balance exploration vs exploitation (Figure 10; Section 4.3).
Compute budgeting: Ladder of Scales (Figure 11; Section 4.3) - Full verification across all scales for all designs is too costly, so the system verifies many candidates at small scale and progressively fewer at larger scales (Figure 11; Section 4.3). - The figure gives a concrete example budget pyramid: - ~1000 models at 14M params trained on ~0.7B tokens, - down to ~5 models at 350M params trained on ~50B tokens (Figure 11; Section 4.3). - The verifier selects the lowest unverified scale with available budget, with budgets âreleased graduallyâ based on a target selection ratio across scales (Section 4.3; Algorithm 5 in Appendix B.2).
3.4.5 Concrete configurations: training, data, hardware (as provided)¶
Training corpus used for verification (Appendix C.1.1; Table 9)
- The paper trains on a filtered dataset called SmolLM-1/8-Corpus, derived from SmolLM with subset filtering (Appendix C.1.1).
- Table 9 reports total size and mixture:
- Total: 34.78B tokens (Train 33.94B; Test 0.42B; Eval 0.42B).
- Major component: FineWeb-Edu 70% with 24.25B tokens.
- It describes removing verbatim overlaps from training relative to test/eval sets (Appendix C.1.1).
Verification training hyperparameters (Table 12)
- Context length: 2045 tokens (Table 12).
- Optimizer: AdamW (Table 12).
- LR schedule: Cosine with minimum LR rate 0.1, warmup ratio 0.02 (Table 12).
- Tokenizer: Llama-2-7b-hf (Table 12).
- Batch size: 0.5M tokens (Table 12).
- Learning rates by scale (Table 12):
- 14M: 1e-3
- 31M: 1e-3
- 70M: 1e-3
- 125M: 6e-4
- (The excerpted Table 12 stops at 125M; 350M settings are not listed there, so I do not infer them.)
Hardware environment (Appendix C.1.2) - Main cluster: 10 machines: - 8 âV-Nodesâ for verification (mixture of 8ĂA6000 48GB and 8ĂL40S 48GB machines), - 2 âD-Nodesâ for design threads (3ĂA6000 48GB machines) (Appendix C.1.2). - The system aims to maintain a design:verify thread ratio around 2:1 based on throughput analysis (Appendix C.1.2; Appendix E.4.2).
4. Key Insights and Innovations¶
- A dedicated, tool-rich discovery environment (
LMADE) for architecture discovery LMADEcombines a literature/code retrieval engine (Figure 2; Appendix B.1.3) with a verification stack (Table 1; Appendix B.1.4), making architecture discovery an end-to-end, measurable ASD task (Section 3.2; Figure 1).-
Significance: this makes âautonomous discoveryâ verifiable and repeatable via standardized checks and training/eval automation.
-
Factorizing LM blocks into a
GAU treeenables genetic programming over code - The system treats designs as compositional trees of units (Figures 3â5; Section 4.1), enabling mutation/crossover at the unit level rather than rewriting entire blocks.
- The paper supports factorization formally via type-based decomposition arguments (Appendix A.2).
-
Significance: factorization is the core enabler for both scalable search (GP operators) and tractable code generation (unit-by-unit implementation).
-
Unit-by-unit âViterbi-styleâ code generation with checkpointing improves validity and efficiency
- The implementation pipeline explicitly freezes correct partial units and retries locally on failures (Figure 8; Section 4.2).
- Formal analysis in Appendix A.1 argues exponential reductions in expected model calls vs direct generation when correctness requires multiple constraints/subcomponents (Appendix A.1, Proposition 1; Figure 9).
-
Empirical support: validity rate is dramatically higher than direct prompting (Table 4).
-
Budget-aware verification via
Ladder of Scales - The system operationalizes scaling-law intuition: small-scale performance provides signal, allowing heavy filtering before expensive large-scale training (Figure 11; Section 4.3).
-
Significance: without this, verifying 1000+ candidates by pre-training at large scale would be computationally infeasible.
-
System-level selection strategy balancing fitness and confidence
- The quadrant-based selector manages exploitation/exploration based on empirical performance and how many scales have been verified (Figure 10; Section 4.3).
- Significance: it provides a concrete mechanism for coordinating distributed designers/verifiers under limited budgets.
5. Experimental Analysis¶
Evaluation methodology (datasets, metrics, baselines, setup)¶
- Evolutionary/system experiments (RQ1, RQ2) focus on:
- Fitness progress over generations (Figure 12; Section 5.1),
- Stability metrics including Sharpe Ratio and Maximum Drawdown (Table 2; Section 5.1),
- Verification-stage error rates (Table 3),
-
Code generation validity and cost metrics (Table 4; Section 5.2).
-
Downstream evaluation (RQ3):
- Compare top 5 discovered designs vs the 5 human seed designs (GPT2, TTT, Mamba2, RWKV, RetNet) (Section 5.3).
- Evaluated at 125M (trained on 25B tokens) and 350M (trained on 50B tokens) (Section 5.3; Tables 5 and 14).
- Metrics are accuracy (%) over selected benchmarks (Tables 5 and 14).
- The benchmark selection is described as based on informativeness at small scales, with further details referenced to Appendix E.3.1 and Table 16 (Section 5.3).
Main quantitative results (with numbers)¶
RQ1: Does GP-style evolution help, and which components matter? (Section 5.1)
- Table 2 reports evolution metrics for the first 300 designs:
Full: Î = 4.10%, Î_max = 4.16%, SR = 69.0, volatility Μ = 1.9, MDD = â0.38.w/o Exp(no experimental verification feedback): Î = 2.20%, SR = 26.3, MDD = â1.10.w/o Lit(no literature): Î = 3.37%, SR = 56.7, MDD = â0.62.Base(no accumulation beyond the five seeds): Î â 0.01%, SR â 0.2.- For extended runs:
- First 500 designs:
FullÎ = 5.87% vsw/o ExpÎ = 1.69% (Table 2). - First 1000 designs:
FullÎ = 7.13%, Î_max = 8.13%, SR = 26.5 (Table 2). - Figure 12 visualizes increasing mean fitness over generations and shows ablations lagging behind.
Verification error rates (Table 3; Section 5.1)
- Error rate during design verification/evaluation:
- Full: 8.61%
- w/o Exp: 27.31%
- w/o Lit: 7.67%
- Base: 21.09%
- w/ Mem: 23.70%
- Interpretation grounded in the paperâs framing: removing experimental feedback increases downstream errors substantially, suggesting selection signals help avoid brittle designs (Table 3; Section 5.1).
RQ2: Does unit-based generation improve code validity vs direct prompting? (Section 5.2; Table 4)
- Table 4 evaluates implementation success on 100 proposals:
Full(planner + coder + observer + symbolic checker + unit generation):- Valid: 92%
- Attempts: 2.6 ± 1.1 (max 5)
- Token cost: 15.0 ± 18.5
- LFC (lines of function-body code): 181 ± 44
Directprompting:- Valid: 6%
- Attempts: 1.1 ± 0.2
- Token cost: 0.3 ± 0.3
- LFC: 49 ± 15
-
The table supports the paperâs claim that structured, checkpointed generation is critical for producing complex valid code (Table 4; Figure 9; Section 4.2).
-
Ablations isolate which components drive validity:
- Removing symbolic checker (
No SC) drops validity to 30% (Table 4). - Removing unit generation (
No UG) drops validity to 73% (Table 4). - Removing planner or observer has smaller quantitative effects (valid 91% and 89%), though the paper argues qualitative benefits (Table 4; Section 5.2).
RQ3: Are discovered models competitive? (Section 5.3; Tables 5 and 14)
- At 350M params, 50B tokens (Table 5):
- Best average score among listed models is
Geogatewith 61.81, narrowly aboveGPTat 61.78 andMamba2at 61.45 (Table 5). - The paper also highlights âoutperforming GPT2, Mamba2, etc., on 6/9 common benchmarksâ for 350M scale (Abstract; echoed near Table 5 discussion).
- At 125M params, 25B tokens (Table 14):
HMambaaverage: 58.39, slightly aboveMamba2average 58.37 and aboveGPTaverage 57.04 (Table 14).- The paper claims discovered designs achieve best results in 7/9 benchmarks at 125M and 6/9 at 350M (Section 5.3 discussion; Tables 14 and 5).
Do experiments support the claims?¶
- Strong support for the code-generation bottleneck claim:
- The gap between
Full(92% valid) andDirect(6% valid) is extremely large and directly targets the stated bottleneck (Table 4; Section 5.2; Figure 9). - Moderate-to-strong support for âsystem components improve evolutionary progressâ:
- Evolution ablations show clear drops when removing experiment feedback or literature access (Table 2; Figure 12).
- However, some metrics (e.g., SR scaling across different run lengths) are reported but not fully contextualized in the excerpt beyond definitions (Section 5.1). The directionality is consistent, but interpretation depends on trusting those metrics for this domain.
- Competitive model performance is supported at small scales:
- Tables 5 and 14 show that discovered models can match or slightly exceed seed baselines on averages and on subsets of tasks, with no single model dominating all tasks (Section 5.3).
- The evidence is limited to:
- two parameter scales (125M, 350M),
- a selected set of tasks (9 tasks in Tables 5/14),
- and zero-shot evaluation as described (Section 5.3).
- From the excerpt alone, this substantiates âcompetitive with known architecturesâ in the tested regime, but does not establish broad superiority beyond those settings.
Ablations, failure cases, robustness checks¶
- System ablations:
w/o Exp,w/o Lit,Base,w/ Memare compared in evolution metrics (Table 2; Figure 12) and error rates (Table 3). - Designer pipeline ablations:
No UG,No SC,No Pl,No Ob,Direct(Table 4). - Sensitivity analysis: the metricsâ sensitivity to population size and step size is analyzed (Figure 19; Appendix D.2).
- Failure analysis:
- Implementation/verification failure distributions and error type breakdowns are given in Appendix E.1.1 and E.2.2 (Figures 20â24, 44â45), indicating frequent failure modes like forward-pass errors and tensor shape issues (Appendix E.2.2).
6. Limitations and Trade-offs¶
- Scaling limitations
-
The systemâs verified discovery is constrained to 14Mâ350M parameters (Abstract; Section 6), and the paper explicitly flags limited compute as a barrier to billion-parameter-level discovery (Section 6).
-
Hardware-specific efficiency innovations are hard to integrate
-
The discussion notes difficulty incorporating âefficiency-focused innovationsâ such as
FlashAttentiondue to complex, hardware-specific evaluation needs (Section 6). This implies the discovered architectures may not be optimized for real-world GPU kernels or peak throughput. -
Reliance on correlation across scales (assumption behind Ladder-of-Scales)
-
Ladder-of-Scales is motivated by scaling-law intuition that performance correlates across scales (Section 4.3; Figure 11). The excerpt does not provide a direct quantified correlation analysis; it primarily uses the methodology as an operational assumption.
-
Evaluation scope
- The main showcased benchmark comparisons use 9 tasks (Tables 5 and 14). The system also mentions evaluating on â29 selected LM-Eval benchmarksâ in the verification engine (Section 3.2), but the headline results focus on the selected subset for small-scale informativeness (Section 5.3; Appendix E.3.1).
-
This creates a trade-off: faster iteration vs broader generalization guarantees.
-
Agent self-evaluation signals are imperfect
-
Proposal ratings correlate weakly with final fitness (Appendix E.1.1, Figure 36 shows correlation coefficient 0.04), meaning LLM reviewer scores are not reliable predictors among proposals that pass thresholds.
-
Non-trivial waste due to failures
- Even with the structured pipeline, Appendix E.1.1 reports a non-negligible invalid design fraction (Figure 20 discussion: ~84.0% âImplemented & Verified,â leaving ~16.0% invalid), which is costly at scale.
7. Implications and Future Directions¶
- How this changes the landscape
- The work suggests that âautonomous discoveryâ becomes substantially more viable when the search space is:
- factorized into compositional units (GAU trees),
- paired with strong correctness tooling (symbolic + runtime checkers),
- and verified with a compute-aware scaling schedule (Figures 3â5, Table 1, Figure 11).
-
It reframes architecture discovery as a realistic proving ground for end-to-end agent systems because success is measurable (executable code + benchmark performance) (Section 3.1; Equation (1)).
-
Follow-up research enabled
- Scaling up discovery beyond 350M parameters is a natural next step, but would require substantially more compute and/or improved filtering signals (Section 6).
- Learning from feedback: the paper suggests reinforcing agents using feedback (possibly RL) to improve proposal/implementation quality (Section 6).
-
More adaptive selection: improving design/scale selection beyond the current quadrant heuristic is explicitly identified (Section 6; Figure 10).
-
Practical applications / downstream use cases
- Rapid prototyping of novel LM blocks for small-to-mid-scale models where training budgets are limited, since the systemâs verification regime is built around small-model methodology (Figure 11; Section 4.3).
-
Discovery of architectural variants tailored to specific task families, motivated by the unitâperformance predictability analysis (Appendix E.3.3; Tables 17â18), though this is presented as analysis rather than an integrated optimization target.
-
Repro/Integration Guidance (based on provided paper details)
- Prefer this approach over âdirect prompting to generate full model codeâ when:
- code validity is a major bottleneck,
- the artifact can be factorized into compositional units with local checks (Figure 9; Table 4; Appendix A.1).
- Prefer the
Ladder-of-Scalesverification strategy when:- verification costs rise steeply with scale,
- you can accept the assumption that small-scale results help triage candidates (Figure 11; Section 4.3).
- If integrating into another domain, the paperâs formalism suggests looking for problems that can be typed/lifted into
ÎŁ â ÎŁ-style compositional programs to reuse the unit-tree + checkpointed generation framework (Appendix A.2).
Assumption note: This analysis is grounded only in the content you provided (including the included sections, figures, tables, and appendices excerpted here). Some referenced details (e.g., full prompt texts and all appendix experiments) are mentioned but not fully reproduced in the excerpt, so I did not infer missing numbers or settings beyond what is explicitly shown (e.g., Table 12 does not list 350M learning rate).