Language Modeling by Language Models¶

ArXiv: 2506.20249

🎯 Pitch¶

The paper presents Genesys, a multi-agent LLM system that autonomously discovers novel autoregressive LM block architectures by simulating research stages (proposal, literature-backed design, unit-by-unit code generation, and laddered verification) inside a purpose-built environment (LMADE). By factorizing blocks into reusable unit trees and using genetic-programming plus a Viterbi-style generation pipeline, Genesys sharply improves valid design synthesis and verifies 1,062 new architectures—many competitive with human baselines—demonstrating a scalable, verifiable path to accelerate and democratize architecture discovery.

1. Executive Summary (2-3 sentences)¶

This paper introduces Genesys, a multi-agent LLM system that autonomously discovers new autoregressive language model block architectures by simulating key stages of ML research—proposal (ideation + literature search), implementation (code generation), and verification (pre-training + evaluation)—inside a purpose-built environment called LMADE (Figure 1; Sections 3–5). Its core technical contribution is a factorized representation of model blocks into a tree of reusable “units,” enabling genetic programming search plus a “unit-by-unit” (Viterbi-style) code generation pipeline that dramatically increases the success rate of generating valid executable designs (Table 4; Appendix A.1). Empirically, the system discovers and verifies over a thousand architectures and reports that top discovered designs are competitive with strong human baselines at 125M and 350M parameter scales on common benchmarks (Tables 5 and 14; Section 5.3).

2. Context and Motivation¶

Problem/gap addressed
Many recent LLM-driven “automated scientific discovery” systems target open-ended research where goals and verification are unclear (Section 1; Related Work discussion).
Architecture discovery in language modeling is a compelling alternative: it has a clear artifact (an executable block program) and clear evaluation (pre-training + downstream benchmark performance), but it also demands:
- deep literature understanding,
- careful compute budgeting for pre-training,
- writing correct code in a large design space (Section 1; Section 3).
Why it matters
Language model architecture research is expensive and iterative; an autonomous pipeline could accelerate exploration and reduce human bottlenecks (Section 1).
The paper frames architecture discovery as a concrete, verifiable ASD task that can stress-test end-to-end agent systems with measurable success criteria (Section 1; Section 3.1).
Prior approaches and shortcomings (as positioned here)
Classical neural architecture search (NAS) often explores fixed operation spaces, while this work targets a broader space of LM block code (Section 2, NAS paragraph).
LLM-driven discovery approaches that rely on direct prompting can fail frequently when asked to generate complex correct code; the paper treats valid code generation as a primary bottleneck (Figure 9; Section 4.2; Table 4).
Verification is expensive; naïvely training every candidate at every scale is infeasible (Section 4.3; Figure 11).
How this paper positions itself
It proposes a full-stack discovery setup: an environment (LMADE) that provides knowledge + verification tooling, and an agentic genetic search system (Genesys) that proposes/implements/verifies designs (Figure 1; Sections 3–4).
It emphasizes factorization + genetic programming as the backbone that makes the search space “efficient and factorizable” (Abstract; Section 4.1).

3. Technical Approach¶

3.1 Reader orientation (approachable technical breakdown)¶

The system is an autonomous architecture discovery pipeline that outputs executable PyTorch code for new autoregressive LM blocks and evaluates them by training and benchmarking.
It solves architecture discovery as a program search + resource-managed verification problem: generate candidate block programs, check correctness cheaply, then selectively pre-train/evaluate them at increasing scales under a fixed compute budget (Equation (1); Table 1; Figure 11).

3.2 Big-picture architecture (diagram in words)¶

LMADE (environment, Figure 1 left; Section 3.2)
Knowledge Engine (KE): retrieves relevant literature/code from a curated library and external sources (Figure 2; Section 3.2; Appendix B.1.3).
Verification Engine (VE): (i) checks candidate code validity via symbolic + runtime checks, and (ii) pre-trains + evaluates models on a fixed corpus and benchmark suite (Table 1; Section 3.2; Appendix B.1.2–B.1.4).
Genesys (discovery system, Figure 1 right; Section 4)
Evolution tree: stores designs, code artifacts, tree-structured representations, and performance metadata (Figure 4; Section 4.1).
Designer agents: propose and implement new designs using LLM sub-agents (proposer/reviewer + planner/coder/observer) (Figures 6–8; Section 4.2).
Verifier agents: select which designs to train at which scale using a confidence/fitness policy and a “Ladder-of-Scales” budget schedule (Figure 10; Figure 11; Section 4.3).

3.3 Roadmap for the deep dive¶

First, define what counts as a design and what objective the system optimizes (Equation (1); Section 3.1).
Next, explain the common representation language (GAB/GAU trees) that makes code checkable and evolvable (Figure 3; Section 4.1; Appendix B.3).
Then, walk through the designer pipeline: proposal → adversarial review → unit-by-unit implementation with checkers (Figures 6–9; Section 4.2).
After that, cover verifier-side scaling/budget management and selection policies (Figures 10–11; Section 4.3).
Finally, connect these mechanisms to the empirical evaluations and ablations (Tables 2–5; Section 5).

3.4 Detailed, sentence-based technical breakdown¶

This is an empirical systems + algorithmic framework paper: it builds a working autonomous discovery system and supports key design choices with formal arguments (Appendix A) and ablation-driven experiments (Section 5).

3.4.1 Problem formulation: architecture discovery as program search¶

The object being discovered is a block program B_LM, i.e., code that defines how one layer/block transforms hidden states in an autoregressive LM (Figure 3; Section 3).
The paper formalizes discovery as searching for the block program that maximizes an empirical fitness function:
[ \hat{B}{LM} = \arg\max} \in \mathcal{B{LM}} F(B), \quad F(B_{LM}) = \frac{1}{M\cdot K}\sum_{i=1}^{M}\sum_{j=1}^{K}\text{Perf}(B_{LM}, D_i, S_j) ] (Equation (1), Section 3.1).
Here, Perf is downstream performance on task D_i at model scale S_j, and the paper uses this as the basis for selection in evolution.
A candidate block is considered valid only if it satisfies syntactic and semantic constraints, including:
it is a differentiable PyTorch module with the right tensor interface,
it is causal (future tokens don’t influence past outputs),
it runs forward/backward stably and efficiently (Table 1; Section 3.1; Appendix B.1.2).

3.4.2 The representation layer: `GAB` and `GAU` trees (why factorization matters)¶

The paper introduces a standardized code interface called a Generalized Autoregressive Block (GAB), implemented as GABBase, which enforces:
input X is shape (B, L, D),
output Y has the same shape,
an auxiliary dictionary Z carries intermediate variables across units and blocks (Figure 17; Appendix B.3).
A Generalized Autoregressive Unit (GAU) is a submodule (GAUBase) with the same (X, Z) → (X, Z) signature and additional constraints for integration and partial checking (Appendix B.3).
Each full block program B_LM is factorized into a GAU tree, where nodes correspond to units (Figure 3, Figure 4; Section 4.1). This enables:
Mutation: replace/modify a subtree or unit (Figure 5A; Section 4.1).
Crossover: merge units from multiple parent designs (Figure 5B; Section 4.1).
More targeted validity checking and incremental code generation (Section 4.2).
The paper provides a formal justification that programs with the relevant type structure (abstracted as Σ → Σ) can be decomposed into such unit trees (Appendix A.2). Practically, this supports the claim that factorization does not “oversimplify” the design space for this setting (Section 4.1; Appendix A.2).

3.4.3 The environment: `LMADE` = Knowledge Engine + Verification Engine¶

Knowledge Engine (KE) (Section 3.2; Appendix B.1.3) - KE includes: - a manually curated reference library of 297 LM papers stored in a searchable vector store and augmented with extracted code snippets (Figure 2; Section 3.2; Appendix B.1.1), - external querying over ArXiv, Semantic Scholar, web search, etc. (Section 3.2; Appendix B.1.3). - The paper describes chunking PDFs into semantic chunks and storing embeddings in a vector DB; retrieval uses reranking (Appendix B.1.3; Table 10).

Verification Engine (VE) (Section 3.2; Appendix B.1.2–B.1.4) - The VE provides two levels of feedback: 1. Symbolic checker that performs static and runtime checks: - Static: parsing/formatting compliance with the GAB/GAU protocol. - Runtime: module init, forward/backward, differentiability, causality tests, and a small training sanity check (Table 1; Appendix B.1.2). - Causality is tested by checking whether changing future tokens changes earlier outputs (Appendix B.1.2). 2. Full verification: automated pre-training + evaluation on a filtered corpus and LM-Eval benchmarks (Section 3.2).

An Auto-Tuner adjusts block depth/width to match target parameter scales and tunes gradient accumulation to avoid OOM (Appendix B.1.4; Table 7).
A runtime monitor terminates runs on issues like exploding gradients/loss or excessive step time (Appendix B.1.4; Table 8).

3.4.4 The discovery system: `Genesys` as distributed genetic programming with LLM agents¶

Core data structure: the evolution tree (Figure 4; Section 4.1) - Each node stores: - executable code, - the GAU tree representation, - design traces (proposal/review artifacts), - empirical performance metrics across verified scales (Figure 4; Section 4.1).

Designer agents: proposal stage (Figures 6–7; Section 4.2) - The proposal stage selects one or more parent designs from the evolution tree and retrieves relevant references from KE (Section 4.2). - A proposer LLM drafts a structured research proposal describing a modification: - a unit mutation, - crossover (mixing units), - or from-scratch design (treated as mutating the root) (Section 4.2). - A separate reviewer LLM adversarially critiques and scores the proposal; proposals iterate until the reviewer score exceeds a threshold (Section 4.2; Algorithm 1 in Appendix B.2 references this loop).

Designer agents: implementation stage (unit-by-unit code generation) (Figure 8; Section 4.2) - Implementation translates an accepted proposal into executable GAU code and then a GAB block. - The key mechanism is incremental, recursive construction of the GAU tree: - Maintain an Unimplemented list of units. - Repeatedly pick one unit, plan its implementation (planner agent), write code (coder agent), then validate: - symbolic checker validates protocol compliance and runtime behavior, - observer LLM rates code quality, adherence, novelty; threshold is 3/5 (Section 4.2; Figure 8). - If checks fail, the system rolls back state and retries; accepted units are “frozen,” reducing future failure impact (Figure 8; Section 4.2).

The paper contrasts this with “direct prompting” (generate the entire artifact in one shot), which has expected attempts 1/p if success probability is p (Section 4.2; Figure 9; Appendix A.1).
It argues that unit-by-unit generation behaves like a Viterbi-style checkpointed search:
If an artifact requires many subparts to be correct simultaneously, direct success probability can become multiplicative (roughly ∏ p_k), while the checkpointed approach needs roughly Σ (1/p_k) attempts (Appendix A.1, Lemma 2 and Proposition 1; also explained in Section 4.2).
Appendix A.1 extends the argument to token-cost models (Appendix A.1.2) and suggests an additional advantage: the approach allows more “design tokens” (iterative refinement) than single-shot generation, which is hypothesized to improve quality (Appendix A.1.3).

Verifier agents and efficient evolution (Section 4.3) - Genesys runs designers and verifiers in parallel (Section 4.3). - The evolution tree is seeded with five architectures: GPT/Transformer, Mamba2, RetNet, RWKV6, and TTT (Section 4.3). - Design selection policy uses two signals: - fitness = aggregated downstream performance (F), - confidence = number of scales verified (Figure 10; Section 4.3). - Designs are bucketed into four quadrants (good/poor × confident/unconfident), and designers/verifiers choose quadrants differently to balance exploration vs exploitation (Figure 10; Section 4.3).

Compute budgeting: Ladder of Scales (Figure 11; Section 4.3) - Full verification across all scales for all designs is too costly, so the system verifies many candidates at small scale and progressively fewer at larger scales (Figure 11; Section 4.3). - The figure gives a concrete example budget pyramid: - ~1000 models at 14M params trained on ~0.7B tokens, - down to ~5 models at 350M params trained on ~50B tokens (Figure 11; Section 4.3). - The verifier selects the lowest unverified scale with available budget, with budgets “released gradually” based on a target selection ratio across scales (Section 4.3; Algorithm 5 in Appendix B.2).

3.4.5 Concrete configurations: training, data, hardware (as provided)¶

Training corpus used for verification (Appendix C.1.1; Table 9) - The paper trains on a filtered dataset called SmolLM-1/8-Corpus, derived from SmolLM with subset filtering (Appendix C.1.1). - Table 9 reports total size and mixture: - Total: 34.78B tokens (Train 33.94B; Test 0.42B; Eval 0.42B). - Major component: FineWeb-Edu 70% with 24.25B tokens. - It describes removing verbatim overlaps from training relative to test/eval sets (Appendix C.1.1).

Verification training hyperparameters (Table 12) - Context length: 2045 tokens (Table 12). - Optimizer: AdamW (Table 12). - LR schedule: Cosine with minimum LR rate 0.1, warmup ratio 0.02 (Table 12). - Tokenizer: Llama-2-7b-hf (Table 12). - Batch size: 0.5M tokens (Table 12). - Learning rates by scale (Table 12): - 14M: 1e-3 - 31M: 1e-3 - 70M: 1e-3 - 125M: 6e-4 - (The excerpted Table 12 stops at 125M; 350M settings are not listed there, so I do not infer them.)

Hardware environment (Appendix C.1.2) - Main cluster: 10 machines: - 8 “V-Nodes” for verification (mixture of 8×A6000 48GB and 8×L40S 48GB machines), - 2 “D-Nodes” for design threads (3×A6000 48GB machines) (Appendix C.1.2). - The system aims to maintain a design:verify thread ratio around 2:1 based on throughput analysis (Appendix C.1.2; Appendix E.4.2).

4. Key Insights and Innovations¶

A dedicated, tool-rich discovery environment (LMADE) for architecture discovery
LMADE combines a literature/code retrieval engine (Figure 2; Appendix B.1.3) with a verification stack (Table 1; Appendix B.1.4), making architecture discovery an end-to-end, measurable ASD task (Section 3.2; Figure 1).
Significance: this makes “autonomous discovery” verifiable and repeatable via standardized checks and training/eval automation.
Factorizing LM blocks into a GAU tree enables genetic programming over code
The system treats designs as compositional trees of units (Figures 3–5; Section 4.1), enabling mutation/crossover at the unit level rather than rewriting entire blocks.
The paper supports factorization formally via type-based decomposition arguments (Appendix A.2).
Significance: factorization is the core enabler for both scalable search (GP operators) and tractable code generation (unit-by-unit implementation).
Unit-by-unit “Viterbi-style” code generation with checkpointing improves validity and efficiency
The implementation pipeline explicitly freezes correct partial units and retries locally on failures (Figure 8; Section 4.2).
Formal analysis in Appendix A.1 argues exponential reductions in expected model calls vs direct generation when correctness requires multiple constraints/subcomponents (Appendix A.1, Proposition 1; Figure 9).
Empirical support: validity rate is dramatically higher than direct prompting (Table 4).
Budget-aware verification via Ladder of Scales
The system operationalizes scaling-law intuition: small-scale performance provides signal, allowing heavy filtering before expensive large-scale training (Figure 11; Section 4.3).
Significance: without this, verifying 1000+ candidates by pre-training at large scale would be computationally infeasible.
System-level selection strategy balancing fitness and confidence
The quadrant-based selector manages exploitation/exploration based on empirical performance and how many scales have been verified (Figure 10; Section 4.3).
Significance: it provides a concrete mechanism for coordinating distributed designers/verifiers under limited budgets.

5. Experimental Analysis¶

Evaluation methodology (datasets, metrics, baselines, setup)¶

Evolutionary/system experiments (RQ1, RQ2) focus on:
Fitness progress over generations (Figure 12; Section 5.1),
Stability metrics including Sharpe Ratio and Maximum Drawdown (Table 2; Section 5.1),
Verification-stage error rates (Table 3),
Code generation validity and cost metrics (Table 4; Section 5.2).
Downstream evaluation (RQ3):
Compare top 5 discovered designs vs the 5 human seed designs (GPT2, TTT, Mamba2, RWKV, RetNet) (Section 5.3).
Evaluated at 125M (trained on 25B tokens) and 350M (trained on 50B tokens) (Section 5.3; Tables 5 and 14).
Metrics are accuracy (%) over selected benchmarks (Tables 5 and 14).
The benchmark selection is described as based on informativeness at small scales, with further details referenced to Appendix E.3.1 and Table 16 (Section 5.3).

Main quantitative results (with numbers)¶

RQ1: Does GP-style evolution help, and which components matter? (Section 5.1)

Table 2 reports evolution metrics for the first 300 designs:
Full: Δ = 4.10%, Δ_max = 4.16%, SR = 69.0, volatility ν = 1.9, MDD = −0.38.
w/o Exp (no experimental verification feedback): Δ = 2.20%, SR = 26.3, MDD = −1.10.
w/o Lit (no literature): Δ = 3.37%, SR = 56.7, MDD = −0.62.
Base (no accumulation beyond the five seeds): Δ ≈ 0.01%, SR ≈ 0.2.
For extended runs:
First 500 designs: Full Δ = 5.87% vs w/o Exp Δ = 1.69% (Table 2).
First 1000 designs: Full Δ = 7.13%, Δ_max = 8.13%, SR = 26.5 (Table 2).
Figure 12 visualizes increasing mean fitness over generations and shows ablations lagging behind.

Verification error rates (Table 3; Section 5.1) - Error rate during design verification/evaluation: - Full: 8.61% - w/o Exp: 27.31% - w/o Lit: 7.67% - Base: 21.09% - w/ Mem: 23.70% - Interpretation grounded in the paper’s framing: removing experimental feedback increases downstream errors substantially, suggesting selection signals help avoid brittle designs (Table 3; Section 5.1).

RQ2: Does unit-based generation improve code validity vs direct prompting? (Section 5.2; Table 4)

Table 4 evaluates implementation success on 100 proposals:
Full (planner + coder + observer + symbolic checker + unit generation):
- Valid: 92%
- Attempts: 2.6 ± 1.1 (max 5)
- Token cost: 15.0 ± 18.5
- LFC (lines of function-body code): 181 ± 44
Direct prompting:
- Valid: 6%
- Attempts: 1.1 ± 0.2
- Token cost: 0.3 ± 0.3
- LFC: 49 ± 15
The table supports the paper’s claim that structured, checkpointed generation is critical for producing complex valid code (Table 4; Figure 9; Section 4.2).
Ablations isolate which components drive validity:
Removing symbolic checker (No SC) drops validity to 30% (Table 4).
Removing unit generation (No UG) drops validity to 73% (Table 4).
Removing planner or observer has smaller quantitative effects (valid 91% and 89%), though the paper argues qualitative benefits (Table 4; Section 5.2).

RQ3: Are discovered models competitive? (Section 5.3; Tables 5 and 14)

At 350M params, 50B tokens (Table 5):
Best average score among listed models is Geogate with 61.81, narrowly above GPT at 61.78 and Mamba2 at 61.45 (Table 5).
The paper also highlights “outperforming GPT2, Mamba2, etc., on 6/9 common benchmarks” for 350M scale (Abstract; echoed near Table 5 discussion).
At 125M params, 25B tokens (Table 14):
HMamba average: 58.39, slightly above Mamba2 average 58.37 and above GPT average 57.04 (Table 14).
The paper claims discovered designs achieve best results in 7/9 benchmarks at 125M and 6/9 at 350M (Section 5.3 discussion; Tables 14 and 5).

Do experiments support the claims?¶

Strong support for the code-generation bottleneck claim:
The gap between Full (92% valid) and Direct (6% valid) is extremely large and directly targets the stated bottleneck (Table 4; Section 5.2; Figure 9).
Moderate-to-strong support for “system components improve evolutionary progress”:
Evolution ablations show clear drops when removing experiment feedback or literature access (Table 2; Figure 12).
However, some metrics (e.g., SR scaling across different run lengths) are reported but not fully contextualized in the excerpt beyond definitions (Section 5.1). The directionality is consistent, but interpretation depends on trusting those metrics for this domain.
Competitive model performance is supported at small scales:
Tables 5 and 14 show that discovered models can match or slightly exceed seed baselines on averages and on subsets of tasks, with no single model dominating all tasks (Section 5.3).
The evidence is limited to:
- two parameter scales (125M, 350M),
- a selected set of tasks (9 tasks in Tables 5/14),
- and zero-shot evaluation as described (Section 5.3).
From the excerpt alone, this substantiates “competitive with known architectures” in the tested regime, but does not establish broad superiority beyond those settings.

Ablations, failure cases, robustness checks¶

System ablations: w/o Exp, w/o Lit, Base, w/ Mem are compared in evolution metrics (Table 2; Figure 12) and error rates (Table 3).
Designer pipeline ablations: No UG, No SC, No Pl, No Ob, Direct (Table 4).
Sensitivity analysis: the metrics’ sensitivity to population size and step size is analyzed (Figure 19; Appendix D.2).
Failure analysis:
Implementation/verification failure distributions and error type breakdowns are given in Appendix E.1.1 and E.2.2 (Figures 20–24, 44–45), indicating frequent failure modes like forward-pass errors and tensor shape issues (Appendix E.2.2).

6. Limitations and Trade-offs¶

Scaling limitations
The system’s verified discovery is constrained to 14M–350M parameters (Abstract; Section 6), and the paper explicitly flags limited compute as a barrier to billion-parameter-level discovery (Section 6).
Hardware-specific efficiency innovations are hard to integrate
The discussion notes difficulty incorporating “efficiency-focused innovations” such as FlashAttention due to complex, hardware-specific evaluation needs (Section 6). This implies the discovered architectures may not be optimized for real-world GPU kernels or peak throughput.
Reliance on correlation across scales (assumption behind Ladder-of-Scales)
Ladder-of-Scales is motivated by scaling-law intuition that performance correlates across scales (Section 4.3; Figure 11). The excerpt does not provide a direct quantified correlation analysis; it primarily uses the methodology as an operational assumption.
Evaluation scope
The main showcased benchmark comparisons use 9 tasks (Tables 5 and 14). The system also mentions evaluating on “29 selected LM-Eval benchmarks” in the verification engine (Section 3.2), but the headline results focus on the selected subset for small-scale informativeness (Section 5.3; Appendix E.3.1).
This creates a trade-off: faster iteration vs broader generalization guarantees.
Agent self-evaluation signals are imperfect
Proposal ratings correlate weakly with final fitness (Appendix E.1.1, Figure 36 shows correlation coefficient 0.04), meaning LLM reviewer scores are not reliable predictors among proposals that pass thresholds.
Non-trivial waste due to failures
Even with the structured pipeline, Appendix E.1.1 reports a non-negligible invalid design fraction (Figure 20 discussion: ~84.0% “Implemented & Verified,” leaving ~16.0% invalid), which is costly at scale.

7. Implications and Future Directions¶

How this changes the landscape
The work suggests that “autonomous discovery” becomes substantially more viable when the search space is:
- factorized into compositional units (GAU trees),
- paired with strong correctness tooling (symbolic + runtime checkers),
- and verified with a compute-aware scaling schedule (Figures 3–5, Table 1, Figure 11).
It reframes architecture discovery as a realistic proving ground for end-to-end agent systems because success is measurable (executable code + benchmark performance) (Section 3.1; Equation (1)).
Follow-up research enabled
Scaling up discovery beyond 350M parameters is a natural next step, but would require substantially more compute and/or improved filtering signals (Section 6).
Learning from feedback: the paper suggests reinforcing agents using feedback (possibly RL) to improve proposal/implementation quality (Section 6).
More adaptive selection: improving design/scale selection beyond the current quadrant heuristic is explicitly identified (Section 6; Figure 10).
Practical applications / downstream use cases
Rapid prototyping of novel LM blocks for small-to-mid-scale models where training budgets are limited, since the system’s verification regime is built around small-model methodology (Figure 11; Section 4.3).
Discovery of architectural variants tailored to specific task families, motivated by the unit–performance predictability analysis (Appendix E.3.3; Tables 17–18), though this is presented as analysis rather than an integrated optimization target.
Repro/Integration Guidance (based on provided paper details)
Prefer this approach over “direct prompting to generate full model code” when:
- code validity is a major bottleneck,
- the artifact can be factorized into compositional units with local checks (Figure 9; Table 4; Appendix A.1).
Prefer the Ladder-of-Scales verification strategy when:
- verification costs rise steeply with scale,
- you can accept the assumption that small-scale results help triage candidates (Figure 11; Section 4.3).
If integrating into another domain, the paper’s formalism suggests looking for problems that can be typed/lifted into Σ → Σ-style compositional programs to reuse the unit-tree + checkpointed generation framework (Appendix A.2).

Assumption note: This analysis is grounded only in the content you provided (including the included sections, figures, tables, and appendices excerpted here). Some referenced details (e.g., full prompt texts and all appendix experiments) are mentioned but not fully reproduced in the excerpt, so I did not infer missing numbers or settings beyond what is explicitly shown (e.g., Table 12 does not list 350M learning rate).