Skip to content

Towards a Science of Scaling Agent Systems

ArXiv: 2512.08296

🎯 Pitch

This paper derives quantitative scaling principles for LM-based agent systems, identifying how agent count, coordination topology, model capability, and task properties interact to predict when multi-agent coordination helps or hurts. By running 180 controlled, budget-matched configurations across four agentic benchmarks and five canonical architectures, it introduces measurable coordination metrics (efficiency, overhead, redundancy, error amplification) and a predictive model (R^2=0.524) that gives principled guidance for selecting architectures—showing that “more agents” is beneficial only for specific, decomposable tasks and can severely degrade performance for sequential, tool-heavy problems.


1. Executive Summary (2-3 sentences)

This paper develops quantitative “scaling principles” for LM-based agent systems—rules that predict when moving from a Single-Agent System (SAS) to a Multi-Agent System (MAS) helps or hurts on genuinely agentic tasks that require multi-step interaction with tools/environments. Through a controlled study of 180 matched-budget configurations across 4 agentic benchmarks and 5 canonical architectures, it identifies measurable coordination dynamics (e.g., overhead, efficiency, redundancy, error amplification) that explain performance variation and fit a predictive model with cross-validated R^2 = 0.524 (Eq. (1), Table 3). The practical significance is that “more agents” is not a general win: benefits depend on task decomposability, tool intensity, and baseline single-agent strength, with some domains seeing large gains (e.g., +80.8% on Finance-Agent) and others large degradations (e.g., −39% to −70% on PlanCraft) (Figure 2, Section 4.2).


2. Context and Motivation

  • Problem / gap.
  • Agent systems (LLM-driven loops of reasoning → acting → feedback) are widely used, but design choices—especially whether/how to coordinate multiple agents—remain driven by heuristics rather than quantitative guidance (Introduction).
  • Prior work often reports MAS improvements, but the paper argues many evaluations are conducted on non-agentic tasks (static, single-shot benchmarks) where ensembles/voting can help without the coordination burdens that appear in interactive settings (Introduction, Section 2).

  • Why it matters.

  • Real deployments (e.g., browsing, finance analysis, workflow execution, planning) require sustained environment interaction, partial observability, and adaptive strategy updates. In these settings, coordination can introduce “coordination tax” (message passing + synchronization + lossy compression of state), and errors can cascade through action chains (Introduction; Section 3.2).

  • Prior approaches and shortcomings (as framed by this paper).

  • Many MAS claims implicitly assume monotonic improvement with team size (“more agents is all you need”), but the paper highlights counter-evidence: as base models get stronger, MAS gains can diminish; and interactive tasks create dynamics (overhead, fragmentation, error propagation) that static accuracy-only reporting misses (Introduction; Section 2).
  • Existing comparisons often confound architecture with prompts/tools/budgets, making it hard to attribute effects to topology itself (Introduction).

  • How this paper positions itself.

  • It introduces a more rigorous notion of agentic evaluation (three necessary properties) (Section 3.2).
  • It runs a controlled, matched-budget evaluation across architectures, tools, and prompt structures to isolate topology effects (Section 4.1; Appendix E).
  • It aims to move from categorical “architecture labels” to measurable coordination metrics that support prediction on unseen tasks/domains (Section 4.3; Table 3).

3. Technical Approach

3.1 Reader orientation (approachable technical breakdown)

  • The system studied is a family of LLM-based agent architectures that solve tasks by repeatedly calling tools and updating state from tool/environment feedback.
  • The paper’s “solution shape” is an empirical scaling framework: run many controlled SAS/MAS variants on interactive benchmarks, measure coordination behaviors from traces, and fit a predictive regression model that maps task + model + coordination properties to expected performance.

3.2 Big-picture architecture (diagram in words)

  • Inputs: a benchmark task instance + a fixed tool API set + an LLM (chosen from three model families) + an agent topology (SAS, Independent MAS, Centralized MAS, Decentralized MAS, Hybrid MAS).
  • Process (per instance):
  • Agent(s) read the task and decide a tool/action.
  • The environment returns an observation (tool output, webpage content, inventory state, etc.).
  • Agents update their history/memory and possibly send messages (MAS only).
  • An aggregation/orchestration mechanism (if present) synthesizes/coordinates outputs and decides termination.
  • Outputs: a final answer/action sequence and a success/accuracy outcome, plus trace-derived coordination metrics (overhead, efficiency, redundancy, etc.) (Section 4.1; Table 5).

3.3 Roadmap for the deep dive

  • First define what counts as an agent system and what distinguishes SAS vs MAS (Section 3.1).
  • Then define the task/benchmark notion of “agentic” and why interactive feedback changes scaling behavior (Section 3.2).
  • Next specify the five architectures/topologies and what mechanism each isolates (Section 3.1; Table 2).
  • Then describe the controlled experimental design (benchmarks, models, budget matching, metrics) (Section 4.1; Appendix E).
  • Finally explain the predictive scaling model and how its coefficients encode the paper’s scaling “principles” (Eq. (1), Table 4; Section 4.3).

3.4 Detailed, sentence-based technical breakdown

This is an empirical systems + modeling paper whose core idea is to treat multi-agent coordination as a measurable process with quantifiable costs/benefits, and then learn a predictive relationship between those measurements and task performance.

3.4.1 Formal agent system definition (what is being evaluated)

  • The paper defines an agent system as S = (A, E, C, Ω) where A is a set of agents, E is a shared environment, C is a communication topology, and Ω is an orchestration policy (Section 3.1).
  • Each agent a_i is formalized as S_i = (Ί_i, A_i, M_i, π_i):
  • Ί_i is the reasoning policy (typically an LLM).
  • A_i is an action space consisting of tool calls ToolCall(t, Ξ).
  • M_i is internal memory.
  • π_i maps observation histories to actions (Section 3.1).
  • Interaction proceeds in timesteps: an agent selects action α_{i,t} = π_i(h_{i,t}), the environment returns observation o_{i,t} = E(α_{i,t}), and the history updates by appending (α_{i,t}, o_{i,t}), subject to context-window truncation at MAX_TOKENS (Section 3.1).

3.4.2 SAS vs MAS (mechanistic differences)

  • A Single-Agent System (SAS) has one “reasoning locus,” meaning all perception/planning/action is within one loop; it has zero inter-agent communication overhead and uses a single coherent memory stream (Section 3.1; Section 2 discussion on context integration).
  • A Multi-Agent System (MAS) has multiple LLM-backed agents that exchange messages, potentially improving exploration/diversity but imposing a coordination tax because global context must be compressed into messages (Introduction; Section 2; Section 3.1).

3.4.3 The five evaluated architectures (what each topology does)

The paper evaluates five canonical architectures (Section 4.1; Table 2):

  • SAS: one agent runs up to k reasoning iterations (O(k) LLM calls; no coordination overhead).
  • Independent MAS: multiple agents run in parallel but do not communicate with each other; outputs are combined by a synthesis-only aggregator.
  • The paper emphasizes the aggregator “concatenates sub-agent outputs without cross-validation or majority voting,” so any benefit would be from parallel exploration rather than explicit error correction (Section 3.1).
  • Centralized MAS: one orchestrator coordinates n sub-agents across r rounds (O(r n k)), creating a validation bottleneck at the orchestrator (Section 3.1; Table 2).
  • Decentralized MAS: n peers communicate in debate/consensus rounds d (O(d n k)), with all-to-all message potential and higher memory demands (Section 3.1; Table 2).
  • Hybrid MAS: combines an orchestrator plus some peer-to-peer communication, increasing protocol complexity and overhead (O(r n k + p n) as described) (Section 3.1; Table 2).

These topologies are framed as a “structural ablation” over two coordination dimensions: orchestrator presence (hierarchy) and peer communication (Section 4.1).

3.4.4 What counts as an “agentic” benchmark (and why the paper insists on it)

The paper defines a task as agentic if optimal interactive policy performance exceeds the best single-pass function by more than a threshold ÎŽ (Section 3.2). Operationally, it requires three necessary benchmark properties (Section 3.2):

  • Sequential Interdependence: later actions depend on earlier observations.
  • Partial Observability: important state must be acquired via interaction/tool use.
  • Adaptive Strategy Formation: the policy must update beliefs/strategy based on feedback.

This matters because static tasks can reward ensembling/voting, while interactive tasks expose overhead and cascading-error dynamics (Introduction; Section 2).

3.4.5 Experimental design (controlled evaluation to isolate topology)

  • Benchmarks (Table 1; Section 4.1; Appendix D):
  • BrowseComp-Plus (web browsing / multi-website info synthesis).
  • Finance-Agent (finance analysis tasks).
  • PlanCraft (Minecraft-style sequential planning/crafting).
  • Workbench (workplace workflows via function/tool calling).
  • Scale: 180 experiments/configurations (Section 4.1), spanning:
  • 5 architectures (SAS + 4 MAS),
  • 3 LLM families (OpenAI GPT-5 series; Google Gemini 2.x/2.5; Anthropic Claude Sonnet 3.7/4.0/4.5) (Section 4.1),
  • and multiple capability levels captured by an external “Intelligence Index” axis (Section 4.1; Appendix A).
  • Fairness control / budget matching:
  • The paper states all configurations are matched for total reasoning-token budget (mean ≈ 4,800 tokens per trial) to isolate architectural effects (Section 4.4; Table 5).
  • MAS gets parallel agent processing but “smaller per-agent iterations,” while SAS gets proportionally more rounds, aiming to equalize total compute (Section 4.1).
  • Implementation (Appendix E):
  • Unified model access via LiteLLM and orchestration/tool integration via LangChain.
  • Tools include web search, code execution (Python REPL), mathematical operations, and completion markers, with dataset-specific tool ecosystems (Appendix E.1).
  • Concrete agent-configuration parameters (Appendix E.2):
  • SAS: max 10 iterations per instance.
  • Independent MAS: 3 agents, synthesis-only coordination.
  • Centralized: 3 sub-agents + 1 orchestrator, max 5 orchestration rounds, 3 iterations per agent per round.
  • Decentralized: 3 agents, 3 debate rounds, 3 iterations per round.
  • Hybrid: centralized orchestration plus limited peer communication phases (Appendix E.2).
  • Evaluation sample sizes (Appendix E.4):
  • Finance-Agent: 50 instances.
  • BrowseComp-Plus: 100 instances.
  • Workbench: 100 instances.
  • PlanCraft: 100 instances.

3.4.6 Coordination/process metrics (what is measured beyond accuracy)

The paper argues final success alone is insufficient, so it instruments MAS/SAS traces with measurable coordination properties (Section 4.1):

  • Coordination overhead O% = (T_MAS − T_SAS)/T_SAS × 100%, where T is “turns” (reasoning–response exchanges) (Section 4.1; Table 5).
  • Message density c: inter-agent messages per reasoning turn (Section 4.1; Table 5).
  • Redundancy rate R: cosine similarity of agent output embeddings (agreement/overlap) (Section 4.1; Table 5).
  • Coordination efficiency E_c = S / (T/T_SAS): success normalized by relative turn count (Section 4.1).
  • Error amplification A_e = E_MAS / E_SAS: relative failure probability (Section 4.1; Table 5).
  • Token overlap structure labels tokens in rationales as unique/shared/contradictory, with contradiction detected via a BERTScore-based dissimilarity threshold (Section 4.1; Section 4.4).

It also defines an information gain measure ΔI (Appendix E.5; Eq. (2)–(3)) as posterior variance reduction for the success variable using Monte Carlo sampling: - K = 10 traces sampled at temperature τ = 0.7 to estimate variance via p(1-p) for binary outcomes.

3.4.7 The predictive scaling model (what it predicts and how)

  • The paper fits a regression-style “scaling principle” with main effects and mechanism-motivated interactions (Section 4.3).
  • Predictors include (Section 4.3; Eq. (1)):
  • Model capability I (Intelligence Index, mean-centered at ÄȘ = 56.9 per Table 4),
  • Tool count T (log-transformed),
  • Agent count n_a (log-transformed),
  • Single-agent baseline performance P_SA,
  • Coordination metrics: O%, c, R, E_c, A_e (some log-transformed),
  • Selected interactions such as E_c × T, O% × T, and P_SA × log(1+n_a).
  • Model form: Eq. (1) in Section 4.3, with standardized predictors (ÎŒ=0, σ=1) and log transforms for skewed variables (O% spans 0–515%, T: 4–16, n_a: 1–4, A_e: 1.0–17.2) (Section 4.3).
  • Validation:
  • 5-fold cross-validation with experiment-level holdout yields R^2_CV = 0.524 ± 0.033, MAE = 0.089 ± 0.011, RMSE = 0.112 ± 0.014 (Section 4.3; Table 3).
  • Adding coordination metrics improves predictive power over using only architecture labels (R^2_CV 0.524 vs 0.430) (Table 3).
  • Worked micro-example (how the model is used in the paper’s decision logic).
  • The paper derives a coordination-saturation / baseline threshold from the interaction P_SA × log(1+n_a) (Table 4), stating coordination yields diminishing/negative returns once the single-agent baseline exceeds about ≈ 0.45 (Section 4.3; also highlighted in Abstract/Conclusion).
  • Mechanistically, this is the “baseline paradox”: if SAS is already strong, the remaining headroom is small, and coordination overhead dominates (Section 4.3; Table 4 shows ÎČ̂ = −0.404, p < 0.001 for this interaction).

4. Key Insights and Innovations

  • (1) A rigorous operationalization of “agentic evaluation.”
  • Innovation: the paper does not treat any benchmark as valid for evaluating agents/MAS; it requires sequential interdependence, partial observability, and adaptive strategy formation (Section 3.2).
  • Significance: this directly targets a methodological failure mode—drawing MAS conclusions from static tasks where coordination costs are absent or irrelevant (Introduction; Section 2).

  • (2) A controlled, topology-focused evaluation across many configurations.

  • Innovation: it standardizes tools, prompt structures, and token budgets and varies mainly topology + model capability, across N=180 configurations (Abstract; Section 4.1).
  • Significance: this is meant to enable causal attribution to coordination structure rather than implementation confounds (Introduction; Section 4.1).

  • (3) Quantitative coordination metrics + predictive modeling outperform architecture labels.

  • Innovation: the main explanatory variables are process metrics (E_c, O%, c, R, A_e) measured from traces, not just “centralized vs decentralized” labels (Section 4.1; Table 5).
  • Significance: the best model reaches R^2_CV = 0.524 and beats an architecture-label model (0.430) (Table 3), suggesting the measurable properties capture transferable principles.

  • (4) Three dominant scaling effects (as summarized by the paper).

  • Tool–coordination trade-off: tool-heavy tasks disproportionately suffer from multi-agent overhead under fixed budgets; modeled via E_c × T with ÎČ̂ = −0.267, p < 0.001 (Section 4.3; Table 4).
  • Capability saturation / baseline paradox: coordination returns diminish or turn negative beyond about ~45% SAS baseline, tied to P_SA × log(1+n_a) with ÎČ̂ = −0.404, p < 0.001 (Abstract; Section 4.3; Table 4).
  • Topology-dependent error amplification: Independent MAS amplifies errors dramatically (reported as 17.2× vs 4.4× centralized) (Abstract; Table 5; Section 4.4).

  • (5) Task-contingent topology preferences (not one MAS that wins everywhere).

  • Centralized is best on Finance-Agent with large gains (+80.8%) (Section 4.2; Figure 2).
  • Decentralized is best on BrowseComp-Plus with modest gains (+9.2%) while centralized is near-flat (+0.2%) (Section 4.2; Figure 2).
  • All MAS variants degrade on PlanCraft (−39% to −70%) (Section 4.2; Figure 2).

5. Experimental Analysis

Evaluation methodology (datasets, metrics, setup)

  • Datasets/benchmarks: BrowseComp-Plus, Finance-Agent, PlanCraft, Workbench (Table 1; Appendix D).
  • Primary metric: success/accuracy depending on domain (Section 4.1).
  • Secondary/process metrics: overhead, efficiency, error amplification, redundancy, message density, token-overlap/contradiction, and information gain ΔI (Section 4.1; Appendix E.5).
  • Configurations: 5 architectures × multiple models across 3 families, totaling 180 experiments (Section 4.1).
  • Budget control: mean ≈ 4,800 tokens per trial matched across conditions (Section 4.4; Table 5).
  • Reliability: the paper reports inter-rater reliability for factual error rate validators with Cohen’s Îș values (e.g., Finance Agent 0.91, Workbench 0.89, PlanCraft 0.87, BrowseComp-Plus 0.88) (Section 4.1).

Main quantitative results (with specific numbers)

Per-benchmark MAS vs SAS (Section 4.2; Figure 2):

  • Finance-Agent: strong MAS gains across topologies.
  • Centralized: 0.631 vs SAS 0.349 → +80.8% (Section 4.2).
  • Decentralized: 0.609 → +74.5%.
  • Hybrid: 0.604 → +73.1%.
  • BrowseComp-Plus: mixed; Independent is harmful, structured coordination yields small gains.
  • Decentralized: 0.347 vs SAS 0.318 → +9.2%.
  • Centralized: +0.2% (near-flat).
  • Independent: −35% (Figure 2 caption; Section 4.2 provides the “modest gains” story and highlights independent underperformance).
  • Workbench: small effects.
  • Decentralized: 0.664 vs SAS 0.629 → +5.7%.
  • Centralized and Hybrid: around −1.2% (Section 4.2).
  • PlanCraft: consistent MAS degradation.
  • Independent: 0.170 vs SAS 0.568 → −70.1%.
  • Centralized: 0.282 → −50.3%.
  • Decentralized: 0.332 → −41.5%.
  • Hybrid: 0.346 → −39.1% (Section 4.2; Figure 2).

Across-architecture coordination costs and dynamics (Table 5; Section 4.4):

  • Mean success rate S:
  • SAS 0.466, Independent 0.370, Decentralized 0.477, Centralized 0.463, Hybrid 0.452 (Table 5).
  • Mean turns T:
  • SAS 7.2 ± 2.1, Independent 11.4 ± 3.2, Decentralized 26.1 ± 7.5, Centralized 27.7 ± 8.1, Hybrid 44.3 ± 12.4 (Table 5).
  • Overhead O% relative to SAS:
  • Independent 58%, Decentralized 263%, Centralized 285%, Hybrid 515% (Table 5).
  • Efficiency E_c:
  • SAS 0.466, Independent 0.234, Decentralized 0.132, Centralized 0.120, Hybrid 0.074 (Table 5).
  • Error amplification A_e:
  • SAS 1.0, Independent 17.2, Decentralized 7.8, Centralized 4.4, Hybrid 5.1 (Table 5).

Predictive modeling results (Section 4.3; Table 3):

  • Full model with coordination metrics: R^2_train = 0.613, R^2_CV = 0.524, AIC = −201.2, 20 parameters (Table 3).
  • Baselines:
  • Intelligence + tools + agents only: R^2_CV = 0.283 (Table 3).
    • architecture labels: R^2_CV = 0.430 (Table 3).

Key coefficient-based claims (Table 4):

  • E_c × T has ÎČ̂ = −0.267, p < 0.001 (tool–coordination trade-off).
  • P_SA × log(1+n_a) has ÎČ̂ = −0.404, p < 0.001 (baseline paradox/capability saturation).
  • O% × T has ÎČ̂ = −0.162, p < 0.001 (overhead compounds with tool complexity).
  • Intelligence main effect: ÎČ̂ = 0.171, p = 0.001; quadratic term not significant (Table 4).

Do the experiments support the claims?

  • Task-contingent coordination: The per-benchmark deltas (Figure 2) strongly support the claim that MAS benefits are not universal and can be negative, including large negative effects on PlanCraft.
  • Cost/overhead as a central driver: Table 5 shows drastic turn and overhead increases for MAS, especially Hybrid. This aligns with the paper’s narrative that under fixed budgets, coordination consumes reasoning capacity (Section 4.4).
  • Predictability via measurable coordination metrics: The improvement from R^2_CV 0.430 (architecture labels) to 0.524 (coordination metrics) (Table 3) supports the claim that process metrics carry additional explanatory power beyond topology categories.
  • Caveat: The paper reports dramatic architecture differences in error amplification (A_e), but in the regression, the main effect of log(1+A_e) and A_e × T are not significant (Table 4; Section 4.3). That internally complicates a simple story of “error amplification drives performance,” and the paper itself interprets error differences as being subsumed by other metrics (efficiency/overhead) in the multivariate model (Section 4.3).

Ablations / robustness checks / out-of-sample validation

  • Model comparison ablation: progressively adding predictors (Table 3) functions as an ablation showing what improves prediction.
  • Cross-validation: 5-fold CV with stable coefficient estimates and diagnostics (Section 4.3).
  • Out-of-sample validation: on GPT-5.2 (Intelligence Index 75), the model reports MAE = 0.071 (Appendix B; Table 7) and validates 4/5 qualitative findings (Table 8), with a noted partial validation where decentralized vs centralized advantages converge at high capability (Table 8).
  • Important caveat in Appendix B: the paper notes SAS performance is systematically over-predicted out of training range (Table 9), suggesting calibration issues when extrapolating beyond the Intelligence Index range used to fit the model.

6. Limitations and Trade-offs

  • Limited architecture coverage.
  • Only five canonical topologies are tested (SAS + four MAS). Other orthogonal design dimensions (specialization, memory design, aggregation rules beyond “synthesis-only,” etc.) are explicitly out of scope (Section 3.1 footnote; Limitations Section 5).

  • Agent-count scaling is only partially explored.

  • The main controlled suite uses small teams (Appendix E.2 uses 3-agent MAS configurations), with additional exploration up to n_a ∈ {1,3,5,7,9} shown for some cases (Figure 5).
  • The paper finds turn count scales super-linearly with agent count (T = 2.72 × (n+0.5)^{1.724}, R^2=0.974) (Section 4.4), implying practical ceilings beyond ~3–4 agents under fixed budgets, but this is not fully mapped across all domains/models.

  • Prompt optimality is not pursued.

  • Prompts are held identical across conditions for experimental validity, but architecture- or model-specific prompt tuning could change scaling behavior (Limitations (iv)).

  • Benchmark coverage is diverse but still narrow.

  • Four benchmarks span finance, browsing, planning, and workplace tools, but may not represent embodied/multimodal or long-horizon temporal feedback settings (Limitations (v)).

  • Predictive model is only moderately explanatory and may not extrapolate cleanly.

  • R^2_CV = 0.524 means substantial variance remains unexplained (Section 4.3).
  • Out-of-range extrapolation shows calibration issues (Appendix B; Table 9), including over-prediction for SAS on GPT-5.2.

  • Economic/latency cost is a first-class trade-off.

  • MAS token efficiency is much worse than SAS in Table 5 (e.g., success per 1K tokens: SAS 67.7 vs Hybrid 13.6), and overhead reaches 515% for Hybrid (Table 5), which directly constrains practical deployment (Section 4.4; Limitations (vi)).

7. Implications and Future Directions

  • How this changes the field’s framing (within the paper’s scope).
  • It argues that “agent scaling” should be studied like an empirical science: not by counting agents, but by quantifying coordination costs and task properties that determine when coordination is net-positive (Abstract; Sections 4.2–4.3).
  • It pushes evaluation toward agentic benchmarks and away from static tasks as a proxy for agent performance (Section 3.2).

  • Practical applications / downstream use cases.

  • Architecture selection by task type (supported by Figure 2 and Section 4.3 examples):
    • Prefer SAS for sequential constraint-satisfaction tasks like PlanCraft, where all MAS variants degrade (−39% to −70%) (Figure 2; Section 4.2).
    • Prefer Centralized MAS for parallelizable analysis tasks like Finance-Agent, where centralized gains are largest (+80.8%) (Figure 2; Section 4.2).
    • Prefer Decentralized MAS for dynamic browsing/navigation where peer exploration and fusion helps modestly (+9.2%) and centralized is near-neutral (+0.2%) (Figure 2; Section 4.2).
  • Budget-aware design: Table 5 implies that if token/latency budgets are tight, Hybrid-style coordination can be dominated by overhead.

  • Repro/Integration Guidance (when to prefer what, per the paper).

  • Use MAS only when the task is decomposable into parallel subtasks and the single-agent baseline is below ~0.45, because the paper reports a capability/baseline saturation threshold beyond which coordination returns diminish (Abstract; Section 4.3).
  • Avoid MAS for tasks where decomposition is artificial (PlanCraft example in Section 4.2), because coordination messages consume budget without adding useful information.
  • Track measurable metrics like O%, E_c, and message density c from traces (Section 4.1; Table 5) and treat them as design targets, not just post-hoc analytics.

  • Follow-up research directions explicitly suggested by the paper (Limitations Section 5).

  • Explore whether larger collectives can overcome super-linear overhead via emergent specialization or self-organization, versus being fundamentally bottlenecked by communication (Limitations (i)).
  • Study heterogeneity beyond same-family scale differences, including mixing different model architectures or specialized fine-tunes (Limitations (ii); also Figure 4 analyzes heterogeneous-role effects within families).
  • Develop coordination protocols for tool-intensive tasks (tool scheduling, routing, hierarchical tool delegation), since tool–coordination interactions are a primary failure mode (Limitations (iii); Section 4.3).
  • Extend to more environments (embodied, multimodal, long-horizon feedback) to test generality of the scaling principles (Limitations (v)).