Towards a Science of Scaling Agent Systems¶

🎯 Pitch¶

This paper derives quantitative scaling principles for LM-based agent systems, identifying how agent count, coordination topology, model capability, and task properties interact to predict when multi-agent coordination helps or hurts. By running 180 controlled, budget-matched configurations across four agentic benchmarks and five canonical architectures, it introduces measurable coordination metrics (efficiency, overhead, redundancy, error amplification) and a predictive model (R^2=0.524) that gives principled guidance for selecting architectures—showing that “more agents” is beneficial only for specific, decomposable tasks and can severely degrade performance for sequential, tool-heavy problems.

1. Executive Summary (2-3 sentences)¶

This paper develops quantitative “scaling principles” for LM-based agent systems—rules that predict when moving from a Single-Agent System (SAS) to a Multi-Agent System (MAS) helps or hurts on genuinely agentic tasks that require multi-step interaction with tools/environments. Through a controlled study of 180 matched-budget configurations across 4 agentic benchmarks and 5 canonical architectures, it identifies measurable coordination dynamics (e.g., overhead, efficiency, redundancy, error amplification) that explain performance variation and fit a predictive model with cross-validated R^2 = 0.524 (Eq. (1), Table 3). The practical significance is that “more agents” is not a general win: benefits depend on task decomposability, tool intensity, and baseline single-agent strength, with some domains seeing large gains (e.g., +80.8% on Finance-Agent) and others large degradations (e.g., −39% to −70% on PlanCraft) (Figure 2, Section 4.2).

2. Context and Motivation¶

Problem / gap.
Agent systems (LLM-driven loops of reasoning → acting → feedback) are widely used, but design choices—especially whether/how to coordinate multiple agents—remain driven by heuristics rather than quantitative guidance (Introduction).
Prior work often reports MAS improvements, but the paper argues many evaluations are conducted on non-agentic tasks (static, single-shot benchmarks) where ensembles/voting can help without the coordination burdens that appear in interactive settings (Introduction, Section 2).
Why it matters.
Real deployments (e.g., browsing, finance analysis, workflow execution, planning) require sustained environment interaction, partial observability, and adaptive strategy updates. In these settings, coordination can introduce “coordination tax” (message passing + synchronization + lossy compression of state), and errors can cascade through action chains (Introduction; Section 3.2).
Prior approaches and shortcomings (as framed by this paper).
Many MAS claims implicitly assume monotonic improvement with team size (“more agents is all you need”), but the paper highlights counter-evidence: as base models get stronger, MAS gains can diminish; and interactive tasks create dynamics (overhead, fragmentation, error propagation) that static accuracy-only reporting misses (Introduction; Section 2).
Existing comparisons often confound architecture with prompts/tools/budgets, making it hard to attribute effects to topology itself (Introduction).
How this paper positions itself.
It introduces a more rigorous notion of agentic evaluation (three necessary properties) (Section 3.2).
It runs a controlled, matched-budget evaluation across architectures, tools, and prompt structures to isolate topology effects (Section 4.1; Appendix E).
It aims to move from categorical “architecture labels” to measurable coordination metrics that support prediction on unseen tasks/domains (Section 4.3; Table 3).

3. Technical Approach¶

3.1 Reader orientation (approachable technical breakdown)¶

The system studied is a family of LLM-based agent architectures that solve tasks by repeatedly calling tools and updating state from tool/environment feedback.
The paper’s “solution shape” is an empirical scaling framework: run many controlled SAS/MAS variants on interactive benchmarks, measure coordination behaviors from traces, and fit a predictive regression model that maps task + model + coordination properties to expected performance.

3.2 Big-picture architecture (diagram in words)¶

Inputs: a benchmark task instance + a fixed tool API set + an LLM (chosen from three model families) + an agent topology (SAS, Independent MAS, Centralized MAS, Decentralized MAS, Hybrid MAS).
Process (per instance):
Agent(s) read the task and decide a tool/action.
The environment returns an observation (tool output, webpage content, inventory state, etc.).
Agents update their history/memory and possibly send messages (MAS only).
An aggregation/orchestration mechanism (if present) synthesizes/coordinates outputs and decides termination.
Outputs: a final answer/action sequence and a success/accuracy outcome, plus trace-derived coordination metrics (overhead, efficiency, redundancy, etc.) (Section 4.1; Table 5).

3.3 Roadmap for the deep dive¶

First define what counts as an agent system and what distinguishes SAS vs MAS (Section 3.1).
Then define the task/benchmark notion of “agentic” and why interactive feedback changes scaling behavior (Section 3.2).
Next specify the five architectures/topologies and what mechanism each isolates (Section 3.1; Table 2).
Then describe the controlled experimental design (benchmarks, models, budget matching, metrics) (Section 4.1; Appendix E).
Finally explain the predictive scaling model and how its coefficients encode the paper’s scaling “principles” (Eq. (1), Table 4; Section 4.3).

3.4 Detailed, sentence-based technical breakdown¶

This is an empirical systems + modeling paper whose core idea is to treat multi-agent coordination as a measurable process with quantifiable costs/benefits, and then learn a predictive relationship between those measurements and task performance.

3.4.1 Formal agent system definition (what is being evaluated)¶

The paper defines an agent system as S = (A, E, C, Ω) where A is a set of agents, E is a shared environment, C is a communication topology, and Ω is an orchestration policy (Section 3.1).
Each agent a_i is formalized as S_i = (Φ_i, A_i, M_i, π_i):
Φ_i is the reasoning policy (typically an LLM).
A_i is an action space consisting of tool calls ToolCall(t, θ).
M_i is internal memory.
π_i maps observation histories to actions (Section 3.1).
Interaction proceeds in timesteps: an agent selects action α_{i,t} = π_i(h_{i,t}), the environment returns observation o_{i,t} = E(α_{i,t}), and the history updates by appending (α_{i,t}, o_{i,t}), subject to context-window truncation at MAX_TOKENS (Section 3.1).

3.4.2 SAS vs MAS (mechanistic differences)¶

A Single-Agent System (SAS) has one “reasoning locus,” meaning all perception/planning/action is within one loop; it has zero inter-agent communication overhead and uses a single coherent memory stream (Section 3.1; Section 2 discussion on context integration).
A Multi-Agent System (MAS) has multiple LLM-backed agents that exchange messages, potentially improving exploration/diversity but imposing a coordination tax because global context must be compressed into messages (Introduction; Section 2; Section 3.1).

3.4.3 The five evaluated architectures (what each topology does)¶

The paper evaluates five canonical architectures (Section 4.1; Table 2):

SAS: one agent runs up to k reasoning iterations (O(k) LLM calls; no coordination overhead).
Independent MAS: multiple agents run in parallel but do not communicate with each other; outputs are combined by a synthesis-only aggregator.
The paper emphasizes the aggregator “concatenates sub-agent outputs without cross-validation or majority voting,” so any benefit would be from parallel exploration rather than explicit error correction (Section 3.1).
Centralized MAS: one orchestrator coordinates n sub-agents across r rounds (O(r n k)), creating a validation bottleneck at the orchestrator (Section 3.1; Table 2).
Decentralized MAS: n peers communicate in debate/consensus rounds d (O(d n k)), with all-to-all message potential and higher memory demands (Section 3.1; Table 2).
Hybrid MAS: combines an orchestrator plus some peer-to-peer communication, increasing protocol complexity and overhead (O(r n k + p n) as described) (Section 3.1; Table 2).

These topologies are framed as a “structural ablation” over two coordination dimensions: orchestrator presence (hierarchy) and peer communication (Section 4.1).

3.4.4 What counts as an “agentic” benchmark (and why the paper insists on it)¶

The paper defines a task as agentic if optimal interactive policy performance exceeds the best single-pass function by more than a threshold δ (Section 3.2). Operationally, it requires three necessary benchmark properties (Section 3.2):

Sequential Interdependence: later actions depend on earlier observations.
Partial Observability: important state must be acquired via interaction/tool use.
Adaptive Strategy Formation: the policy must update beliefs/strategy based on feedback.

This matters because static tasks can reward ensembling/voting, while interactive tasks expose overhead and cascading-error dynamics (Introduction; Section 2).

3.4.5 Experimental design (controlled evaluation to isolate topology)¶

Benchmarks (Table 1; Section 4.1; Appendix D):
BrowseComp-Plus (web browsing / multi-website info synthesis).
Finance-Agent (finance analysis tasks).
PlanCraft (Minecraft-style sequential planning/crafting).
Workbench (workplace workflows via function/tool calling).
Scale: 180 experiments/configurations (Section 4.1), spanning:
5 architectures (SAS + 4 MAS),
3 LLM families (OpenAI GPT-5 series; Google Gemini 2.x/2.5; Anthropic Claude Sonnet 3.7/4.0/4.5) (Section 4.1),
and multiple capability levels captured by an external “Intelligence Index” axis (Section 4.1; Appendix A).
Fairness control / budget matching:
The paper states all configurations are matched for total reasoning-token budget (mean ≈ 4,800 tokens per trial) to isolate architectural effects (Section 4.4; Table 5).
MAS gets parallel agent processing but “smaller per-agent iterations,” while SAS gets proportionally more rounds, aiming to equalize total compute (Section 4.1).
Implementation (Appendix E):
Unified model access via LiteLLM and orchestration/tool integration via LangChain.
Tools include web search, code execution (Python REPL), mathematical operations, and completion markers, with dataset-specific tool ecosystems (Appendix E.1).
Concrete agent-configuration parameters (Appendix E.2):
SAS: max 10 iterations per instance.
Independent MAS: 3 agents, synthesis-only coordination.
Centralized: 3 sub-agents + 1 orchestrator, max 5 orchestration rounds, 3 iterations per agent per round.
Decentralized: 3 agents, 3 debate rounds, 3 iterations per round.
Hybrid: centralized orchestration plus limited peer communication phases (Appendix E.2).
Evaluation sample sizes (Appendix E.4):
Finance-Agent: 50 instances.
BrowseComp-Plus: 100 instances.
Workbench: 100 instances.
PlanCraft: 100 instances.

3.4.6 Coordination/process metrics (what is measured beyond accuracy)¶

The paper argues final success alone is insufficient, so it instruments MAS/SAS traces with measurable coordination properties (Section 4.1):

Coordination overhead O% = (T_MAS − T_SAS)/T_SAS × 100%, where T is “turns” (reasoning–response exchanges) (Section 4.1; Table 5).
Message density c: inter-agent messages per reasoning turn (Section 4.1; Table 5).
Redundancy rate R: cosine similarity of agent output embeddings (agreement/overlap) (Section 4.1; Table 5).
Coordination efficiency E_c = S / (T/T_SAS): success normalized by relative turn count (Section 4.1).
Error amplification A_e = E_MAS / E_SAS: relative failure probability (Section 4.1; Table 5).
Token overlap structure labels tokens in rationales as unique/shared/contradictory, with contradiction detected via a BERTScore-based dissimilarity threshold (Section 4.1; Section 4.4).

It also defines an information gain measure ΔI (Appendix E.5; Eq. (2)–(3)) as posterior variance reduction for the success variable using Monte Carlo sampling: - K = 10 traces sampled at temperature τ = 0.7 to estimate variance via p(1-p) for binary outcomes.

3.4.7 The predictive scaling model (what it predicts and how)¶

The paper fits a regression-style “scaling principle” with main effects and mechanism-motivated interactions (Section 4.3).
Predictors include (Section 4.3; Eq. (1)):
Model capability I (Intelligence Index, mean-centered at Ī = 56.9 per Table 4),
Tool count T (log-transformed),
Agent count n_a (log-transformed),
Single-agent baseline performance P_SA,
Coordination metrics: O%, c, R, E_c, A_e (some log-transformed),
Selected interactions such as E_c × T, O% × T, and P_SA × log(1+n_a).
Model form: Eq. (1) in Section 4.3, with standardized predictors (μ=0, σ=1) and log transforms for skewed variables (O% spans 0–515%, T: 4–16, n_a: 1–4, A_e: 1.0–17.2) (Section 4.3).
Validation:
5-fold cross-validation with experiment-level holdout yields R^2_CV = 0.524 ± 0.033, MAE = 0.089 ± 0.011, RMSE = 0.112 ± 0.014 (Section 4.3; Table 3).
Adding coordination metrics improves predictive power over using only architecture labels (R^2_CV 0.524 vs 0.430) (Table 3).
Worked micro-example (how the model is used in the paper’s decision logic).
The paper derives a coordination-saturation / baseline threshold from the interaction P_SA × log(1+n_a) (Table 4), stating coordination yields diminishing/negative returns once the single-agent baseline exceeds about ≈ 0.45 (Section 4.3; also highlighted in Abstract/Conclusion).
Mechanistically, this is the “baseline paradox”: if SAS is already strong, the remaining headroom is small, and coordination overhead dominates (Section 4.3; Table 4 shows β̂ = −0.404, p < 0.001 for this interaction).

4. Key Insights and Innovations¶

(1) A rigorous operationalization of “agentic evaluation.”
Innovation: the paper does not treat any benchmark as valid for evaluating agents/MAS; it requires sequential interdependence, partial observability, and adaptive strategy formation (Section 3.2).
Significance: this directly targets a methodological failure mode—drawing MAS conclusions from static tasks where coordination costs are absent or irrelevant (Introduction; Section 2).
(2) A controlled, topology-focused evaluation across many configurations.
Innovation: it standardizes tools, prompt structures, and token budgets and varies mainly topology + model capability, across N=180 configurations (Abstract; Section 4.1).
Significance: this is meant to enable causal attribution to coordination structure rather than implementation confounds (Introduction; Section 4.1).
(3) Quantitative coordination metrics + predictive modeling outperform architecture labels.
Innovation: the main explanatory variables are process metrics (E_c, O%, c, R, A_e) measured from traces, not just “centralized vs decentralized” labels (Section 4.1; Table 5).
Significance: the best model reaches R^2_CV = 0.524 and beats an architecture-label model (0.430) (Table 3), suggesting the measurable properties capture transferable principles.
(4) Three dominant scaling effects (as summarized by the paper).
Tool–coordination trade-off: tool-heavy tasks disproportionately suffer from multi-agent overhead under fixed budgets; modeled via E_c × T with β̂ = −0.267, p < 0.001 (Section 4.3; Table 4).
Capability saturation / baseline paradox: coordination returns diminish or turn negative beyond about ~45% SAS baseline, tied to P_SA × log(1+n_a) with β̂ = −0.404, p < 0.001 (Abstract; Section 4.3; Table 4).
Topology-dependent error amplification: Independent MAS amplifies errors dramatically (reported as 17.2× vs 4.4× centralized) (Abstract; Table 5; Section 4.4).
(5) Task-contingent topology preferences (not one MAS that wins everywhere).
Centralized is best on Finance-Agent with large gains (+80.8%) (Section 4.2; Figure 2).
Decentralized is best on BrowseComp-Plus with modest gains (+9.2%) while centralized is near-flat (+0.2%) (Section 4.2; Figure 2).
All MAS variants degrade on PlanCraft (−39% to −70%) (Section 4.2; Figure 2).

5. Experimental Analysis¶

Evaluation methodology (datasets, metrics, setup)¶

Datasets/benchmarks: BrowseComp-Plus, Finance-Agent, PlanCraft, Workbench (Table 1; Appendix D).
Primary metric: success/accuracy depending on domain (Section 4.1).
Secondary/process metrics: overhead, efficiency, error amplification, redundancy, message density, token-overlap/contradiction, and information gain ΔI (Section 4.1; Appendix E.5).
Configurations: 5 architectures × multiple models across 3 families, totaling 180 experiments (Section 4.1).
Budget control: mean ≈ 4,800 tokens per trial matched across conditions (Section 4.4; Table 5).
Reliability: the paper reports inter-rater reliability for factual error rate validators with Cohen’s κ values (e.g., Finance Agent 0.91, Workbench 0.89, PlanCraft 0.87, BrowseComp-Plus 0.88) (Section 4.1).

Main quantitative results (with specific numbers)¶

Per-benchmark MAS vs SAS (Section 4.2; Figure 2):

Finance-Agent: strong MAS gains across topologies.
Centralized: 0.631 vs SAS 0.349 → +80.8% (Section 4.2).
Decentralized: 0.609 → +74.5%.
Hybrid: 0.604 → +73.1%.
BrowseComp-Plus: mixed; Independent is harmful, structured coordination yields small gains.
Decentralized: 0.347 vs SAS 0.318 → +9.2%.
Centralized: +0.2% (near-flat).
Independent: −35% (Figure 2 caption; Section 4.2 provides the “modest gains” story and highlights independent underperformance).
Workbench: small effects.
Decentralized: 0.664 vs SAS 0.629 → +5.7%.
Centralized and Hybrid: around −1.2% (Section 4.2).
PlanCraft: consistent MAS degradation.
Independent: 0.170 vs SAS 0.568 → −70.1%.
Centralized: 0.282 → −50.3%.
Decentralized: 0.332 → −41.5%.
Hybrid: 0.346 → −39.1% (Section 4.2; Figure 2).

Across-architecture coordination costs and dynamics (Table 5; Section 4.4):

Mean success rate S:
SAS 0.466, Independent 0.370, Decentralized 0.477, Centralized 0.463, Hybrid 0.452 (Table 5).
Mean turns T:
SAS 7.2 ± 2.1, Independent 11.4 ± 3.2, Decentralized 26.1 ± 7.5, Centralized 27.7 ± 8.1, Hybrid 44.3 ± 12.4 (Table 5).
Overhead O% relative to SAS:
Independent 58%, Decentralized 263%, Centralized 285%, Hybrid 515% (Table 5).
Efficiency E_c:
SAS 0.466, Independent 0.234, Decentralized 0.132, Centralized 0.120, Hybrid 0.074 (Table 5).
Error amplification A_e:
SAS 1.0, Independent 17.2, Decentralized 7.8, Centralized 4.4, Hybrid 5.1 (Table 5).

Predictive modeling results (Section 4.3; Table 3):

Full model with coordination metrics: R^2_train = 0.613, R^2_CV = 0.524, AIC = −201.2, 20 parameters (Table 3).
Baselines:
Intelligence + tools + agents only: R^2_CV = 0.283 (Table 3).
- architecture labels: R^2_CV = 0.430 (Table 3).

Key coefficient-based claims (Table 4):

E_c × T has β̂ = −0.267, p < 0.001 (tool–coordination trade-off).
P_SA × log(1+n_a) has β̂ = −0.404, p < 0.001 (baseline paradox/capability saturation).
O% × T has β̂ = −0.162, p < 0.001 (overhead compounds with tool complexity).
Intelligence main effect: β̂ = 0.171, p = 0.001; quadratic term not significant (Table 4).

Do the experiments support the claims?¶

Task-contingent coordination: The per-benchmark deltas (Figure 2) strongly support the claim that MAS benefits are not universal and can be negative, including large negative effects on PlanCraft.
Cost/overhead as a central driver: Table 5 shows drastic turn and overhead increases for MAS, especially Hybrid. This aligns with the paper’s narrative that under fixed budgets, coordination consumes reasoning capacity (Section 4.4).
Predictability via measurable coordination metrics: The improvement from R^2_CV 0.430 (architecture labels) to 0.524 (coordination metrics) (Table 3) supports the claim that process metrics carry additional explanatory power beyond topology categories.
Caveat: The paper reports dramatic architecture differences in error amplification (A_e), but in the regression, the main effect of log(1+A_e) and A_e × T are not significant (Table 4; Section 4.3). That internally complicates a simple story of “error amplification drives performance,” and the paper itself interprets error differences as being subsumed by other metrics (efficiency/overhead) in the multivariate model (Section 4.3).

Ablations / robustness checks / out-of-sample validation¶

Model comparison ablation: progressively adding predictors (Table 3) functions as an ablation showing what improves prediction.
Cross-validation: 5-fold CV with stable coefficient estimates and diagnostics (Section 4.3).
Out-of-sample validation: on GPT-5.2 (Intelligence Index 75), the model reports MAE = 0.071 (Appendix B; Table 7) and validates 4/5 qualitative findings (Table 8), with a noted partial validation where decentralized vs centralized advantages converge at high capability (Table 8).
Important caveat in Appendix B: the paper notes SAS performance is systematically over-predicted out of training range (Table 9), suggesting calibration issues when extrapolating beyond the Intelligence Index range used to fit the model.

6. Limitations and Trade-offs¶

Limited architecture coverage.
Only five canonical topologies are tested (SAS + four MAS). Other orthogonal design dimensions (specialization, memory design, aggregation rules beyond “synthesis-only,” etc.) are explicitly out of scope (Section 3.1 footnote; Limitations Section 5).
Agent-count scaling is only partially explored.
The main controlled suite uses small teams (Appendix E.2 uses 3-agent MAS configurations), with additional exploration up to n_a ∈ {1,3,5,7,9} shown for some cases (Figure 5).
The paper finds turn count scales super-linearly with agent count (T = 2.72 × (n+0.5)^{1.724}, R^2=0.974) (Section 4.4), implying practical ceilings beyond ~3–4 agents under fixed budgets, but this is not fully mapped across all domains/models.
Prompt optimality is not pursued.
Prompts are held identical across conditions for experimental validity, but architecture- or model-specific prompt tuning could change scaling behavior (Limitations (iv)).
Benchmark coverage is diverse but still narrow.
Four benchmarks span finance, browsing, planning, and workplace tools, but may not represent embodied/multimodal or long-horizon temporal feedback settings (Limitations (v)).
Predictive model is only moderately explanatory and may not extrapolate cleanly.
R^2_CV = 0.524 means substantial variance remains unexplained (Section 4.3).
Out-of-range extrapolation shows calibration issues (Appendix B; Table 9), including over-prediction for SAS on GPT-5.2.
Economic/latency cost is a first-class trade-off.
MAS token efficiency is much worse than SAS in Table 5 (e.g., success per 1K tokens: SAS 67.7 vs Hybrid 13.6), and overhead reaches 515% for Hybrid (Table 5), which directly constrains practical deployment (Section 4.4; Limitations (vi)).

7. Implications and Future Directions¶

How this changes the field’s framing (within the paper’s scope).
It argues that “agent scaling” should be studied like an empirical science: not by counting agents, but by quantifying coordination costs and task properties that determine when coordination is net-positive (Abstract; Sections 4.2–4.3).
It pushes evaluation toward agentic benchmarks and away from static tasks as a proxy for agent performance (Section 3.2).
Practical applications / downstream use cases.
Architecture selection by task type (supported by Figure 2 and Section 4.3 examples):
- Prefer SAS for sequential constraint-satisfaction tasks like PlanCraft, where all MAS variants degrade (−39% to −70%) (Figure 2; Section 4.2).
- Prefer Centralized MAS for parallelizable analysis tasks like Finance-Agent, where centralized gains are largest (+80.8%) (Figure 2; Section 4.2).
- Prefer Decentralized MAS for dynamic browsing/navigation where peer exploration and fusion helps modestly (+9.2%) and centralized is near-neutral (+0.2%) (Figure 2; Section 4.2).
Budget-aware design: Table 5 implies that if token/latency budgets are tight, Hybrid-style coordination can be dominated by overhead.
Repro/Integration Guidance (when to prefer what, per the paper).
Use MAS only when the task is decomposable into parallel subtasks and the single-agent baseline is below ~0.45, because the paper reports a capability/baseline saturation threshold beyond which coordination returns diminish (Abstract; Section 4.3).
Avoid MAS for tasks where decomposition is artificial (PlanCraft example in Section 4.2), because coordination messages consume budget without adding useful information.
Track measurable metrics like O%, E_c, and message density c from traces (Section 4.1; Table 5) and treat them as design targets, not just post-hoc analytics.
Follow-up research directions explicitly suggested by the paper (Limitations Section 5).
Explore whether larger collectives can overcome super-linear overhead via emergent specialization or self-organization, versus being fundamentally bottlenecked by communication (Limitations (i)).
Study heterogeneity beyond same-family scale differences, including mixing different model architectures or specialized fine-tunes (Limitations (ii); also Figure 4 analyzes heterogeneous-role effects within families).
Develop coordination protocols for tool-intensive tasks (tool scheduling, routing, hierarchical tool delegation), since tool–coordination interactions are a primary failure mode (Limitations (iii); Section 4.3).
Extend to more environments (embodied, multimodal, long-horizon feedback) to test generality of the scaling principles (Limitations (v)).