Towards a Science of Scaling Agent Systems¶
ArXiv: 2512.08296
đŻ Pitch¶
This paper derives quantitative scaling principles for LM-based agent systems, identifying how agent count, coordination topology, model capability, and task properties interact to predict when multi-agent coordination helps or hurts. By running 180 controlled, budget-matched configurations across four agentic benchmarks and five canonical architectures, it introduces measurable coordination metrics (efficiency, overhead, redundancy, error amplification) and a predictive model (R^2=0.524) that gives principled guidance for selecting architecturesâshowing that âmore agentsâ is beneficial only for specific, decomposable tasks and can severely degrade performance for sequential, tool-heavy problems.
1. Executive Summary (2-3 sentences)¶
This paper develops quantitative âscaling principlesâ for LM-based agent systemsârules that predict when moving from a Single-Agent System (SAS) to a Multi-Agent System (MAS) helps or hurts on genuinely agentic tasks that require multi-step interaction with tools/environments. Through a controlled study of 180 matched-budget configurations across 4 agentic benchmarks and 5 canonical architectures, it identifies measurable coordination dynamics (e.g., overhead, efficiency, redundancy, error amplification) that explain performance variation and fit a predictive model with cross-validated R^2 = 0.524 (Eq. (1), Table 3). The practical significance is that âmore agentsâ is not a general win: benefits depend on task decomposability, tool intensity, and baseline single-agent strength, with some domains seeing large gains (e.g., +80.8% on Finance-Agent) and others large degradations (e.g., â39% to â70% on PlanCraft) (Figure 2, Section 4.2).
2. Context and Motivation¶
- Problem / gap.
- Agent systems (LLM-driven loops of reasoning â acting â feedback) are widely used, but design choicesâespecially whether/how to coordinate multiple agentsâremain driven by heuristics rather than quantitative guidance (Introduction).
-
Prior work often reports MAS improvements, but the paper argues many evaluations are conducted on non-agentic tasks (static, single-shot benchmarks) where ensembles/voting can help without the coordination burdens that appear in interactive settings (Introduction, Section 2).
-
Why it matters.
-
Real deployments (e.g., browsing, finance analysis, workflow execution, planning) require sustained environment interaction, partial observability, and adaptive strategy updates. In these settings, coordination can introduce âcoordination taxâ (message passing + synchronization + lossy compression of state), and errors can cascade through action chains (Introduction; Section 3.2).
-
Prior approaches and shortcomings (as framed by this paper).
- Many MAS claims implicitly assume monotonic improvement with team size (âmore agents is all you needâ), but the paper highlights counter-evidence: as base models get stronger, MAS gains can diminish; and interactive tasks create dynamics (overhead, fragmentation, error propagation) that static accuracy-only reporting misses (Introduction; Section 2).
-
Existing comparisons often confound architecture with prompts/tools/budgets, making it hard to attribute effects to topology itself (Introduction).
-
How this paper positions itself.
- It introduces a more rigorous notion of agentic evaluation (three necessary properties) (Section 3.2).
- It runs a controlled, matched-budget evaluation across architectures, tools, and prompt structures to isolate topology effects (Section 4.1; Appendix E).
- It aims to move from categorical âarchitecture labelsâ to measurable coordination metrics that support prediction on unseen tasks/domains (Section 4.3; Table 3).
3. Technical Approach¶
3.1 Reader orientation (approachable technical breakdown)¶
- The system studied is a family of LLM-based agent architectures that solve tasks by repeatedly calling tools and updating state from tool/environment feedback.
- The paperâs âsolution shapeâ is an empirical scaling framework: run many controlled SAS/MAS variants on interactive benchmarks, measure coordination behaviors from traces, and fit a predictive regression model that maps task + model + coordination properties to expected performance.
3.2 Big-picture architecture (diagram in words)¶
- Inputs: a benchmark task instance + a fixed tool API set + an LLM (chosen from three model families) + an agent topology (SAS, Independent MAS, Centralized MAS, Decentralized MAS, Hybrid MAS).
- Process (per instance):
- Agent(s) read the task and decide a tool/action.
- The environment returns an observation (tool output, webpage content, inventory state, etc.).
- Agents update their history/memory and possibly send messages (MAS only).
- An aggregation/orchestration mechanism (if present) synthesizes/coordinates outputs and decides termination.
- Outputs: a final answer/action sequence and a success/accuracy outcome, plus trace-derived coordination metrics (overhead, efficiency, redundancy, etc.) (Section 4.1; Table 5).
3.3 Roadmap for the deep dive¶
- First define what counts as an agent system and what distinguishes SAS vs MAS (Section 3.1).
- Then define the task/benchmark notion of âagenticâ and why interactive feedback changes scaling behavior (Section 3.2).
- Next specify the five architectures/topologies and what mechanism each isolates (Section 3.1; Table 2).
- Then describe the controlled experimental design (benchmarks, models, budget matching, metrics) (Section 4.1; Appendix E).
- Finally explain the predictive scaling model and how its coefficients encode the paperâs scaling âprinciplesâ (Eq. (1), Table 4; Section 4.3).
3.4 Detailed, sentence-based technical breakdown¶
This is an empirical systems + modeling paper whose core idea is to treat multi-agent coordination as a measurable process with quantifiable costs/benefits, and then learn a predictive relationship between those measurements and task performance.
3.4.1 Formal agent system definition (what is being evaluated)¶
- The paper defines an agent system as
S = (A, E, C, Ω)whereAis a set of agents,Eis a shared environment,Cis a communication topology, andΩis an orchestration policy (Section 3.1). - Each agent
a_iis formalized asS_i = (Ί_i, A_i, M_i, Ï_i): Ί_iis the reasoning policy (typically an LLM).A_iis an action space consisting of tool callsToolCall(t, Ξ).M_iis internal memory.Ï_imaps observation histories to actions (Section 3.1).- Interaction proceeds in timesteps: an agent selects action
α_{i,t} = Ï_i(h_{i,t}), the environment returns observationo_{i,t} = E(α_{i,t}), and the history updates by appending(α_{i,t}, o_{i,t}), subject to context-window truncation atMAX_TOKENS(Section 3.1).
3.4.2 SAS vs MAS (mechanistic differences)¶
- A
Single-Agent System (SAS)has one âreasoning locus,â meaning all perception/planning/action is within one loop; it has zero inter-agent communication overhead and uses a single coherent memory stream (Section 3.1; Section 2 discussion on context integration). - A
Multi-Agent System (MAS)has multiple LLM-backed agents that exchange messages, potentially improving exploration/diversity but imposing a coordination tax because global context must be compressed into messages (Introduction; Section 2; Section 3.1).
3.4.3 The five evaluated architectures (what each topology does)¶
The paper evaluates five canonical architectures (Section 4.1; Table 2):
- SAS: one agent runs up to
kreasoning iterations (O(k)LLM calls; no coordination overhead). - Independent MAS: multiple agents run in parallel but do not communicate with each other; outputs are combined by a synthesis-only aggregator.
- The paper emphasizes the aggregator âconcatenates sub-agent outputs without cross-validation or majority voting,â so any benefit would be from parallel exploration rather than explicit error correction (Section 3.1).
- Centralized MAS: one orchestrator coordinates
nsub-agents acrossrrounds (O(r n k)), creating a validation bottleneck at the orchestrator (Section 3.1; Table 2). - Decentralized MAS:
npeers communicate in debate/consensus roundsd(O(d n k)), with all-to-all message potential and higher memory demands (Section 3.1; Table 2). - Hybrid MAS: combines an orchestrator plus some peer-to-peer communication, increasing protocol complexity and overhead (
O(r n k + p n)as described) (Section 3.1; Table 2).
These topologies are framed as a âstructural ablationâ over two coordination dimensions: orchestrator presence (hierarchy) and peer communication (Section 4.1).
3.4.4 What counts as an âagenticâ benchmark (and why the paper insists on it)¶
The paper defines a task as agentic if optimal interactive policy performance exceeds the best single-pass function by more than a threshold ÎŽ (Section 3.2). Operationally, it requires three necessary benchmark properties (Section 3.2):
Sequential Interdependence: later actions depend on earlier observations.Partial Observability: important state must be acquired via interaction/tool use.Adaptive Strategy Formation: the policy must update beliefs/strategy based on feedback.
This matters because static tasks can reward ensembling/voting, while interactive tasks expose overhead and cascading-error dynamics (Introduction; Section 2).
3.4.5 Experimental design (controlled evaluation to isolate topology)¶
- Benchmarks (Table 1; Section 4.1; Appendix D):
BrowseComp-Plus(web browsing / multi-website info synthesis).Finance-Agent(finance analysis tasks).PlanCraft(Minecraft-style sequential planning/crafting).Workbench(workplace workflows via function/tool calling).- Scale:
180experiments/configurations (Section 4.1), spanning: 5architectures (SAS + 4 MAS),3LLM families (OpenAI GPT-5 series; Google Gemini 2.x/2.5; Anthropic Claude Sonnet 3.7/4.0/4.5) (Section 4.1),- and multiple capability levels captured by an external âIntelligence Indexâ axis (Section 4.1; Appendix A).
- Fairness control / budget matching:
- The paper states all configurations are matched for total reasoning-token budget (mean
â 4,800tokens per trial) to isolate architectural effects (Section 4.4; Table 5). - MAS gets parallel agent processing but âsmaller per-agent iterations,â while SAS gets proportionally more rounds, aiming to equalize total compute (Section 4.1).
- Implementation (Appendix E):
- Unified model access via
LiteLLMand orchestration/tool integration viaLangChain. - Tools include web search, code execution (Python REPL), mathematical operations, and completion markers, with dataset-specific tool ecosystems (Appendix E.1).
- Concrete agent-configuration parameters (Appendix E.2):
- SAS: max
10iterations per instance. - Independent MAS:
3agents, synthesis-only coordination. - Centralized:
3sub-agents +1orchestrator, max5orchestration rounds,3iterations per agent per round. - Decentralized:
3agents,3debate rounds,3iterations per round. - Hybrid: centralized orchestration plus limited peer communication phases (Appendix E.2).
- Evaluation sample sizes (Appendix E.4):
- Finance-Agent:
50instances. - BrowseComp-Plus:
100instances. - Workbench:
100instances. - PlanCraft:
100instances.
3.4.6 Coordination/process metrics (what is measured beyond accuracy)¶
The paper argues final success alone is insufficient, so it instruments MAS/SAS traces with measurable coordination properties (Section 4.1):
- Coordination overhead
O% = (T_MAS â T_SAS)/T_SAS Ă 100%, whereTis âturnsâ (reasoningâresponse exchanges) (Section 4.1; Table 5). - Message density
c: inter-agent messages per reasoning turn (Section 4.1; Table 5). - Redundancy rate
R: cosine similarity of agent output embeddings (agreement/overlap) (Section 4.1; Table 5). - Coordination efficiency
E_c = S / (T/T_SAS): success normalized by relative turn count (Section 4.1). - Error amplification
A_e = E_MAS / E_SAS: relative failure probability (Section 4.1; Table 5). - Token overlap structure labels tokens in rationales as unique/shared/contradictory, with contradiction detected via a BERTScore-based dissimilarity threshold (Section 4.1; Section 4.4).
It also defines an information gain measure ÎI (Appendix E.5; Eq. (2)â(3)) as posterior variance reduction for the success variable using Monte Carlo sampling:
- K = 10 traces sampled at temperature Ï = 0.7 to estimate variance via p(1-p) for binary outcomes.
3.4.7 The predictive scaling model (what it predicts and how)¶
- The paper fits a regression-style âscaling principleâ with main effects and mechanism-motivated interactions (Section 4.3).
- Predictors include (Section 4.3; Eq. (1)):
- Model capability
I(Intelligence Index, mean-centered atÄȘ = 56.9per Table 4), - Tool count
T(log-transformed), - Agent count
n_a(log-transformed), - Single-agent baseline performance
P_SA, - Coordination metrics:
O%,c,R,E_c,A_e(some log-transformed), - Selected interactions such as
E_c Ă T,O% Ă T, andP_SA Ă log(1+n_a). - Model form: Eq. (1) in Section 4.3, with standardized predictors (
ÎŒ=0, Ï=1) and log transforms for skewed variables (O%spans0â515%,T: 4â16,n_a: 1â4,A_e: 1.0â17.2) (Section 4.3). - Validation:
- 5-fold cross-validation with experiment-level holdout yields
R^2_CV = 0.524 ± 0.033,MAE = 0.089 ± 0.011,RMSE = 0.112 ± 0.014(Section 4.3; Table 3). - Adding coordination metrics improves predictive power over using only architecture labels (
R^2_CV 0.524vs0.430) (Table 3). - Worked micro-example (how the model is used in the paperâs decision logic).
- The paper derives a coordination-saturation / baseline threshold from the interaction
P_SA Ă log(1+n_a)(Table 4), stating coordination yields diminishing/negative returns once the single-agent baseline exceeds aboutâ 0.45(Section 4.3; also highlighted in Abstract/Conclusion). - Mechanistically, this is the âbaseline paradoxâ: if SAS is already strong, the remaining headroom is small, and coordination overhead dominates (Section 4.3; Table 4 shows
ÎČÌ = â0.404, p < 0.001for this interaction).
4. Key Insights and Innovations¶
- (1) A rigorous operationalization of âagentic evaluation.â
- Innovation: the paper does not treat any benchmark as valid for evaluating agents/MAS; it requires sequential interdependence, partial observability, and adaptive strategy formation (Section 3.2).
-
Significance: this directly targets a methodological failure modeâdrawing MAS conclusions from static tasks where coordination costs are absent or irrelevant (Introduction; Section 2).
-
(2) A controlled, topology-focused evaluation across many configurations.
- Innovation: it standardizes tools, prompt structures, and token budgets and varies mainly topology + model capability, across
N=180configurations (Abstract; Section 4.1). -
Significance: this is meant to enable causal attribution to coordination structure rather than implementation confounds (Introduction; Section 4.1).
-
(3) Quantitative coordination metrics + predictive modeling outperform architecture labels.
- Innovation: the main explanatory variables are process metrics (
E_c,O%,c,R,A_e) measured from traces, not just âcentralized vs decentralizedâ labels (Section 4.1; Table 5). -
Significance: the best model reaches
R^2_CV = 0.524and beats an architecture-label model (0.430) (Table 3), suggesting the measurable properties capture transferable principles. -
(4) Three dominant scaling effects (as summarized by the paper).
- Toolâcoordination trade-off: tool-heavy tasks disproportionately suffer from multi-agent overhead under fixed budgets; modeled via
E_c Ă TwithÎČÌ = â0.267, p < 0.001(Section 4.3; Table 4). - Capability saturation / baseline paradox: coordination returns diminish or turn negative beyond about
~45%SAS baseline, tied toP_SA Ă log(1+n_a)withÎČÌ = â0.404, p < 0.001(Abstract; Section 4.3; Table 4). -
Topology-dependent error amplification: Independent MAS amplifies errors dramatically (reported as
17.2Ăvs4.4Ăcentralized) (Abstract; Table 5; Section 4.4). -
(5) Task-contingent topology preferences (not one MAS that wins everywhere).
- Centralized is best on Finance-Agent with large gains (
+80.8%) (Section 4.2; Figure 2). - Decentralized is best on BrowseComp-Plus with modest gains (
+9.2%) while centralized is near-flat (+0.2%) (Section 4.2; Figure 2). - All MAS variants degrade on PlanCraft (â39% to â70%) (Section 4.2; Figure 2).
5. Experimental Analysis¶
Evaluation methodology (datasets, metrics, setup)¶
- Datasets/benchmarks: BrowseComp-Plus, Finance-Agent, PlanCraft, Workbench (Table 1; Appendix D).
- Primary metric: success/accuracy depending on domain (Section 4.1).
- Secondary/process metrics: overhead, efficiency, error amplification, redundancy, message density, token-overlap/contradiction, and information gain
ÎI(Section 4.1; Appendix E.5). - Configurations: 5 architectures Ă multiple models across 3 families, totaling
180experiments (Section 4.1). - Budget control: mean
â 4,800tokens per trial matched across conditions (Section 4.4; Table 5). - Reliability: the paper reports inter-rater reliability for factual error rate validators with Cohenâs Îș values (e.g., Finance Agent
0.91, Workbench0.89, PlanCraft0.87, BrowseComp-Plus0.88) (Section 4.1).
Main quantitative results (with specific numbers)¶
Per-benchmark MAS vs SAS (Section 4.2; Figure 2):
- Finance-Agent: strong MAS gains across topologies.
- Centralized:
0.631vs SAS0.349â+80.8%(Section 4.2). - Decentralized:
0.609â+74.5%. - Hybrid:
0.604â+73.1%. - BrowseComp-Plus: mixed; Independent is harmful, structured coordination yields small gains.
- Decentralized:
0.347vs SAS0.318â+9.2%. - Centralized:
+0.2%(near-flat). - Independent:
â35%(Figure 2 caption; Section 4.2 provides the âmodest gainsâ story and highlights independent underperformance). - Workbench: small effects.
- Decentralized:
0.664vs SAS0.629â+5.7%. - Centralized and Hybrid: around
â1.2%(Section 4.2). - PlanCraft: consistent MAS degradation.
- Independent:
0.170vs SAS0.568ââ70.1%. - Centralized:
0.282ââ50.3%. - Decentralized:
0.332ââ41.5%. - Hybrid:
0.346ââ39.1%(Section 4.2; Figure 2).
Across-architecture coordination costs and dynamics (Table 5; Section 4.4):
- Mean success rate
S: - SAS
0.466, Independent0.370, Decentralized0.477, Centralized0.463, Hybrid0.452(Table 5). - Mean turns
T: - SAS
7.2 ± 2.1, Independent11.4 ± 3.2, Decentralized26.1 ± 7.5, Centralized27.7 ± 8.1, Hybrid44.3 ± 12.4(Table 5). - Overhead
O%relative to SAS: - Independent
58%, Decentralized263%, Centralized285%, Hybrid515%(Table 5). - Efficiency
E_c: - SAS
0.466, Independent0.234, Decentralized0.132, Centralized0.120, Hybrid0.074(Table 5). - Error amplification
A_e: - SAS
1.0, Independent17.2, Decentralized7.8, Centralized4.4, Hybrid5.1(Table 5).
Predictive modeling results (Section 4.3; Table 3):
- Full model with coordination metrics:
R^2_train = 0.613,R^2_CV = 0.524,AIC = â201.2,20parameters (Table 3). - Baselines:
- Intelligence + tools + agents only:
R^2_CV = 0.283(Table 3). -
- architecture labels:
R^2_CV = 0.430(Table 3).
- architecture labels:
Key coefficient-based claims (Table 4):
E_c Ă ThasÎČÌ = â0.267, p < 0.001(toolâcoordination trade-off).P_SA Ă log(1+n_a)hasÎČÌ = â0.404, p < 0.001(baseline paradox/capability saturation).O% Ă ThasÎČÌ = â0.162, p < 0.001(overhead compounds with tool complexity).- Intelligence main effect:
ÎČÌ = 0.171, p = 0.001; quadratic term not significant (Table 4).
Do the experiments support the claims?¶
- Task-contingent coordination: The per-benchmark deltas (Figure 2) strongly support the claim that MAS benefits are not universal and can be negative, including large negative effects on PlanCraft.
- Cost/overhead as a central driver: Table 5 shows drastic turn and overhead increases for MAS, especially Hybrid. This aligns with the paperâs narrative that under fixed budgets, coordination consumes reasoning capacity (Section 4.4).
- Predictability via measurable coordination metrics: The improvement from
R^2_CV 0.430(architecture labels) to0.524(coordination metrics) (Table 3) supports the claim that process metrics carry additional explanatory power beyond topology categories. - Caveat: The paper reports dramatic architecture differences in error amplification (
A_e), but in the regression, the main effect oflog(1+A_e)andA_e Ă Tare not significant (Table 4; Section 4.3). That internally complicates a simple story of âerror amplification drives performance,â and the paper itself interprets error differences as being subsumed by other metrics (efficiency/overhead) in the multivariate model (Section 4.3).
Ablations / robustness checks / out-of-sample validation¶
- Model comparison ablation: progressively adding predictors (Table 3) functions as an ablation showing what improves prediction.
- Cross-validation: 5-fold CV with stable coefficient estimates and diagnostics (Section 4.3).
- Out-of-sample validation: on
GPT-5.2(Intelligence Index75), the model reportsMAE = 0.071(Appendix B; Table 7) and validates4/5qualitative findings (Table 8), with a noted partial validation where decentralized vs centralized advantages converge at high capability (Table 8). - Important caveat in Appendix B: the paper notes SAS performance is systematically over-predicted out of training range (Table 9), suggesting calibration issues when extrapolating beyond the Intelligence Index range used to fit the model.
6. Limitations and Trade-offs¶
- Limited architecture coverage.
-
Only five canonical topologies are tested (SAS + four MAS). Other orthogonal design dimensions (specialization, memory design, aggregation rules beyond âsynthesis-only,â etc.) are explicitly out of scope (Section 3.1 footnote; Limitations Section 5).
-
Agent-count scaling is only partially explored.
- The main controlled suite uses small teams (Appendix E.2 uses 3-agent MAS configurations), with additional exploration up to
n_a â {1,3,5,7,9}shown for some cases (Figure 5). -
The paper finds turn count scales super-linearly with agent count (
T = 2.72 Ă (n+0.5)^{1.724},R^2=0.974) (Section 4.4), implying practical ceilings beyond ~3â4 agents under fixed budgets, but this is not fully mapped across all domains/models. -
Prompt optimality is not pursued.
-
Prompts are held identical across conditions for experimental validity, but architecture- or model-specific prompt tuning could change scaling behavior (Limitations (iv)).
-
Benchmark coverage is diverse but still narrow.
-
Four benchmarks span finance, browsing, planning, and workplace tools, but may not represent embodied/multimodal or long-horizon temporal feedback settings (Limitations (v)).
-
Predictive model is only moderately explanatory and may not extrapolate cleanly.
R^2_CV = 0.524means substantial variance remains unexplained (Section 4.3).-
Out-of-range extrapolation shows calibration issues (Appendix B; Table 9), including over-prediction for SAS on GPT-5.2.
-
Economic/latency cost is a first-class trade-off.
- MAS token efficiency is much worse than SAS in Table 5 (e.g., success per 1K tokens: SAS
67.7vs Hybrid13.6), and overhead reaches515%for Hybrid (Table 5), which directly constrains practical deployment (Section 4.4; Limitations (vi)).
7. Implications and Future Directions¶
- How this changes the fieldâs framing (within the paperâs scope).
- It argues that âagent scalingâ should be studied like an empirical science: not by counting agents, but by quantifying coordination costs and task properties that determine when coordination is net-positive (Abstract; Sections 4.2â4.3).
-
It pushes evaluation toward agentic benchmarks and away from static tasks as a proxy for agent performance (Section 3.2).
-
Practical applications / downstream use cases.
- Architecture selection by task type (supported by Figure 2 and Section 4.3 examples):
- Prefer SAS for sequential constraint-satisfaction tasks like PlanCraft, where all MAS variants degrade (
â39%toâ70%) (Figure 2; Section 4.2). - Prefer Centralized MAS for parallelizable analysis tasks like Finance-Agent, where centralized gains are largest (
+80.8%) (Figure 2; Section 4.2). - Prefer Decentralized MAS for dynamic browsing/navigation where peer exploration and fusion helps modestly (
+9.2%) and centralized is near-neutral (+0.2%) (Figure 2; Section 4.2).
- Prefer SAS for sequential constraint-satisfaction tasks like PlanCraft, where all MAS variants degrade (
-
Budget-aware design: Table 5 implies that if token/latency budgets are tight, Hybrid-style coordination can be dominated by overhead.
-
Repro/Integration Guidance (when to prefer what, per the paper).
- Use MAS only when the task is decomposable into parallel subtasks and the single-agent baseline is below ~0.45, because the paper reports a capability/baseline saturation threshold beyond which coordination returns diminish (Abstract; Section 4.3).
- Avoid MAS for tasks where decomposition is artificial (PlanCraft example in Section 4.2), because coordination messages consume budget without adding useful information.
-
Track measurable metrics like
O%,E_c, and message densitycfrom traces (Section 4.1; Table 5) and treat them as design targets, not just post-hoc analytics. -
Follow-up research directions explicitly suggested by the paper (Limitations Section 5).
- Explore whether larger collectives can overcome super-linear overhead via emergent specialization or self-organization, versus being fundamentally bottlenecked by communication (Limitations (i)).
- Study heterogeneity beyond same-family scale differences, including mixing different model architectures or specialized fine-tunes (Limitations (ii); also Figure 4 analyzes heterogeneous-role effects within families).
- Develop coordination protocols for tool-intensive tasks (tool scheduling, routing, hierarchical tool delegation), since toolâcoordination interactions are a primary failure mode (Limitations (iii); Section 4.3).
- Extend to more environments (embodied, multimodal, long-horizon feedback) to test generality of the scaling principles (Limitations (v)).