CodeClash: Benchmarking Goal-Oriented Software Engineering¶

🎯 Pitch¶

Introduce CodeClash, a tournament-style benchmark where LMs act as persistent software‑engineering agents that iteratively edit codebases and compete in diverse “code arenas” toward open‑ended objectives. By running 1,680 tournaments across six arenas and eight models, the benchmark reveals that modern LMs can generate creative solutions but systematically fail at strategic reasoning, log‑driven adaptation, and long‑term code maintenance—highlighting crucial gaps for building reliable, autonomous software developers.

1. Executive Summary (2-3 sentences)¶

CodeClash introduces a benchmark for goal-oriented software engineering where language-model “software engineering agents” iteratively edit a persistent codebase across multiple rounds, and that codebase competes head-to-head in a “code arena” to win an open-ended objective (Figure 1, Section 2). Across 1,680 tournaments (25,200 rounds total) over 6 arenas and 8 LMs, the benchmark exposes consistent weaknesses in models’ strategic reasoning (log interpretation, validation, adaptation) and in long-horizon codebase maintenance, despite eliciting diverse and creative code evolution (Abstract; Sections 3–5; Figures 4–8). A striking result is that the top model in their leaderboard, Claude Sonnet 4.5, wins 0 / 150 rounds against a strong static human bot on RobotRumble (Section 4.1).

2. Context and Motivation¶

Problem / gap addressed
Most existing coding benchmarks evaluate models on well-specified, isolated tasks (e.g., implement an algorithm, fix a bug, write a test), where success is defined by correctness against explicit instructions and/or unit tests (Introduction).
Real-world software development is often driven by high-level objectives (e.g., “improve retention,” “reduce cost”), requiring developers to:
- Decompose goals into actionable steps,
- Prioritize and strategize,
- Iterate based on noisy feedback,
- Maintain and refactor code over time (Introduction).
The paper targets the missing evaluation: whether LMs can autonomously and iteratively develop code to better accomplish open-ended objectives without explicit guidance (Abstract; Introduction).
Why it matters
If LMs are to function as long-running SWE agents, capability must include:
- Strategic reasoning under feedback (interpreting logs/metrics, deciding what to change),
- Long-horizon robustness (not regressing after changes),
- Codebase hygiene and maintainability (avoiding repository clutter and redundancy) (Sections 2.3 and 5).
Prior approaches and shortfalls (as positioned here)
Repository-level benchmarks like SWE-bench measure issue resolution against unit tests, but objectives are still explicit and feedback is often binary (pass/fail), which can saturate once tests pass (Related Work; “Self improving agents” discussion).
Code optimization benchmarks exist where the model chooses how to improve, but they typically do not involve adversarial opponents and have narrow objectives (Related Work; “Performance optimization”).
Game benchmarks evaluate gameplay directly, but CodeClash’s novelty is that the model does not play directly; it evolves a codebase that plays as a proxy (Introduction; Related Work; Figure 1).
Positioning
CodeClash is positioned as an evaluation and potential training ground for “self-evolving” SWE agents: a multi-round, adversarial, open-ended objective setting with persistent codebases and log-based feedback (Abstract; Sections 2–3; Related Work).

3. Technical Approach¶

3.1 Reader orientation (approachable technical breakdown)¶

The system is a tournament loop that repeatedly lets an LM-agent edit a code repository, then evaluates that repository by having it compete in a simulator/game-like environment against other repositories (Figure 1, Section 2.1).
It solves the evaluation problem of “can an LM improve software toward a high-level goal over time?” by using multi-round, head-to-head competition outcomes as the objective signal, rather than unit tests or single-shot tasks (Sections 2.3, 3).

3.2 Big-picture architecture (diagram in words)¶

Player / agent container: a Docker container where an LM uses a bash-only interface to inspect/edit its codebase (Appendix A; Figure 9).
Agent scaffold (mini-SWE-agent): enforces the interaction protocol (ReAct-style “thought + bash action”), step limits, and cost limits (Sections 2.2, 3; Appendix C.1).
Code arena container: a Docker container that runs the arena engine to execute matches between players’ codebases and produce outcomes/logs (Appendix A; Figure 9).
Logging + feedback injection: after competition, logs/results are copied into each player’s repository as the only new cross-round information (Section 2.1; Appendix A; Figure 9).
Tournament manager: orchestrates rounds, validates submissions, runs repeated simulations per round, and computes win rates / Elo (Sections 3–4; Appendix C.2–C.3).

3.3 Roadmap for the deep dive¶

I will explain, in order:
The benchmark formalization (what a “round,” “tournament,” “arena,” and “player” are), because this defines the evaluation task (Section 2.1).
The agent interaction model (mini-SWE-agent + bash actions), because it constrains what LMs can do and how they gather evidence (Section 2.2; Appendix C.1).
The infrastructure loop (Docker containers, copying code/logs), because it determines reproducibility and what feedback is available (Appendix A; Figure 9).
Arena interface and properties (open-ended objectives, adversarial adaptation, self-crafted memory), because these are the benchmark’s core stressors (Section 2.3; Appendix B).
Evaluation design and metrics (tournament structure, repeated sims, Elo fitting), because they define how performance is quantified (Section 3; Appendix C.3).

3.4 Detailed, sentence-based technical breakdown¶

This is an empirical benchmark + systems evaluation paper whose core idea is to benchmark LMs as autonomous software developers by making them iteratively modify a persistent codebase that competes in a simulator, with only log-based feedback across rounds (Sections 2–3; Figure 1).

Formulation: tournaments, rounds, and the “codebase-as-memory” constraint (Section 2.1). - A player is defined as an LM paired with an agent computer interface (ACI) that can interact with a codebase (Section 2.1). - A code arena is any platform that accepts multiple codebases, executes them against each other, and produces objective-driven outcomes such as “score maximization,” “resource acquisition,” or “survival” (Section 2.1; Figure 1). - Each tournament contains multiple rounds; in each round: 1. Edit phase: each player independently modifies its codebase within a fixed turn budget (Section 2.1; Section 3). 2. Competition phase: codebases are compiled/executed in the arena head-to-head; the arena declares winners based on the objective (Section 2.1; Figure 1). - The benchmark makes three explicit design choices (Section 2.1): - Codebase-as-memory: agents have no persistent external memory; only what they write into the repository persists across rounds. - Log-based feedback: after competition, results/logs are copied into each player’s repo as the only new information for subsequent rounds. - Strategic opacity: players cannot see opponents’ codebases (with an ablation later that partially lifts this; Section 4.1).

Agent interaction model: mini-SWE-agent with bash-only actions (Sections 2.2, 3; Appendix C.1). - The paper implements players using mini-SWE-agent, an ACI that lets the LM act by issuing bash commands to a terminal (Sections 2.2, 3). - Each interaction step is described as ReAct-style: the model outputs a natural-language “thought” plus a bash action, then receives stdout/stderr (Section 2.2). - The per-round edit budget is: - A maximum of 30 turns per edit phase (Section 3; Appendix C.1). - A $1 per-round cost limit, intended to prevent context flooding; the paper states this cost limit is “almost never reached,” while turn limits are “exhausted frequently for specific models” (Appendix C.1). - The system prompt and interaction rules require the model to produce exactly one bash code block with one command (or chained with && / ||), and warn that directory changes are not persistent across turns because each action runs in a new subshell (Appendix C.1).

Infrastructure and execution loop: containerized editing, arena runs, and log injection (Appendix A; Figure 9). - CodeClash is containerized using Docker to improve portability/reproducibility (Appendix A). - A single round is implemented as a concrete 4-step flow (Figure 9): 1. In agent containers, LMs edit their own codebases using mini-SWE-agent. 2. Codebases are copied into the arena container. 3. The arena runs competitions, generating outcomes and logs. 4. Logs are copied back into each player’s repository (into a logs/ folder) before the next round. - The starter codebase for each arena is designed so that models can discover mechanics without prior knowledge (Appendix A): - docs/ contains copied documentation. - Arena executables/assets are available but the precise run commands are not directly disclosed (models must infer them). - A “working submission” baseline is included so competition can start immediately.

Arena interface and repeated simulations per round (Sections 2.2–2.3; Appendix A, B). - The paper defines a “lightweight, flexible interface” for arenas: an arena implementation must define commands to run the competition and determine a winner (Section 2.2). - Many arenas are stochastic; the paper standardizes outcome determination by running the competition 1000 times per round and declaring the winner by majority (Appendix A).
- This choice is explicitly framed as avoiding arbitrary win thresholds while aligning with common competitive practice (Appendix A). - If a submission becomes invalid (e.g., doesn’t compile or violates arena-specific structure), CodeClash applies a decision tree (Appendix A): - If all are invalid: tie. - If only one is valid: valid one wins. - If some are valid: run competition among valid ones; invalid are excluded.

“Open-ended objectives” and benchmark properties (Section 2.3). - CodeClash is explicitly designed to push beyond unit-test correctness by using objectives that vary by arena, requiring the model to invent intermediate metrics and strategies (Section 2.3). - Key properties emphasized: - Open-ended objectives: optimize for winning outcomes, not correctness (Section 2.3). - Diverse arenas: different languages, interfaces, and log formats (Section 2.3; Table 2 in Appendix B lists arenas, languages, and player counts). - Adversarial adaptation: because opponents evolve too, “early round wins do not ensure continued dominance” (Section 2.3). - Self-crafted memory: no persistent memory besides what is written to the codebase (Section 2.3). - Self-directed improvement: minimal guidance in prompts; models must choose whether to read docs, parse logs, write tests, etc. (Section 2.3; Appendix C.1).

Arenas included in the initial release (Section 2.3; Appendix B; Table 2). - The paper’s arena suite includes 6 code arenas (Section 2.3; Table 2): - Battlesnake (Python): grid survival/territory control (Appendix B.2). - Core War (Redcode): assembly-like programs in shared memory (Appendix B.3). - Halite I (C/C++/OCaml/Rust): grid resource/territory game (Appendix B.4). - Poker via Husky Hold’em Bench (Python): no-limit Texas Hold’em (Appendix B.5). - RoboCode (Java): tank combat (Appendix B.6). - RobotRumble (JavaScript in main setup; later human ablation uses Python): turn-based spawning robot battle (Appendix B.7; Section 4.1 / Appendix D.3).

Evaluation design and metrics (Section 3; Appendix C.3). - Models evaluated: 8 LMs are listed in Section 3: Claude Sonnet 4.5, Claude Sonnet 4, GPT-5, GPT-5 Mini, o3, Gemini 2.5 Pro, Qwen3-Coder, Grok Code Fast 1. - Main evaluation scale (Section 3): - One-on-one tournaments. - 10 tournaments per model pair per arena, 15 rounds per tournament. - Total rounds computed as $\binom{8}{2} \times 6 \times 10 \times 15 = 25{,}200$ (Section 3). - Runtime: tournament runtime “varies by arena,” 75 minutes on average, totaling 2.4 million hours of runtime, “mostly due to model latency,” parallelized across tournaments (Section 3). - Important ambiguity / inconsistency: Appendix C.2 later states “main results table reflects values of M = 9 … giving … 32,400 total rounds,” which conflicts with Section 3’s M = 8 and 25,200 rounds. Based on the provided paper text, I treat Section 3’s 8 models / 25,200 rounds as the primary configuration and flag Appendix C.2 as likely a typo or leftover configuration. - Win definitions (Section 3; Appendix C.3.1): - A round win occurs if the model scores higher than the opponent (or the opponent has an invalid submission). - A tournament win is the majority of round wins; ties are broken by “scores the last win” (Section 3; Appendix C.3.1). - Elo methodology (Section 3; Appendix C.3.1): - Uses Elo with base rating R0 = 1200 and slope β = 400 (Section 3; Appendix C.3.1). - Elo is fit via maximum likelihood under a Bradley–Terry logistic model rather than sequential Elo updates (Section 3; Appendix C.3.1, Eq. (1), Eq. (2)). - Rank stability is checked by parametric and non-parametric bootstrap with > 98% pairwise order agreement (Section 3; Appendix C.3.3; Table 4).

4. Key Insights and Innovations¶

(1) Benchmarking “goal-oriented software engineering” via competitive arenas rather than unit tests
Unlike test-based benchmarks where success is correctness, CodeClash makes success be competitive objective achievement (Section 2.3).
This is significant because it forces models to do the missing parts of SWE agency: select strategies, define intermediate metrics, and iterate without being told what to do.
(2) Persistent codebase as the only long-term memory
The deliberate codebase-as-memory constraint means any cross-round learning must be encoded into files (notes, scripts, tests) (Section 2.1; Section 2.3).
This makes codebase organization and tool-building part of the measured capability, not an external convenience.
(3) Log-based feedback as the sole external signal
The benchmark enforces a realistic “telemetry-driven” loop: after competition, logs/results are injected into the repository and become the only new information for the next round (Section 2.1; Appendix A; Figure 9).
This elevates the importance of interpreting feedback and validating changes, which the analysis later shows models struggle with (Section 5.2; Figure 8).
(4) Multi-round, adversarial adaptation with evolving opponents
Because both sides modify code across rounds, the benchmark tests whether models can adapt to opponent behavior rather than converge to a single canonical solution (Section 2.3; Section 5.1; Figure 5).
The paper provides evidence of increasing code diversity over rounds (Figure 5; Appendix D.4).
(5) Large-scale empirical evaluation + behavioral instrumentation of agents
Beyond win rates and Elo, the paper instruments edit behaviors (files edited, lines changed, steps used, thought length) and evaluates groundedness/hallucination/validation using an LM judge (Section 5; Appendix D.1, D.3; Figures 8, 29–38, 45–46).
This is an incremental-but-useful innovation: it turns the benchmark into a diagnostic tool rather than only a leaderboard.

5. Experimental Analysis¶

Evaluation methodology (Section 3; Appendix C; Appendix B)¶

Models: 8 LMs (Section 3).
Arenas: 6 arenas (Section 2.3; Table 2).
Tournament structure: one-on-one, 15 rounds per tournament, 10 tournaments per model pair per arena (Section 3).
Edit budget: 30 steps per round, $1 cost cap per round (Section 3; Appendix C.1).
Competition repetitions: arena run 1000 simulations per round to decide winner by majority (Appendix A; Appendix C.2 example config shows sims per round: 1000).
Metrics:
Tournament win rate aggregated over tournaments (Section 3).
Elo per arena and overall, fit via maximum-likelihood Bradley–Terry model (Section 3; Appendix C.3.1, Eq. (1), Eq. (2)).
Bootstrapped rank stability and Elo uncertainties (Appendix C.3.3; Table 3; Table 4; Figures 27–28).

Main quantitative results (Section 4; Table 1; Figure 2; Figure 3; Table 3)¶

Overall Elo leaderboard (Table 1):
Claude Sonnet 4.5: 1389 overall Elo (per-arena: BattleSnake 1470; CoreWar 1641; Halite 1408; Poker 1248; RoboCode 1361; RobotRumble 1423).
GPT-5: 1360 overall (notably strong in Poker 1599 and Halite 1522, weaker in CoreWar 1199).
o3: 1343 overall (Halite 1576).
Next tier includes Claude Sonnet 4 (1223) and GPT-5 Mini (1200) (Table 1).
Lower: Gemini 2.5 Pro (1125), Grok Code Fast (1004), Qwen3 Coder (952) (Table 1).
No single model dominates all arenas: e.g., top overall Claude Sonnet 4.5 is only 4th in Poker by Elo (Section 4; Table 1).
Pairwise win rates: Figure 2 shows a matrix of tournament win rates (“row beats column”) aggregated across all arenas; it states Claude Sonnet 4.5 has the highest average win rate at 69.9% (Figure 2 caption).
Across-round dynamics: Figure 3 shows win rates by round; it highlights that some models gain momentum (e.g., Claude Sonnet 4.5) while others lose it over time (e.g., GPT-5) (Figure 3 caption; Section 4).

Ablations / additional evaluations (Section 4.1; Appendix D.2)¶

Human expert bot vs top model (RobotRumble)
Setup: Claude Sonnet 4.5 vs gigachad (top open-source submission) over 10 tournaments × 15 rounds = 150 rounds (Section 4.1).
Each RobotRumble round uses 250 simulations and majority vote; across 37,500 simulations, Claude Sonnet 4.5 wins 0 (Section 4.1).
This is used to argue a large human–model gap in this setting.
Transparent opponent code (Core War)
Modification: before edit phase of round n, each player gets a read-only copy of opponent’s code from round n−1 (Section 4.1).
For matchups among Claude Sonnet 4.5, GPT-5, Gemini 2.5 Pro, win rates shift:
- GPT-5: 74.6% rounds (+7.8%),
- Claude Sonnet 4.5: 53.2% (−1.8%),
- Gemini 2.5 Pro: 22.7% (−5.5%) (Section 4.1).
Code access behavior differs: GPT-5 accesses opponent code in 12.8% of rounds vs Claude Sonnet 4.5 99.3% and Gemini 2.5 Pro 52.9%, suggesting frequent inspection doesn’t necessarily yield advantage (Section 4.1; Appendix D.2 Figure 41 provides additional breakdown).
Multi-player (6-player) Core War
Uses TrueSkill rather than Elo (Section 4.1; Appendix D.2 Figure 42).
Competitive volatility increases: lead changes occur 48.4% in 6-player vs 18.2% in 2-player; winners’ share of points drops to 28.6% vs 78.0% (Section 4.1; Figures 39–40).

Do experiments support the claims?¶

The paper’s claims about diverse development styles are supported by quantified editing behavior differences (Section 5.1; Appendix D.1 Figures 29–36).
Claims about strategic reasoning limitations are supported by the LM-judge annotations:
High rates of ungrounded edits outside the Claude models (Figure 8(a)).
Non-trivial hallucinated loss causality claims (Figure 8(b)).
Low validation rates for several models (Figure 8(c)).
The “messy codebases over time” claim is supported by:
Nearly linear growth in files created with rounds (Figure 6),
High filename redundancy for some models (Figure 6, Figure 51),
Many “throwaway” files not reused later (Figure 7),
Additional measures like root-level clutter and file reuse ratio (Appendix D.4 Figure 50).
The “top models lose to experts” claim is strongly supported for RobotRumble via the 0/150 result (Section 4.1), but the paper explicitly notes lack of comparable human leaderboards/open-source baselines for other arenas and calls the sample limited (Section 4.1).

6. Limitations and Trade-offs¶

Arena scale vs real-world software
The paper acknowledges arenas are “smaller and more self-contained than most real-world software systems” (Section 7). This limits direct transfer of measured performance to large production codebases.
Scaffold choice constraints
They intentionally use mini-SWE-agent (bash-only) rather than more tool-heavy scaffolds (Section 3), aiming to evaluate models rather than tooling.
This is a trade-off: it reduces tool bias, but it may under-represent what a fully engineered agent system could do (Section 7).
Feedback modality is text-only
Competition logs are text-based; they do not explore multimodal feedback or VLMs (Section 7).
Evaluation and configuration ambiguities
As noted earlier, there is a discrepancy between Section 3’s 8 models / 25,200 rounds and Appendix C.2’s M=9 / 32,400 rounds statement. The paper does not resolve this internally in the provided text.
Human comparison is limited
Strong human-vs-model evidence is shown only for RobotRumble because other arenas lack accessible leaderboards or open-source ranked bots (Section 4.1).
Strategic opacity assumptions
Default setting prohibits seeing opponent codebases (Section 2.1), which may differ from some real competitive programming contexts (open-source strategies, shared baselines). The paper does explore a partial “transparent code” ablation, but only in Core War and only for a subset of models (Section 4.1).
Potential judge-model bias in behavioral annotation
Strategic reasoning analyses rely on “GPT-5 with high reasoning as a judge” for groundedness/hallucination/validation (Section 5.2; Appendix D.3). The paper provides structured schemas and removes agent thoughts for one study to avoid sycophancy (Appendix D.3), but the approach still inherits the usual risks of LM-based evaluation (e.g., misclassification), and the text does not provide human-rater validation in the excerpt.

7. Implications and Future Directions¶

How this changes the field (within the paper’s framing)
CodeClash shifts coding evaluation from “can you solve a specified task?” to “can you run a self-improving development loop under competitive pressure?” (Introduction; Section 2.3).
The findings suggest that current frontier models’ bottlenecks in such loops are less about command-line operation (low bash error rates; Section 5.2; Appendix D.1 Figures 37–38) and more about:
- Interpreting feedback (logs),
- Avoiding hallucinated causal explanations,
- Validating changes to prevent regressions,
- Maintaining codebase hygiene over long horizons (Section 5; Figures 6–8).
Research directions suggested by the results
Better feedback-to-edit pipelines: Since many edits are “ungrounded” and many models rarely validate changes (Figure 8), future agents might need stronger internal routines for:
- Parsing logs beyond “first lines,”
- Running systematic experiments (A/B tests between versions),
- Creating persistent analysis tooling rather than throwaway scripts (Figures 7, 50, 53).
Opponent modeling improvements: The transparent-code ablation shows that simply reading opponent code frequently does not guarantee advantage (Section 4.1), suggesting research into how to extract exploitative strategy from opponent code/logs.
Long-horizon repository management: The linear file creation trend and redundancy metrics (Figures 6–7, 51) motivate interventions (rewarding reuse/refactoring, penalizing clutter) if the benchmark is used as a training environment (Section 7; Appendix D.4).
Multi-agent strategic phenomena: The 6-player Core War experiments show increased volatility (Figures 39–40), suggesting opportunities to study coalition dynamics, positional play, and risk (Section 4.1).
Practical applications / downstream use cases (as implied)
As an open-source toolkit with logs and leaderboard (Abstract; Appendix), CodeClash can be used to:
- Compare SWE-agent designs under controlled scaffolding,
- Stress-test models on iterative improvement,
- Collect trajectories for potential training (Section 7 mentions pretraining on traces or post-training via self-play/RL).
Repro/Integration Guidance (based on provided paper)
Prefer CodeClash when you want to evaluate autonomous iteration rather than correctness on a fixed spec:
- If your goal is “does the agent build tools, read docs, interpret logs, and improve over many rounds,” CodeClash’s codebase-as-memory + log-based feedback makes those behaviors observable (Sections 2.1–2.3; Figure 1).
If you instead need to measure correctness on explicit tasks with clear pass/fail, the paper’s own positioning suggests test-based benchmarks (e.g., SWE-bench-style) are a better fit (Related Work).
To extend CodeClash to a new environment, the paper’s arena abstraction requires only commands to run competitions and determine winners (Section 2.2), with containerization and per-round simulation repetition handled by the framework (Appendix A).