Skip to content

Confucius Code Agent: Scalable Agent Scaffolding for Real-World Codebases

ArXiv: 2512.10398

🎯 Pitch

This paper introduces the Confucius Code Agent (CCA) and the Confucius SDK—an AX/UX/DX‑driven agent scaffold that combines a lightweight orchestrator, hierarchical working memory with adaptive context compression, persistent note‑taking, modular extensions, and a meta‑agent build–test–improve loop to enable reliable, long‑horizon code reasoning over large repositories. By shifting the focus from bigger models to principled agent scaffolding, CCA substantially improves real-world bug‑fix and repair performance (Resolve@1 54.3% on SWE‑Bench‑Pro) while offering extensibility, interpretability, and cross‑session learning crucial for production software engineering.


1. Executive Summary (2-3 sentences)

Confucius Code Agent (CCA) is an agent scaffold for real-world software engineering that targets two scaling bottlenecks in coding agents: long-context, long-horizon execution over large repos and persistent cross-session learning. It introduces the Confucius SDK, organized around Agent Experience (AX), User Experience (UX), and Developer Experience (DX), combining an orchestrator loop, hierarchical working memory with adaptive context compression, a persistent note-taking subsystem, and a modular extension system for tool use. On SWE-Bench-Pro (731 tasks), CCA reports Resolve@1 = 54.3% (Table 1), attributing gains to scaffolding rather than changing the backbone model or tool environment.

2. Context and Motivation

  • What specific problem or gap is addressed?
  • The work targets the gap between:
    • Research-grade coding agents that are transparent/inspectable but struggle to scale to “heavier, production-level workloads,” and
    • Production-grade systems that may perform well in practice but offer limited “extensibility, interpretability, and controllability” (abstract + Introduction).
  • It identifies two core challenges (Introduction):

    • C1: Long-context reasoning: localizing relevant code and performing multi-hop reasoning across large repos, long tool traces, and deep execution histories.
    • C2: Long-term memory: accumulating persistent knowledge across tasks/sessions (patterns, invariants, failure modes) instead of re-discovering them.
  • Why is the problem important?

  • Real software engineering issue resolution involves:
    • Multi-file changes, debugging loops, command execution, and tests.
    • Long-running sessions where raw chat histories and tool logs can exceed context limits (Section 2.3.1).
  • The paper argues that simply increasing model context windows is insufficient; the structure of the agent’s “cognitive and operational environment” matters (Introduction).

  • What prior approaches existed, and where do they fall short (as framed here)?

  • Tool-augmented agent frameworks like SWE-Agent and OpenHands scaffold LLMs with search/edit/execute tools (Introduction; Section 4.2).
  • Prompt-only pipelines (e.g., Agentless) can do well under fixed staged workflows but are not the focus here (Introduction; Section 4.2).
  • The paper’s critique is that many coding agents rely on:

    • Flat interaction histories,
    • Heuristic prompt engineering,
    • Tightly coupled tool pipelines,
    • Naive truncation/ad-hoc retrieval for context management,
    • Limited or coarse “memory” (e.g., embedding entire turns), which fails to capture structure like design decisions and failure modes (Section 2.3.2).
  • How does this paper position itself relative to existing work?

  • It positions agent scaffolding (orchestration, memory structures, tool abstractions) as a “fundamental research dimension” that can cause large performance differences even with the same backbone model (Introduction).
  • It proposes a system-level design decomposition into AX/UX/DX (Figure 3) and implements it as a reusable SDK (Figure 2), then instantiates a concrete agent (CCA) and evaluates it against established scaffolds like SWE-Agent and Live-SWE-Agent (Section 3).

3. Technical Approach

3.1 Reader orientation (approachable technical breakdown)

  • The system is a software engineering LLM agent scaffold (an orchestration + memory + tool-use framework) designed to run iterative “think → act → observe” loops over real repositories.
  • It solves long-horizon, tool-heavy coding tasks by structuring what the model sees (AX), what the user sees (UX), and what developers can measure/modify (DX) via an orchestrator, hierarchical memory + compression, persistent notes, and modular tool extensions.

3.2 Big-picture architecture (diagram in words)

  • User provides a task/issue → Orchestrator runs an iterative loop (Algorithm 1) that:
  • Builds a prompt from system prompt + memory,
  • Calls the LLM,
  • Parses the output into structured actions,
  • Routes actions to Extensions (tools like bash/file edit/search),
  • Executes tools against the Environment (filesystem/console/etc.),
  • Writes summarized observations back into Hierarchical Working Memory and optionally into Long-term Notes.
  • Separately, a Note-taking agent turns trajectories into persistent Markdown notes (Section 2.3.2).
  • A Meta-agent runs a build–test–improve loop to synthesize/refine agent configurations and tool-use conventions (Figure 5; Section 2.3.4).

3.3 Roadmap for the deep dive

  • I will explain the approach in the operational order you would encounter it when running the agent:
  • First, the orchestrator loop and output parsing (Algorithm 1; Section 2.2), since it defines execution semantics.
  • Second, context management (hierarchical working memory + compression) because it determines what the LLM sees at each iteration (Section 2.3.1; Figure 4).
  • Third, note-taking long-term memory because it creates cross-session learning artifacts (Section 2.3.2; Appendix B).
  • Fourth, the extension system that encapsulates tool behaviors and prompt shaping (Section 2.3.3).
  • Fifth, the meta-agent that automates scaffold/config refinement (Section 2.3.4; Figure 5).
  • Finally, I will connect these mechanisms to the reported experiments (Section 3; Tables 1–6).

3.4 Detailed, sentence-based technical breakdown

  • Framing sentence (paper type + core idea). This is primarily an empirical systems/scaffolding paper: it defines an agent runtime (Confucius SDK) and shows that improving orchestration, memory, and tool modularization can materially improve coding-agent benchmark performance without changing the backbone model (Sections 2–3; Table 1).

  • System/data pipeline diagram in words (explicit “first, second, third”).

  • Initialization happens first. The orchestrator initializes the session context, the memory structures, and the registered extensions (Algorithm 1, line 1).
  • LLM invocation happens next. On each iteration, the orchestrator calls the LLM using system prompt + memory (Algorithm 1, line 3), where “memory” is not just a flat chat history but a curated representation produced by the context management layer (Section 2.3.1).
  • Output parsing happens next. The orchestrator converts the LLM response into structured actions (Algorithm 1, line 4) using one of two interfaces (Section 2.2):
    • If the model supports native tool-use APIs, the model emits structured JSON tool calls.
    • Otherwise, the model emits XML-style tags (e.g., <bash>...</bash>), which are parsed into the same structured action format.
  • Tool routing and execution happens next. For each action, the orchestrator routes it to the correct extension (Algorithm 1, lines 5–7), and the extension executes the operation against the environment (filesystem, console, database, etc.; Figure 2).
  • Memory updates happen after execution. Extensions update memory with results, errors, or other observations (Algorithm 1, line 7 and lines 8–10), and may request continuation so the orchestrator immediately re-invokes the LLM with updated memory (Section 2.2).
  • Termination happens when actions stop or completion triggers. The loop stops if the agent emits no further actions (implicit completion) or if a completion check triggers (Algorithm 1, line 13), subject to a maximum iteration bound (Section 2.2).

  • AX/UX/DX separation (what the agent sees vs what the user sees).

  • The SDK treats AX and UX as intentionally different channels (Section 2.1; Figure 3).
  • A concrete example is provided in Section 2.1: users may see verbose streaming updates and diffs, while the agent sees a compressed representation (e.g., a succinct <file_edit ...> plus a short <result>), which reduces distraction and context bloat for the model.

  • F1: Context management = hierarchical working memory + adaptive compression (C1; AX).

  • The SDK defines a hierarchical working memory with configurable visibility scopes (Section 2.3.1).
    • The paper gives an example hierarchy for an instance (Section 2.3.1), showing an instance root, a memory node, and task-specific directories/files like analysis.md, implementation_summary.md, and todo.md.
  • It adds an adaptive context compression mechanism driven by a planner agent called the Architect (Figure 4; Section 2.3.1).
    • When the effective prompt length approaches configurable thresholds, the orchestrator invokes the Architect in a separate LLM call.
    • The Architect produces a structured plan-like summary that explicitly preserves categories such as goals, decisions, open TODOs, and critical error traces (Figure 4; Section 2.3.1).
    • The system replaces older raw history spans with the summary while retaining a rolling window of recent messages verbatim (Figure 4).
  • The claimed benefits are (Section 2.3.1):

    • Avoiding brittle fixed-window truncation or naive retrieval that might silently drop key state.
    • Keeping important intermediate artifacts accessible via hierarchy even when raw history is compressed.
  • F2: Note-taking long-term memory = asynchronous, persistent Markdown notes (C2; AX+UX).

  • Every session interaction is logged as a structured trajectory including user messages, tool invocations, LLM outputs, and system events (Section 2.3.2).
  • A dedicated note-taking agent distills trajectories into compact notes asynchronously so it does not affect online latency of the main agent (Section 2.3.2).
  • Notes are stored as a file-tree of Markdown documents with lightweight tags and typed memory nodes (Section 2.3.2).
    • The SDK exposes tools to search/read/write/edit/delete/import these nodes, enabling programmatic reuse across sessions (Section 2.3.2).
  • A distinctive emphasis is on hindsight notes capturing failures (errors, exceptions, unproductive strategies) alongside resolutions or reasons for abandonment (Section 2.3.2).

    • Appendix B provides concrete examples of such notes (e.g., handling wildcard escaping and a prefix-removal edge case), illustrating the intended “reusable knowledge” structure.
  • F3: Extensions = modular tool-use + prompt shaping via typed callbacks (C1; AX+DX).

  • The SDK factors behaviors (parsing, tool invocation, side-effect management, prompt shaping) into extensions rather than ad-hoc code (Section 2.3.3).
  • An extension is a typed configuration object registering callbacks such as on_input_messages, on_plain_text, on_tag, and on_llm_output (Section 2.3.3).
  • During each orchestrator loop iteration, callbacks are invoked in a fixed order and operate over a shared run context that exposes I/O, storage, hierarchical memory, and artifacts (Section 2.3.3).
  • The paper categorizes extensions into (Section 2.3.3):
    • Perception extensions that parse raw model outputs into structured actions (e.g., file-edit and CLI parsing/validation).
    • Reasoning extensions that rewrite/annotate messages pre-LLM (e.g., planning modules, formatting instructions).
    • Action extensions that execute tools (shell commands, edits, search) and persist summarized results into memory.
  • Examples of behaviors mentioned include rewriting naive grep into scalable queries, per-command validation for CLI execution, and provider-specific prompt caching metadata to reduce cost/latency (Section 2.3.3).
  • The paper stresses that CCA is an orchestrator instantiated with a bundle of extensions, and that ablations vary enabled/configured extensions while holding the orchestrator fixed (Section 2.3.3).

  • F4: Meta-agent = automated configuration synthesis + build–test–improve loop (DX).

  • The Meta-agent is an agent (built on the same orchestrator) that constructs and refines other agents given high-level specs and constraints (Section 2.3.4; Figure 5).
  • The meta-agent workflow is (Section 2.3.4):
    • Gather requirements via a structured configuration form (repo scope, constraints, which extensions, evaluation tasks/test suites).
    • Synthesize prompts/config and wire extensions + memory policies.
    • Spin up the candidate agent locally and run regression tasks.
    • Observe failures (e.g., brittle tool choice, incorrect file-edit patterns, weak recovery) and propose concrete modifications to prompts/config/tool wrappers.
    • Repeat until target metrics stabilize (Figure 5; Section 2.3.4).
  • The paper claims the “production CCA” is itself an outcome of this loop (Section 2.3.4), and later attributes large performance differences to “Meta-agent–derived” tool-use conventions (Section 3.3).

  • Core configurations / hyperparameters (what is specified vs missing).

  • Specified in the provided text:
    • Benchmarks: SWE-Bench-Pro public split (731 tasks) and SWE-Bench-Verified (500 tasks) (Section 3.1).
    • Metric: Resolve Rate = percent of tasks whose patch passes all repo tests without human intervention; they report mean Resolve@1 across three runs (Section 3.1).
    • Backbone models: Claude 4 Sonnet, Claude 4.5 Sonnet, Claude 4.5 Opus (Section 3.1; Table 1).
    • Baseline scaffolds: SWE-Agent and Live-SWE-Agent (Section 3.1; Table 1).
    • A separate “thinkingBudget” parameter sweep for Claude 4 Sonnet on a SWE-Bench-Verified subset: 8k/16k/32k tokens (Appendix A; Table 6).
  • Not specified (so cannot be filled in without guessing):
    • Training hyperparameters (optimizer, LR schedule, batch size, etc.), because this work evaluates scaffolding at inference time and does not describe training a new model.
    • Systems performance numbers like latency/throughput/memory footprint for the SDK runtime; while “prompt-caching” and developer tools are described (Section 2.3.3; Section 2.4), concrete latency/throughput metrics are not provided in the excerpt.

4. Key Insights and Innovations

  • (1) AX/UX/DX as first-class, explicitly separated design axes (Figure 3; Section 2.1).
  • Novelty: Instead of conflating “what users see” with “what the model sees,” the framework treats them as distinct representations.
  • Significance: The paper argues this reduces prompt noise/context overflow for the model (better AX), while preserving transparency for humans (UX) and observability/modularity for developers (DX).

  • (2) Hierarchical working memory + adaptive, planner-driven context compression (F1; Figure 4; Section 2.3.1).

  • Novelty: Compression is triggered adaptively near prompt-length thresholds and produces structured summaries that preserve specific categories (goals/decisions/TODOs/errors) rather than generic summarization.
  • Significance: Reported to improve Resolve@1 in ablations (Table 2) and to support longer planning depth (Section 3.4.1, which reports higher mean planning iterations with context management).

  • (3) Persistent, hierarchical note-taking with “hindsight notes” for failure modes (F2; Section 2.3.2; Appendix B).

  • Novelty: Long-term memory is implemented as human-readable, structured Markdown notes in a file-tree, with explicit focus on indexing failures by error messages/stack traces/affected components.
  • Significance: The paper reports reduced average turns and token cost and a small Resolve@1 increase on repeated runs when notes are reused (Table 4).

  • (4) Extension architecture as the unit of composability and ablation (F3; Section 2.3.3).

  • Novelty: Typed callbacks that can shape prompts, parse outputs, validate tool calls, and write summarized results to memory, all as modular components attached to a minimal orchestrator.
  • Significance: This makes “tool-use sophistication” and other behaviors independently configurable and measurable (Table 2 is framed as toggling tool-use sophistication and context management).

  • (5) Meta-agent for automatic scaffold/config refinement via build–test–improve (F4; Figure 5; Section 2.3.4; Section 3.3).

  • Novelty: Automates agent configuration and tool-use convention refinement, not just task-solving.
  • Significance: The paper attributes a large share of performance gains to meta-agent–learned tool-use behavior (Section 3.3; Table 2).

5. Experimental Analysis

  • Evaluation methodology (datasets, metrics, setup).
  • SWE-Bench-Pro public split:
    • 731 tasks (Section 3.1).
    • Environment configuration and infrastructure kept identical to the SWE-Agent baseline (Section 3.1).
    • Metric: official Resolve Rate / Resolve@1 (tests pass) (Section 3.1).
    • They repeat each trial with different random seeds and report mean Resolve@1 across three runs (Section 3.1).
  • SWE-Bench-Verified:
    • 500 tasks (Section 3.1; Section 3.6).
  • Additional analyses:

    • Ablations on a 100-example subset of SWE-Bench-Pro (Table 2).
    • Edited-file bucket robustness analysis on SWE-Bench-Pro grouped by number of modified files (Table 3).
    • Long-term memory evaluation via two consecutive passes on 151 tasks for which notes were produced (Table 4).
    • Thinking budget scaling on a SWE-Bench-Verified subset (Appendix A; Table 6).
    • A qualitative/controlled comparison to Claude Code on a custom “PyTorch-Bench” of 8 issues, but explicitly noted as not directly comparable to SWE-Bench-Pro due to environment/tooling differences (Appendix C).
  • Main quantitative results (with specific numbers).

  • SWE-Bench-Pro (Table 1; also Figure 1 visualizes comparisons):
    • Claude 4 Sonnet + SWE-Agent: 42.7% vs Claude 4 Sonnet + CCA: 45.5%.

    • Claude 4.5 Sonnet + SWE-Agent: 43.6%, + Live-SWE-Agent: 45.8%, + CCA: 52.7%.

    • Claude 4.5 Opus + CCA: 54.3%.

    • Table 1 also includes a 52.0% number attributed to “Anthropic System Card” for Claude 4.5 Opus, with a footnote that it is from a proprietary scaffold and reported externally.
  • Ablation on SWE-Bench-Pro subset (Table 2):
    • With Claude 4 Sonnet, context management: No advanced = 42.0% vs Yes advanced = 48.6% on the 100-example subset.
    • With Claude 4.5 Sonnet, tool-use and context variants include:
    • No context + simple tool use = 44.0%,
    • No context + advanced tool use = 51.0%,
    • Yes context + advanced tool use = 51.6%.
    • Section 3.3 interprets this as “learned tool-use” being a major driver, because moving from simple→advanced tool use produces a large gain even without context management (44.0 → 51.0).
  • Multi-file robustness (Table 3, SWE-Bench-Pro grouped by edited files):
    • 1–2 files: 57.8% (n=294)
    • 3–4 files: 49.2% (n=203)
    • 5–6 files: 44.1% (n=86)
    • 7–10 files: 52.6% (n=38)
    • 10+ files: 44.4% (n=18)
  • Long-term memory via notes (Table 4; Claude 4.5 Sonnet; 151 tasks rerun):
    • Run 1: Avg. turns = 64, Avg. token cost = 104k, Resolve@1 = 53.0%
    • Run 2 with notes: Avg. turns = 61, Avg. token cost = 93k, Resolve@1 = 54.4%
  • SWE-Bench-Verified comparison (Table 5):
    • Claude 4 Sonnet + SWE-Agent: 66.6%
    • Claude 4 Sonnet + OpenHands: 72.8%
    • Claude 4 Sonnet + CCA: 74.6%
    • Claude 4.5 Sonnet + mini-SWE-Agent: 70.6%
  • Thinking budget scaling (Appendix A; Table 6; SWE-Bench-Verified subset; Claude 4 Sonnet):

    • 8k: 67.3%, 16k: 68.4%, 32k: 68.7%, with diminishing returns beyond 16k (Appendix A).
  • Do experiments support the claims? (critical assessment grounded in provided text)

  • The main claim—scaffolding changes alone can yield large improvements under identical model backend and environment—is supported by Table 1’s within-model comparisons (e.g., Claude 4.5 Sonnet: 43.6% SWE-Agent vs 52.7% CCA).
  • The claim that context management matters is supported by Table 2 (notably +6.6 points for Claude 4 Sonnet on the subset) and by Section 3.4.1’s qualitative/aggregate observations (planner reduces prompt length by “over 40%” and increases mean planning iterations from 1.4 to 2.7). However, the “over 40%” figure is described as a manual inspection result rather than a benchmarked aggregate with distribution details.
  • The claim that meta-agent–learned tool-use conventions drive performance is supported by Table 2’s “simple vs advanced tool use” gap on Claude 4.5 Sonnet (44.0 → 51.0) while context management is held off (Section 3.3), but the paper does not fully specify what constitutes “advanced” vs “simple” beyond being meta-agent-derived conventions.
  • The long-term memory benefit is supported by Table 4, but the magnitude is modest (+1.4 Resolve@1 on 151 rerun tasks) and the selection criterion “notes produced for 151 instances” could bias toward cases where notes are more helpful (the text says it “skips cases where no meaningful insight can be distilled”).

  • Ablations, failure cases, robustness checks mentioned here

  • Ablations: context management and tool-use sophistication (Table 2).
  • Robustness: edited-file bucket analysis (Table 3).
  • Memory: two-pass rerun with notes (Table 4).
  • Thinking budget sensitivity (Appendix A; Table 6).
  • The excerpt mentions “analyses categorize remaining failure cases” in the contribution list, but detailed failure taxonomy is not included in the provided text (so I cannot summarize categories beyond what’s said in Section 3.4.2 about degradation sources).

6. Limitations and Trade-offs

  • Underspecified “advanced tool-use” mechanism.
  • The paper attributes substantial gains to “Meta-agent learned tool-use” (Section 3.3; Table 2), but the excerpt does not fully enumerate the exact policies, prompts, validators, or wrappers that differentiate “simple” from “advanced,” which limits reproducibility from this text alone.

  • Subset-based ablation constraints.

  • Table 2 is computed on a 100-example subset because without context control “many trajectories exceed model token limits and fail to complete,” so ablation comparisons are conditioned on solvable trajectories (Section 3.4.1). This is a reasonable engineering constraint but can complicate interpretation as a fully representative estimate.

  • Long-term memory evaluation is indirect and potentially selection-biased.

  • The note-taking evaluation reruns only 151 tasks where the note taker produced meaningful notes (Section 3.5; Table 4). Tasks with no distilled notes are excluded, which may inflate observed gains relative to a fully unconditional “all tasks” rerun.

  • Operational cost and complexity trade-offs are discussed qualitatively, not quantified.

  • The SDK introduces additional components (Architect planner calls, note-taking agent calls, meta-agent loops, extension layers). The excerpt does not provide end-to-end latency/throughput overhead, tool-call counts at scale, or resource consumption beyond token cost in Table 4 and some token counts in Appendix C’s case-study narrative.

  • Comparisons to commercial tools are limited by toolability and environment mismatch.

  • The paper explicitly cannot compare to Claude Code on SWE-Bench-Pro due to lack of a programmatic interface compatible with containerized evaluation (Section 3.2). The alternative PyTorch-Bench comparison is acknowledged as not directly comparable because of different execution environments (Appendix C).

  • Generalization beyond the evaluated benchmarks is not guaranteed.

  • The design is motivated by large codebases, but the excerpt’s strongest quantitative evidence centers on SWE-Bench-Pro/Verified and a small curated PyTorch-Bench (Sections 3–Appendix C). Performance in other languages, tool stacks, or private monorepos is not quantified here.

7. Implications and Future Directions

  • How this changes the landscape (within scope of the provided text).
  • The results argue that “agentic scaffolding” can be a primary determinant of coding-agent performance—sometimes enough that a weaker backbone with a better scaffold can outperform a stronger backbone with a weaker scaffold (Section 3.2 highlights Claude 4.5 Sonnet + CCA (52.7%) vs Claude 4.5 Opus + Anthropic system card (52.0%), noting the latter is externally reported).
  • The AX/UX/DX decomposition provides a vocabulary and design target for building agents that are simultaneously:

    • Stable for the LLM (AX),
    • Transparent/controllable for users (UX),
    • Modular/observable for developers (DX) (Section 2.1; Figure 3; Section 2.4).
  • Follow-up research the paper suggests (explicitly mentioned).

  • The Future Work section proposes integrating reinforcement learning (RL) more deeply by treating agent execution as an MDP and leveraging the SDK’s trajectory interfaces and meta-agent feedback signals as potential reward sources (Section 5).
  • It suggests rewards tied to note-taking quality, tool robustness, recovery behaviors, and exploration efficiency, and proposes curricula via progressively richer toolsets/environments (Section 5).

  • Practical applications / downstream use cases mentioned or implied here.

  • Building specialized internal agents such as:
    • CI failure triage agents,
    • Refactoring agents with constrained access (Section 2.3.4).
  • Operating on large-scale repositories with long debugging sessions and multi-file edits, with developer tooling such as Trace UI, Playground, Eval UI, and centralized management (Section 2.4; Appendix D).

  • Repro/Integration Guidance (when to prefer this method, based on the text).

  • Prefer a Confucius SDK-style scaffold when tasks are long-horizon and tool-heavy (multi-step debugging, multi-file refactors, long command traces), because the design explicitly targets context overflow (F1) and cross-session continuity (F2) (Introduction; Sections 2.3.1–2.3.2).
  • Prefer extension-based modularity when you need auditable, ablatable tool behaviors (e.g., guarded shell execution, file-edit parsing/validation, prompt caching), because the framework makes these behaviors first-class, swappable components (Section 2.3.3; Table 2’s tool-use ablation framing).
  • Use the Meta-agent loop when adapting to new environments or tool stacks, since the paper’s development story and ablations indicate that tool-use conventions and prompt/config tuning materially affect results (Section 2.3.4; Section 3.3).