Skip to content

ProRL Agent: Rollout-as-a-Service for RL Training of Multi-Turn LLM Agents

ArXiv: 2603.18815

Pitch

Existing RL frameworks tightly couple I/O-intensive agent rollouts with GPU-intensive training, causing system interference and creating maintenance nightmares. ProRL Agent introduces a "rollout-as-a-service" architecture that decouples these workloads via an HTTP API, enabling independent scaling and seamless backend portability. This design yields substantial performance improvements on SWE-Bench while supporting standardized, extensible sandbox environments for diverse agentic tasks.


1. Executive Summary

This paper introduces ProRL Agent, a scalable infrastructure that decouples multi-turn LLM agent rollout from RL training through a "rollout-as-a-service" HTTP API design. The core contribution addresses a fundamental architectural limitation in existing frameworks: tightly coupling I/O-intensive rollout orchestration with GPU-intensive policy training creates system interference, impedes maintainability, and blocks independent evolution of either component. Validated across software engineering, STEM, math, and coding tasks, ProRL Agent achieves strong performance gains on SWE-Bench Verified (e.g., 18.0% for 8B model vs. 9.4% for SkyRL-Agent-8B-v0) while supporting rootless HPC deployment through Singularity-based sandboxing.

2. Context and Motivation

The Core Problem: Rollout-Training Coupling in Agent RL

Reinforcement learning for multi-turn LLM agents requires generating large numbers of rollout trajectories—sequences of agent actions and environment observations produced by executing the agent's policy in sandboxed environments. Unlike single-turn RL tasks, multi-turn agentic tasks involve interacting with external environments (code repositories, web browsers, operating systems) over dozens of turns and tens of thousands of tokens. Each rollout requires: (1) sandbox environment initialization, (2) multi-turn agent execution with tool calls, and (3) outcome evaluation and reward computation.

The paper identifies a critical architectural flaw in existing frameworks: rollout orchestration is tightly coupled with the RL training stack. Existing systems like SkyRL-Agent, VeRL-Tool, Agent Lightning, rLLM, and GEM all embed the full agent lifecycle—including environment management, tool execution, and trajectory collection—directly inside the training loop. Figure 1a illustrates this coupled design.

Why This Coupling Matters: Two Fundamental Limitations

The paper articulates two specific problems arising from this design choice (Section 1):

1. Conflicting system requirements. Rollout and policy training have fundamentally different resource and operational characteristics: - Rollout is I/O-intensive, involving sandbox creation, long-lived tool sessions, and asynchronous coordination across hundreds of concurrent instances. Environment interactions incur highly variable latency—shell commands can take milliseconds or minutes depending on the task. - Training is GPU-intensive, centered on forward/backward passes and gradient synchronization across distributed workers.

Coupling these workloads causes interference: I/O-bound rollout workers compete for resources with compute-bound training workers, reducing overall throughput and making resource allocation inefficient.

2. Difficult to migrate and maintain. When rollout logic is embedded in the RL trainer: - Migrating to a different training backend requires re-implementing the entire agent execution pipeline. - Improving rollout infrastructure (new runtime environments, new tasks) propagates changes into the training codebase. - Independent experimentation and optimization on either side becomes difficult.

The paper argues that this tight coupling "slows progress on both fronts" and will become "a serious obstacle to scalability and long-term maintainability" as systems grow more complex.

Prior Approaches and Their Limitations

Section 2 provides a systematic comparison (Table 1) of existing frameworks across three dimensions:

Framework Training-Rollout Decoupled? Rootless Sandbox? Scaffold-Independent?
SkyRL-Agent
VeRL-Tool
Agent Lightning
rLLM
GEM
ProRL Agent

Training-rollout coupling. All prior frameworks implement rollout orchestration as an in-process library within the training loop. Appendix A (Figures 7–11) provides detailed architectural diagrams: - SkyRL-Agent (Figure 7): The training driver runs concurrent trajectory-generation coroutines on a single CPU process, controlling the multi-turn agent loop directly. - Agent Lightning (Figure 8): Places training loop, LightningStoreServer, and all rollout workers within a single process tree—"rollout does not have an independent service lifecycle." - VeRL-Tool (Figure 9): Extends veRL trainer with multi-turn agent rollouts, keeping "rollout control remains inside the trainer." - rLLM (Figure 10): Built on a modified veRL fork with "agent loop, environment management, and trajectory orchestration all reside within a single driver process." - GEM (Figure 11): Keeps environments as in-memory Python objects with stepping through direct env.step() calls—"the environment and rollout lifecycle remain fully embedded in the training stack."

Rootless sandboxing gap. Existing agentic sandbox environments (referenced in Section 2) rely on Docker, which "assumes daemon access and root-equivalent privileges, which are often unavailable on shared Slurm-managed HPC clusters." Practitioners face a tradeoff: maintain separate infrastructure for evaluation and deployment, or incur operational complexity of privileged container runtimes on restricted systems.

How This Paper Positions Itself

The paper draws inspiration from inference-as-a-service philosophy adopted by LLM inference engines like vLLM and SGLang. The key insight: just as inference engines decouple serving from application logic, agentic RL frameworks should decouple rollout from training.

The paper introduces rollout-as-a-service as the core design principle: the agentic rollout lifecycle becomes an independent HTTP service, while the trainer submits task instances and receives completed trajectories with rewards. This achieves three goals:

  1. Resource isolation: Rollout and training run on different machines optimized for their respective workloads.
  2. Portability: Adding new tasks or training backends requires no changes to the other side.
  3. Extensibility: Rollout infrastructure evolves independently from training algorithms.

The paper explicitly positions ProRL Agent as infrastructure work, validated through end-to-end RL training across multiple domains rather than algorithmic novelty.

3. Technical Approach

3.1 Reader Orientation

ProRL Agent is a system infrastructure for RL training of multi-turn LLM agents that decouples the rollout generation process from the training loop by exposing the full rollout lifecycle—environment setup, agent execution, and reward computation—as an HTTP service accessible to any RL trainer. The solution "shape" is an asynchronous three-stage pipeline server that manages sandboxed environments, coordinates LLM inference backends, and returns token-level trajectories to trainers through a REST API.

3.2 Big-Picture Architecture (Diagram in Words)

The system consists of three major components (Figure 2):

  1. Sandbox Environment — Isolated containers for agent execution, each managed by a pluggable AgentHandler that exposes three lifecycle methods: init() (environment setup), run() (multi-turn agent loop), and eval() (reward scoring). Built on Singularity for rootless HPC deployment.

  2. ProRL Agent Server — The core HTTP service orchestrating hundreds of concurrent rollouts. Contains:

  3. Three independent worker pools (INIT, RUN, EVAL) with separate job queues
  4. LLM backend pool managed via min-heap for load balancing
  5. REST API endpoints for job submission, cancellation, and backend management

  6. RL Trainer — Any training framework (veRL, NeMo RL, etc.) that interacts with the server solely via HTTP, submitting jobs via POST /process and receiving completed trajectories with rewards.

Information flow: Trainer submits task instance → Server queues job → INIT worker provisions sandbox → RUN worker executes agent loop (calling LLM backends) → EVAL worker computes reward → Server returns trajectory + reward to trainer.

3.3 Roadmap for the Deep Dive

  • First, the extensible sandbox environment layer—the foundation that enables isolated, portable agent execution on HPC clusters.
  • Second, the pluggable task abstraction (AgentHandler) that encapsulates diverse agentic tasks behind a uniform interface.
  • Third, the three-stage asynchronous pipeline architecture that decouples rollout phases with independent worker pools.
  • Fourth, the LLM backend management system including dynamic registration, checkpoint swapping, and min-heap load balancing.
  • Fifth, token-in/token-out communication that eliminates re-tokenization drift during training.
  • Sixth, the connection to RL trainers including efficient DAPO implementation.

3.4 Detailed, Sentence-Based Technical Breakdown

This is primarily a systems infrastructure paper whose core idea is that separating rollout orchestration from training through a service interface enables better resource utilization, maintainability, and extensibility for multi-turn agent RL.


The Sandbox Environment Layer

Multi-turn agent RL requires sandboxed environments that provide isolation, reproducibility, and security. Each agent rollout executes inside a container that isolates its filesystem, network, and process space. The paper addresses a specific constraint: existing platforms (Docker-based) require root privileges unavailable on shared HPC clusters.

Singularity Runtime. The paper implements SingularityRuntime, a container system that "requires no persistent daemon and runs entirely as an unprivileged user process." Key mechanisms (Section 3.2.2):

  • Container lifecycle: Each container is launched as a child process in its own session. Shutdown proceeds gracefully via SIGTERM before escalating to SIGKILL if necessary—this ensures no orphaned containers remain on nodes.
  • Port isolation: Each container instance is assigned a unique loopback IP address within the 127.x.x.x range via a thread-safe allocator. This supports hundreds of concurrent containers on the same node without port conflicts.
  • HPC compatibility flags: --fakeroot grants simulated root access for package installation without actual host privileges; --network none optionally disables external network access for isolation.

Image build pipeline. Container images are packaged as Singularity Image Files (.sif), which "encapsulate the full execution environment in a single portable file." The SingularityRuntimeBuilder constructs images from Jinja2 templates with three caching modes: - Scratch: Always performs a full rebuild - Versioned: Reuses cached image when base image and framework version are unchanged - Lock: Reuses image when dependency lockfile is identical

The template-driven design enables heterogeneous runtime specialization—for example, "QEMU-based virtual machines used in GUI-centric tasks can provide custom definition files... without requiring any modifications to the core build logic."


The AgentHandler Pluggable Task Abstraction

Different agentic tasks (software engineering, math reasoning, computer use) require different environment setups, agent behaviors, and reward computations. Rather than hardcoding these differences in the server, the paper encapsulates all task-specific logic in an abstract interface called AgentHandler (Section 3.2.1, Listing 1).

Three lifecycle methods. The interface defines three core methods corresponding to the three pipeline stages:

class AgentHandler(ABC):
    @abstractmethod
    async def init(self, job_details) -> (Runtime, Metadata, Config):
        """Provision environment; return (runtime, metadata, config)."""

    @abstractmethod
    async def run(self, job_details) -> dict:
        """Execute agent loop; return trajectory and artifacts."""

    @abstractmethod
    async def eval(self, job_details) -> dict:
        """Score output; return reward signal."""
  • init: Initialize the sandbox environment for the task, configure the agent with the corresponding toolset. Returns a runtime handle, metadata, and configuration.
  • run: Drive the multi-turn agent loop within the prepared sandbox, collecting the action-observation trajectory and any task artifacts.
  • eval: Score the agent's output against ground truth and return a scalar reward signal for RL training.

Error handling. Each handler exposes per-stage error callbacks (init_exception, run_exception, eval_exception) and a final_result method for response serialization. This ensures the server "always emits a well-formed output even when a rollout fails partway through."

Registry-based dispatch. When the server receives a job, it reads the task instance, looks up the corresponding handler in a registry, and dispatches to its lifecycle methods in order. Adding a new task requires only implementing a handler plugin and registering it—no changes to the training code or server core.


Three-Stage Asynchronous Pipeline

The paper frames rollout as an assembly line problem (Section 3.3.1): a naive implementation assigns one worker per job that executes all three phases sequentially. The problem is that each phase has fundamentally different resource demands:

  • Container initialization: I/O-bound, slow due to disk and network latency
  • Agent execution: LLM-inference-bound, bottlenecked by GPU throughput
  • Outcome evaluation: Variable latency—milliseconds for math answer checks to several minutes for full test suites

A single worker would "spend most of its time idle, waiting for whichever phase happens to be slow."

Pipeline architecture. The solution decouples the phases into three independent worker pools, each with its own queue (Section 3.3.1, Listing 2):

STAGES = [INIT, RUN, EVAL]
queues = {s: Queue() for s in STAGES}  # thread-safe FIFO per stage
pools = {s: ThreadPool(N[s]) for s in STAGES}  # independent pools

Workers continuously pull from their respective queues and execute their phase. After completing a phase, the job is handed off to the next stage's queue:

def worker_loop(stage):
    while running:
        job = queues[stage].get()
        if job.id in discarded: continue
        with job.timer.phase(stage):
            try:
                result = handler[stage](job)
            except Exception as e:
                result = handler[stage+'_exception'](job, e)
        job.store(stage, result)
        if stage == RUN:
            cleanup(job.runtime)  # free container before eval
        if stage != EVAL:
            queues[next_stage[stage]].put(job)
        else:
            job.done.set()  # unblock HTTP handler

Concurrent phase execution. At any moment, all three pools are busy on different jobs: "while one job is being evaluated, a second is mid-rollout, and a third is having its container started."

Independent pool sizing. Because pools are independent, "they can also be sized separately to match their respective workloads, with more init workers to absorb the slow I/O startup, or more eval workers when test suites are particularly long."

Container cleanup. Note the line if stage == RUN: cleanup(job.runtime)—this frees container resources immediately after agent execution completes, before evaluation begins. This is crucial for throughput: evaluation can take minutes, and holding containers during that time would limit concurrency.


LLM Backend Management

Every step of the agent loop requires an LLM completion. "When hundreds of rollouts run in parallel, these calls arrive at the inference layer simultaneously and at high frequency. A single LLM server quickly becomes a bottleneck."

Dynamic registration and checkpoint swapping. The server manages LLM backends directly through a management API (Section 3.3.2, Listing 3):

POST /add_llm_server {"address": "http://host:port/v1"}  # register backend
POST /clear_llm_server  # flush all backends
POST /process {"instance": {...}, "sampling_params": {...}}  # submit job
POST /cancel {"job_id": "..."}  # abort running job
POST /start | POST /stop  # server lifecycle
GET /status  # queue depths

When the RL trainer updates the policy checkpoint after a gradient step, the old LLM weights are invalid. Rather than restarting the rollout server, the trainer calls POST /clear_llm_server to flush registered backends, then re-registers the reloaded LLM server endpoints. The paper emphasizes: "From that point on, all subsequent rollouts automatically use the updated model, with no interruption to jobs already in the pipeline."

Min-heap load balancing. Each LLM backend is stored alongside an assignment counter in a min-heap. The selection formula is:

\[s^* = \arg\min_s w_s\]
\[w_{s^*} \leftarrow w_{s^*} + 1\]

where \(w_s\) counts the total number of inference calls assigned to server \(s\) since registration.

Task-level routing. The counter increments once per task (not per call), ensuring "all subsequent calls within the same task are consistently routed to the same backend to maximize prefix cache reuse." This is a crucial optimization for multi-turn agents: the conversation history serves as a prefix that can be cached; routing all calls for a task to the same backend preserves this cache.

Round-robin balance. "Because selection is proportional to assignment count, servers that receive heavier traffic fall back in priority, achieving a round-robin-like balance across the pool without requiring any global synchronization."


Token-in/Token-out Communication

A subtle but critical issue: if trajectories are transmitted as plain text, re-tokenization can be lossy. The resulting token sequence may differ from the one originally generated during rollout, "leading to unintended off-policy discrepancies."

Problem explained. LLMs generate text by sampling from a vocabulary of tokens. The generated token IDs are converted to text for storage/transmission. If the training pipeline re-tokenizes this text, the resulting token IDs may differ from the originals—different tokenizers may split text differently, or edge cases in tokenization may produce different results.

Solution. ProRL Agent uses token IDs as the canonical representation throughout the training pipeline:

  • Rollout workers send prompt_ids directly to the LLM backend
  • They receive response_ids with per-token log-probabilities
  • Each message carries input_ids, output_ids, and logprobs fields populated at generation time
  • During multi-turn rollouts, prior assistant turns retain their original token IDs and are concatenated directly into the input buffer
  • Only new messages (environment observations) are tokenized and appended

This ensures "every token ID returned to the trainer is identical to the one produced during rollout."


Efficient Tool Backends

Section 3.2.3 addresses a practical bottleneck: tool execution latency. "Each tool call is a synchronous blocking operation from the agent's perspective... per-tool latency compounds directly into total rollout time."

Efficient Bash. Shell execution is the most frequent action in code-centric tasks. Conventional implementations route bash commands through a tmux session, incurring terminal multiplexing overhead. ProRL Agent replaces this with a ptyprocess-based direct pseudo-terminal, "granting the agent a raw shell without the tmux intermediary."

IPython. Persistent kernels allow agents to maintain state across steps—imports and variable definitions persist. The conventional approach uses Jupyter kernel gateway, adding network round-trips. ProRL Agent connects directly via the in-process API, "removing this overhead entirely."

Unix Domain Sockets (UDS). Agent actions are sent to an execution server inside the container. Conventional implementations use TCP loopback, which "forces co-located processes to share the same IP to be distinguished only by port numbers, complicating non-conflicting port assignment." UDS passes messages through the OS kernel directly without networking overhead. "Since this channel is exercised on every agent action, shaving latency here accumulates meaningfully across a full rollout."


Job Lifecycle and Cancellation

Phase-aware timeouts. Each job has a PausableTimer that accumulates elapsed time only during active pipeline stages, excluding time spent waiting in inter-stage queues. This ensures "the timeout budget reflects actual execution time rather than transient server-side delays."

Cancellation mechanism. The training framework can abort any in-flight job via POST /cancel. The server then: 1. Marks the job as discarded so workers skip it 2. Cancels the currently executing async task 3. Closes the associated container runtime to release resources 4. Signals the job's completion event so the waiting HTTP handler returns

This enables the RL trainer to "discard incomplete rollouts once a sufficient number of valid samples has been collected."

Graceful shutdown. On POST /stop, the server cancels all in-flight jobs, terminates Singularity processes via process-group scanning, drains worker pools, and exits cleanly—"leaving no orphaned containers on the node."


Efficient DAPO Implementation

DAPO (Dynamic Sampling Policy Optimization) filters out "Zero-Variance Prompts"—those whose rollouts yield uniform rewards (all correct or all incorrect)—since they provide no gradient signal. However, applying DAPO to agent RL is challenging because "agent rollouts are typically long-running, asynchronous, and computationally expensive."

Naive implementation problem. A batch-by-batch approach requests \(n\) prompts, filters non-informative ones, and triggers new batches until \(n\) Informative Prompts are collected. This is inefficient: workers idle waiting for batches to complete, and incomplete rollouts at batch end are discarded.

Asynchronous replenishment. ProRL Agent implements three optimizations (Section 3.4, Figure 3):

  1. Continuous Throughput: Replenish the job queue as soon as it empties
  2. Early Termination: Terminate remaining active jobs once target Informative Prompts is reached
  3. Cross-Iteration Persistence: Unfinished jobs carry over to subsequent iterations

Figure 3 illustrates the improvement: the efficient implementation significantly reduces the "wasted worker time" (waiting periods) between rollout generations compared to the baseline batch-by-batch approach.


RL Trainer Integration

Hierarchical load balancing. The client-side implementation uses two-phase load balancing:

  1. Phase 1: LLM servers are preferentially assigned to ProRL Agent servers on the same physical node (identified through IP address matching) to reduce network latency
  2. Phase 2: Remaining servers are distributed round-robin across all available LLM servers

Supported trainers. The paper supports both veRL and NeMo RL, demonstrating that "rollout-level decoupling allows the agent server to interface with a wide range of RL trainers."


Summary of Design Choices and Justifications

  • HTTP service interface over in-process library: enables independent deployment, scaling, and evolution of rollout and training components.
  • Three-stage pipeline over single-worker sequential: allows phase-level parallelism and independent resource sizing.
  • Singularity over Docker: enables rootless deployment on HPC clusters where Docker daemons are forbidden.
  • Min-heap load balancing over simple round-robin: achieves load balance while preserving task-level routing for prefix cache reuse.
  • Token-in/token-out over text transmission: eliminates re-tokenization drift that causes off-policy discrepancies.
  • Async DAPO over batch-by-batch: maximizes throughput and minimizes wasted computation during RL training.

4. Key Insights and Innovations

Innovation 1: Rollout-as-a-Service Architecture

The paper's most fundamental contribution is the architectural decoupling of rollout from training through a service interface. This is not merely a refactoring—it represents a paradigm shift in how agent RL systems should be structured. Prior frameworks all embedded rollout within the training process (as documented in Appendix A), making infrastructure changes expensive and blocking independent evolution.

The key insight is that rollout and training have fundamentally different resource profiles: rollout is I/O-intensive with highly variable latency (environment initialization, tool execution, test suite evaluation), while training is GPU-intensive with predictable compute patterns. By separating these into independent services, ProRL Agent allows: - Rollout nodes and training nodes to be provisioned and scaled separately - Different machine types optimized for each workload (CPU-heavy vs. GPU-heavy) - Independent failure domains—training crashes don't orphan containers

This architectural pattern mirrors the success of inference-as-a-service (vLLM, SGLang) but applied to the rollout problem. The paper explicitly positions this as the "core design principle" (Section 1).

Innovation 2: Rootless HPC-Compatible Sandboxing

Existing agentic sandbox environments universally rely on Docker, which requires root-equivalent privileges unavailable on shared HPC clusters. This is a practical deployment gap: most research organizations use Slurm-managed clusters where Docker daemons are forbidden for security reasons.

ProRL Agent's use of Singularity addresses this gap directly. The design choices—running entirely as unprivileged user processes, using .sif single-file image format, supporting --fakeroot for simulated root access—enable large-scale agent training in environments where Docker is impossible. This is particularly valuable for organizations that cannot maintain separate privileged infrastructure for agent evaluation.

The paper provides a systematic comparison (Table 1) showing that no prior framework supports rootless sandboxing, making ProRL Agent the only option for HPC-native agent RL training.

Innovation 3: Three-Stage Asynchronous Pipeline with Independent Worker Pools

The three-stage pipeline architecture (INIT → RUN → EVAL) with independent worker pools is a systems-level innovation that addresses the phase heterogeneity problem. The paper's "assembly line" analogy is apt: different stages have different bottlenecks, and sequential execution would limit throughput to the slowest phase.

The key insight is that phase-level parallelism is more important than job-level parallelism. With 100 concurrent rollouts, having 100 workers each handle complete jobs sequentially would be inefficient because workers would spend time waiting on I/O (during init), inference (during run), or test execution (during eval). Instead, having separate pools for each phase allows all phases to operate concurrently:

  • INIT workers continuously provision new containers
  • RUN workers drive agent loops on already-provisioned containers
  • EVAL workers score completed trajectories

The container cleanup after RUN stage (before EVAL) is a crucial detail: evaluation can take minutes for full test suites, and holding containers during evaluation would severely limit concurrency.

Innovation 4: Token-in/Token-out Trajectory Communication

The re-tokenization drift problem is subtle but significant for RL training fidelity. If trajectories are transmitted as text and re-tokenized on the training side, the resulting token sequences may differ from the original generation, creating off-policy discrepancies that affect gradient computation.

ProRL Agent's solution—using token IDs as the canonical representation throughout—is conceptually simple but requires careful implementation. The paper cites The Agent Lightning Team (2025) for identifying this issue. The implementation ensures that: - Original token IDs from generation are preserved - Multi-turn rollouts concatenate prior turns by token ID, not text - Only new environment observations require tokenization

This is particularly important for agent RL where trajectories span tens of thousands of tokens and small tokenization differences compound into significant distribution shifts.

Innovation 5: Efficient DAPO with Asynchronous Replenishment

DAPO filters out uninformative prompts (zero-variance rewards), but naive batch-by-batch implementation causes worker idle time and wasted computation. The paper's asynchronous replenishment mechanism—continuous throughput, early termination, and cross-iteration persistence—is a practical systems optimization that significantly improves hardware utilization.

Figure 3 provides clear visual evidence: the efficient implementation shows minimal "wasted worker time" compared to the baseline batch approach. This is important because agent rollouts are expensive (minutes per trajectory), and wasting completed rollouts due to batch boundaries is costly.

5. Experimental Analysis

Evaluation Methodology

Tasks and datasets. The paper evaluates across four domains:

  • Software Engineering: SWE-Gym (293-instance subset used in SkyRL-v0), evaluated on SWE-Bench Verified. Models: Qwen3-4B-Instruct-2507, Qwen3-8B, Qwen3-14B.
  • STEM: SCP-116K dataset with web search (Tavily), Bash, and IPython tools. Agent retrieves external knowledge and executes code for numerical/symbolic computation.
  • Math: DeepScaleR data, evaluated on AMC. Agent uses IPython with NumPy, SciPy, SymPy plus a think tool for planning.
  • Code: Eurus-2-RL-Data, evaluated on Codeforces testing split. Agent uses str_replace_editor for file editing, Bash for test execution, and IPython for prototyping.

RL algorithm. DAPO (Yu et al., 2025) is the default algorithm, which filters instances with 100% or 0% resolved ratio.

Hyperparameters. All RL training uses: - Batch size: 32 - Mini-batch size: 8 - Rollouts per instance: 8 - KL coefficient: \(1 \times 10^{-4}\) - Learning rate: \(1 \times 10^{-6}\) - Hardware: 32 NVIDIA H100 GPUs

Baselines. For software engineering, the paper compares against SkyRL-Agent (Cao et al., 2025b) at 8B and 14B scales. For other domains, baseline is the pretrained model before RL.


Main Quantitative Results

Software Engineering (Table 2)

Size Model Reproduced Reported
4B Qwen3-4B-Instruct-2507 14.8
4B ProRL Agent-4B (RL) 21.2
8B Qwen3-8B 9.6
8B SkyRL-Agent-8B-v0 9.4
8B ProRL Agent-8B (RL) 18.0
14B Qwen3-14B 15.4
14B SkyRL-Agent-14B-v0 21.6
14B ProRL Agent-14B (RL) 23.6

Key observations: - ProRL Agent improves over base models by +6.4 percentage points (4B), +8.4 (8B), and +8.2 (14B). - vs. SkyRL-Agent: ProRL Agent-8B achieves 18.0% vs. SkyRL-Agent-8B-v0's 9.4%—nearly 2× improvement. - ProRL Agent-14B (23.6%) outperforms SkyRL-Agent-14B-v0 (21.6%).

The paper attributes gains to "a more effective and stable foundation for RL training on software engineering agents."

STEM Agent (Figure 4a)

Mean reward increases from approximately 0.2 to 0.65 after 60 training steps. The smoothed curve shows "a clear upward trend without signs of saturation."

Math Agent (Figure 4b)

Pass@1 on AMC improves from 0.4 to approximately 0.9. The paper notes the low initial performance reflects the base model not being "proficient at solving mathematical problems through simple tool use"—RL training teaches effective tool leverage.

Code Agent (Figure 4c)

Pass@1 on Codeforces improves from 0.23 to approximately 0.42. Similar to math, the base model initially struggles with the str_replace_editor tool and test-based verification; RL improves these capabilities.


Scalability Analysis

Figure 5 shows rollout throughput (instances/sec) on software engineering tasks as compute nodes increase:

"throughput increases nearly linearly with the number of nodes, indicating that ProRL Agent can effectively leverage additional compute resources with minimal scaling overhead."

This is crucial for RL training where "efficient rollout generation is often the main system bottleneck and directly affects overall training efficiency."


Ablation Studies

Table 3 presents component ablations measuring Action Time, GPU Utilization, and Throughput during DAPO training on Qwen3-14B-Instruct-2507 using 8 H100 GPUs:

Load Balancing Efficient Bash Stale Job Cleanup Action Time (s) GPU Util (%) Throughput (instance/sec)
0.42 78 0.37
0.42 42 0.25
0.78 68 0.29
0.42 65 0.30

Interpretation:

  • Full system: 0.37 instances/sec with 78% GPU utilization
  • Without Stale Job Cleanup: GPU utilization drops from 78% to 42%, throughput drops to 0.25. The paper notes cleanup improves throughput "by increasing GPU utilization."
  • Without Efficient Bash: Action Time increases from 0.42s to 0.78s, throughput drops to 0.29. Efficient Bash improves throughput "by reducing action execution time."
  • Without Load Balancing: GPU utilization drops from 78% to 65%, throughput drops to 0.30.

Each component contributes meaningfully: Load Balancing and Stale Job Cleanup improve GPU utilization; Efficient Bash reduces action latency.


Assessment: Do the Experiments Support the Claims?

Claim 1: ProRL Agent enables effective RL training across multiple domains. Strongly supported. Software engineering shows consistent improvements across model scales (Table 2). STEM, Math, and Code agents all show substantial gains during training (Figure 4). The training curves are smooth and monotonic, indicating stable learning.

Claim 2: The infrastructure scales efficiently. Supported by Figure 5's near-linear scaling. However, the paper does not provide absolute numbers for throughput at different node counts—only the shape of the curve.

Claim 3: Each proposed component contributes to efficiency. Strongly supported by Table 3 ablations. The quantitative breakdown shows specific mechanisms: Stale Job Cleanup improves GPU utilization from 42% to 78%; Efficient Bash halves action time from 0.78s to 0.42s.

Limitations: - Single benchmark for each domain: SWE-Bench Verified, AMC, Codeforces. No comparison across multiple benchmarks per domain. - No comparison to other frameworks on same models: The SkyRL-Agent baselines use different base models and training data. A head-to-head comparison controlling for all variables would be stronger. - Hyperparameters fixed: No sensitivity analysis for batch size, learning rate, KL coefficient. - No wall-clock training time comparison: While throughput is measured, total training time comparison with baselines is not reported.

Missing details: - Specific numbers for Figure 5's throughput at each node count - Confidence intervals or variance measures for any results - Comparison of resource efficiency (e.g., total GPU-hours to achieve a given performance level)

Overall, the experiments convincingly demonstrate that ProRL Agent works across domains and that individual components contribute as claimed. The software engineering results are particularly strong, with the 8B model nearly doubling SkyRL-Agent's performance. However, the comparison would be stronger with more controlled baselines and resource efficiency metrics.

6. Limitations and Trade-offs

Assumption: HTTP Overhead Is Acceptable for RL Training Workloads

The rollout-as-a-service design introduces HTTP as the communication layer between the trainer and rollout server. The paper argues this overhead is acceptable because rollout operations are relatively coarse-grained—each request spawns a complete multi-turn trajectory, not individual LLM calls. However, the paper does not quantify this overhead. For high-frequency RL algorithms that require rapid iteration between rollout and policy updates, even modest HTTP latency could accumulate. The three-stage pipeline's asynchronous design mitigates this by allowing many rollouts to proceed in parallel, but the paper lacks a direct comparison of wall-clock training time between ProRL Agent and tightly coupled baselines. Practitioners considering adoption should evaluate whether their training regime involves sufficient rollout parallelism to amortize the service boundary overhead.

Assumption: LLM Backends Are Homogeneous in Capability

The min-heap load balancing strategy assumes all registered LLM backends serve the same model with identical capabilities. The server tracks only an assignment counter \(w_s\), routing based purely on load:

\[s^* = \arg\min_s w_s\]

This works when backends are identical replicas, but breaks if backends differ in model version, checkpoint state, or hardware capability. The paper mentions checkpoint swapping—calling /clear_llm_server to flush backends after policy updates—but does not address scenarios where heterogeneous backends might coexist. For example, mixing GPU generations (H100 vs. A100) would create latency differences not captured by the simple counter. A more sophisticated load balancer might track backend response times or model versions, but this is not implemented.

Rootless Sandboxing Trades Isolation Strength for Deployability

The choice of Singularity over Docker is pragmatic for HPC environments but involves trade-offs the paper does not fully explore. Singularity's --fakeroot flag provides "simulated root access for package installation without requiring actual host privileges," but this is not equivalent to true container isolation. The paper does not discuss:

  • Security boundaries: Whether Singularity's isolation is sufficient for untrusted agent code execution
  • Performance characteristics: Any performance differences between Singularity and Docker for typical agent workloads
  • Feature parity: Whether all Docker features needed by agents (e.g., GPU passthrough, specific networking modes) are available

For practitioners deciding between ProRL Agent and Docker-based alternatives, these trade-offs matter. The paper positions rootless deployment as a key feature, but does not provide a security or performance comparison to help users evaluate whether the trade-off is acceptable for their use case.

Multi-Turn Rollout Complexity Not Fully Addressed

Agent rollouts in the evaluated tasks span "dozens of turns and tens of thousands of tokens" (Section 1), but the paper does not analyze failure modes specific to long-horizon execution. Several questions remain unaddressed:

  • Timeout handling: The PausableTimer accumulates time only during active phases, excluding queue wait time. But what happens when a rollout exceeds its timeout mid-execution? The paper mentions cancellation but does not detail how partial trajectories are handled or whether they can be used for training.
  • Memory accumulation: Long multi-turn conversations accumulate context that must be passed to the LLM on each call. The paper does not discuss memory management or context window handling for trajectories approaching model limits.
  • Error recovery: The AgentHandler interface provides per-stage error callbacks, but the paper does not characterize common failure modes or their frequency in practice. How often do rollouts fail, and for what reasons?

These gaps matter for practitioners planning large-scale training runs. Understanding failure rates and recovery mechanisms is essential for estimating compute requirements and designing robust training pipelines.

Limited Comparison to Alternative Decoupling Strategies

The paper presents rollout-as-a-service as the solution to training-rollout coupling, but does not compare to other possible decoupling approaches. For example:

  • Process-level separation: Running rollout in a separate process with IPC (rather than HTTP)
  • Async library patterns: Using Python asyncio with shared memory for zero-copy trajectory transfer
  • Ray-based actor model: Distributed execution with object store for trajectory sharing (used by some baselines the paper compares against)

Each approach has different trade-offs in latency, throughput, and complexity. The HTTP service design maximizes decoupling and enables "any RL trainer" to connect, but may not be optimal for single-node training where lower-latency communication is possible. The paper does not provide guidance on when the service overhead is worth the decoupling benefits versus when tighter integration might be preferable.

Baseline Comparisons Lack Controlled Variables

The software engineering experiments compare ProRL Agent against SkyRL-Agent, but the comparison is not fully controlled. Table 2 shows:

  • ProRL Agent-8B achieves 18.0% vs. SkyRL-Agent-8B-v0's 9.4%
  • ProRL Agent-14B achieves 23.6% vs. SkyRL-Agent-14B-v0's 21.6%

However, the paper does not confirm that both systems use identical base models, training data, hyperparameters, and compute. The ProRL training framework (Liu et al., 2025a) is mentioned as the RL backend, but SkyRL-Agent uses its own training stack. The performance difference could reflect infrastructure quality, training algorithm differences, hyperparameter tuning, or data curation. The paper claims ProRL Agent "provides a more effective and stable foundation," but without controlled experiments, this attribution cannot be verified.

Scalability Results Lack Absolute Numbers

Figure 5 shows near-linear scaling of throughput with compute nodes, which the paper cites as evidence of efficient scaling. However, the figure lacks specific numbers:

"throughput increases nearly linearly with the number of nodes, indicating that ProRL Agent can effectively leverage additional compute resources with minimal scaling overhead."

Without absolute throughput numbers (instances/second at 1 node, 2 nodes, 4 nodes, etc.), practitioners cannot estimate resource requirements for their own training runs. The shape of the curve shows good scaling, but the slope is unknown.

Token-in/Token-out Requires Backend Cooperation

The token-in/token-out communication design eliminates re-tokenization drift, but it requires that LLM backends support passing token IDs directly. The paper notes that rollout workers "send prompt_ids directly to the LLM backend" and "receive response_ids with per-token log-probabilities." This implies backends must implement this protocol—not all inference servers may support it. The paper does not discuss:

  • Which inference backends are compatible
  • How to adapt backends that only accept text
  • Whether standard OpenAI-compatible APIs support this mode

Practitioners with existing inference infrastructure may need to modify or replace backends to use ProRL Agent's token-level communication.

7. Implications and Future Directions

How This Work Changes the Landscape

ProRL Agent establishes rollout-as-a-service as a viable architectural pattern for multi-turn agent RL, analogous to how inference-as-a-service (vLLM, SGLang) became standard for LLM serving. Prior to this work, all major frameworks embedded rollout within the training process. The paper provides both the conceptual argument and working implementation demonstrating that decoupling improves:

  • Resource efficiency: I/O-intensive rollout and GPU-intensive training can be provisioned separately
  • Maintainability: Rollout infrastructure can evolve without touching training code
  • Portability: New training frameworks can connect via a simple HTTP interface

This is likely to influence future framework designs. As agent RL scales to more complex environments and longer horizons, the engineering burden of rollout infrastructure will grow. ProRL Agent provides a template for managing this complexity through service boundaries. The integration with NVIDIA NeMo Gym signals institutional adoption, which may drive standardization around this pattern.

Follow-Up Research This Work Enables

1. Multi-environment rollout coordination. The current design handles rollouts within a single task type at a time. Future work could extend the service to coordinate rollouts across heterogeneous environments simultaneously—e.g., mixing software engineering, web browsing, and code execution tasks within a single training run. The AgentHandler abstraction provides the extension point, but the scheduling and resource allocation challenges of mixed workloads remain unexplored.

2. Adaptive rollout budget allocation. The paper notes that DAPO filters uninformative prompts (zero-variance rewards), but the allocation of rollout budget across prompts is not optimized. A natural extension is adaptive allocation that spends more compute on difficult or high-uncertainty prompts. The service architecture enables this: the trainer could dynamically adjust sampling_params per prompt based on estimated difficulty or learning potential.

3. Checkpoint-aware scheduling. The current design flushes all LLM backends when checkpoints update. A more sophisticated scheduler could support gradual rollout—allowing some backends to continue with stale checkpoints while new ones come online—trading off policy staleness for throughput. This is particularly relevant for large-scale training where checkpoint synchronization is expensive.

4. Rollout caching and reuse. Multi-turn trajectories contain valuable information beyond the immediate gradient computation. Future work could implement trajectory caching within the service, enabling: - Curriculum learning that prioritizes difficult trajectories - Experience replay for stability - Distillation datasets for smaller models

The service boundary makes this natural: cached trajectories can be stored server-side without trainer modification.

5. Formal analysis of decoupling overhead. The paper claims HTTP overhead is acceptable but does not quantify it. A rigorous analysis comparing: - Latency: HTTP service vs. in-process call overhead - Throughput: Maximum rollouts/second under different architectures - Training efficiency: Wall-clock time to convergence with different coupling levels

would help practitioners make informed architectural decisions and identify optimization targets.

Practical Applications and Downstream Use Cases

HPC-native agent training. The primary use case is organizations with Slurm-managed clusters who want to train agents without maintaining separate privileged Docker infrastructure. ProRL Agent's rootless Singularity deployment directly addresses this. Research labs, universities, and companies with shared HPC resources can now run multi-turn agent RL within existing security policies.

Multi-team collaboration on agent infrastructure. The service boundary enables separation of concerns between infrastructure and training teams. One team can own the rollout service—optimizing sandbox performance, adding new environments, improving tool backends—while another team focuses on RL algorithms and policy training. This is valuable for larger organizations where different teams own different parts of the stack.

Rapid prototyping of agent tasks. The AgentHandler interface makes it "simple to host heterogeneous agentic tasks within a unified rollout service" (Section 1). Researchers prototyping new agent environments can implement a handler and register it without modifying training infrastructure. This lowers the barrier to experimenting with new tasks.

Production agent evaluation. Beyond training, the rollout service can serve as an evaluation infrastructure. The same sandboxed environments and tool backends used for rollouts can evaluate deployed agents. The service architecture naturally supports this dual use: evaluation requests are structurally identical to rollout requests, just without policy updates.

When to Prefer This Method Over Alternatives

Prefer ProRL Agent when: - Training on HPC clusters where Docker is unavailable or restricted - Running large-scale training where rollout and training have different resource requirements (different node types, different scaling needs) - Multiple training frameworks or research groups need to share rollout infrastructure - Experimenting with diverse agent tasks that require different sandbox environments

Prefer tightly coupled alternatives when: - Training on single machines or cloud instances where Docker is available - Minimizing latency is critical and rollout parallelism is low - Simplicity is prioritized over scalability—the service architecture adds deployment complexity - Existing infrastructure already embeds rollout logic and migration cost is high

For replication or integration, the key requirements are: - Singularity installed on execution nodes (for rootless sandboxing) - LLM inference backend supporting token-in/token-out protocol (vLLM, NeMo) - RL trainer capable of HTTP communication with the rollout service - Task-specific AgentHandler implementation for the target domain

The open-source release at the ProRL Agent GitHub repository and integration with NVIDIA NeMo Gym provide starting points for adoption.