Skip to content

Scaling Agent Learning via Experience Synthesis

ArXiv: 2511.03773

🎯 Pitch

This paper introduces DreamGym, a unified framework that trains LLM-based agents with scalable, reasoning-driven synthetic experiences—using a chain-of-thought experience model, a replay buffer grounded in offline trajectories, and curriculum task generation—to produce consistent state transitions and informative rewards for online RL. By replacing costly real-environment rollouts with controllable synthetic interactions, DreamGym makes RL feasible in non-RL-ready domains, matches costly real-rollout baselines in RL-ready settings, and provides a sample-efficient sim-to-real warm start that substantially reduces real-world interaction needs.


1. Executive Summary (2-3 sentences)

DreamGym is a unified framework for training LLM-based autonomous agents with online reinforcement learning (RL) using synthetic interaction data, replacing expensive and brittle real-environment rollouts with a reasoning-driven “experience model” that generates state transitions and reward signals in an abstract textual state space (Figure 1, Figure 2, §4.1). It combines (i) a chain-of-thought (CoT)-guided simulator, (ii) a replay buffer grounded in offline real trajectories, and (iii) curriculum task generation driven by a reward-entropy heuristic to keep training informative (§4.1–§4.2). Empirically, it enables RL on non-RL-ready domains like WebArena and matches or improves real-environment RL baselines in RL-ready-but-costly settings, including sim-to-real warm-start gains (Table 1, Figure 3, §5.2).

2. Context and Motivation

  • Problem / gap addressed.
  • RL for LLM agents is attractive because interaction can enable self-improvement beyond static pretraining, especially for multi-step tool/web/embodied tasks (§1, §2.1).
  • In practice, collecting scalable online interaction experience is blocked by four recurring bottlenecks (§1):

    1. Costly rollouts (long horizons, expensive steps).
    2. Limited task diversity (fixed instruction sets; hard to scale feasible tasks).
    3. Unreliable reward signals (dynamic GUIs/web; noisy or sparse feedback; false positives).
    4. Infrastructure complexity (heterogeneous environments; heavy backends; poor parallelism; resets/safety issues).
  • Why this is important.

  • Many target agent settings (realistic web, OS/GUI interaction) are precisely where the above bottlenecks are worst, making “standard” online RL infeasible even if algorithms like PPO/GRPO exist (§1, §2.1, Appendix A.3).

  • Prior approaches and shortcomings (as positioned here).

  • Offline imitation / preference methods like SFT and DPO can distill static trajectories but are limited by trajectory diversity/adaptivity (§2.2; Table 1 shows these are much weaker than RL-style training on the evaluated agent tasks).
  • Synthetic data / task synthesis efforts expand tasks or demonstrations, but many still require real-environment collection, inheriting scalability bottlenecks (§2.2).
  • World-model / simulator approaches attempt to emulate environments, but the paper argues that agent training need not require perfectly realistic raw-state simulation; it needs diverse, informative, causally grounded transitions and feedback for learning (§4.1).

  • DreamGym’s positioning.

  • It reframes the environment as a learning-oriented experience generator operating in an abstract textual state space, explicitly designed to be scalable for RL training and to support curriculum and replay grounding (Figure 1, Figure 2, §4).

3. Technical Approach

3.1 Reader orientation (approachable technical breakdown)

  • DreamGym is a training framework that replaces a real interactive environment with a reasoning-based model that synthesizes the agent’s observations and rewards step-by-step.
  • It solves scalable experience collection for online RL by generating abstract “next states,” rewards, and new tasks cheaply—so RL algorithms like PPO or GRPO can be run without (or with far fewer) costly real rollouts (§3.2, §4, §5).

3.2 Big-picture architecture (diagram in words)

  • Inputs: seed tasks (instructions), some offline trajectories from the target domain, and an agent policy to train.
  • Components (Figure 2):
  • Reasoning experience model (Mexp): given current abstract state + action + history + retrieved similar experiences + task instruction, it produces next abstract state and reward via explicit reasoning (§4.1.1, Eq. (4)).
  • Experience replay buffer: stores offline and newly generated trajectories; retrieval provides grounding demonstrations to reduce hallucination and improve consistency (§4.1.1; Figure 2; ablated in Table 2).
  • Curriculum task generator (Mtask): proposes new task variations from high-value seed tasks selected using a reward-entropy / variance heuristic (§4.2, Eq. (6)–(7)).
  • RL trainer (PPO/GRPO): updates the agent policy using synthetic rollouts the same way it would use real rollouts (§3.2, §4.3).

3.3 Roadmap for the deep dive

  • I first define the RL framing and what DreamGym must supply to make RL work for agent tasks (§3).
  • Then I explain the experience model: abstract states, inputs/outputs, CoT reasoning, and how it is trained (§4.1, Eq. (4)–(5)).
  • Next I explain replay-buffer retrieval grounding and why it matters for stability and factuality (§4.1.1; Table 2, Figure 4).
  • Then I explain curriculum task generation via reward-entropy and the sampling bound λ (§4.2, Eq. (6)–(7)).
  • Finally I connect the pipeline to training modes: fully synthetic RL vs sim-to-real warm-start, plus the provided policy improvement bound (Theorem 1, Eq. (9)) (§4.3, Appendix B.1).

3.4 Detailed, sentence-based technical breakdown

Framing: This is an algorithmic + systems-style paper centered on a synthetic experience generation framework for scalable online RL of LLM agents, with empirical evaluation across multiple environments and model backbones (§4–§6).

3.4.1 RL formulation and what data RL needs

  • The agent learning problem is formalized as an MDP M = (S, A, T, R, Îł, ρ0) where τ0 (the natural-language instruction) is part of the initial state distribution, states encode what the agent can observe, and actions are discrete UI/tool/text operations (§3.1).
  • RL optimizes the policy πΞ(a|s) to maximize expected cumulative reward using policy gradients (Eq. (1)).
  • The paper focuses on PPO and GRPO as representative online RL optimizers (§3.2):
  • PPO uses a learned value function V(s) and generalized advantage estimation (GAE) (Eq. (2)).
  • GRPO discards the value function and normalizes rewards within a group of rollouts for the same task (Eq. (3)), aiming for scalability.
  • Key dependency: both PPO and GRPO require a stream of (state, action, reward, next state) transitions. DreamGym’s purpose is to synthesize these transitions cheaply and consistently at scale (§4).

3.4.2 The core idea: reason in an abstract textual state space (instead of raw environments)

  • DreamGym builds an LLM-based experience model Mexp that operates in an abstract, meta-representational textual state space rather than raw HTML/pixels (§4.1).
  • Concretely, for web tasks, the “state” can be a cleaned list of actionable elements (e.g., link/menuitem/table entries) rather than raw HTML; this is argued to be more token-efficient and to discard irrelevant structure (§4.1; Figure 6 shows examples of such states as numbered UI elements).
  • The stated insight is that training does not require perfect raw-environment realism; it requires diverse, informative, causally grounded interaction data that supports knowledge acquisition for the target domain (§4.1).

3.4.3 Experience rollout inference: what Mexp consumes and produces

DreamGym generates rollouts by alternating agent actions and experience-model transitions (Figure 2, §4.3). At step t:

  1. The agent observes an abstract state st (plus task instruction and recent history in the prompt templates shown in Appendix C) and chooses an action at.
  2. The experience model predicts:
  3. the next abstract state st+1, and
  4. the reward rt+1, using explicit chain-of-thought reasoning.

The paper identifies three additional contexts beyond (st, at) that improve state quality and stability (§4.1.1):

  • Interaction history {(si, ai)}_{i=0..t}, included in the context window to keep multi-turn transitions consistent (§4.1.1).
  • Task instruction τ, so actions are interpreted relative to the goal (§4.1.1).
  • Retrieved past experiences {dj}_{j=1..k} from a replay buffer, where retrieval uses semantic similarity of state-action embeddings:
  • Topk(cos(ϕ(st, at), ϕ(si, ai))) (retrieval expression in §4.1.1).
  • ϕ(·) is “an arbitrary semantic encoder” (not further specified in the provided content).

Given these, the transition is modeled as (Eq. (4)):

(st+1, rt+1) = Mexp(Rt | {(si, ai)}_{i=0..t}, {dj}_{j=1..k}, τ)

where Rt is an explicit reasoning trace that guides prediction.

Reward design used in experiments. - The paper uses an outcome-based reward: r = 1 only at the final step if the task succeeds, and r = 0 otherwise (§4.1.1).
- This is important because it implies DreamGym’s gains are not from dense hand-shaped rewards; they come from (i) better transition quality and (ii) curriculum/task diversity while still operating under sparse terminal rewards.

3.4.4 Training the experience model to “reason” (SFT objective with reasoning traces)

  • Mexp is trained from an offline trajectory dataset D = {(st, at, st+1, rt+1)} (§4.1.2).
  • Each transition is augmented with a teacher-generated reasoning trace R*t (prompt examples in Appendix C.1–C.3).
  • The experience model is then supervised fine-tuned (SFT) with a joint objective that:
  • learns to generate the reasoning trace, and
  • predicts the next state conditioned on the reasoning trace (Eq. (5)):

LSFT = E[...] [ -log PΞ(R*t | st, at, Ht, Dk) -log PΞ(st+1 | st, at, R*t, Ht, Dk) ]

where Ht is interaction history and Dk the retrieved top-k demonstrations (§4.1.2).

Mechanistic interpretation (why the two terms matter). - The first term trains the model to produce a causal explanation of why an action leads to an outcome in context; the second term uses that explanation as a scaffold to generate a coherent, informative next observation (§4.1.2). - This design is later supported by ablations showing that removing reasoning degrades informativeness and hallucination behavior (Figure 4; Table 2).

Data efficiency claims grounded in provided experiments. - For WebShop and ALFWorld, the offline datasets used to train Mexp include a mix of demonstrations and additional trajectories from oracle/random exploration (Appendix A.1, A.2). - For WebArena, offline trajectories are mined from successful public leaderboard agents plus additional high-performing and random trajectories, totaling 4,800 offline trajectories (Appendix A.3). - The paper explicitly investigates varying offline data sizes (2k/10k/20k/40k transition steps) and shows performance scaling trends of the experience model (Figure 5), though exact numeric values are only partially specified in text (see §5 below for careful reporting).

3.4.5 Experience replay buffer: grounding and co-evolution

  • DreamGym uses an experience replay buffer seeded with offline “real-world” trajectories and continually enriched with fresh synthetic interactions (Figure 2; §1; §4.1.1).
  • Retrieval of similar experiences is used at inference time to reduce hallucination and improve factuality/consistency in generated states (§4.1.1).
  • The buffer co-evolves with the policy: as the agent changes, new rollouts are added, and retrieval provides policy-aligned context, which is argued to stabilize training (§1; §4).

Empirical role: Removing replay reduces success rates in both WebShop and WebArena (Table 2; see §5.3 below).

3.4.6 Curriculum task generation via reward-entropy (variance) heuristic

DreamGym addresses task scarcity by generating new tasks online (§4.2).

  • Task variations are generated from a set of m seed tasks using:
  • τt = Mtask({τ^{i}_{t-1}}_{i=1..m}) (Eq. (6)),
  • where Mtask shares parameters with Mexp (§4.2).

How does it choose which tasks to expand? - It defines a per-task “value” based on group-based reward entropy, operationalized as reward variance across n rollouts (Eq. (7)):

Vτ = (1/n) ÎŁ_i (ri - r̄)^2, with r̄ = (1/n) ÎŁ_i ri

  • Interpretation given in §4.2:
  • If variance is non-zero, the agent sometimes succeeds and sometimes fails: the task is feasible but challenging.
  • Maximum “entropy” is when successes and failures are balanced, providing high information gain for credit assignment.
  • This selection rule is compatible with both GRPO (groups naturally exist as multiple rollouts per task) and PPO (groups can be formed by clustering tasks using a semantic embedder) (§4.2).

Stability control: A hyperparameter λ bounds the proportion of synthetic tasks sampled per iteration to preserve coverage of the original task distribution while focusing on current weaknesses (§4.2).
- The provided excerpt does not specify the numeric value(s) of λ; it only defines its role.

3.4.7 End-to-end training loop: what happens first, second, third (explicit flow)

Putting the pieces together (Figure 2; §4.3):

  1. Initialize with a seed task set and an offline-seeded replay buffer (§4.3; Figure 2).
  2. Roll out trajectories:
  3. For each task, the agent selects actions from current abstract states.
  4. After each action, Mexp generates the next abstract state and reward using history + task + retrieved demonstrations + reasoning (Eq. (4)).
  5. Update the agent with a standard RL algorithm (PPO or GRPO) using the collected synthetic transitions (§3.2, §4.3).
  6. Augment the replay buffer with newly generated trajectories (Figure 2).
  7. Expand the task set:
  8. Identify high-entropy (high-variance) tasks.
  9. Generate task variations with Mtask to create an adaptive curriculum (Eq. (6)–(7)).
  10. Repeat until convergence or a training budget is reached (§4.3).

3.4.8 Sim-to-real (S2R) extension and state-space consistency

  • DreamGym-S2R trains a policy first in DreamGym’s synthetic environment, then continues RL in the real environment with far fewer real interactions (§4.3, Table 1).
  • To transfer, it enforces consistent state space between synthetic and real environments using the same rule-based mapping or a lightweight fine-tuned model (§4.3).
  • In WebArena, an AX-tree refinement/mapping prompt is provided (Appendix C.3, “AX-tree state mapping prompt”), illustrating how raw accessibility-tree observations are reduced to a focused abstract observation.

3.4.9 Theoretical policy improvement bound (why synthetic training can help real performance)

Appendix B.1 provides a trust-region-style bound (Theorem 1, Eq. (9)):

  • Define one-step experience-model errors:
  • ΔR = sup_{s,a} |R(s,a) - R̂(s,a)| (reward error),
  • ΔP = sup_{s,a} TV(P(·|s,a), P̂(·|s,a)) (transition distribution error), where TV is total variation distance (Eq. (8)).
  • Under a KL-bounded policy update sup_s KL(π'(·|s) || π(·|s)) ≀ ÎŽ (linked to PPO/GRPO-style constraints), the improvement in the real environment is lower bounded by:
  • a synthetic surrogate improvement term (advantage under synthetic MDP),
  • minus a trust-region penalty (scales with ÎŽ),
  • minus an experience-model error penalty (scales with ΔR, ΔP) (Eq. (9)).

Key implication grounded in the text: the bound depends on reward accuracy and domain consistency, not strict raw-state reconstruction error (Appendix B.1, Lemma 1 / Eq. (14)–(26)), aligning with the paper’s “learning-centric” simulator design (§4.1).

3.4.10 Worked micro-example (single input → output walk-through)

Using the WebArena case study (Figure 6):

  • Task (instruction): “What is the message Jack provided in the first commit of Apr 2023?” (Figure 6).
  • State 0: abstract element list includes a “Total Commits” stat badge and an “Open Change Log” button among other UI elements (Figure 6).
  • Agent action: Click(463) where [463] corresponds to “Total Commits” (Figure 6).
  • Experience-model reasoning (<think>):
  • It interprets the click as intended navigation to commit history and predicts the page should display a list of commits including April 2023 entries (Figure 6).
  • Predicted next state (State 1):
  • A list grouping commit history by date appears, with list items for Apr 2, Apr 3, Apr 7, Apr 8, etc. (Figure 6).
  • Next agent action: Click(1144) selecting the first April commit entry (Apr 2, 2023) (Figure 6).
  • Predicted next state (State 2):
  • A commit detail view appears including message "Add API migration notes" (Figure 6).
  • Outcome: This trajectory illustrates the intended causal linkage “click → history list → click commit → details,” and shows the abstract state representation (numbered actionable elements + descriptions) that DreamGym uses instead of raw webpages (Figure 6; §4.1).

4. Key Insights and Innovations

  • (1) Reasoning-based experience synthesis as an RL-ready “environment.”
  • Novelty: Instead of simulating raw environments, DreamGym synthesizes transitions in an abstract textual state space with explicit reasoning traces (Eq. (4)–(5), §4.1).
  • Significance: This targets the practical RL bottleneck—experience collection—by making rollouts cheap, controllable, and scalable (Figure 1; §1; §5.2).

  • (2) Retrieval-grounded multi-turn experience model with co-evolving replay.

  • Novelty: At inference time, Mexp conditions not just on current state/action but also on history, task instruction, and top-k retrieved demonstrations (retrieval rule in §4.1.1).
  • Significance: Ablations show replay removal reduces success rates (Table 2), and evaluation suggests history/reasoning reduce inconsistencies and hallucinations (Figure 4).

  • (3) Reward-entropy (variance) driven curriculum task generation.

  • Novelty: Task selection uses a group-based reward variance heuristic (Eq. (7)) to identify tasks that are feasible yet challenging, then generates variations (Eq. (6), §4.2).
  • Significance: Removing task generation causes notable performance drops and earlier plateaus (Table 2; §6.2; Figure 3 Right discussion in §6.2).

  • (4) Sim-to-real warm-start strategy for sample efficiency.

  • Novelty: DreamGym-S2R trains purely synthetically first, then performs a small real-environment RL phase, with explicit attention to consistent state representations (§4.3).
  • Significance: It yields large gains with far fewer real interactions in reported experiments (Table 1; §5.2).

  • (5) Learning-centric theoretical guarantee (trust-region bound).

  • Novelty: Theorem 1 (Appendix B.1) emphasizes that reward accuracy and transition distribution consistency control real-environment improvement, not pixel/HTML reconstruction fidelity.
  • Significance: This formalizes the paper’s core design stance in §4.1 that “perfect realism” is unnecessary if learning-relevant signals are preserved.

5. Experimental Analysis

5.1 Evaluation methodology (datasets, metrics, baselines, setup)

  • Environments (§5.1):
  • WebShop: e-commerce web interaction tasks (§5.1; Appendix A.1).
  • ALFWorld: embodied/text interaction benchmark (§5.1; Appendix A.2).
  • WebArena-Lite: realistic web interaction; explicitly described as non-RL-ready due to reset/scaling/infrastructure issues (§5.1; Appendix A.3).
  • Agent backbones (§5.1):
  • Llama-3.2-3B-Instruct, Llama-3.1-8B-Instruct, Qwen-2.5-7B-Instruct.
  • Baselines (Table 1, §5.1):
  • Offline: SFT (20K real transitions), DPO (40K real transitions).
  • Online real-environment RL: GRPO and PPO with 80K real transitions.
  • DreamGym variants:
    • DreamGym: RL (GRPO/PPO) trained with 0 real transitions (purely synthetic).
    • DreamGym-S2R: synthetic pretraining then 5K real transitions in the real environment.
  • Metric: Success rate (%) across tasks (Table 1; Figure 3).
  • Experience model backbone: main results use an experience model trained from Llama-3.1-8B-Instruct (§5.1).
  • Compute/resources: Experiments are run on “8 nodes with A100 GPUs and 4 nodes with H100 GPUs” (Appendix A.1–A.3).
  • The provided excerpt does not include optimizer settings, learning rates, batch sizes, or context window lengths; it references “standard setup and hyperparameter settings from Verl-Agent (Feng et al., 2025)” and that “detailed parameter settings 
 are provided in Appendix A” (§5.1, Appendix A), but those numeric hyperparameters are not included in the provided content.

5.2 Main quantitative results (with specific numbers)

Table 1 (success rate %, best bolded in table):

  • WebArena (non-RL-ready; strongest relative gains):
  • GRPO Traditional (80K real): 7.3 / 6.1 / 6.1 (L3.2-3B / L3.1-8B / Q2.5-7B).
  • PPO Traditional (80K real): 6.7 / 4.8 / 7.3.
  • DreamGym GRPO (0 real): 13.3 / 9.1 / 12.7.
  • DreamGym PPO (0 real): 14.5 / 10.9 / 10.0.
  • DreamGym-S2R (5K real) results are mixed vs pure DreamGym on WebArena:
    • GRPO S2R: 13.9 / 9.7 / 11.2
    • PPO S2R: 13.3 / 10.9 / 13.9
  • Interpretation grounded in §5.2: DreamGym makes RL feasible where real-environment RL is bottlenecked by infrastructure and reward instability.

  • WebShop (RL-ready but costly):

  • GRPO Traditional (80K real): 62.1 / 65.0 / 66.1.
  • DreamGym GRPO (0 real): 59.3 / 63.9 / 68.3 (comparable, sometimes higher).
  • PPO Traditional (80K real): 59.9 / 64.2 / 68.1.
  • DreamGym PPO (0 real): 60.5 / 58.1 / 65.0.
  • DreamGym-S2R (5K real) improves substantially:
    • GRPO S2R: 70.5 / 75.0 / 72.1
    • PPO S2R: 66.0 / 63.9 / 73.7
  • This supports the warm-start claim in §5.2: a synthetic mid-training stage improves sample efficiency and attainable performance.

  • ALFWorld (RL-ready):

  • GRPO Traditional (80K real): 65.3 / 70.9 / 79.8.
  • DreamGym GRPO (0 real): 62.1 / 66.3 / 71.0.
  • DreamGym-S2R GRPO (5K real): 65.0 / 75.9 / 82.4.
  • PPO Traditional (80K real): 47.0 / 72.9 / 81.1.
  • DreamGym PPO (0 real): 40.5 / 70.8 / 72.7.
  • DreamGym-S2R PPO (5K real): 49.1 / 73.3 / 79.9.

Offline methods (Table 1) are much lower across tasks, especially WebArena (e.g., SFT 6.1/5.5/7.3; DPO 5.5/4.8/4.8), highlighting the gap DreamGym targets: getting RL-style improvement without needing scalable real-environment rollouts.

5.3 Ablations and supporting analyses

  • Component ablation (Table 2, success rate %):
  • Full DreamGym: WebShop 63.9, WebArena 13.3.
  • w/o Exp. Replay: WebShop 59.2, WebArena 9.7.
  • w/o Exp. Reasoning: WebShop 55.8, WebArena 7.3.
  • w/o Task Generation: WebShop 57.3, WebArena 7.3.
  • Takeaway: reasoning and task generation are particularly important for WebArena (both drop to 7.3), aligning with claims about reward/task sparsity in non-RL-ready domains (§6.2–§6.3).

  • Experience model quality metrics (Figure 4):

  • The paper evaluates consistency, diversity, informativeness, and hallucination using GPT-4o as a judge with discrete scores in {0,1,2} (Figure 4; Appendix C.4).
  • Reported qualitative findings (§6.3):

    • Removing history reduces consistency.
    • Removing reasoning hurts informativeness and increases hallucination.
    • Full DreamGym is best or near-best across criteria.
  • Offline data size and backbone sensitivity (Figure 5; §6.4):

  • The paper claims high data efficiency: “Even with 2k–10k offline samples, it already reaches competitive performance” (§6.4).
  • It gives one concrete numeric anchor in text:
    • On WebShop, Llama-3.1-8B exceeds 50% success with 10k samples (§6.4).
    • Llama-3.2-3B reaches “about 55%” success on WebShop with 20k samples (§6.4).
    • On WebArena, WebDreamer achieves “around 13%” success in the low-data regime (§6.4).
  • Beyond those, the figure shows trends but the excerpt does not provide exact datapoints for every curve; I avoid inventing additional numbers.

5.4 Do the experiments support the claims?

  • Claim: RL becomes feasible on non-RL-ready WebArena. Supported by Table 1 where DreamGym (0 real) roughly doubles WebArena success rates compared to “Traditional” RL with 80K real transitions, across backbones (Table 1; §5.2). The infrastructure bottleneck is explicitly described in Appendix A.3 (only 4 parallel sessions; manual resets; noisy evaluation), strengthening motivation for a synthetic alternative.
  • Claim: Synthetic RL can match real RL on RL-ready tasks. Supported by WebShop/ALFWorld comparisons where DreamGym (0 real) is close to traditional PPO/GRPO with 80K real transitions, and DreamGym-S2R improves further (Table 1; §5.2).
  • Claim: Curriculum and reasoning matter. Supported by Table 2 and Figure 4, plus training-curve discussion that removing task generation leads to earlier plateau on WebShop (Figure 3 Right referenced in §6.2).

6. Limitations and Trade-offs

  • Dependence on offline trajectories and teacher-generated reasoning traces.
  • Mexp training requires offline transition data plus reasoning annotations R*t generated by a “strong teacher LLM” (Appendix A.1–A.3; §4.1.2). This introduces:

    • cost/complexity in dataset construction,
    • potential annotation bias (not analyzed quantitatively in the provided excerpt).
  • Reward scheme remains sparse in experiments.

  • The paper uses outcome-only terminal rewards (r=1 on success else 0) (§4.1.1). While this keeps the reward simple, it means improvements must come from better experience/task synthesis rather than richer reward shaping. A trade-off is that some domains might need intermediate feedback to learn efficiently; the framework’s ability to incorporate richer reward signals is asserted (e.g., “rich feedback signals” in §4) but not numerically explored here.

  • State abstraction / mapping risk.

  • DreamGym relies on an abstract textual state space and mapping functions (e.g., AX-tree refinement in WebArena; Appendix C.3; §4.3). If the abstraction drops crucial information, policies might learn shortcuts that do not transfer.
  • The paper itself notes transfer limits when domain gaps are large: policies transfer between web domains (WebShop ↔ WebArena) better than from web to embodied ALFWorld (Figure 3 Middle; §5.2).

  • Simulator factuality and hallucination are managed but not eliminated.

  • Retrieval grounding and reasoning are explicitly motivated as hallucination reducers (§4.1.1, §6.3), and Figure 4 evaluates hallucination behavior via an LLM judge. However:

    • The judge itself is another model (GPT-4o) (Figure 4; Appendix C.4), and the excerpt does not report human evaluation or error analysis of judge reliability.
  • Incomplete reporting of low-level training hyperparameters in the provided content.

  • The excerpt references “standard” settings and Appendix A for detailed parameter settings (§5.1), but does not include optimizer, learning rate schedule, batch size, context window, etc. This limits reproducibility assessment based solely on the provided content.

  • Single-environment focus (explicit limitation).

  • The “Limitations and Future Work” section notes the work “primarily investigates single-environment learning setups” and suggests extending to a universal world model that unifies multiple environments (Limitations and Future Work section).

7. Implications and Future Directions

  • Field-level implication: environments as learning-centric experience generators.
  • DreamGym pushes a perspective shift: for training general-purpose agents, the bottleneck is the quality/structure of interaction data, not necessarily perfect environment fidelity (§7; Appendix B.1). This can influence how future benchmarks and RL infrastructure are designed—toward scalable, abstract, controllable training interfaces.

  • Practical applications / use cases suggested by results.

  • Non-RL-ready domains (e.g., realistic web interaction): DreamGym can enable RL training where real rollouts are too costly or operationally brittle (WebArena; Table 1; Appendix A.3).
  • Costly RL-ready domains: DreamGym can reduce reliance on real interactions while maintaining performance (WebShop, ALFWorld; Table 1).
  • Warm-start for real-environment RL: DreamGym-S2R provides a recipe: synthetic pretraining for exploration/skills, then limited real RL for grounding, yielding higher final success with far fewer real transitions (Table 1; §4.3; §5.2).

  • Follow-up research directions directly suggested in the paper.

  • Universal / multi-environment experience models: Extend from per-environment Mexp to a unified model enabling cross-environment knowledge transfer (Limitations and Future Work section).
  • Better abstractions and transfer: Since web↔embodied transfer degrades when the domain gap is large (Figure 3 Middle; §5.2), a concrete direction is improving the meta-representation/state abstraction so it supports broader transfer.
  • More robust reward and evaluation modeling: The theoretical bound emphasizes reward accuracy (ΔR) and transition consistency (ΔP) (Appendix B.1). Future work could focus on measuring/controlling these quantities in practice.

  • Repro/Integration Guidance (when to prefer DreamGym, based on provided evidence).

  • Prefer DreamGym (pure synthetic RL) when:
    • the environment is not RL-ready (reset/safety/throughput issues) as described for WebArena (Appendix A.3), and you still want RL-style policy improvement (Table 1).
  • Prefer DreamGym-S2R when:
    • you can afford a small amount of real interaction and want higher final performance and sample efficiency (Table 1; §5.2).
  • DreamGym is described as orthogonal to the RL algorithm (PPO vs GRPO) because it focuses on scaling experience synthesis (end of §3.2). Empirically, both PPO and GRPO benefit in the reported setups (Table 1).