Skip to content

Web World Models

ArXiv: 2512.23676

🎯 Pitch

This paper introduces the Web World Model (WWM), a hybrid architecture that uses deterministic web code to implement world state and physics while leveraging large language models to generate contextual descriptions, narratives, and high-level content on top. By combining typed interfaces, deterministic procedural generation, and graceful degradation, WWMs preserve controllability, debuggability, and object permanence while enabling effectively unlimited, on-demand expansion of persistent environments for language agents—making scalable, reliable long-running agent worlds practical.


1. Executive Summary (2-3 sentences)

This paper introduces the Web World Model (WWM), a hybrid way to build persistent, open-ended environments for language agents where deterministic web code defines world state and rules (“physics”), and an LLM generates rich descriptions, narratives, and high-level content (“imagination”) on top (Figure 1, Figure 3). The core significance is a practical “middle ground” between database-bounded web apps and fully generative world models: WWMs aim to keep controllability, debuggability, and logical consistency while still enabling effectively unlimited, on-demand world expansion via procedural/deterministic generation.

2. Context and Motivation

  • Problem / gap addressed
  • Language agents increasingly need persistent worlds where they can “act, remember, and grow” (Introduction).
  • Existing approaches cluster at two extremes (Figure 1):

    • Traditional web frameworks: reliable engineering and security boundaries, but context is bounded by database-backed schemas/content.
    • Fully generative world models: potentially unlimited context, but the world is “constructed primarily through generation,” making it harder to keep a fixed, deterministic global framework, reducing controllability, and making debugging/scaling expensive (Abstract, Introduction, Figure 1).
  • Why it matters

  • Long-running agent applications typically require:
    • State consistency (e.g., you can’t walk through locked doors; you can’t spend money you don’t have).
    • Persistence / object permanence (a place should remain the same when revisited).
    • Observability and tooling (debugging, versioning, deployment), which are strengths of web stacks (Introduction, Section 2.5).
  • Pure generation can undermine these guarantees via inconsistency/hallucination; pure hand-coded worlds can be too rigid or finite (Section 2, especially 2.1).

  • Prior approaches and shortcomings (as positioned in the paper)

  • Fixed-context web apps:
    • Strong at deterministic behavior and operational reliability.
    • Weak at scaling “world size” without large stored content and pre-defined schemas (Introduction, Figure 1).
  • Fully generative worlds:

    • Strong at open-ended content creation.
    • Weak at controllability, debugging, and maintaining global determinism (Abstract, Introduction, Figure 1).
  • How this paper positions WWMs

  • WWMs treat web code as the substrate of a world model’s rules/state and treat the LLM as a bounded imagination engine that must interface through explicit schemas (Abstract; Introduction; Section 2; Figure 3).
  • The paper argues web stacks can be a scalable substrate for world models (Abstract; Section 2.5; Conclusion).

3. Technical Approach

3.1 Reader orientation (approachable technical breakdown)

  • The system is a web application framework for building interactive “worlds” where users/agents take actions and the system responds with consistent state updates plus generated narrative content.
  • It solves persistent-environment construction by combining a deterministic code-defined state machine (rules + state transitions) with LLM-driven generation that fills in descriptions, missions, reactions, articles, or story text in a structured way (Figure 3; Section 2).

3.2 Big-picture architecture (diagram in words)

  • User/agent action (a_t) → Physics layer (S^ϕ) → Imagination layer (S^ψ) → Rendered world → next action
  • Physics (S^ϕ): deterministic code maintains invariant state, enforces rules, computes next state (Section 2.1; Figure 3).
  • Imagination (S^ψ): LLM generates descriptions/dialogue/narrative conditioned on the updated physics state (Section 2.1; Figure 3).
  • Typed interface / schema contract: LLM outputs structured JSON that must conform to TypeScript/JSON schemas (Section 2.2).
  • Deterministic hashing / seeding: procedural expansion uses coordinate/ID hashing to fix seeds and preserve object permanence without storing everything (Section 2.3; Figure 4).
  • Graceful degradation: caching/templates ensure the system remains usable if LLM calls fail or are slow (Section 2.4).

3.3 Roadmap for the deep dive

  • First, define the state decomposition (S_t = (S_t^ϕ, S_t^ψ)) and the update order, because it is the core correctness mechanism (Section 2.1; Figure 3).
  • Second, explain typed interfaces as the main boundary that makes LLM outputs executable and debuggable (Section 2.2).
  • Third, explain deterministic generation via hashing to get “infinite worlds” plus persistence without a database (Section 2.3; Figure 4).
  • Fourth, explain graceful degradation because it is critical for real web deployment under latency/cost constraints (Section 2.4).
  • Finally, walk through how these principles instantiate in the paper’s example systems (Section 3; Figures 5–14).

3.4 Detailed, sentence-based technical breakdown

This is an empirical systems/architecture paper: it proposes an architectural abstraction (WWM) and demonstrates it through multiple implemented web-based environments rather than proposing a new trained model or a new learning algorithm (Abstract; Introduction; Section 3).

Core formalism: splitting “physics” and “imagination”

  • The world state at time t is explicitly decomposed as:
  • S_t = (S_t^ϕ, S_t^ψ) (Section 2.1).
  • S_t^ϕ (“Physics Layer”) is deterministic, code-defined state (inventories, coordinates, logic constraints).
  • S_t^ψ (“Imagination Layer”) is stochastic, model-defined high-dimensional content (descriptions, dialogue, “vibe”) (Section 2.1; Figure 3).

  • State transitions occur in a strict order (Section 2.1; Figure 3):

  • The user/agent takes an action a_t.
  • Deterministic code computes the next physics state:
    • S_{t+1}^ϕ = f_code(S_t^ϕ, a_t) (Section 2.1).
  • The LLM generates the imagination layer conditioned on the updated physics:
    • S_{t+1}^ψ ~ π_θ(¡ | S_{t+1}^ϕ) (Section 2.1).
  • The system renders a combined “world view” to the user, and the loop continues (Figure 3).

Why this ordering matters: by updating S^ϕ first with deterministic code, the system ensures that core invariants (e.g., locked doors, resource constraints) cannot be violated by generative text (Section 2.1).

Typed interfaces: making LLM output executable and debuggable

  • WWMs replace “opaque embeddings” as latent state with explicit typed web interfaces (Section 2.2).
  • The paper’s mechanism is to define strict schemas such as:
  • Example given: interface Planet { biome: string; hazard: string; } (Section 2.2).
  • The LLM is required to output valid JSON objects conforming to these type definitions (Section 2.2).
  • This “schema as contract” has two practical effects in the paper’s framing:
  • It prevents structural hallucinations (outputs missing required fields or having wrong structure) by acting as a syntactic filter (Section 2.2).
  • It keeps imagined content compatible with deterministic rule code (e.g., if an item is created, it must include fields like weight and cost that the physics layer needs) (Section 2.2).

Deterministic generation via hashing: “infinite worlds” with object permanence

  • The paper’s scalability argument is that you cannot store an infinite world in a database, so the system generates content Just-In-Time (JIT) (Section 2.3).
  • The key device is deterministic hashing (Section 2.3; Figure 4):
  • When a user arrives at location x (or coordinate (x, y) in Figure 4), the system computes a seed:
    • seed = h(x) (or h(x, y)).
  • The seed fixes the LLM’s sampling randomness, so revisiting the same coordinate yields the same generated content (Section 2.3; Figure 4).
  • The paper states “Object Permanence with no storage cost” via an invariance condition (Section 2.3):
  • S_t^ψ ≡ S_{t+k}^ψ if location(t) = location(t+k) (Eq. 2.1).
  • In practical system terms (as used across examples in Section 3), the seed is the stable identifier that ties together:
  • deterministic structural generation (layout, metadata, node IDs), and
  • deterministic/stable LLM generation (same prompt inputs + fixed seed).

Graceful degradation: operating under latency/failure constraints

  • The paper explicitly assumes calling an LLM “for every frame” is too expensive (Section 2.4).
  • It proposes a Fidelity Slider (Section 2.4):
  • High fidelity: generate bespoke content in real time via LLM calls.
  • Medium fidelity: retrieve cached content.
  • Base fidelity: fall back to deterministic templates (Section 2.4).
  • Because S^ϕ is code-governed, the environment remains functional if S^ψ generation is slow/unavailable; richness drops but logical continuity is preserved (Section 2.4).

Technical stack assumptions (what is specified vs. not)

  • The paper emphasizes using a modern web stack (Section 2.5; multiple examples in Section 3):
  • Type safety: TypeScript is highlighted as key for typed interfaces (Section 2.5).
  • Delivery: HTTP streaming is mentioned for real-time text delivery (Section 2.5).
  • Deployment: serverless architecture is mentioned as enabling “infinite scaling” without persistent infra management (Section 2.5).
  • Several example implementations specify concrete frontend technologies:
  • TypeScript + React 19 + Tailwind CSS (AI Spire; AI Alchemy) (Sections 3.3–3.4).
  • WebGL rendering (Cosmic Voyager) (Section 3.5).
  • Notably absent (and should not be invented): training hyperparameters like optimizer, learning rate, batch size, tokens, compute budget, hardware training details. The paper describes using existing LLM APIs (e.g., “Gemini Flash”, “Gemini 2.5 flash” appears in figures/captions), not training a model, and does not provide training configurations in the provided content (e.g., Sections 3.3–3.5, Figure 7 caption, Figure 35 caption).

End-to-end “pipeline diagram in words” (explicit sequence)

A typical WWM step, abstracted across the demos, is:

  1. The user performs an interaction (clicking a beacon on a globe, selecting a planet node, resolving a collision in a sandbox, turning a page, etc.) (e.g., Figures 5–6, 9, 14).
  2. The system computes/updates deterministic physics state S^ϕ:
  3. It may compute structural metadata (IDs, coordinates, biome type sets, inventory bounds) via code.
  4. It may compute a deterministic seed via hashing h(¡) (Section 2.3; Figure 4).
  5. The system constructs a typed JSON “request” for the LLM, including:
  6. seed/metadata/context, and
  7. a JSON schema/type contract specifying allowed fields and values (Section 2.2; examples described in Sections 3.2–3.4).
  8. The LLM produces JSON that must validate against the schema:
  9. If valid, it becomes (part of) S^ψ and is rendered.
  10. If invalid or unavailable, the system falls back to cached outputs or templates (Section 2.4; also explicitly described in AI Spire and Cosmic Voyager examples).
  11. The rendered UI exposes the combined result, and the loop continues (Figure 3).

Worked micro-example (single input → output walk-through)

Consider the Galaxy Travel Atlas selection loop (Section 3.2; Figure 6) as a concrete instance:

  1. Input action (a_t): The user clicks a planetary node in the galaxy map (Figure 6).
  2. Physics update (S^ϕ):
  3. Deterministic code (universe.ts is named) procedurally generates the galaxy layout, star lanes, and a planet’s stable identifier and symbolic attributes (sector label, physical type, risk profile) (Section 3.2).
  4. Hashing ensures revisiting the coordinate yields the same structural planet identity and attributes without DB lookups (Section 3.2 referencing Figure 4’s idea).
  5. Imagination generation (S^ψ):
  6. The LLM is queried to generate a mission brief, hazards, lore, and “field logs,” but only as JSON conforming to strict TypeScript interfaces such as Planet (Section 3.2; Figure 6).
  7. Render:
  8. The UI displays the mission brief/profile cards and a narrative hook in a stable layout (Figures 27–34 in the appendix gallery illustrate the consistent UI with varying content).
  9. Persistence/degradation:
  10. If the LLM fails, the system falls back to template-based descriptions, and caching reduces inference costs (Section 3.2).

This illustrates the paper’s core claim: the LLM can vary narrative texture while code preserves navigational continuity and invariants.

How the principles instantiate in the example systems (Section 3)

  • Infinite Travel Atlas (Section 3.1; Figure 5):
  • Physics: user-selected geographic coordinates are hashed to seeds; code grounds “semantic context” and computes a valid subset of themes before the LLM selects and writes within that constraint (Section 3.1).
  • Two-stage procedural generation strategy is named:
    • worldPromptService.ts initializes experience with query templates.
    • proceduralBeaconService.ts generates stable beacons with identifiers/metadata on interaction (Section 3.1).
  • Imagination: LLM generates a structured destination guide and itinerary themed to the location (Figure 5; Section 3.1).
  • Demonstrations are qualitative: Nairobi/Honolulu/Rio examples are cited with thematic consistency and figure references (Section 3.1; Figures 21–23).

  • Galaxy Travel Atlas (Section 3.2; Figure 6):

  • Physics: procedural noise functions in universe.ts dictate galaxy layout and star lanes; stable planet IDs and attributes are code-derived; reseeding makes reachable galaxies unbounded (Section 3.2).
  • Imagination: LLM generates mission briefs and narrative content under typed JSON constraints; fallback to templates; caching keyed by procedural seed (Section 3.2).

  • AI Spire (card roguelike) (Section 3.3; Figure 7–8):

  • Physics: deterministic rules engine maintains HP/energy/deck/status/enemy intent; effect codes are translated into deterministic execution (Figure 7 caption; Section 3.3).
  • Imagination: LLM (“Gemini Flash” in Figure 7 caption) generates cards/relics as schema-structured JSON (name, description, effect codes).
  • Control surface: “The Wish” allows free-form user prompt that is translated into mechanics but still restricted to controlled vocabulary/effect codes (Section 3.3).
  • Robustness: schema validation via response schema; fallback to stored samples if API missing/fails (Section 3.3).

  • AI Alchemy (falling-sand cellular automata) (Section 3.4; Figure 9–10):

  • Physics: React + Canvas grid simulates gravity/flow/diffusion; symbolic automata updates depend on physical categories like POWDER, LIQUID, GAS (Section 3.4).
  • Imagination: when two elements collide and no rule exists, the system queries the LLM for a schema-constrained reaction outcome; result is cached and integrated into the simulation loop (Figure 9; Section 3.4).
  • Optional “AI Supervisor” monitors global canvas and perturbs (e.g., rainfall) to shape emergent behavior (Section 3.4; Figure 10 overlay mention).

  • Cosmic Voyager (3D solar system) (Section 3.5; Figures 11–12, 35–37):

  • Physics: WebGL engine manages scene/state, scripted motion (preset orbital speeds, compressed distances) and navigation modes (orbit, pilot, surface walk) (Section 3.5; Figure 12).
  • Imagination: LLM generates (i) general sidebar descriptions and (ii) view-dependent subtitle narration refreshed every 30 seconds; fallback to bundled descriptions when API unavailable (Section 3.5; Figure 11).
  • Note on terrain: the main text says “generated terrain” and the appendix caption claims “terrain is generated by an LLM” (Figure 36 caption). The provided content does not specify the exact mechanism or schema for terrain generation beyond these statements.

  • WWMPedia (web as knowledge world) (Section 3.6; Figure 13):

  • Physics: code-defined retrieval primitives (search/open/extract evidence), sanitization, and deterministic HTML renderer enforcing layout (title, TOC, sections, references) (Section 3.6).
  • Imagination: LLM composes a Wikipedia-like article with citations linked to retrieved sources; UI shows provenance (“LLM generated” + timestamp in Figure 13c) (Section 3.6).

  • Bookshelf (infinite reader / long-form fiction) (Section 3.7; Figure 14):

  • Physics: code defines session state, page-turn semantics, page length, streaming boundaries, and which fields persist across turns (Section 3.7).
  • Imagination: LLM generates book cards and page continuations conditioned on typed tag constraints (interface-style tags for deterministic UI theming via CSS; literary tags for genre/tone/pacing) (Figure 14; Section 3.7).
  • The paper emphasizes that long-horizon generation becomes a state management problem; they keep carried state “typed and small” (Section 3.7).

4. Key Insights and Innovations

  • (1) A “middle-ground” world model abstraction built on web stacks
  • Novelty relative to the paper’s framing: instead of “DB-bounded web apps” or “fully generative world models,” WWMs explicitly separate deterministic web code from generative content, aiming to combine scalability with controllability (Abstract; Introduction; Figure 1).
  • Significance: it reuses mature web engineering practices (testing/versioning/deployment) while enabling open-ended expansion (Introduction; Section 2.5).

  • (2) Separation of Concerns as a first-class architectural rule: Physics vs. Imagination

  • Difference: many generative environments conflate state transitions and narration inside the model; WWMs enforce deterministic state updates first, then narration conditioned on state (Section 2.1; Figure 3).
  • Significance: this is the paper’s primary mechanism for logical consistency and debuggability in persistent settings.

  • (3) Typed Interfaces (schemas) as the “common language” between code and LLMs

  • Difference: rather than treating the LLM as producing free-form text (or opaque latent vectors), the paper insists on JSON outputs that satisfy strict interface contracts (Section 2.2).
  • Significance: enables validation, reduces runtime errors, and makes content structurally compatible with a deterministic engine (Section 2.2; AI Spire’s contract-and-validation emphasis in Section 3.3 and Figure 7).

  • (4) Deterministic hashing to achieve “infinite” persistent worlds without storing them

  • Difference: instead of storing every generated object/page in a database, the world can be reconstructed deterministically from identifiers/coordinates by hashing into a seed that fixes generation randomness (Section 2.3; Figure 4; Eq. 2.1).
  • Significance: supports object permanence and revisitation at scale, aligning with the “unlimited context” goal while avoiding unbounded storage.

  • (5) Graceful Degradation via fidelity levels

  • Difference: the paper treats LLM availability/latency as a first-order systems constraint and designs fallbacks (cache/templates) so the environment remains functional (Section 2.4; also concretely in AI Spire and Cosmic Voyager).
  • Significance: this moves the concept closer to deployable web applications rather than research-only demos.

5. Experimental Analysis

  • Evaluation methodology (as provided)
  • The paper primarily uses implemented demonstrations across multiple domains as evidence (Section 3; Figures 5–14 and appendix Figures 15–37).
  • Evidence is qualitative/observational:

    • Travel Atlas: “empirical observation” of thematic consistency across Nairobi/Honolulu/Rio and referenced figures (Section 3.1; Figures 21–23).
    • Galaxy Atlas: visual traversal snapshots and examples of diverse nodes with stable UI structure (Section 3.2; Figures 25–34).
    • AI Spire / AI Alchemy: described gameplay loops, schema validation, caching, and fallback behavior (Sections 3.3–3.4; Figures 7–10).
    • Cosmic Voyager: described interaction modes and narration refresh; fallback behavior (Section 3.5; Figures 11–12, 35–37).
    • WWMPedia / Bookshelf: described retrieval-render pipeline and tag-conditioned long-form generation (Sections 3.6–3.7; Figures 13–14).
  • Metrics, datasets, baselines

  • The provided content does not include standard quantitative benchmarks, numerical metrics, or head-to-head comparisons against baselines (e.g., no success rates, latency distributions, cost, consistency error rates, user studies, or agent task performance).
  • As a result, claims about scalability/robustness are supported mainly by architecture arguments and the existence of working demos, not by measured performance.

  • Main results (what is concretely shown)

  • “Results” are best interpreted as capability demonstrations:

    • Stable structured UI with variable generated content across many nodes (Galaxy Travel figures 27–34).
    • Schema validation + fallback to stored samples to keep gameplay smooth when LLM calls fail (AI Spire, Section 3.3).
    • Reaction caching for new element collisions to expand a sandbox over time (AI Alchemy, Figure 9; Section 3.4).
    • Provenance surfaced in UI for generated encyclopedic pages with citations (WWMPedia, Figure 13).
  • Do the experiments convincingly support the claims?

  • They convincingly support the narrow claim that: it is feasible to build multiple WWMs on a real web stack using the stated principles (Section 3).
  • They only weakly support stronger claims like “scalable” or “hallucination-free” in a measurable sense, because the provided content does not report systematic evaluations of:

    • state inconsistency rates,
    • schema violation rates,
    • determinism under seed control,
    • caching efficacy,
    • latency/cost at scale,
    • user study outcomes, or
    • agent task performance improvements.
  • Ablations / robustness checks / failure cases

  • The closest to robustness checks are the described fallback pathways:
    • “Fidelity Slider” concept (Section 2.4).
    • AI Spire’s schema validation + stored samples fallback (Section 3.3).
    • Galaxy Travel Atlas template fallback and file-backed caching keyed by seed (Section 3.2).
    • Cosmic Voyager bundled descriptions fallback without API key (Section 3.5).
  • The paper does not provide formal ablation studies (e.g., removing typed interfaces or hashing and measuring effect).

6. Limitations and Trade-offs

  • Limited quantitative validation
  • The architecture is motivated and illustrated well, but the provided content lacks quantitative evidence about performance, reliability, or “hallucination-free” behavior under stress (Section 3 is demo-heavy; no benchmark section is present in the provided excerpt).

  • Typed interfaces constrain creativity and increase engineering work

  • While schemas prevent structural errors (Section 2.2), they also require:

    • upfront design of types and controlled vocabularies (explicit in AI Spire’s effect code system, Section 3.3),
    • validators and translators from generated specs to executable effects (Figure 7; Section 3.3),
    • ongoing maintenance as the world’s capabilities expand.
  • Deterministic generation depends on careful control of inputs

  • The determinism/object permanence argument assumes the generation process is fully determined by coordinate → seed and fixed generation setup (Section 2.3; Figure 4).
  • In practice (within the paper’s own framing), you must ensure prompts, schemas, and any model configuration that affects sampling remain stable; otherwise revisits might drift. The paper does not detail all the controls needed beyond “seed fixes sampling randomness.”

  • Graceful degradation implies variable user experience

  • The fidelity slider design explicitly accepts that narrative richness may degrade when LLM calls are slow/unavailable (Section 2.4).
  • This is a trade-off: logical continuity is preserved, but semantic richness becomes resource-dependent.

  • Scope of “physics” is application-defined

  • WWMs rely on the developer to decide what belongs in S^ϕ (hard invariants) vs. S^ψ (generated layer) (Section 2.1).
  • If too much is left to S^ψ, you risk inconsistency; if too much is forced into S^ϕ, you lose open-endedness and increase coding burden. The paper gives principles but not a formal method for this partition.

  • Security/safety boundaries are implied but not deeply analyzed

  • The paper emphasizes “clear security boundaries” as a strength of web frameworks (Introduction), and treats the LLM as a “microservice” constrained by schemas (Section 3.2).
  • However, the provided content does not include a detailed threat model, prompt-injection analysis (especially relevant for WWMPedia’s web retrieval setting), or formal sandboxing guarantees beyond schema validation.

7. Implications and Future Directions

  • How this changes the landscape (within the paper’s framing)
  • WWMs suggest a reframing: instead of building “world models” purely inside learned latent spaces, developers can treat web infrastructure + typed APIs + deterministic procedural generation as a scalable substrate for persistent agent environments (Abstract; Section 2.5; Conclusion).
  • This could make agent environments more like deployable products: testable, versioned, observable services rather than monolithic generative simulations.

  • Research directions suggested by the paper’s systems

  • Better interfaces between deterministic engines and generative modules:
    • richer schema languages,
    • more expressive yet safe controlled vocabularies (e.g., extending effect-code DSLs like AI Spire’s) (Section 3.3).
  • State management for long-horizon generation
    • Bookshelf explicitly highlights that continuity is largely a state management problem, suggesting future work on compact typed state representations and plot/thread tracking (Section 3.7).
  • Procedural determinism beyond text

    • Cosmic Voyager hints at extending the approach to 3D contexts with view-conditioned narration and possibly generated terrain/assets (Section 3.5; Figure 36 caption), motivating future work on typed interfaces for multimodal generation outputs.
  • Practical applications / downstream use cases (as evidenced by examples)

  • Travel/education explorers grounded in real coordinates (Infinite Travel Atlas, Section 3.1).
  • Procedural mission-based sci-fi exploration (Galaxy Travel Atlas, Section 3.2).
  • Games where content expands but mechanics remain deterministic and safe (AI Spire; AI Alchemy, Sections 3.3–3.4).
  • Knowledge interfaces that synthesize structured pages from web retrieval with explicit provenance (WWMPedia, Section 3.6).
  • Infinite reading / interactive fiction with UI and style constraints (Bookshelf, Section 3.7).

  • Repro/Integration Guidance (when to prefer WWMs, based on the provided paper)

  • Prefer a WWM when you need:
    • persistent state invariants that must not be violated (use code for S^ϕ) (Section 2.1),
    • structured, inspectable outputs suitable for rendering/execution (use typed JSON interfaces) (Section 2.2),
    • unbounded exploration without storing everything (use hashing/seeded procedural generation) (Section 2.3),
    • deployable reliability under LLM latency/outage (use caching/templates; fidelity levels) (Section 2.4).
  • Prefer a more conventional fixed web app when:
    • the world is naturally bounded and you want strict curated content backed by a database (Figure 1’s left framing).
  • Prefer a fully generative world model when:
    • the primary goal is unconstrained generative richness and you can tolerate reduced global controllability (Figure 1’s right framing), noting the paper positions this as harder to debug/control.