MolmoWeb: Open Visual Web Agent and Open Data for the Open Web¶
ArXiv: 2604.08516
Pitch¶
MolmoWeb introduces a family of fully open multimodal web agents that navigate websites using only visual screenshots—no HTML or accessibility trees required—democratizing a capability previously locked behind proprietary systems. Alongside the model, the authors release MolmoWebMix, a large-scale training mixture of over 100K synthetic and 30K human demonstrations, enabling the 8B model to outperform GPT-4o on key benchmarks and achieve 94.7% success on WebVoyager with test-time scaling.
1. Executive Summary¶
This paper introduces MolmoWeb, a family of fully open multimodal web agents (4B and 8B parameters) that navigate and execute tasks on the web using only visual screenshots—requiring no access to HTML, accessibility trees, or specialized APIs. The authors also release MolmoWebMix, a large-scale training dataset combining over 100K synthetic task trajectories, 30K+ human demonstrations, atomic skill trajectories, and 10.5M GUI perception examples. MolmoWeb-8B achieves state-of-the-art results among open-weight models, outperforming larger proprietary systems like GPT-4o with set-of-marks prompting on benchmarks including WebVoyager (78.2% vs. 65.1%), and demonstrating substantial gains from test-time scaling via parallel rollouts (94.7% pass@4 on WebVoyager).
2. Context and Motivation¶
The Problem: Closed Proprietary Web Agents Limit Scientific Progress¶
Web agents—autonomous systems that navigate and execute tasks on websites—have the potential to transform how people interact with the digital world, particularly for users facing barriers due to disability, inaccessible design, or limited digital literacy. However, the most capable web agents today rely on proprietary models with undisclosed training data and recipes. Frontier labs like OpenAI and Google offer computer-use and browser-use capabilities as hosted services, but provide limited disclosure of training data, model architectures, and full training procedures.
This opacity creates several concrete problems the authors identify (Section 1):
- Hindered scientific understanding: Without access to training data and recipes, researchers cannot understand what actually drives performance, making it impossible to build on prior work systematically.
- Limited reproducibility: Proprietary systems cannot be replicated or audited, violating core scientific norms.
- Accountability concerns: Autonomous agents operating on the open web require transparency for trustworthy deployment, but closed systems are unauditable by design.
- Community-driven progress is blocked: The research community cannot collectively improve systems they cannot inspect or replicate.
Prior Approaches and Their Limitations¶
The paper situates itself relative to two broad categories of prior work (Section 5):
LLM-driven web agents using structured representations. Many prior systems use large language models operating on language representations of web pages derived from the Document Object Model (DOM), such as accessibility trees. These systems (e.g., Mind2Web, WebArena) work by providing the LLM with serialized accessibility tree content, where each interactive element has a unique numeric identifier (bid). The LLM predicts actions by referencing these identifiers.
Limitations: The authors argue that DOM-based representations are brittle because they "vary significantly across websites, frameworks, and even minor page updates" (Section 1). They can be incomplete or misleading for dynamically rendered content, and they consume substantial tokens—accessibility tree inputs can require "tens of thousands of tokens per page" (Section 1).
Multimodal web agents. An increasing number of systems process screenshots to produce actions. Proprietary systems like Gemini computer-use and OpenAI computer-use operate this way but are closed. A handful of open-weight models exist (Fara-7B, UI-TARS family, Holo1-7B, OpenCUA), but the authors note these typically do not provide fully open training data or transparent training and evaluation pipelines.
Limitations: The open-weight models often do not release training data, making it impossible to understand what supervision signals drive performance. Some prior work like Fara uses distillation from proprietary vision-based web agents, which introduces dependencies on closed systems.
Set-of-marks (SoM) prompting. This approach annotates screenshots with numeric labels corresponding to elements in the accessibility tree, providing visual grounding for the model.
Limitations: SoM agents still require access to the accessibility tree to generate annotations, so they don't operate purely visually. The authors show that MolmoWeb outperforms SoM agents built on much larger models (GPT-4o, o3), suggesting that data quality and targeted training can compensate for raw scale.
How This Paper Positions Itself¶
The authors explicitly state: "We believe agents for the open web should be built in the open" (Section 1). Their contribution is a complete, transparent, and reproducible foundation comprising:
- MolmoWebMix: A training dataset with diverse task demonstrations and GUI perception data, fully open-sourced.
- MolmoWeb: Model weights for 4B and 8B multimodal web agents, trained via supervised fine-tuning.
- Code and evaluation harness: A unified evaluation framework for reproducibility.
The deliberate design choice is vision-only operation: MolmoWeb agents "operate on any website using only the visual interface that human users see" (Section 1). This mirrors human perception, avoids DOM brittleness, and reduces token consumption. The authors argue that a single screenshot provides a "compact, information-rich representation" compared to structured alternatives.
3. Technical Approach¶
3.1 Reader Orientation¶
MolmoWeb is a vision-language model fine-tuned to serve as a web agent: given a task instruction and a screenshot of the current webpage, it predicts the next browser action (click, type, scroll, etc.) along with a natural language thought explaining the reasoning.
3.2 Big-Picture Architecture¶
The system has four major components:
- Base VLM (Molmo2): A pretrained vision-language model (4B or 8B parameters) that processes interleaved images and text, serving as the foundation.
- Observation encoder: At each step, captures the current webpage screenshot and compiles the task instruction plus the history of the last 10 actions as text context.
- Action predictor: The model outputs a JSON object containing a natural language "thought" (rationale) followed by a structured action specifying the browser operation.
- Training data mixture (MolmoWebMix): A curated combination of task trajectories, atomic skill trajectories, and GUI perception data used for supervised fine-tuning.
Information flows as follows: The agent receives (screenshot, instruction, action history) → the VLM encodes this input → predicts (thought, action) → the action executes in the browser → a new screenshot is captured → the loop repeats until task completion or maximum steps.
3.3 Roadmap for the Deep Dive¶
The technical explanation proceeds in the following order:
- Observation and action space: What the agent sees and what actions it can take—this defines the interface between agent and browser.
- MolmoWebMix dataset composition: The four data categories and how each is generated—this is the core contribution enabling training.
- Trajectory generation pipelines: The detailed mechanisms for synthetic and human trajectory collection.
- Training procedure: How supervised fine-tuning combines all data sources with specific mixture ratios.
- Test-time scaling: How parallel rollouts with best-of-N selection improve performance at inference time.
3.4 Detailed, Sentence-Based Technical Breakdown¶
This is an empirical systems paper whose core contribution is a complete open-source training pipeline for web agents, combining diverse data sources into a unified supervised fine-tuning procedure.
Observation Space¶
At each timestep \(t\), the agent receives three inputs (Section 3.2):
- Current webpage screenshot: A visual capture of the browser viewport.
- Task instruction: A natural language description of the goal (e.g., "Find the cheapest nonstop flights from Seattle to Tokyo").
- Action history: The sequence of actions taken in the 10 prior steps, along with the URL and title of the current page.
The action history provides temporal context, allowing the agent to understand what it has already done. This design choice mirrors human working memory during task execution.
Action Space¶
The model predicts actions as structured JSON objects containing both a "thought" (natural language rationale) and an action specification (Section 3.2, Table 3). The complete action space includes:
- Navigation actions:
goto(url),go_back(),new_tab(),tab_focus(index) - Mouse actions:
mouse_click(x, y, ...),mouse_drag_and_drop(...),hover_at(x, y) - Scroll actions:
scroll(delta_x, delta_y),scroll_at(x, y, dx, dy) - Keyboard actions:
keyboard_type(text),keyboard_press(key) - Control actions:
noop(wait_ms)(wait for page load),send_msg_to_user(msg)(task completion signal)
Spatial coordinates are normalized to [0, 100] with 2 decimal points during training, then denormalized to viewport pixel coordinates at execution time (Section 3.2). This normalization makes the action space resolution-independent, allowing the model to generalize across different screen sizes.
The thought component serves multiple purposes: it provides interpretability (users can see the agent's reasoning), and the authors note it can sometimes function as working memory for tracking information across long trajectories (Section 6).
MolmoWebMix Dataset Composition¶
The training dataset combines three broad categories (Section 2, Table 2):
Task trajectories (MolmoWebMix-Traj): 278.5K trajectories totaling 2.2M steps across 2.6K domains, averaging 13.2 steps per trajectory. This comprises 80% of the training mixture. Sources include: - Synthetic AxTree single-agent: 70K trajectories from an LLM operating on accessibility trees (35% mixture ratio) - Synthetic AxTree multi-agent: 35K trajectories from a planner-operator-verifier system (18% mixture ratio) - AxTree atomic skills: 5.5K trajectories for targeted skills (2% mixture ratio) - Node traversal: 16K deterministic navigation trajectories (2% mixture ratio) - Human demonstrations: 36K trajectories (18% mixture ratio) - Human skill segments: 116K atomic skill segments extracted from human trajectories (5% mixture ratio)
GUI perception data (MolmoWebMix-Perception): 10.5M examples totaling 20% of the training mixture. This includes: - Grounding data: 8.3M examples combining synthetic grounding (7M+ pairs), PixmoPoints repurposed from Molmo (1.1M examples) - Screenshot QA: 2.2M question-answer pairs covering OCR (54%), affordance queries (26%), and summarization (20%)
Trajectory Generation Pipelines¶
AxTree single-agent synthetic trajectories (Section 2.1, Figure 3): An LLM agent operates on the accessibility tree representation of the webpage. At each step, the agent receives the serialized accessibility tree along with the instruction and past actions, then predicts the next action by referencing element browser IDs (bid). The authors use Gemini-3-Flash-Preview as the agent backbone. Although the agent observes only the accessibility tree, screenshots are captured at each step so all trajectories share a unified observation format. Actions predicted in bid-space are post-processed into pixel-space coordinates. Trajectories are filtered for success using the WebVoyager LLM judge (a VLM that evaluates task completion), retaining only successful trajectories.
AxTree multi-agent trajectories (Section 2.1, Figure 4): A more sophisticated system with three specialized roles: - Planner: Generates the immediate next subgoal based on the high-level task goal and task progress - Operator: Executes one low-level browser action per step to accomplish the current subgoal - Verifier: Checks whether the current subgoal has been completed by analyzing the most recent 5 screenshots
The system uses Gemini-2.5-Flash for the Planner, GPT-4o for the Verifier, and the Gemini AxTree agent for the Operator. This multi-agent setup achieves higher task completion rates than the single-agent approach: 78.5% vs. 74.4% on WebVoyager (Section 2.1).
Human trajectories (Section 2.1): Human demonstrations were collected via a custom Chrome extension that captures browser interaction events and corresponding screenshots. Crowdworkers (managed through partnership with Snorkel AI) received tasks with subtask decompositions and checked off each subtask upon completion. Each trajectory underwent human review to verify task completion and correct capture of screenshots and actions. The annotation tool required workers to wait for page loads and manually take screenshots when automatic capture failed (Appendix B).
Node traversal trajectories (Section 2.2, Figure 5): A deterministic approach that generates navigation trajectories from precomputed website graphs. The pipeline: 1. Build a directed graph over 500 popular websites via breadth-first exploration to depth 4 2. Extract accessibility trees and prompt an LLM to select diverse navigational links at each page 3. Generate root-to-leaf URL sequences as ground-truth navigation plans 4. Use a deterministic, LLM-free procedure to replay each path via scrolling and clicking 5. Generate a plausible instruction at the terminal page using an LLM
Because no language model is used at execution time, trajectories are cheap to produce at scale and success verification is straightforward through URL matching.
Atomic Skill Trajectories¶
Task trajectories require composing multiple web skills; atomic skill trajectories isolate individual skills (Section 2.2). The authors define a taxonomy of 12 core skills (Table 1):
| Skill | Description |
|---|---|
go_to |
Navigate directly to a specified URL |
search |
Enter a query into a search box and submit |
find |
Locate an element or information on the page |
find_and_open |
Locate an element or subpage and open it |
find_and_click |
Locate an element and click it |
fill_form |
Fill form fields with specified values |
fill_form_and_submit |
Fill form fields and submit the form |
apply_filters |
Set filter or sort controls |
apply_filters_and_search |
Set filters and trigger a search |
add_to_cart |
Add the current item to cart |
navigate |
Free-form navigation |
Human trajectories were annotated with subtask decompositions, so each subtask segment naturally constitutes an atomic skill trajectory. These segments are automatically extracted from full human trajectories. Additionally, the AxTree agent was prompted to execute targeted skill instructions for fill_form and find_and_open.
GUI Perception Data¶
Grounding data (Section 2.3): Consists of (screenshot, element description) → click point pairs. The pipeline extracts clickable elements from AxTree agent trajectories, generates natural language descriptions using accessible name and role (via template or GPT-5), and samples click coordinates from a clipped Gaussian distribution centered on the element's center—encouraging spatially robust clicking. Total: 7M+ grounding pairs (3.4M templated, 3.8M GPT-5-generated) plus 1.1M examples from PixmoPoints.
Screenshot QA data (Section 2.3): Consists of (screenshot, question) → answer pairs. An LLM receives the accessibility tree and generates question-answer pairs covering:
- OCR queries (54%): text and values on the page (prices, counts, text content)
- Affordance queries (26%): actions available on the page
- Summarization queries (20%): content or purpose of page elements
Samples referencing AxTree-specific information (like element IDs) are removed to ensure questions rely solely on visual content. Total: 2,237,252 QA pairs across 395 websites.
Training Procedure¶
Base model: Molmo2 with Qwen3 language model and SigLIP2 vision encoder (Section 3.3).
Training setup: - Hardware: 64 H100 GPUs - Global batch size: 128 - Training duration: Up to 50K steps (approximately 3.2 epochs on average) - Fine-tuning scope: Language model, vision encoder, and adapter are all tuned - Initialization: Single-image checkpoint pretrained on image captioning and fine-tuned on single-image QA
All data types (task trajectories, atomic skills, GUI perception) are mixed in a single training stage. The mixing ratios (Table 2) are treated as hyperparameters and ablated to balance performance across benchmarks.
Sampling strategy: The default inference strategy is nucleus sampling (top-p) with \(p = 0.8\) and temperature \(= 0.7\), based on Qwen3's recommended parameters. Table 7 shows greedy decoding performs significantly worse (61.4%) compared to top-p (68.5%) on WebVoyager because greedy can get stuck in repetitive action loops.
Test-Time Scaling via Parallel Rollouts¶
The authors explore two approaches to using more compute at inference time (Section 4.3):
Parallel rollouts with best-of-N selection: Run \(k\) independent agents on the same task, evaluate each with a VLM judge, and select the best outcome. To estimate pass@k with low variance, they use the formula:
where \(m\) is the total number of rollouts collected (\(m = 5\)) and \(c\) is the number of successful rollouts. The VLM judge (GPT-4o for WebVoyager/DeepShop, o4-mini for Online-Mind2Web) evaluates task completion success.
Increasing inference steps: Raising the maximum number of steps per trajectory from 30 to 100 provides consistent gains.
Key finding: Parallel rollouts are far more effective than increasing steps. For example, the 8B model achieves 86.2% accuracy via 3 parallel runs with 30 steps each (90 total steps) compared to 78.2% with a single run of 100 steps. This suggests that the diversity from independent attempts matters more than giving one attempt more time.
Design Choices and Rationale¶
Vision-only over DOM/accessibility tree: The authors argue this mirrors human perception, avoids brittleness from DOM variations, and reduces token consumption (Section 1). A single screenshot is more compact than tens of thousands of tokens from an accessibility tree.
Thought-action output format: Including natural language thoughts provides interpretability and can serve as working memory, though the authors note this behavior is not robust and sometimes thoughts don't correlate well with actions (Section 6).
No distillation from proprietary visual web agents: Unlike Fara, MolmoWeb's synthetic data comes from AxTree agents that don't see screenshots, avoiding dependencies on closed vision-based agents. The visual agent learns to translate accessibility-tree-guided behavior to pure visual input.
Diverse data sources: Combining synthetic AxTree trajectories, human demonstrations, deterministic node traversal, and GUI perception data provides complementary supervision. The authors argue that synthetic trajectories provide reliable learning signals while human data captures real behavior patterns.
4. Key Insights and Innovations¶
Innovation 1: Complete Open-Source Training Pipeline for Visual Web Agents¶
The paper's most fundamental contribution is providing a fully transparent, reproducible foundation for web agent research—model weights, training data, code, and evaluation harness are all released. Prior open-weight models like Fara, UI-TARS, and Holo1 do not provide complete training data, making it impossible to understand what drives performance or build on the work systematically. This openness enables: - Scientific understanding of what data sources drive agent performance - Community-driven improvements and extensions - Auditable, accountable systems for trustworthy deployment
The significance is not just the artifacts themselves but the completeness: the training recipe is fully specified, enabling replication and extension.
Innovation 2: Synthetic Trajectories Outperform Human Demonstrations for Learning¶
A surprising empirical finding (Table 6): models trained on AxTree-sourced synthetic trajectories consistently outperform models trained on human trajectories for the same tasks (53.0% vs. 35.4% on WebVoyager). The authors hypothesize two factors:
- Lower signal-to-noise ratio: Humans exhibit more exploratory behavior with detours that may hinder imitation learning, while LLM agents operating on accessibility trees produce more direct, consistent trajectories.
- Structural information in accessibility trees: The accessibility tree encodes rich structural and semantic information not immediately apparent from visual cues alone, enabling the LLM to produce more reliable action sequences.
This finding is counterintuitive—human data is typically considered higher quality—and suggests that synthetic data from appropriate representations can be more effective for training.
Innovation 3: Vision-Only Agents Match or Exceed Larger Proprietary Systems with Richer Inputs¶
MolmoWeb-8B outperforms GPT-4o with set-of-marks prompting on WebVoyager (78.2% vs. 65.1%) despite: - Being much smaller (8B vs. GPT-4o's estimated hundreds of billions of parameters) - Using only screenshots, while SoM agents receive both screenshots and accessibility-tree-derived annotations
This demonstrates that data quality and targeted training can compensate for raw model scale and richer input representations. The result challenges the assumption that larger models with more input modalities are necessarily better for web agent tasks.
Innovation 4: Multi-Agent Trajectory Generation Improves Over Single-Agent¶
The multi-agent pipeline (Planner-Operator-Verifier) achieves 78.5% vs. 74.4% on WebVoyager compared to single-agent generation (Section 2.1). This suggests that decomposing trajectory generation into specialized roles—planning high-level goals, executing low-level actions, and verifying progress—produces higher-quality training data. The design is analogous to actor-critic architectures in reinforcement learning but applied to data generation.
Innovation 5: Test-Time Scaling via Parallel Rollouts Is Highly Effective¶
The authors demonstrate that parallel rollouts with best-of-N selection yield substantial gains: MolmoWeb-8B improves from 78.2% pass@1 to 94.7% pass@4 on WebVoyager (Section 4.3, Figure 6). This 16.5 percentage point gain far exceeds increasing inference steps (100 steps vs. 30 steps provides smaller improvements). The finding has practical implications: running 4 independent attempts and selecting the best outcome is more effective than giving one attempt more time, suggesting that error correction is harder than error avoidance through diversity.
5. Experimental Analysis¶
Evaluation Methodology¶
Benchmarks (Section 4.1): Four popular web browsing benchmarks using live websites: - WebVoyager: End-to-end web agent tasks on real websites - Online-Mind2Web: Extension of Mind2Web with live evaluation - DeepShop: Benchmark for shopping agents - WebTailBench: Benchmark for web tasks (judge unclear, so authors use WebVoyager judge)
Evaluation setup: - Steps: Models run up to 100 steps per task (30 steps for some comparisons) - Retries: Up to 10 retries per trajectory for environment errors - Runs: 3-5 evaluations per benchmark, reporting average - Judge: GPT-4o for WebVoyager and DeepShop, o4-mini for Online-Mind2Web - Environment: Browserbase (hosted browser with captcha-solving capability)
Baselines compared (Table 4): - API-only (no vision): AxTree agents using GPT-5 and Gemini-3-flash - With vision (proprietary): SoM agents using GPT-4o, o3, GPT-5; OpenAI computer-use-preview; Gemini computer-use-preview - Open-weight: Holo1-7B, UI-TARS-1.5-7B, GLM-4.1V-9B-Thinking, Fara-7B
Metrics: Task success rate (%) as determined by VLM judge.
Main Quantitative Results¶
Comparison to Open-Weight Models (Table 4)¶
On WebVoyager (100 steps): - MolmoWeb-8B: 78.2% - Fara-7B: 73.5% - UI-TARS-1.5-7B: 66.4% - GLM-4.1V-9B-Thinking: 66.8% - Holo1-7B: 55.4% (30 steps) - MolmoWeb-4B: 75.2%
On Online-Mind2Web (100 steps): - MolmoWeb-8B: 35.3% - Fara-7B: 34.1% - UI-TARS-1.5-7B: 31.3% - GLM-4.1V-9B-Thinking: 33.9% - MolmoWeb-4B: 31.3%
On DeepShop (100 steps): - MolmoWeb-8B: 42.3% - Fara-7B: 26.2% - GLM-4.1V-9B-Thinking: 32.0% - MolmoWeb-4B: 35.6% - UI-TARS-1.5-7B: 11.6%
On WebTailBench (100 steps): - MolmoWeb-8B: 49.5% - Fara-7B: 38.4% - MolmoWeb-4B: 43.8% - GLM-4.1V-9B-Thinking: 22.4% - UI-TARS-1.5-7B: 19.5%
MolmoWeb-8B achieves best-in-class results among open-weight models across all benchmarks, with particularly large margins on DeepShop (16.1 percentage points over Fara-7B) and WebTailBench (11.1 points over Fara-7B).
Comparison to Proprietary Systems (Table 4)¶
On WebVoyager (100 steps): - MolmoWeb-8B: 78.2% - SoM Agent (GPT-4o): 65.1% - OpenAI computer-use-preview: 70.9% - SoM Agent (o3): 79.3% - SoM Agent (GPT-5): 90.6% - Gemini computer-use-preview: 88.6%
MolmoWeb-8B outperforms GPT-4o by 13.1 percentage points and OpenAI computer-use-preview by 7.3 points. It matches o3 performance (79.3%) despite being much smaller and using only visual input.
On DeepShop (100 steps): - MolmoWeb-8B: 42.3% - SoM Agent (GPT-4o): 16.0% - OpenAI computer-use-preview: 24.7% - SoM Agent (o3): 49.7% - Gemini computer-use-preview: 62.0%
MolmoWeb-8B substantially outperforms GPT-4o (by 26.3 points) and OpenAI computer-use-preview (by 17.6 points).
Comparison to AxTree Teacher Agent (Table 4)¶
MolmoWeb-8B trails the Gemini-3-flash AxTree agent: - WebVoyager: 78.2% vs. 85.6% (7.4 point gap, 100 steps) - Online-Mind2Web: 35.3% vs. 44.8% (9.5 point gap, 100 steps) - DeepShop: 42.3% vs. 55.3% (13 point gap, 100 steps) - WebTailBench: 49.5% vs. 63.5% (14 point gap, 100 steps)
The authors attribute this gap to: (1) model size differences (Gemini is presumably larger), (2) AxTree agents use element IDs rather than pixel coordinates for clicking, and (3) visual agents must perform implicit OCR for reading comprehension tasks.
Test-Time Scaling Results (Figure 6, Section 4.3)¶
Pass@k performance on WebVoyager (30 steps): - MolmoWeb-4B: pass@1 = 65%, pass@4 ≈ 87% - MolmoWeb-8B: pass@1 = 78.2%, pass@4 = 94.7%
Pass@k performance on Online-Mind2Web (30 steps): - MolmoWeb-4B: pass@1 ≈ 28%, pass@4 ≈ 50% - MolmoWeb-8B: pass@1 = 35.3%, pass@4 = 60.5%
Parallel rollouts provide 16.5 and 25.2 percentage point gains for MolmoWeb-8B on WebVoyager and Online-Mind2Web respectively.
Steps vs. parallel runs comparison: The authors note that 3 parallel runs with 30 steps each (90 total steps) achieve 86.2% accuracy, compared to 78.2% with a single run of 100 steps. This demonstrates that diversity across independent attempts is more valuable than giving one attempt more time.
Training Data Ablations (Table 5, Section 4.4)¶
Effect of data scale (Table 5a, using earlier version of dataset, 30 steps): - 1% of data: WebVoyager 44.5%, Online-Mind2Web 11.7% - 10% of data: WebVoyager 63.2%, Online-Mind2Web 20.4% - 100% of data: WebVoyager 68.5%, Online-Mind2Web 21.9%
Approximately 85-90% of final performance is achieved with just 10% of the dataset.
Human vs. synthetic data (Table 5b): - Human only (28K trajectories): WebVoyager 27.8%, Online-Mind2Web 13.2% - Synthetic only (106K trajectories): WebVoyager 67.8%, Online-Mind2Web 22.0% - Synthetic + human (134K trajectories): WebVoyager 68.5%, Online-Mind2Web 21.4%
Human data provides limited and inconsistent gains. The authors hypothesize that synthetic and human data represent distinctly different web task completion policies that the model struggles to reconcile.
Matching trajectories comparison (Table 6, Section 4.4): For the same 2.7K tasks with matched instructions: - Human trajectories: DeepShop 19.8%, WebVoyager 35.4%, Online-Mind2Web 9.0% - Synthetic trajectories: DeepShop 24.4%, WebVoyager 53.0%, Online-Mind2Web 16.8%
Synthetic trajectories provide substantially better learning signals (17.6 point gap on WebVoyager).
Sampling Strategy Ablation (Table 7, Section 4.5)¶
On WebVoyager with 30 steps: - Greedy (temperature=0.0): 61.4% - Top-k (temperature=0.7, k=20): 67.4% - Top-p (temperature=0.7, p=0.8): 68.5%
Greedy decoding performs 7.1 points worse than nucleus sampling because it can get stuck in repetitive action loops (clicking same location, scrolling indefinitely).
Grounding Evaluation (Table 8, Section 4.6)¶
On ScreenSpot: - MolmoWeb-Ground-8B (specialist): 88.7% - MolmoWeb-4B (agent): 87.2% - Holo1-7B: 87.4% - Fara-7B: 86.7% - UGround-7B: 86.7% - Qwen2.5VL-7B: 85.5%
On ScreenSpot v2: - MolmoWeb-Ground-8B: 91.8% - MolmoWeb-4B: 89.5% - Holo1-7B: 89.9% - Fara-7B: 89.3% - Qwen2.5VL-7B: 89.3% - Gemini-3-Pro: 93.7% - OpenAI CUA: 87.9% - Claude 3.7: 87.6%
The grounding specialist outperforms both open-weight and proprietary baselines. MolmoWeb-4B is competitive despite also being trained for web task completion.
Assessment: Do the Experiments Support the Claims?¶
Claim 1: MolmoWeb achieves SOTA among open-weight models. Strongly supported. MolmoWeb-8B outperforms all open-weight baselines across all four benchmarks (Table 4), with margins ranging from 4.7 points (WebVoyager vs. Fara-7B) to 16.1 points (DeepShop vs. Fara-7B).
Claim 2: MolmoWeb outperforms larger proprietary models with richer inputs. Supported with qualifications. MolmoWeb-8B outperforms GPT-4o SoM agent by 13.1 points on WebVoyager and 26.3 points on DeepShop. However, it trails GPT-5 SoM agent (90.6% vs. 78.2% on WebVoyager) and Gemini computer-use (88.6% vs. 78.2%). The claim is strongest when comparing to GPT-4o class models.
Claim 3: Synthetic trajectories are more effective than human demonstrations. Strongly supported. The ablation with matched tasks (Table 6) shows a 17.6 point gap on WebVoyager, and the human-only model achieves only 27.8% vs. synthetic's 67.8%.
Claim 4: Parallel rollouts provide substantial gains. Strongly supported. Pass@4 achieves 94.7% vs. 78.2% pass@1 on WebVoyager (Figure 6), a 16.5 point gain.
Claim 5: Vision-only operation is viable for web agents. Supported, but with a caveat. MolmoWeb is competitive with AxTree agents using accessibility trees, but still trails by 7-14 points. The vision-only approach works but has a measurable performance cost.
Potential weaknesses: - Single model family: All experiments use Molmo2 as the base VLM. Results may not generalize to other architectures. - Limited human data effectiveness: Human demonstrations contribute little (Table 5b), which raises questions about the value of human annotation versus synthetic generation. - Gap to AxTree teacher: The 7-14 point gap to AxTree agents suggests there is a fundamental cost to vision-only operation that training alone cannot fully overcome. - Evaluation on live websites: While more realistic, live evaluation introduces variability. The authors use 3-5 runs per benchmark to mitigate this, but some variance remains. - Task distribution in training: The authors note that human data gains are limited partly due to differences in task distribution compared to benchmarks (Section 4.4), suggesting potential train-test mismatch.
6. Limitations and Trade-offs¶
Vision-Only Operation Incurs a Measurable Performance Cost¶
The most significant limitation is the performance gap between MolmoWeb and its AxTree-based teacher agent. Despite training on trajectories generated by the AxTree agent, MolmoWeb-8B trails by 7.4 points on WebVoyager, 9.5 points on Online-Mind2Web, 13 points on DeepShop, and 14 points on WebTailBench (Table 4). The authors attribute this gap to three factors:
- Model size: Gemini-3-flash is "presumably a much larger model" (Section 4.2)
- Action precision: AxTree agents reference elements by unique IDs (bid) that programmatically map to click locations, while visual agents must predict pixel coordinates
- Implicit OCR burden: Visual agents must perform OCR to parse text from screenshots, while AxTree agents receive pre-parsed text content
The authors are transparent about this trade-off, but it raises a fundamental question: is the vision-only constraint worth 7-14 points of performance? For applications where accessibility trees are unavailable (e.g., dynamically rendered content, canvas-based UIs), the answer is clearly yes. But for standard websites with well-formed accessibility trees, the DOM-based approach may be superior. The paper does not attempt to combine both modalities, leaving this as an open question.
Human Demonstrations Provide Limited Value¶
A striking and unexpected finding is that human demonstrations contribute little to model performance. Table 5b shows that training on human-only data yields 27.8% on WebVoyager, while synthetic-only data achieves 67.8%. Even when combined, human data adds only 0.7 points over synthetic data alone (68.5% vs. 67.8%).
The authors hypothesize several reasons (Section 4.4):
"This is likely due to lower volume of human data, differences in task distribution compared to benchmarks, difficulty in learning new actions present only in human data (scroll_at and mouse_drag_and_drop), difference in style and quality of post-hoc generated thoughts, and human annotation noise (e.g. variable scroll amounts in human annotation as opposed to always scrolling by 100% of viewport height in synthetic data)."
This raises practical concerns about the return on investment for human annotation. The human data collection required partnership with Snorkel AI, custom annotation tooling, and human review of trajectories (Appendix B)—yet contributes marginally. The finding suggests that synthetic data from well-designed pipelines may be more cost-effective than human demonstration, though this may not generalize to tasks where LLMs cannot reliably verify success.
Infrequent Actions Are Poorly Learned¶
The authors explicitly note that MolmoWeb "struggles with predicting more infrequent actions like scroll_at, mouse_drag_and_drop, or hover" (Section 6). This is likely a consequence of the action distribution in training data (Figure 8 in Appendix), where mouse_click and keyboard_type dominate while scroll_at, mouse_drag_and_drop, and hover_at are rare.
The human-synthetic data conflict exacerbates this: human data contains scroll_at and scroll actions with variable scroll amounts, while synthetic data uses standardized scroll actions. When combined, the model "almost always produces scroll which tries to scroll the page instead of the element" (Section 4.4), suggesting it fails to learn the nuanced action repertoire present in human trajectories.
Thought-Action Correlation Is Inconsistent¶
The natural language "thought" output is intended to provide interpretability and serve as working memory. However, the authors note a failure mode:
"We also see failure modes where thoughts sometimes do not correlate well with actions. For instance, the thought might say the agent needs to type some text and press Enter but the action is directly to press Enter" (Section 6).
This inconsistency undermines the reliability of thoughts for both interpretation and memory. The issue likely stems from the post-hoc generation of thoughts for synthetic trajectories and human demonstrations, rather than emergent reasoning during action generation. The paper does not analyze whether improving thought-action consistency would improve task performance.
Error Recovery and State Stuckness¶
A practical limitation the authors acknowledge:
"While MolmoWeb shows some error correction behavior... the agent may sometimes get stuck in states where it incorrectly keeps predicting the same action (eg. repeatedly scrolling or clicking at the same location) without being able to recover or course-correct" (Section 6).
Greedy decoding is particularly susceptible to this behavior, performing 7.1 points worse than nucleus sampling on WebVoyager (Table 7). While sampling strategies help, the underlying issue is that the model lacks explicit mechanisms for detecting stuck states and triggering recovery behaviors. The parallel rollout strategy mitigates this through diversity but does not address the root cause.
Training Data Does Not Cover All Benchmarks Equally¶
The authors acknowledge that human data gains are limited "likely due to... differences in task distribution compared to benchmarks" (Section 4.4). To address this, they generated "benchmark-like tasks" using in-context examples from WebVoyager and Online-Mind2Web to close the distribution gap (Appendices B.2.4 and C.1.2).
However, this creates a subtle issue: the training data is partially tailored to the evaluation benchmarks, which could overestimate generalization. The authors do not evaluate on held-out benchmark distributions or report results on tasks explicitly excluded from training data generation.
Evaluation on Live Websites Introduces Variability¶
All evaluations are conducted on live websites (Section 4.1), which is more realistic than sandboxed environments but introduces variability:
- Websites change over time, affecting task feasibility
- Dynamic content (ads, personalization) may differ across runs
- Network latency and page load times vary
The authors mitigate this by running 3-5 evaluations per benchmark and allowing up to 10 retries for environment errors. However, the variance is not quantified with confidence intervals, making it difficult to assess whether small differences (e.g., 35.3% vs. 34.1% on Online-Mind2Web) are statistically significant.
Action Space Could Be Compressed¶
The current action space requires separate actions for sequences that could be combined. For example, searching with a search box requires three actions: click to select the input box, type the query, and press Enter or click the search button. The authors note:
"This sequence could be combined into a single
type_at(text, x, y, press_enter=True)action" (Section 6).
The current design mirrors human browser interactions closely, but the granular action space may increase the number of steps required and the cumulative error probability. A more abstract action space could reduce trajectory length but may sacrifice the generality of operating on any website.
No Handling of Authentication or Personal Information¶
The task generation pipelines explicitly exclude tasks requiring authentication or personal information (Appendices B.2.2, B.2.3):
"Do not generate tasks requiring logins, or user's personal information like name, address, credit card details etc."
This is a deliberate safety choice, but it means MolmoWeb cannot handle a significant class of economically valuable tasks: booking flights with stored payment methods, accessing email, managing accounts, or personalized shopping. The limitation is acknowledged but not addressed, leaving a gap for future work.
Scalability of Data Generation Pipeline¶
The synthetic trajectory generation requires running AxTree agents (powered by Gemini-3-flash or GPT-4o) to generate successful trajectories, then filtering with VLM judges. This is compute-intensive and depends on proprietary APIs. The multi-agent pipeline specifically uses three different models (Gemini-2.5-Flash, GPT-4o, Gemini AxTree agent), introducing latency and cost.
The node traversal approach (Figure 5) mitigates this by using deterministic, LLM-free execution, but it only covers navigation—not form filling, searching, or other complex actions. The paper does not quantify the computational cost of data generation, making it difficult to assess the feasibility of scaling to even larger datasets.
7. Implications and Future Directions¶
How This Work Changes the Landscape¶
The most significant shift is methodological: this paper demonstrates that fully open, reproducible web agent training pipelines are achievable. Prior to this work, the field lacked a complete artifact suite—model weights, training data, code, and evaluation harness—for multimodal web agents. Proprietary systems (OpenAI computer-use, Gemini computer-use) offered capabilities but no transparency; open-weight models (Fara, UI-TARS, Holo1) released weights but not training data.
This completeness matters for several reasons:
- Scientific understanding becomes possible: Researchers can now investigate why certain data sources or training choices matter, rather than treating model performance as a black box.
- Community-driven improvement: The open artifacts enable iterative contributions from the research community, rather than progress being gated by proprietary labs.
- Auditable deployment: Organizations can evaluate model behavior, biases, and failure modes—critical for trustworthy deployment in high-stakes settings.
The finding that synthetic trajectories outperform human demonstrations challenges conventional wisdom about data quality. If LLM-generated synthetic data can provide more effective supervision than human demonstrations for certain tasks, this has implications beyond web agents: it suggests that the bottleneck for agent development may be the quality and consistency of training signals, not human annotation volume. This reframes the data collection problem as one of designing better synthetic pipelines rather than scaling human labeling.
The demonstration that vision-only agents can match or exceed much larger proprietary models with richer inputs (GPT-4o with SoM prompting) has practical implications. It suggests that specialized training on domain-appropriate data can compensate for model scale and input modality advantages—a finding relevant to deployment decisions where compute constraints favor smaller models.
Follow-Up Research This Work Enables¶
Combining visual and structured representations. The performance gap between MolmoWeb and AxTree agents suggests that visual agents lose information available in accessibility trees. A natural extension is to train agents that use both modalities—visual perception for robustness to DOM variations, and structured representations when available. The paper does not explore this hybrid approach, leaving open whether the modalities are complementary or redundant.
Improving human data effectiveness. The finding that human demonstrations provide limited gains raises questions about how to make human data more useful. Potential directions include: - Ensuring task distribution alignment between human data and target benchmarks - Improving annotation tools to reduce noise (e.g., standardizing scroll amounts) - Training separate models on human vs. synthetic data and ensembling - Using human data for validation rather than training
Explicit error recovery mechanisms. The "stuck state" problem—repeatedly predicting the same action—could be addressed through explicit recovery behaviors. One approach is to detect when recent actions have not changed the visual state and trigger fallback strategies (go back, scroll to top, try alternative approach). Another is to train the model with counterfactual data that demonstrates recovery from stuck states.
Thought-action alignment. The inconsistency between predicted thoughts and actions suggests a training objective issue. Future work could: - Add auxiliary losses penalizing thought-action mismatches - Generate thoughts on-policy during synthetic trajectory generation rather than post-hoc - Evaluate whether improving thought quality improves task performance
Reinforcement learning from rollouts. The parallel rollout results (94.7% pass@4 on WebVoyager) suggest a natural RL setup: use successful rollouts from best-of-N sampling as positive examples for policy improvement. The authors briefly mention this:
"These massive gains from parallel rollouts suggest (or, perhaps, explain to the unsurprised why) self-distillation from best-of-N rollouts and RL might be effective strategies for further improving single rollout performance" (Section 4.3).
This is a direct extension enabled by the open pipeline.
Extending the action space. The current action space lacks composite actions that could reduce trajectory length and cumulative errors. Future work could:
- Add compound actions like type_at(text, x, y, press_enter=True)
- Add web_search(query) that directly passes query to search engines via URL parameters
- Evaluate whether abstract actions improve performance or reduce generality
Addressing authentication and personalization. The deliberate exclusion of tasks requiring authentication or personal information leaves a gap for deployment scenarios. Future work could: - Integrate secure credential management for authenticated sessions - Explore privacy-preserving personalization techniques - Develop evaluation frameworks for personalized tasks
Practical Applications and Downstream Use Cases¶
Web automation for accessibility. The vision-only design makes MolmoWeb applicable to websites with poor accessibility tree support (dynamically rendered content, canvas-based UIs, single-page applications). This could benefit users who face barriers due to inaccessible design, as the agent perceives the same visual interface they would.
Data extraction and monitoring. MolmoWeb could navigate to target pages and extract information (prices, availability, news headlines) without requiring API access. The implicit OCR and reading comprehension capabilities demonstrated in the Screenshot QA data enable this use case. However, the authors note limitations for complex text:
"We do see instances of failures due to OCR for smaller texts or for answering complex questions requiring understanding of large passages of text" (Section 6).
Workflow automation. Multi-step tasks like flight booking, comparison shopping, or form filling are within MolmoWeb's capabilities, though performance degrades for tasks with "many constraints or search filters" (Section 6). The atomic skill taxonomy (Table 1) provides a framework for breaking down complex workflows.
Research platform for web agents. The open artifacts enable researchers to: - Investigate the effects of different data mixtures or training objectives - Develop improved synthetic data generation pipelines - Evaluate new action spaces or observation encodings - Study failure modes and develop robustness improvements
When to Prefer This Method Over Alternatives¶
Prefer MolmoWeb when: - Transparency and auditability are required (e.g., research, regulated deployments) - Compute constraints favor smaller models (edge deployment, on-device) - Accessibility trees are unavailable or unreliable (dynamically rendered content, canvas UIs) - Customization via fine-tuning is needed (domain-specific tasks)
Prefer AxTree-based agents when: - Accessibility trees are reliable and complete - Maximum task completion accuracy is critical - The compute budget supports larger proprietary models - The task requires precise element selection (form fields, buttons)
Prefer proprietary computer-use APIs when: - Development resources are limited (no training pipeline needed) - The task distribution aligns with the API's capabilities - Latency is a primary concern (hosted APIs may be faster than running local models)
Reproducibility and Integration Guidance¶
For reproduction, the authors commit to releasing: - Model checkpoints (4B and 8B) - Training data (MolmoWebMix) - Code for training and evaluation - Unified evaluation harness
For integration, the key practical choices are:
-
Observation format: Screenshots plus instruction plus action history (last 10 steps). The action history should include thoughts, actions, URLs, and page titles.
-
Sampling strategy: Use nucleus sampling with \(p = 0.8\) and temperature \(= 0.7\). Greedy decoding performs 7 points worse due to stuck state issues.
-
Step budget: Allow up to 100 steps for complex tasks, though 30 steps suffices for simpler tasks. Parallel rollouts (3-4 independent attempts) provide larger gains than increasing steps.
-
Judge for parallel selection: Use GPT-4o as the VLM judge for best-of-N selection (WebVoyager setup) or o4-mini for Online-Mind2Web-style evaluation.
-
Training mixture: If fine-tuning, the default mixture ratios are 80% trajectories, 15% grounding data, 5% Screenshot QA (Table 2). The grounding data from PixmoPoints is repurposed from the Molmo dataset, so ensure compatibility with the target domain.
-
Action space: Table 3 defines the full action space. Spatial coordinates are normalized to [0, 100] with 2 decimal points; denormalize to viewport pixel coordinates at execution time.