MMGR: Multi-Modal Generative Reasoning¶

🎯 Pitch¶

MMGR is a principled evaluation framework and benchmark suite that measures whether image and video generative models satisfy core reasoning constraints—physical, logical, 3D/2D spatial, and temporal—across 10 tasks in Abstract Reasoning, Embodied Navigation, and Physical Commonsense. By requiring strict, holistic correctness (not just perceptual fidelity) and combining automated VLM scoring with human validation, MMGR exposes critical failure modes—especially in symbolic and long-horizon reasoning—and provides a diagnostic roadmap for building generation models that can reliably serve as world simulators for robotics, planning, and scientific applications.

1. Executive Summary (2-3 sentences)¶

MMGR is an evaluation framework and benchmark suite that tests whether modern image and video generators obey reasoning constraints (physical, logical, spatial, temporal) rather than merely producing perceptually realistic outputs. It operationalizes “generative reasoning” via 10 tasks across three domains—Abstract Reasoning, Embodied Navigation, and Physical Commonsense—and scores models with strict holistic metrics that require all sub-criteria to be satisfied simultaneously. The benchmark reveals large modality-dependent gaps: current models show moderate performance on physical commonsense but fail badly on abstract/symbolic reasoning and long-horizon navigation, and automated VLM judging can significantly misestimate true success depending on the task.

2. Context and Motivation¶

Problem/gap addressed
Video foundation models can generate visually compelling, temporally coherent clips, but “world simulator” usefulness depends on whether generated content respects physical laws, logical rules, spatial consistency, and causal/temporal structure (Introduction; Section 1).
Standard generative metrics like FVD (Fréchet Video Distance), IS (Inception Score), and CLIP-based similarity primarily measure perceptual fidelity and caption alignment, and can miss “reasoning failures” such as:
- objects passing through each other,
- agents teleporting through walls,
- global inconsistencies across frames (Introduction).
Why it matters
If generative models are used for simulation-like settings (embodied navigation, robotics, planning, scientific visualization), perceptual realism without constraint satisfaction can be actively misleading (Introduction; Section 3 overview).
Prior approaches and shortcomings (as framed in the provided text)
Existing benchmarks largely focus on:
- perceptual/semantic alignment,
- temporal smoothness,
- discriminative understanding tasks (video QA / recognition),
but not on generation-time enforcement of rules or holistic correctness over long horizons (Related Work; Section 2).
How MMGR positions itself
MMGR shifts evaluation from “does the model generate something realistic?” to “does the model generate something that is correct under explicit reasoning criteria?” (Figure 1; Section 3).
It proposes a five-ability reasoning framework (Physical, Logical, 3D Spatial, 2D Spatial, Temporal) and maps each task to these abilities (Section 1; Table 1).

3. Technical Approach¶

3.1 Reader orientation (approachable technical breakdown)¶

The system is a benchmark + evaluation pipeline that turns reasoning-heavy tasks into image-conditioned generation prompts for both video and image generators.
It solves the evaluation problem by (i) generating controlled task instances (mazes, Sudoku, ARC-AGI puzzles, navigation scenes, physics prompts), (ii) collecting model outputs (images or videos), and (iii) scoring them with fine-grained rubrics and a strict “all conditions must pass” primary metric (Figure 1; Sections 3–4).

3.2 Big-picture architecture (diagram in words)¶

(A) Benchmark instance creation → (B) Model generation → (C) Automated evaluation (+ (D) Human evaluation subset) → (E) Metric aggregation & analysis
(A) creates an input (often an image) and a prompt with controlled difficulty.
(B) runs either a video model (multi-frame trajectory) or image model (single-frame solution).
(C) uses a VLM judge (Gemini-2.5-Pro) to produce structured metric labels.
(D) humans annotate a curated subset to calibrate/validate AutoEval.
(E) reports fine-grained metrics and a strict primary metric (Figure 1; Section 4.3; Figure 3).

3.3 Roadmap for the deep dive¶

I will explain MMGR in the order that matches how you would implement or use it:
The five reasoning abilities and how tasks map to them (why these tasks exist).
The three domains and 10 tasks (what is evaluated).
The data generation + hard-level control (how difficulty is controlled).
The evaluation protocol (AutoEval with Gemini-2.5-Pro + HumanEval) and the meaning of “holistic” scoring.
The model suite and generation settings (how comparisons are run).

3.4 Detailed, sentence-based technical breakdown¶

Framing: This is an empirical benchmarking / evaluation framework paper whose core idea is to evaluate generative models with reasoning-grounded, domain-specific rubrics and holistic pass/fail success metrics, rather than perceptual realism metrics (Section 1; Section 4.3; Figure 1).

3.4.1 The five core reasoning abilities (what MMGR claims must be tested)¶

MMGR defines five abilities as pillars for “world simulation” evaluation (Section 1):

Physical reasoning: intuitive physics (gravity, collisions, object permanence, material behavior).
Logical reasoning: rule following, abstract operations (“if A then B”), constraint satisfaction.
3D spatial reasoning: 3D relations, navigation topology, viewpoint/occlusion consistency.
2D spatial reasoning: planar layout relations in the image plane (grids, adjacency, relative positions).
Temporal reasoning: event order/causality and long-range consistency across frames.

A key design choice is to separate 2D vs 3D spatial reasoning because MMGR treats map/grid logic (Sudoku/ARC) as fundamentally different from embodied multi-view geometry (Section 1, explanation after Table 1).

3.4.2 Benchmark scope: three domains, 10 tasks, 1,853 samples¶

MMGR spans 3 domains and 10 tasks, totaling 1,853 evaluation samples (Table 1; Table 2):

Domain 1: Abstract Reasoning (logical + 2D spatial + sometimes temporal)
Maze: 240 samples (Section 5; Table 1/2).
Sudoku: 300 samples (Section 6; Table 1/2).
ARC-AGI: 456 tasks (v1: 381, v2: 75) (Section 7; Table 8; Table 1/2).
Math: 327 problems across GSM8K, MATH500, AIME 2024/2025, Omni-MATH (Section 8; Table 18).
Domain 2: Embodied Navigation (physical + 2D/3D spatial + temporal)
Last-Mile Navigation (Panoramic view): 120 samples (Section 10; Table 1/2).
Top-down View Navigation: 120 samples (Section 11; Table 1/2).
3D Real-World Navigation: 120 samples (Section 12; Table 1/2).
SLAG (Simultaneous Localization and Generation): 120 samples (Section 13; Table 1/2).
Domain 3: Physical Commonsense (physical + temporal + spatial)
Physical Concept: 25 videos (sampled from a larger VideoPhy-based pool) (Section 3.3; Table 1/2).
Sports: 25 videos (sampled from a larger sports prompt pool) (Section 3.3; Table 1/2).

3.4.3 Hard-level control: how MMGR creates graded difficulty¶

MMGR does not just collect prompts; it explicitly controls difficulty knobs per task (Section 3; detailed in Sections 5–14). Examples:

Maze
Generated using two algorithms: DFS and Wilson’s, producing 240 mazes total (Section 5.2).
Difficulty tiers by grid size:
- Easy: 3×3–5×5
- Medium: 6×6–9×9
- Hard: 10×10–13×13
Also controls start-goal placement (4 schemes) and enforces minimum start-goal distance (Section 5.2).
Sudoku
300 puzzles across grid sizes 4×4 and 9×9 and difficulties Easy/Medium/Hard (Section 6.1).
Difficulty is modulated by number/sparsity of initial clues (Section 6.1).
ARC-AGI
Combines ARC-AGI v1 (381) + v2 (75) for 456 tasks (Section 7.1; Table 8).
Two-level classification:
- Shape consistency: Match (same grid size) vs Mismatch (size changes) (Section 7.1.1).
- Quantitative difficulty from grid features (grid size bins, color count, object count, occupancy ratio, and ∆IO for Match) with explicit thresholds (Section 7.1.1).
Math
Uses multiple datasets with explicit sample counts: GSM8K (50), MATH500 (50), AIME24 (30), AIME25 (30), Omni-MATH (167) for 327 total (Table 18).
Omni-MATH further categorized by difficulty T0–T4 and by subject category (Table 19; Tables 21–22).
Embodied Navigation
Each of the 4 tasks has 120 samples, arranged into 24 configurations crossing:
- environmental complexity (1 floor vs 2+ floors),
- view fidelity (quality03–quality05),
- trajectory distance (short vs long),
- destination specification (color mark vs language description) (Section 9.2; Table 23).

3.4.4 The evaluation pipeline: what happens first, second, third (explicit flow)¶

MMGR’s evaluation is explicitly multi-stage (Figure 1; Section 4):

Prepare task input + reference
For many tasks, the input includes an image (e.g., a maze, Sudoku, ARC examples+test input, navigation scene with markers).
A ground-truth solution is available for tasks like Maze/Sudoku/ARC/Math (Sections 5–8).
Generate outputs from models
For each prompt, MMGR generates 5 samples per model to account for stochasticity (Section 4.2).
It uses default API parameters for closed models and “recommended configurations” for open models, and does zero-shot evaluation (Section 4.2).
Important missing detail: The provided text does not include low-level generation parameters (e.g., diffusion steps, sampler, guidance scale, video length in seconds/frames) for any model; MMGR states only that defaults/recommended settings are used (Section 4.2).
Auto-evaluate with a VLM judge
MMGR uses Gemini 2.5-Pro as a unified judge that consumes:
- the generated video/image,
- task-specific context (often including the ground-truth solution),
- a structured rubric prompt,
- and outputs metric labels (Section 4.3; Figure 1).
The judge produces fine-grained metrics and MMGR composes a primary metric that requires all sub-metrics to be satisfied (Section 4.3).
Human evaluation on curated subsets
MMGR builds a web interface with frame-by-frame controls and task-specific forms (Figure 3; Section 4.5).
Human protocol: 6 annotators, 4-hour instruction, 50-video practice, calibration meetings (Section 4.5).
Aggregate metrics
MMGR emphasizes that ignoring holistic “all constraints satisfied” scoring can inflate performance by 1.2×–4× (Section 4.3).

3.4.5 Metric design: “holistic correctness” vs partial credit¶

A recurring MMGR design choice is to separate:

Fine-grained metrics diagnosing specific failure modes, from
a primary metric that is a strict conjunction (“pass only if everything passes”).

Concrete examples from the provided text:

Maze metrics (Section 5.3; Table 4; Table 5)
Fine-grained: Maze Changed, Cross Wall, Action Reflection, Target Achievement.
Primary: Overall.
Notable inconsistency in the paper text: In Section 5.3, Cross Wall is described as a failure mode where 1 indicates crossing, and Table 4 marks Cross Wall ↓ (lower is better). However, the bullet for Overall Score in the provided excerpt reads: > “Overall Score: 1 only if Maze Changed=0 AND Cross Wall=1 AND Task Completion=1; 0 otherwise.” This conjunction appears sign-inverted for Cross Wall. Based on the metric name (“Cross Wall” as failure) and the downward arrow in Table 4, the simplest consistent interpretation is that Overall intends to require no wall crossing (i.e., Cross Wall=0). I cannot resolve this beyond flagging it as an internal inconsistency in the provided text.
Sudoku metrics (Section 6.2; Table 6; Table 7)
Fine-grained: Clues Changed, Constraints Violation, Completion Accuracy, Action Reflection (video only).
Primary: Overall.
Another potential naming/sign ambiguity: Constraints Violation is described as “fraction of constraints correctly satisfied” (where 1 means full compliance), but it is labeled as a “Failure Mode” and shown with ↓ in Table 6. The tables show large percentages (e.g., 42.00%), which could be either “violation rate” or “compliance rate” depending on definition. MMGR’s narrative interprets higher values as worse (video models have “high constraint-violation rates”), suggesting the metric is actually a violation rate despite the “fraction satisfied” wording. The provided text is inconsistent; the qualitative conclusions rely on the “higher is worse” reading.
Embodied navigation metrics (Section 9.3; Figure 16; Tables 24–33)
Task completeness: Success Score and Oracle Success Score in 2D/3D (depending on task), plus Trajectory Alignment for SLAG.
Physical understanding: Object Semantic (no collisions), Agent Consistency (single continuous agent), Spatial Alignment (heading/motion coherence).
Instruction following: Destination Integrity (goal marker unchanged; no hallucinated goal), Scene Consistency (static environment).
Primary metric: Overall Success as a conjunction of all applicable binary checks (Sections 9.3.4, 10.2.2, 11.2.2, 12.2.2, 13.2.2).
Physical commonsense metrics (Section 14.3; Table 35)
Four binary dimensions: Physics Accuracy, Motion Quality, Visual Realism, Prompt Adherence.
Primary: Overall = conjunction of all four.

3.4.6 Models evaluated and experimental settings¶

Video models (Table 3): Veo-3, Sora-2, Wan-2.2.
Image models (Table 3): Nano-banana, Nano-banana Pro, GPT-4o-image, Qwen-image.
Generation: 5 samples per prompt; zero-shot; default/recommended settings (Sections 4.2–4.4).
Evaluator: Gemini-2.5-Pro as AutoEval judge (Section 4.3; Figure 1).

4. Key Insights and Innovations¶

(1) A unified, five-ability framework tied to generative evaluation (Section 1; Table 1)
Novelty: Instead of evaluating “video quality,” MMGR defines explicit reasoning abilities and maps tasks to them, enabling capability diagnosis (e.g., separating 2D grid logic from 3D navigation).
Significance: This supports targeted failure analysis (e.g., “temporal consistency barrier” in video ARC-AGI; Section 7.4.3).
(2) Holistic primary metrics that penalize “partial success” (Section 4.3; Tables throughout)
Novelty: MMGR systematically reports fine-grained sub-metrics but prioritizes strict conjunction-based success.
Significance: The benchmark argues partial metrics can overstate performance by 1.2×–4× (Section 4.3), and the reported tables repeatedly show large gaps between component scores and holistic success (e.g., Embodied Navigation; Table 24).
(3) Multi-modal parity: evaluating both video and image generators on reasoning tasks (Figure 1; Section 4.4)
Novelty: Many reasoning benchmarks are for discriminative models; MMGR adapts reasoning problems into generative tasks and compares across modalities.
Significance: This reveals consistent modality asymmetries (e.g., image models dominating Sudoku/Math; Sections 6 and 8).
(4) Explicit evaluation reliability analysis: AutoEval vs HumanEval (Sections 5.5.2, 6.4.2, 7.4.2, 10.3.2, 11.3.2, 12.3.2, 13.3.2, 14.5.2)
Novelty: The benchmark does not assume VLM judging is correct; it quantifies where it fails.
Significance: It shows evaluation is itself a bottleneck—e.g., Maze wall-crossing is massively underdetected by AutoEval (Table 5), while in Physical Commonsense humans rate Veo-3 higher than AutoEval (Table 38), indicating task-dependent bias directions.
(5) Identification of “temporal tax” and “hallucination of competence” patterns (Conclusion; Section 8.3; Section 5.5.2)
Novelty: MMGR characterizes a recurring failure mode where temporally coherent generation competes with logical/global state consistency.
Significance: This pattern appears across Maze (fast motion hides wall crossings), Sudoku (temporal drift changes clues), ARC (examples drift), Math (correct final answers with invalid intermediate steps).

5. Experimental Analysis¶

5.1 Evaluation methodology (datasets, metrics, setup)¶

Dataset size: 1,853 total samples across 10 tasks (Table 1; Table 2).
Per-prompt sampling: 5 generations per prompt per model (Section 4.2).
Auto evaluation: Gemini-2.5-Pro with task-specific rubrics (Section 4.3).
Human evaluation: subset evaluations, 6 annotators with training protocol (Section 4.5; Figure 3).
Metrics: task-specific fine-grained metrics + strict primary metric (Section 4.3; Figure 16; Tables 4, 6, 9–15, 20–22, 24–33, 35–40).

5.2 Main quantitative results (with specific numbers)¶

I focus on the headline results that the provided text explicitly supports.

5.2.1 Abstract Reasoning¶

Maze (Table 4; Table 5)
AutoEval “Overall” for Veo-3 ranges roughly 38.69%–51.50% across DFS Medium/Hard and 45.63%–47.50% across Wilson Medium/Hard (Table 4).
HumanEval for Veo-3 shows Overall collapses to 0.00%–20.00% depending on setting (Table 5).
The largest discrepancy is Cross Wall: AutoEval reports ~15.63%–25.63%, while humans report 70.00%–100.00% wall-crossing detection for Veo-3 (Table 5).
Interpretation supported by the paper: AutoEval misses transient violations (frame dropping / temporal resolution limits) (Section 5.5.3).
Sudoku (Table 6; Table 7)
On 4×4 Easy, image models achieve much higher Overall than video models:
- Nano-banana: 66.25%
- GPT-4o-image: 61.22%
- Nano-banana Pro: 56.12%
- Best video (Veo-3): 11.38% (Table 6)
On 9×9, video models are near-zero on Overall (e.g., Veo-3: 2.57%–3.18%; Sora-2 peaks at 7.14% on 9×9 Hard) (Table 6).
HumanEval claims Veo-3 has 0.00% Overall for all grid sizes and difficulties in the evaluated subset (Table 7).
ARC-AGI (Tables 9–17)
On ARC-AGI v1 (381 tasks):
- Best overall among evaluated models is Nano-banana Pro: 30.54%
- Best video is Sora-2: 20.18%
- Veo-3: 5.16%
- Wan-2.2: 0.17%
- GPT-4o-image: 0.00% (Table 9)
On ARC-AGI v2 (75 tasks):
- Nano-banana Pro: 30.36%
- Veo-3: 4.00%
- Sora-2: 1.33% (Table 10)
HumanEval subset for Veo-3 reports 0.00% Valid Solution across 98 evaluated cases (60 v1 + 38 v2), despite AutoEval reporting non-zero valid solutions (Tables 16–17; Section 7.4.2).
Math (Tables 18–22; Table 20)
MMGR separates Process Success Rate (intermediate correctness) from Outcome Success Rate (final answer correctness) and defines Overall as requiring both (Section 8.2).
Example of “reasoning-outcome disconnect” (Table 20):
- On GSM8K, Veo-3 has 74.00% outcome but only 12.00% process and 12.00% overall.
- Nano-banana Pro shows 97.83% process, 97.83% outcome, 97.83% overall on GSM8K.
On AIME25, Nano-banana Pro achieves 66.67% overall, while video models are near 0–5.56% (Table 20).
On Omni-MATH, Nano-banana Pro achieves 63.06% overall; best video (Veo-3) is 3.89% overall (Table 20).

Aggregate per-task performance (Table 24)
Panoramic View Last-Mile Navigation:
- Nano-banana: 74.2 holistic overall
- Veo-3: 60.0
- Sora-2: 0.0
- Wan-2.2: 0.0
- GPT-4o-image: 0.0
Top-down View Navigation:
- Veo-3: 19.5 (highest among those listed)
- Nano-banana: 11.1
3D Real-World Navigation:
- Nano-banana: 79.2
- Wan-2.2: 24.2
- Veo-3: 22.5
- Sora-2: 0.0
SLAG:
- Nano-banana: 28.8
- GPT-4o-image: 16.1
- Sora-2: 12.9
- Veo-3: 11.2
- Wan-2.2: 0.8
Human evaluation reveals sharper “holistic collapse” (Table 25)
For Veo-3, human fine-grained scores can be moderate/high (e.g., Panoramic Spa. Ali. 96.67%), but Overall success is much lower:
- Panoramic Overall: 26.67%
- Top-down Overall: 5.93%
- 3D navigation Overall: 1.67%
- SLAG Overall: 0.00% (Table 25)
AutoEval vs HumanEval discrepancies are large and direction-dependent
Panoramic: Auto Overall Success 73.33% vs Human 25.00% on floor01 (Table 27).
Top-down: Auto Overall Success 37.14% vs Human 10.34% on floor01 (Table 29).
3D navigation: Auto Overall Success 25.00% vs Human 3.33% on floor01 (Table 31).
SLAG: Auto Overall Success 11.86% vs Human 0.00% on floor01 (Table 33).

5.2.3 Physical Commonsense (video-only in MMGR)¶

Physical commonsense overall (Table 35)
Sora-2 average overall: 70.00%
Veo-3 average overall: 51.02%
Wan-2.2 average overall: 24.00%
Notably, Wan-2.2 can score high on Visual Realism (e.g., Sports: 96.00%) but low on Prompt Adherence (Sports: 21.33%) and low overall (Table 35), supporting the “visual–physical disconnect” claim (Section 14.5.1).
AutoEval vs HumanEval flips direction here (Table 38)
For Veo-3, humans rate overall success higher than AutoEval:
- Average AutoEval Overall: 51.02%
- Average HumanEval Overall: 80.00% (Table 38)
This contrasts with Maze/Navigation where humans are typically harsher than AutoEval, indicating evaluator bias depends on task properties (temporal density, strictness, perceptual ambiguity).

5.3 Do the experiments support the claims?¶

Supported strongly by provided evidence:

Perceptual metrics are insufficient: Across tasks, models can be visually plausible but fail constraint checks (e.g., maze wall crossing; navigation scene drift; physics violations) (Sections 5, 9–13, 14).
Modality asymmetry: Image models outperform video models on symbolic/abstract tasks like Sudoku and Math (Tables 6, 20), while video models show some strengths in certain navigation settings (Table 24; Section 9.4.1).
AutoEval reliability is not guaranteed: Quantified gaps are reported repeatedly, sometimes large (Tables 5, 7, 16–17, 27, 29, 31, 33, 38).

Less fully supported / needs caution:

Causal claims about training data imbalance and architectural weaknesses are plausible interpretations (Conclusion; earlier bullet list in Section 1) but are not validated by controlled interventions in the provided text (no ablations on training recipes or architectures are included here).

6. Limitations and Trade-offs¶

(1) Reliance on VLM-as-judge with known failure modes
AutoEval misses transient temporal violations in dense videos (Maze wall crossing; Table 5; Section 5.5.3).
AutoEval can also be overly strict or overly lenient depending on the metric/task (e.g., Veo-3 physical commonsense scores are higher under HumanEval than AutoEval; Table 38).
(2) Human evaluation coverage is limited to subsets
Human evaluation is performed on curated subsets (e.g., Veo-3 subset for Maze; subset for ARC-AGI; n=45 for physical commonsense; Section 14.5.2).
This is necessary for cost, but it means calibrated “true” performance is not available for every model/task pair.
(3) Internal metric-definition ambiguities in the provided text
As noted:
- Maze Overall definition appears inconsistent with the meaning of Cross Wall (Section 5.3 vs Table 4 conventions).
- Sudoku Constraints Violation description conflicts with its naming/arrow direction and the narrative (Section 6.2 vs Table 6 interpretation).
These ambiguities complicate re-implementation unless the repository clarifies them.
(4) Generation settings are underspecified in the provided excerpt
MMGR states it uses default APIs and recommended configs (Section 4.2), but does not list:
- video length / FPS / resolution,
- sampling steps,
- guidance scales,
- seeds,
- decoding parameters.
This limits reproducibility analysis from the provided text alone.
(5) Benchmark composition and sampling choices may affect conclusions
Physical commonsense uses 25 sampled prompts per subtask (Physical Concepts / Sports) from larger pools (Section 3.3; Section 14.1), which may introduce variance.
Some ARC-AGI v2 difficulty distribution is skewed (e.g., only 1 Easy Match case in v2; Table 15 note), making some stratified results sensitive to small counts.
(6) Scope limitation: evaluation, not training
MMGR diagnoses failures but does not provide a verified method that fixes them (Conclusion suggests directions like external memory/world-state representations, auxiliary objectives, RL from structured feedback, etc., but these are not empirically tested here).

7. Implications and Future Directions¶

How this changes the landscape
MMGR reframes progress in video/image generation from “better looking” to “more correct under constraints,” making it harder for models to “pass” by exploiting perceptual plausibility (Section 4.3; Conclusion).
It also highlights that evaluation tooling (VLM judges) can be a major bottleneck, sometimes producing 2–5× overestimation or, in other tasks, underestimation (Tables 5 and 38).
Research directions suggested by MMGR’s failure patterns (from the provided text)
Reduce the “temporal tax”: develop architectures that separate reasoning state from visual rendering, since maintaining frame coherence can compete with global constraint satisfaction (Conclusion; Sections 6–8).
Improve global state consistency: failures like ARC example drift (Figure 11), navigation scene drift (low scene consistency in some tasks; Tables 25, 30–33), and Sudoku clue changes (Table 6) suggest a need for persistent representations (the paper mentions ideas like external memory/world-state representations; Section 1 bullet list; Conclusion).
Train with reasoning-correctness objectives: optimize for rule adherence and causal correctness rather than only reconstruction/adversarial/perceptual fidelity (Section 1 bullet list; Conclusion).
Evaluator-aware benchmarks: since “evaluability” matters (Section 5.5.3), develop evaluation protocols that are robust to high-speed motion and small frame-local violations.
Practical applications / downstream use
Model selection: MMGR suggests current models may be usable for:
- some physical commonsense generation (e.g., Sora-2 overall 70% on physical commonsense; Table 35),
- some last-mile navigation-like generation when the goal is visible (Panoramic; Table 24),
- but not for:
- reliable abstract puzzle solving (ARC, Sudoku),
- long-horizon embodied planning with strict physical and cross-view alignment (SLAG; Table 25; Table 33).
Benchmarking world models: MMGR can serve as a diagnostic suite for “world model” claims, especially when paired with human calibration on high-risk slices (Maze Hard; SLAG).
Repro/Integration Guidance (based on the provided paper content)
Prefer MMGR’s holistic primary metrics when comparing models; relying on partial metrics (e.g., just reaching a goal) can overstate capability (Section 4.3).
Use HumanEval alongside AutoEval for tasks where transient errors are decisive (Maze; Section 5.5.3; Table 5) or where cross-view alignment is subtle (SLAG; Table 33).
If integrating MMGR-like evaluation internally, treat VLM judging as a noisy sensor whose bias varies by task: it can miss fast violations (Maze, navigation), but can be stricter than humans in plausibility judgments (physical commonsense; Table 38).