Skip to content

Molmo2

Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

ArXiv: 2601.10611

🎯 Pitch

Molmo2 is a fully open family of vision–language models (weights, data, and code) that advances video understanding by adding robust spatio-temporal grounding—pointing and continuous object tracking—alongside high-quality captioning and QA. By releasing nine new video- and multi-image datasets collected without distilling from proprietary VLMs and a scalable training recipe, Molmo2 enables reproducible progress on real-world tasks that require precise when-and-where localization (robotics, assistive tech, video search, analytics).


1. Executive Summary (2-3 sentences)

Molmo2 is a fully open (weights + data + code) family of vision-language models (VLMs) designed to close a practical gap in video understanding: producing not only text answers/captions, but also grounded outputs like spatio-temporal points and object tracks. Its main significance is that it demonstrates strong video QA/captioning performance while substantially advancing video grounding (pointing and tracking) using training data collected without distilling from proprietary VLMs, backed by a detailed and efficient training recipe.

2. Context and Motivation

  • Problem/gap addressed
  • Strong video-language models are largely proprietary, and many “open” alternatives either:
    • rely on synthetic data generated by closed VLMs (effectively distillation from proprietary systems), or
    • do not release training data/recipes, limiting reproducibility and iteration.
  • Even when video understanding is good at a “high-level,” many downstream applications need grounding:
    • pointing: “click the moment and location where X happens,” and
    • tracking: “track object Y across time.”
  • The paper positions grounding as a missing capability in many systems (including proprietary ones), especially for video, where outputs must be aligned to time as well as pixels.

  • Why this matters

  • Real-world tasks like video search, assistive tech, robotics, sports analytics, and monitoring often require precise localization (where/when) rather than only descriptive text (Introduction).
  • Grounding enables more controllable and verifiable outputs (e.g., counting by emitting one point per instance).

  • Prior approaches and shortcomings (as framed here)

  • Open models: often lack open data/recipe or rely on proprietary VLM-generated synthetic labels.
  • Existing video grounding datasets: described as narrow in scope/vocabulary, not sufficient for arbitrary user prompts (Introduction).

  • How this work positions itself

  • A “fully open” stack: open weights + open datasets + open training code.
  • Data created without closed VLMs for annotation generation; however, the pipeline still uses closed text-only LLMs in places (explicitly acknowledged in Limitations §H).

3. Technical Approach

3.1 Reader orientation (approachable technical breakdown)

  • The system is a multimodal model that takes single images, sets of images, or videos and generates either text (answers/captions) or grounded outputs (points and tracks over time).
  • The solution combines a vision transformer encoder with a language model, plus training data and encoding strategies specialized for video-scale token budgets and multi-annotation supervision.

3.2 Big-picture architecture (diagram in words)

  • (1) Video/Image input → preprocessing: crop/resize images; sample video frames at fixed FPS with a max frame cap.
  • (2) Vision encoder (ViT): convert each crop/frame into patch features.
  • (3) Connector: pool and project vision features into a set of “visual tokens” usable by the LLM.
  • (4) LLM backbone: consume interleaved visual tokens + text (timestamps, image indices, subtitles), produce text and/or structured point/track strings.
  • (5) Training infrastructure: “message trees” + custom attention masks + sequence packing + token-weighting to handle diverse tasks efficiently (Figure 3, Section 3.2).

3.3 Roadmap for the deep dive

  • Explain the model I/O and grounding formats (points/tracks) because they define what the model can emit.
  • Explain vision preprocessing and tokenization because video token budgets dominate compute.
  • Explain the three-stage training pipeline and data mixture because performance depends heavily on dataset composition.
  • Explain efficiency mechanisms (packing + message trees + attention masks) because they enable scaling to many annotations and long sequences.
  • Explain evaluation protocols (especially for captioning/grounding) to interpret the reported numbers.

3.4 Detailed, sentence-based technical breakdown

This is an empirical system + dataset + training recipe paper whose core idea is: build a large, fully open video-centric multimodal corpus (including new grounding datasets) and train a standard ViT→LLM VLM with specialized data encodings and efficiency tricks so it can do video understanding + point/track grounding effectively.

3.4.1 System/data pipeline diagram in words (what happens first → second → third)

  1. Data collection / construction happens first (Section 2, Figure 1), producing:
  2. dense video captions (human + model-assisted),
  3. long-form QA (human–LLM collaboration),
  4. synthetic video QA (captioner + LLM),
  5. video pointing (human clicks after LLM-generated queries),
  6. video tracking (repurposed tracks + human-written referring queries),
  7. multi-image QA and multi-image pointing datasets.
  8. Examples are formatted into “message trees” (Section 3.2), where:
  9. the first message contains the visual input tokens (video frames / images),
  10. each annotation (caption, QA turn, pointing request, tracking request) becomes a branch.
  11. Multiple message trees are packed into a single long training sequence (Section 3.2; Figure 3) using:
  12. a custom attention mask to prevent branches (or different examples) from attending to each other improperly,
  13. dynamic packing to minimize padding waste.
  14. The model is trained in three stages (Section 3.2):
  15. image-only pre-training (captioning + transcript prediction + pointing + text-only),
  16. joint multimodal SFT on a mixture of images/videos/multi-image tasks,
  17. short long-context SFT to extend context length and max frames.

3.4.2 Model architecture and interfaces

  • Overall structure
  • The model follows the common “vision encoder + connector + LLM” pattern (Section 3.1, Figure 2).

  • Vision encoder

  • All variants use SigLIP 2 So400m/14, 384px ViT (Appendix A; Table 12).
  • ViT hyperparameters given (Table 12):

    • dim = 1152, layers = 27, heads = 16, patch size = 14, image size = 384×384, dropout 0.0.
  • Cropping / frame sampling

  • Images: use a low-res full-image crop plus up to K overlapping tiled crops (Section 3.1).
    • Training: K = 8; inference: K = 24.
    • Crops are resized to 378 without black padding (Appendix A), following the ViT’s training procedure described there.
  • Videos:

    • sample frames at S = 2 fps as single crops (Section 3.1),
    • cap at F = 128 frames normally, F = 384 during long-context training (Section 3.1 / 3.2),
    • if video is longer than F/S, uniformly sample F frames; always include the last frame (Section 3.1; Appendix A explains timestamp-based extraction for variable-FPS).
  • Connector (vision → language)

  • Uses features from the third-to-last and ninth-from-last ViT layers (Section 3.1).
  • Pools local patch windows:
    • images: 2×2 pooling windows,
    • video frames: 3×3 pooling windows (reduces token count).
  • Pooling uses multi-head attention where the mean of patches is the query, then a shared projection MLP (Section 3.1).
  • Connector hyperparameters vary by model size (Table 12), with pool heads = 16, pool dim = 1152, dropout = 0.0.

  • LLM backbones and attention behavior

  • The LLM consumes visual tokens interleaved with:
    • timestamps (videos) and/or image indices (multi-image),
    • optional subtitles appended after visual tokens (Section 3.1; Appendix A formatting).
  • They enable bi-directional attention among visual tokens (“allow image tokens … to forward-attend to one another”) and report gains from this (Section 3.1; ablation in Table 8b).

  • Model scales

  • Released variants (Abstract, Table 2):
    • Molmo2-4B, Molmo2-8B (based on Qwen3 LLMs),
    • Molmo2-O-7B (based on OLMo 3).
  • LLM hyperparameters (Table 12) include:
    • 4B: dim 2560, layers 36, heads 32, KV heads 8, dropout 0.1.
    • 8B: dim 4096, layers 36, heads 32, KV heads 8, dropout 0.1.
    • O-7B: dim 4096, layers 32, KV heads 32, dropout 0.1.
  • RoPE theta values are listed (Table 12): 1m (4B/8B) and 0.5m (7B).

3.4.3 Grounding output formats (points and tracks)

  • Pointing / tracking representation
  • Outputs are encoded in an HTML-like tag format:
    • <points coords="...">text span</points>
    • <tracks coords="...">text span</tracks> (Appendix A).
  • Coordinates:
    • x, y are normalized to [0, 1000],
    • each point includes an object index (ID) used for:
    • counting (the final object index corresponds to total count),
    • tracking (same ID reused across frames) (Section 3.2; Appendix A).
    • video points include a timestamp (seconds with one decimal), image points include an image index (starting at 1).
  • Points are sorted by time/image index, then by x, y (Section 3.2; Appendix A).
  • They explicitly choose this compact format over JSON because it uses fewer tokens (Appendix A).

  • Training-time grounding specifics

  • For video pointing, examples can have up to 60 points annotated (Section 3.2).
  • They also create multi-turn conversations with multiple pointing/counting queries per video (Section 3.2).
  • For tracking, they add auxiliary tasks like predicting first/last appearance frames, and tracking from an input query + point (Section 3.2).

3.4.4 Data: what is trained on, and how it is collected

  • Nine new datasets highlighted (Table 1; Figure 1)
  • The paper groups datasets into:
    • captions/long QA,
    • image QA,
    • video QA,
    • image pointing,
    • video pointing,
    • video tracking,
    • NLP text-only SFT.
  • Table 1 reports example counts after formatting into message trees (e.g., total examples per group), and per-group sampling rates.

  • Key new datasets and sizes (Section 2)

  • Molmo2-Cap (human):
    • 104k video-level + 431k clip-level dense captions.
    • Captions are very long: 924 words per video on average; comparisons to prior datasets are given in Section 2.
    • Pipeline: spoken descriptions → transcription via Whisper-1 → rewrite by text-only LLM → merge with Molmo-generated frame-level captions via another LLM step (Section 2).
  • Molmo2-AskModelAnything (human):
    • 140k human-authored video QA pairs.
    • Uses a human–LLM refinement loop with an LLM (specified as Claude Sonnet 4.5) seeded by an early Molmo2 captioner (Section 2).
    • Filters out counting questions so counting can be handled by pointing (Section 2).
  • Molmo2-CapQA and Molmo2-SubtitleQA (synthetic):
    • 1M QA pairs for CapQA (200k videos × 5 QA/video),
    • 300k QA pairs for SubtitleQA (100k videos × 3 QA/video),
    • captions produced by a captioner trained on Molmo2-Cap; subtitles transcribed with Whisper-1 for SubtitleQA (Section 2).
  • Molmo2-VideoPoint (human):
    • 650k+ pointing queries on 280k videos, avg 6 points/video.
    • Queries derived from LLM over early Molmo2 captions; annotators select the frame and click location; frames at 2 fps (Section 2).
  • Molmo2-VideoTrack (human):
    • 3.6k clips and 15k natural-language queries, avg 2.28 objects/query (Section 2).
    • Built by re-labeling existing tracking annotations: annotators write complex queries that apply to subsets of tracked objects; validation round included (Section 2).
  • AcademicVideoPoint and AcademicVideoTrack (curated conversions):
    • Convert existing tracking/segmentation datasets into pointing/counting and tracking supervision.
    • Uses SAM-2 to convert bbox tracks into segmentation masks and point tasks (Section 2).
  • Multi-image datasets:

    • Molmo2-MultiImageQA (human): 45k image sets from 96k images, 72k QA pairs.
    • Molmo2-MultiImagePoint: 470k pointing/counting examples derived from PixMo-Points with canonicalized labels, but training samples pre-canonicalized variants to preserve lexical diversity (Section 2).
  • SFT mixture and sampling

  • They manually assign category sampling rates (Table 1), and within a category sample datasets proportional to \(\sqrt{\text{dataset size}}\) with manual adjustments (Section 3.2).
  • Table 13 provides a full per-dataset breakdown and rates; Figure 4 visualizes the mixture.

3.4.5 Training pipeline and hyperparameters

  • Optimizer and schedule
  • Uses AdamW with:
    • betas = (0.9, 0.95), eps = 1e-6,
    • cosine LR decay to 10% of peak,
    • no weight decay (Appendix B; Table 12).
  • Uses separate learning rates for ViT / connector / LLM in pre-training (Section 3.2).

  • Stage 1: Image-only pre-training

  • Data mixture: 60% captioning, 30% image pointing, 10% natural language (Section 3.2).
  • Steps and batch size: 32k steps, batch size 128 (Section 3.2).
  • Sequence length: 2560 (Table 12).
  • LRs (Table 12):
    • ViT 6e-6, connector 2e-4, LLM 2e-4.
  • Warmups (Table 12):

    • ViT 2000, connector 200, LLM 2000.
  • Stage 2: Joint supervised fine-tuning (SFT)

  • Trains on mixed images + videos + multi-image + NLP (Section 3.2; Table 1).
  • Steps and batch size: 30k steps, batch size 128 (Section 3.2; Table 12).
  • Max sequence length: 16,384 tokens (Section 3.2; Table 12).
  • LRs (Table 12):
    • ViT 5e-6, connector 5e-6, LLM 1e-5.
  • Warmup: 200 steps for each component (Table 12).

  • Stage 3: Long-context SFT

  • Sequence length increased to 36,864 tokens; F = 384 frames (Section 3.2).
  • Train 2k steps (Section 3.2).
  • Uses context parallelism: each example processed by a group of 8 GPUs, with Ulysses attention for LLM context parallelism due to compatibility with custom attention masks (Section 3.2).
  • They also distribute vision encoder frame processing across context-parallel groups to reduce memory footprint (Section 3.2).

  • Token weighting (loss balancing)

  • Addresses the issue that tasks have wildly different output lengths (e.g., multiple-choice vs 4,000+ token captions).
  • Fixed loss weights:
    • video captions weight 0.1,
    • pointing weight 0.2 (Section 3.2).
  • For other tasks, weight by \(4/\sqrt{n}\) where \(n\) is number of answer tokens (Section 3.2).
  • Purpose: prevent long-output tasks from dominating the loss even if sampled infrequently.

  • Packing and message trees

  • Packing merges multiple short examples into one long sequence to reduce padding (Section 3.2).
  • On-the-fly packing algorithm (Appendix B):
    • maintains a pool of M = 48 preprocessed examples,
    • chooses subset maximizing \(T + I \cdot w_i\) subject to \(T \le 16384\) tokens and \(I \le 128\) crops, with \(w_i = 30\),
    • token counts are quantized to multiples of 32 for the solver.
  • Message trees (Section 3.2):

    • visual input is the first message; each annotation is a branch,
    • custom attention mask blocks cross-attention between branches (Figure 3),
    • average of 4 annotations per example,
    • packing fits 3.8 examples into a 16,348-token sequence on average, yielding 15× training efficiency (Section 3.2).
    • (Note: the text says 16,348 tokens here; the configured max length elsewhere is 16,384.)
  • Implementation choices

  • Uses PyTorch FSDP2, torch.compile, AMP bfloat16 (Appendix A).
  • Uses PyTorch SDPA instead of FlashAttention because SDPA supports custom attention masks (Appendix A).
  • Gradient normalization detail: divide per-device loss by the average number of loss tokens across all devices to avoid bias toward short answers (Appendix A).

  • Compute / training time

  • Training uses NVIDIA H100 GPUs (Table 14).
  • GPU-hours (Table 14):
    • 4B: pre-train 490, SFT 7.5k, long-context 3.2k.
    • 8B: pre-train 780, SFT 8.1k, long-context 3.3k.
    • O-7B: pre-train 720, SFT 7.6k, long-context 3.3k.
  • Total training tokens / PF-days are not provided in the included text, so they cannot be reported precisely here.

4. Key Insights and Innovations

  1. Fully open, video-centric dataset suite with grounding (Figure 1; Table 1; Section 2)
  2. The most fundamental contribution is nine new datasets (7 video + 2 multi-image, per Abstract) targeting:
    • dense long video captions,
    • long-form QA,
    • long-video QA,
    • open-vocabulary video pointing,
    • open-vocabulary video tracking.
  3. Novelty here is not just scale but capability coverage (spatio-temporal grounding), and a strong emphasis on avoiding closed VLM distillation.

  4. Extending 2D pointing to spatio-temporal video grounding with a compact token format (Section 3.2; Appendix A)

  5. They unify counting, localization, and tracking under a point/ID representation:
    • points carry timestamps (video) and IDs (object identity),
    • IDs enable both “count” (number of unique IDs) and “track” (same ID over time).
  6. The compact serialization is explicitly chosen to reduce token cost versus JSON.

  7. Training throughput improvements via packing + message-tree encoding + custom attention masks (Section 3.2; Figure 3; Appendix B)

  8. The “message tree” idea allows one visual input to supervise multiple tasks/annotations without letting them leak information across branches.
  9. Packing makes heterogeneous example lengths feasible at scale, and they report ~15× training efficiency from packing + multi-annotation structure (Section 3.2).

  10. Loss token balancing (“token weighting”) for multi-task multimodal SFT (Section 3.2)

  11. Long captions and dense point outputs can dominate training signal; explicit weighting prevents degradation on short-answer tasks.
  12. This is a pragmatic multi-task stabilization technique tied to the VLM setting where output lengths vary drastically.

  13. Ablated modeling choices specific to video: timestamps and bi-directional attention among vision tokens (Table 8b)

  14. They report that:
    • removing time tokens hurts captioning notably (Table 8b),
    • disabling “bidir” attention reduces QA and caption metrics (Table 8b),
  15. framing temporal markers and vision-token interactions as key for video detail understanding.

5. Experimental Analysis

Evaluation methodology: datasets, metrics, setup

  • Video understanding benchmarks (Table 2)
  • Includes a wide suite: NextQA, PerceptionTest, MVBench, Tomato, MotionBench, TempCompass, Video-MME, LongVideoBench, MLVU, LVBench, VideoEvalPro, EgoSchema, plus their own:
    • Molmo2 Caption test F1,
    • Molmo2 Count accuracy.
  • Inference settings (Section 4.1):

    • greedy decoding,
    • up to 384 frames (notably larger than the default F=128 training cap, except long-context stage),
    • for captioning + human eval: top_p=0.95, temperature=0.7, frequency_penalty=0.1.
  • Captioning evaluation (Section 4.1; Appendix C)

  • Uses Molmo2-CapTest: 693 Creative Commons videos, each with at least 4 human captions.
  • Metric: statement-level precision/recall/F1 using an LLM-as-judge pipeline (similar to Molmo’s image caption metric).
  • Appendix C details:

    • GPT-4.1 enumerates “atomic statements” and judges matches for precision and recall.
  • Counting evaluation (Section 4.1; Appendix C)

  • Molmo2-VideoCount: 533 examples collected via the video pointing pipeline, up to 60 points.
  • Also evaluates on BURST-VideoCount (Table 3 text): 2.2k examples derived from BURST tracks, with “close accuracy”

    • correct if \(|\text{pred}-\text{gt}| \le \Delta\),
    • \(\Delta = 1 + \lfloor 0.05 \times \text{gt} \rfloor\) (Section 4.2).
  • Video pointing evaluation (Section 4.2; Table 3)

  • Molmo2-VP: 181 manually filtered examples.
  • Ground-truth masks obtained by running SAM 2 in a 3-second window around annotated points.
  • Metric: precision/recall/F1 based on whether predicted points fall within the mask.

  • Tracking evaluation (Tables 4–5; Appendix C)

  • Benchmarks include MeViS, Ref-YT-VOS, Ref-Davis, ReasonVOS, plus their new Molmo2-Track.
  • They evaluate point predictions (correct if inside mask) and convert points to segmentation via SAM 2 for segmentation metrics (Appendix C).
  • Metrics:

    • J&F (Jaccard IoU + boundary F-score) for segmentation quality,
    • point F1 at 1 fps,
    • HOTA for identity-aware tracking (Appendix C), adapted to point-within-mask matching.
  • Human preference evaluation (Section 4.1; Appendix C; Table 15; Figures 5–6)

  • 450 QA questions + 51 captioning queries; 105k pairwise ratings.
  • Elo computed using Bradley–Terry; bootstrapped for confidence intervals.

Main quantitative results (with specific numbers)

Video benchmarks (Table 2)

  • Molmo2-8B:
  • Molmo2 Caption F1: 43.2
  • Molmo2 Count accuracy: 35.5
  • Short QA avg.: 69.9
  • Long QA avg.: 64.1
  • Average: 63.1
  • Human preference Elo: 1057 (rank 5 in that table)
  • Among open-weight baselines in Table 2:
  • Qwen3-VL-8B shows Molmo2 Count of 29.6, vs Molmo2-8B 35.5 (also echoed in the Abstract).
  • For overall average, Qwen3-VL-8B is 59.5 vs Molmo2-8B 63.1.

Video counting + pointing (Table 3)

  • Molmo2-8B:
  • BURST video counting accuracy: 60.8
  • Molmo2-VC accuracy: 35.5
  • Molmo2-VP pointing: 38.4 F1, 39.3 recall, 38.7 precision
  • Comparison explicitly visible in Table 3:
  • Qwen3-VL-8B has Molmo2-VP F1 1.5.
  • Gemini 3 Pro has pointing F1 20.0, while Molmo2-8B has 38.4 (also summarized in the Abstract).

Tracking on academic benchmarks (Table 4)

  • Molmo2-8B reports strong numbers across:
  • MeViS valid: J&F 62.3, point F1 75.9, HOTA 72.6
  • Ref-YT-VOS valid: J&F 70.2, F1 78.7, HOTA 77.3
  • Ref-Davis test: J&F 81.3, F1 78.7, HOTA 72.7
  • ReasonVOS test: J&F 65.8, F1 70.8, HOTA 68.6
  • The paper highlights (in the prose beneath Tables 4–5) that Molmo2 outperforms both general VLM baselines and several specialized segmentation models across these benchmarks.

Tracking on Molmo2’s domain benchmark (Table 5)

  • Overall (Molmo2-8B): J&F 56.2, F1 57.1, HOTA 57.5, with breakdowns by domain (Animals/Person/Sports/Dancers/Misc).

Image and multi-image benchmarks (Table 6)

  • Molmo2-8B:
  • Img QA avg.: 81.7
  • MultiImg QA avg.: 56.4
  • Average: 76.3
  • The text notes Molmo2 is behind some open-weight models on OCR-heavy benchmarks (DocVQA/InfoQA), but strong on general QA and counting (Section 4.3).

Image pointing (Table 7)

  • On Point-Bench, Molmo2-8B average is 72.7 (Table 7) and the paper claims it surpasses leaderboard baselines as of a stated date in the text below Table 7.

Do the experiments support the claims?

  • Grounding claim support is strong within the provided evidence
  • Video pointing and tracking results show large gaps versus several baselines (Tables 3–5), including very low pointing F1 for some open-weight models in their tested setup.
  • They also discuss baseline prompting difficulties for pointing (Appendix C), which is relevant: grounding requires precise output formatting, and weak adherence can crater metrics.

  • Video understanding claims are supported with broad benchmark coverage

  • Table 2 spans many datasets and includes both short- and long-video benchmarks, plus their own captioning/counting evaluations.

  • Ablations add credibility about what matters

  • Video modeling ablations (Table 8b) isolate bidirectional attention, token weighting, time tokens, and pooling size effects.
  • Counting/pointing ablations (Table 9) test strategy (“point then count”), data source complementarity, and sampling of high-count examples.
  • Tracking ablations (Table 10) show pointing helps tracking, and temporal grounding helps on academic VOS.

Ablations, failure cases, robustness checks

  • Ablations
  • Video ablations (Table 8):
    • “No bidir” reduces QA avg from 64.8 → 64.4 and caption F1 39.5 → 38.5 in their video-only setting (Table 8b).
    • “No time tokens” reduces caption F1 39.5 → 37.4 (Table 8b).
    • Increasing video pool from 3×3 to 4×4 reduces caption F1 39.5 → 37.0 (Table 8b).
  • Counting strategy (Table 9a):
    • direct “Count” vs “Point then count”: on MVC, 28.1 → 34.5, showing the importance of pointing for counting.
  • Long-context SFT ablation (Table 11):

    • improves long video QA avg 64.4 → 67.4 but reduces caption F1 42.3 → 39.9.
  • Failure cases / limitations observed in outputs

  • Appendix H and Figure 36 describe degenerate grounding outputs (repeated points, lines of points) and caption repetition for very long generations.

6. Limitations and Trade-offs

  • Not fully open in every component
  • Uses a closed-data vision encoder (SigLIP 2) (Limitations §H).
  • Uses closed text-only LLMs in data generation and refinement loops (Limitations §H), reducing full transparency/reproducibility of the dataset construction—even though they avoid closed VLMs.

  • Video grounding is still far from “solved”

  • They note that video grounding metrics are substantially lower than typical image grounding scores (Limitations §H), reflecting:

    • larger visual volume (many frames),
    • re-identification difficulty across time,
    • reduced effective resolution for long videos,
    • vision encoders not necessarily pretrained on video.
  • Long-video grounding support is limited

  • Grounding training is limited to videos up to ~3 minutes, and extending longer is complicated because annotations are aligned to 2 fps while longer videos would require lower sampling rates (Limitations §H).

  • Degenerate grounding outputs

  • Models sometimes emit repeated points (same point every frame, or many points in one frame), especially for high-frequency objects or long videos (Limitations §H), suggesting remaining robustness issues and/or task interference in joint training.

  • Captioning degeneration for extremely long outputs

  • Repetition at the end of very long captions is observed, particularly under greedy decoding (Limitations §H; Appendix C notes repetition issues).

  • Trade-off: long-context post-training helps long-video QA but can hurt captioning

  • Table 11 quantifies this: long-context SFT improves long-video QA but drops caption F1.

  • Compute and data constraints for ultra-long videos

  • The paper attributes lagging behind the best open-weight models partly to limited open long-video training data and compute limitations for extensive ultra-long context training (Section 4.1).

7. Implications and Future Directions

  • Impact on the field
  • Provides an end-to-end open baseline (data + recipe + code) for video grounding, which is positioned as underdeveloped in open research compared to proprietary systems.
  • The dataset suite and training tricks (message trees, packing, token weighting) give concrete scaffolding for others to replicate and extend.

  • Research directions suggested by the paper’s findings/limitations

  • Open vision encoders: replace closed-data SigLIP 2 with competitive open-data alternatives (Limitations §H).
  • Better long-video grounding alignment: develop sampling/annotation alignment methods when fps must drop for long videos (Limitations §H).
  • Reduce degenerate grounding outputs: mitigate repeated-point failure modes via more targeted data, better decoding constraints, or reduced task interference (Limitations §H).
  • Improve long-caption generation stability: address repetition/nonsense in very long captions (Limitations §H).

  • Practical applications / downstream use

  • Any application needing “answer + where/when evidence” can benefit:
    • video search with clickable moments,
    • robotics/monitoring tasks requiring localized events (e.g., grasps, falls),
    • sports analytics requiring tracking/counting.
  • The model’s explicit point/track interface is designed to be consumable by downstream tools (e.g., converting points to masks using SAM 2 in evaluation).

  • Repro/Integration Guidance

  • When to prefer Molmo2 (based on this paper’s evidence):
    • If you need open weights + open training data + open code and care about video grounding (pointing/tracking), Molmo2 is explicitly engineered for this and shows large gains on their pointing/tracking metrics (Tables 3–5).
    • If your primary workload is OCR-heavy document QA, the paper notes Molmo2 can lag behind some open-weight models (Section 4.3; Table 6), so you may need additional OCR-focused training data or a different model.
  • Training recipe components that appear most important (per ablations):
    • include timestamps/time tokens for video (Table 8b),
    • keep video pooling at 3×3 rather than larger pooling for caption detail (Table 8b),
    • use point-then-count for counting supervision (Table 9a),
    • expect a captioning vs long-video-QA trade-off when adding long-context post-training (Table 11).