Molmo2¶
Open Weights and Data for Vision-Language Models with Video Understanding and Grounding
ArXiv: 2601.10611
🎯 Pitch¶
Molmo2 is a fully open family of vision–language models (weights, data, and code) that advances video understanding by adding robust spatio-temporal grounding—pointing and continuous object tracking—alongside high-quality captioning and QA. By releasing nine new video- and multi-image datasets collected without distilling from proprietary VLMs and a scalable training recipe, Molmo2 enables reproducible progress on real-world tasks that require precise when-and-where localization (robotics, assistive tech, video search, analytics).
1. Executive Summary (2-3 sentences)¶
Molmo2 is a fully open (weights + data + code) family of vision-language models (VLMs) designed to close a practical gap in video understanding: producing not only text answers/captions, but also grounded outputs like spatio-temporal points and object tracks. Its main significance is that it demonstrates strong video QA/captioning performance while substantially advancing video grounding (pointing and tracking) using training data collected without distilling from proprietary VLMs, backed by a detailed and efficient training recipe.
2. Context and Motivation¶
- Problem/gap addressed
- Strong video-language models are largely proprietary, and many “open” alternatives either:
- rely on synthetic data generated by closed VLMs (effectively distillation from proprietary systems), or
- do not release training data/recipes, limiting reproducibility and iteration.
- Even when video understanding is good at a “high-level,” many downstream applications need grounding:
- pointing: “click the moment and location where X happens,” and
- tracking: “track object Y across time.”
-
The paper positions grounding as a missing capability in many systems (including proprietary ones), especially for video, where outputs must be aligned to time as well as pixels.
-
Why this matters
- Real-world tasks like video search, assistive tech, robotics, sports analytics, and monitoring often require precise localization (where/when) rather than only descriptive text (Introduction).
-
Grounding enables more controllable and verifiable outputs (e.g., counting by emitting one point per instance).
-
Prior approaches and shortcomings (as framed here)
- Open models: often lack open data/recipe or rely on proprietary VLM-generated synthetic labels.
-
Existing video grounding datasets: described as narrow in scope/vocabulary, not sufficient for arbitrary user prompts (Introduction).
-
How this work positions itself
- A “fully open” stack: open weights + open datasets + open training code.
- Data created without closed VLMs for annotation generation; however, the pipeline still uses closed text-only LLMs in places (explicitly acknowledged in Limitations §H).
3. Technical Approach¶
3.1 Reader orientation (approachable technical breakdown)¶
- The system is a multimodal model that takes single images, sets of images, or videos and generates either text (answers/captions) or grounded outputs (points and tracks over time).
- The solution combines a vision transformer encoder with a language model, plus training data and encoding strategies specialized for video-scale token budgets and multi-annotation supervision.
3.2 Big-picture architecture (diagram in words)¶
- (1) Video/Image input → preprocessing: crop/resize images; sample video frames at fixed FPS with a max frame cap.
- (2) Vision encoder (ViT): convert each crop/frame into patch features.
- (3) Connector: pool and project vision features into a set of “visual tokens” usable by the LLM.
- (4) LLM backbone: consume interleaved visual tokens + text (timestamps, image indices, subtitles), produce text and/or structured point/track strings.
- (5) Training infrastructure: “message trees” + custom attention masks + sequence packing + token-weighting to handle diverse tasks efficiently (Figure 3, Section 3.2).
3.3 Roadmap for the deep dive¶
- Explain the model I/O and grounding formats (points/tracks) because they define what the model can emit.
- Explain vision preprocessing and tokenization because video token budgets dominate compute.
- Explain the three-stage training pipeline and data mixture because performance depends heavily on dataset composition.
- Explain efficiency mechanisms (packing + message trees + attention masks) because they enable scaling to many annotations and long sequences.
- Explain evaluation protocols (especially for captioning/grounding) to interpret the reported numbers.
3.4 Detailed, sentence-based technical breakdown¶
This is an empirical system + dataset + training recipe paper whose core idea is: build a large, fully open video-centric multimodal corpus (including new grounding datasets) and train a standard ViT→LLM VLM with specialized data encodings and efficiency tricks so it can do video understanding + point/track grounding effectively.
3.4.1 System/data pipeline diagram in words (what happens first → second → third)¶
- Data collection / construction happens first (Section 2, Figure 1), producing:
- dense video captions (human + model-assisted),
- long-form QA (human–LLM collaboration),
- synthetic video QA (captioner + LLM),
- video pointing (human clicks after LLM-generated queries),
- video tracking (repurposed tracks + human-written referring queries),
- multi-image QA and multi-image pointing datasets.
- Examples are formatted into “message trees” (Section 3.2), where:
- the first message contains the visual input tokens (video frames / images),
- each annotation (caption, QA turn, pointing request, tracking request) becomes a branch.
- Multiple message trees are packed into a single long training sequence (Section 3.2; Figure 3) using:
- a custom attention mask to prevent branches (or different examples) from attending to each other improperly,
- dynamic packing to minimize padding waste.
- The model is trained in three stages (Section 3.2):
- image-only pre-training (captioning + transcript prediction + pointing + text-only),
- joint multimodal SFT on a mixture of images/videos/multi-image tasks,
- short long-context SFT to extend context length and max frames.
3.4.2 Model architecture and interfaces¶
- Overall structure
-
The model follows the common “
vision encoder+connector+LLM” pattern (Section 3.1, Figure 2). -
Vision encoder
- All variants use SigLIP 2 So400m/14, 384px ViT (Appendix A; Table 12).
-
ViT hyperparameters given (Table 12):
dim = 1152,layers = 27,heads = 16,patch size = 14,image size = 384×384, dropout0.0.
-
Cropping / frame sampling
- Images: use a low-res full-image crop plus up to
Koverlapping tiled crops (Section 3.1).- Training:
K = 8; inference:K = 24. - Crops are resized to 378 without black padding (Appendix A), following the ViT’s training procedure described there.
- Training:
-
Videos:
- sample frames at
S = 2 fpsas single crops (Section 3.1), - cap at
F = 128frames normally,F = 384during long-context training (Section 3.1 / 3.2), - if video is longer than
F/S, uniformly sampleFframes; always include the last frame (Section 3.1; Appendix A explains timestamp-based extraction for variable-FPS).
- sample frames at
-
Connector (vision → language)
- Uses features from the third-to-last and ninth-from-last ViT layers (Section 3.1).
- Pools local patch windows:
- images:
2×2pooling windows, - video frames:
3×3pooling windows (reduces token count).
- images:
- Pooling uses multi-head attention where the mean of patches is the query, then a shared projection MLP (Section 3.1).
-
Connector hyperparameters vary by model size (Table 12), with
pool heads = 16,pool dim = 1152,dropout = 0.0. -
LLM backbones and attention behavior
- The LLM consumes visual tokens interleaved with:
- timestamps (videos) and/or image indices (multi-image),
- optional subtitles appended after visual tokens (Section 3.1; Appendix A formatting).
-
They enable bi-directional attention among visual tokens (“allow image tokens … to forward-attend to one another”) and report gains from this (Section 3.1; ablation in Table 8b).
-
Model scales
- Released variants (Abstract, Table 2):
Molmo2-4B,Molmo2-8B(based onQwen3LLMs),Molmo2-O-7B(based onOLMo 3).
- LLM hyperparameters (Table 12) include:
- 4B:
dim 2560,layers 36,heads 32,KV heads 8, dropout0.1. - 8B:
dim 4096,layers 36,heads 32,KV heads 8, dropout0.1. - O-7B:
dim 4096,layers 32,KV heads 32, dropout0.1.
- 4B:
- RoPE
thetavalues are listed (Table 12):1m(4B/8B) and0.5m(7B).
3.4.3 Grounding output formats (points and tracks)¶
- Pointing / tracking representation
- Outputs are encoded in an HTML-like tag format:
<points coords="...">text span</points><tracks coords="...">text span</tracks>(Appendix A).
- Coordinates:
x, yare normalized to[0, 1000],- each point includes an object index (ID) used for:
- counting (the final object index corresponds to total count),
- tracking (same ID reused across frames) (Section 3.2; Appendix A).
- video points include a timestamp (seconds with one decimal), image points include an image index (starting at 1).
- Points are sorted by time/image index, then by
x, y(Section 3.2; Appendix A). -
They explicitly choose this compact format over JSON because it uses fewer tokens (Appendix A).
-
Training-time grounding specifics
- For video pointing, examples can have up to 60 points annotated (Section 3.2).
- They also create multi-turn conversations with multiple pointing/counting queries per video (Section 3.2).
- For tracking, they add auxiliary tasks like predicting first/last appearance frames, and tracking from an input query + point (Section 3.2).
3.4.4 Data: what is trained on, and how it is collected¶
- Nine new datasets highlighted (Table 1; Figure 1)
- The paper groups datasets into:
- captions/long QA,
- image QA,
- video QA,
- image pointing,
- video pointing,
- video tracking,
- NLP text-only SFT.
-
Table 1 reports example counts after formatting into message trees (e.g., total examples per group), and per-group sampling rates.
-
Key new datasets and sizes (Section 2)
Molmo2-Cap(human):104kvideo-level +431kclip-level dense captions.- Captions are very long: 924 words per video on average; comparisons to prior datasets are given in Section 2.
- Pipeline: spoken descriptions → transcription via
Whisper-1→ rewrite by text-only LLM → merge with Molmo-generated frame-level captions via another LLM step (Section 2).
Molmo2-AskModelAnything(human):140khuman-authored video QA pairs.- Uses a human–LLM refinement loop with an LLM (specified as Claude Sonnet 4.5) seeded by an early Molmo2 captioner (Section 2).
- Filters out counting questions so counting can be handled by pointing (Section 2).
Molmo2-CapQAandMolmo2-SubtitleQA(synthetic):1MQA pairs for CapQA (200k videos × 5 QA/video),300kQA pairs for SubtitleQA (100k videos × 3 QA/video),- captions produced by a captioner trained on
Molmo2-Cap; subtitles transcribed withWhisper-1for SubtitleQA (Section 2).
Molmo2-VideoPoint(human):650k+pointing queries on280kvideos, avg6points/video.- Queries derived from LLM over early Molmo2 captions; annotators select the frame and click location; frames at 2 fps (Section 2).
Molmo2-VideoTrack(human):3.6kclips and15knatural-language queries, avg2.28objects/query (Section 2).- Built by re-labeling existing tracking annotations: annotators write complex queries that apply to subsets of tracked objects; validation round included (Section 2).
AcademicVideoPointandAcademicVideoTrack(curated conversions):- Convert existing tracking/segmentation datasets into pointing/counting and tracking supervision.
- Uses
SAM-2to convert bbox tracks into segmentation masks and point tasks (Section 2).
-
Multi-image datasets:
Molmo2-MultiImageQA(human):45kimage sets from96kimages,72kQA pairs.Molmo2-MultiImagePoint:470kpointing/counting examples derived from PixMo-Points with canonicalized labels, but training samples pre-canonicalized variants to preserve lexical diversity (Section 2).
-
SFT mixture and sampling
- They manually assign category sampling rates (Table 1), and within a category sample datasets proportional to \(\sqrt{\text{dataset size}}\) with manual adjustments (Section 3.2).
- Table 13 provides a full per-dataset breakdown and rates; Figure 4 visualizes the mixture.
3.4.5 Training pipeline and hyperparameters¶
- Optimizer and schedule
- Uses
AdamWwith:betas = (0.9, 0.95),eps = 1e-6,- cosine LR decay to
10%of peak, - no weight decay (Appendix B; Table 12).
-
Uses separate learning rates for ViT / connector / LLM in pre-training (Section 3.2).
-
Stage 1: Image-only pre-training
- Data mixture:
60%captioning,30%image pointing,10%natural language (Section 3.2). - Steps and batch size:
32k steps,batch size 128(Section 3.2). - Sequence length:
2560(Table 12). - LRs (Table 12):
- ViT
6e-6, connector2e-4, LLM2e-4.
- ViT
-
Warmups (Table 12):
- ViT
2000, connector200, LLM2000.
- ViT
-
Stage 2: Joint supervised fine-tuning (SFT)
- Trains on mixed images + videos + multi-image + NLP (Section 3.2; Table 1).
- Steps and batch size:
30k steps,batch size 128(Section 3.2; Table 12). - Max sequence length:
16,384tokens (Section 3.2; Table 12). - LRs (Table 12):
- ViT
5e-6, connector5e-6, LLM1e-5.
- ViT
-
Warmup:
200steps for each component (Table 12). -
Stage 3: Long-context SFT
- Sequence length increased to
36,864tokens;F = 384frames (Section 3.2). - Train
2ksteps (Section 3.2). - Uses context parallelism: each example processed by a group of
8GPUs, withUlysses attentionfor LLM context parallelism due to compatibility with custom attention masks (Section 3.2). -
They also distribute vision encoder frame processing across context-parallel groups to reduce memory footprint (Section 3.2).
-
Token weighting (loss balancing)
- Addresses the issue that tasks have wildly different output lengths (e.g., multiple-choice vs 4,000+ token captions).
- Fixed loss weights:
- video captions weight
0.1, - pointing weight
0.2(Section 3.2).
- video captions weight
- For other tasks, weight by \(4/\sqrt{n}\) where \(n\) is number of answer tokens (Section 3.2).
-
Purpose: prevent long-output tasks from dominating the loss even if sampled infrequently.
-
Packing and message trees
- Packing merges multiple short examples into one long sequence to reduce padding (Section 3.2).
- On-the-fly packing algorithm (Appendix B):
- maintains a pool of
M = 48preprocessed examples, - chooses subset maximizing \(T + I \cdot w_i\) subject to \(T \le 16384\) tokens and \(I \le 128\) crops, with \(w_i = 30\),
- token counts are quantized to multiples of
32for the solver.
- maintains a pool of
-
Message trees (Section 3.2):
- visual input is the first message; each annotation is a branch,
- custom attention mask blocks cross-attention between branches (Figure 3),
- average of
4annotations per example, - packing fits
3.8examples into a 16,348-token sequence on average, yielding 15× training efficiency (Section 3.2). - (Note: the text says 16,348 tokens here; the configured max length elsewhere is 16,384.)
-
Implementation choices
- Uses PyTorch
FSDP2,torch.compile, AMPbfloat16(Appendix A). - Uses PyTorch SDPA instead of FlashAttention because SDPA supports custom attention masks (Appendix A).
-
Gradient normalization detail: divide per-device loss by the average number of loss tokens across all devices to avoid bias toward short answers (Appendix A).
-
Compute / training time
- Training uses NVIDIA H100 GPUs (Table 14).
- GPU-hours (Table 14):
- 4B: pre-train
490, SFT7.5k, long-context3.2k. - 8B: pre-train
780, SFT8.1k, long-context3.3k. - O-7B: pre-train
720, SFT7.6k, long-context3.3k.
- 4B: pre-train
- Total training tokens / PF-days are not provided in the included text, so they cannot be reported precisely here.
4. Key Insights and Innovations¶
- Fully open, video-centric dataset suite with grounding (Figure 1; Table 1; Section 2)
- The most fundamental contribution is nine new datasets (7 video + 2 multi-image, per Abstract) targeting:
- dense long video captions,
- long-form QA,
- long-video QA,
- open-vocabulary video pointing,
- open-vocabulary video tracking.
-
Novelty here is not just scale but capability coverage (spatio-temporal grounding), and a strong emphasis on avoiding closed VLM distillation.
-
Extending 2D pointing to spatio-temporal video grounding with a compact token format (Section 3.2; Appendix A)
- They unify counting, localization, and tracking under a point/ID representation:
- points carry timestamps (video) and IDs (object identity),
- IDs enable both “count” (number of unique IDs) and “track” (same ID over time).
-
The compact serialization is explicitly chosen to reduce token cost versus JSON.
-
Training throughput improvements via packing + message-tree encoding + custom attention masks (Section 3.2; Figure 3; Appendix B)
- The “message tree” idea allows one visual input to supervise multiple tasks/annotations without letting them leak information across branches.
-
Packing makes heterogeneous example lengths feasible at scale, and they report ~15× training efficiency from packing + multi-annotation structure (Section 3.2).
-
Loss token balancing (“token weighting”) for multi-task multimodal SFT (Section 3.2)
- Long captions and dense point outputs can dominate training signal; explicit weighting prevents degradation on short-answer tasks.
-
This is a pragmatic multi-task stabilization technique tied to the VLM setting where output lengths vary drastically.
-
Ablated modeling choices specific to video: timestamps and bi-directional attention among vision tokens (Table 8b)
- They report that:
- removing time tokens hurts captioning notably (Table 8b),
- disabling “bidir” attention reduces QA and caption metrics (Table 8b),
- framing temporal markers and vision-token interactions as key for video detail understanding.
5. Experimental Analysis¶
Evaluation methodology: datasets, metrics, setup¶
- Video understanding benchmarks (Table 2)
- Includes a wide suite:
NextQA,PerceptionTest,MVBench,Tomato,MotionBench,TempCompass,Video-MME,LongVideoBench,MLVU,LVBench,VideoEvalPro,EgoSchema, plus their own:Molmo2 Captiontest F1,Molmo2 Countaccuracy.
-
Inference settings (Section 4.1):
- greedy decoding,
- up to
384frames (notably larger than the defaultF=128training cap, except long-context stage), - for captioning + human eval:
top_p=0.95,temperature=0.7,frequency_penalty=0.1.
-
Captioning evaluation (Section 4.1; Appendix C)
- Uses
Molmo2-CapTest:693Creative Commons videos, each with at least 4 human captions. - Metric: statement-level precision/recall/F1 using an LLM-as-judge pipeline (similar to Molmo’s image caption metric).
-
Appendix C details:
- GPT-4.1 enumerates “atomic statements” and judges matches for precision and recall.
-
Counting evaluation (Section 4.1; Appendix C)
Molmo2-VideoCount:533examples collected via the video pointing pipeline, up to60points.-
Also evaluates on
BURST-VideoCount(Table 3 text):2.2kexamples derived from BURST tracks, with “close accuracy”- correct if \(|\text{pred}-\text{gt}| \le \Delta\),
- \(\Delta = 1 + \lfloor 0.05 \times \text{gt} \rfloor\) (Section 4.2).
-
Video pointing evaluation (Section 4.2; Table 3)
Molmo2-VP: 181 manually filtered examples.- Ground-truth masks obtained by running
SAM 2in a 3-second window around annotated points. -
Metric: precision/recall/F1 based on whether predicted points fall within the mask.
-
Tracking evaluation (Tables 4–5; Appendix C)
- Benchmarks include
MeViS,Ref-YT-VOS,Ref-Davis,ReasonVOS, plus their newMolmo2-Track. - They evaluate point predictions (correct if inside mask) and convert points to segmentation via
SAM 2for segmentation metrics (Appendix C). -
Metrics:
J&F(Jaccard IoU + boundary F-score) for segmentation quality,- point
F1at 1 fps, HOTAfor identity-aware tracking (Appendix C), adapted to point-within-mask matching.
-
Human preference evaluation (Section 4.1; Appendix C; Table 15; Figures 5–6)
- 450 QA questions + 51 captioning queries; 105k pairwise ratings.
- Elo computed using Bradley–Terry; bootstrapped for confidence intervals.
Main quantitative results (with specific numbers)¶
Video benchmarks (Table 2)¶
Molmo2-8B:Molmo2 CaptionF1: 43.2Molmo2 Countaccuracy: 35.5Short QA avg.: 69.9Long QA avg.: 64.1Average: 63.1- Human preference Elo: 1057 (rank 5 in that table)
- Among open-weight baselines in Table 2:
Qwen3-VL-8BshowsMolmo2 Countof 29.6, vs Molmo2-8B 35.5 (also echoed in the Abstract).- For overall average,
Qwen3-VL-8Bis 59.5 vsMolmo2-8B63.1.
Video counting + pointing (Table 3)¶
Molmo2-8B:BURSTvideo counting accuracy: 60.8Molmo2-VCaccuracy: 35.5Molmo2-VPpointing: 38.4 F1, 39.3 recall, 38.7 precision- Comparison explicitly visible in Table 3:
Qwen3-VL-8BhasMolmo2-VPF1 1.5.Gemini 3 Prohas pointing F1 20.0, while Molmo2-8B has 38.4 (also summarized in the Abstract).
Tracking on academic benchmarks (Table 4)¶
Molmo2-8Breports strong numbers across:MeViS valid:J&F 62.3, pointF1 75.9,HOTA 72.6Ref-YT-VOS valid:J&F 70.2,F1 78.7,HOTA 77.3Ref-Davis test:J&F 81.3,F1 78.7,HOTA 72.7ReasonVOS test:J&F 65.8,F1 70.8,HOTA 68.6- The paper highlights (in the prose beneath Tables 4–5) that Molmo2 outperforms both general VLM baselines and several specialized segmentation models across these benchmarks.
Tracking on Molmo2’s domain benchmark (Table 5)¶
- Overall (
Molmo2-8B):J&F 56.2,F1 57.1,HOTA 57.5, with breakdowns by domain (Animals/Person/Sports/Dancers/Misc).
Image and multi-image benchmarks (Table 6)¶
Molmo2-8B:Img QA avg.: 81.7MultiImg QA avg.: 56.4Average: 76.3- The text notes Molmo2 is behind some open-weight models on OCR-heavy benchmarks (DocVQA/InfoQA), but strong on general QA and counting (Section 4.3).
Image pointing (Table 7)¶
- On
Point-Bench,Molmo2-8Baverage is 72.7 (Table 7) and the paper claims it surpasses leaderboard baselines as of a stated date in the text below Table 7.
Do the experiments support the claims?¶
- Grounding claim support is strong within the provided evidence
- Video pointing and tracking results show large gaps versus several baselines (Tables 3–5), including very low pointing F1 for some open-weight models in their tested setup.
-
They also discuss baseline prompting difficulties for pointing (Appendix C), which is relevant: grounding requires precise output formatting, and weak adherence can crater metrics.
-
Video understanding claims are supported with broad benchmark coverage
-
Table 2 spans many datasets and includes both short- and long-video benchmarks, plus their own captioning/counting evaluations.
-
Ablations add credibility about what matters
- Video modeling ablations (Table 8b) isolate bidirectional attention, token weighting, time tokens, and pooling size effects.
- Counting/pointing ablations (Table 9) test strategy (“point then count”), data source complementarity, and sampling of high-count examples.
- Tracking ablations (Table 10) show pointing helps tracking, and temporal grounding helps on academic VOS.
Ablations, failure cases, robustness checks¶
- Ablations
- Video ablations (Table 8):
- “No bidir” reduces QA avg from 64.8 → 64.4 and caption F1 39.5 → 38.5 in their video-only setting (Table 8b).
- “No time tokens” reduces caption F1 39.5 → 37.4 (Table 8b).
- Increasing video pool from
3×3to4×4reduces caption F1 39.5 → 37.0 (Table 8b).
- Counting strategy (Table 9a):
- direct “Count” vs “Point then count”: on
MVC, 28.1 → 34.5, showing the importance of pointing for counting.
- direct “Count” vs “Point then count”: on
-
Long-context SFT ablation (Table 11):
- improves long video QA avg 64.4 → 67.4 but reduces caption F1 42.3 → 39.9.
-
Failure cases / limitations observed in outputs
- Appendix H and Figure 36 describe degenerate grounding outputs (repeated points, lines of points) and caption repetition for very long generations.
6. Limitations and Trade-offs¶
- Not fully open in every component
- Uses a closed-data vision encoder (
SigLIP 2) (Limitations §H). -
Uses closed text-only LLMs in data generation and refinement loops (Limitations §H), reducing full transparency/reproducibility of the dataset construction—even though they avoid closed VLMs.
-
Video grounding is still far from “solved”
-
They note that video grounding metrics are substantially lower than typical image grounding scores (Limitations §H), reflecting:
- larger visual volume (many frames),
- re-identification difficulty across time,
- reduced effective resolution for long videos,
- vision encoders not necessarily pretrained on video.
-
Long-video grounding support is limited
-
Grounding training is limited to videos up to ~3 minutes, and extending longer is complicated because annotations are aligned to 2 fps while longer videos would require lower sampling rates (Limitations §H).
-
Degenerate grounding outputs
-
Models sometimes emit repeated points (same point every frame, or many points in one frame), especially for high-frequency objects or long videos (Limitations §H), suggesting remaining robustness issues and/or task interference in joint training.
-
Captioning degeneration for extremely long outputs
-
Repetition at the end of very long captions is observed, particularly under greedy decoding (Limitations §H; Appendix C notes repetition issues).
-
Trade-off: long-context post-training helps long-video QA but can hurt captioning
-
Table 11 quantifies this: long-context SFT improves long-video QA but drops caption F1.
-
Compute and data constraints for ultra-long videos
- The paper attributes lagging behind the best open-weight models partly to limited open long-video training data and compute limitations for extensive ultra-long context training (Section 4.1).
7. Implications and Future Directions¶
- Impact on the field
- Provides an end-to-end open baseline (data + recipe + code) for video grounding, which is positioned as underdeveloped in open research compared to proprietary systems.
-
The dataset suite and training tricks (message trees, packing, token weighting) give concrete scaffolding for others to replicate and extend.
-
Research directions suggested by the paper’s findings/limitations
- Open vision encoders: replace closed-data
SigLIP 2with competitive open-data alternatives (Limitations §H). - Better long-video grounding alignment: develop sampling/annotation alignment methods when fps must drop for long videos (Limitations §H).
- Reduce degenerate grounding outputs: mitigate repeated-point failure modes via more targeted data, better decoding constraints, or reduced task interference (Limitations §H).
-
Improve long-caption generation stability: address repetition/nonsense in very long captions (Limitations §H).
-
Practical applications / downstream use
- Any application needing “answer + where/when evidence” can benefit:
- video search with clickable moments,
- robotics/monitoring tasks requiring localized events (e.g., grasps, falls),
- sports analytics requiring tracking/counting.
-
The model’s explicit point/track interface is designed to be consumable by downstream tools (e.g., converting points to masks using
SAM 2in evaluation). -
Repro/Integration Guidance
- When to prefer
Molmo2(based on this paper’s evidence):- If you need open weights + open training data + open code and care about video grounding (pointing/tracking), Molmo2 is explicitly engineered for this and shows large gains on their pointing/tracking metrics (Tables 3–5).
- If your primary workload is OCR-heavy document QA, the paper notes Molmo2 can lag behind some open-weight models (Section 4.3; Table 6), so you may need additional OCR-focused training data or a different model.
- Training recipe components that appear most important (per ablations):
- include timestamps/time tokens for video (Table 8b),
- keep video pooling at 3×3 rather than larger pooling for caption detail (Table 8b),
- use point-then-count for counting supervision (Table 9a),
- expect a captioning vs long-video-QA trade-off when adding long-context post-training (Table 11).