Skip to content

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

ArXiv: 2409.17146

🎯 Pitch

Molmo introduces a new family of state-of-the-art open vision-language models (VLMs) that are trained entirely on fully open, non-distilled multimodal datasets—PixMo—without relying on any synthetic data from proprietary models. Through innovative data collection methods (such as dense, speech-driven captions, fine-grained 2D pointing, and interactive QA), Molmo sets a new standard for reproducibility and transparency, delivering performance rivaling or exceeding proprietary models like Gemini 1.5 Pro and Claude 3.5 Sonnet. This breakthrough empowers the global research community to study, build, and deploy competitive VLMs without closed data dependencies, unlocking broader progress in AI vision, reasoning, and grounded agent capabilities.


1. Executive Summary (2–3 sentences)

Molmo is a family of open-weight vision–language models (VLMs) trained end-to-end on a fully open multimodal data suite called PixMo—without distilling any supervision from proprietary VLMs. With careful model and training design plus new data collection methods (dense speech-driven captions, interactive QA, and large-scale 2D pointing), Molmo-72B achieves state-of-the-art results among open models and competes with top proprietary systems, ranking second to GPT‑4o in human preference while outperforming many others (Table 1; §5).

2. Context and Motivation

  • Problem addressed
  • Most top-performing VLMs are closed (weights, data, and code are unavailable), and many strong open-weight models are trained on synthetic labels generated by those same closed models (i.e., distillation), obscuring what is required to build performant VLMs from scratch (§1; Table 1 grouping). The key missing piece is high-quality, open, non-distilled multimodal data and a clear, reproducible recipe that achieves competitive performance.
  • Why it matters
  • Scientific impact: Enables reproducible research on how multimodal capabilities emerge and on which data/architectural choices matter (§1).
  • Practical impact: Capabilities like pointing (2D grounding) support downstream agents for robot navigation, UI control, and fine-grained visual explanations (§3 PixMo-Points description).
  • Prior approaches and their limitations
  • Fully open early works (e.g., LLaVA) now lag far behind (§1).
  • Many stronger open-weight models depend on distillation from proprietary VLMs (e.g., via ShareGPT4V captions), so they’re “open weights but closed supervision” (§1; see Table 1 “Open weights + data († distilled)”).
  • High-quality multimodal data is costly; naïve crowd-sourced dense captions are low quality and prone to copy–paste from proprietary models (§3 PixMo-Cap).
  • How this work positions itself
  • Offers an open suite of models (1B–72B) plus open, non-distilled data (PixMo) and open training code and evaluations—explicitly designed to remove dependence on proprietary VLM outputs (Figure 11, VLM Openness Comparison).

3. Technical Approach

This section summarizes the full system: model architecture, image processing, data, training, and implementation details.

  • Core architecture (Figure 2; §2)
  • Pre-processor: Splits each input image into a low-resolution “overview” and multiple high-resolution square crops that tile the image. A novel “overlapping multi-crop” strategy ensures border patches retain neighborhood context (Figure 3; §2 Cropping; Appendix A.1).
  • Vision encoder: Primarily ViT-L/14 (CLIP 336px) but also works with SigLIP and fully open MetaCLIP (Table 2a).
  • Connector: Concatenates features from the 3rd-to-last and 10th-from-last encoder layers, then performs 2×2 attention pooling (query = mean of the 4 patches), and maps to the LLM embedding via an MLP (§2 Vision–language connector; Table 2f).
  • Decoder-only LLM: Variants include fully open OLMo-7B, OLMoE‑1B–7B (MoE), and open-weight Qwen2-7B/72B (§2).
  • Token sequencing: Low-res tokens come first, then high-res crop tokens, with special tokens marking image/crop boundaries and row changes (§2 Arranging vision tokens; Figure 5).
  • Dropout scheme: During pre-training on dense captions, dropout is applied only to text tokens (not vision tokens) to encourage reliance on visual evidence (§2 Dropout; Table 2c).
  • Multi-annotation batching: All annotations for the same image are packed into one sequence with attention masks so each annotation attends to the same image tokens but not to each other (except within the same annotation). This avoids redundant image encoding and halves training time (§2 Multi-annotated images).

  • Overlapping multi-crop image encoding (why and how)

  • Why: Standard ViTs accept a single fixed-size square; large scenes lose detail; simple tiling creates contextless borders (§2 Cropping).
  • How: Build a grid of overlapping crops, encode each independently, but pass only non-overlapping features to the LLM so the tiled features exactly cover the high-res image (Figure 3; Appendix A.1). Special learned embeddings flag padding regions so the model can distinguish true black borders from padded background (Appendix A.1).

  • The PixMo data suite (Figure 1; §3)

  • PixMo-Cap (dense caption pre-training)
    • 712k images; 1.3M audio transcripts and long-form captions averaging 196 words—orders of magnitude richer than standard captions (§3 PixMo-Cap).
    • Modality-switching trick: annotators speak for 60–90s; speech-to-text yields transcripts. This produces richer details faster, and the audio serves as a “receipt” proving no VLM was used (§3).
    • An LLM cleans/merges transcripts to produce final text (language-only; no VLM).
  • PixMo-AskModelAnything (instruction-style fine-tuning)
    • 162k QA pairs over 73k images using a human-in-the-loop loop: annotators write questions; an OCR (non‑VLM) + the PixMo-Cap-pretrained model supply evidence to a language-only LLM that drafts answers; annotators approve/revise (§3 PixMo-AskModelAnything).
  • PixMo-Points (2D pointing and counting)
    • 2.3M question–point pairs over 223k images; supports (i) referring by pointing, (ii) counting-by-pointing, (iii) pointing as visual explanation (§3 PixMo-Points).
    • Points are lightweight to annotate compared to boxes/masks, enabling scale; includes “not present” cases for negative grounding (§3; Figure 10 distribution in §F).
  • Synthetic datasets (no VLMs used)

    • PixMo-CapQA: 214k QA pairs created from ground-truth captions (language-only LLM) (§3).
    • PixMo-Docs: 255k document/chart/table/diagram renderings + 2.3M QA generated with privileged access to rendering code; supports OCR-free document understanding (§3; Figure 4; §F PixMo-Docs).
    • PixMo-Clocks: 826k synthetic clocks from ~50 watch bodies and ~160k faces for robust time-reading (§3; Figures 1 and 17).
    • PixMo-Count: 36k detector-based counting images (0–10) with points + QA; harder test sets hand-verified (540 images each) (§3).
  • Training pipeline (Figure 4; §4; Appendix A.2, B)

  • Pre-training (on PixMo-Cap only)
    • Task: generate either a long caption or a transcript; inputs include an optional length hint (0–100) 90% of the time to target output length (Table 2e; Appendix B.1; Figure 7).
    • No separate connector-only stage; instead, use higher LR and shorter warmup for connector to align faster (§4 Pre-training). This simplifies the pipeline and avoids large noisy web data stages.
    • Optimizer: AdamW with cosine decay to 10% peak LR, separate LRs for ViT/connector/LLM, gradient clipping per subsystem; 4 epochs total (§4; Appendix A.2).
  • Supervised fine-tuning (SFT)

    • Mixture includes PixMo datasets and standard academic datasets (VQA v2.0, TextVQA, ChartQA, DocVQA, InfographicVQA, AI2D, A-OKVQA, AndroidControl, ScienceQA, TabMWP, ST-VQA, TallyQA, DVQA, FigureQA, PlotQA) (§4).
    • Sampling: roughly proportional to square root of dataset size with manual balancing; pointing data is upweighted since it learns slower (Figure 4; Table 7).
    • Style tags: dataset-specific prefixes (e.g., vqa2:) encourage benchmark-appropriate short answers without leaking those styles into user-facing responses (§4).
    • Point output format: plain-text <point ...> or <points ...> with normalized coordinates 0–100; counting is taught as “point then count” (chain-of-thought) (§4; Table 4a–4d).
    • Resolution at eval: generally 36 crops for academic benchmarks (higher resolution) except counting tasks, which are sensitive to a train–test crop mismatch; a short high-res SFT recovers counting at higher crop counts (Table 12).
  • Implementation and efficiency (Appendix A.3, B.3)

  • FSDP training with PyTorch SDPA attention; AMP used but model weights and gradient reduction kept at float32 for stability (Figure 6).
  • Gradient normalization uses the global average number of loss tokens across devices to avoid bias from shorter sequences (Appendix A.3).
  • GPU budget: H100s; e.g., pretrain Molmo‑72B on 128 GPUs for 33.3h (~4.2k GPU hours), SFT 256 GPUs for 32.4h (~8.3k GPU hours) (Table 8).

4. Key Insights and Innovations

  • High-quality, non‑distilled multimodal data at scale (§3; Figures 1, 4)
  • Novel speech-driven dense captions with “audio receipts” and interactive QA collection produce detailed, accurate supervision without using any proprietary VLM outputs.
  • Significance: Removes the dependency loop of distilling closed models; enables studying “from-scratch” ingredients and yields strong downstream performance (Table 3a–3c; Table 5).
  • 2D pointing as a first-class supervision and inference modality (§3 PixMo-Points; §4; Table 4)
  • The model outputs normalized coordinates as part of answers, enabling pointing-based grounding and a chain-of-thought style of counting (“point then count”).
  • Significance: Improves counting accuracy (Table 4a) and offers pixel-grounded explanations. This is a capability enabler for agents (robotics/UI control).
  • Overlapping multi-crop image encoding (§2 Cropping; Figure 3)
  • Adds overlap during cropping but only forwards non-overlapped tiles to the LLM, preserving context at patch boundaries without inflating token counts (Appendix A.1).
  • Significance: Large gains over single-crop and over non-overlapping tiling (Table 2d: 62.8 → 75.7 → 76.9 on 11‑avg).
  • Pre-training task design that generalizes (§2 Dropout; §4 Pre-training; Table 2c, 2e; Figure 7)
  • Text-only dropout encourages reliance on vision tokens; length-conditioned captioning is a stronger pre-training objective than vanilla captioning.
  • Significance: Both raise the caption F1 and the average across 11 downstream benchmarks (Table 2c, 2e); cap F1 correlates with 11‑avg (Pearson ρ = 0.82; Figure 9), offering a fast proxy for model selection.
  • Efficient multi-annotation training (§2 Multi-annotated images)
  • Packing all annotations for a given image into one sequence with attention masks avoids redundant vision encoding, cutting training time while preserving correctness.
  • Significance: Practical scalability with large, richly annotated datasets.

5. Experimental Analysis

  • Evaluation methodology (§5; Appendix C, D)
  • Academic suite: 10 common datasets plus a new harder counting test (PixMo‑Count). The paper uses consistent prompts and, where appropriate, dataset-specific style tags to elicit benchmark-specific answer formats (§5; Table 1).
  • Human preference: ~15k image–question pairs across 10 categories, ~325k pairwise comparisons from ~870 annotators; Elo computed via Bradley–Terry model following Chatbot Arena (§5; Appendix C).
  • Additional skills: clock reading benchmark (Table 10), pointing evaluation (Table 11), high-res fine-tuning for resolution shifts (Table 12), and Chatbot Arena ranking (Table 9).

  • Headline results (Table 1; §5)

  • Overall average across 11 benchmarks:
    • “API only”: GPT‑4o‑0513 averages 78.5; Claude 3.5 Sonnet 76.7; Gemini 1.5 Pro 78.3; Flash 75.1.
    • “Open weights only”: Qwen2‑VL‑72B 79.4; Llama‑3.2V‑90B 74.5.
    • “Open weights + data († distilled)”: LLaVA OneVision‑72B† 76.6.
    • Molmo family:
    • MolmoE‑1B: 68.6 (nearly matches GPT‑4V’s 71.1 on average and Elo; §5).
    • Molmo‑7B‑O (OLMo‑7B): 74.6.
    • Molmo‑7B‑D (Qwen2‑7B): 77.3.
    • Molmo‑72B: 81.2, which “achieves the highest academic benchmark score and ranks second by human preference, just behind GPT‑4o” (§5).
  • Human preference (Table 1 “Elo score”, plus §5):

    • Molmo‑72B: Elo 1077 (rank 2) vs GPT‑4o‑0513 Elo 1079 (rank 1). It outperforms Gemini 1.5 Pro/Flash and Claude 3.5 Sonnet in Elo (Table 1).
  • Area-specific strengths and weaknesses (§5; Table 1)

  • Strong on natural images and counting:
    • VQA v2.0: Molmo‑72B (86.5) competes with top systems; RealWorldQA: Molmo performs strongly (§5; Table 1).
    • Counting: “point then count” delivers top scores: e.g., Molmo‑72B CountBenchQA 91.2; PixMo‑Count 85.2, leading all models (Table 1; §5).
  • OCR/document/plot tasks:
    • Molmo surpasses many open and some proprietary models on ChartQA/DocVQA/InfoQA/TextVQA but trails Qwen2‑VL on these OCR-centric tasks (Table 1 discussion in §5).
  • Reasoning benchmarks:

    • MMMU and MathVista: Molmo lags (e.g., Molmo‑7B‑D scores 45.3 on MMMU and 51.6 on MathVista vs higher proprietary models; Table 1; §5), likely reflecting the training mix (less advanced reasoning data).
  • Robustness and targeted evaluations

  • Clock reading (Table 10; §D)
    • Molmo models (even 1B) substantially outperform both proprietary APIs and open-weight baselines on COCO/OpenImages/ClockMovies time-reading, but still trail a specialized single-task clock reader (78.9 overall). E.g., Molmo‑7B‑D overall 68.2 vs Qwen2‑VL‑72B 9.1.
  • Pointing quality (Table 11; §D)
    • Molmo‑72B achieves precision/recall/F1 ≈ 75/75/75 on a curated pointing benchmark. Note: pointing degrades when the number of crops at test time differs from the training setting (Table 11; Table 12).
  • High-resolution SFT recovers counting under resolution change (Table 12)
    • Simply increasing test crops from 12→36 hurts counting (e.g., PixMo‑Count 85.2→73.9), but a brief high-res SFT (12→36, 36) recovers to 87.4 without hurting overall 11‑avg (Table 12).
  • Human preference ablations (Table 5; Figure 8)

    • Removing PixMo‑Cap (“no PixMo‑Cap data”) reduces win rate vs default Molmo‑7B‑D to 35% and Elo to 990, highlighting dense caption data’s importance.
    • Using GPT‑4o captions on PixMo images is competitive in human preference (Elo 1018, 55% win rate vs default), suggesting strong image coverage and modern captioners matter (Table 3b; Table 5).
  • Ablations (Tables 2–4, 15; §6; §E)

  • Image resolution: more crops help; single-crop hurts badly (11‑avg 62.8). Overlap is superior to non-overlap (75.7→76.9) (Table 2b, 2d).
  • Connector pooling: attention pooling beats stacking (11‑avg 76.1→76.9) (Table 2f).
  • Dropout and length conditioning both help (Table 2c, 2e).
  • Data scaling: more PixMo-Cap → monotonic gains; 0→712k images increases 11‑avg from 74.9→76.9 (Table 3a).
  • Pre-training choices: LAION stage or ShareGPT4V/o underperform PixMo-Cap; GPT‑4o captions on PixMo images perform very well, close to using human transcripts (Table 3b).
  • Pointing/counting mechanics: “point then count” > “count then point”; ordering points (top-down, left-to-right) helps; plain-text coordinate tokens beat special tokens (Table 4a–4d).

  • External leaderboard

  • Chatbot Arena (vision, English): Molmo‑72B (1115 ± ~18) is the highest-ranked open(-weight+data) model below a tier of proprietary systems (Table 9; §D).

  • Do the experiments support the claims?

  • The combination of cross-benchmark averages (Table 1), human preference (Table 1; Table 5; Figure 8), and targeted skill probes (Tables 10–12) coherently support the main claims: PixMo data + pipeline yields SOTA open models; pointing enables superior counting; dense caption pretraining plus overlapping crops and pooling are effective design choices. The paper is unusually transparent about evaluation details and ablations (Appendix C–E), which strengthens the evidence.

6. Limitations and Trade-offs

  • Dependence on language-only LLMs for data cleaning/synthesis
  • While no VLMs are used for supervision, the pipelines do use closed LLMs (e.g., Claude 3.5 Sonnet, GPT‑4o‑mini) for code generation and caption post-processing (§F PixMo-Docs; §3 PixMo-Cap). This keeps supervision non-VLM but is not “fully open tooling.” The paper argues open LLMs can replace these as they improve (Related Work §H).
  • Reasoning gaps and NLP regression
  • Advanced reasoning: MMMU/MathVista results lag stronger proprietary models (Table 1; §5).
  • Text-only benchmarks: Multimodal SFT reduces the base LLM’s text-only performance (Table 13), though adding text-only SFT (Tulu 3) partly restores it—especially with 10% down-sampling—while maintaining multimodal performance.
  • Sensitivity to resolution mismatch for pointing/counting
  • Counting accuracy drops when the number of test crops differs from training (Table 12). A short high-res SFT fixes this but adds extra steps to deployment.
  • OCR/document understanding not best-in-class
  • Even though Molmo is strong, Qwen2‑VL models sometimes lead on OCR-heavy tasks (Table 1).
  • Compute cost
  • Training the largest model requires thousands of H100 GPU-hours (Table 8). While competitive for a frontier open system, this is still a significant barrier for many labs.
  • Dataset characteristics and coverage
  • PixMo‑Clocks generalizes well to wild clocks but still trails specialized systems (Table 10); more real clock faces may close the gap (§D).
  • Although “audio receipts” mitigate plagiarism from VLMs, quality and coverage still depend on annotator behavior and the curated source imagery.

7. Implications and Future Directions

  • How this changes the landscape
  • Demonstrates that frontier-level VLMs can be trained with open, non-distilled data. This breaks the previous reliance on closed VLM labelers and provides a reproducible baseline and ablation map (Figures 1–5; Tables 1–4, 15).
  • Positions pointing as a practical output modality, not just a research curiosity—enabling agentic applications (robotics, UI automation) where “where” matters as much as “what” (§3 PixMo‑Points; §4).
  • Follow-up research enabled
  • Data: Expand PixMo-Points to richer expressions, more fine-grained temporal grounding (video), and UI-specific ontologies; broaden PixMo‑Docs with more realistic renderings and noisy capture conditions.
  • Modeling: Integrate cross-attention backbones to maintain text-only performance (Related Work §H), or hybridize with the current connector approach; investigate learned crop policies or deformable token pyramids to reduce inference cost.
  • Training: Explore curriculum schedules that emphasize reasoning (MMMU/MathVista) without sacrificing perception; combine high-res SFT with test-time adaptation for robust resolution shifts.
  • Evaluation: Standardize open evaluation recipes (prompts, style tags, crop counts) to reduce 10% swings noted across settings (§5).
  • Practical applications
  • Visual assistants with reliable fine-grained grounding and counting (retail audits, inventory, traffic analysis).
  • Document-heavy workflows (forms, charts, diagrams) with OCR-free reasoning using render-aware training (PixMo‑Docs).
  • Agentic control with pointing for mobile/desktop automation; the AndroidControl performance (low-level 88.7% / high-level 69.0%) shows feasibility (§5 summary).
  • Accessibility features (point-to-explain) and education (visual explanations).

Representative headline result (Table 1; §5):
“Our best-in-class Molmo-72B model … ranks second by human preference, just behind GPT‑4o,” and “outperforms … Gemini 1.5 Pro and Flash, and Claude 3.5 Sonnet.”

Counting via pointing (Table 4a):
“point then count” (chain-of-thought) achieves 89.4% on CountBenchQA and 86.3% on PixMo‑Count val vs 87.9%/80.2% for “count only”.

Pre-training design (Table 2d, 2e):
Overlapping crops (+1.2 to 11‑avg vs non-overlap) and length-conditioned captioning (+0.7 to 11‑avg vs no conditioning) are both important.

In sum, Molmo+PixMo contribute an open, end-to-end path to strong VLMs; introduce practical innovations (pointing, overlapping crops, length-conditioned captioning); and supply detailed ablations that clarify which modeling/data choices matter. The system is not yet the strongest on advanced reasoning or pure text benchmarks, but it substantially advances openness, capability, and reproducibility in multimodal modeling.