Skip to content

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

ArXiv: 2504.10479

🎯 Pitch

InternVL3 pioneers a native multimodal pre-training paradigm that enables large models to simultaneously learn language and visual capabilities from both text and multimodal data, eliminating the need for complex, post-hoc adaptation stages. By integrating innovations like Variable Visual Position Encoding (V2PE), advanced supervised fine-tuning and mixed preference optimization, and test-time scaling, InternVL3 achieves state-of-the-art performance among open-source models, dramatically simplifying the training pipeline and closing the gap with leading proprietary MLLMs. This approach greatly enhances scalability, efficiency, and robustness for real-world applications such as document understanding, GUI agents, and multimodal reasoning, while fostering transparency and reproducibility through the release of both data and model weights.


1. Executive Summary (2-3 sentences)

InternVL3 proposes a “native multimodal pre‑training” recipe that trains a vision–language model to learn language and visual skills together in a single stage, rather than adapting a text-only LLM afterward. Combined with a variable visual position encoding (V2PE), improved post-training (SFT + Mixed Preference Optimization), and test-time scaling with a visual process reward model, the 78B variant reaches state-of-the-art open-source results on many benchmarks, including 72.2 on MMMU and 906 on OCRBench (Figure 1; Tables 2–3).

2. Context and Motivation

  • Problem/gap
  • Most multimodal LLMs (MLLMs) are built by adapting a text-only LLM through multi-stage “post-hoc” pipelines that add vision later (Section 1). This often requires delicate freezing/unfreezing schedules and extra alignment data to avoid degrading language ability, while still aligning visual and textual representations.
  • These pipelines are resource-heavy and can be brittle for new domains (OCR, GUIs, long videos), leaving a need for a simpler, more scalable approach that preserves language capability and scales to long context (Section 1).

  • Why it matters

  • Practical: Robust, scalable MLLMs underpin real-world tasks like document understanding, GUI agents, spatial reasoning, and multi-image reasoning. Efficiency in training and inference is critical for open-source ecosystems.
  • Scientific: Demonstrates whether joint training on text and multimodal data can replace multi-stage adaptation, and whether position encoding innovations like V2PE help extend multimodal context effectively (Sections 2.1–2.2).

  • Prior approaches and shortcomings

  • “Post-hoc” alignment (e.g., LLaVA-style pipelines) needs complex stages and special datasets (OCR, charts, etc.) to bridge modality gaps (Section 1). It risks catastrophic forgetting of language skills and adds engineering complexity.
  • Existing long-context methods typically treat all tokens uniformly in positional encoding, so visual tokens quickly eat context budget and limit sequence length (Section 2.1: V2PE motivation).

  • Positioning

  • InternVL3 unifies pre-training for text and vision (“native multimodal pre-training,” Section 2.2), adopts a modality-aware positional encoding (V2PE, Section 2.1), adds targeted post-training (SFT + MPO, Section 2.3), and leverages test-time scaling with a visual process reward model (VisualPRM, Section 2.4). Infrastructure enhancements (InternEVO) improve training throughput (Section 2.5).

3. Technical Approach

  • Architecture (“ViT–MLP–LLM,” Section 2.1; Table 1)
  • Vision encoder: InternViT-300M (for 1–14B models) or InternViT-6B (for 38B/78B) at 448 px input with pixel unshuffle, which reduces visual tokens by 4×, mapping each 448×448 tile to 256 tokens (Section 2.1).
  • Language model: pre-trained base LLMs from Qwen2.5 (0.5B–72B) or InternLM3-8B; no instruction-tuned LLMs are used for initialization (Table 1).
  • Connector: a 2-layer MLP maps visual embeddings to the LLM space (randomly initialized; Section 2.1).

  • Variable Visual Position Encoding (V2PE; Section 2.1, Eqs. (1)–(4))

  • Goal: Stretch multimodal context by “spending” fewer position increments on visual tokens.
  • Mechanism:
    • Sequence tokens x1..xL receive position indices p1..pL recursively (Eq. (2)). For text tokens, index increases by 1; for visual tokens, by a fraction δ<1 (Eq. (3)).
    • During training, δ is sampled per image from a set {1, 1/2, 1/4, …, 1/256} (Eq. (4)); during inference, δ can be chosen based on sequence length to keep positions within the context window.
  • Intuition: Visual tokens consume less “positional budget,” enabling longer sequences without losing text positional resolution.

  • Native Multimodal Pre-Training (Section 2.2)

  • Objective: Autoregressive next-token prediction but compute loss only on text tokens (L_text-only, Eq. (6)), while visual tokens serve as conditioning context.
    • Formal: Minimize expected loss over a combined dataset of text-only and multimodal samples, updating all parameters jointly (Eq. (8)).
  • Loss scaling: Uses “square averaging” to balance gradients across short/long answers (Eq. (7)); mitigates bias that either favors long or short responses.
  • Data mixture: Approximately 200B training tokens with a 1:3 ratio of language:multimodal tokens (50B language, 150B multimodal; Section 2.2 “Data”). Multimodal sources cover captioning, OCR, charts, math, documents, video, GUIs, tools, 3D scenes, etc.

  • Post-Training: SFT and Mixed Preference Optimization (MPO; Section 2.3)

  • SFT (supervised fine-tuning): Uses high-quality, diverse instruction data; preserves InternVL2.5 practices (random JPEG compression, square loss re-weighting, multimodal packing). Expanded data for tools/3D/GUI/long-context/video/diagrams/CoT, growing SFT corpus from 16.3M to 21.7M samples.
  • MPO: Addresses exposure bias and improves chain-of-thought by combining:

    • Preference loss (L_p, DPO-style, Eq. (10)) to learn chosen vs rejected responses.
    • Quality loss (L_q, BCO, Eqs. (11)–(13)) to model absolute response quality with a reward shift.
    • Generation loss (L_g, Eq. (6)) to learn to produce preferred responses.
    • Total loss is a weighted sum L = w_p L_p + w_q L_q + w_g L_g (Eq. (9)). MPO uses ~300K preference pairs derived following MMPR v1.2 (Section 2.3 “Data”).
  • Test-Time Scaling with VisualPRM (Section 2.4)

  • Best-of-N (BoN): Sample multiple candidate solutions and select one using an external critic.
  • VisualPRM (Visual Process Reward Model): A separate 8B model that assigns step-level correctness probabilities and averages them to score a full solution (Eq. (14)). Trained on VisualPRM400K (extended with InternVL3 rollouts).

  • Infrastructure: InternEVO extensions (Section 2.5)

  • Flexible sharding of ViT/MLP/LLM modules; supports data/tensor/sequence/pipeline parallelism and combinations; overlaps communication/computation.
  • Dynamic load balancing between visual and language modules to handle varying token proportions.
  • Supports sequences up to 32K via head-parallel + sequence-parallel. Reports 50–200% training speedup vs. InternVL2.5 at the same compute budget.

4. Key Insights and Innovations

  • Native multimodal pre-training (fundamental)
  • What’s new: Jointly optimize all parameters on a mixture of text-only and multimodal data from scratch in one stage (Section 2.2), with loss computed only on text tokens (Eq. (6)).
  • Why it matters: Simplifies pipelines, eliminates fragile freeze/unfreeze schedules, and aligns vision/text representations during the earliest stage. Figure 3 shows that even without post-training, native pre-training alone yields strong multimodal capability; with SFT it surpasses the classic multi-stage pipeline.

  • Variable Visual Position Encoding (V2PE) for long context (fundamental)

  • What’s new: Modality-specific position increments so visual tokens consume fewer position steps (Eqs. (2)–(4)).
  • Why it matters: Extends effective multimodal context without changing model size. Table 12 shows consistent gains on standard tasks even with moderate context, contradicting earlier reports that V2PE helps only for long contexts.

  • Mixed Preference Optimization (MPO) that blends relative and absolute supervision (incremental but impactful)

  • What’s new: Combines DPO-style pairwise preference (Eq. (10)) with BCO-style absolute quality modeling (Eqs. (11)–(13)) plus standard generation loss (Eq. (6)).
  • Why it matters: Improves reasoning stability under self-generated tokens. Table 13 shows across seven reasoning benchmarks, MPO boosts overall scores by +1.5 to +4.5 depending on model size.

  • Test-time scaling with VisualPRM as a critic (incremental but effective)

  • What’s new: Step-level process supervision to score and select the best sampled solution (Section 2.4).
  • Why it matters: Substantial test-time boosts in math/reasoning. For instance, InternVL3-8B improves MathVista from 71.6 to 75.2 and MathVision from 29.3 to 37.5 with BoN=8 (Table 2, “w/ VisualPRM-Bo8”).

  • Practical training system upgrades (incremental)

  • InternEVO enables 32K sequences and 50–200% faster training than InternVL2.5 (Section 2.5), critical for scaling to 78B parameters with long-context inputs.

5. Experimental Analysis

  • Evaluation setup
  • Benchmarks span:
    • Multimodal reasoning and math: MMMU, MathVista, MathVision, MathVerse, DynaMath, WeMath, LogicVista (Table 2).
    • OCR, charts, documents: AI2D, ChartQA, TextVQA, DocVQA, InfoVQA, OCRBench, SEED-2-Plus, CharXiv, VCR (Table 3).
    • Multi-image and real-world: BLINK, Mantis-Eval, MMIU, MuirBench, MMT-Bench, MIRB; RealWorldQA, MME-RealWorld, WildVision, R-Bench (Table 4).
    • Comprehensive capability and hallucination: MME, MMBench (EN/CN), MMBench v1.1, MMVet v1/v2, MMStar; HallusionBench, MMHal, CRPE, POPE (Table 5).
    • Grounding: RefCOCO/+/g (Table 6).
    • Multilingual: MMMB, Multilingual MMBench, MTVQA across EN/ZN/PT/AR/TR/RU (Table 7).
    • Video: Video-MME, MVBench, MMBench-Video, MLVU, LongVideoBench, CG-Bench (Table 8).
    • GUI grounding: ScreenSpot and ScreenSpot-V2 (Table 9).
    • Spatial reasoning: VSI-Bench (Table 10).
    • Language-only capability: MMLU, CMMLU, C-Eval, GAOKAO, TriviaQA, NQ, RACE, WinoGrande, HellaSwag, BBH, GSM8K, MATH, TheoremQA, HumanEval, MBPP/MBPP-CN (Table 11).
  • Models: 1B to 78B variants (Table 1). Results compared to Qwen2.5-VL series and closed models like GPT‑4o, Claude 3.5, Gemini 1.5/2.0/2.5 (Figures 1–2; multiple tables).

  • Main quantitative highlights

  • Overall multimodal strength
    • “InternVL3-78B achieves MMMU 72.2” (Table 2), leading open-source MLLMs and close to proprietary models (Figure 1).

    • “OCRBench 906, AI2D 89.7, ChartQA 89.7, DocVQA 95.4” (Table 3), with competitive or superior scores to Qwen2.5‑VL‑72B and GPT‑4o on several tasks.

  • Reasoning and math (Table 2)
    • “InternVL3-78B: MMMU 72.2, MathVista 79.0, MathVision 43.1, MathVerse (vision-only) 51.0.”

    • With VisualPRM-Bo8, 38B and 8B models gain +4–8 points on average (Table 2; best-of-8 rows).
  • Multi-image and real-world (Table 4)
    • “InternVL3-78B: BLINK 66.3, MMT-Bench 73.2, RealWorldQA 78.0, MME-RealWorld 65.4, WildVision 73.6, R-Bench 77.4.”

    • Competes closely with GPT‑4o on RealWorldQA (78.0 vs. 75.4) and is strong on MME-RealWorld.
  • Hallucination & comprehensive (Table 5)
    • “InternVL3-78B: MME sum 2549.8; MMBench EN/CN 89.0/88.7; MMVet 81.3; HallusionBench 59.1; CRPE 79.2; POPE 90.3.”

    • Gemini 2.5 Pro can lead on some hallucination tests (e.g., HallusionBench in Figure 1).
  • Grounding (Table 6)
    • “InternVL3-78B overall 91.4,” slightly below InternVL2.5‑78B (92.3), suggesting limited new grounding-specific data.

  • Multilingual (Table 7)
    • “InternVL3-78B overall 68.9,” edging out InternVL2.5‑78B (68.0) and comparable to Qwen2‑VL‑72B (67.2).

  • Video (Table 8)
    • “InternVL3-78B: Video-MME 72.7/75.7, MVBench 78.7, MLVU 79.5, LongVideoBench 65.7, CG-Bench 48.4/65.3.” Strong scaling trend across sizes.

  • GUI and spatial reasoning (Tables 9–10)

    • “ScreenSpot-V2: InternVL3-72B 90.9,” slightly above UI‑TARS‑72B (90.3); “VSI-Bench overall: 38B 48.9; 78B 48.4,” surpassing open and closed baselines on several sub-tasks.

  • Ablations and diagnostics

  • Native multimodal pre-training vs. classic pipeline (Figure 3)
    • The pre-trained-only model (no SFT) with native pre-training already reaches near full-pipeline performance; with SFT it exceeds it across many tasks—evidence that the unified training is effective.
  • V2PE δ sweep (Table 12)
    • Using V2PE improves many metrics even for short contexts; small δ (e.g., 1/4–1/16) often best. For fairness, the rest of the paper reports δ=1 unless noted (Section 3.14).
  • MPO impact (Table 13)

    • Improves reasoning by +1.5 to +4.5 overall points depending on scale; largest relative gains at larger scales (38B/78B), suggesting synergy with capacity.
  • Are claims supported?

  • The paper triangulates: broad benchmark coverage (Tables 2–10), detailed ablations (Figure 3; Tables 12–13), and comparisons to both open- and closed-source models (Figures 1–2). The gains on MMMU, OCRBench, RealWorldQA, and process-supervised BoN solidly substantiate the core contributions.

  • Notable mixed results

  • Visual grounding plateaus or dips slightly at the largest scale (Table 6), likely due to less grounding-specific data.
  • Hallucination scores are competitive but not dominant on every benchmark (Figure 1; Table 5).

6. Limitations and Trade-offs

  • Data and training assumptions
  • Relies on ~200B tokens total and diverse multimodal corpora (Section 2.2 “Data”). Reproducing results requires large-scale data curation and infrastructure (Section 2.5).
  • The L_text-only objective (Eq. (6)) never predicts visual tokens. While simpler and effective, it may limit the model’s ability to generate dense visual sequences (e.g., pixel-level tasks) without further adaptations.

  • Task coverage gaps

  • Visual grounding: performance saturates or slightly declines at the largest scale (Table 6), likely because the training expansion did not emphasize grounding data (Section 3.8 discussion).
  • Hallucinations: improvements are not uniform across all hallucination benchmarks (Table 5), leaving room for targeted robustness work.

  • Computational costs

  • Largest variant uses InternViT-6B + Qwen2.5-72B (Table 1) and 32K context support with complex parallelism (Section 2.5). Inference/test-time scaling (Best-of-N with VisualPRM) increases compute at evaluation time.

  • Sensitivity and tuning

  • V2PE δ selection impacts results (Table 12). Although the paper fixes δ=1 for fairness outside the ablation, practitioners will need policies to pick δ based on sequence length at inference.

  • Dependency on critic at test time

  • The biggest reasoning gains rely on VisualPRM BoN (Table 2), introducing an extra model and sampling budget during evaluation.

7. Implications and Future Directions

  • Field impact
  • Demonstrates that unified, native multimodal pre-training can replace multi-stage post-hoc adaptation, simplifying pipelines while preserving or improving language skill (Figure 3; Table 11). This can shift the community toward single-stage, mixed-data pre-training for MLLMs.
  • V2PE offers a practical recipe to extend multimodal context without enlarging the model, likely influencing future positional encoding designs for MLLMs (Section 2.1; Table 12).

  • Follow-up research

  • Data-targeted improvements: Add grounding-specific corpora and richer hallucination-robust data; probe how the L_text-only objective interacts with tasks that benefit from visual-token prediction.
  • Position strategies: Learn δ or schedule it adaptively per input instead of sampling from a fixed set; explore cross-modal relative position schemes.
  • Process supervision: Expand VisualPRM beyond math/science into multi-image reasoning and long video; investigate training-time integration (e.g., reinforcement learning with process rewards).
  • Efficiency: Combine BoN with smarter sampling (e.g., conditional early stopping, diversity-promoting decoding) to reduce test-time cost.

  • Applications

  • Strong results suggest readiness for:
    • Document and chart analysis (Table 3: DocVQA 95.4, ChartQA 89.7).
    • Enterprise OCR and form understanding (OCRBench 906).
    • GUI agents and UI automation (Table 9: up to 90.9 on ScreenSpot-V2).
    • Spatial reasoning for robotics/autonomy (Table 10: top object counting/relative distance/appearance order).
    • Long video understanding for surveillance, sports, or education (Table 8: strong scaling on MLVU/LongVideoBench).
  • Public release of code, data, and weights (front page; Figure 1 footers) lowers barriers for practitioners and researchers to build and evaluate next-generation open-source MLLMs.

Headline result: “InternVL3‑78B reaches MMMU 72.2, OCRBench 906, AI2D 89.7, ChartQA 89.7, DocVQA 95.4” (Figure 1; Tables 2–3), with compelling language-only competence (Table 11) and broad real-world robustness (Tables 4–5), achieved via a unified training paradigm plus V2PE, MPO, and test-time process supervision.