InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models¶
ArXiv: 2504.10479
🎯 Pitch¶
InternVL3 pioneers a native multimodal pre-training paradigm that enables large models to simultaneously learn language and visual capabilities from both text and multimodal data, eliminating the need for complex, post-hoc adaptation stages. By integrating innovations like Variable Visual Position Encoding (V2PE), advanced supervised fine-tuning and mixed preference optimization, and test-time scaling, InternVL3 achieves state-of-the-art performance among open-source models, dramatically simplifying the training pipeline and closing the gap with leading proprietary MLLMs. This approach greatly enhances scalability, efficiency, and robustness for real-world applications such as document understanding, GUI agents, and multimodal reasoning, while fostering transparency and reproducibility through the release of both data and model weights.
1. Executive Summary (2-3 sentences)¶
InternVL3 proposes a “native multimodal pre‑training” recipe that trains a vision–language model to learn language and visual skills together in a single stage, rather than adapting a text-only LLM afterward. Combined with a variable visual position encoding (V2PE), improved post-training (SFT + Mixed Preference Optimization), and test-time scaling with a visual process reward model, the 78B variant reaches state-of-the-art open-source results on many benchmarks, including 72.2 on MMMU and 906 on OCRBench (Figure 1; Tables 2–3).
2. Context and Motivation¶
- Problem/gap
- Most multimodal LLMs (MLLMs) are built by adapting a text-only LLM through multi-stage “post-hoc” pipelines that add vision later (Section 1). This often requires delicate freezing/unfreezing schedules and extra alignment data to avoid degrading language ability, while still aligning visual and textual representations.
-
These pipelines are resource-heavy and can be brittle for new domains (OCR, GUIs, long videos), leaving a need for a simpler, more scalable approach that preserves language capability and scales to long context (Section 1).
-
Why it matters
- Practical: Robust, scalable MLLMs underpin real-world tasks like document understanding, GUI agents, spatial reasoning, and multi-image reasoning. Efficiency in training and inference is critical for open-source ecosystems.
-
Scientific: Demonstrates whether joint training on text and multimodal data can replace multi-stage adaptation, and whether position encoding innovations like V2PE help extend multimodal context effectively (Sections 2.1–2.2).
-
Prior approaches and shortcomings
- “Post-hoc” alignment (e.g., LLaVA-style pipelines) needs complex stages and special datasets (OCR, charts, etc.) to bridge modality gaps (Section 1). It risks catastrophic forgetting of language skills and adds engineering complexity.
-
Existing long-context methods typically treat all tokens uniformly in positional encoding, so visual tokens quickly eat context budget and limit sequence length (Section 2.1: V2PE motivation).
-
Positioning
- InternVL3 unifies pre-training for text and vision (“native multimodal pre-training,” Section 2.2), adopts a modality-aware positional encoding (V2PE, Section 2.1), adds targeted post-training (SFT + MPO, Section 2.3), and leverages test-time scaling with a visual process reward model (VisualPRM, Section 2.4). Infrastructure enhancements (InternEVO) improve training throughput (Section 2.5).
3. Technical Approach¶
- Architecture (“ViT–MLP–LLM,” Section 2.1; Table 1)
- Vision encoder:
InternViT-300M(for 1–14B models) orInternViT-6B(for 38B/78B) at 448 px input with pixel unshuffle, which reduces visual tokens by 4×, mapping each 448×448 tile to 256 tokens (Section 2.1). - Language model: pre-trained base LLMs from Qwen2.5 (0.5B–72B) or InternLM3-8B; no instruction-tuned LLMs are used for initialization (Table 1).
-
Connector: a 2-layer MLP maps visual embeddings to the LLM space (randomly initialized; Section 2.1).
-
Variable Visual Position Encoding (V2PE; Section 2.1, Eqs. (1)–(4))
- Goal: Stretch multimodal context by “spending” fewer position increments on visual tokens.
- Mechanism:
- Sequence tokens
x1..xLreceive position indicesp1..pLrecursively (Eq. (2)). For text tokens, index increases by 1; for visual tokens, by a fractionδ<1(Eq. (3)). - During training,
δis sampled per image from a set {1, 1/2, 1/4, …, 1/256} (Eq. (4)); during inference,δcan be chosen based on sequence length to keep positions within the context window.
- Sequence tokens
-
Intuition: Visual tokens consume less “positional budget,” enabling longer sequences without losing text positional resolution.
-
Native Multimodal Pre-Training (Section 2.2)
- Objective: Autoregressive next-token prediction but compute loss only on text tokens (
L_text-only, Eq. (6)), while visual tokens serve as conditioning context.- Formal: Minimize expected loss over a combined dataset of text-only and multimodal samples, updating all parameters jointly (Eq. (8)).
- Loss scaling: Uses “square averaging” to balance gradients across short/long answers (Eq. (7)); mitigates bias that either favors long or short responses.
-
Data mixture: Approximately 200B training tokens with a 1:3 ratio of language:multimodal tokens (50B language, 150B multimodal; Section 2.2 “Data”). Multimodal sources cover captioning, OCR, charts, math, documents, video, GUIs, tools, 3D scenes, etc.
-
Post-Training: SFT and Mixed Preference Optimization (MPO; Section 2.3)
- SFT (supervised fine-tuning): Uses high-quality, diverse instruction data; preserves InternVL2.5 practices (random JPEG compression, square loss re-weighting, multimodal packing). Expanded data for tools/3D/GUI/long-context/video/diagrams/CoT, growing SFT corpus from 16.3M to 21.7M samples.
-
MPO: Addresses exposure bias and improves chain-of-thought by combining:
- Preference loss (
L_p, DPO-style, Eq. (10)) to learn chosen vs rejected responses. - Quality loss (
L_q, BCO, Eqs. (11)–(13)) to model absolute response quality with a reward shift. - Generation loss (
L_g, Eq. (6)) to learn to produce preferred responses. - Total loss is a weighted sum
L = w_p L_p + w_q L_q + w_g L_g(Eq. (9)). MPO uses ~300K preference pairs derived following MMPR v1.2 (Section 2.3 “Data”).
- Preference loss (
-
Test-Time Scaling with VisualPRM (Section 2.4)
- Best-of-N (BoN): Sample multiple candidate solutions and select one using an external critic.
-
VisualPRM (Visual Process Reward Model): A separate 8B model that assigns step-level correctness probabilities and averages them to score a full solution (Eq. (14)). Trained on VisualPRM400K (extended with InternVL3 rollouts).
-
Infrastructure: InternEVO extensions (Section 2.5)
- Flexible sharding of ViT/MLP/LLM modules; supports data/tensor/sequence/pipeline parallelism and combinations; overlaps communication/computation.
- Dynamic load balancing between visual and language modules to handle varying token proportions.
- Supports sequences up to 32K via head-parallel + sequence-parallel. Reports 50–200% training speedup vs. InternVL2.5 at the same compute budget.
4. Key Insights and Innovations¶
- Native multimodal pre-training (fundamental)
- What’s new: Jointly optimize all parameters on a mixture of text-only and multimodal data from scratch in one stage (Section 2.2), with loss computed only on text tokens (Eq. (6)).
-
Why it matters: Simplifies pipelines, eliminates fragile freeze/unfreeze schedules, and aligns vision/text representations during the earliest stage. Figure 3 shows that even without post-training, native pre-training alone yields strong multimodal capability; with SFT it surpasses the classic multi-stage pipeline.
-
Variable Visual Position Encoding (V2PE) for long context (fundamental)
- What’s new: Modality-specific position increments so visual tokens consume fewer position steps (Eqs. (2)–(4)).
-
Why it matters: Extends effective multimodal context without changing model size. Table 12 shows consistent gains on standard tasks even with moderate context, contradicting earlier reports that V2PE helps only for long contexts.
-
Mixed Preference Optimization (MPO) that blends relative and absolute supervision (incremental but impactful)
- What’s new: Combines DPO-style pairwise preference (Eq. (10)) with BCO-style absolute quality modeling (Eqs. (11)–(13)) plus standard generation loss (Eq. (6)).
-
Why it matters: Improves reasoning stability under self-generated tokens. Table 13 shows across seven reasoning benchmarks, MPO boosts overall scores by +1.5 to +4.5 depending on model size.
-
Test-time scaling with VisualPRM as a critic (incremental but effective)
- What’s new: Step-level process supervision to score and select the best sampled solution (Section 2.4).
-
Why it matters: Substantial test-time boosts in math/reasoning. For instance, InternVL3-8B improves MathVista from 71.6 to 75.2 and MathVision from 29.3 to 37.5 with BoN=8 (Table 2, “w/ VisualPRM-Bo8”).
-
Practical training system upgrades (incremental)
- InternEVO enables 32K sequences and 50–200% faster training than InternVL2.5 (Section 2.5), critical for scaling to 78B parameters with long-context inputs.
5. Experimental Analysis¶
- Evaluation setup
- Benchmarks span:
- Multimodal reasoning and math: MMMU, MathVista, MathVision, MathVerse, DynaMath, WeMath, LogicVista (Table 2).
- OCR, charts, documents: AI2D, ChartQA, TextVQA, DocVQA, InfoVQA, OCRBench, SEED-2-Plus, CharXiv, VCR (Table 3).
- Multi-image and real-world: BLINK, Mantis-Eval, MMIU, MuirBench, MMT-Bench, MIRB; RealWorldQA, MME-RealWorld, WildVision, R-Bench (Table 4).
- Comprehensive capability and hallucination: MME, MMBench (EN/CN), MMBench v1.1, MMVet v1/v2, MMStar; HallusionBench, MMHal, CRPE, POPE (Table 5).
- Grounding: RefCOCO/+/g (Table 6).
- Multilingual: MMMB, Multilingual MMBench, MTVQA across EN/ZN/PT/AR/TR/RU (Table 7).
- Video: Video-MME, MVBench, MMBench-Video, MLVU, LongVideoBench, CG-Bench (Table 8).
- GUI grounding: ScreenSpot and ScreenSpot-V2 (Table 9).
- Spatial reasoning: VSI-Bench (Table 10).
- Language-only capability: MMLU, CMMLU, C-Eval, GAOKAO, TriviaQA, NQ, RACE, WinoGrande, HellaSwag, BBH, GSM8K, MATH, TheoremQA, HumanEval, MBPP/MBPP-CN (Table 11).
-
Models: 1B to 78B variants (Table 1). Results compared to Qwen2.5-VL series and closed models like GPT‑4o, Claude 3.5, Gemini 1.5/2.0/2.5 (Figures 1–2; multiple tables).
-
Main quantitative highlights
- Overall multimodal strength
-
“InternVL3-78B achieves MMMU 72.2” (Table 2), leading open-source MLLMs and close to proprietary models (Figure 1).
-
“OCRBench 906, AI2D 89.7, ChartQA 89.7, DocVQA 95.4” (Table 3), with competitive or superior scores to Qwen2.5‑VL‑72B and GPT‑4o on several tasks.
-
- Reasoning and math (Table 2)
-
“InternVL3-78B: MMMU 72.2, MathVista 79.0, MathVision 43.1, MathVerse (vision-only) 51.0.”
- With VisualPRM-Bo8, 38B and 8B models gain +4–8 points on average (Table 2; best-of-8 rows).
-
- Multi-image and real-world (Table 4)
-
“InternVL3-78B: BLINK 66.3, MMT-Bench 73.2, RealWorldQA 78.0, MME-RealWorld 65.4, WildVision 73.6, R-Bench 77.4.”
- Competes closely with GPT‑4o on RealWorldQA (78.0 vs. 75.4) and is strong on MME-RealWorld.
-
- Hallucination & comprehensive (Table 5)
-
“InternVL3-78B: MME sum 2549.8; MMBench EN/CN 89.0/88.7; MMVet 81.3; HallusionBench 59.1; CRPE 79.2; POPE 90.3.”
- Gemini 2.5 Pro can lead on some hallucination tests (e.g., HallusionBench in Figure 1).
-
- Grounding (Table 6)
-
“InternVL3-78B overall 91.4,” slightly below InternVL2.5‑78B (92.3), suggesting limited new grounding-specific data.
-
- Multilingual (Table 7)
-
“InternVL3-78B overall 68.9,” edging out InternVL2.5‑78B (68.0) and comparable to Qwen2‑VL‑72B (67.2).
-
- Video (Table 8)
-
“InternVL3-78B: Video-MME 72.7/75.7, MVBench 78.7, MLVU 79.5, LongVideoBench 65.7, CG-Bench 48.4/65.3.” Strong scaling trend across sizes.
-
-
GUI and spatial reasoning (Tables 9–10)
-
“ScreenSpot-V2: InternVL3-72B 90.9,” slightly above UI‑TARS‑72B (90.3); “VSI-Bench overall: 38B 48.9; 78B 48.4,” surpassing open and closed baselines on several sub-tasks.
-
-
Ablations and diagnostics
- Native multimodal pre-training vs. classic pipeline (Figure 3)
- The pre-trained-only model (no SFT) with native pre-training already reaches near full-pipeline performance; with SFT it exceeds it across many tasks—evidence that the unified training is effective.
- V2PE δ sweep (Table 12)
- Using V2PE improves many metrics even for short contexts; small δ (e.g., 1/4–1/16) often best. For fairness, the rest of the paper reports δ=1 unless noted (Section 3.14).
-
MPO impact (Table 13)
- Improves reasoning by +1.5 to +4.5 overall points depending on scale; largest relative gains at larger scales (38B/78B), suggesting synergy with capacity.
-
Are claims supported?
-
The paper triangulates: broad benchmark coverage (Tables 2–10), detailed ablations (Figure 3; Tables 12–13), and comparisons to both open- and closed-source models (Figures 1–2). The gains on MMMU, OCRBench, RealWorldQA, and process-supervised BoN solidly substantiate the core contributions.
-
Notable mixed results
- Visual grounding plateaus or dips slightly at the largest scale (Table 6), likely due to less grounding-specific data.
- Hallucination scores are competitive but not dominant on every benchmark (Figure 1; Table 5).
6. Limitations and Trade-offs¶
- Data and training assumptions
- Relies on ~200B tokens total and diverse multimodal corpora (Section 2.2 “Data”). Reproducing results requires large-scale data curation and infrastructure (Section 2.5).
-
The
L_text-onlyobjective (Eq. (6)) never predicts visual tokens. While simpler and effective, it may limit the model’s ability to generate dense visual sequences (e.g., pixel-level tasks) without further adaptations. -
Task coverage gaps
- Visual grounding: performance saturates or slightly declines at the largest scale (Table 6), likely because the training expansion did not emphasize grounding data (Section 3.8 discussion).
-
Hallucinations: improvements are not uniform across all hallucination benchmarks (Table 5), leaving room for targeted robustness work.
-
Computational costs
-
Largest variant uses
InternViT-6B+Qwen2.5-72B(Table 1) and 32K context support with complex parallelism (Section 2.5). Inference/test-time scaling (Best-of-N with VisualPRM) increases compute at evaluation time. -
Sensitivity and tuning
-
V2PE δ selection impacts results (Table 12). Although the paper fixes δ=1 for fairness outside the ablation, practitioners will need policies to pick δ based on sequence length at inference.
-
Dependency on critic at test time
- The biggest reasoning gains rely on VisualPRM BoN (Table 2), introducing an extra model and sampling budget during evaluation.
7. Implications and Future Directions¶
- Field impact
- Demonstrates that unified, native multimodal pre-training can replace multi-stage post-hoc adaptation, simplifying pipelines while preserving or improving language skill (Figure 3; Table 11). This can shift the community toward single-stage, mixed-data pre-training for MLLMs.
-
V2PE offers a practical recipe to extend multimodal context without enlarging the model, likely influencing future positional encoding designs for MLLMs (Section 2.1; Table 12).
-
Follow-up research
- Data-targeted improvements: Add grounding-specific corpora and richer hallucination-robust data; probe how the
L_text-onlyobjective interacts with tasks that benefit from visual-token prediction. - Position strategies: Learn δ or schedule it adaptively per input instead of sampling from a fixed set; explore cross-modal relative position schemes.
- Process supervision: Expand VisualPRM beyond math/science into multi-image reasoning and long video; investigate training-time integration (e.g., reinforcement learning with process rewards).
-
Efficiency: Combine BoN with smarter sampling (e.g., conditional early stopping, diversity-promoting decoding) to reduce test-time cost.
-
Applications
- Strong results suggest readiness for:
- Document and chart analysis (Table 3: DocVQA 95.4, ChartQA 89.7).
- Enterprise OCR and form understanding (OCRBench 906).
- GUI agents and UI automation (Table 9: up to 90.9 on ScreenSpot-V2).
- Spatial reasoning for robotics/autonomy (Table 10: top object counting/relative distance/appearance order).
- Long video understanding for surveillance, sports, or education (Table 8: strong scaling on MLVU/LongVideoBench).
- Public release of code, data, and weights (front page; Figure 1 footers) lowers barriers for practitioners and researchers to build and evaluate next-generation open-source MLLMs.
Headline result: “InternVL3‑78B reaches MMMU 72.2, OCRBench 906, AI2D 89.7, ChartQA 89.7, DocVQA 95.4” (Figure 1; Tables 2–3), with compelling language-only competence (Table 11) and broad real-world robustness (Tables 4–5), achieved via a unified training paradigm plus V2PE, MPO, and test-time process supervision.