Intern-S1-Pro: Scientific Multimodal Foundation Model at Trillion Scale¶

Pitch¶

This paper introduces Intern-S1-Pro, the first trillion-parameter scientific multimodal foundation model, designed to bridge the gap between general AI capabilities and deep scientific expertise. By scaling to this unprecedented size, the model masters over 100 specialized tasks across chemistry, materials, life sciences, and earth sciences, while maintaining top-tier reasoning and agent capabilities—proving that a sufficiently large generalist can outperform specialized models even on niche scientific benchmarks.

1. Executive Summary¶

This paper introduces Intern-S1-Pro, the first trillion-parameter scientific multimodal foundation model that integrates general reasoning capabilities with deep expertise across over 100 specialized scientific tasks spanning chemistry, materials science, life sciences, and earth sciences. The core innovations include a Grouped Routing mechanism that achieves absolute load balancing across devices in Mixture-of-Experts training, and a comprehensive engineering framework enabling stable FP8 mixed-precision reinforcement learning at unprecedented scale. Results demonstrate that this model achieves 55.5 accuracy on SciReasoner (versus 14.7 for Gemini-3-Pro and 13.6 for GPT-5.2), and crucially shows that a sufficiently large generalist model with joint training can outperform specialized models—even when trained on identical scientific data—validating the "Specializable Generalist" paradigm.

2. Context and Motivation¶

The Fundamental Challenge: Scientific Domains Require Massive Model Capacity¶

The paper addresses a core tension in foundation model development: scientific domains exhibit far greater diversity than natural language, encompassing specialized fields like chemistry, biology, physics, and earth sciences, each with its own "language" of domain-specific notations, knowledge structures, and reasoning patterns. The authors ground this in prior work from multilingual machine translation, where a single model translating hundreds of language pairs required 90× more parameters than a bilingual model (citing NLLB Team, 2022). The implication is that a scientific foundation model needs sufficient capacity to master diverse scientific tasks while retaining general text and vision capabilities.

Why Existing Approaches Fall Short¶

Prior to this work, a common belief dominated: specialized models are superior for niche tasks. The prevailing approach was to train smaller, domain-specific models on curated scientific data. However, this paper challenges that assumption with empirical evidence: a sufficiently large generalist model, when trained jointly on general and scientific data, can achieve superior performance on specialized tasks compared to smaller specialized models trained on the same data.

The authors identify several technical gaps that prevented scaling to the trillion-parameter regime:

Training instability from expert load imbalance. Mixture-of-Experts (MoE) models route each token to a small subset of experts. Traditional Top-K routing creates uneven expert utilization, which becomes catastrophic at scale—causing memory spikes and Out-of-Memory (OOM) errors during Expert Parallelism training. The conventional solution was to adopt more robust but slower parallelism strategies, trading efficiency for stability.

Router optimization difficulties. As expert count increases, router embeddings—the parameters that decide which experts process which tokens—require efficient learning to handle the expanded expert pool. Standard MoE training updates router parameters only for selected experts, leaving most router embeddings stale.

Training-inference distribution mismatch in RL. Prior work (cited as [48]) identified that discrepancies between training and inference engines cause reinforcement learning instability. Approaches like IcePop employed importance sampling to mask tokens with large distribution shifts, but scaling FP8 quantization to trillion-parameter models with extreme sparsity required new solutions.

Scientific data-general data conflicts. Scientific data exhibits high logical determinism and structured features, while general data emphasizes semantic depth and linguistic diversity. Direct mixing causes "distribution shift" and "negative transfer," leading to logical confusion during inference.

Poor-quality scientific image-text pairs. Existing open-source caption datasets derived from alt-text or webpage context exhibit limited image-text alignment. Scientific images from PDF sources often have brief, misaligned captions that serve as extensions of figures rather than descriptive text.

How This Paper Positions Itself¶

The paper introduces the SAGE framework (Synergistic Architecture for Generalizable Experts), comprising three layers: Foundation (base capabilities), Fusion (integration of modalities and domains), and Evolution (RL-based refinement). Within this framework, Intern-S1-Pro represents a "Specializable Generalist"—a model that demonstrates top-tier performance on general capabilities while outperforming proprietary models on specialized scientific tasks.

The paper explicitly positions itself against the specialized model paradigm through a controlled case study (Section 5.5): comparing Intern-S1-Pro against Biology-Instruction (a specialized model) on identical training data, demonstrating that larger model scale plus joint training yields substantial performance gains.

3. Technical Approach¶

3.1 Reader Orientation¶

Intern-S1-Pro is a trillion-parameter Mixture-of-Experts multimodal foundation model designed to handle text, images, and scientific time-series data across both general reasoning tasks and specialized scientific domains. The solution involves architectural innovations for stable large-scale MoE training, a dedicated data pipeline for scientific image-text alignment, and a mixed-precision reinforcement learning framework that maintains numerical fidelity between training and inference engines.

3.2 Big-Picture Architecture¶

The system comprises five major components:

MoE Backbone with Grouped Routing — A trillion-parameter language model using Mixture-of-Experts architecture, where expert expansion from Intern-S1 and a novel Grouped Routing mechanism ensure both capacity scaling and training stability.
Multimodal Encoders — A Native Vision Transformer (ViT) for images, a Fourier Position Encoding (FoPE) module for enhanced position modeling, and a Time-series Encoder with adaptive subsampling for temporal scientific data.
Scientific Caption Pipeline — A PDF extraction and captioning system that generates high-quality, dense image-text pairs from scientific literature.
Data Integration Framework — Strategies for resolving conflicts between scientific and general data, including Structured Scientific Data Transformation, Scientific Data Diversification, and System Prompt Isolation.
Stable Mixed-Precision RL Framework — A reinforcement learning training system combining FP8 quantization, rollout router replay, operator-level precision alignment, and dual importance sampling for stable policy optimization.

3.3 Roadmap for the Deep Dive¶

First, the MoE architecture with Grouped Routing and Straight-Through Estimator—these enable stable trillion-parameter scaling.
Second, the multimodal encoders (ViT, FoPE, Time-series)—these provide native support for scientific data modalities.
Third, the scientific caption pipeline and data integration strategies—these ensure high-quality training data alignment.
Fourth, the stable mixed-precision RL framework—this enables efficient post-training at scale.
Fifth, the evaluation methodology and results—these validate the "Specializable Generalist" hypothesis.

3.4 Detailed Technical Breakdown¶

This is primarily a system paper presenting a large-scale multimodal foundation model with architectural innovations enabling trillion-parameter training, novel multimodal encoding capabilities, and comprehensive scientific task coverage.

MoE Architecture: Expert Expansion and Grouped Routing¶

Expert expansion from Intern-S1. Intern-S1-Pro is derived from Intern-S1 through expert expansion. The process involves replicating and distributing well-trained experts from the parent model. A critical design choice is the Grouped Routing initialization: experts are distributed into groups such that each group contains experts corresponding to the Top-1 or Top-2 experts from the pre-expansion model.

The authors tested two initialization strategies on a 30B model over 2000 training steps: - Method 1 (Grouped Routing): Assign well-trained Top-1/Top-2 experts across groups. Result: slight improvement over pre-expansion model. - Method 2 (Differentiated Assignment): Assign Top-1 to Top-8 experts across groups corresponding to pre-expansion rankings. Result: over 20 points performance drop.

The hypothesis: experts frequently activated as Top-1 are well-trained and important. Each group needs well-trained experts for stable initialization. After initialization, experts naturally differentiate during training despite initial homogenization.

Grouped Routing mechanism (Section 2.1). Let the total number of experts in an MoE layer be $E$ and the expert parallelism degree be $S$. In Grouped Routing, all experts are uniformly partitioned into $G$ mutually disjoint groups: $\{\mathcal{E}_1, \mathcal{E}_2, \ldots, \mathcal{E}_G\}$, with each group containing $E/G$ experts.

For each group $g$, only the top-$(K/G)$ experts with highest scores are selected. The final activated expert set is the union of intra-group top experts. With Intern-S1-Pro's configuration of $K=8$ and EP8 training strategy:

\[G = 8 \text{ groups}$$ $$\text{Top-}1 \text{ expert selected per group}$$ $$\text{Result: Absolute load balancing across 8 devices}\]

This configuration fundamentally eliminates the OOM risk during training and significantly improves training efficiency by ensuring each device processes exactly one expert per token.

Straight-Through Estimator for Router Optimization¶

The router gradient problem. Standard MoE routing computes logits $z = W_r x$, applies softmax to get routing probabilities $p = \text{softmax}(z)$, and selects Top-K experts. The layer output is:

\[y = \sum_{i \in S} \tilde{p}_i \cdot E_i(x)\]

where $\tilde{p}_i = p_i / \sum_{j \in S} p_j$ is the normalized routing weight for expert $E_i$.

The problem: gradients only flow through selected experts, leaving most router embeddings stale as the expert pool expands.

Straight-Through Estimator (STE) solution (Section 2.2). The STE decouples forward and backward passes:

Forward pass: Standard sparse Top-K selection is preserved exactly.
Backward pass: Gradients flow through the full softmax distribution without re-normalization.

The STE routing weight is:

\[\hat{p}^{\text{STE}}_i = \text{sg}(\tilde{p}_i) + (p^\tau_i - \text{sg}(p^\tau_i))\]

where $p^\tau_i = \text{softmax}(z/\tau)_i$ is a temperature-scaled routing probability, and $\text{sg}(\cdot)$ is the stop-gradient operator.

In the forward pass, $\hat{p}^{\text{STE}}_i$ reduces exactly to $\tilde{p}_i$ (the standard sparse routing weight). In the backward pass, the gradient with respect to logit $z_j$ is:

\[\frac{\partial \mathcal{L}}{\partial z_j} = \sum_{i \in S} \frac{\partial \mathcal{L}}{\partial \hat{p}^{\text{STE}}_i} \cdot \frac{\partial p^\tau_i}{\partial z_j}\]

This ensures the router receives consistent data-driven feedback throughout training, accelerating convergence for router embeddings in the expanded expert pool.

Multimodal Encoders¶

Native Vision Transformer (Section 2.3). Intern-S1-Pro uses a Native ViT that processes images at native resolution. Unlike fixed-size image encoders, the visual token count depends on the original input resolution. This preserves fine-grained spatial information in high-resolution scientific images. Visual tokens pass through an MLP projector that maps features into the language model's embedding space.

Training data includes approximately 300 million image-text pairs from: - English caption datasets: CC12M, LAION-COCO, SBU Caption - Chinese caption datasets: LAION-2B-Multi, Wukong

Fourier Position Encoding (FoPE) (Section 2.4). The paper identifies a fundamental mismatch: language models process all modalities as discrete tokens (particle-like representation), but physical signals (light, sound, electromagnetic waves) exhibit wave-like, continuous properties. Traditional position encodings (sinusoidal, RoPE) inject sequential order but fail to model spectral and wave-interference patterns.

FoPE addresses this by modeling each dimension as a Fourier series of different frequency components. Key innovations:

Separates frequency components: Unlike RoPE which treats each dimension as a single-frequency function, FoPE treats each dimension as a multi-frequency function, separating information more effectively.
Clips undertrained frequencies: During pre-training, certain frequency components are inadequately trained. FoPE identifies and zeros out these components to prevent spectral damage during length extrapolation.
Preserves periodic extension: By filtering harmful frequency components, FoPE maintains better generalization to longer sequences.

Time-series Encoder (Section 2.5). Time series data presents challenges: extreme variability in rate, length, value, and dimensionality. Direct serialization into text tokens causes information loss.

The time-series module features:

Adaptive Subsampling Module: Partitions continuous signals into local segments (patches) where patch size and stride are dynamically determined based on signal characteristics and sampling rate. This normalizes heterogeneous time series into a uniform representation space.
Hierarchical Encoding: Captures local dynamics within each patch via CNNs, then models long-range dependencies across segments using Transformer encoders.
Domain Coverage: Supports sequences from $10^2$ to $10^6$ time steps. Domains include astronomy, geoscience, neuroscience, physiological signal analysis, and bioacoustics.

Scientific Caption Pipeline¶

The caption quality problem (Section 3.1). Scientific images from PDFs have captions that are brief extensions of figures, not descriptive text. The paper contrasts:

Natural caption (typical scientific literature): Often < 100 words, e.g., "This image is a scientific plot labeled '(b)' in the top left corner..."
Desired dense caption (generated by pipeline): Average ~1000 words, with explicit visual element references: "Main Plot: Axes: The y-axis is labeled 'Spectral Intensity (a.u.)' with a scale from 0 to 1 in increments of 0.2. Legend: Real: Solid blue line, NPRS: Dashed purple line..."

Pipeline workflow (Figure 7).

PDF Extraction: Use MinerU 2.5 for layout analysis and structural recognition, detecting figures, formulas, and tables. Crop into standardized sub-image samples.
Content Deduplication: Apply perceptual hashing (pHash) to eliminate redundant visual content.
Topic Classification and Model Routing:
Scientific sub-images: Processed by InternVL3.5-241B for domain-specific professional captions.
Non-scientific sub-images: Processed by CapRL-32B (a reinforcement-learning-trained captioning model based on Qwen 2.5 VL 32B).
Quality Filtering: Use a 0.5B-parameter text quality discriminator to filter garbled text, repetitive expressions, and low-information-density content.
Output: Approximately 270 billion tokens of high-quality scientific image-text caption data across life sciences, chemistry, earth sciences, and materials science.

Data Integration: Resolving Scientific-General Data Conflicts¶

Three-strategy framework (Section 3.2).

Strategy 1: Structured Scientific Data Transformation. Scientific data from sources like PubChem is highly structured (tabular, formulas). Two methods:

Template Construction: Convert heterogeneous input-output pairs into grammatically correct, narrative text matching general data representation style.
Task Form Transformation: Map abstract outputs (lists, matrices) to descriptive answers with scientific meaning using domain-specific priors.

Strategy 2: Scientific Data Diversification.

Prompt Diversification: For repeated scientific concepts (e.g., similar protein sequences), provide dozens of varied instructions to prevent overfitting and expand generalization boundaries.
Rollout Mechanism: Transform simple outputs (numerical values, conclusions) into complete reasoning chains by leveraging strong base models. This converts knowledge recall into logical deduction, enhancing zero-shot reasoning.

Strategy 3: System Prompt Isolation. Inject mutually exclusive system-level prefixes for scientific and general data during training, creating independent contextual processing environments. This reduces data conflicts and improves stability.

Stable Mixed-Precision Reinforcement Learning¶

The training-inference discrepancy problem (Section 4.1). Prior work identified this as a primary source of RL training instability. At trillion-parameter scale with extreme sparsity, FP8 quantization requires careful handling.

Four-component stabilization framework.

Component 1: Operator-level precision alignment. The authors performed systematic comparison between LMDeploy (rollout engine) and XTuner (training engine). Identified numerically sensitive components: RMSNorm, router softmax, positional embedding application. Minimized precision gaps in these kernels to ensure rollout distribution is faithfully reflected during training.

Component 2: Rollout router replay. For each token, record selected expert indices per layer during rollout. Replay identical routing decisions during policy updates. To avoid bandwidth/latency bottlenecks, routing traces are transmitted via Ray object references rather than the HTTP channel used for response tokens.

Component 3: Targeted mixed-precision scheme.

Expert MLP layers: Quantized to FP8 (largest memory footprint, GEMM operations tolerant to reduced precision).
Non-expert components: Keep in BF16.
Language modeling head: Use FP32 (small errors in log-probability estimation can be amplified by policy-gradient updates).

This preserves memory and throughput benefits while avoiding degradation in sensitive computation graph regions.

Component 4: Dual importance sampling. The modified REINFORCE loss:

\[\mathcal{L}(\theta) = -\mathbb{E}_{x \sim \mathcal{D}, \{y_i\}_{i=1}^G \sim \pi^{\text{rollout}}_\theta(\cdot|x)} \left[ \frac{1}{G} \sum_{i=1}^G \frac{1}{|y_i|} \sum_{t=1}^{|y_i|} \text{sg}(\mathcal{M}(\rho_{i,t}; \alpha, \beta) \cdot r_{i,t}) \cdot \hat{A}_{i,t} \cdot \log \pi_\theta(y_{i,t}|x, y_{i,<t}) \right]\]

Two importance sampling ratios:

$\rho_{i,t} = \frac{\pi^{\text{train}}_\theta(y_{i,t}|x,y_{i,<t})}{\pi^{\text{rollout}}_\theta(y_{i,t}|x,y_{i,<t})}$: Calibrates for training-inference distribution mismatch.
$r_{i,t} = \frac{\pi^{\text{new}}_\theta(y_{i,t}|x,y_{i,<t})}{\pi^{\text{old}}_\theta(y_{i,t}|x,y_{i,<t})}$: Corrects for off-policy bias from mini-batch updates.

The masking function:

\[\mathcal{M}(\rho_{i,t}; \alpha, \beta) = \begin{cases} \rho_{i,t}, & \alpha < \rho_{i,t} < \beta \\ 0, & \text{otherwise} \end{cases}\]

suppresses tokens with excessively large training-rollout discrepancy.

Advantage estimation. Leave-one-out (LOO) baseline:

\[\hat{A}_{i,t} = R_i - b_i, \quad b_i = \frac{1}{G-1} \sum_{j \neq i} R_j\]

where $R_i$ is the sequence-level reward of sample $y_i$.

Validation (Figure 8). Comparison on a 30B MoE model shows FP8 mixed-precision RL closely matches BF16 training in both validation accuracy and KL divergence between train and rollout engines.

Training Configuration Summary¶

Pre-training: - Total tokens: 6T (image-text and text data) - Model parameters: ~1 trillion (MoE with 4× expert count of Intern-S1) - Expert parallelism: EP8 - Top-K routing: K=8 - Groups: G=8 (Top-1 per group)

RL training: - Precision: FP8 for expert MLP layers, BF16 for non-expert components, FP32 for LM head - Importance sampling thresholds: $\alpha$, $\beta$ (specific values not provided) - Samples per prompt: $G$ responses (specific value not provided)

4. Key Insights and Innovations¶

Innovation 1: Grouped Routing for Absolute Load Balancing¶

The Grouped Routing mechanism represents a fundamental architectural innovation for large-scale MoE training. Unlike traditional Top-K routing that creates cross-device load imbalance, Grouped Routing partitions experts into device-aligned groups and selects top-$(K/G)$ experts within each group. With $K=8$ and $G=8$, selecting Top-1 expert per group achieves absolute load balancing—each of the 8 devices processes exactly one expert per token.

This is significant because it: - Eliminates the OOM risk that plagued large-scale MoE training - Preserves training efficiency without resorting to slower, more robust parallelism strategies - Enables stable training at trillion-parameter scale

The design is non-obvious: the paper's ablation shows that simply distributing pre-expansion Top-1 through Top-8 experts across groups causes over 20 points performance drop. The key insight is that well-trained Top-1/Top-2 experts must be distributed across all groups to maintain initialization stability.

Innovation 2: Straight-Through Estimator for Dense Router Gradient Flow¶

The application of STE to MoE router training addresses a subtle but critical scaling problem. As expert count increases, standard Top-K routing only updates router embeddings for selected experts, leaving most parameters stale. The STE decouples forward and backward passes:

Forward: Use sparse Top-K selection exactly (no approximation)
Backward: Allow gradients to flow through full softmax distribution

This is a mathematically elegant solution that requires no architectural changes but enables efficient router embedding optimization for the expanded expert pool. The technique builds on prior work (cited as [4, 15, 22, 23, 47]) but applies it specifically to the trillion-parameter scaling regime.

Innovation 3: Specializable Generalist Paradigm¶

The most conceptually significant contribution is empirical validation that a large generalist model with joint training can outperform specialized models. The case study in Section 5.5 compares Intern-S1-Pro against Biology-Instruction on identical training data (only text fluency was upgraded for Intern-S1-Pro; core biological information remained the same).

Results: - Protein-Fluorescence: Intern-S1-Pro 78.14 vs. Biology-Instruction 2.57 - Protein-FunctionEC: Intern-S1-Pro 72.70 vs. Biology-Instruction 19.79 - Average score: Intern-S1-Pro 52.45 vs. Biology-Instruction 39.24

This contradicts the common belief that specialized models are superior for niche tasks. The insight: larger model scale provides stronger general reasoning capabilities that enable more effective extraction and utilization of specialized knowledge from the same data.

Innovation 4: Comprehensive Stabilization for FP8 RL Training¶

The four-component stabilization framework (operator-level precision alignment, rollout router replay, targeted mixed-precision, dual importance sampling) represents a systematic engineering solution to RL training instability at scale. While each component builds on prior work, their integration specifically addresses the challenges of trillion-parameter sparse MoE models:

Router replay ensures expert selection consistency between rollout and training
Mixed-precision scheme balances memory efficiency with numerical fidelity
Dual importance sampling calibrates both train-inference mismatch and off-policy bias

Figure 8 shows FP8 training achieves nearly identical validation accuracy and KL divergence curves to BF16, validating the framework's effectiveness.

Innovation 5: Scientific Data Integration via System Prompt Isolation¶

The System Prompt Isolation strategy is a simple but effective technique for managing distribution shift between scientific and general data. By injecting mutually exclusive system-level prefixes, the model creates independent contextual processing environments for each data type during training. This represents a practical alternative to complex multi-domain training procedures, reducing interference between data types without requiring architectural changes.

5. Experimental Analysis¶

Evaluation Methodology¶

Evaluation toolkits. Three frameworks are used: - OpenCompass: General evaluation platform - VLMEvalKit: Multimodal evaluation - AgentCompass: Agent evaluation (developed by the authors, to be released)

Evaluation configurations. Two modes are defined (Table 1):

Parameter	Thinking	Non-Thinking
max tokens	65536	32768
temperature	0.8	0
top_p	0.95	1.0
top_k	50	1

The "thinking" configuration enables sampling for complex reasoning tasks; "non-thinking" uses greedy decoding.

Benchmarks. The evaluation covers:

Scientific benchmarks (11 total): - SciReasoner: 10 disciplines, 149 tasks, scientific reasoning (non-thinking) - SFE: 830 VQA pairs across 66 multimodal tasks, 5 disciplines (thinking) - SmolInstruct: 14 chemistry tasks, 3M+ samples (non-thinking) - MatBench: 13 materials property prediction tasks (non-thinking) - Mol-Instructions: Molecule/protein/biomolecular tasks (non-thinking) - MicroVQA: 1,042 microscopy questions, biological workflows (non-thinking) - Biology-Instruction: Multi-omics sequence understanding (non-thinking) - XLRS-Bench: Ultra-high-resolution remote sensing, 16 sub-tasks (thinking) - MSEarth-MCQ: Earth science multimodal (non-thinking) - SciTS: Time series understanding (subset reported)

General benchmarks (13 total): - MMMU-Pro: Multidisciplinary multimodal tasks (thinking) - MMLU-Pro: Enhanced MMLU with more choices (thinking) - AIME-2025: 30 olympiad-level math problems (thinking) - IMO-Answer-Bench: 400 Olympiad problems with verifiable answers (thinking) - RefCOCO: Referring expression comprehension (non-thinking) - IFBench: Instruction following with 58 constraints (thinking) - OCRBench V2: Visual text localization (non-thinking) - SArena (Icon): SVG generation (thinking) - LCB V6: Code generation (thinking) - GAIA: Real-world agent tasks with tool use (thinking) - $\tau$2-Bench: Conversational agents (thinking) - ScreenSpot V2: GUI grounding (non-thinking)

Baselines. Table 2 compares against: - Qwen3-VL-235B-Thinking (open-source, 235B-A22B) - Kimi-K2.5 (1T-A32B) - GPT-5.2 (proprietary) - Gemini-3-Pro (proprietary)

Note: The paper uses notation like "1T-A22B" indicating model size (1T parameters) and activation size (22B activated parameters), but exact definitions are not provided.

Main Quantitative Results¶

Scientific tasks (Table 2).

Intern-S1-Pro achieves:

Benchmark	Intern-S1-Pro	Best Proprietary	Margin
SciReasoner	55.5	14.7 (Gemini-3-Pro)	+40.8 pts
SFE	52.7	58.9 (Gemini-3-Pro)	-6.2 pts
SmolInstruct	74.8	58.3 (Gemini-3-Pro)	+16.5 pts
MatBench	72.8	64.9 (Gemini-3-Pro)	+7.9 pts
Mol-Instructions	48.8	34.6 (Gemini-3-Pro)	+14.2 pts
MicroVQA	63.3	69.0 (Gemini-3-Pro)	-5.7 pts
Biology-Instruction	52.5	12.0 (Gemini-3-Pro)	+40.5 pts
XLRS-Bench	52.8	51.8 (Gemini-3-Pro)	+1.0 pts
MSEarth-MCQ	65.2	65.8 (Gemini-3-Pro)	-0.6 pts

Key observations: - Intern-S1-Pro significantly outperforms proprietary models on 6 of 9 scientific benchmarks - Largest margins on SciReasoner (+40.8), Biology-Instruction (+40.5), SmolInstruct (+16.5) - Competitive performance on remaining benchmarks (within 6 points)

General tasks (Table 2).

Benchmark	Intern-S1-Pro	Best Model	Notes
MMMU-Pro	72.8	81.0 (Gemini-3-Pro)	-8.2 pts
MMLU-Pro	86.6	89.3 (Gemini-3-Pro)	-2.7 pts
AIME-2025	93.1	100.0 (GPT-5.2)	Perfect score by GPT-5.2
IMO-Answer-Bench	77.3	86.3 (GPT-5.2)	-9.0 pts
RefCOCO-avg	91.9	87.8 (Kimi-K2.5)	Best in class
IFBench	71.2	75.4 (GPT-5.2)	-4.2 pts
OCRBench V2 (ENG/CHN)	60.1/60.6	66.8/63.8 (Qwen3-VL)	Mixed
SArena (Icon)	83.5	82.6 (Gemini-3-Pro)	Competitive
LCB V6	74.3	87.7 (GPT-5.2)	-13.4 pts
GAIA (Text-Only)	77.4	79.9 (Kimi-K2.5)	Competitive
$\tau$2-Bench	80.9	85.4 (Gemini-3-Pro)	-4.5 pts
ScreenSpot V2	93.6	94.7 (Gemini-3-Pro)	-1.1 pts

Key observations: - Intern-S1-Pro is competitive with top-tier models but generally trails the best proprietary models on general tasks - Best-in-class on RefCOCO (visual grounding): 91.9 - Competitive on agent benchmarks: GAIA 77.4, $\tau$2-Bench 80.9, ScreenSpot V2 93.6

Improvement over Intern-S1. The paper reports: - AIME-2025: 86.0 → 93.1 (+7.1 pts) - MMLU-Pro: 83.5 → 86.6 (+3.1 pts) - Expanded coverage to new scientific benchmarks (SciReasoner, Mol-Instructions, Biology-Instruction) - New agent capabilities: GAIA 77.4, $\tau$2-Bench 80.9, ScreenSpot V2 93.6

Time Series Results (Table 3)¶

Results on a subset of SciTS benchmark (F1 scores):

Task ID	Intern-S1-Pro	Best Text LLM	Best VL LLM
ASU01	98.0	67.2 (GPT-4.1-mini)	65.7 (GPT-5-mini)
ASU03	75.9	16.3 (Gemini2.5-Flash)	18.9 (GPT-5-mini)
BIU01	20.8	1.5 (Gemini2.5-Flash)	0.8 (GPT-5-mini)
BIU03	88.3	12.7 (GPT-4.1-mini)	17.9 (GPT-5-mini)
EAU01	99.5	67.6 (Gemini2.5-Flash)	72.5 (Gemini2.5-Flash)
MEU01	65.6	60.9 (Gemini2.5-Flash)	64.1 (Gemini2.5-Flash)
NEU06	71.3	16.1 (GPT-4.1-mini)	13.3 (GPT-5-mini)
PHU01	36.8	28.9 (DeepSeek-V3)	22.7 (Gemini2.5-Flash)
PHU04	93.2	64.8 (Gemini2.5-Flash)	59.0 (Gemini2.5-Flash)

The dedicated time series module provides substantial advantages: EAU01 reaches 99.5 (vs. 67.6 best baseline), BIU03 reaches 88.3 (vs. 17.9 best baseline). This validates the effectiveness of native time series encoding versus text or image conversion approaches.

Biology Case Study: Specializable Generalist Validation (Table 4)¶

This is the paper's most compelling controlled experiment. Intern-S1-Pro compared against Biology-Instruction on identical training data (only text fluency upgraded for Intern-S1-Pro):

Task	Biology-Instruction	Intern-S1-Pro	Improvement
DNA-cpd	44.54	54.60	+10.06
DNA-emp	8.10	14.02	+5.92
DNA-pd	58.18	82.65	+24.47
DNA-tf-h	24.45	54.11	+29.66
DNA-tf-m	39.91	60.80	+20.89
Multi_sequence-antibody_antigen	10.26	44.76	+34.50
Protein-Fluorescence	2.57	78.14	+75.57
Protein-FunctionEC	19.79	72.70	+52.91
RNA-Isoform	59.01	82.95	+23.94
AVG score	39.24	52.45	+13.21

Two tasks show negative results: - Multi_sequence-promoter_enhancer_interaction: 4.77 → -1.30 - RNA-NoncodingRNAFamily: 63.09 → 34.50

The paper notes these results "strongly suggest that the integration of general and specialized capabilities is not merely an aggregation of functions, but a synergistic process that fundamentally promotes the intelligence and problem-solving capacity of the model in professional domains."

Assessment: Do the Experiments Support the Claims?¶

Claim 1: Intern-S1-Pro achieves state-of-the-art on scientific tasks.

Supported. Table 2 shows Intern-S1-Pro outperforms proprietary models (GPT-5.2, Gemini-3-Pro) on 6 of 9 scientific benchmarks, with particularly large margins on SciReasoner (+40.8 pts), Biology-Instruction (+40.5 pts), and SmolInstruct (+16.5 pts). These are substantial, practically significant differences.

Claim 2: The Specializable Generalist paradigm outperforms specialized models.

Strongly supported. The Biology case study (Table 4) provides controlled evidence: same training data, different model scale + joint training, +13.21 average improvement. The improvement on Protein-Fluorescence (+75.57) and Protein-FunctionEC (+52.91) are dramatic. However, two tasks show degradation, indicating the paradigm isn't universally superior.

Claim 3: Grouped Routing enables stable trillion-parameter training.

Supported indirectly. The paper describes ablation experiments showing the alternative initialization causes >20 pt performance drop. However, direct training stability metrics (loss curves, gradient norms) for trillion-parameter models are not provided. The FP8 RL comparison (Figure 8) uses a 30B model, not the full trillion-parameter system.

Claim 4: The time series module provides native advantages.

Strongly supported. Table 3 shows massive improvements over text LLMs and VL LLMs on SciTS benchmark, with several tasks showing >50 pt improvements (BIU03: 88.3 vs. 17.9; ASU01: 98.0 vs. 67.2).

Claim 5: FP8 mixed-precision RL matches BF16 training.

Supported for 30B scale. Figure 8 shows nearly identical validation accuracy and KL divergence curves. However, results are not provided for trillion-parameter scale.

Limitations and Missing Experiments¶

No ablation on Grouped Routing at trillion-parameter scale. The 20 pt drop experiment uses a 30B model, not the full Intern-S1-Pro.
No comparison with other trillion-parameter MoE models. Baselines include proprietary models of unknown architecture. Comparison with other open trillion-parameter MoE models (e.g., Grok-1, DBRX) would strengthen positioning.
No detail on expert count or activation sparsity. The paper mentions "4× the expert count of Intern-S1" but doesn't specify exact numbers, making efficiency comparisons difficult.
Hyperparameters not fully disclosed. Importance sampling thresholds ($\alpha$, $\beta$), number of samples $G$, temperature $\tau$, and other RL hyperparameters are not specified.
Limited negative results discussion. Two biology tasks show negative performance (Table 4), and SFE/MicroVQA/MSEarth-MCQ trail Gemini-3-Pro. The paper doesn't analyze why.
No inference efficiency metrics. Training efficiency is discussed (~20% reduction from Intern-S1 despite 4× scale), but inference throughput, latency, and cost comparisons are absent.
No contamination analysis. The paper uses benchmarks like AIME-2025 and LCB V6 which may have temporal overlap with training data cutoffs.

6. Limitations and Trade-offs¶

Assumption: Larger Scale Always Enables Better Scientific Knowledge Utilization¶

The paper's central claim—that a trillion-parameter generalist model can outperform specialized models—rests on the assumption that model scale provides reasoning capabilities that enable more effective extraction of specialized knowledge. While the Biology case study (Table 4) supports this for most tasks, two notable failures undermine the universality of this assumption:

Multi_sequence-promoter_enhancer_interaction: Biology-Instruction achieves 4.77, Intern-S1-Pro scores -1.30 (negative score indicates worse than random)
RNA-NoncodingRNAFamily: Biology-Instruction achieves 63.09, Intern-S1-Pro drops to 34.50

The paper provides no analysis of why these specific tasks degrade. This is a significant gap: understanding the failure modes of the "Specializable Generalist" paradigm is essential for practitioners deciding whether to adopt this approach. Possible explanations include task-specific architectures in Biology-Instruction that don't transfer to larger models, or distribution shifts that the joint training doesn't adequately address. Without investigation, these remain open questions.

Computational Constraints and Missing Efficiency Details¶

Expert count and sparsity are not disclosed. The paper states Intern-S1-Pro has "4× the expert count of Intern-S1" but never specifies the exact number of experts, the expert hidden dimension, or the true activation ratio (expressed as "1T-A22B" in tables, suggesting 22B activated parameters per forward pass). Without these details:

Training and inference cost comparisons to other trillion-parameter models are impossible
The efficiency claims ("~20% reduction in training efficiency" despite 4× scale) cannot be independently verified
Practitioners cannot estimate hardware requirements for deployment

No inference metrics provided. The paper focuses on training efficiency but omits: - Inference throughput (tokens/second) - Latency measurements - Memory footprint during inference - Cost-per-query estimates

For practical deployment, these metrics are often more critical than training efficiency. A model that is cheaper to train but expensive to serve may not be viable for production systems.

Scaling Validated Only at 30B Scale¶

The ablation studies and stabilization experiments are conducted on smaller models:

Grouped Routing ablation (Section 2): Tested on a 30B model over 2000 steps, showing >20 pt performance drop for the alternative initialization
FP8 RL validation (Figure 8): Comparing FP8 to BF16 on a 30B MoE model

While these experiments provide evidence for the techniques' effectiveness, they do not guarantee the same behavior at trillion-parameter scale. The paper lacks direct evidence that: - Grouped Routing prevents OOM errors in trillion-parameter training - The FP8 mixed-precision framework maintains numerical stability at scale - Router replay scales efficiently with expert parallelism degree > 8

This is a common limitation in large-scale system papers, but it remains a gap between what is validated and what is claimed.

Data Requirements and Reproducibility¶

The caption pipeline requires massive infrastructure. Section 3.1 describes generating 270B tokens of scientific image-text data using: - MinerU 2.5 for PDF extraction - InternVL3.5-241B for scientific captioning - CapRL-32B for general captioning - A 0.5B quality discriminator for filtering

This pipeline is not reproducible for most research groups. The paper does not provide: - Access to the generated dataset - Detailed hyperparameters for the captioning models - Computational cost of the caption pipeline itself

Training data composition is opaque. The 6T token training corpus is described only at a high level: - 270B tokens from scientific image-text caption pipeline - Remaining composition (general text, code, etc.) not specified - Data mixing ratios not disclosed - Tokenization details not provided

This limits reproducibility and makes it difficult to separate the contributions of model architecture from training data quality.

Unaddressed Scenarios and Edge Cases¶

Non-English scientific literature. The vision encoder training data includes Chinese caption datasets (LAION-2B-Multi, Wukong), and OCRBench V2 includes Chinese evaluation. However, the scientific benchmarks are predominantly English. The paper does not address: - Performance on scientific literature in other languages - Multilingual scientific reasoning capabilities - Cross-lingual transfer for scientific tasks

Long-context scientific documents. The thinking configuration supports up to 65536 max tokens (Table 1), but no benchmarks evaluate: - Processing complete scientific papers with multiple figures - Long-form scientific reasoning chains - Multi-document scientific synthesis

Real-world scientific workflows. While the paper demonstrates agent capabilities on benchmarks like GAIA and $\tau$2-Bench, these are not specific to scientific domains. The paper does not evaluate: - Integration with laboratory information systems - Real-time experimental data analysis - Multi-modal scientific report generation

Theoretical Gaps¶

No theoretical analysis of Grouped Routing. The paper empirically shows Grouped Routing works but does not analyze: - Why distributing well-trained experts across groups is superior (beyond the ablation) - The relationship between group count $G$ and optimal $K$ values - Potential limitations of the constraint on expert selection diversity

STE temperature selection unexplained. Equation 2 introduces a temperature parameter $\tau$ for the scaled routing probability, but: - The value of $\tau$ is not specified - No ablation on $\tau$ is provided - The interaction between $\tau$ and the stop-gradient operator is not analyzed

Benchmark-Specific Concerns¶

Contamination potential. AIME-2025, LCB V6, and other temporal benchmarks may overlap with training data. The paper does not: - Provide training data cutoff dates - Analyze potential contamination - Discuss temporal validation methodology

Proprietary baseline opacity. Comparisons against GPT-5.2 and Gemini-3-Pro face standard limitations: - Unknown model sizes and architectures - Unknown training data composition - Potential版本ing inconsistencies

The paper acknowledges the comparison models as proprietary but does not discuss these inherent limitations.

Negative Results Under-Analyzed¶

Beyond the two biology task failures, several benchmarks show Intern-S1-Pro trailing significantly:

MMMU-Pro: 72.8 vs. 81.0 (Gemini-3-Pro) — 8.2 pt gap
LCB V6: 74.3 vs. 87.7 (GPT-5.2) — 13.4 pt gap
IMO-Answer-Bench: 77.3 vs. 86.3 (GPT-5.2) — 9.0 pt gap

The paper does not analyze why Intern-S1-Pro excels on scientific tasks but trails on general reasoning benchmarks requiring similar skills (mathematical reasoning, coding). This is a missed opportunity to understand the trade-offs inherent in the scientific specialization approach.

7. Implications and Future Directions¶

Paradigm Shift: Scientific AI as a Scale Problem¶

This paper fundamentally reframes scientific AI from a domain-specific specialization problem to a scale problem. The central insight—that larger generalist models with joint training can outperform specialized models—challenges the prevailing assumption that niche domains require niche models.

This has profound implications for the field:

Domain adaptation becomes a data problem, not an architecture problem. Rather than designing specialized architectures for chemistry, biology, or materials science, practitioners can leverage large generalist models with appropriate training data. The Biology case study (Table 4) shows that identical scientific data yields dramatically different results based on model scale and training strategy.

The capacity requirements for scientific domains are quantified. The paper's citation of the multilingual translation finding (90× model size for 100× languages) provides a framework for estimating scientific model capacity. If each scientific sub-domain has its own "language" of notations and reasoning patterns, the trillion-parameter scale may be necessary precisely because science encompasses hundreds of such sub-domains.

Reinforcement learning can be stable at trillion-parameter scale. The four-component stabilization framework demonstrates that FP8 RL training is viable for extremely sparse MoE models. This opens the door to scaling RL-based post-training to models previously considered too large for such refinement.

Follow-Up Research Enabled by This Work¶

1. Understanding the Specializable Generalist failure modes. The two biology tasks where Intern-S1-Pro underperforms Biology-Instruction demand investigation. Key questions:

Are there task types where specialized architectures are genuinely superior?
Does the negative transfer stem from data distribution, model architecture, or training dynamics?
Can systematic analysis identify which tasks benefit from generalist vs. specialized approaches?

Future work should map the boundary conditions of the Specializable Generalist paradigm, providing practitioners with decision criteria for model selection.

2. Optimizing Grouped Routing beyond K=8, G=8. The current configuration is tailored to EP8 training. Open questions include:

How does Grouped Routing scale to EP16 or EP32 configurations?
What is the relationship between expert count, group count, and optimal Top-K values?
Can adaptive grouping (dynamic group assignment based on token characteristics) improve performance?

3. Extending STE to other MoE components. The STE technique enables dense router gradient flow. Potential extensions:

Apply STE to expert selection within groups (currently Top-1 per group)
Use STE for load balancing loss computation
Combine STE with expert dropout for regularization

4. Scientific domain expansion. The paper covers chemistry, materials, life sciences, and earth sciences. Natural extensions:

Physics: Particle physics, condensed matter, astrophysics
Engineering: Structural analysis, circuit design, control systems
Medicine: Clinical decision support, medical imaging, drug interactions

Each domain may require specific multimodal capabilities (equations, schematics, waveforms) that test the generalist approach's limits.

5. Efficient scientific inference. The paper focuses on training efficiency. Inference optimization is critical for deployment:

Scientific MoE specialization: Can experts be pruned for domain-specific inference?
Quantization-aware training for scientific tasks: Does FP8 inference preserve scientific reasoning accuracy?
Caching strategies for scientific documents with repeated figure references

6. Contamination-free scientific benchmarks. As training datasets expand, ensuring benchmark integrity becomes harder:

Develop benchmarks from papers published after training data cutoff
Create synthetic scientific problems with verifiable ground truth
Establish standardized temporal validation protocols for scientific AI

Practical Applications and Downstream Use Cases¶

Scientific literature understanding at scale. Intern-S1-Pro's capabilities in scientific image understanding and reasoning enable:

Automated extraction of experimental protocols from papers
Cross-paper synthesis of related findings
Generation of structured summaries from multimodal scientific documents

The dense caption pipeline (Section 3.1) provides a blueprint for creating training data, but the resulting model can be deployed for downstream tasks without replicating the pipeline.

Laboratory automation and agent systems. The agent benchmarks (GAIA 77.4, $\tau$2-Bench 80.9, ScreenSpot V2 93.6) suggest potential for:

Integration with laboratory information management systems (LIMS)
Automated experimental design and hypothesis generation
Real-time analysis of instrument output (spectra, chromatograms, microscopy)

The time-series module's ability to handle $10^2$ to $10^6$ time steps covers most laboratory instrument outputs.

Materials discovery and drug design. The strong performance on:

MatBench (72.8): Materials property prediction
SmolInstruct (74.8): Chemistry tasks including synthesis
Mol-Instructions (48.8): Biomolecular tasks

indicates readiness for:

Inverse design of materials with target properties
Retrosynthetic analysis for drug candidates
Protein function prediction and engineering

Earth science applications. The remote sensing (XLRS-Bench 52.8) and earth science (MSEarth-MCQ 65.2) results enable:

Satellite imagery analysis for climate monitoring
Geological survey automation
Natural disaster risk assessment

The high-resolution image handling (Native ViT) and time-series processing support common earth science data modalities.

Integration Guidance: When to Prefer Intern-S1-Pro¶

Prefer Intern-S1-Pro when:

Task matches covered scientific domains. Chemistry (SmolInstruct), materials (MatBench), biology (Mol-Instructions, Biology-Instruction), and earth science (XLRS-Bench, MSEarth-MCQ) are the model's strengths. Performance significantly exceeds proprietary alternatives.
Multimodal scientific reasoning is required. Tasks involving scientific figures, plots, diagrams, or time-series data benefit from the Native ViT, FoPE, and time-series encoder.
Agent capabilities are needed. The model shows strong grounding (RefCOCO 91.9, ScreenSpot V2 93.6) and agent benchmarks (GAIA, $\tau$2-Bench), suggesting capability for tool-using scientific workflows.
Open-source deployment is required. As an open-source model, Intern-S1-Pro enables local deployment, fine-tuning, and integration without API costs or data privacy concerns.

Consider alternatives when:

Tasks require non-covered domains. Physics, engineering, and clinical medicine are not extensively evaluated. Specialized models may be more appropriate.
General reasoning tasks dominate. Intern-S1-Pro trails on MMMU-Pro, LCB V6, and IMO-Answer-Bench. For non-scientific applications, general-purpose models may be superior.
Inference cost is critical. The trillion-parameter scale (even with 22B activation) is expensive to serve. For high-throughput applications, smaller specialized models may be more cost-effective.
Tasks match the failure cases. Promoter-enhancer interaction prediction and non-coding RNA family classification showed degradation in the Biology case study. Practitioners should validate on similar task types.

Future Directions for the SAGE Framework¶

The paper introduces the SAGE framework (Foundation → Fusion → Evolution) but only partially realizes its potential:

Foundation layer: The trillion-parameter MoE backbone is established, but the relationship between model scale and scientific capability requires further characterization. Scaling laws for scientific domains remain an open question.

Fusion layer: The integration of text, image, and time-series modalities is demonstrated, but scientific domains require additional modalities:

Molecular structures (SMILES, 3D conformations)
Crystallographic data
Genomic sequences (beyond the text-based approach in Biology-Instruction)
Simulation trajectories

Evolution layer: The RL post-training framework stabilizes FP8 training, but the paper does not demonstrate: - Iterative self-improvement cycles - Domain-specific RL reward design - Transfer learning across scientific domains

The SAGE framework provides a conceptual structure, but the evolution from Intern-S1 to Intern-S1-Pro represents primarily scaling within the Foundation and Fusion layers. The Evolution layer remains underexplored.

Closing Perspective¶

Intern-S1-Pro demonstrates that the boundaries between general and scientific AI are more permeable than previously assumed. The Specializable Generalist paradigm suggests that the path to scientific AI may run through scale and data quality rather than architectural specialization. However, the failure cases and general task gaps remind us that scale is not a universal solution—the architecture and training strategies developed in this paper address specific challenges (load balancing, router optimization, RL stability) that enabled scaling, but do not guarantee superiority across all tasks.

The work opens a research agenda: characterize where generalist scale suffices for scientific tasks, develop techniques for the remaining specialized requirements, and establish the scaling laws that predict scientific capability from model and data scale. Intern-S1-Pro provides both a proof of concept and a platform for this exploration.