Probing Scientific General Intelligence of LLMs¶

with Scientist-Aligned Workflows

🎯 Pitch¶

We introduce a principled operational definition of Scientific General Intelligence (SGI) and present SGI-Bench — a scientist-aligned, multi-domain benchmark plus an agentic evaluation framework that measures end-to-end research workflows across Deliberation, Conception, Action, and Perception. By revealing concrete gaps in current LLMs (e.g., numerical fidelity, protocol sequencing, multimodal comparison) and demonstrating Test-Time Reinforcement Learning to boost idea novelty, this work provides a practical roadmap and tools to advance AI systems that can reliably participate in real scientific discovery.

1. Executive Summary (2-3 sentences)¶

This work defines Scientific General Intelligence (SGI) as an AI system that can autonomously carry out the full iterative cycle of scientific inquiry—Deliberation → Conception → Action → Perception—and operationalizes that definition with a benchmark, SGI-Bench, spanning four scientist-aligned task families. It pairs the benchmark with an agent-based evaluation framework (SGIEvalAgent) and reports that current frontier LLMs and agents show fragmented competence: they often produce plausible intermediate reasoning or executable code, but still fail at end-to-end scientific correctness, workflow fidelity, and multimodal comparison. The paper also introduces Test-Time Reinforcement Learning (TTRL) with retrieval-based novelty rewards to improve idea novelty without reference answers.

2. Context and Motivation¶

Problem / gap addressed
“Scientific ability” in LLMs is often evaluated via static QA or isolated reasoning tests that capture only parts of scientific work (e.g., knowledge recall, tool use), but not the full workflow that human scientists follow.
The paper targets the lack of a coherent, measurable framework for “SGI”—i.e., the capability to conceive, investigate, execute, and interpret across domains.
Why this matters
Scientific inquiry is a structured, long-horizon process involving evidence gathering, synthesis, experimentation (computational or lab), and interpretation, often with iterative correction loops.
If “AGI-like” capability is debated, scientific inquiry is positioned as a high bar because it requires planning, numerical rigor, procedural discipline, and evidence-grounded interpretation, not just fluent text.
Prior approaches and shortcomings (as framed in the provided content)
Benchmarks like MMLU and SuperGPQA largely map to Deliberation (multidisciplinary knowledge understanding).
Benchmarks like GAIA emphasize procedural Action (tool use).
HLE increases difficulty but still isolates inquiry stages, and framing remains largely closed-domain QA, not workflow integration.
Result: evaluations are fragmented—they don’t measure whether a system can “close the loop” of scientific inquiry.
How this paper positions itself
It anchors SGI in the Practical Inquiry Model (PIM) quadrants:
- Deliberation: search/synthesis/critical evaluation.
- Conception: idea/hypothesis/method generation.
- Action: executing experiments (dry/wet).
- Perception: interpreting results (often multimodal).
It then implements a benchmark explicitly mapping each quadrant to a task family (Figure 1, Figure 2).

3. Technical Approach¶

3.1 Reader orientation (approachable technical breakdown)¶

The system is a workflow-centric benchmark + agentic evaluator that tests whether AI systems can perform key parts of the scientific research loop in a structured, reproducible way.
It solves the evaluation problem by turning “scientific inquiry” into four task categories with task-specific metrics, and evaluates models using a tool-augmented Agent-as-a-judge framework (SGIEvalAgent).

3.2 Big-picture architecture (diagram in words)¶

SGI definition layer (PIM) → defines the four inquiry stages.
SGI-Bench dataset layer → provides curated tasks across 10 scientific domains, split into four task families.
Metrics layer → defines per-task multi-dimensional metrics (exact-match, step-level scoring, code unit tests, sequence similarity, multimodal MCQ accuracy, etc.).
Evaluation layer (SGIEvalAgent):
Questioning Agent selects questions.
Metric Customization Agent optionally adds user-defined metrics and merges them with built-in ones.
LLM/Agent Runner produces model outputs with tool access when configured.
Eval Agent computes scores (including LLM-judge step grading).
Reporting Agent generates an evaluation report (Figure 10).

3.3 Roadmap for the deep dive¶

I first explain the SGI operationalization via PIM and how tasks map to quadrants (Figures 1–2).
Then I detail each task definition (inputs/outputs/formulation) for the four task families (Figures 3–7; Tables 1–4).
Next I explain the metrics for each task family (Section 2.2), including the formulas.
Then I describe data construction (scientist alignment, cleaning, difficulty filtering) and dataset scale (Section 2.3–2.4).
Finally I explain the evaluation framework (SGIEvalAgent) and the TTRL method (Figures 10, 26–28; Table 8).

3.4 Detailed, sentence-based technical breakdown¶

This is an empirical benchmark + evaluation framework paper, with an additional test-time reinforcement learning (TTRL) method to improve idea-generation novelty. The core idea is to operationalize “scientific inquiry capability” as a cycle and to measure each quadrant with tasks that resemble scientist workflows and have explicit metrics.

3.4.1 SGI definition via Practical Inquiry Model (PIM)¶

The paper defines SGI as the ability to “autonomously navigate the complete, iterative cycle of scientific inquiry” (Introduction; Figure 1).
It uses the PIM quadrants as a taxonomy:
Deliberation: searching, synthesizing, critically evaluating knowledge.
Conception: generating ideas (hypotheses and method plans).
Action: executing experiments (computational and lab-protocol).
Perception: interpreting outcomes (often with multimodal evidence).
SGI-Bench maps these quadrants to four task categories (Figure 1; Figure 2):
Scientific Deep Research ↔ Deliberation
Idea Generation ↔ Conception
Dry/Wet Experiment ↔ Action
Experimental Reasoning ↔ Perception

3.4.2 Task family 1: Scientific Deep Research (Deliberation)¶

What it is. A literature-inquiry–centric task combining AI “deep research” style multi-step retrieval with a meta-analysis flavor focused on precise, verifiable outputs (Section 2.1.1).

Task I/O and formulation (Figure 3). - Inputs: - Background (B): topic context and terminology disambiguation. - Constraints (C): assumptions, data sources, settings. - Data (D): any given empirical data or values referenced by the question. - Question (Q): focused query, often quantitative. - Response Requirements (R): required units/format (e.g., decimals). - Outputs: - Steps (S): step-by-step procedure (retrieval + reasoning). - Answer (A): short numerical/string answer. - Formally: S, A = LLM/Agent(B, C, D, Q, R).

Why it is constrained. - The benchmark narrows deep research to literature inquiry rather than open-ended report writing to keep evaluation reproducible (Section 2.1.1).

Subtypes (Table 1). - Data: retrieve/analyze structured datasets or counts. - Property: infer material/molecular/system properties. - Micro-experiment: small controlled lab-like experiments. - Macro-experiment: large-scale observational/natural experiments.

Metrics (Section 2.2.1). - Exact Match (EM): binary correctness of final answer (1 if exact match else 0). - Step-Level Accuracy (SLA): LLM-judge evaluates each reasoning step vs a reference; computed as
[ SLA = \frac{#\text{ correct reasoning steps}}{#\text{ total reasoning steps}}. ]

Micro-example (from Figure 11). - A Chua’s circuit question requires (i) finding a specific paper, (ii) computing RC time constant \( \tau=RC \) in microseconds, (iii) reading hysteresis loop voltage range at a given frequency, (iv) identifying a critical frequency threshold. The example answer format is “2.23, 3.2, 10”, showing how the task forces retrieval + computation + formatting.

3.4.3 Task family 2: Idea Generation (Conception)¶

What it is. A structured “methodology design” task rather than unconstrained hypothesis brainstorming, because fully open-ended idea quality is hard to score reliably at scale (Section 2.1.2).

Task I/O and formulation (Figure 4). - Inputs: - Related Work (RW), Challenge (C), Limitation (L), Motivation (M), - Task Objective (TO), Existing Solutions (ES). - Outputs (structured idea decomposition): - Core Idea (CI) - Implementation Steps (IS) - Implementation Order (IO) - Data (D) (data to use / collect) - Evaluation Metrics (EM) - Expected Outcome (EO) - Formally: CI, IS, IO, D, EM, EO = LLM/Agent(RW, C, L, M, TO, ES).

Why structure matters. - The decomposition is meant to expose whether an “idea” is actually executable: it should specify steps, ordering, data, metrics, and expected results (Section 2.1.2).

Metrics: hybrid subjective + objective (Section 2.2.2). - Four dimensions: Effectiveness, Novelty, Detailedness, Feasibility. - Subjective component: pairwise comparison vs expert reference ideas using three LLM judges, each casting two votes → six votes per dimension; score uses win rate. - Objective components: - Effectiveness: expert keywords (3–5) hit rate (semantic matches allowed). - Novelty: dissimilarity from prior related work (lower similarity = higher novelty). - Detailedness: completeness of required components + redundancy penalty via sentence-level similarity. - Feasibility: compare an extracted “implementation graph” to an expert template graph. - Each dimension combines objective score and LLM win rate by averaging, e.g.: - [ \text{Effectiveness} = \frac{\text{Keyword Hit Rate} + \text{LLM Win Rate}}{2}. ] - [ \text{Feasibility} = \frac{\text{Graph Similarity} + \text{LLM Win Rate}}{2}. ]

3.4.4 Task family 3: Dry/Wet Experiment (Action)¶

This family measures whether models can operationalize a plan into either (i) code that runs and is correct (dry) or (ii) a step-by-step lab protocol (wet), under constrained formulations for reproducibility (Section 2.1.3; Figures 5–6).

Dry Experiment (code completion)¶

Task setup (Figure 5; Figure 16). - Input provides: - Background (B): scientific context from code/comments. - Data Code (D): code that constructs inputs/data. - Main Code (M): script with masked/missing functions (signatures + docstrings preserved). - Output: - Functions (F): completed missing functions. - Formally: F = LLM/Agent(B, D, M).

Function categories (Table 2). - Numerical Calculation, Statistical Analysis, Simulation, Metric Calculation, Data Processing, Predictive Modeling.

Metrics (Section 2.2.3; Table 7). - Each problem has 5 unit tests. - PassAll@k: proportion of problems where k or more unit tests pass (with k ∈ {1,3,5} reported; PassAll@5 is strictest). [ PassAll@k = \frac{#\text{ problems with ≥k tests passed}}{#\text{ total problems}}. ] - SER (Smooth Execution Rate): fraction of generated code that runs without runtime errors. - AET (Average Execution Time): average runtime across tests: [ AET = \frac{1}{N}\sum_{i=1}^{N} t_i. ]

Why dry experiments expose scientific weakness. - The benchmark separates syntactic validity (SER) from scientific correctness (PassAll@k), which becomes important in results (Section 4.5.1).

Wet Experiment (protocol planning)¶

Task setup (Figure 6; Figure 19). - Input provides: - Background (B): experimental objective/context. - Action Pool (AP): predefined atomic actions with parameters and I/O schemas. - Output: - Atomic Action Order (AAO) - Atomic Action Parameters (AAP) - Formally: AAO, AAP = LLM/Agent(B, AP).

Metrics (Section 2.2.3). - Sequence Similarity (SS): inversion-based similarity between action orderings: - Let Inv(seq_model, seq_ref) be number of discordant pairs; for length n: [ SS = 1 - \frac{Inv(seq_{model}, seq_{ref})}{n(n-1)/2}. ] - SS=1 means identical ordering; SS=0 means maximal disorder. - Parameter Accuracy (PA): fraction of correctly specified parameters: [ PA = \frac{#\text{ correct parameters}}{#\text{ total parameters}}. ]

3.4.5 Task family 4: Experimental Reasoning (Perception)¶

What it is. A multimodal, data-analysis–oriented reasoning task where models interpret images and answer a multiple-choice question with ≥10 options (Section 2.1.4; Figure 7; Figure 23).

Inputs/outputs (Figure 7). - Input: - Multiple Experimental Images (MEI) - Question (Q) - Output: - Reasoning (R) (step-by-step explanation) - Answer (A) (selected option) - Formally: R, A = LLM/Agent(MEI, Q).

Modalities (Table 3; Figure 22). - Process images, Observation images, Experiment images, Simulation images, Visualization images.

Reasoning paradigms (Table 4). - Signal Perception, Attribute Understanding, Comparative Reasoning, Causal Reasoning.

Metrics (Section 2.2.4). - MCA (Multi-choice Accuracy): exact match on the selected option. - RV (Reasoning Validity): LLM-judge gives 0–10 score for reasoning quality vs reference reasoning; averaged across samples.

3.4.6 Dataset construction (“scientist alignment”) and scale¶

Domains and sourcing. - Tasks span 10 scientific domains: astronomy, chemistry, earth science, energy, information science, life science, materials science, neuroscience, physics, math (Section 2; Figure 8).

Data construction pipeline (Section 2.3; Figure 2). - The pipeline includes: 1. Raw corpus collection with domain experts; seed questions inspired by Science’s 125 Big Questions and expert research directions. 2. Question construction by >100 Masters/PhD annotators with scientist feedback; questions reference original sources for traceability. 3. Data cleaning: - rule-based filters, - model-based consistency checks, - expert review (including executability checks for dry experiment code). 4. Difficulty filtering using six high-performance models with web search/deep reasoning enabled; questions solvable by >50% of those models are removed.

Dataset size (Section 2.4). - Total: over 1,000 samples. - Breakdown: - Scientific Deep Research: 318 - Idea Generation: 315 - Dry Experiment: 271 - Wet Experiment: 68 - Experimental Reasoning: 291

3.4.7 SGIEvalAgent: agentic evaluation framework¶

Why agentic evaluation. - The benchmark uses many specialized metrics (unit tests, graph similarity, inversion distance, step grading), so a single “LLM-as-a-judge” is insufficient (Section 3).

Four-stage evaluation (Figure 10; Section 3.1–3.4). 1. Question Selection - A Questioning Agent selects a subset based on user query, domain, and intent. 2. Metric Customization - A Customization Agent generates novel metrics from user intent and merges them with pre-defined metrics (Section 3.2). 3. Prediction & Evaluation - A tool-augmented runner executes the target model/agent; an Eval Agent computes scores and rationales (Section 3.3). - Tool pool includes web search, PDF parser, Python interpreter, file reader, metric-specific functions. 4. Report Generation - A Reporting Agent compiles results into a report (Section 3.4).

Tool-integrated reasoning analysis (Section 5.2; Figure 29). - Across evaluated agent workflows: - web_search: 539 calls (33.98%) - visit_webpage: 385 calls (24.27%) - final_answer: 358 calls (22.57%) - python_interpreter: 200 calls (12.61%) - wikipedia_search: 104 calls (6.56%) - Latency variability is tool-dominated; visit_webpage ranges from 5.37 s to 114.29 s across models (a 21.28× spread).

3.4.8 TTRL: Test-Time Reinforcement Learning for idea novelty¶

Motivation. - Idea generation lacks ground-truth labels, so standard supervised/RL pipelines struggle to define correctness (Section 5.1).

Core mechanism (Figure 26; Section 5.1.1). - Use online retrieval as a moving “baseline” and reward ideas that are dissimilar to retrieved related works. - Training backbone: GRPO (Group Relative Policy Optimization). - For a query Q, policy π_θ samples a group of outputs {o_1, …, o_k}.

Reward definition (Equations (1)–(5)). - Total reward: [ R(o) = R_{format}(o) + R_{novelty}(o, W) ] where W = {w_1, …, w_n} are retrieved related works. - Format reward enforces strict XML tags: [ R_{format}(o) = \mathbb{I}(\text{output follows }......) ] - Novelty via embedding dissimilarity: - Compute average cosine similarity: [ S_{avg} = \frac{1}{n}\sum_{j=1}^{n}\frac{e_{idea}\cdot e_{w_j}}{|e_{idea}||e_{w_j}|} ] - Convert to innovation score: [ S_{inn} = clip((1 - S_{avg})\times 10, 0, 10) ] - Gate with threshold τ = 5: [ R_{novelty}(o, W) = \mathbb{I}(S_{inn}>\tau) ] - Conceptually: the model gets rewarded if its idea is far enough (in embedding space) from retrieved papers and in correct format.

Hyperparameters (Table 8). - Base model: Qwen3-8B - RL algorithm: GRPO - Precision: bfloat16 - Learning rate: 5 × 10^-7 - Max length: 2048 - Generations G: 8 - Temperature: 1.0 - Batch size: 4 - Related works n: 4 - Reward weights: 1:1 (format:novelty)

Observed training dynamics (Figure 27). - Format reward saturates quickly near 1.0; novelty reward increases more gradually.

4. Key Insights and Innovations¶

A principle-grounded operational definition of SGI via PIM (Figure 1; Section 1–2).
Novelty: SGI is not treated as a vague aspiration but decomposed into four interdependent capabilities tied to a known inquiry model.
Significance: enables benchmark design that checks whether systems can cover all quadrants, not just recall or reasoning.
SGI-Bench: workflow-centric, cross-disciplinary tasks with scientist-aligned construction (Figure 2; Sections 2.3–2.4).
Novelty: tasks are built from expert-provided materials and iteratively vetted, with explicit filtering for difficulty (removing questions solvable by >50% of strong models under tool-augmented settings).
Significance: pushes evaluation beyond “easy-to-grade QA” toward workflow competence, while keeping tasks reproducible via constrained formulations.
Multi-dimensional, task-specific metrics designed to expose “local plausibility vs global correctness” gaps (Section 2.2).
Examples:
- SLA vs EM for deep research isolates step correctness vs final answer correctness.
- SER vs PassAll@k isolates code executability vs scientific correctness.
- RV vs MCA isolates explanation plausibility vs correct discrimination in multimodal reasoning.
Significance: the benchmark is engineered to reveal characteristic failure modes where LLMs sound right but are wrong.
SGIEvalAgent: an agent-based evaluation stack with tool support and customizable metrics (Figure 10; Section 3; Figure 30).
Novelty: evaluation is framed as an agentic pipeline rather than a single judge call, enabling computation-heavy scoring (unit tests, graph similarity) and user-defined rubrics.
Significance: makes the benchmark more extensible for real-world “scientist needs” (e.g., custom rigor metrics).
TTRL for open-ended idea novelty without ground truth (Section 5.1; Figure 26–28; Table 8).
Novelty: test-time RL uses retrieval-conditioned novelty rewards to improve idea generation even when no reference answer exists.
Significance: points to SGI as a dynamic capacity that can improve at inference time with minimal feedback loops, not purely via offline training.

5. Experimental Analysis¶

Evaluation methodology and setup (Section 4.1)¶

Models evaluated include a large list of open-weight and closed-weight LLMs and multiple agent systems (Section 4.1).
Inference settings:
Temperature is set to 0 “for benchmarking consistency” (Section 4.1).
A “standard zero-shot, task-specific prompt template” is used across tasks (Section 4.1).
Important missing details (cannot be filled from provided content):
The paper does not provide per-model training hyperparameters (optimizer, batch size, training tokens, etc.) for the evaluated proprietary/open models in the excerpt provided; therefore, those cannot be summarized reliably here.

Main quantitative results (Tables 5–7; Section 4.2)¶

Overall SGI-Score remains low. - Table 5 reports SGI-Score as an average across tasks, with the best aggregate around 33.83/100 (Gemini-3-Pro). - The paper emphasizes that even the best systems are far from “proficient” integrated scientific inquiry.

Cross-task snapshot highlights (Table 5). - Deep Research (Exact Match, strict): - Best reported: 18.48 (Gemini-3-Pro). - Many models cluster around ~8–16. - Idea Generation (average of four metrics): - Highest reported: 55.40 (GPT-5) and 55.03 (GPT-5.2-Pro). - Dry Experiments (PassAll@5, strict): - Best reported: 36.64 (Gemini-3-Pro). - Other high scores include 35.79 (o4-mini, Claude-Sonnet-4.5) and 34.69 (Claude-Opus-4.1). - Wet Experiments (average of SS and PA): - Example strong scores: 37.92 (Grok-3), 36.63 (GPT-4.1), 33.62 (Qwen3-Max). - Experimental Reasoning (MCA, strict): - Best reported: 41.92 (Gemini-3-Pro), with other strong results around the high 30s (Table 5 shows 39.18 for GPT-5.2-Pro, 38.83 for Claude-Opus-4.1, 38.49 for GPT-4.1).

Closed vs open performance gap is described as marginal (Section 4.2). - Table 5 includes an example comparison: - Claude-Sonnet-4.5 SGI-Score 32.16 vs Qwen3-Max 31.97 (marked with * for multimodal series versions). - The paper interprets this as “scale and access alone” not yielding robust scientific cognition.

Deep Research detailed findings (Section 4.3; Figures 12–14)¶

Key pattern: SLA is much higher than EM.
Several systems exceed 50% SLA, with best around ~65% (Section 4.3), but EM stays around ~10% and “seldom above 20%”.
Interpretation in the paper: models often produce locally consistent steps but fail globally (reasoning-chain collapse).
Task-type breakdown (Figure 14; Table 10):
Data/Properties are hardest; Micro/Macro experiment types are relatively less bad but still generally low, “rarely exceeds 30%”.

Idea Generation findings (Section 4.4; Table 6)¶

Table 6 shows:
GPT-5 average 55.40 with very high Novelty 76.08 and Detailedness 85.72, but low Feasibility 18.87.
Best Feasibility is 22.90 (o3), while many models are in the ~14–20 range.
The paper’s diagnosis:
LLMs produce novel, elaborate ideas but omit executable details (data acquisition, hyperparameters/resources, module interfaces, ordering).

Dry Experiment findings (Section 4.5.1; Table 7; Figure 17; Figure 18)¶

Executability vs correctness gap:
Many top models have high SER (often >90%), but PassAll@5 remains ≤ 36.64% (Table 7).
Example: Gemini-3-Pro has SER 98.85% but PassAll@5 36.64%.
Hardest function types:
The paper highlights Numerical Calculation and Simulation functions as hardest (Figure 17; narrative in Section 4.5.1).
Case study demonstrates methodological sensitivity (Figure 18):
Two solutions for comoving volume differ by numerical integration approach:
- One uses scipy.integrate.quad (adaptive integration).
- Another uses cumulative sum / trapezoidal-style approximation that accumulates error.
The case is used to argue that scientific code requires choosing numerically appropriate methods, not just writing runnable code.

Wet Experiment findings (Section 4.5.2; Figure 20; Figure 21)¶

SS remains uniformly low; closed models are somewhat higher but still poor (Figure 20).
PA is modest and sometimes competitive across open/closed models.
Case study (Figure 21) shows failures in:
temporal sampling logic,
branch-aware protocol structure,
multi-sample bookkeeping (e.g., repeated PBMC isolation per time point).

Experimental Reasoning findings (Section 4.6; Figures 24–25; Table 13)¶

RV > MCA is common (Figure 24): models can produce plausible reasoning even when wrong.
Comparative reasoning is weakest (Figure 25; Table 13 shows comparative is consistently lower than signal/causal for many models).
Domain variability:
Higher accuracy in astronomy/chemistry; lower in materials/life/earth (Section 4.6; Figure 25).

Does the evidence support the claims?¶

The reported metrics align well with the claims about “fragmentation” because multiple task families show the same structural gap:
SLA > EM (Deep Research),
SER > PassAll@k (Dry Experiments),
RV > MCA (Experimental Reasoning).
A potential caveat is that several metrics rely on LLM judges (SLA, RV, and subjective idea comparisons), which introduces evaluation-model dependence; the paper mitigates this partially for idea generation via multiple judges and multiple votes, but the excerpt does not provide detailed inter-rater reliability statistics.

6. Limitations and Trade-offs¶

(From Section 6.4 plus limitations implied by the evaluation design.)

Partial coverage of real scientific workflows
The four stages are “probes” rather than a complete representation of scientific practice (Section 6.4).
Topics like integration across disciplines and risk/safety assessment are noted as outside scope.
Deep Research scope is restricted
The benchmark focuses on “literature-inquiry–centric” deep research rather than broader report writing (Section 6.4), trading realism for reproducibility.
Idea Generation evaluates method-design, not full hypothesis quality
The benchmark focuses on structured methodological plans because open-ended hypothesis evaluation at scale is difficult (Sections 2.1.2, 6.4).
Limited code/action-space coverage
Dry Experiment tasks support Python only (Section 6.4).
Wet Experiment action pools cover only a subset of disciplines (Section 2.4; Wet Experiments = 68 items).
Experimental Reasoning uses multiple-choice
Multiple-choice improves automatable scoring but constrains the range of valid scientific explanations (Section 6.4).
Deductive bias
The workflow primarily reflects a deductive paradigm starting from literature and moving toward experiments (Section 6.4), leaving inductive discovery (hypothesis emergence from novel observations) for future work.
TTRL optimizes novelty only
The novelty reward is a gated dissimilarity score vs retrieved works (Equations (3)–(5)); optimizing novelty alone can trade off against feasibility/rigor, and the current method does not incorporate feasibility rewards in the described formulation.

7. Implications and Future Directions¶

How this changes the landscape
It reframes SGI evaluation from isolated QA into workflow-faithful testing aligned with a scientific inquiry cycle (PIM).
It introduces a practical benchmark with explicit metrics that diagnose where models fail (numerical aggregation, protocol sequencing, comparative multimodal reasoning).
Research directions suggested by the results (Section 6.3)
Meta-analytic reasoning with numerical robustness
- Target the Data/Properties brittleness and the SLA–EM gap by training retrieval-conditioned quantitative reasoning and verification-aware aggregation.
Planning-aware conception
- Raise feasibility by enforcing parameter-complete steps, dependency consistency, and tool-augmented checks during decoding.
Scientific coding beyond syntax
- Address the SER–PassAll@k gap with training that emphasizes numerical stability and algorithm choice (e.g., sensitivity to discretization/integration methods).
Branch- and time-aware wet-lab reasoning
- Improve protocol coherence with stateful workflow verification and training signals for temporal/branch logic.
Comparative multimodal scientific reasoning
- Since comparative reasoning is consistently weakest (Figure 25), future training should reward precise cross-sample discrimination and extraction of quantitative cues from visuals.
Multi-objective test-time learning
- Extend TTRL beyond novelty to incorporate scientific rigor and feasibility rewards, preventing “novel but unusable” outputs.
Tool ecosystem optimization
- Since tool latency dominates (Figure 29), engineering improvements (caching, selective browsing, better extraction) can materially improve agentic SGI throughput.
Practical applications / downstream use cases
Based on the task families, SGI-Bench-style evaluation can guide deployment decisions:
- Use Experimental Reasoning metrics to choose multimodal models for interpreting scientific figures/charts.
- Use Dry Experiment unit-test outcomes to assess whether a model is safe to trust for scientific code completion.
- Use Wet Experiment sequence/parameter metrics to assess whether a model can draft lab protocols without hazardous omissions or ordering errors.
Repro/Integration Guidance (from what is described)
If evaluating an agentic system, expect performance and latency to be strongly influenced by retrieve–browse loops and tool routing (Figure 29).
If optimizing idea generation without labels, the paper’s TTRL setup provides a concrete recipe (Table 8), but it should likely be paired with additional rewards if feasibility matters.