Skip to content

Nemotron 3 Nano: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning

ArXiv: 2512.20848

🎯 Pitch

Nemotron 3 Nano introduces a hybrid Mamba–Transformer with sparse Mixture-of-Experts (31.6B total params, ~3.2B active per pass) plus a large-scale pretraining and multi-stage post-training pipeline (SFT + RLVR + RLHF) to deliver strong reasoning and agentic tool-use while greatly improving inference throughput. By trading dense parameter activation for expert sparsity and careful long-context and alignment training, it offers a practical path to high-capability, long-context LLMs that are faster and more cost-effective to serve for agentic and reasoning-heavy applications.


1. Executive Summary (2-3 sentences)

Nemotron 3 Nano 30B-A3B is an open-weight language model that targets the throughput–accuracy trade-off by combining a hybrid Mamba–Transformer backbone with sparse Mixture-of-Experts (MoE) layers, so that only a small subset of parameters is used per token. It is pretrained on 25 trillion text tokens, then post-trained with supervised fine-tuning (SFT) plus large-scale reinforcement learning (verifiable rewards + human-feedback via a generative reward model), and it supports up to 1M-token context after a long-context extension phase. The headline claim is that it improves accuracy relative to prior Nemotron 2 Nano while activating ~3.2B parameters per forward pass (out of 31.6B total) and achieves higher inference throughput than similarly-sized open models in the paper’s H200 tests (Figure 1).

2. Context and Motivation

  • Problem / gap addressed
  • Modern LLMs often face a three-way tension between:
    • Accuracy/capability (especially “reasoning” and “agentic/tool-use” behaviors),
    • Inference efficiency (tokens/sec/GPU, memory footprint),
    • Long-context support (hundreds of thousands to 1M tokens).
  • Dense models pay inference cost on all parameters every token, which can be inefficient for deployment.
  • Long-context capability often requires additional training stages and can degrade short-context quality if not handled carefully (the paper describes such an effect during long-context CPT and how it is mitigated in §2.5).

  • Why it matters

  • If only a fraction of parameters are active per token (sparse MoE), inference can be faster for similar total parameter counts—important for serving and agentic workloads with long generations.
  • Agentic settings (tool calls, terminal tasks, software engineering) require robustness and calibrated behaviors; post-training needs to improve these without destroying general capability.

  • Prior approaches and shortcomings (as positioned here)

  • The model builds on NVIDIA’s prior hybrid Mamba–Transformer Nemotron lines (Nemotron-H; Nemotron 2 Nano) but replaces dense FFNs with sparse MoE to improve the throughput-to-accuracy frontier (§2.1).
  • Post-training in Nemotron 3 Nano is presented as a scale-up of reinforcement learning compared to Nemotron 2 Nano, particularly via multi-environment RL to avoid single-task RL causing regressions in other skills (§3).

  • Positioning relative to existing work

  • The paper positions Nemotron 3 Nano as:
    • Competitive on common reasoning/code/math/chat/agentic benchmarks (Figure 1, Table 3),
    • Faster in a generation-heavy throughput scenario on H200 (Figure 1),
    • Supporting longer context (up to 1M) with strong RULER results (Table 3; also base long-context results in Table 2).

3. Technical Approach

3.1 Reader orientation (approachable technical breakdown)

  • The system is a sparse-activated LLM: a 31.6B-parameter model where only a small subset of parameters (“experts”) runs on each token.
  • It solves the “how do we get strong reasoning/agentic ability while staying fast and supporting very long context” problem via a hybrid Mamba–Transformer stack + MoE + multi-stage post-training (SFT → RLVR → RLHF → RLVR) + selective FP8 quantization.

3.2 Big-picture architecture (diagram in words)

  • Input tokens → hybrid backbone (52 layers) consisting of:
  • Many Mamba-2 blocks (state-space sequence modeling) and a smaller number of attention blocks (Grouped-Query Attention),
  • MoE layers replacing standard dense FFNs, where a router selects a subset of experts per token,
  • → output logits for next-token prediction.
  • After base pretraining:
  • SFT adds chat/agentic/reasoning behaviors via curated and synthetic instruction data.
  • RLVR trains with automatically verifiable rewards across multiple environments (math/code/tool-use/etc.).
  • RLHF uses a generative reward model (GenRM) to score chat responses efficiently.
  • Quantization (PTQ) produces an FP8 checkpoint with selective BF16 retention for sensitive submodules.

3.3 Roadmap for the deep dive

  • I will explain, in order:
  • Core model architecture (hybrid Mamba–Transformer + MoE routing) because it determines active compute and throughput.
  • Pretraining data + schedule because the model’s base capabilities and long-context extension come from these choices.
  • Long-context extension (CPT) because 1M context is a core capability claim and has explicit training tactics.
  • Post-training (SFT, RLVR, RLHF) because the agentic/reasoning/chat performance depends on this pipeline.
  • Quantization (BF16→FP8 PTQ) because throughput claims depend on precision strategy and what stays BF16.

3.4 Detailed, sentence-based technical breakdown

This is an empirical systems + training recipe paper: it combines an architectural choice (hybrid backbone + sparse MoE) with a large-scale training/post-training pipeline to optimize capability per unit inference cost.

3.4.1 Architecture: hybrid Mamba–Transformer with granular MoE

  • Hybrid backbone components
  • The model combines Mamba-2 blocks and attention blocks (specifically GQA, grouped-query attention) as in prior Nemotron hybrids (§1, §2.1, Figure 2).
  • GQA reduces the number of key/value heads relative to query heads to lower KV-cache and attention compute; here the architecture uses 32 Q heads and 2 KV heads with head dimension 128 (Table 1).
  • Mamba configuration is explicitly given (Table 1): Mamba state dimension = 128, Mamba groups = 8, Mamba heads = 64, Mamba head dimension = 64.

  • MoE instead of dense FFN

  • Instead of a dense feed-forward network at each layer, the model uses sparse Mixture-of-Experts layers (§2.1).
  • An MoE layer contains multiple “expert” MLPs; a router chooses which experts to activate for each token.
  • Nemotron 3 Nano uses a granular MoE setup with 128 routable experts, activating 6 experts per token, plus 2 shared experts (Table 1; §1 says “activates 6 out of 128 experts (§2.1)”).
  • Parameter counts: 31.6B total parameters, with 3.2B active per forward pass (3.6B including embeddings) (§1, §2.1). This is the core mechanism behind the “less than half the parameters per forward pass” claim in the abstract.

  • Router + activation details

  • The MoE uses squared ReLU activation and a learned MLP router with sigmoid gating (§2.1).
  • Load balancing: during pretraining it uses DeepSeek’s aux-loss-free load balancing strategy with update rate 1e-3, together with a standard load balancing loss (Lepikhin et al., 2020) with coefficient 1e-4 (§2.4).

  • Other architectural “defaults”

  • The model uses RMSNorm, no positional embeddings, no dropout, and no bias on linear layers, and it does not tie embedding and projection weights (§2.1).
  • Assumption note: the paper excerpt does not specify the tokenizer, vocabulary size, or the exact number/placement of attention layers beyond “6 out of 52 attention layers” mentioned in quantization sensitivity (§4.2). Figure 2 shows a layer pattern visually, but the exact per-layer enumeration is not fully spelled out in the provided text.

3.4.2 Pretraining pipeline: data, curriculum, and hyperparameters

  • Data scale and novelty
  • Pretraining uses 25 trillion text tokens total (Abstract; §2.4), including >3 trillion new unique tokens over Nemotron 2 (Abstract).
  • The corpus spans 15 categories (§2.3). The mixture includes:

    • Web crawl buckets using the Nemotron-CC taxonomy (crawl-medium, crawl-medium-high, syn-crawl-medium-high, crawl-high, syn-crawl-high),
    • Domain data: math, Wikipedia, code, nemotron-cc-code, academic text, Crawl++ (OpenWebText/BigScience/Reddit), multilingual (19 languages listed), and synthetic SFT-style datasets grouped into general/stem/code (§2.3).
  • Two-phase pretraining curriculum

  • Phase 1 emphasizes diversity: 23.5T tokens (§1).
  • Phase 2 emphasizes high quality (e.g., more Wikipedia): 1.5T tokens (§1).
  • The transition happens at 94% of training (§2.3), and Figure 3 provides the exact category percentages for each phase.

    • Example: Phase 1 has syn-crawl-high = 20.4%, code = 14.0%, syn-crawl-medium-high = 11.7%, stem-sft = 11.1% (Figure 3a).
    • Phase 2 increases stem-sft to 22.3% and math to 12.5%, keeps syn-crawl-high = 20.4% and code = 14.0% (Figure 3b).
  • Key newly described datasets (what they do mechanically)

  • Nemotron-CC-Code-v1 (427.92B tokens) is built from Common Crawl code pages by:
    1. Identifying likely code pages via a fast classifier,
    2. Rendering raw HTML with Lynx to preserve code formatting,
    3. Cleaning with an LLM stage (Phi-4) that removes boilerplate but preserves code/math,
    4. Applying a code-quality relevance classifier to filter non-technical pages,
    5. Standardizing equations to LaTeX and preserving code block structure (§2.2.1).
  • Nemotron-Pretraining-Code-v2 augments code via:
    • New GitHub curation up to April 15, 2025, with deduplication against existing corpus (§2.2.2),
    • Synthetic Q&A / student-teacher / code review dialogs generated with Qwen3 32B seeded by source code,
    • LLM-based code rewriting (SGCR/SCOR-style prompts) with post-checks for syntax errors and Pylint scoring,
    • Python→C++ transpilation as augmentation (§2.2.2).
  • Nemotron-CC-v2.1 adds:
    • Three newer CC snapshots (CC-MAIN-2025-18/21/26),
    • Much larger synthetic rephrasing: five prompts applied to Medium-High-Quality data across 110 snapshots to produce 2.1T new tokens,
    • Translation-to-English from 9 languages followed by quality filtering; later an extra LLM-based filter removes ~10.6% of tokens due to “uninformative translated documents” (§2.2.3).
  • Nemotron-Pretraining-Specialized-v1 is synthetic specialized corpora:

    • Revised Wikipedia, synthetic math textbooks, synthetic scientific coding, and the InfiniByte cross-domain code generation approach,
    • A large RQA dataset: from ~14M filtered Essential-Web STEM docs → stratified sampling → two-step question generation and answering with Qwen3-235B-A22B-Thinking-2507 → 4.3M demonstrations totaling ~31.7B unique tokens (§2.2.4).
  • Pretraining optimization hyperparameters

  • Learning rate schedule: Warmup–Stable–Decay (§2.4):
    • Warm up over 8.4B tokens to max LR 1e-3,
    • Hold max LR for 80% of training (20T tokens),
    • Decay to min LR 1e-5 over the last 20% (5T tokens).
  • Optimizer: AdamW with weight_decay=0.1, β1=0.9, β2=0.95 (§2.4).
  • Training sequence length: 8192.
  • Batch size: 3072, yielding ~25 million tokens per batch (§2.4).
  • Missing detail note: the excerpt does not provide gradient clipping, Adam epsilon, tokenizer details, or total training FLOPs / PF-days.

3.4.3 Long-context extension to reach 1M tokens (continuous pretraining phase)

  • Why a separate phase exists
  • Base pretraining runs at sequence length 8,192 tokens (§2.4), which is not enough to directly support 256K–1M contexts.
  • The model adds an LC-Phase at the end of pretraining for long-context ability via continuous pretraining (CPT) (§2.5).

  • LC-Phase training configuration

  • LR: constant 1e-5.
  • Global batch size: 48.
  • Parallelism (H100 GPUs): 8-way context parallelism, 8-way tensor parallelism, 8-way expert parallelism, 4-way pipeline parallelism (§2.5).
  • Max sequence length: 256K tokens for synthetic retrieval-focused data; training also uses 512K sequences in experiments (§2.5).
  • Data blend:
    • 20% long-context document QA (reused from Nemotron 2 Nano, scaled 3Ă— larger),
    • 1% synthetic retrieval-focused data,
    • 79% downscaled Phase 2 data (§2.5).
  • Total LC-Phase tokens: 121B tokens (§2.5).

  • Mitigating short-context regressions

  • Attempting CPT only with 512K sequence batches slightly hurts short-context benchmarks.
  • The fix is mixing 512K and 4K sequences, which improves short-context scores (notably MMLU-Pro and Code) while also improving long-context scores (§2.5).

3.4.4 Post-training: SFT → RLVR (multi-environment) → RLHF (GenRM) → RLVR

  • SFT stage (what changes and how)
  • The SFT data differs from pretraining SFT-style inclusions: it is more focused on agentic tasks and uses a chat template (§3.1).
  • Chat template reasoning behavior (§3.1.1, Figure 4):

    • The template supports reasoning vs non-reasoning mode.
    • In multi-step tool workflows, “existing reasoning tokens are preserved” across assistant calls within a step sequence.
    • In multi-turn conversation, when a new user message appears, previous-turn reasoning is dropped, and only current-turn reasoning is materialized into the prompt (Figure 4).
    • Tool calls use XML-style tags to reduce escaping issues (§3.1.1).
  • SFT data sources and what each teaches

  • Math and tool-integrated math traces distilled from GPT-OSS 120B (§3.1.2).
  • Code data from Nemotron 2 Nano prompts with DeepSeek-R1-0528 responses (§3.1.2).
  • Synthetic multi-turn tool-use trajectories generated by multiple teacher models and filtered by an LLM judge for consistency (§3.1.2).
  • Long-context synthetic data with mean length 128K and max 256K tokens validated against RULER subsets (§3.1.2).
  • Formal proofs: Lean 4 theorem statements and proof traces (final: 300K examples) built via autoformalization + proof search with compiler verification (§3.1.2).
  • Terminal use: verifiable tasks based on Terminal Bench + SWE-Smith, with trajectories generated by large teacher agents (§3.1.2).
  • Software engineering: tasks from SWE-Gym and R2E-Gym, distilled from agent harnesses (OpenHands, SWE-Agent, Mini-SWE-Agent) with Qwen3-Coder-480B-A35B as teacher (§3.1.2).
  • Safety: unsafe prompts wrapped to train refusal behaviors; responses filtered by a content-safety classifier (§3.1.2).
  • Instruction following: data generated and filtered using IFBench/IFEval verifiers and an LLM judge to avoid trivial compliance (§3.1.2).

  • SFT filtering and control knobs

  • Filtering removes malformed examples, repetitive/pathological reasoning, and targeted keyword/regex patterns to mitigate politically nationalistic narratives in synthetic traces (§3.1.3).
  • Reasoning controls (§3.1.5):

    • Reasoning on/off: strip reasoning traces from a random 10% of samples.
    • Reasoning budget control: truncate 3% of reasoning traces to different budgets and continue with the post-reasoning response.
  • SFT optimization hyperparameters

  • Steps: 13,000.
  • Batch size: 64.
  • Sequence packing to length: 256K.
  • LR: 5e-5 with 800 warmup steps.
  • MoE load balancing regularizer coefficient: 1e-4 (§3.1.6).

  • RLVR: reinforcement learning from verifiable rewards (multi-environment)

  • RLVR here means the reward is computed by automatic verifiers (unit tests, schema checks, database state checks, judges) rather than purely human preference.
  • Environments include (§3.2.1):
    • Math: DAPO (17K tasks) + SkyWorks math (104K).
    • Competitive coding: 22K tasks after limiting unit tests to 50.
    • Document-grounded STEM MCQ QA: 135K tasks.
    • JSON schema adherence: 9K tasks with reward only for exact schema match (no semantic reward).
    • Instruction following: 46K (verifier-based) + 3K (LLM-judge multi-turn subtle constraints).
    • Long-context QA: 12K tasks with 32K input limit referencing ≥5 documents; judged by an LLM.
    • Tool use agentic: Workplace Assistant (690 tasks; 26 tools; verified by DB state) + banking multi-turn tool environment (~1K tasks; verified by DB state).
  • Curriculum sampling (§3.2.2, Figures 6–7):

    • Tasks are profiled with the SFT checkpoint; tasks with 100% pass rate are filtered out.
    • Sampling targets a Gaussian pass-rate distribution that shifts from easier to harder over training, aiming for stable multi-domain improvement.
    • Figure 7 shows curriculum sampling improves stability vs random sampling on GPQA, LiveCodeBench, AIME 2025, and IFBench Prompt (qualitatively; exact numeric deltas are not printed in the figure).
  • RLVR algorithm and key hyperparameters

  • Algorithm: synchronous GRPO with masked importance sampling to reduce training–inference misalignment (§3.2.5).
  • Per step: 128 prompts, 16 generations per prompt.
  • Batch size for updates: 2048, described as making updates on-policy (§3.2.5).
  • Stability choices:
    • Freeze the MoE router weights during RL.
    • Continue aux-loss-free load balancing updates for expert bias (§3.2.5).
  • Max generation length: 49K tokens.
  • Uses overlong filtering (Yu et al., 2025) to boost reasoning benchmarks (§3.2.5).
  • Figure 9 plots benchmark trajectories across RL training (AALCR, AIME25, GPQA, IFBench, LiveCodeBench, MMLU Pro, SciCode, Tau Average), showing generally rising curves for several metrics, though Tau Average appears noisy.

  • RLHF with a generative reward model (GenRM)

  • Instead of a scalar reward model (e.g., Bradley–Terry classifier), the paper trains a generative reward model that reads a prompt + two candidate responses and outputs:
    • Reasoned critique,
    • Helpfulness scores for each response,
    • A preference ranking (§3.3.1).
  • GenRM training objective (Equation 1) uses a reward:
    • Penalize format violations (I_format),
    • Penalize absolute errors between predicted and ground-truth helpfulness scores for both responses,
    • Penalize absolute error in predicted vs ground-truth ranking.
    • Hyperparameters set to C1=10, C2=1 (§3.3.1).
  • Efficient RLHF comparisons (§3.3.2):

    • For N=16 responses per prompt, full pairwise comparisons would be 120; instead the paper uses a circular comparison graph requiring exactly N comparisons per prompt.
    • Each response is judged twice in different positions to reduce positional bias.
  • Group Relative Length Control (GRLC) to reduce verbosity during RLHF

  • Problem: RLHF tends to increase response length (especially reasoning traces) even when not needed (§3.3.2).
  • Mechanism: adjust rewards relative within a prompt’s group so shorter (but still high-quality) responses get bonuses.
  • Key equations (Equations 4–6):
    • For each response i in a group of N, split it into thinking and answer parts with lengths â„“_i^(think) and â„“_i^(answer).
    • Compute normalized within-group “shortness weight” for thinking:
    • w_i^(think) = 1 - (â„“_i^(think) - â„“_min^(think)) / (â„“_max^(think) - â„“_min^(think))
    • Then center it to be zero-mean across the group: ŵ_i^(think) = w_i^(think) - (1/N) ÎŁ_j w_j^(think)
    • Do the same for the answer part to get ŵ_i^(answer).
    • Final reward:
    • R_i = R_i^(base) + λ^(think) ŵ_i^(think) + λ^(answer) ŵ_i^(answer)
    • With λ^(think)=0.5, λ^(answer)=0.5 (§3.3.2).
  • Optional conciseness bonus:
    • Give a bonus β^(think) to the shortest-thinking response if its base reward is in the top pth percentile; same for shortest answer with β^(answer); set β^(think)=0.5, β^(answer)=0.5, p=80 (§3.3.2).
  • Reported effect: verbosity reduces ~30% without sacrificing accuracy (§3.3.2). (No table of before/after verbosity is included in the excerpt, so this is a qualitative claim.)

  • Worked micro-example: how GRLC changes rewards (toy numbers)

  • Suppose a prompt produces N=4 candidate responses with base rewards (from GenRM comparisons) and thinking lengths:
    • R^(base) = [2.0, 1.8, 1.9, 2.1]
    • â„“^(think) = [100, 200, 400, 100] tokens
  • Compute â„“_min=100, â„“_max=400. Then
    • w^(think) = [1-(0/300), 1-(100/300), 1-(300/300), 1-(0/300)]
    • w^(think) = [1.00, 0.67, 0.00, 1.00]
  • Mean is (1 + 0.67 + 0 + 1)/4 = 0.6675. Center:
    • ŵ^(think) = [0.3325, ~0.0025, -0.6675, 0.3325]
  • With λ^(think)=0.5, the thinking-length reward adjustment is:
    • ΔR = 0.5 * ŵ^(think) = [0.166, 0.001, -0.334, 0.166] (approx.)
  • So the long-thinker (400 tokens) gets a noticeable penalty, while the short-thinkers get small bonuses—but only relative to the other candidates for the same prompt, which is the key design intent in §3.3.2.

3.4.5 Quantization: BF16 model → FP8 PTQ with selective BF16 retention

  • Goal
  • Increase inference throughput with minimal accuracy loss by quantizing to FP8 after post-training (§4).

  • Calibration dataset

  • PTQ calibration uses 1K samples from the post-training reasoning SFT dataset, and this performs slightly better than using cnn_dailymail for accuracy recovery (§4.1).
  • An ablation using on-policy generations as calibration data shows no benefit over SFT-based calibration (§4.1).

  • Selective quantization strategy

  • Sensitivity analysis finds attention layers are most sensitive:
    • Nemotron 3 Nano has 6 self-attention layers out of 52, and these are kept in BF16 (§4.2).
    • The Mamba layers feeding into those attention layers are also sensitive and kept in BF16.
    • Conv1D in all Mamba layers is kept BF16 (§4.2).
  • Weights, activations, and KV cache are quantized to FP8 in the FP8 checkpoint, with the above exceptions (§4.2).

  • Accuracy recovery after quantization

  • Table 4 shows small degradations on several benchmarks; examples:
    • MMLU-Pro: 78.30 → 78.10
    • AIME25 (no tools): 89.06 → 87.71
    • LiveCodeBench v6: 68.25 → 67.62
    • Some metrics slightly improve (e.g., IFBench (prompt) 71.51 → 72.19; AA-LCR 35.85 → 36.06).
  • The paper summarizes this as ~99% median accuracy recovery (Table 4 narrative in §4.3).

  • Throughput trade-off evidence

  • Figure 11 plots throughput improvement vs accuracy recovery across quantization configurations and indicates the “selective quant” setting yields a strong trade-off; exact coordinates are not numerically listed in the text, but the axis suggests roughly ~300–350% throughput improvement vs BF16 at near ~99–100% recovery for the final FP8 config.

4. Key Insights and Innovations

  • (1) Sparse scaling with granular MoE in a hybrid Mamba–Transformer backbone
  • The model combines hybrid sequence modeling (Mamba + attention) with MoE replacing dense FFNs (§2.1).
  • The key novelty in this report’s framing is the very low active parameter count relative to total: 3.2B active / 31.6B total per forward pass (§2.1), aiming to improve the throughput–accuracy frontier.

  • (2) Two-phase pretraining plus explicit long-context CPT with sequence-length mixing

  • Two-phase token curriculum (diversity then quality, switching at 94%) is central to the base model training design (§2.3).
  • The long-context CPT phase includes a concrete mitigation for short-context regressions: mixing 512K and 4K sequences instead of only 512K (§2.5). This is a practical training insight for long-context extension.

  • (3) Multi-environment RLVR as a stability mechanism

  • The post-training emphasizes training on many environments simultaneously to avoid “un-recoverable degradation” that can happen in single-environment RL (§3.2).
  • Curriculum sampling based on pass-rate profiling is presented as important for learning hard tasks rather than collapsing to easy ones (Figures 6–7).

  • (4) RLHF efficiency + verbosity control via group-relative reward shaping

  • The paper’s RLHF design reduces GenRM comparison cost from O(N^2) to O(N) via circular comparisons (§3.3.2).
  • The Group Relative Length Control mechanism (Equations 4–6) is a concrete, mathematically specified approach to discouraging unnecessary long reasoning traces while keeping the competition “fair” within each prompt group.

  • (5) Selective FP8 PTQ with BF16 islands for sensitive submodules

  • Rather than quantizing everything uniformly, the method keeps specific layers (attention + preceding Mamba, plus Conv1D) in BF16 based on sensitivity (§4.2), and shows near-full accuracy recovery (Table 4) with substantial throughput gains (Figure 11).

5. Experimental Analysis

  • Evaluation methodology and tooling
  • Base model evaluation uses Nemo Evaluator SDK and LM Evaluation Harness with benchmark-specific few-shot settings (§2.6).
  • Post-trained evaluation uses Nemo Evaluator SDK and Nemo Skills Harness for most benchmarks, plus official implementations for Terminal Bench, SWE-Bench, and Scale AI Multi Challenge (§3.4.1).
  • Some comparisons to other models use officially reported numbers or ArtificialAnalysis when not available, and sometimes the authors compute scores themselves (§3.4.1). This means baseline comparability may vary by benchmark depending on source.

  • Base model results (Nemotron 3 Nano 30B-A3B Base)

  • Table 2 compares against Qwen3-30B-A3B-Base across categories.
  • Strong gains for Nemotron base on several tasks, e.g.:
    • MMLU-Pro (5-shot, CoT EM): 65.05 vs 61.71
    • AGIEval-En: 68.32 vs 63.12
    • HumanEval: 78.05 vs 70.73
    • GSM8K: 92.34 vs 89.01
    • MATH: 82.88 vs 61.14
    • Long-context RULER shows very large gains:
    • RULER (64K): 87.50 vs 63.55
    • RULER (128K): 82.92 vs 60.69
    • RULER (256K): 75.44 (Qwen base not reported in Table 2)
  • Regressions also exist:

    • MMLU (5-shot): 78.56 vs 81.07
    • ARC-Challenge: 91.89 vs 94.45
    • RACE: 88.04 vs 90.05
    • MMLU Global Lite: 74.47 vs 76.84
  • Post-trained final model results

  • Table 3 compares Nemotron 3 Nano against Qwen3-30B-A3B-Thinking-2507 and GPT-OSS 20B.
  • Examples of where Nemotron 3 Nano leads in Table 3:
    • IFBench (prompt): 71.51 vs 51.00 (Qwen3) vs 65.00 (GPT-OSS)
    • SWE-Bench (OpenHands): 38.76 vs 22.00 vs 34.00
    • RULER-100 @ 1M: 86.34 vs 77.50 (Qwen3) (GPT-OSS not reported; GPT-OSS context length noted as 128K in Figure 1 caption)
    • LiveCodeBench v6: 68.25 vs 66.00 vs 61.00
    • MMLU-Pro: 78.30 vs 80.90 vs 75.00 (Nemotron is not best here; Qwen3 is higher)
    • AIME25 (no tools): 89.06 vs 85.00 vs 91.70 (GPT-OSS higher)
    • AIME25 (with tools): 99.17 vs 98.7 (GPT-OSS; Qwen not listed)
  • Agentic benchmarks show mixed per-subtask outcomes:
    • TauBench V2 average: 49.04 vs 47.70 vs 47.50, but Qwen3 beats Nemotron on Airline and Retail while Nemotron leads on Telecom.
    • Terminal Bench (hard subset): 8.51 vs 5.00 vs 10.00 (GPT-OSS higher).
  • A notable outlier is multilingual reasoning:

    • MMLU-ProX (avg over langs): 59.50 vs 77.60 vs 69.10 (Nemotron trails both baselines in Table 3).
  • Throughput results

  • Figure 1 reports relative throughput (output tokens/s/GPU) in an 8K input / 16K output scenario:
    • Nemotron 3 Nano shows 3.3Ă— throughput relative to Qwen3-30B-A3B-Thinking-2507 and 2.2Ă— relative to GPT-OSS-20B, measured on a single H200 GPU using best-of vLLM vs TRT-LLM configurations.
    • Precision differs per model in this throughput test (Figure 1 caption):
    • Nemotron 3 Nano and Qwen3 measured with FP8 weights + activations,
    • GPT-OSS-20B measured with mxfp4 weights and bfloat16 activations.
  • Interpretation caveat: because precision settings differ, the throughput comparison is not “same precision across models,” though it is arguably “best available configuration per model on H200” as described in the caption.

  • Do the experiments support the claims?

  • The paper provides:
    • Clear architectural/accounting evidence for sparse activation (Table 1; §2.1).
    • Quantitative accuracy comparisons for base (Table 2) and post-trained (Table 3) models.
    • A specified throughput comparison scenario (Figure 1) plus quantization trade-off evidence (Table 4, Figure 11).
  • The evidence most strongly supports:
    • High long-context benchmark performance (RULER) at multiple lengths (Table 2, Table 3).
    • Throughput advantage in the chosen generation-heavy scenario on H200 (Figure 1).
    • Generally strong post-training on several agentic/instruction-following benchmarks (Table 3).
  • The evidence is more mixed for:
    • General knowledge and multilingual reasoning where some baselines outperform Nemotron (Table 2, Table 3).
    • “Best-in-class” style language: Table 3 has several benchmarks where Nemotron is not strictly best (e.g., MMLU-Pro, AIME25 no-tools, Terminal Bench hard subset, MMLU-ProX).

6. Limitations and Trade-offs

  • Comparability and evaluation sourcing
  • Baseline numbers come from multiple sources (official reports, ArtificialAnalysis, and sometimes the authors’ own runs) (§3.4.1). This can introduce inconsistency in exact evaluation settings across models.

  • Multilingual reasoning weakness (relative to baselines shown)

  • On MMLU-ProX, Nemotron 3 Nano is substantially lower than both Qwen3 and GPT-OSS in Table 3 (59.50 vs 77.60 and 69.10), suggesting multilingual reasoning is a clear gap in the presented comparison.

  • Trade-off between short-context and long-context training

  • Long-context CPT initially harmed short-context scores when using only 512K sequences; the final method mitigates this by mixing 512K and 4K (§2.5), but this indicates long-context extension is delicate and may require careful tuning.

  • MoE-specific complexity

  • Sparse MoE introduces:
    • Router/load-balancing design and stability considerations (aux-loss-free load balancing, load balancing loss coefficients, router freezing during RL) (§2.4, §3.2.5).
    • Parallelism complexity for training (expert parallelism is explicitly used in LC-Phase, §2.5).
  • These can complicate reproduction and deployment compared to dense models, even if per-token active compute is smaller.

  • Quantization sensitivity and partial BF16 retention

  • The FP8 model is not “fully FP8 everywhere”: attention layers and certain Mamba components remain BF16 (§4.2). This is a practical trade-off: better accuracy recovery at the cost of less aggressive compression.

  • Missing implementation specifics in the provided excerpt

  • Some details that often matter for full reproduction are not present in the excerpt (e.g., tokenizer/vocab, exact attention-layer placement pattern beyond Figure 2, training compute budget in FLOPs, full hardware counts for the 25T-token pretraining run).

7. Implications and Future Directions

  • How this work changes the landscape (based on the provided content)
  • It demonstrates a concrete recipe for combining:
    • Hybrid Mamba–Transformer backbones,
    • Sparse MoE with very low active parameter counts,
    • Multi-environment RL for agentic/reasoning improvements,
    • Long-context CPT to reach 1M tokens,
    • Selective FP8 PTQ for throughput gains,
    • While releasing checkpoints + substantial data/recipes (Introduction, §1; dataset bullets; Conclusion).
  • The combination targets deployment constraints directly: the model is designed to be fast in generation-heavy settings (Figure 1) while remaining competitive on a broad benchmark suite (Table 3).

  • Follow-up research directions suggested by the paper’s choices

  • Better multilingual capability: The large MMLU-ProX gap (Table 3) suggests room to adjust multilingual data mixture, post-training data, or RL environments toward multilingual reasoning.
  • Long-context robustness beyond RULER: The model reports strong RULER-100 up to 1M (Table 3), but AA-LCR is much lower than Qwen3 (35.85 vs 59.00). This suggests future work could focus on improving “coherence over extreme context” tasks like AA-LCR while preserving RULER gains.
  • Reward design for structured outputs: The RL structured-output environment rewards only schema validity, not semantic correctness (§3.2.1). A natural extension is adding semantic rewards while keeping verifiability.
  • More principled verbosity control: Group Relative Length Control is a clear first step (Equations 4–6). Future work could test more granular controls (e.g., per-section budgets) or explicit task-conditioned length targets.

  • Practical applications / downstream use cases (as implied by benchmarks and datasets)

  • Agentic tool use in controlled environments (Workplace Assistant DB tools; banking tool simulation) (§3.2.1).
  • Software engineering agents (SWE-Bench via OpenHands; SWE-Gym/R2E-Gym derived tasks) (§3.1.2, Table 3).
  • Terminal-based automation (Terminal Bench-style tasks) (§3.1.2, Table 3).
  • Long-context retrieval and QA up to 1M tokens (RULER-100 results, Table 3).

  • Repro/Integration Guidance (grounded in the paper’s reported setup)

  • If your priority is inference throughput while keeping a ~30B-parameter-class model, the paper’s strategy is:
    • Use a sparse MoE design with low active parameters (3.2B active) (§2.1),
    • Serve with FP8 where possible (Figure 1 uses FP8 for Nemotron and Qwen3 throughput),
    • Consider selective BF16 retention for sensitive layers if using PTQ (§4.2–4.3).
  • If your priority is agentic + reasoning across diverse tasks, the paper suggests:
    • SFT on diverse agentic traces with explicit reasoning controls (§3.1.5),
    • RLVR on many environments simultaneously to avoid regressions (§3.2),
    • RLHF with a scalable GenRM scoring method plus verbosity controls (§3.3).
  • If your priority is 1M context, the key integration insight is that long-context capability comes from an explicit CPT phase with:
    • Long sequences (up to 512K) but mixed with short ones (4K) to protect short-context quality (§2.5),
    • A blend that includes document QA and retrieval-focused synthetic data (§2.5).