Skip to content

NVIDIA Nemotron 3: Efficient and Open Intelligence

ArXiv: 2512.20856

🎯 Pitch

Nemotron 3 introduces a family of open models (Nano, Super, Ultra) that combine a hybrid Mamba–Transformer Mixture-of-Experts architecture with hardware-aware LatentMoE, Multi-Token Prediction, NVFP4 training, and multi-environment RL post-training to deliver high accuracy while dramatically improving inference throughput and scaling context to 1M tokens. This matters because it lowers the compute and latency cost of agentic, long-horizon reasoning and tool-using systems, enabling practical, scalable deployments that keep performance competitive on reasoning, coding, and long-context benchmarks.


1. Executive Summary (2-3 sentences)

Nemotron 3 is a family of open models (Nano, Super, Ultra) designed to maximize accuracy per unit inference cost for agentic and reasoning workloads by combining Mixture-of-Experts (MoE) sparsity with a hybrid Mamba-2 + limited self-attention architecture (Figure 1, §2.1). The paper’s main significance is showing how this hybrid design, plus hardware-aware MoE design (LatentMoE, §2.2), Multi-Token Prediction (MTP), and large-scale multi-environment RL post-training (§2.6), can push throughput and context length (up to 1M tokens, §2.5) without giving up strong benchmark performance (Figure 2, Table 1–3).

2. Context and Motivation

  • Problem / gap addressed
  • Building agentic AI systems (multi-step reasoning, tool use, long conversations, codebases, RAG over large documents) stresses two bottlenecks simultaneously:
    1. Inference efficiency (throughput + latency) during long generations and large-scale rollout sampling.
    2. Long-context capability (hundreds of thousands to a million tokens).
  • Standard Transformer inference gets expensive at long generation lengths because self-attention must attend over a growing KV cache (memory and compute scale with sequence length), which the paper frames as particularly costly for “common reasoning workloads” with long inputs and outputs (§2.1).

  • Why this matters

  • Agentic systems often require repeated calls to the model (planning, tool calls, verification, reflection, multi-agent collaboration). If each call is expensive, overall system cost and latency explode.
  • Long-context support reduces the need to aggressively chunk documents/code or rely on lossy retrieval heuristics.

  • Prior approaches and shortfalls (as positioned here)

  • The paper contrasts with “standard Transformers” and “Transformer MoEs” by arguing attention-heavy designs incur high KV-cache costs during generation (§2.1).
  • It also highlights MoE deployment bottlenecks: memory-bandwidth bound in latency-focused settings and all-to-all communication bound in throughput-focused settings (§2.2). Standard MoE does not directly optimize for both.

  • Positioning

  • Nemotron 3 positions itself as an efficiency-first open model family for agentic AI: hybrid Mamba–Transformer MoE for throughput (§2.1), LatentMoE for “accuracy per byte” (§2.2), MTP for faster generation (§2.3), NVFP4 training for efficient pretraining (§2.4), long-context to 1M (§2.5), and multi-environment RL post-training for agentic competence (§2.6), with inference-time “reasoning budget control” (§2.7).

3. Technical Approach

3.1 Reader orientation (approachable technical breakdown)

  • The system is a family of large language models optimized to be fast to run while still being good at reasoning and agentic tasks.
  • It solves this by combining sparse computation (MoE) with cheap sequence modeling (Mamba-2) and only a small amount of expensive global mixing (self-attention), then post-training with reinforcement learning across many task environments.

3.2 Big-picture architecture (diagram in words)

  • Input tokens → stack of layers that are mostly Mamba-2 and MoE → a few self-attention layers interspersed for global routing (Figure 1, §2.1) → (for larger models) MTP head/module to predict multiple next tokens (§2.3) → generated tokens.
  • For Super/Ultra, MoE layers use LatentMoE (operate experts in a smaller latent dimension) and are trained with NVFP4 quantization-aware numeric choices (§2.2, §2.4).
  • After pretraining/SFT, models undergo multi-environment RL with an asynchronous sampling/training system (§2.6).

3.3 Roadmap for the deep dive

  • First, explain why the model minimizes attention and uses Mamba-2 + MoE (§2.1) because that drives the throughput story.
  • Second, explain LatentMoE (§2.2) because it changes the core MoE compute/communication trade-off.
  • Third, explain MTP (§2.3) because it affects generation speed and training signal.
  • Fourth, explain NVFP4 training details (§2.4) because they’re crucial for training feasibility and stability.
  • Fifth, explain long-context design choices (§2.5) and then RL post-training for agentic behavior (§2.6–§2.7).

3.4 Detailed, sentence-based technical breakdown

This is primarily an empirical systems + modeling paper: it introduces an efficiency-oriented model architecture (hybrid Mamba–Transformer MoE), plus training/post-training techniques (LatentMoE, MTP, NVFP4, multi-environment RL) aimed at agentic performance at high throughput.

3.4.1 Core model architecture: Hybrid Mamba-2 + MoE + a few self-attention layers (§2.1, Figure 1)

  • A standard Transformer layer’s key expensive operation at inference is self-attention, because generating each new token requires attending over all previous tokens and maintaining a growing KV cache (§2.1).
  • Mamba-2 (a structured state-space model layer) is used because it requires storing only a constant-sized state during generation rather than a KV cache that grows with sequence length (§2.1). In the paper’s framing, this makes Mamba-2 “cheaper” for long generation.
  • Nemotron 3 therefore interleaves MoE layers with Mamba-2 layers most of the time, and includes only a select few self-attention layers (Figure 1, §2.1).
  • The architecture aims to keep:
  • MoE for sparse scaling (many parameters, but only a subset active per token) to increase accuracy per compute budget (§2.1).
  • A few self-attention layers for high-fidelity all-to-all information routing when needed (§2.1).
  • Many Mamba-2 layers for fixed inference-time computation and memory with respect to sequence length (§2.1).

Claimed effect (anchored to paper evidence): - For a “common reasoning workload” of 8k input / 16k output tokens, the paper reports Nemotron 3 Nano 30B-A3B achieves 3.3× higher throughput than Qwen3-30B-A3B (Figure 2, §2.1). - Figure 2 also positions the hybrid architecture as strong on reasoning and long-context tasks, but the most precise numbers in the provided excerpt are the throughput ratio and the benchmark bars (Figure 2). (The excerpt does not provide full methodological details for these benchmarks.)

3.4.2 LatentMoE: routing and expert compute in a smaller latent dimension (§2.2, Figure 3, Table 1)

Definitions (paper-specific / important): - MoE (Mixture-of-Experts) means each token is routed to a small set of “expert” feed-forward networks; only the selected experts run for that token. - Top-K (written as K) means the router activates the best K experts per token. - Hidden dimension (d) is the model’s main embedding width. - Latent dimension (ℓ) is a smaller width used only inside the routed expert computation in LatentMoE.

Why LatentMoE is introduced (hardware bottlenecks in §2.2): - The paper separates two deployment regimes: - Latency-focused: small batches; MoE cost is dominated by memory bandwidth to read expert weights. Expert matrices are size d × m (where m is FFN intermediate dimension), so reducing d or m reduces weight-read cost (§2.2). - Throughput-focused: large batches; MoE cost is dominated by all-to-all communication to dispatch tokens to experts and combine outputs. Communication scales with K × d and is independent of m (§2.2). - The goal is to improve quality “without compromising inference throughput or latency,” so the design targets both weight-load and communication payload.

Mechanism (Figure 3b): - In LatentMoE, each token is: 1. Projected down from hidden dimension d to latent dimension ℓ < d. 2. Routed and processed by experts entirely in latent space. 3. Projected back up to d. - Because expert compute and all-to-all traffic happen in ℓ instead of d, per-expert weight loads and communication payload are reduced by a factor of d/ℓ (§2.2, Figure 3 caption). - The paper “reinvests” these savings by scaling: - number of experts: N′ = N · d/ℓ - top-K active experts: K′ = K · d/ℓ so the model gets more expert diversity and nonlinear capacity while keeping overall cost “approximately constant” (§2.2).

What stays at full dimension d (quality preservation): - Non-routed computations remain in d, including: - the router/gating network, - shared expert computation, - non-expert layers (§2.2).

Quantitative evidence (Table 1): - Comparison between a Standard MoE and LatentMoE, both around 8B active and ~73B total parameters, trained for 1T tokens with identical hyperparameters (Table 1): - Standard MoE: d = 4096, 128 total experts, 6 active experts. - LatentMoE: ℓ = 1024, 512 total experts, 22 active experts. - Accuracy improvements (all in %, Table 1): - MMLU-Pro: 48.30 → 52.87 - MMLU: 70.10 → 72.11 - Code: 51.95 → 55.14 - Math: 78.32 → 80.19 - Commonsense Understanding: 81.73 → 82.10 - The paper interprets this as “consistently outperforms” across the evaluated tasks (Table 1).

3.4.3 Multi-Token Prediction (MTP) for speed and quality (§2.3, Table 2)

Definition: - MTP trains the model to predict multiple future tokens (not just the next token) as auxiliary outputs (§2.3).

Why it helps (as argued here): - Predicting multiple future tokens provides richer training signal and encourages planning several steps ahead (§2.3). - These auxiliary predicted tokens can serve as draft tokens for speculative decoding, potentially accelerating generation without a separate draft model (§2.3).

Evidence (Table 2): - Ablation on an “8B active parameter transformer MoE base model trained on 1T tokens” shows MTP improves multiple tasks (Table 2): - MMLU (5-shot, acc): 70.06 → 71.26 - MMLU-Pro (5-shot, CoT EM): 45.05 → 47.84 - MBPP-Sanitized (3-shot): 65.58 → 66.89 - ARC-Challenge (25-shot, acc_norm): 86.43 → 88.05 - WinoGrande (0-shot, acc): 74.59 → 75.45 - RACE (0-shot, acc): 84.02 → 85.36 - GSM8K (8-shot, acc): 82.49 → 84.46 - The paper summarizes this as ~2.4% average improvement across benchmarks in that ablation (§2.3).

Systems detail: - The paper reports a lightweight MTP module gets ~97% acceptance on the first two predicted tokens in an ablation study on an 8B active MoE model (§2.3), which is directly relevant to speculative decoding speedups.

3.4.4 NVFP4 training for efficient pretraining at scale (§2.4, Figure 4, Figure 5)

Definition (paper-specific): - NVFP4 is a 4-bit numeric format and associated training recipe where weights/activations/gradients are quantized to enable FP4 GEMMs (matrix multiplications) in forward and backward passes (§2.4).

Claimed capability: - The paper reports “stable and accurate pretraining” of the hybrid Mamba–MoE architecture for up to 25T tokens using NVFP4 (§2.4). (The excerpt does not specify which Nemotron 3 model(s) reach 25T; it is framed as a demonstrated capability.)

Training-recipe mechanisms (as described in §2.4): - Quantization uses: - fine-grained micro-block scaling (16 elements), - scaling factors in E4M3, - a second-level FP32 global scale, - element format E2M1 (§2.4). - They apply: - 2D block scaling for weight quantization, - Random Hadamard Transforms (RHTs) on inputs to wgrad, - stochastic rounding on gradients (§2.4). - They keep the last 15% of the network in high precision to maintain stability (§2.4).

Selective high-precision exceptions (important practical detail): - For Super/Ultra: - Latent projections are kept in BF16 (minimal step-time impact). - MTP layers are kept in BF16 due to being at the end of the network and to preserve MTP capabilities (§2.4). - For attention layers: - The model has a small ratio of attention to Mamba-2 layers, and each attention layer uses GQA with only 2 KV heads (§2.4). - QKV and attention projections are kept in BF16 “to maintain fidelity” (§2.4). - For Mamba: - Mamba output projection layers can have high “flushes to zero” (up to 40% on Nano) when quantized to NVFP4, so they keep these layers in MXFP8 (§2.4).

Quantitative stability evidence: - Figure 4 reports the relative loss gap between NVFP4 and BF16: - On Nano, the paper reports < 1% relative loss difference for their improved recipe (green) (§2.4, Figure 4). - On a larger 8B-active MoE model, the gap decreases to < 0.6% (§2.4, Figure 4). - Figure 5 shows downstream evaluation trajectories for BF16 vs NVFP4 on an 8B-active MoE model trained to 1T tokens, and the paper states NVFP4 accuracy “closely follows BF16 trajectories” (evaluations performed in BF16) (§2.4, Figure 5).

3.4.5 Long-context to 1M tokens without RoPE in attention layers (§2.5, Table 3, Figure 6)

Key design choice: - The paper states Rotary Position Embeddings (RoPE) are a hurdle for extending context beyond the training length, and Nemotron 3 avoids RoPE in attention layers because Mamba layers provide implicit positional information (§2.5). - As a result, Nemotron 3 “does not suffer from out-of-distribution RoPE issues” during context extension (§2.5).

Training stages for long context (Nano details are explicit): - For Nemotron 3 Nano: - Continued pre-training (CPT) at 512k sequence length. - Supervised fine-tuning (SFT) at 256k sequence length. - RL includes a long-context environment with inputs up to 32k tokens (§2.5). - All three stages include synthetic data aimed at long-range retrieval, multi-hop reasoning, multi-document aggregation, etc. (§2.5). - The paper notes they did not need a staged increase from 8k to 512k during CPT (§2.5).

Evidence for context extrapolation (Table 3): - RULER scores across context lengths for two base models trained up to 512k (Table 3): - Nemotron-Nano-12B-v2-Base (Dense Hybrid):
- 128k: 85.13
- 256k: 79.85
- 512k: 75.12
- 1M: 23.43 - Nemotron-3-Nano-30B-A3B-Base (MoE hybrid):
- 128k: 74.48
- 256k: 71.67
- 512k: 66.02
- 1M: 54.19 - The paper interprets this as the MoE hybrid being more robust to length extrapolation, showing “graceful degradation” versus an abrupt dropoff for the dense hybrid (Table 3, §2.5).

Evidence from next-token likelihood across long sequences (Figure 6): - The paper evaluates cumulative average Negative Log-Likelihood (NLL) by token position on code sequences larger than 1M tokens and reports NLL decreases with sequence length (Figure 6, §2.5), suggesting the model continues using additional context up to the tested range.

3.4.6 Post-training: Multi-environment RL for agentic skills + inference-time reasoning budget control (§2.6–§2.7, Figure 7–8)

Multi-environment RL design (§2.6): - The RL stage spans many environments: math/science reasoning, competitive coding, instruction following, software engineering, search, chat, tool use, long context, and more (§2.6). - Unlike “separate training stages for different tasks,” Nemotron 3 trains on all tasks simultaneously, which the paper claims is: - more stable, - less prone to reward hacking, - better than staged approaches that can degrade some capabilities (§2.6).

RL systems + algorithm choices (§2.6): - The paper uses an asynchronous RL architecture that decouples training from inference, and leverages MTP to accelerate rollout generation (§2.6). - For stable learning, it uses GRPO with masked importance sampling to account for discrepancies between training and rollout policies (§2.6).

Training signal evidence (Figure 7): - Figure 7 shows multiple benchmark curves trending upward during a single RL run (e.g., AIME25, GPQA, IFBench Prompt, LiveCodeBench, MMLU Pro, etc.), which the paper presents as evidence that simultaneous optimization improves diverse capabilities (§2.6).

Granular reasoning budget control (§2.7, Figure 8): - At inference time, a user can impose a maximum token budget for the model’s “thinking trace”; when it reaches the budget, one can append </think> and the model generates an answer based on the partial trace (§2.7). - Figure 8 illustrates an accuracy–efficiency trade-off as the “average number of generated tokens per query” changes, though the provided excerpt does not include exact numeric points for the curve.

3.4.7 What the excerpt does not specify (required hyperparameter/config completeness check)

The paper excerpt does not provide many training/inference configuration details that would normally be needed for full reproducibility, including: - Optimizer name and settings (e.g., AdamW betas/eps), learning rate schedule, weight decay. - Global batch size / microbatching, sequence packing details, gradient clipping. - Full architecture scalars for Nemotron 3 Nano/Super/Ultra (number of layers, hidden size d, heads, FFN sizes, exact attention-layer placement), except for the LatentMoE ablation model (d=4096, ℓ=1024, experts and top-K in Table 1) and attention detail of GQA with 2 KV heads (§2.4). - Tokenizer/vocabulary and context window settings beyond the stated sequence lengths for CPT/SFT/RL (§2.5). - Total training tokens per Nemotron 3 model (beyond statements like “over 10 trillion tokens of datasets” in the introduction and “up to 25T tokens” for NVFP4 stability). - Compute budget (e.g., PF-days), hardware counts, and end-to-end throughput numbers beyond the relative throughput bar in Figure 2.

4. Key Insights and Innovations

  • Hybrid Mamba–Transformer MoE layering to minimize attention cost (§2.1, Figure 1)
  • Novelty/significance: Instead of the common “attention everywhere” Transformer stack, the model uses mostly Mamba-2 and MoE with only a few attention layers to reduce KV-cache-driven costs, yielding large throughput gains (Figure 2).
  • Why it matters: Higher throughput directly benefits agentic systems and RL rollouts where many tokens must be generated.

  • LatentMoE: shrinking routed dimension to buy more experts and higher top-K (§2.2, Figure 3, Table 1)

  • Novelty/significance: The routed expert computation happens in a smaller â„“, reducing both memory traffic and all-to-all payload by d/â„“, and the savings are used to scale expert count and top-K.
  • Evidence: Clear accuracy gains at similar active/total parameter counts and same token budget (Table 1).

  • MTP integrated as both quality and inference-speed lever (§2.3, Table 2)

  • Novelty/significance: MTP is used not just as an auxiliary objective but as a practical speculative-decoding enabler without a separate draft model.
  • Evidence: Broad accuracy improvements in Table 2 and high acceptance rate (~97% for first two predicted tokens) (§2.3).

  • Practical NVFP4 recipe with selective BF16/MXFP8 exceptions (§2.4, Figure 4–5)

  • Novelty/significance: The paper emphasizes native NVFP4 GEMMs and provides concrete “sensitive layers” handling (keep QKV/projections BF16; Mamba output projection MXFP8; last 15% high precision).
  • Evidence: Small loss gaps (<1% on Nano; <0.6% on A8B) and similar evaluation trajectories (Figure 4–5).

  • Multi-environment RL done simultaneously + reasoning budget control (§2.6–§2.7, Figure 7–8)

  • Novelty/significance: Simultaneous RL across heterogeneous environments is positioned as more stable than staged RL and directly supports agentic tool use and controllable reasoning cost.
  • Evidence: Multi-metric improvements over RL steps in Figure 7; trade-off curves in Figure 8.

5. Experimental Analysis

  • Evaluation methodology (what’s explicitly described)
  • The excerpt provides benchmark names and some results:
    • RULER context extrapolation (Table 3; also referenced in Figure 2).
    • Downstream aggregated categories for LatentMoE vs standard MoE (Table 1).
    • A benchmark suite for MTP ablation on an 8B active MoE model trained on 1T tokens (Table 2).
    • NVFP4 vs BF16 loss comparisons (Figure 4) and downstream evaluation trajectories (Figure 5).
    • RL training curves across multiple benchmarks (Figure 7).
  • It does not provide full details of evaluation prompts, sampling settings, decoding parameters, or variance estimates.

  • Main quantitative results (specific numbers)

  • Throughput: Nemotron 3 Nano 30B-A3B shows 3.3Ă— relative throughput vs Qwen3-30B-A3B (Figure 2, §2.1).
  • LatentMoE improves accuracy at similar model size and training budget (Table 1):
    • e.g., MMLU-Pro +4.57 points (48.30 → 52.87), Code +3.19 points (51.95 → 55.14).
  • MTP improves downstream accuracy on many tasks (Table 2):
    • e.g., GSM8K +1.97 points (82.49 → 84.46), MMLU-Pro +2.79 points (45.05 → 47.84).
  • NVFP4 training stability: loss gap vs BF16 is reported as <1% on Nano and <0.6% on an 8B-active MoE model (Figure 4, §2.4), with similar downstream trajectories (Figure 5).
  • Long-context robustness: at 1M context length on RULER, Nemotron-3-Nano-30B-A3B-Base scores 54.19 vs Nemotron-Nano-12B-v2-Base 23.43 (Table 3).

  • Do the experiments support the claims?

  • The provided evidence supports several targeted claims well:
    • LatentMoE’s accuracy gains are backed by a controlled comparison (same training tokens and hyperparameters stated, Table 1).
    • MTP’s effect is shown via direct ablation with concrete scores (Table 2).
    • NVFP4 stability is supported via loss curves and downstream tracking (Figure 4–5), plus concrete layer-precision adjustments (§2.4).
    • Long-context extrapolation robustness is clearly quantified (Table 3), and Figure 6 adds an auxiliary likelihood-based analysis.
  • However, some high-level claims are only partially evidenced in the excerpt:

    • “State-of-the-art accuracy” is asserted in the abstract/introduction and implied in Figure 2, but the excerpt does not provide a comprehensive baseline suite, statistical testing, or detailed settings to rigorously substantiate a SOTA claim across tasks.
    • For Super and Ultra, the excerpt contains feature descriptions but little direct empirical reporting specific to those models.
  • Ablations / robustness / failure cases

  • Ablations included:
    • Standard MoE vs LatentMoE (Table 1).
    • Baseline vs Baseline + MTP (Table 2).
    • NVFP4 recipe ablations around quantizing “sensitive layers” (described around Figure 4).
  • The excerpt does not include:
    • Failure cases, safety analyses, red-teaming outcomes, or robustness to adversarial prompting.
    • Detailed long-context adversarial tests beyond RULER and NLL trends.

6. Limitations and Trade-offs

  • Missing reproducibility-critical details
  • The excerpt lacks optimizer/schedule, batch sizes, full architecture sizes for Nano/Super/Ultra, compute budget, and hardware scale, which limits the ability to independently reproduce results.

  • Limited clarity on Super/Ultra specifics

  • While Super/Ultra are described as using NVFP4, LatentMoE, and MTP (abstract; §2.4), the excerpt provides most quantitative evidence on Nano and ablation models, not full Super/Ultra evaluations.

  • Long-context training vs inference mismatch

  • Nano’s CPT is at 512k and SFT at 256k, while the model claims support up to 1M tokens (§2.5). Table 3 and RULER show extrapolation is plausible, but this still implies reliance on generalization beyond the trained length.

  • Reasoning budget control depends on interface convention

  • The budget control mechanism relies on using a </think> token to truncate “thinking traces” (§2.7). The excerpt does not discuss how reliably this works across domains, whether it affects faithfulness, or how it interacts with tool-use policies.

  • MoE operational complexity

  • Even with LatentMoE, MoE introduces routing, load balancing, and communication complexity. The excerpt focuses on bottlenecks and mitigation, but does not detail load-balancing losses, expert collapse issues, or distributed systems edge cases.

7. Implications and Future Directions

  • How this changes the landscape
  • The paper pushes an architectural thesis: for agentic and long-generation workloads, minimizing attention layers and leaning on Mamba-2 plus MoE can shift the accuracy–throughput frontier (Figure 2, §2.1).
  • It also presents a hardware-aware MoE redesign (LatentMoE) that explicitly targets memory bandwidth and all-to-all bottlenecks rather than treating MoE as purely a modeling choice (§2.2).

  • Follow-up research enabled

  • Better understanding of where a few attention layers are necessary in predominantly Mamba-based stacks (Figure 1 suggests specific patterns, but the excerpt does not formalize design rules).
  • More systematic study of LatentMoE scaling laws: how â„“, N, and K should scale with model size and deployment regime.
  • Stronger long-context evaluations beyond lookup tasks (e.g., multi-document reasoning reliability across 1M tokens), since the excerpt shows promising indicators (Table 3, Figure 6) but not broad task coverage at 1M.

  • Practical applications / downstream use cases (as stated)

  • Nano: positioned as cost-efficient and strong vs comparable models for accuracy while remaining inexpensive for inference (abstract).
  • Super: positioned for collaborative agents and high-volume workloads like IT ticket automation (abstract).
  • Ultra: positioned for highest accuracy and reasoning performance (abstract).

  • Repro/Integration Guidance (based on the provided excerpt)

  • Prefer this approach when:
    • You need high throughput for long input/output reasoning traces (Figure 2, §2.1).
    • You need very long context up to 1M tokens and want to avoid RoPE-based extrapolation issues (§2.5).
    • You plan to do RL post-training at scale and benefit from fast rollout generation (§2.6).
  • LatentMoE is most relevant if you deploy MoE in regimes where either:
    • expert weight reads dominate (latency-focused), or
    • all-to-all dispatch dominates (throughput-focused), since LatentMoE directly shrinks the communicated/routed dimension (§2.2).
  • If you want speculative decoding without a separate draft model, MTP is presented as a natural fit, with reported high acceptance of early predicted tokens (§2.3).