Skip to content

Olmo 3

ArXiv: 2512.13961

🎯 Pitch

Olmo 3 is a fully-open family of state-of-the-art 7B and 32B language models that uniquely releases the entire “model flow” — every dataset, intermediate checkpoint, training recipe, and code artifact — plus post-training variants for thinking, instructing, and RL. By exposing the complete lifecycle and providing new data curation, decontamination, and RL tooling, it enables reproducible, causal study of how data and training stages produce capabilities and delivers the strongest fully-open thinking model at this scale, narrowing the gap to top closed-weight systems.


1. Executive Summary (2-3 sentences)

OLMo 3 is a “fully-open” release of 7B- and 32B-parameter language models where not only the final weights but the entire training lifecycle (“model flow”)—data mixes, code, intermediate checkpoints, and dependencies—is released, enabling intervention and study at every stage (Figure 1–2). Technically, it combines (i) multi-stage base training optimized for reasoning and long context (up to 65,536 tokens) with (ii) post-training pipelines for “thinking” (explicit reasoning traces), “instruct” (concise assistant behavior + tool use), and “RL-Zero” (RL directly from base to study RL effects without hidden pretraining). Across their evaluation suites, the flagship Olmo 3.1 Think 32B is positioned as the strongest fully-open thinking model of its scale, while remaining competitive with leading open-weight baselines on many reasoning/coding tasks (Table 1, Table 14).

2. Context and Motivation

  • Problem/gap addressed
  • Many “open” model releases are open-weights: they provide the final checkpoint but not the training data, intermediate checkpoints, or full reproducible pipeline (Figure 1).
  • This blocks research questions such as:

    • How specific data sources or training stages cause capabilities to appear.
    • Whether benchmark results are inflated by contamination.
    • How RL outcomes depend on base-model data, given that most RLVR work starts from models with undisclosed pretraining/midtraining data (Section 2.2, Section 6).
  • Why it matters

  • Scientific reproducibility & mechanistic understanding: releasing all stages supports controlled experiments (e.g., swap a data bucket, change long-context recipe, re-run RL) rather than treating the final weights as a black box.
  • Benchmark integrity: OLMo 3 invests heavily in decontamination and shows that contamination can be subtle and widespread in popular datasets; midtraining explicitly incorporates decontamination tooling (decon) and analyzes contamination impact (Section 3.5.3–3.5.4, Figure 12).
  • Long-context practicality: modern use cases (long documents, long reasoning traces) require long context windows; OLMo 3 is their first model family with long-context support up to 65K (Section 3.6).

  • Prior approaches and shortcomings (as positioned in the document)

  • Prior fully-open baselines exist (e.g., Marin, Apertus), but OLMo 3 aims to exceed them at similar scale on base-model quality (Figure 1; Tables 2–3).
  • Many strong open-weight “thinking” models do not release a base model or complete pipeline, limiting research into the roots of reasoning behavior (Section 1, Figure 1).
  • Small-scale development decisions are hard because many benchmarks are noisy or near random-chance at small compute; OLMo 3 introduces an evaluation framework to make data decisions more compute-efficient and reliable (Section 3.3).

  • How OLMo 3 positions itself

  • As a fully-open model flow release (data + code + checkpoints), not just a final model (Figure 1–2).
  • As a capability-driven pipeline:
    • Olmo 3 Base → foundation (reasoning/tool-use ready).
    • Olmo 3 Think → explicit reasoning traces.
    • Olmo 3 Instruct → concise chat + function calling.
    • Olmo 3 RL-Zero → RLVR directly from base to study data/RL interactions (Figure 2).

3. Technical Approach

3.1 Reader orientation (approachable technical breakdown)

  • The system is a full lifecycle pipeline for building and releasing language models (7B and 32B) including data processing, training, evaluation, and post-training variants.
  • It solves “how do we build strong long-context + reasoning-capable models and make them scientifically reproducible?” via staged base training (pretrain → midtrain → long-context extension) plus staged post-training (SFT → DPO → RLVR), with released artifacts at each step.

3.2 Big-picture architecture (diagram in words)

  • Data tooling (datamap-rs, duplodocus, dolma3) → produces curated pools and mixes (Dolma 3 Mix, Dolmino Mix, Longmino Mix; Figure 8, Figure 11).
  • Base training (OLMo-core) → trains Olmo 3 Base in 3 stages (pretraining, midtraining, long-context extension; Figure 2, Table 35).
  • Evaluation & decontamination (OLMES, OlmoBaseEval, decon) → guides decisions and filters leakage (Section 3.3, Section 3.5.3).
  • Post-training (OLMo-core for SFT; Open Instruct for RL infra) → produces:
  • Think via SFT + DPO + RLVR (Figure 15).
  • Instruct via Think-SFT warm start + SFT + DPO (+ length control) + RL (Section 5).
  • RL-Zero via RLVR directly from base (Section 6).
  • Release artifacts → final models + every checkpoint and data point in the flow (Abstract, Section 2.1 “Open artifacts”).

3.3 Roadmap for the deep dive

  • I first explain Olmo 3 Base training stages (pretrain → midtrain → long-context), because all post-trained variants depend on this foundation.
  • Then I explain the evaluation framework (OlmoBaseEval) and decontamination (decon), because they shape how data/training choices are made and how results should be trusted.
  • Next I explain post-training pipelines:
  • Think (SFT → DPO → RLVR) including the RL objective and verifiers.
  • Instruct (concise responses + function calling) including the special data and length-control choices.
  • RL-Zero (RL from base) as a research testbed.
  • Finally, I connect these mechanisms to the reported quantitative results and trade-offs.

3.4 Detailed, sentence-based technical breakdown

This is an empirical system + algorithmic methods paper centered on a practical recipe: build a strong base model with long-context readiness and “post-trainability,” then produce multiple specialized variants via staged post-training—while releasing the entire reproducible pipeline.

3.4.1 “Model flow” and released artifacts (what “fully-open” means here)

  • OLMo 3 releases the entire model flow: training data mixes, intermediate checkpoints, dependencies, and code across all stages (Abstract; Figure 1–2; Section 2.1 “Open artifacts”).
  • The release includes:
  • Models: Base (Olmo-3-1025-7B, Olmo-3-1125-32B), Think, Instruct, RL-Zero variants (header list).
  • Data: Dolma 3 Mix (pretrain), Dolmino Mix (midtrain), Longmino Mix (long context), Dolci suites for post-training (Think/Instruct/RL-Zero).
  • Code: OLMo-core (pretrain/midtrain/LC/SFT), Open Instruct (posttrain infrastructure), duplodocus (dedup), datamap-rs (processing), decon (decontam), OLMES (eval).
  • Logs and checkpoints for multiple stages (Section 2.1–2.4).

3.4.2 Base model architecture and training system

Architecture - OLMo 3 Base is a decoder-only transformer with key specs summarized in Appendix Table 33: - 7B / 32B use 32 / 64 layers and dmodel 4096 / 5120. - Attention heads: Q heads 32 / 40; KV heads 32 / 8 (32B uses grouped-query attention; Table 33). - Activation: SwiGLU; normalization: RMSNorm; uses QK-Norm. - Sliding-window attention (SWA) is used for efficiency: in 3 out of every 4 layers, each token attends to a 4096-token window; the last layer always uses full attention (Section 3.2). - Positional encoding uses RoPE, with long-context scaling via YaRN on full-attention layers during extension (Table 33; Section 3.6.4; Figure 13a). - Context lengths - Pretraining and midtraining use 8,192 tokens context window (Section 3.2). - After long-context extension, the model supports 65,536 tokens (Section 3.6).

Tokenizer - Uses the same tokenizer as OLMo 2, derived from OpenAI’s cl100k tokenizer (Section 3.2).

Training system performance - Trained with OLMo-core. - At sequence length 8,192: - 7B: ~7,700 tokens/s/GPU, ~43% MFU. - 32B: ~1,960 tokens/s/GPU, ~41% MFU. - Optimizations include torch.compile(), custom kernels (attention, LM head), async metrics and checkpointing (Section 3.2).

3.4.3 Base training stages: what happens first, second, third

Stage 1: Pretraining (foundation) - What happens first: Build Dolma 3 Mix by collecting and cleaning large-scale sources (Figure 8). - Data scale and composition - Train up to 5.9T tokens (Section 2.1; Table 35 shows 5.93T for 7B and 5.5T for 32B). - Dolma 3 Mix composition snapshot (Table 4): - CommonCrawl web pages: 4.51T tokens (76.1%). - olmOCR science PDFs: 805B tokens (13.6%). - Stack-Edu (rebalanced) code: 409B tokens (6.89%). - FineMath 3+ math pages: 152B (2.56%). - arXiv LaTeX: 50.8B (0.86%). - Wikipedia/Wikibooks: 2.51B (0.04%). - Key processing steps - Web pipeline: HTML extraction (Resiliparse), heuristic filters (spam/adult/length/symbol/repetition/language), then large-scale deduplication (Section 3.4.1; Appendix A.2). - Deduplication is multi-stage: 1. Exact dedup removes 67% of web pool docs (38.7B → 12.8B). 2. MinHash fuzzy dedup removes 23% more (→ 9.8B). 3. Substring dedup removes 14% of text bytes via fuzzy suffix-array approach (Section 3.4.1). - Topic + quality classification partitions web into 24 topics × 20 quality tiers = 480 buckets for controllable mixing (Section 3.4.1). - New PDF source: crawl 238M PDFs (cutoff Dec 2024), text extracted via olmOCR, deduped, then a model-based PII filtering pipeline removes 4.9%, followed by heuristic filtering to a final 108M documents (Section 3.4.2). - How token selection is optimized - Token-constrained mixing: train many small proxy models (30M params for 3B tokens) across candidate mixtures, regress per-task BPB, then optimize mixture weights under constraints like max repetition (Section 3.4.4). - Conditional mixing: efficiently update mixtures when domains change by treating the existing mix as a “virtual domain” and re-optimizing only over new/changed domains (Section 3.4.4). - Quality-aware upsampling: rather than flat filtering, use monotonic upsampling curves per topic with max upsample factor (Figure 10; Section 3.4.4; Appendix A.2.4).

Stage 2: Midtraining (capability shaping) - What happens second: Continue training for 100B tokens on a curated midtraining mix Dolma 3 Dolmino Mix aimed at boosting math, code, QA, instruction-following, and reasoning traces (Section 3.5; Table 5). - Data composition (Table 5) - A large “HQ subset” of pretraining web remains important: CommonCrawl HQ subset contributes 22.4B tokens (22.5%). - Math sources include synthetic and rewritten sets (e.g., Dolmino Math 10.7B (10.7%), CraneMath 5.62B (5.63%), etc.). - Code includes StackEdu FIM 10B (10.0%) and synthetic Python (CraneCode 10B (10.0%)). - QA includes synthetic datasets (e.g., Reddit-to-Flashcards 5.9B, Wiki-to-RCQA 3B, Nemotron synth QA 5B). - Thinking traces and instruction data are explicitly included to prepare for post-training (multiple entries under “Thinking (synth)” and “Instruction (synth)”). - Methodological framework (Figure 11) - Distributed exploration via “microanneals”: for a candidate dataset, take 5B tokens from the dataset + 5B web tokens, “anneal” (continue training) and compare against a matched 10B web-only baseline to estimate marginal value (Section 3.5.1). - Centralized integration tests: run full 100B-token midtraining candidates; then optionally quick SFT + eval to estimate “post-trainability” (Section 3.5.1). - Decontamination - A dedicated tool decon removes overlaps between midtraining data and evaluation benchmarks via an n-gram detection phase + cluster expansion phase (Section 3.5.3). - The paper reports that contamination can be large in some sources (e.g., template-driven datasets) and that removing it sometimes changes measured performance, but not uniformly (Figure 12).

Stage 3: Long-context extension (8K → 65K) - What happens third: Extend the context window to 65,536 tokens using Dolma 3 Longmino Mix for 50B tokens (7B) or 100B tokens (32B) (Section 3.6; Table 35). - Long-context data pool - Backbone is long olmOCR PDFs: 22.3M documents > 8K tokens totaling 639B tokens in the pool (Section 2.1; Table 11). - They additionally create synthetic long-context tasks (CWE/REX) injected into documents, derived from OLMo 2 Instruct 32B generations (Section 3.6.2). - Mixing strategy - Extension training mixes 34% long-context data with 66% short-context midtraining data to preserve short-context performance (Section 3.6.3). - Key recipe choices (Figure 13) - YaRN applied only to full-attention layers performs best among their tested attention-scaling options (Figure 13a). - Their Longmino PDFs outperform alternative recipes like ProLong in their tests (Figure 13b). - Synthetic augmentation (CWE/REX) improves RULER long-context performance (Figure 13c). - Best-fit document packing improves long-context training (Figure 13d). - Larger extension token budgets improve long-context scores, especially at longer lengths (Figure 13e). - Training infrastructure - Uses 8-way context parallelism so each device handles ~8K tokens in a 65K sequence (Section 3.6.4; Appendix Table 34).

3.4.4 Evaluation: OlmoBaseEval for base-model development

  • The core challenge: small models can look like random chance on hard tasks, and benchmark noise can swamp small deltas (Section 3.3).
  • OLMo 3’s approach:
  • Task clustering: cluster benchmarks into capability groups by how they rank models, using 23K benchmark scores from 70 open-weight models, then manual adjustments to keep formats consistent (Section 3.3.1; Figure 5). Final clusters include MCSTEM, MCNon-STEM, GenQA, Math, Code, and Code FIM.
  • Proxy metrics (Base Easy): use bits-per-byte (BPB)—negative log-likelihood of gold answers normalized by UTF-8 bytes—to get signal earlier than pass@1 in the “noise floor” regime (Section 3.3.2; Figure 6).
  • Signal-to-noise ratio (SNR) analysis: drop or isolate noisy tasks and tune evaluation configs to improve SNR (Section 3.3.3; Figure 7).
  • Held-out suite: 4 held-out tasks (MMLU Pro, DeepMind Math, LBPP, BBH) to reduce overfitting to dev benchmarks (Section 3.3.4).

3.4.5 Post-training variants: Think, Instruct, RL-Zero

A) OLMo 3 Think (reasoning traces) - Goal: Generate a structured “thinking trace” then a final answer (Figure 2; Section 4). - Pipeline (Figure 15): 1. SFT on Dolci Think SFT: prompts + synthetic reasoning traces across math, code, IF, chat, safety, etc. (Section 4.2; Table 17). 2. DPO on Dolci Think DPO: preference pairs designed to have capability-relevant “deltas” (Section 4.3; Table 19). 3. RLVR using OlmoRL: reinforcement learning with verifiable rewards across math/code/IF plus judge-based rewards for chat (Section 4.4; Table 20; Figure 16). - Why the DPO data is unusual (delta learning) - They emphasize that preference tuning acts as contrastive learning: improvement depends on the quality gap between chosen and rejected outputs, not necessarily on the absolute quality of chosen outputs (Section 4.3). - Concretely, they create pairs by sampling a chosen answer from a stronger thinking model (Qwen 3 32B thinking) and a rejected answer from a weaker one (Qwen 3 0.6B thinking) (Section 4.3.1). - They show a case where continued SFT on the “chosen” outputs hurts, but DPO with a strong chosen-vs-rejected contrast helps (Table 21). - RLVR mechanism and equations - RLVR optimizes expected reward of generated answers using verifiers (math equivalence, code tests, IF constraint checks, LM-judge scores; Figure 16). - Their OlmoRL objective (Eq. (1)) is a GRPO-style clipped policy-gradient loss with: - Token-level normalization (normalize by total tokens in batch). - Truncated importance sampling to correct differences between inference engine (vLLM) and training probabilities. - No KL term (they remove KL regularization). - Asymmetric clipping (ε_low, ε_high) and “clip higher.” - Definitions (from Eq. (1)–(2), paraphrased): - Let x be a prompt, and y_i be the i-th sampled response in a group of size G. - Let r(x, y_i) be the verifier reward for that response. - The advantage for each token is the response reward minus the group mean reward: - A_{i,t} = r(x, y_i) - mean_j r(x, y_j) (Eq. (2)). - The policy ratio compares new vs old policy probabilities per token, then applies clipping to limit update size (Eq. (1)). - Micro-example (illustrative of Eq. (2) mechanics): - Suppose a math verifier yields rewards for a group of 8 rollouts: [1,1,1,0,0,0,0,0]. - The mean reward is 3/8 = 0.375. - Any token in a successful rollout gets A = 1 - 0.375 = +0.625 (positive push), while failures get A = 0 - 0.375 = -0.375 (negative push). - If all rewards were identical, advantage standard deviation would be zero; they filter out such groups (“zero gradient signal filtering”) to avoid wasted steps (Section 4.4.1).

B) OLMo 3 Instruct (concise assistant + tool use) - Goal: Provide shorter, direct responses without exposing thinking traces, optimized for chat and function calling (Section 1; Section 5). - Warm-start: They start Instruct SFT from the Think SFT checkpoint, and report it improves instruct performance without leaving “thinking trace remnants” (Section 5.2.2; Table 29). - Function calling training (Section 5.2.1) - Two trajectory types: - Real interactions with MCP servers: - Science QA via ASC MCP server (Semantic Scholar tools). - Web search QA via Serper API (search + fetch webpages). - Simulated interactions (SimFC): LLM-generated synthetic tool trajectories over diverse APIs/MCP servers. - They unify tool definitions via OpenAPI specs and represent tool calls as pythonic code blocks with dedicated special tokens (Section 5.2.1). - Preference tuning with usability constraints - Dolci Instruct DPO combines: - Delta-learning heuristic pairs (strong vs weak model completions). - Delta-aware GPT-judged pairs (ensuring the rejected completion is meaningfully worse). - Multi-turn preference data. - Length control: filter chat and multi-turn pairs so chosen-vs-rejected length difference ≤ 100 tokens to counter “verbosity bias” (Section 5.3.1; Figure 23; Figure 22). - RL stage for Instruct - Uses OlmoRL on a mixture of domains but with shorter max response lengths (8K for 7B, 16K for 32B) to keep outputs concise (Section 5.4.1).

C) OLMo 3 RL-Zero (RL directly from base) - Goal: Study RLVR without hidden pretraining/midtraining confounds by starting RL from Olmo 3 Base and decontaminating RL datasets against base data (Section 6). - Key detail: They find that simple prompt templates work better for RL-from-base because midtraining excluded special formatting, and they also “clean” eval prompts to match training (Section 6.1). - Negative control for contamination: training with random (signal-free) rewards yields no benchmark gains (Figure 27), used as evidence that their decontamination prevents spurious-reward improvements from memorization.

3.4.6 Core training hyperparameters and infrastructure (as provided)

Base training hyperparameters (Table 35; Appendix Table 34) - 7B - Pretraining: seq len 8192, batch 4,194,304 tokens, total 5.93T tokens, warmup 2000 steps, peak LR 3e-4, final LR 3e-5 (10%). - Midtraining: 100B tokens, linear decay, peak LR 2.074e-4, batch 2,097,152 tokens. - Long-context ext: seq len 65,536, 50B tokens, warmup 200 steps, peak LR 2.074e-4, batch 4,194,304 tokens, context parallelism 8. - 32B - Pretraining: cosine over 5.93T but truncated at 5.5T tokens, peak LR 6e-4, warmup 2000 steps, batch 8,388,608 tokens. - Midtraining: 100B tokens twice (two runs + merge), peak LR 2.071e-4, batch 4,194,304 tokens. - Long-context ext: seq len 65,536, 100B tokens, warmup 200 steps, batch 8,388,608 tokens, CP 8.

Throughput & parallelism (Appendix Table 34) - 7B pretraining: 512 devices, ~7.7K tokens/s/device. - 32B pretraining: 1024 devices, ~2.0K tokens/s/device. - Long-context ext reduces throughput: 7B ~4.0K tokens/s/device, 32B ~1.3K tokens/s/device.

Wall-clock cost reporting (Section 2.4) - ~56 days elapsed from start to evaluation of Olmo 3 Think 32B on a dedicated cluster of 1024 H100 GPUs. - Breakdown: - Pretraining (including midtraining + long-context): ~47 days. - Post-training (SFT/DPO/RL): ~9 days, with RL hyperparameter uncertainty and cluster instability noted. - They also mention an additional 21-day extended RL continuation on 224 GPUs to produce Olmo 3.1 Think 32B (Section 2.4; footnote in Section 4.1).

4. Key Insights and Innovations

  1. “Fully-open model flow” as a first-class research artifact
  2. Novelty is not just weights: releasing checkpoints + data mixes + tooling across stages enables tracing outputs (including reasoning chains) back to specific training sources and interventions (Abstract; Section 1; Figure 1–2).
  3. Significance: supports controlled science on data/capability causality, contamination, and RL dynamics.

  4. Compute-efficient evaluation design for base-model iteration (OlmoBaseEval)

  5. Combines task clustering (Figure 5), proxy metrics (BPB) for low-compute regimes (Figure 6), and SNR-driven benchmark selection/tuning (Figure 7).
  6. Significance: makes it feasible to choose data and training strategies based on small models, reducing cost of large-scale mistakes.

  7. Trillion-token-scale dedup + controllable data mixing and quality-aware upsampling

  8. Dedup pipeline includes a novel substring-level dedup stage via fuzzy suffix arrays (Section 3.4.1).
  9. Mixing uses swarm-based proxy training plus conditional mixing to cheaply update mixtures as data evolves (Section 3.4.4).
  10. Significance: enables targeted capability shaping (e.g., STEM upweighting) while managing repetition and token budgets (Figure 9, Figure 10).

  11. Long-context extension recipe grounded in large open PDF corpora (olmOCR science PDFs)

  12. Uses a very large open long-document pool (e.g., 22.3M docs > 8K tokens, 4.5M docs > 32K tokens cited in Section 2.1) and combines it with recipe choices like YaRN-on-full-attn-only and document packing (Figure 13).
  13. Significance: yields a 65K-context model with relatively short extension budgets (50B/100B tokens) and competitive benchmark performance (Table 12).

  14. Post-training framed as staged capability building, with “delta learning” as the bridge past SFT saturation

  15. They show that DPO on high-contrast pairs can improve even when further SFT on “good” completions hurts (Table 21), and that DPO is a better initialization for RLVR than SFT alone (Table 22).
  16. Significance: offers a practical recipe for building thinking models when imitation learning stops helping.

5. Experimental Analysis

5.1 Evaluation methodology (datasets, metrics, setup)

  • Base-model evaluation: OlmoBaseEval with clustered aggregates across math, code, STEM/non-STEM MCQA, GenQA, plus held-out tasks (Section 3.3.4; Tables 2–3).
  • Long-context evaluation: RULER used as dev suite and HELMET as mostly-held-out suite (Section 3.6 “Overall results”; Table 12).
  • Post-training evaluation suite: includes math (MATH, AIME 2024/2025, OMEGA), reasoning (BBH, ZebraLogic, AGI Eval), coding (HumanEvalPlus, MBPP+, LiveCodeBench v3), instruction following (IFEval, IFBench), knowledge/QA (MMLU, PopQA, GPQA), chat (AlpacaEval 2 LC), and for Instruct also tool-use benchmarks (SimpleQA, LitQA2, BFCL) (Section 4.1; Table 16; Section 5.1).
  • Variance awareness: they measure per-evaluation variance over multiple runs and categorize tasks as high-variance vs stable vs very stable (Section 4.1.1).
  • Generation settings: post-training evaluations use sampling temp=0.6, top-p=0.95, and strip <think>...</think> traces before scoring (Table 16; Appendix A.8.1).

5.2 Main quantitative results (selected, with specific numbers)

5.2.1 Base model results (7B and 32B)

  • OLMo 3 Base 32B vs other base models on OlmoBaseEval (Table 2):
  • OlmoBaseEval Math: 61.9
  • OlmoBaseEval Code: 39.7
  • MMLU: 70.8 (STEM) / 78.3 (Humanities) / 84.0 (Social Sci.) etc. (see Table 2 rows)
  • OLMo 3 Base 7B (Table 3):
  • OlmoBaseEval Math: 54.7
  • OlmoBaseEval Code: 30.7
  • Across training stages (Table 13):
  • For 32B, math aggregate jumps notably from Stage 1 (48.4) to Stage 2 soup (69.7), while code stays strong (~39.7), and stage 3 long-context keeps code (39.7) but reduces math to 61.4, indicating a trade-off during long-context extension.
  • For 7B, Stage 2 substantially improves Math (59.8) and Code (31.9) relative to Stage 1 (Math 23.5, Code 19.8), and Stage 3 slightly reduces math (54.4) and code (30.6) while adding long-context capability (Table 13).

5.2.2 Long-context results (Table 12)

  • Olmo 3 32B RULER averages:
  • 4K: 96.10, 8K: 94.57, 16K: 90.42, 32K: 86.22, 65K: 79.70
  • HELMET for Olmo 3 32B:
  • 8K: 52.11, 16K: 49.36, 32K: 48.60, 65K: 43.15
  • Olmo 3 7B RULER averages drop more sharply by 65K (67.96) and HELMET 65K (36.80), showing the expected scaling gap vs 32B.

5.2.3 Think model results (Table 14–15, plus Table 1 snapshot)

  • Flagship Olmo 3.1 Think 32B (Table 14 / Table 1):
  • Math: MATH 96.2, AIME 2024 80.6, AIME 2025 78.1, OMEGA 53.4
  • Reasoning: BBH 88.6, ZebraLogic 80.1, AGI Eval English 89.2
  • Coding: HumanEvalPlus 91.5, MBPP+ 68.3, LiveCodeBench v3 83.3
  • IF: IFEval 93.8, IFBench 68.1
  • Knowledge: MMLU 86.4, GPQA 57.5, PopQA 30.9
  • Chat: AlpacaEval 2 LC 69.1
  • Olmo 3 Think 7B final (Table 15) is strong on math/coding/reasoning vs several 7B-ish baselines, e.g. MATH 95.1, HumanEvalPlus 89.9, LiveCodeBench v3 75.2, but weaker on knowledge (MMLU 77.8, PopQA 23.7).

  • Effect of extended RL in 3.1 Think 32B

  • The paper highlights improvements from longer RL training (Table 14 narrative): e.g., AIME 2024 rises from 76.8 → 80.6, ZebraLogic 76.0 → 80.1, IFEval 89.0 → 93.8, IFBench 47.6 → 68.1, while AlpacaEval drops (74.2 → 69.1) (Section 4.1.2 discussion).

5.2.4 Instruct model results (Table 25–26)

  • Olmo 3.1 32B Instruct final (Table 25):
  • Math: MATH 93.4, AIME 2024 67.8, AIME 2025 57.9
  • IF: IFBench 39.7, IFEval 88.8
  • Chat: AlpacaEval 2 LC 59.8
  • Tool use: SimpleQA 84.7, LitQA2 55.6, BFCL 58.8
  • Safety: 89.5
  • Olmo 3 7B Instruct final (Table 26):
  • Math: MATH 87.3, AIME 2024 44.3, AIME 2025 32.5
  • Chat: AlpacaEval 2 LC 40.9
  • Tool use: SimpleQA 79.3, LitQA2 38.2, BFCL 49.8
  • Safety: 87.6

5.2.5 RL-Zero behavior and controls

  • RL-Zero demonstrates training reward increases across domains and AIME improvements over steps (Figure 24).
  • Negative control with random rewards shows no performance gains across many benchmarks (Figure 27), used to support the claim that evaluation leakage is not driving RL improvements.

5.3 Do experiments support the claims?

  • Supportive evidence
  • The “fully-open strongest at 32B among fully-open” positioning is backed by comparisons to other fully-open models in base evaluation (Table 2) and to a set of open-weight thinking baselines in post-training evals (Table 1 / Table 14).
  • Staged training benefits are demonstrated:
    • Base stage contributions are quantified (Table 13).
    • DPO’s role as a bridge past SFT saturation is concretely shown (Table 21) and its synergy with RL is shown (Table 22).
  • Long-context recipe choices are explicitly ablated on RULER (Figure 13).

  • Where evidence is conditional or mixed

  • Some gains trade off against others:
    • Extended RL improves several reasoning/IF metrics but may reduce AlpacaEval (Section 4.1.2).
    • Long-context extension can slightly degrade some short-context capability aggregates (Section 3.6.3; Table 13 shows math drop after stage 3).
  • Some evaluations are high variance (GPQA, AlpacaEval, IFEval variance numbers listed in Section 4.1.1), so single-number comparisons should be treated cautiously.

6. Limitations and Trade-offs

  • Language and domain focus
  • Web filtering keeps English for CommonCrawl (Section 3.4.1), and multiple filters remove high Chinese-character content in post-training traces (Section 4.2.1; Appendix A.7.1), so multilingual breadth is limited compared to multilingual-focused projects.

  • Capability trade-offs from data skew

  • The recipe intentionally upweights STEM and code domains; they note this can lead to slight degradation in general knowledge benchmarks relative to the emphasis (Section 3.7).
  • Midtraining experiments show clear trade-offs: pushing math/code/thinking harder can reduce MCQA/GenQA and vice versa (Table 7, Table 8–9).

  • Reliance on synthetic and model-generated data

  • Many midtraining and post-training datasets are synthetic or rewritten using other models (e.g., Qwen3, GPT-4.1, DeepSeek R1, OLMo 2 Instruct for long-context synthetic augmentation) (Sections 3.5.2, 3.6.2, 4.2.1, 4.3.1, 4.4.2).
  • This introduces dependencies on generator behavior and filtering/verifier correctness.

  • Evaluation and development complexity

  • They report evaluation can cost 10–20% of compute during recipe development (Section 4.1.1).
  • Some benchmarks are noisy/high variance; they mitigate but do not eliminate this issue (Section 4.1.1; Section 3.3.3).

  • RL infrastructure cost asymmetry

  • RLVR is inference-dominated: for 32B they report using 20 nodes for inference vs 8 nodes for training, and for 7B 7 nodes inference vs 2 nodes learner, implying large inference overhead (Section 4.4.3).
  • They suspect sharding configs are suboptimal for 32B learner (Section 4.4.3).

  • Tool-use evaluation gaps

  • They note missing tool-use evaluations for some baselines due to compatibility issues (Section 5.1 footnote), so tool-use comparisons are not fully comprehensive across all model families.

  • Prompt/template sensitivity

  • Special tokens introduced too early can break behavior (midtraining experiments show dramatic evaluation drops due to special-token emission; Section 3.5.4).
  • RL-from-base requires simplified prompt templates and “cleaned” evaluation prompts to align training/eval distributions (Section 6.1).

7. Implications and Future Directions

  • How this changes the landscape
  • OLMo 3 sets a strong precedent for reproducible open LLM science by treating data, checkpoints, and training code as first-class release artifacts (Figure 1–2).
  • It enables new research modes:

    • Tracing long reasoning behaviors back to specific training examples or midtraining sources.
    • Testing decontamination methods and quantifying benchmark leakage effects with fully auditable pipelines.
    • Studying RLVR dynamics without hidden pretraining confounds via RL-Zero.
  • Follow-up research it enables (grounded in paper’s own directions)

  • Longer RL runs and stability: they show continued RL past initial release improves several benchmarks and note performance had not saturated at 2300 steps for Olmo 3.1 Think 32B (Section 4.1.2 footnote), suggesting additional RL work could further shift trade-offs.
  • Architecture choices for long-context: they reference a deeper investigation of architectural decisions affecting long-context extension (Section 3.6; Bertsch et al., 2026 mentioned), indicating continued exploration of where to apply RoPE scaling and attention patterns.
  • Multi-objective RLVR: RL-Zero “Mix” provides a benchmark for interactions between domains (math/code/IF/general chat) and under-optimization compared to single-domain runs (Section 6.2; Figure 24), suggesting future work on balancing objectives.

  • Practical applications / downstream use cases

  • Olmo 3 Think: best suited for tasks where longer reasoning traces and verifiable correctness matter (math, coding, constraint-heavy IF), especially when you can tolerate higher latency due to long generations.
  • Olmo 3 Instruct: aimed at real-world assistant usage with shorter responses, plus function calling and tool-using agents via MCP-style servers (Section 5.2.1).
  • Olmo 3 Base: a strong foundation model for custom post-training, especially if you want to experiment with alternative reasoning/tool-use pipelines with full provenance.

  • Repro/Integration Guidance (when to prefer what)

  • Prefer Instruct over Think when:
    • Latency matters and you do not need explicit long reasoning traces (Section 1; Section 5).
    • Your application relies on tool calls (BFCL, MCP agents) and concise outputs.
  • Prefer Think when:
    • You can leverage RLVR-style verifiers or benefit from explicit step-by-step reasoning traces for accuracy.
    • You are studying or auditing reasoning behavior using full model-flow provenance.
  • Prefer RL-Zero when:
    • Your research question is about RL dynamics from base or the effect of pretraining/midtraining data on RL outcomes, and you need strict decontamination and transparency (Section 6).