OLMES: A Standard for Language Model Evaluations¶

🎯 Pitch¶

This paper introduces OLMES (Open Language Model Evaluation Standard), a fully specified, open, and practical protocol that fixes prompt formats, curated few‑shot examples, probability normalizations, and a principled CF vs MCF decision so LLM benchmark scores are reproducible and comparable. By removing under‑specified evaluation choices that can wildly change measured performance, OLMES makes claims like “model X scores Y on dataset Z” precise and trustworthy, enabling fairer model comparisons, more reliable leaderboards, and better scientific progress in LLM development.

1. Executive Summary (2-3 sentences)¶

OLMES (Open Language Model Evaluation Standard) defines a fully specified, open, and practical standard for reproducible evaluation of base language models on multiple-choice question answering (MCQA) benchmarks, addressing the fact that small implementation choices can drastically change reported scores (Abstract; Section 1). The core contribution is a documented set of decisions—prompt formats, curated few-shot examples, probability normalizations, and a principled way to choose between MCF vs CF formulations—that makes “model X scores Y on dataset Z” a well-defined and reproducible claim across papers and leaderboards (Sections 3–4; Table 1).

2. Context and Motivation¶

What problem/gap is addressed?
Reported LLM benchmark numbers are often not comparable or reproducible because different works evaluate “the same” task using different (sometimes under-specified) setups (Section 1; Table 1; Table 14).
Evaluation choices that can vary include prompt wording and formatting, which few-shot examples are used, how answer probabilities are normalized, and even whether the task is posed as “choose A/B/C/D” vs “fill in the answer” (Section 1; Sections 3.1–3.5).
Why this matters (impact)?
Benchmark scores are used to justify progress and model releases; if the measurement procedure is ambiguous, conclusions about “which model is better” can flip across references (Section 1; Table 1).
The paper cites evidence that formatting and in-context examples can cause very large accuracy swings (Section 1 mentions claims “as much as an 80% difference” from Sclar et al., 2023), motivating a standard that removes ambiguity in measurement.
Prior approaches and where they fall short (as positioned here):
Existing efforts like HELM and the Hugging Face Open LLM Leaderboard improve reproducibility by running many models under one implementation, but the paper argues that:
- The rationale for default choices is often under-documented, and
- Model creators frequently do not follow those exact setups in their own reporting, so cross-paper comparisons remain hard (Section 1; Section 5; Table 1; Table 14).
How this paper positions itself:
OLMES is presented as a completely documented evaluation standard with justified recommendations and an open reference implementation (Abstract; Section 1; Section 3; Section 4).
A key positioning point is that OLMES is intended to work throughout the base-model development cycle, including early training when some evaluation formats provide no useful signal (Section 1; Section 3.4; Figure 1).

3. Technical Approach¶

3.1 Reader orientation (approachable technical breakdown)¶

The system is an evaluation protocol + reference implementation that takes a dataset of multiple-choice questions and a base language model, then outputs reproducible accuracy scores.
It solves the problem of evaluation ambiguity by fixing (and documenting) the entire evaluation pipeline: which examples are used, exactly how prompts are formatted, how probabilities are computed/normalized, and how a final score is chosen (Sections 3.1–3.5; Section 4).

3.2 Big-picture architecture (diagram in words)¶

Task spec layer: Defines dataset split, sampling rules, prompt templates, and task-specific settings like probability normalization and averaging method (Table 2; Sections 3.1, 3.3, 3.5).
Few-shot prompt builder: Inserts a fixed, curated 5-shot demonstration set per task into the prompt (Section 3.2; Appendix G; Figures 5–25 show examples).
Evaluator (two formulations):
Runs the task in MCF (multiple-choice formulation) and CF (completion/cloze formulation) (Section 2.1; Section 3.4).
Computes scores using specified normalization for CF (Section 3.3).
Score selector: For each model×task, reports the better of MCF and CF as the OLMES task score (Section 3.4; Section 4; Tables 6–7).
Aggregator: Computes overall averages across tasks (Table 4), and applies task-specific aggregation choices like MMLU macro-average (Section 3.5; Table 8).

3.3 Roadmap for the deep dive¶

I will explain, in order:
The tasks, models, and scope OLMES targets, because that determines what must be standardized (Sections 2.2–2.3; Table 2).
The two MCQA task formulations (MCF vs CF) and why both are needed (Section 2.1; Section 3.4; Figures 1–2).
The prompt/instance formatting standard and why tokenization details matter (Section 3.1; Figure 3).
The few-shot example policy and its compute/practicality motivation (Section 3.2).
The CF probability normalization choices and how OLMES selects them per task (Section 3.3; Table 3; Tables 10–12).
The remaining implementation constraints that affect reproducibility (Section 3.5; Appendix F).

3.4 Detailed, sentence-based technical breakdown¶

This is an empirical methodology + standardization paper whose core idea is: make LLM benchmark numbers reproducible by fully specifying every evaluation decision, and resolve key open choices (like normalization and formulation) with experiments across many models and tasks (Sections 3–4; Table 3; Figure 2).

3.4.1 Scope: what OLMES standardizes and evaluates¶

OLMES focuses on multiple-choice question answering (MCQA) because it is widely used to evaluate base (non-instruction-tuned) models and provides useful signal early in training when generative tasks may not (Section 2.1).
The paper implements OLMES for 10 common MCQA benchmarks: ARC-CHALLENGE, ARC-EASY, BOOLQ, COMMONSENSEQA, HELLASWAG, MMLU, OPENBOOKQA, PIQA, SOCIAL IQA, and WINOGRANDE (Table 2).
It develops and validates the standard on 15 openly available base LLMs spanning roughly 1B to 70B parameters, including Pythia, OLMo, TinyLlama, StableLM2, RPJ-INCITE, MPT, Falcon, Llama2, Mistral, and Llama3 variants (Section 2.3; Table 4).

3.4.2 System / data pipeline diagram in words (what happens first, second, third…)¶

Select dataset split and (optional) sample instances.
OLMES uses the test split when labels are public; otherwise it uses validation (Section 3.1).
If a dataset has more than 1500 instances, it samples 1000 instances for practicality, following a fixed Python sampling seed:
- Random(1234).sample(all_instances, 1000) (Section 3.1; Table 2 lists which tasks are sampled).
Table 2 specifies, per task, the split used, number of instances evaluated (and total size if sampled), and the prescribed CF normalization scheme.
Build a prompt for each evaluation instance using standardized formatting.
For most tasks, OLMES uses a consistent "Question: <question>" prefix and "Answer:" suffix (Section 3.1).
There are task-specific exceptions meant to preserve semantics:
- PIQA uses "Goal: <goal>" (Section 3.1).
- In MCF, HELLASWAG uses "Choose the best continuation:" and WINOGRANDE uses "Fill in the blank:" (Section 3.1).
- For CF on HELLASWAG and WINOGRANDE, OLMES removes certain prefixes/suffixes so the task becomes closer to pure “continue the text” language modeling (Section 3.1; examples in Figures 14–15 and 24–25).
Insert a fixed 5-shot demonstration block (few-shot prompting).
OLMES uses manually curated 5-shot examples per task, taken from the task’s training split (Section 3.2).
The curation goals include “good quality,” “balanced label coverage,” and avoiding skew like “4 A’s and 1 B” among the five answers (Section 3.2; Appendix G describes the procedure; Figure 5 shows an ARC-CHALLENGE 5-shot prompt example for CF).
Prompts separate in-context examples using two newlines (Section 3.5).
Evaluate each task in both formulations: MCF and CF.
MCF (multiple-choice formulation) presents answer options labeled A/B/C/... and expects the model to choose a label (Section 2.1; Figures 6, 8, 10, 12, 14, 16, 18, 20, 22, 24).
CF (completion/cloze formulation) substitutes each answer choice into an answer slot (or scores each choice completion) and ranks choices by model probabilities (Section 2.1; Figures 7, 9, 11, 13, 15, 17, 19, 21, 23, 25).
The paper highlights that CF was originally used because it better matches next-token prediction and can give much better performance for weaker models, but it introduces normalization ambiguities and has limitations for certain answer types (Section 2.1).
For CF, compute normalized scores to avoid length bias (task-specific).
In CF, for question prompt q and answer choice string a_i, the model gives a conditional probability P(a_i | q) (Section 3.3).
OLMES categorizes and compares four normalization strategies (Section 3.3):
- none: ln(P(a_i | q))
- token: ln(P(a_i | q)) / num_tokens(a_i)
- character: ln(P(a_i | q)) / num_characters(a_i)
- pmi: ln(P(a_i | q) / P(a_i | u)) with an “unconditional” prompt u = "Answer:"
Worked micro-example (symbolic, to illustrate mechanics):
- Suppose a question has two candidate answers: a1 = "iron" and a2 = "basalt".
- Under none, you compare ln P("iron"|q) vs ln P("basalt"|q), which can favor shorter/easier strings.
- Under character, you compare ln P(a|q) / chars(a) to reduce the advantage of shorter answers.
- Under pmi, you compare ln P(a|q) - ln P(a|u), which discounts answers that are intrinsically “more likely to say” even without the question (Section 3.3’s definition).
- OLMES then chooses the answer with the highest normalized score.
Choose CF normalization per task based on broad experiments across 15 models.
The paper runs all four normalizations across all tasks and models and summarizes which normalization “wins” most often (Table 3; detailed per-model tables in Appendix C.2 / Tables 10–12).
OLMES then fixes a recommendation per task (Section 3.3; Table 2):
- pmi for ARC-CHALLENGE, COMMONSENSEQA, OPENBOOKQA.
- character for ARC-EASY, HELLASWAG, PIQA, SOCIAL IQA, MMLU.
- none for BOOLQ and WINOGRANDE (with reasoning tied to single-token choices in BoolQ and identical continuations in WinoGrande) (Section 3.3).
The paper reports that the “diff oracle” (gap between OLMES choice and per-model best normalization) is generally small (Table 3), arguing the standard is close to per-model optimal without “tuning” evaluation choices per model.
Compute MCF score and handle tokenization pitfalls.
In MCF, OLMES uses canonical labels A/B/C/... formatted as "\n A. <choice>" with a leading space before the letter (Section 3.1).
The reason is tokenizer compatibility: many tokenizers treat "A" at the start of a line differently from " A" after a space, and OLMES wants the answer-label token in the options to match the answer-label token in "Answer: A" (Section 3.1; Figure 3 shows tokenizer examples for Llama and OLMo).
Select the final task score as max(MCF, CF) for each model.
OLMES standardizes that for each model×task, it evaluates both and keeps the best performing formulation (Section 3.4; Section 4; Tables 6–7 indicate this with † marks for MCF usage in final scores).
The motivation is explicitly about signal quality across model capability levels:
- Weaker models often perform near random under MCF (e.g., in Figure 2; and Tables 6–7 show low MCF for early models), but CF can still distinguish models.
- Stronger models often do much better under MCF and CF can “lag behind” due to CF’s limitations (Section 3.4; Figure 2).
Figure 1 provides a training-time example on MMLU for OLMo-7B-0424, where MCF starts near random early and becomes informative after about 400B training tokens (Figure 1; Section 3.4).
Apply remaining implementation constraints for reproducibility and practicality.
MMLU averaging: OLMES uses macro-average over 57 subjects, not micro-average over 14,042 questions (Section 3.5; Table 8 shows differences are generally small).
Prompt tokens: Add <bos> when a model requires it (e.g., Gemma) (Section 3.5).
Character normalization detail: include the leading space in answer-length calculation for character normalization (Section 3.5).
Context length cap: restrict all inputs (with completions) to 2048 tokens for consistency across models (Section 3.5).
Model precision: evaluate using default precision; avoid changes like load_in_8bit unless identical (Section 3.5).
Floating-point nondeterminism note: batch size / GPU state can cause near-ties to flip; OLMES does not currently introduce tie-handling rules beyond acknowledging this as future work (Section 3.5).
Publish artifacts to enable exact reproduction.
The paper states all prompts, examples, and code are released (Section 1; Section 3.5), and the repository is linked in multiple places (Abstract footnote; Section 1).

3.4.3 Practical compute configuration reported¶

Inference is run on NVIDIA RTX A6000 GPUs, totaling ~400 GPU hours (Appendix F).
OLMES explicitly aims to avoid unnecessary compute by capping to 1000 instances for large datasets and using 5-shot prompts (Sections 3.1–3.2).

4. Key Insights and Innovations¶

(1) “A score is only meaningful if the whole evaluation pipeline is fixed.”
The paper demonstrates large cross-reference discrepancies even on a single dataset (ARC-CHALLENGE) due to differing shots, formulations, and normalization choices (Table 1) and expands this to include OPENBOOKQA (Table 14).
The innovation is not a new benchmark, but a fully specified measurement standard that removes degrees of freedom.
(2) Dual-formulation evaluation (MCF and CF) with a best-of rule to span weak→strong models.
OLMES’s design explicitly treats CF as an “early / weaker model” signal and MCF as the “later / stronger model” signal, and standardizes reporting the best of the two (Section 3.4; Figure 1; Figure 2; Tables 6–7).
This is particularly targeted at making comparisons meaningful for base models across training stages and sizes, where a single formulation can be misleading (Section 3.4).
(3) Task-specific, empirically grounded CF normalization recommendations.
Instead of picking a single universal normalization, OLMES fixes per-task choices justified by answer-type properties (e.g., rarer phrases benefiting from pmi) and supported by multi-model experiments (Section 3.3; Table 3; Tables 10–12).
The paper emphasizes practicality: pmi requires extra unconditional likelihood computation, so it is avoided when not strongly motivated (Section 3.3).
(4) Standardized prompt formatting that accounts for tokenizer behavior.
The leading-space requirement in answer labels is a concrete standardization detail aimed at cross-model fairness and reproducibility across tokenizers (Section 3.1; Figure 3).
(5) Practical, curated 5-shot examples to stabilize evaluation without large compute overhead.
OLMES standardizes fixed curated shots and argues that >5 shots typically does not meaningfully change scores, citing prior work and choosing 5 for practicality (Section 3.2).

5. Experimental Analysis¶

Evaluation methodology (what is tested):
The paper runs evaluations across 10 tasks (Table 2) and 15 base models (Section 2.3; Table 4).
Key experimental questions include:
- Which CF normalization should be used per task? (Section 3.3; Table 3; Tables 10–12)
- When does MCF vs CF provide a better measure of capability? (Section 3.4; Figures 1–2; Tables 6–7)
- How stable are results under minor prompt/example variants? (Appendix A / Table 5)
Metrics and reporting:
The main metric is accuracy (%) on each multiple-choice task (implied throughout tables; Tables 4, 6, 7).
For MMLU, OLMES uses macro-average across 57 subjects (Section 3.5; Table 8).
OLMES reports an overall average across the 10 tasks (Table 4).
Main quantitative results (specific numbers):
Normalization selection evidence: Table 3 reports “win percentage” per task across 15 models, e.g.:
- HELLASWAG: character wins 100% of the time among tested normalizations (Table 3).
- OPENBOOKQA: pmi wins 100% (Table 3).
- WINOGRANDE: none wins 100% (Table 3).
MCF vs CF divergence for strong models: For Llama3-70B on ARC-CHALLENGE, the paper highlights MCF = 93.7% vs CF = 69.0% (Section 3.4; Table 6), framing this as a large difference in error rate.
Full OLMES leaderboard-style results: Table 4 provides OLMES scores across 15 models and 10 tasks. Examples:
- Pythia-1B average: 49.0 (Table 4).
- Llama3-70B average: 88.4 (Table 4).
- Many strong-model entries are marked † indicating the final score used MCF for that task (Table 4).
Stability under minor variants: Table 5 shows that averaging over 4 small variants vs using the original OLMES setup yields small differences, with the largest reported diff = 1.4% in the shown slices, and many are ≤1% (Table 5).
Do the experiments support the claims?
The experiments directly support two central claims:
1. Normalization matters and is task-dependent, and OLMES picks norms that are near-oracle on average (“diff oracle” small in Table 3).
2. Formulation matters and depends on model strength/training stage, shown via:
3. Training curve crossover behavior on MMLU (Figure 1),
4. Per-task scatter across models (Figure 2),
5. Detailed per-model MCF vs CF scores (Tables 6–7).
A limitation of the evidence (based on the provided content) is that OLMES is validated primarily on MCQA tasks and base models, so the standard’s broader generality (e.g., generative tasks, chat/instruction settings) is asserted as future work rather than empirically shown here (Section 6; “Limitations”).
Ablations / robustness checks:
Robustness to “minor prompt wording” and “changing few-shot examples” is tested in Appendix A via Variants 1–3 (Table 5).
The paper also discusses macro vs micro averaging for MMLU and shows differences are small (Section 3.5; Table 8).

6. Limitations and Trade-offs¶

Scope limitation: only MCQA in this version.
The current OLMES standard focuses on multiple-choice tasks and does not standardize evaluation for generative tasks, chain-of-thought prompting, answer extraction, or chat message formatting (Section 6; “Future work and limitations”; Limitations section).
Designed for base models, not instruction-tuned/chat models.
The paper emphasizes base (not instruction-tuned) models as the main target and notes prompting strategies like CoT/self-reflection tend to be less effective for base models (FAQ; “Why not use prompting techniques such as CoT…”).
Best-of (MCF, CF) introduces a particular trade-off.
Reporting the max across formulations improves comparability across weak and strong models (Section 3.4), but it also means the reported score is not tied to a single fixed interaction format; rather, it is tied to OLMES’s rule for selecting a format.
Compute/practicality choices may reduce statistical precision.
Sampling to 1000 instances for datasets >1500 is motivated as a practical compute cap (Section 3.1), but it necessarily introduces sampling variance (the paper partially addresses variability via standard error reporting in Table 5 for selected tasks/variants).
Some sources of nondeterminism remain.
The paper notes that GPU/batch/precision details can flip decisions when answer probabilities are close, and OLMES does not yet implement tie-handling rules (Section 3.5).
Not designed for adversarial robustness to evaluation manipulation.
The Limitations section explicitly says adversarial setups (e.g., unnatural answer labels, noisy random shots) are left for future study, and larger differences are expected when diverging from OLMES’s “natural” prompt principles (Limitations; Appendix A discussion).

7. Implications and Future Directions¶

How this changes the field’s evaluation landscape (as framed here):
OLMES is meant to turn ambiguous claims like “25-shot ARC-CHALLENGE” into a single, well-defined protocol (“score on ARC Challenge using OLMES”), enabling reproducible comparisons across papers and leaderboards (FAQ; Section 1; Tables 1 and 14).
By documenting the rationale behind choices, it aims to increase adoption beyond a single leaderboard implementation (Section 1; Section 5).
Follow-up research enabled/suggested:
Extend OLMES beyond MCQA to:
- generative tasks,
- chain-of-thought prompting,
- standardized answer extraction,
- standardized chat prompt/message splitting (Section 6).
Add additional benchmarks and model types (e.g., multimodal models are mentioned as a future direction) (Limitations section).
Practical applications / downstream use cases:
Model development monitoring: Figure 1 illustrates OLMES’s motivation for giving a usable signal during training, where CF can be informative before MCF becomes meaningful (Section 3.4; Figure 1).
Model selection: Table 4 (and the extended Table 13 in Appendix D) show how OLMES enables consistent ranking/comparison across many base models.
Repro/Integration Guidance: When to prefer this method over alternatives (based on the paper):
Prefer OLMES when you need reproducible, cross-paper comparable base-model MCQA evaluations, because it fixes:
- dataset split and sampling (Section 3.1; Table 2),
- exact prompt formats (Section 3.1; Appendix H figures),
- fixed curated 5-shot examples (Section 3.2; Appendix G),
- CF normalization per task (Section 3.3; Table 2),
- and the MCF/CF selection rule (Section 3.4).
The paper suggests OLMES is straightforward to incorporate into existing evaluation codebases such as Eleuther LM Evaluation Harness and HELM (Section 1), since it is primarily a set of standardized configuration and formatting decisions plus a reference implementation (Section 3.5).
Use the OLMES-specific implementation constraints—like the 2048 token cap, default precision, and tokenizer-aware label formatting—to reduce accidental variation across runs and models (Section 3.5; Figure 3).