Skip to content

Humanity’s Last Exam

ArXiv: 2501.14249

🎯 Pitch

A 2,500-question, multi-modal benchmark (HLE) of extremely challenging, closed-ended academic problems designed by subject-matter experts to remain unsolved by frontier LLMs. By intentionally avoiding searchable or ambiguous items and filtering out questions solvable by current models, HLE preserves a measurable expert frontier—revealing large capability gaps and severe miscalibration in state-of-the-art models and providing a durable tool for researchers and policymakers to track true progress and reliability at the highest levels of technical knowledge.


1. Executive Summary (2-3 sentences)

HUMANITY’S LAST EXAM (HLE) introduces a 2,500-question, multimodal, closed-ended benchmark intentionally constructed to be far harder than saturated academic benchmarks (e.g., where frontier LLMs exceed 90% accuracy), so that progress at the “expert frontier” remains measurable (Abstract; Section 1; Figure 1). The core significance is that current frontier models score in the single digits to low teens on HLE and are badly miscalibrated (high-confidence wrong answers), making HLE useful for capability tracking and for highlighting reliability gaps on difficult, verifiable questions (Section 4.2; Table 1).

2. Context and Motivation

  • What specific problem/gap is addressed?
  • Many widely used academic benchmarks are saturated: frontier LLMs score >90% on them, so those benchmarks no longer distinguish between strong models or measure further progress meaningfully (Abstract; Section 1; Figure 1).
  • The benchmark ecosystem needs evaluations that remain difficult even as models improve, especially for closed-ended questions that can be scored objectively at scale (Abstract; Section 1).

  • Why this matters

  • If benchmarks saturate, it becomes hard for:

    • Researchers to quantify improvements or regressions in model capability.
    • Policymakers and the public to interpret claims about “human-level” performance, because the tests no longer represent a frontier (Section 1; Section 5 “Impact”).
  • Prior approaches and shortfalls (as positioned in the paper)

  • Existing benchmarks cover scientific reasoning, math reasoning, coding, and general-purpose assistance, often using multiple-choice or short-answer formats (Section 2 “LLM Benchmarks”).
  • Recent work has tried to build harder benchmarks by adding multimodality, filtering via review, strengthening existing benchmarks, and using expert-written questions (Section 2 “Saturation and Frontier Benchmark Design”).
  • The shortfall motivating HLE is that, despite these efforts, many popular benchmarks still saturate quickly or do not aim to be a broad “final exam”-style academic evaluation at the frontier (Section 1; Section 2).

  • How HLE positions itself

  • HLE explicitly aims to be the “final closed-ended academic benchmark of its kind” with broad subject coverage, expert-written questions, multimodality, and multi-stage filtering to keep accuracy low for current frontier models (Abstract; Section 1; Section 2; Figure 4).

3. Technical Approach

3.1 Reader orientation (approachable technical breakdown)

  • The system built here is a benchmark + dataset creation and evaluation pipeline that produces extremely hard, closed-ended academic questions and measures LLM accuracy and calibration on them.
  • It solves the “benchmark saturation” problem by combining expert authoring, LLM-based difficulty filtering, human review, and automated grading/judging into an end-to-end pipeline (Section 3; Section 4; Figure 4).

3.2 Big-picture architecture (diagram in words)

  • (1) Expert question writing(2) LLM difficulty check / filtering(3) Expert peer review & refinement (two rounds)(4) Organizer/expert approval(5) Public release (2,500 questions) + private held-out test set(s)(6) Standardized evaluation of frontier models with automatic judging and calibration measurement.
    (Section 3.1–3.2; Figure 4; Section 4)

3.3 Roadmap for the deep dive

  • I first explain what HLE contains (formats, modalities, subject coverage) because that defines what is being measured (Section 3; Figure 3).
  • Then I explain how questions are created and filtered to maintain difficulty (Section 3.1–3.2; Figure 4).
  • Next I explain review/auditing and data quality controls, including disagreement rates and “searchability” checks (Section B.2–B.3).
  • Finally I detail the evaluation protocol, including prompting, judging, metrics (accuracy + calibration error), and reported results (Section 4; Tables 1–4; Figure 5; Table 3).

3.4 Detailed, sentence-based technical breakdown

This is an empirical benchmark/dataset paper whose core idea is to build an extremely challenging but automatically scorable set of questions by filtering submissions against frontier LLMs and refining them with expert review, then evaluating frontier models on accuracy and calibration (Abstract; Sections 3–4; Figure 4).

3.4.1 What HLE is (dataset composition and formats)

  • Size and scope. HLE contains 2,500 questions spanning over a hundred subjects (Abstract; Section 3; Figure 3).
  • Broad subject grouping and proportions. The dataset is grouped into high-level categories with approximate shares shown in Figure 3, including:
  • Math 41%
  • Biology/Medicine 11%
  • Computer Science/Artificial Intelligence 10%
  • Physics 9%
  • Humanities/Social Science 9%
  • Other 9%
  • Chemistry 7%
  • Engineering 4% (Figure 3)
  • Two question formats.
  • Exact-match questions require the model to output an exact string (with the benchmark expecting a specific answer format that can be checked automatically) (Section “Question Style”).
  • Multiple-choice questions require selecting one option among five or more choices (Section “Question Style”; see also sample in Figure 2).
  • Multimodality.
  • About 14% of questions require using both text and an image reference (Section “Question Style”).
  • Format split.
  • About 24% of questions are multiple-choice, and the remainder are exact-match (Section “Question Style”).

3.4.2 Submission requirements (how questions are constrained to be scorable and hard)

  • Each submission must include the question text, answer specification (exact-match answer or multiple-choice options with correct answer marked), a detailed solution rationale, subject, and contributor identity/affiliation “to maintain accountability and accuracy” (Section “Question Style”).
  • The benchmark enforces submission criteria aimed at making questions:
  • Precise and unambiguous (so there is a single verifiable correct answer).
  • Solvable with known solutions.
  • Non-searchable, meaning models should not be able to answer via straightforward internet lookup or database retrieval (Abstract; Section 1; Section “Submission Format”).
  • The benchmark explicitly rejects:
  • Open-ended prompts, subjective interpretation questions, and (notably) content related to weapons of mass destruction (Section “Submission Format”).
  • For exact-match questions, answers are required to be short and easily verifiable to support automatic grading (Section “Submission Format”).

3.4.3 Pipeline “diagram in words” (what happens first, second, third…)

A concrete flow consistent with Figure 4 and Sections 3–4 is:

  1. Launch + question submission. Experts globally submit candidate questions with answer specs and rationales (Section 3.1; Figure 4).
  2. LLM difficulty check (pre-submission filtering). Each candidate question is tested against several frontier LLMs; questions that are too easy (models succeed) are rejected at this stage (Section 3.2; Section B.1; Figure 4).
  3. The paper reports over 70,000 attempts logged during this process, resulting in about 13,000 questions that “stumped” the tested LLMs and were forwarded to expert review (Section 3.2).
  4. The filtering differs by question type:
    • Exact-match must stump all tested models.
    • Multiple-choice must stump all but one model (to reduce the chance that a lucky guess blocks a good question) (Section B.1).
  5. Models used for this stage include multimodal models for text+image questions and additional text-only models for text-only questions; the paper lists: GPT-4O, GEMINI 1.5 PRO, CLAUDE 3.5 SONNET, O1 (multimodal), plus O1-MINI and O1-PREVIEW (text-only additions) (Section B.1).
  6. Expert review and iterative refinement (Round 1). Graduate-degree reviewers (Master’s/PhD/JD/etc.) review and provide feedback; each question receives 1–3 reviews, with the goal of making questions more robust, closed-ended, and high quality (Section 3.2; Figure 4; Section C.7.1).
  7. Selection and approval (Round 2 + organizer/expert sign-off). “Good and outstanding” questions are identified and then approved by organizers and trained reviewers for inclusion in the final dataset (Section 3.2; Figure 4; Section C.7.2).
  8. Figure 4 also depicts an intermediate “candidates” stage (labeled 6,000 candidates) before final inclusion into HLE (Figure 4). The provided excerpt does not further quantify how that maps to the final 2,500 beyond the diagram.
  9. Dataset release strategy.
  10. A public set of 2,500 questions is released, while a private held-out test set is kept to assess overfitting/gaming on the public benchmark (Abstract; Section 3; Figure 4).
  11. Post-release, the project also collects late contributions and creates “a second held-out private set” for future evaluations (Section B.2 “Late Contributions”).
  12. Evaluation of frontier LLMs.
  13. The paper evaluates “additional frontier multi-modal LLMs” on the final dataset, using a standardized prompting + judging setup (Section 4.1).

3.4.4 Quality control after release (audits, disagreement, and searchability)

  • Community feedback / bug bounty refinement. Because reviewers are not expected to fully verify every solution rationale if it would take more than ~5 minutes, the benchmark adds a post-release process to identify and remove major dataset errors, especially label errors or major question statement issues (Section B.2 “Refinement Community Feedback”).
  • Audits by students. The team recruits students from top US universities to fully solve a sample; flagged errors are adjudicated among organizers, question authors, and auditors until consensus (Section B.2 “Audit”).
  • Searchability audit using search-enabled models.
  • The benchmark defines a question as “potentially searchable” if a model with search tools answers correctly but fails without search; such questions are manually audited and removed if easily found via web search (Section B.2 “Searchable Questions”).
  • Models used in this procedure include GPT-4o mini/GPT-4o search and Perplexity Sonar (Section B.2).
  • They report that model performance after this procedure is “similar” to before, but no specific before/after numbers are provided in the excerpt (Section B.2).
  • Estimated expert disagreement rate.
  • Two main rounds of auditing were performed on samples of 200 questions each, with rebuttals by original authors on disagreements; after iteration, the paper estimates an expert disagreement rate of 15.4% for the public set (Section B.3).
  • In biology/chemistry/health, targeted review finds disagreement of about 18%; it notes disagreement can jump higher under a more stringent “single dissent flags a question” rule (to 25% in that subset) (Section B.3).
  • Dynamic variant: HLE-ROLLING.
  • The paper announces a regularly updated fork, HLE-ROLLING, intended to incorporate community feedback and new questions and to provide a migration path once the original HLE saturates (Section B.3).

3.4.5 Evaluation protocol (prompts, judging, metrics, and model versions)

  • Standardized response format. For evaluation, models are instructed to output:
  • Explanation: ...
  • Answer: ...
  • Confidence: ... (0%–100%) (Section C.1.1)
  • Automatic judging of correctness.
  • The paper uses o3-mini as a judge to verify correctness “while accounting for equivalent formats (e.g., decimals vs. fractions or estimations)” (Section 4.1).
  • The judge prompt demands extraction of the final answer, comparison to the correct answer, and a yes/no correctness decision; it allows a “small margin of error” for numerical problems (Section C.1.1).
  • The judge configuration is specified as o3-mini-2025-01-31 with “structured decoding enabled” producing fields like extracted_final_answer, reasoning, correct, and confidence (Section C.1.1).
  • Accuracy metric. Standard percentage correct on the benchmark (Section 4.2; Table 1).
  • Calibration metric.
  • The paper measures calibration via RMS calibration error, where a well-calibrated model’s stated confidence should match empirical accuracy at that confidence level (Section 4.2).
  • They cite implementation provenance (“from Hendrycks et al. [24]”) and mention using a setup from Wei et al. [56], but the excerpt does not provide the explicit RMS formula (Section 4.2).
  • Temperature and model versioning (important for reproducibility).
  • Model versions are listed in Table 4 (Section C.5).
  • Temperature settings:
    • “All models use temperature 0.0 when configurable and not otherwise stated.”
    • o3-mini and o1 “only support temperature 1.0.”
    • Gemini 2.0 Flash Thinking is sampled at temperature 0.7 (and the paper notes an earlier deprecated version used different settings) (Table 4).

3.4.6 What the token-count analysis adds (compute at inference time)

  • The paper records average completion token counts (including “both reasoning and output tokens”) for “reasoning models” and argues that improved accuracy currently comes with increased inference-time computation (Figure 5; Section 4.2 “Token Counts”).
  • The excerpt provides the plots (Figure 5 and Figure 6) but not precise per-category numeric token averages, so the main grounded takeaway is qualitative: reasoning models generate substantially more tokens than non-reasoning models in this evaluation setup (Section 4.2; Figure 5; Figure 6).

4. Key Insights and Innovations

  • 1) Difficulty-preserving pipeline that filters against frontier models before expert review.
  • Novelty: Rather than only expert-writing questions, HLE uses an explicit “LLM difficulty check” gate that rejects questions if current frontier models can answer them (Section 3.2; Section B.1; Figure 4).
  • Significance: This operationalizes “frontier difficulty” and makes low accuracy partly a design target, not an accident of weak modeling (Section 4.2 “Accuracy”).

  • 2) Broad, multimodal, closed-ended benchmark positioned as a “final exam”-style evaluation.

  • Novelty: HLE combines wide subject coverage (over 100 subjects; Figure 3) with multimodal questions (~14%) and closed-ended scoring (exact-match and multiple choice) (Section 3; “Question Style”).
  • Significance: It aims to retain the scalability and objectivity of closed-ended grading while still testing the frontier of specialized knowledge (Abstract; Section 1).

  • 3) Explicit measurement of calibration (not just accuracy) on extremely hard questions.

  • Novelty: The evaluation requires models to report confidence and then computes RMS calibration error, exposing a failure mode where models answer incorrectly with high confidence (Section 4.2; Table 1).
  • Significance: On tasks where accuracy is low by design, calibration becomes especially important for safe deployment; the reported errors are very high across models (Table 1).

  • 4) Release strategy for robustness against benchmark gaming and ongoing correction.

  • Novelty: The benchmark releases a public set but keeps private held-out sets for overfitting detection and later adds HLE-ROLLING to incorporate fixes and new questions (Section 3; Figure 4; Section B.2–B.3).
  • Significance: This acknowledges that public benchmarks can be overfit and that extremely specialized questions have nontrivial error rates requiring iteration (Section B.2–B.3).

5. Experimental Analysis

Evaluation methodology (datasets, metrics, baselines, setup)

  • Dataset evaluated: HLE public set of 2,500 questions (Section 3; Abstract).
  • Metrics:
  • Accuracy (%) (higher is better) (Section 4.2; Table 1).
  • RMS calibration error (%) (lower is better) computed from model-reported confidence vs. empirical correctness (Section 4.2; Table 1).
  • Prompting / response format: The fixed format with Explanation, Answer, Confidence is enforced (Section C.1.1).
  • Automatic judge: o3-mini-2025-01-31 is used to judge equivalence / correctness and extract the final answer (Section 4.1; Section C.1.1).
  • Models evaluated and versions: Provided in Table 4 (Section C.5), including:
  • gpt-4o-2024-11-20
  • grok-2-latest
  • claude-3-5-sonnet-20241022
  • gemini-1.5-pro-002
  • gemini-2.0-flash-thinking-exp-01-21
  • o1-2024-12-17
  • DeepSeek-R1 (Jan 20, 2025 release)
  • o3-mini-2025-01-31 (Table 4)

Main quantitative results (with specific numbers)

From Table 1 (full HLE unless otherwise marked):

Accuracy (%) ↑ / RMS calibration error (%) ↓

  • GPT-4O: 2.7% accuracy, 89% calibration error
  • GROK 2: 3.0%, 87%
  • CLAUDE 3.5 SONNET: 4.1%, 84%
  • GEMINI 1.5 PRO: 4.6%, 88%
  • GEMINI 2.0 FLASH THINKING: 6.6%, 82%
  • O1: 8.0%, 83%
  • DEEPSEEK-R1*: 8.5%, 73% (text-only subset; not multimodal)
  • O3-MINI (HIGH)*: 13.4%, 80% (text-only subset; not multimodal)
    (Table 1)

Additional breakdowns:

  • Text-only subset results (Table 2) are very similar to Table 1 for the models reported there; e.g., o3-mini (high) remains 13.4% accuracy and 80% calibration error on text-only (Table 2).
  • Category-wise performance (Table 3) shows substantial variation by domain and model. Two concrete examples from the text-only section of Table 3:
  • o3-mini (high) reaches 18.6% on Math and 15.3% on Physics, but only 5.2% on Humanities (Table 3, text-only).
  • DeepSeek-R1 reaches 14.5% on Engineering and 10.4% on Humanities, but 5.0% on Chemistry (Table 3, text-only).

Do the experiments support the claims?

  • Claim: Frontier models have low accuracy on HLE. Supported directly by Table 1, where all reported accuracies are in the 2.7%–13.4% range under this evaluation (Table 1).
  • Claim: Models are poorly calibrated on HLE. Supported by high RMS calibration errors across all models, ranging from 73% to 89% in Table 1 (Table 1; Section 4.2 “Calibration Error”).
  • Claim: HLE resists benchmark saturation and is effective at measuring advanced capabilities.
  • Figure 1 visually contrasts low HLE accuracy with high accuracies on older benchmarks (Figure 1; Section 1), but the excerpt does not provide the underlying numeric values for those other benchmarks in-table (it instead cites sources in Section C.6). So the claim is supported qualitatively in the paper’s presentation, while precise cross-benchmark numbers are not fully contained in the excerpt itself (Figure 1; Section C.6).

Ablations, robustness checks, and failure cases

  • The paper does not present classic modeling ablations (since it is not a new model), but it does include:
  • A text-only vs. full dataset reporting split (Table 2; Table 3).
  • A token-count analysis to contextualize compute trade-offs between “reasoning” and “non-reasoning” models (Figure 5; Figure 6).
  • Multiple audit and disagreement estimates as a form of dataset robustness analysis (Section B.3).
  • A post-release searchability filtering audit (Section B.2).
  • A key “failure mode” emphasized is high-confidence wrong answers (miscalibration / hallucination framing) (Section 4.2; Table 1).

6. Limitations and Trade-offs

  • Label correctness and expert disagreement are nontrivial at the frontier.
  • The benchmark estimates 15.4% expert disagreement on the public set and ~18% in bio/chem/health (Section B.3). This implies that evaluation scores may partially reflect ambiguity or adjudication complexity, especially in certain domains.

  • Difficulty filtering can entangle evaluation with the specific models used during construction.

  • Because questions are filtered by whether a set of frontier LLMs fail them at submission time (Section 3.2; Section B.1), HLE is, by design, tuned to be hard for those models. This is intentional, but it means:

    • Small improvements near zero accuracy may be hard to interpret (the paper explicitly cautions that “small inflections close to zero accuracy are not strongly indicative of progress”) (Section 4.2 “Accuracy”).
    • Some residual non-zero accuracy may be due to stochastic guessing/noise rather than systematic capability (Section 4.2 “Accuracy”).
  • Automated judging introduces its own potential error surface.

  • The benchmark uses an LLM judge (o3-mini) to determine correctness and equivalence classes of answers (Section 4.1; Section C.1.1). While structured, this can introduce:

    • False positives/negatives if equivalence handling is imperfect.
    • Dependence on judge behavior, especially for numeric tolerances (“small margin of error”) (Section C.1.1).
  • Closed-ended format limits what “high score” implies.

  • The paper explicitly notes that high HLE accuracy would indicate expert-level performance on closed-ended verifiable questions, but would not by itself demonstrate autonomous research or “AGI,” because HLE does not test open-ended research and creativity (Section 5 “Future Model Performance”).

  • Compute trade-offs at inference time are highlighted but not fully quantified in-text.

  • The token-count plots suggest reasoning models use many more completion tokens for improved performance (Figure 5; Section 4.2), but the excerpt does not give exact token counts per model/category, limiting precise compute-efficiency comparisons.

7. Implications and Future Directions

  • How this changes the landscape
  • HLE provides a deliberately hard, broad academic benchmark that can serve as a new reference point once older benchmarks saturate (Section 1; Figure 1; Section 5 “Impact”).
  • By reporting both accuracy and calibration error, it pressures the field to improve not only correctness but also uncertainty awareness on difficult questions (Section 4.2; Table 1).

  • Follow-up research enabled

  • Better calibration methods for LLMs on out-of-distribution / ultra-hard questions are directly motivated by the very high RMS calibration errors (Section 4.2; Table 1).
  • Methods for compute-optimal reasoning are motivated by the observation that “reasoning models require substantially more inference time compute,” with token counts used as a proxy (Section 4.2; Figure 5).
  • Benchmark maintenance and dataset quality work: HLE-ROLLING is explicitly proposed as an evolving benchmark to integrate fixes and new questions (Section B.3).

  • Practical applications / downstream use

  • For organizations tracking frontier model capability, HLE can be used as a stress test for:
    • Specialized academic knowledge across many domains (Figure 3).
    • Robustness to multimodal inputs (~14% image-referenced items) (Section “Question Style”).
    • Reliability via confidence reporting and calibration (Section 4.2).
  • Because the benchmark keeps private held-out sets, it is structured to support more credible evaluation under potential public-set overfitting (Section 3; Figure 4; Section B.2).

  • Repro/Integration Guidance

  • Prefer HLE over saturated academic benchmarks when you need:
    • A measurement regime where frontier models are far from ceiling and improvements are more visible (Section 1; Figure 1; Table 1).
    • A closed-ended evaluation where automated scoring is feasible at scale (Abstract; Section “Question Style”; Section 4.1).
  • When reproducing reported numbers, match the evaluation specifics that materially affect results:
    • Use the fixed response format (Explanation, Answer, Confidence) (Section C.1.1).
    • Use the same model versions (Table 4) and temperature constraints (Table 4).
    • Use the same judging approach (o3-mini-2025-01-31 structured decoding) and equivalence handling (Section 4.1; Section C.1.1).
  • Interpret low accuracies cautiously near the floor because the benchmark intentionally filters out questions models can answer, and the paper warns that small changes near zero may not reflect real progress (Section 4.2 “Accuracy”).