Skip to content

Evaluation

Papers on benchmarking, evaluation methods, and model monitoring.

Overview

This section contains 5 papers covering:

  • LLM-as-a-Judge - Using GPT-4 to evaluate chat models with MT-bench
  • LLM Evaluation Survey - Comprehensive framework for LLM assessment
  • GPT Model Drift - Tracking performance changes over time
  • LMSYS-Chat-1M - Real-world conversation dataset
  • GPQA - Graduate-level science benchmark for scalable oversight