Evaluation¶
Papers on benchmarking, evaluation methods, and model monitoring.
Overview¶
This section contains 5 papers covering:
- LLM-as-a-Judge - Using GPT-4 to evaluate chat models with MT-bench
- LLM Evaluation Survey - Comprehensive framework for LLM assessment
- GPT Model Drift - Tracking performance changes over time
- LMSYS-Chat-1M - Real-world conversation dataset
- GPQA - Graduate-level science benchmark for scalable oversight