Skip to content

Paper Summaries

Evaluation

bzhng-development/summary-of-some-paper-in-cuda

Evaluation¶

Papers on benchmarking, evaluation methods, and model monitoring.

Overview¶

This section contains 5 papers covering:

LLM-as-a-Judge - Using GPT-4 to evaluate chat models with MT-bench
LLM Evaluation Survey - Comprehensive framework for LLM assessment
GPT Model Drift - Tracking performance changes over time
LMSYS-Chat-1M - Real-world conversation dataset
GPQA - Graduate-level science benchmark for scalable oversight