Evaluating Large Language Models Trained on Code¶

🎯 Pitch¶

This paper introduces Codex, a large language model specifically fine-tuned on billions of lines of public GitHub code, and establishes a new standard for evaluating code generation based on functional correctness rather than simplistic text similarity. By releasing the HumanEval benchmark and demonstrating Codex's dramatic improvements in generating executable, correct code (solving up to 77.5% of problems with sufficient sampling), this work marks a major leap toward practical, reliable AI-assisted programming—setting a new baseline for future research and real-world applications in software development and education.

1. Executive Summary (2–3 sentences)¶

This paper introduces Codex, a family of large language models fine-tuned on public GitHub code, and presents a rigorous way to evaluate program synthesis by “functional correctness” rather than text similarity. On the HumanEval benchmark of 164 hand-written problems with unit tests, a 12B-parameter Codex solves 28.8% with one try and, when sampling 100 candidates per problem, solves up to 72.3%; a further supervised fine-tuning variant (“Codex‑S”) reaches 77.5% with 100 samples (Fig. 1; Table 1).

2. Context and Motivation¶

Problem addressed
Generating correct, executable code from natural language prompts (e.g., Python docstrings) and evaluating such generation reliably.
Existing benchmarks often use match-based metrics like BLEU, which do not align with whether code actually works.
Why it matters
Practical: Correctness in code is judged by tests, not by similarity to a reference. The ability to generate functionally correct code can accelerate software development, onboarding, and education.
Scientific: Establishes evaluation methods and data for measuring program synthesis ability in generative models.
Shortcomings of prior approaches
BLEU and other text-similarity metrics correlate poorly with code semantics; functionally wrong programs can score well and vice versa (Fig. 8).
General-purpose LMs (e.g., GPT-3) show limited ability to synthesize working code without code-specific training (Introduction; Table 1).
Lack of a standardized, hand-written, test-driven benchmark tailored to docstring-conditional function synthesis.
Positioning relative to existing work
Builds on large Transformer LMs but specializes via fine-tuning on code and a curated function-synthesis distribution.
Compares against GPT‑Neo and GPT‑J trained on The Pile, and a production code autocomplete system (Tabnine) (Sec. 3.4; Table 1).
Contributes a new dataset (HumanEval) and an unbiased estimator for the “pass@k” metric (Sec. 2.1; Eq. (1); Fig. 3).

3. Technical Approach¶

Step-by-step overview

Task formulation
Input: a prompt containing a header, a Python function signature, and a natural-language docstring.
Output: a function body that satisfies the docstring and passes unit tests.
Models generate tokens until encountering a stop sequence such as “\nclass”, “\ndef”, “\n#”, “\nif”, or “\nprint” (Sec. 3.2).
Data and tokenization
Training data: 54M public GitHub repos (May 2020), filtered to 159 GB of Python files under 1MB after removing likely autogenerated files and other noise (Sec. 3.1).
Tokenization: Starts from GPT‑3’s text tokenizer but adds special tokens for runs of whitespace, reducing token count ~30% for code (Sec. 3.2).
Model training
Architecture: GPT-style autoregressive Transformers up to 12B non-embedding parameters (Sec. 3).
Optimization: Adam with β1=0.9, β2=0.95, ε=1e-8, weight decay 0.1; 175-step linear warmup and cosine decay; total 100B tokens (Sec. 3.2).
Initialization: Fine-tuning from GPT‑3 does not improve final loss but speeds convergence, so it is used (Sec. 3.2).
Inference and sampling
Nucleus (top‑p) sampling with p=0.95 (Sec. 3.2).
Temperature tuning matters: lower T maximizes single-shot accuracy; higher T boosts diversity for large k (Fig. 5).
Evaluation metric and its computation
pass@k: probability that at least one of k generated samples passes all tests.
Unbiased estimation: generate n≥k samples; if c of n are correct, estimate pass@k as 1 − Π_{i=n−c+1 to n} (1 − k/i). This is Eq. (1) and implemented stably in Fig. 3.
Rationale: avoids bias of the common 1 − (1 − p̂)^k estimator (Appendix A; Fig. 13).
Benchmarks and environment
HumanEval: 164 problems, each with signature, docstring, body, and ~7.7 tests on average; hand-written to avoid training leakage (Sec. 2.2).
Secure execution: gVisor-based sandbox with eBPF firewall to safely run untrusted code (Sec. 2.3).
Supervised fine-tuning on function synthesis (“Codex‑S”)
Goal: reduce distribution mismatch by training specifically on standalone, correct functions (Sec. 4).
Data construction:
- 10,000 problems from competitive programming and interview sites; unit tests built from examples and exploratory submissions (Sec. 4.1).
- ~40,000 problems from tracing functions during CI runs across open-source projects and PyPI packages; inputs/outputs captured with sys.setprofile (Sec. 4.2).
Quality filtering: keep only problems for which Codex‑12B can produce at least one solution among 100 samples; repeat to remove non-deterministic ones (Sec. 4.3).
Training: left-pad prompts to align outputs; minimize NLL on reference solutions; lower learning rate (1/10 of Codex fine-tuning); stop when validation loss plateaus (Sec. 4.4).
Reverse task: docstring generation (“Codex‑D”)
Train on code-to-docstring by concatenating signature + body → docstring and minimizing NLL (Sec. 5).
Use “back-translation” as a ranking heuristic: select the code sample whose generated docstring gets highest probability under the docstring model (Sec. 5; Fig. 7).
Sample selection heuristics (when you must return only one sample)
Rank candidates by mean token log-probability; this consistently beats random selection and sum log-probability (Sec. 3.3; Fig. 7).
Back-translation ranking helps over random but trails mean log-prob (Sec. 5; Fig. 7).

4. Key Insights and Innovations¶

Functional-correctness evaluation with an unbiased pass@k estimator
What’s new: Defines pass@k as the central metric and provides an unbiased estimator (Eq. (1); Fig. 3; Appendix A), addressing high-variance and bias issues with naïve estimators.
Why it matters: Aligns evaluation with how developers actually judge code (by tests), and enables fair comparison across different sample budgets.
HumanEval and a secure execution harness
What’s new: A hand-written, unit-test-based benchmark tailored to docstring-conditioned function synthesis (Sec. 2.2) and a gVisor-based sandbox for safe execution (Sec. 2.3).
Why it matters: Reduces contamination risk, standardizes problem format, and makes automated, safe correctness checking practical.
Demonstration that code-specific fine-tuning plus sampling dramatically lifts capability
What’s new: Fine-tuning on a large Python corpus yields strong scaling (loss follows a power law in parameters; Fig. 4) and high pass rates; diverse sampling (higher T) becomes beneficial as k increases (Fig. 5–6).
Why it matters: Shows that the bottleneck is not just model size, but also domain specialization and search via sampling.
Distribution matching via supervised fine-tuning (Codex‑S)
What’s new: Curates tens of thousands of standalone function tasks and shows large gains—e.g., +6.5 percentage points on pass@1 and +15.1 on pass@100 over Codex at the same size (Sec. 4.5; Fig. 10).
Why it matters: Highlights that task-specific supervision can be more parameter-efficient than scaling alone (Fig. 10).
Careful failure analysis and hazard assessment
What’s new: Documents performance collapse as instruction chains grow (roughly 2–3× drop per added component; Fig. 11), variable-binding errors, misalignment (Fig. 12, Fig. 14), insecure code suggestions (Fig. 15), and economic/security implications (Sec. 7; Appendices E–H).
Why it matters: Moves beyond topline accuracy to characterize where and why current models fail, informing safety and future research.

5. Experimental Analysis¶

Evaluation methodology
Datasets
- HumanEval: 164 problems with tests (Sec. 2.2).
- APPS: coding competition-style tasks; measures “strict accuracy” (problem solved) (Sec. 3.5; Table 2).
Metrics
- pass@k computed via Eq. (1), using n=200 samples per problem for HumanEval unless otherwise noted (Sec. 2.1).
- BLEU used diagnostically to show its weak correlation with correctness (Fig. 8); not a primary metric.
Setup
- Sampling with top‑p=0.95; temperature tuned per k (Fig. 5).
- Stop sequences to avoid generating extraneous code (Sec. 3.2).
- Secure sandbox execution (Sec. 2.3).
Main quantitative results
HumanEval, Codex vs. baselines (Table 1)
- “CODEX‑12B: pass@1 28.81%, pass@100 72.31%.”
- GPT‑J‑6B: 11.62% / 27.74%.
- GPT‑Neo‑2.7B: 6.41% / 21.37%.
- Tabnine (largest free model): 2.58% / 7.59%.
Supervised fine-tuning on function tasks (Codex‑S)
- Single-sample accuracy improves from 28.8% to 37.7% for 12B (Fig. 1).
- With 100 samples per problem, Codex‑S reaches 77.5% solved (Fig. 1).
Sample selection without test execution
- Selecting by highest mean log-probability yields 44.5% solved with 100 samples, compared to the oracle upper bound of 77.5% that picks by tests (Fig. 1; Fig. 7).
Temperature vs. k
- Optimal T increases with k; example: for a 679M model, T=0.2 (pass@1) and T=0.8 (pass@100) (Fig. 5).
Scaling behavior
- Test loss follows a power law in parameter count: roughly (N / 5.92×10^7)^(-0.13) (Fig. 4).
- pass@1 and pass@100 improve smoothly with size (Fig. 6).
BLEU vs. correctness
- Distributions for correct and incorrect samples overlap substantially (Fig. 8), showing BLEU is unreliable for functional correctness.
APPS results (Table 2)
- “1-shot Codex raw pass@1000”: 25.02% (Introductory), 3.70% (Interview), 3.23% (Competition).
- Filtering samples that pass the three public examples boosts pass@1 (e.g., 22.78% on Introductory).
Robustness and ablations
Sample ranking: mean log-probability > random; sum log-probability can underperform random (Fig. 7).
Instruction-chain stress test: each added operation halves or thirds the pass rate (Fig. 11).
Docstring generation: Codex‑D pass@10 is 46.5% vs. Codex‑S’s 59.5% at similar temperature; docstrings graded by hand (Table 3).
Security probes: models sometimes suggest insecure cryptographic parameters (e.g., AES‑ECB, RSA < 2048 bits) across sizes (Fig. 15).
Do the experiments support the claims?
Yes: Multiple lines of evidence show specialization and sampling are crucial. The unbiased pass@k estimator, task-specific fine-tuning, and rigorous sandboxed execution lend credibility. Failure analyses (Figs. 11–15) give a balanced view.
Notable failure cases
Misbinding actions to variables; e.g., a prompt requiring updating all four variables but returning only a partial product (Sec. 6).
Context sensitivity: subtle bugs in the in-context examples reduce performance even with an instruction to “write correct code” (Fig. 12; Fig. 14).

6. Limitations and Trade-offs¶

Assumptions and dependencies
Assumes unit tests sufficiently specify correctness; weak tests can mislabel incorrect programs as correct and vice versa (Sec. 2.1; broader discussion in Sec. 7).
Training on public GitHub code; inherits its biases, insecure patterns, and quality issues (Appendix F–G).
Scenarios not addressed or underexplored
Complex multi-file or system-level synthesis; most tasks are single-function (Sec. 1; Sec. 6; Appendix D).
Robust long-horizon reasoning: performance decays steeply with chained operations (Fig. 11).
Computational and data costs
Large-scale training on ~159 GB of Python and 100B tokens; fine-tuning a 12B model plus pre-training has significant compute and environmental cost (Sec. 7.6).
Not sample-efficient relative to human programmers; needs many generations per problem for high pass@k (Sec. 6).
Alignment and safety trade-offs
Misalignment: tends to mirror buggy or insecure context even when capable of doing better (Fig. 12; Fig. 14).
Security: can suggest insecure crypto usage or typosquatted packages; non-determinism complicates malware detection (Sec. 7.5; Fig. 15; Appendix G).
Evaluation constraints
Corruption or leakage risks are mitigated but not eliminated; HumanEval is hand-written yet novelty is not formally guaranteed (Fig. 2 caption).
Docstring quality in training data is variable; docstring generation requires manual grading (Sec. 5; Table 3).

7. Implications and Future Directions¶

How this work changes the landscape
Establishes functional-correctness benchmarks and methodology for code LMs (HumanEval, unbiased pass@k), shifting evaluation standards away from BLEU to unit-test-driven metrics.
Demonstrates that combining domain-specific fine-tuning with sampling and smart ranking produces substantial practical gains, foreshadowing tools like code assistants.
Follow-up research enabled or suggested
Alignment-focused training: dataset curation to filter buggy/insecure code; conditioning on quality labels; reinforcement learning from human (and tool-assisted) feedback to prefer secure, correct, and helpful code (Appendix E.4; Sec. 7.8).
Better selection without tests: develop ranking signals beyond mean log-probability and back-translation; integrate static/dynamic analysis to predict test pass likelihood.
Stronger benchmarks: more diverse, multi-file, and system-level tasks; richer test suites; robustness suites for long chained instructions and variable-binding stressors (Sec. 6; Appendix D).
Safety and security: automated detection of insecure patterns; package recommendation hygiene; rate-limiting and abuse monitoring for deployment (Sec. 7.5; 7.8).
Efficiency: methods to reduce sample budgets (e.g., guided decoding, program repair loops, or search) and compute-efficient training.
Practical applications and downstream use
Autocomplete and pair-programming assistants (e.g., Copilot-like tools).
Educational aids for learning programming via examples and docstrings.
Codebase navigation and onboarding by proposing tests, documentation, and simple refactors.
However, safe use requires human oversight, careful UI design, and guardrails to mitigate over-reliance, bias, and security risks (Sec. 7.1, 7.3, 7.5, 7.8).

Key headline results to remember: - “CODEX‑12B achieves pass@1 28.81% and pass@100 72.31% on HumanEval” (Table 1). - “Codex‑S (function‑focused supervised fine‑tuning) boosts pass@1 to 37.7% and reaches 77.5% with 100 samples” (Fig. 1; Sec. 4.5). - “Selecting a single sample by highest mean log‑probability yields 44.5% solved, vs. an oracle 77.5% when selecting by tests” (Fig. 1; Fig. 7). - “BLEU does not reliably track functional correctness—distributions overlap for correct and incorrect code” (Fig. 8). - “Performance drops roughly 2–3× per added operation in chained-instruction prompts” (Fig. 11).