Competitive Programming with Large Reasoning Models¶
ArXiv: 2502.06807
🎯 Pitch¶
This paper demonstrates that scaling reinforcement learning in large language models enables them to autonomously develop sophisticated reasoning abilities, achieving gold-medal performance at the International Olympiad in Informatics and outcompeting nearly all human programmers on CodeForces—without any hand-crafted, domain-specific strategies. By rigorously benchmarking three OpenAI models (o1, o1-ioi, o3), the authors show that end-to-end RL alone is sufficient to yield state-of-the-art results on complex coding tasks, marking a pivotal shift from engineered solutions to scalable and general-purpose AI reasoning. This finding signals that learning-based approaches can unlock robust, broadly applicable AI systems for high-stakes reasoning domains beyond programming, including math and software engineering.
1. Executive Summary (2-3 sentences)¶
This paper investigates how reinforcement learning (RL) applied to large language models can turn them into large reasoning models (LRMs) that solve complex coding tasks competitively. It compares three systems—o1 (general reasoning with tool use), o1-ioi (o1 fine‑tuned plus hand‑engineered test‑time strategies for the International Olympiad in Informatics), and an early o3 checkpoint (scaled general-purpose RL)—and shows that scaling general-purpose RL yields gold‑medal IOI performance and a CodeForces rating near top human competitors without domain‑specific heuristics (Figures 5 and 7).
2. Context and Motivation¶
- Gap addressed:
- Competitive programming demands precise algorithmic reasoning under strict time and memory limits. Previous LLM systems that performed well (e.g., AlphaCode/AlphaCode2) relied on massive sampling (up to 1M candidates per problem) and hand‑crafted selection heuristics at inference, which are expensive and domain‑specific (Section 1).
-
The open question: are domain‑specific, hand‑designed pipelines necessary, or can end‑to‑end RL training on general reasoning tasks make models discover equally effective test‑time strategies on their own (Section 1)?
-
Why it matters:
- Real‑world impact: Programmers and organizations need reliable AI that can synthesize correct, efficient code, not just plausible snippets. Competitive programming offers objective, rigorous grading at scale (Section 1).
-
Theoretical significance: Understanding whether reasoning strategies can be learned (rather than engineered) informs how to scale AI reasoning ability in broader domains like math and software engineering.
-
Prior approaches and limitations:
- Program synthesis with LLMs shows accuracy grows with model size and fine‑tuning (Section 1; [1], [2]).
- AlphaCode/AlphaCode2 improved substantially via large‑scale sampling and selection heuristics, but still depended on hand‑engineered, coding‑specific inference logic (Section 1; [6], [7]).
-
Early chain‑of‑thought (CoT)–style LRMs (e.g.,
o1, independent work like DeepSeek‑R1, Kimi k1.5) improved reasoning but left unclear whether domain‑specific tricks remained necessary at test time (Section 1; [3], [12], [15]). -
Positioning:
o1: general LRM trained with RL that reasons using extended internal chains of thought and can call tools for code execution and testing (Section 2).o1-ioi: a domain‑tailored system that adds coding‑focused RL fine‑tuning plus a sophisticated, hand‑designed test‑time pipeline for IOI (Section 3).o3: a later, scaled general-purpose LRM trained with more RL compute that learns its own test‑time behaviors without domain‑specific heuristics, outperformingo1-ioi(Section 4).
3. Technical Approach¶
This section unpacks what the three systems do and how the evaluations are run.
- Definitions (paper‑specific or uncommon):
Large reasoning model (LRM): a language model trained via RL to produce explicit, extended “thinking” (chain of thought) before final answers, and to use tools (e.g., code execution) effectively (Section 2).Test-time compute: extra computation the model uses while answering a single prompt (e.g., producing long chains of thought, running code, trying multiple variants). Figures 2, 3 show performance increases from more test-time compute.IOI subtask: IOI problems are split into subtasks, each with separate constraints and points; scores are the sum over subtasks. Competitors can submit multiple programs; the best score on each subtask counts (Section 3.2).Public tests: sample tests provided in a problem statement or portal;full testsare hidden evaluator tests. Humans see pretests during contests; models here are evaluated against full tests offline (Appendix B.2).-
pass@k: probability of solving with up tokindependent attempts (here used in analysis like “pass@10” for CodeForces simulation; Appendix B.2). -
System 1 —
o1(general-purpose LRM with tools; Section 2): - RL training produces longer internal reasoning that helps break down complex problems, explore alternatives, and self‑correct.
-
Tool use: runs generated code in a sandbox to compile, execute, and verify against tests, then iteratively refines the solution in a single sample (Section 2).
-
System 2 —
o1-ioi(domain‑specific fine‑tune + engineered inference; Section 3): - Coding‑focused RL fine‑tuning (Section 3.1):
- Resume RL from an
o1checkpoint. - Emphasize harder programming problems and C++ code generation, including runtime checks.
- Produce outputs in IOI submission format.
- Resume RL from an
-
Hand‑engineered test‑time pipeline (Section 3.2):
- Problem decomposition: create one prompt per IOI subtask.
- Massive sampling: generate 10,000 samples per subtask.
- Model‑generated tests:
- Ask the model to write C++ test input generators and validators per subtask.
- Generate 256 random test inputs and accept each input that passes at least 75% of validators (Section 3.2, “Clustering”).
- Clustering by behavior: group programs if they produce identical outputs on the 256 accepted inputs.
- Reranking:
- Score each sample by (i) a learned scoring function, (ii) errors on model‑generated tests, and (iii) failing public tests.
- Give each cluster an aggregate score (mean of its samples minus a penalty for repeated submissions from that cluster). Tune penalty weights via random search on prior IOI problems by simulating submissions (Section 3.2, “Reranking”).
- Submission strategy:
- Up to 50 submissions total (IOI limit), allocated round‑robin across subtasks starting from the hardest.
- When a subtask is solved, stop sampling for it.
- For supersets of solved subtasks, filter candidates that don’t match outputs on previously solved easier subtasks to rapidly narrow viable options (Section 3.2, “Submission”).
-
System 3 —
o3(scaled general-purpose LRM; Section 4): - Goal: rely on learned test‑time behaviors rather than human‑designed pipelines.
- Observed emergent behavior: for some problems,
o3writes a simple brute‑force solution to verify correctness of its optimized solution by cross‑checking outputs, thereby catching subtle errors (Figure 6). -
Evaluation procedure differs from
o1-ioi(Section 4.2):- Single prompt per IOI problem (no subtask‑specific prompts).
- Sample 1,024 solutions per problem; choose the 50 with the highest test‑time compute (i.e., the most deliberate reasoning).
- No clustering or manual selection heuristics; yet, solutions often cover many or all subtasks.
-
CodeForces contest simulation (Appendix B):
- Data: Division 1 contests from late 2023 and 2024 (post‑training cutoff), excluding one with an interactive problem (Appendix B.1).
- Contamination check: embedding search to ensure test problems are unseen (Appendix B.1).
- Grading: run against full hidden tests; following AlphaCode, allow up to 10 independent submissions; mark solved if any one passes (
pass@10, Appendix B.2). - Time scoring fairness: because models can think in parallel, the contest score for solved problems is set to the median score of humans who solved the same problem with the same number of failed attempts (Appendix B.3).
-
Rating estimation: compute the single model rating that maximizes the likelihood of observed rankings vs human ratings, using the Elo‑style logistic model described in Appendix B.4.
-
IOI 2024 evaluations:
- Live competition with
o1-ioi: 10 hours, six problems, 50 submissions per problem (Section 3.4). - Retrospective eval with
o3: same six problems, 50‑submission limit; 1,024 samples/problem; pick 50 highest compute (Section 4.2). -
“Gold” threshold ≈ 360 points (Section 4.2 and Figure 7).
-
Software engineering benchmarks (Section 5):
- HackerRank Astra: 65 realistic, multi‑file coding tasks across frameworks. No public tests, so test‑time heuristics like public‑test filtering are not available (Section 5.1).
- SWE‑bench Verified: curated 500 GitHub issues with corrected grading and clearer specs; each model gets up to 5 tries to produce a patch; evaluations averaged over 3 trials; system failures aren’t penalized but retried (Section 5.2).
4. Key Insights and Innovations¶
- Emergent learned test‑time strategies from general RL (fundamental):
-
o3learns to self‑validate by generating a brute‑force checker and cross‑comparing with its optimized solution (Figure 6). This behavior replaces much of the human‑designed selection pipeline needed foro1-ioi, demonstrating that end‑to‑end RL can teach models how to test their own code. -
General-purpose RL scaling beats domain‑specific pipelines (fundamental):
-
Despite
o1-ioi’s strong domain tailoring and complex test‑time engineering, a latero3checkpoint surpasses it with a simpler sampling policy (Section 4.2). On IOI 2024,o3achieves gold within the standard 50‑submission limit, whileo1-ioineeded either hand‑engineered selection or relaxed submission limits to reach gold (Figure 7). -
Two knobs that reliably raise reasoning performance: more RL training compute and more test‑time compute (broad, empirically grounded insight):
-
Figure 2 shows monotonic gains in competitive mathematics (AIME) both when increasing RL training compute (“o1 AIME accuracy during training”) and when allocating models more test‑time compute.
-
A practical, reproducible contest‑simulation methodology for CodeForces (incremental but important):
- The paper details how to approximate fairness (Appendix B.3), guard against data leakage (Appendix B.1), and map performance to an estimated CodeForces rating via an Elo‑like model (Appendix B.4). This framework helps the community compare models against strong human baselines.
5. Experimental Analysis¶
- Evaluation methodology
- CodeForces (Sections 2.1, 3.3, 4.1; Appendix B):
- Division 1 contests after training cutoff.
- Full hidden tests;
pass@10acceptance; rating estimation as in Appendix B.4. - Time advantage corrected by replacing model time with median human score at the same failure count (Appendix B.3).
- IOI 2024 (Sections 3.4, 4.2):
o1-ioicompeted live under official rules (50 submissions/problem).o3evaluated retrospectively with the same limits and a simple selection rule (top‑compute 50 of 1,024 samples).
- HackerRank Astra (Section 5.1):
- 65 multi‑file tasks; report
pass@1and average proportion of tests passed. No public tests; hand‑crafted test-time heuristics are moot here.
- 65 multi‑file tasks; report
-
SWE‑bench Verified (Section 5.2):
- 500 curated GitHub issues; 5 attempts per task; percent of issues resolved.
-
Main quantitative results
- CodeForces ratings
- Figure 1:
gpt‑4o: 808 (11th percentile)o1‑preview: 1258 (62nd)o1: 1673 (89th)- Figure 3 (o1-ioi ablations):
o1-ioivanilla: 1807 (93rd)-
- public‑test filtering: 2092 (96th)
-
- full test‑time strategy (clustering + reranking): 2214 (98th)
- Figure 5:
o3(early checkpoint): 2724 (99.8th)- Appendix B.6 Table 1 provides a per‑contest, per‑problem breakdown (e.g., contest 1951 shows solutions up to G=3100 rating, H/I unsolved; contest 1916 solves up to H2=2700).
- IOI 2024 scores
- Figure 4 (o1‑ioi, live contest):
- 50 submissions with hand‑crafted selection: 213 points (49th percentile).
- If randomly selecting 50 submissions from the 10k pool: 156 points.
- If allowed 10k submissions (no selection strategy): 362.14 points (above gold threshold).
- Figure 7 (o3 retrospective vs o1‑ioi):
o1‑ioi(50 submissions): 213o1‑ioi(10k submissions): 362.14o3(50 submissions, simple top‑compute selection): 395.64 (gold)
- Emergent test‑time behavior
- Figure 6:
o3writes and runs a brute‑force solver to validate outputs from an optimized solution, a behavior not specified by humans.
- Figure 6:
- HackerRank Astra (Figure 8):
gpt‑4o: 50.91% pass@1; 69.52% avg scoreo1‑preview: 60.89% pass@1; 75.55% avg scoreo1: 63.92% pass@1; 75.80% avg score
-
SWE‑bench Verified (Figure 9):
gpt‑4o: 33.2% correcto1‑preview: 41.3%o1: 48.9%o3: 71.7%
-
How convincing are the experiments?
- The CodeForces methodology addresses key fairness issues (parallel “thinking”) and data leakage (Appendix B.1–B.4). The large, unseen contest set and per‑problem breakdowns in Table 1 support the headline rating claims.
- The IOI analysis isolates the impact of test‑time pipelines vs general RL. Figure 4 acts as an ablation: random 50 vs engineered selection vs lifting the submission limit. Figure 7 then shows a later
o3checkpoint beatingo1‑ioiwith a simpler selection rule. -
The software‑engineering results (HackerRank Astra, SWE‑bench Verified) generalize the reasoning gains beyond competitive programming (Figures 8–9).
-
Failure cases and caveats
- The very hardest CodeForces problems (e.g., 3300–3500 rating) remain frequently unsolved, and overall solve rate still trails the very top humans (Figure 10).
- The CodeForces simulation allows
pass@10against full tests (Appendix B.2), different from how humans experience pretests during the contest, though authors argue Division 1 pretests are typically strong. - Interactive problems were excluded (Appendix B.1), so results do not cover that category.
6. Limitations and Trade-offs¶
- Assumptions and scope:
- Heavy reliance on a secure code‑execution sandbox and tool use (Section 2). Real‑world environments with nondeterminism, external services, or complex build systems may pose additional challenges not captured here.
-
The CodeForces simulation and IOI retrospective results are careful but still approximations of live human competition dynamics (Appendix B.2–B.4).
-
Compute and scalability:
o1-ioidepends on very large test‑time sampling (10k samples/subtask), plus custom clustering and reranking (Section 3.2). This is expensive and domain‑specific.-
o3uses less hand engineering but is a larger, more compute‑intensive model trained with substantially more RL compute (Sections 4–5). The paper does not disclose precise compute budgets, which makes cost–benefit analysis hard. -
Transparency:
-
While the paper explains inference behaviors and evaluation in detail, it does not detail
o3’s architecture or exact RL training algorithmic choices. That limits reproducibility of the training process even though evaluation is well described. -
Generalization boundaries:
- Interactive programming tasks were excluded in CodeForces (Appendix B.1).
- Some contest‑specific features (e.g., pretests vs full tests, time pressure, hacking in CodeForces) are not fully replicated.
-
The retrospective IOI evaluation for
o3samples only 1,024 solutions per problem and uses highest‑compute selection; different sampling/selection policies could change scores. -
Measurement choices:
- CodeForces time‑scoring adjustment uses median human scores for similar fail counts (Appendix B.3). Although sensible, alternative adjustments (e.g., exact wall‑clock model times at fixed GPU budgets) could yield different estimated ratings.
7. Implications and Future Directions¶
- How this work shifts the landscape:
-
It provides strong evidence that scaling general‑purpose RL produces LRMs that learn sophisticated test‑time strategies (like self‑verification) without domain‑specific heuristics, while achieving state‑of‑the‑art results in competitive programming (Figures 5–7). This is a step from “engineer the inference pipeline” toward “train the behavior.”
-
Practical applications:
- Competitive programming assistance, automated bug‑fixing, and feature implementation in real codebases, suggested by gains on SWE‑bench Verified (Figure 9) and HackerRank Astra (Figure 8).
-
Education and contest training: models that can generate brute‑force checkers and cross‑validate solutions offer novel pedagogy and debugging workflows (Figure 6).
-
Research directions enabled:
- Learned testing strategies: formalize and strengthen the emergent behaviors (e.g., test generation, property checks, differential testing) so models can systematically validate their own code as problems get harder.
- Cost‑aware reasoning: optimize the trade‑off between test‑time compute and accuracy (Figure 2) with dynamic compute allocation, e.g., deciding when to invoke brute‑force validation vs trusting an optimized solution.
- Broader benchmarks: extend beyond non‑interactive problems to interactive protocols and multi‑component systems; include runtime constraints that penalize slow brute‑force checks to encourage algorithmic efficiency.
-
Interpretability of reasoning: analyze chains of thought and learned verification routines to understand failure modes on very hard problems (e.g., 3300–3500 CodeForces tasks in Table 1).
-
Standards and evaluation:
- The contest‑simulation blueprint in Appendix B (data leakage checks, pass@k policy, rating estimation, time‑scoring correction) can serve as a community reference for fair, apples‑to‑apples evaluation of LRMs on competitive programming.
Bottom line: With
o3, general-purpose RL training—not domain‑specific test‑time heuristics—emerges as the more robust route to top‑tier coding and reasoning performance, achieving a 2724 CodeForces rating (99.8th percentile; Figure 5) and a gold‑medal score on IOI 2024 with only 50 submissions (Figure 7), while also transferring to real‑world software tasks (Figures 8–9).