Why AI Isn't a Coding Grandmaster Yet (And How We're Cracking it Efficiently)

Published on

June 29, 2025

Read time:

6 mins

Tl;dr:

Recent advancements in Large Language Models (LLMs) have significantly improved their capabilities in reasoning capabilities, models such as OpenAI's 'o' series and DeepSeek-R1, which incorporate Chain-of-Thought (CoT) to facilitate step-by-step logical thinking and effective test-time scaling, demonstrate promising performance across a variety of tasks. Among these, coding is known as a highly challenging domain as it requires absolute syntactic precision and tracking long-range dependencies. Within various coding tasks, competitive programming (CP) stands out as particularly intellectually demanding discipline. Compared to general software engineering tasks, CP problems rarely offer straightforward solutions and instead necessitate specialized, often non-standard algorithm derived through sharp analytical thinking.

Measuring the Gap: Introducing LiveCodeBench Pro

Our latest research reveals that a significant reasoning gap persists between even frontier models and elite human programmers. To precisely measure this gap, we introduced LiveCodeBench Pro, a continuously updated benchmark designed to rigorously evaluate LLMs on competitive programming challenges.

Developed in collaboration with researchers from New York University, Princeton University, the University of California San Diego etc, LiveCodeBench Pro overcomes the limitations of existing evaluations such as data contamination. LiveCodeBench Pro sources high-quality problems in real-time from top-tier contests like Codeforces, the ICPC, and the IOI. It also features detailed annotations from a team of Olympiad medalists who analyze each problem and every failed model submission.

Figure 1. Model performances across easy, medium, and hard CP problems

‍

LiveCodeBench Pro begins where most benchmarks end: at the instant a Codeforces, ICPC, or IOI round finishes. A crawler captures every statement, input-generator, and judge immediately—before editorials appear or solutions propagate onto GitHub—then freezes the original time and memory limits so evaluation mirrors the live-contest sandbox. Each task is hammered by Codeforces’ “hack” phase, plus an in-house fuzz-tester that mutates edge cases until coverage plateaus, ensuring the hidden tests are at least as adversarial as what human finalists face. Finally, all submissions run inside a uniform Docker image with deterministic compilers and wall-clock timers, so results are free from runtime skew across languages or hardware.

Rather than lumping everything into a single pass@k metric, we assign every problem an Elo-style difficulty derived from historical solve rates of top human contestants (≤ 2000 Easy, 2000-3000 Medium, > 3000 Hard). Because these ratings update as more humans and models attempt the tasks, the corpus stays balanced even as the leaderboard evolves. This three-tier structure prevents easy problems from swamping the signal and lets us report model scores the same way Codeforces ranks people—making human-versus-LLM comparisons intuitive and fair.

Olympiad medalists review each problem and tag its dominant cognitive burden using a twenty-label ontology collapsed into three headline buckets:

Knowledge-heavy: unlocks once you recall a canonical technique or data-structure template (e.g., Fenwick tree, FFT).
Logic-heavy: demands step-wise deduction, proofs, or DP state design before any code can be written.
Observation-heavy: hinges on a short, creative “aha!” that shrinks the search space; code is trivial once the insight lands.

These tags accompany the public test logs, so researchers can slice results by bucket and see, for example, that a model’s accuracy might be 70 % on knowledge-heavy Easy tasks yet just 8 % on observation-heavy Medium ones.

Every failed run is stored with compiler output, runtime traces, and a verdict label—syntax error, wrong answer, TLE, or memory fault—plus a short, human-authored note explaining the conceptual misstep when it’s more subtle than a crash. Because tasks, tags, and traces are all version-controlled, anyone can replicate experiments, add ablations, or retrain models without worrying that the ground truth has drifted. The result is a benchmark that doesn’t just rank models; it tells you why they lose, and exactly which kind of reasoning they need next.

Our analysis with LiveCodeBench Pro revealed several key insights:

Performance varies by problem type. Models perform well on “knowledge-heavy” and “logic-heavy” problems that require applying known templates or structured thinking. However, their performance collapses on “observation-heavy” problems (like game theory or greedy algorithms) that demand novel insights.
Models fail differently than humans. A detailed diagnosis of failed submissions from the o3-mini model showed that it makes significantly more conceptual errors in algorithm logic than human competitors of a similar rating. In contrast, the model's implementation is a strong suit, with fewer low-level coding errors than humans.
Performance is heavily influenced by external aids. Allowing models multiple attempts (pass@k) substantially improves their ratings, though they still fail in the hardest tier. The highest reported scores for models like o4-mini are also largely attributable to tool use, such as terminal access for local compilation and testing.

This work confirmed that while LLMs are powerful, conquering the nuanced reasoning of competitive programming requires more than just scaling. This motivates us to better understand how coding performance scales.

Cracking the Code Efficiently with Dobby-CP (Competitive Programmer)

To investigate this systematically, we explored how to best teach open-sourced, smaller models these complex skills through supervised fine-tuning (SFT). We drew inspiration from several areas:

Knowledge distillation. Existing work utilizes thinking trajectories generated from larger models to teach smaller models. This form of knowledge distillation allows more efficient models to learn complex reasoning patterns and problem-solving strategies from their larger counterparts.
Human expertise. Human programmers typically learn from curated educational materials written by experts. Translating this expertise to data that AI models can follow is another method for improving LLM coding abilities and serves as a high-quality complement to model-generated data.
Cross-domain knowledge. Insights from human cognition suggest that proficiency in programming is often augmented by knowledge and skills from other domains such as math.

Motivated by these insights, our work on the Dobby-CP models aims to investigate the efficiency laws that govern improvements in LLM performance on competitive programming tasks. Specifically, we examine how training efficiency is influenced by three core factors:

Data source. Which types of training data (e.g., human-annotated solutions, data distilled from advanced models, supervised distilled data) yield the most efficient learning?
Data domain. Is learning most effective using purely CP-focused data, or does incorporating data from broader domains improve efficiency?
Data quantity. How does model performance scale as the amount of training data increases?

We introduce the Dobby-CP (7B and 14B) models, a group of efficient yet powerful competitive programming assistants trained on a small dataset (24k training examples for 14B and 40k for 7B), which achieves competitive results on the LiveCodeBench. The Pass@1 of Dobby-CP models rival models trained on significantly more data, such as R1-Distill-Qwen-7B/14B (800k reported), and approaches the performance of closed-source models like OpenAI-o1-mini. Our key findings provide a practical guide for resource-efficient fine-tuning:

Finding the right teacher. We found that the CoT data distilled from LLMs under supervision of human editorials provides better learning signals than human-written or purely distilled data.
Blended curriculum helps training. A blended curriculum combining competitive programming tasks with general, cross-domain reasoning problems outperforms purely domain-specific training.
Efficient scaling law. Most critically, we found that increasing problem diversity is the most important factor for performance gains, while increasing the number of solution samples per problem yields diminishing returns. Figure 1 illustrates how fine-tuning data volume correlates with the model performance, positioning Dobby on the most efficient frontier, delivering top-tier reasoning accuracy with minimal fine-tuning cost

Figure 2. SFT performance frontier over increasing dataset size

‍

Conclusion

LiveCodeBench Pro has made the reasoning gap between today’s best LLMs and seasoned competitive coders impossible to ignore—yet Dobby-CP shows that this gap can shrink rapidly when we pair the right teachers with the right curriculum. By blending distilled chain-of-thought traces from frontier models, Olympiad-level human insights, and a diverse mix of cross-domain puzzles, we found an efficiency law that prizes problem variety over sheer data volume. The result is a lean, openly available 7 B/14 B family that hits performance once reserved for much larger or closed models while demanding only a fraction of the fine-tuning budget. Bridging the last mile to true “coding grandmaster” status will still require breakthroughs in observation-heavy reasoning and real-time tool use—but with LiveCodeBench Pro continuously surfacing fresh challenges, and Dobby-CP charting a clear path toward data-efficient mastery, the playbook for closing that gap has never been clearer.