[State of Code Evals] After SWE-bench, Code Clash & SOTA Coding Benchmarks recap — John Yang | Latent Space: The AI Engineer Podcast Brief

Key Takeaways

SWE-bench became the de facto standard for AI coding agent evaluation following Cognition's Devin launch in October 2023.
The SWE-bench ecosystem has expanded significantly, now including multilingual and multimodal versions across 9 languages and 40 repositories.
New benchmarks like Code Clash are emerging to evaluate long-horizon AI agent development and competitive programming in diverse arenas.
The field of code evaluation is seeing a proliferation of specialized benchmarks, addressing areas from code optimization to security and user simulation.
A key tension exists between developing fully autonomous AI agents and integrating interactive human-AI collaboration for software engineering tasks.

SWE-bench, initially released in October 2023, gained industry prominence after Cognition's Devin launch, with Cognition contacting John Yang two weeks prior to their announcement.
The benchmark ecosystem expanded to include SWE-bench Verified, an independent SWE-bench Pro, and new multilingual and multimodal versions.
These expanded versions cover nine programming languages across 40 repositories, moving beyond the original Django-heavy focus.
The guest considers the independent SWE-bench Pro a 'great benchmark' despite its unauthorized use of the name.

Code Clash is a new benchmark for long-horizon development, where AI agents maintain and improve their own codebases through multiple rounds of programming tournaments.
Tournaments utilize diverse arenas, including the classic Halite game and economically valuable tasks, to evaluate agent codebases using an LM judge or task performance.
SWE-ficiency, developed by Jeffrey Maugh, focuses on optimizing code performance through techniques like parallelization and SIMD operations without altering behavior.
AlgoTune also addresses code optimization for speed, while SciCode is designed for scientific computing with advanced human evaluation.

The conversation highlighted the cost of running agentic benchmarks, suggesting SWE-bench Verified as a less expensive stepping stone.
Recent benchmarks include Meter, which uses SWE-bench Verified to analyze runtime versus completion, and specialized evaluations like Terminal-bench, SecBench, and SRE-bench.
User simulator benchmarks, such as Tau-bench, are in early development, aiming to replicate real-world coding environments beyond traditional GitHub issues.
The inclusion of 'impossible tasks' in Tau-bench is considered a feature, as models scoring above 75% on such tasks may indicate cheating or attempts to bypass them.

The future of code evaluations is predicted to involve more SWE-bench variants and the expansion of Terminal Bench, which fosters greater creativity in environment design.
Emphasis is placed on long-running agent tasks where models achieve goals with minimal human guidance as a future direction for AI agents.
The guest expresses caution regarding the push for extensive AI agent autonomy, contrasting it with Cognition's focus on rapid, interactive feedback loops for developers.
Questions are raised about whether extended autonomous runs materially advance the industry beyond serving as proofs of concept.

Enabling different levels of abstraction based on task needs is crucial for human-AI collaboration, supporting both hands-on interaction and autonomous completion.
There is a call for more user interaction data, similar to that collected by companies like Cognition and Cursor, for academic research.
Academics need to either build compelling products or develop advanced user simulators to acquire comparable insights into user interaction.
CodeClash is envisioned as a testbed for human-AI collaboration, where model capabilities are frozen to observe how interaction patterns change when varying setups (solo, multi-agent, human+agent).