Overview

AI agents for software engineering are evolving beyond simple code generation to handle complex repository-level tasks, as demonstrated by SweetBench which evaluates models on realistic GitHub issues rather than isolated coding puzzles.

Effective AI tool design requires balancing autonomy with guidance - giving models freedom to determine their approach while providing clear, foolproof tools with detailed explanations and avoiding unnecessary constraints or complex frameworks.

Claude 3.5 models demonstrate significant improvements in self-correction and persistence, with the ability to try multiple approaches when initial attempts fail and maintain coherence through complex, multi-step processes spanning many iterations.

The future of AI agents faces critical challenges in building trust and reliability, requiring systems that produce auditable, transparent work with near-perfect reliability (beyond 99.9%) to meet human expectations for practical applications.

Content

Background and Career Path

Eric Schlens' Background:

• Previously CTO and co-founder of Cobalt Robotics, which built security and inspection robots • Became excited about AI after using tools like GitHub Copilot • Took a sabbatical to research AI before joining Anthropic

Reasons for Joining Anthropic:

• Felt burnt out from robotics • Knew and trusted many smart people at the company • Aligned with his interest in AI safety • Appreciated the company's culture

Professional Approach at Anthropic:

• Joined with interest in code generation and AI that can help people build things • Initially worked on various projects like function calling and tool use • Part of the Product Research team focused on understanding customer needs • Flexible work environment that allows exploration of different projects • Currently working on SweetBench and involved in research around computer use and tool use • Part of the team behind Claude 3.5 Sonnet developments

SweetBench Development and Characteristics

Key Points about Sweetbench:

• Focused on real engineering tasks, not abstract benchmarks • Specifically evaluates coding agents and their performance • Limited to 12 Python repositories • Only includes issues with matching commits and passing tests • Represents more authentic engineering work compared to traditional coding benchmarks

Benchmark Comparison:

• Unlike HumanEval (which tests isolated coding puzzles), Sweetbench: - Operates within full repository contexts - Requires finding relevant files and understanding system interactions - Simulates more realistic software engineering scenarios • HumanEval still valuable for: - Greenfield code generation testing - Quick, easy implementation - Providing baseline performance signals

Challenges and Considerations:

• Sweetbench is more complex and expensive to implement • Requires parsing repositories, multiple code iterations • Uses more computational tokens compared to simpler benchmarks

Sweetbench and Sweetbench Verified Details:

• Sweetbench is a set of around 2,000 tasks scraped from GitHub • Many original tasks were effectively impossible due to overly specific test conditions • Sweetbench Verified was created in partnership with OpenAI to address these issues • Tasks were reviewed by humans and categorized by estimated difficulty (15 min to >4 hours) • Current Sweetbench verified performance is around 49% (improved from ~30% previously)

Model Performance and Behavior Insights

Key Observations about Model Performance:

• Model often operates at the wrong level of abstraction • Tends to apply smaller fixes instead of comprehensive refactoring • Lacks multimodal capabilities, especially in visual tasks • Successfully solved around 92% of Sweetbench Verified tasks

Prompt and Model Behavior Insights:

• Different tasks may require different prompting strategies • Separating tasks and using specialized prompts can be more effective • Classification and routing problems to specific prompts can simplify system design • Models sometimes prefer generating shorter code diffs rather than full implementations • This behavior might mirror human communication patterns • Being explicit in prompts about desired output (full code vs. changes) is important

Meta Prompting Observations:

• Meta prompting allows dynamic prompt generation based on task complexity • Anthropic's meta prompting system was highlighted as an innovative approach • Compared to human experts using checklists and scaffolding, meta prompting provides structured guidance for AI models

Agent Architecture and Implementation

Agent Architecture and Runtime:

• The approach focuses on giving the AI (Claude) maximum autonomy with minimal constraints • Key principles include: - Providing tools without forcing specific workflow steps - Letting the model decide how to approach problems - Allowing the model to keep calling tools until it believes the task is complete

Model Capabilities:

• Claude (especially Sonnet 3.5) demonstrates strong capabilities: - Effective self-correction - Ability to try different approaches after initial failures - Less prone to getting "stuck" compared to previous models • Haiku 3.5 performed well on SweetBench (scoring 40.6) • Demonstrates that even smaller models can handle complex "agentric" tasks • Newer models (Haiku and Sonnet) outperformed the original Opus model

Search and Context Handling:

• Used "agentic search" where the model decides: - How to search for information - When to stop searching - Which files/directories to explore • Did not rely on external code indexers or vector databases • Used basic tools like Bash commands (LS, CAT) and a custom "view" file editing tool

Tool Design and Implementation

File Editing Approaches:

• String replace method found most reliable for AI file edits • Alternative methods like writing full diffs or regenerating entire files have drawbacks: - Diff writing requires pre-determining line changes - Full file regeneration is accurate but token-cost prohibitive

Tool Design Philosophy:

• Need to iterate on tools, not just create minimal API interfaces • Importance of detailed explanations and examples for AI tool usage • Parallels drawn between human interface design and AI computer interfaces • Focus on making tools "foolproof" (referencing the Japanese concept of "poka-yoke")

Specific Tool Improvements:

• Forcing absolute path usage to prevent model confusion • Avoiding non-returnable commands (like Vim) • Adding clear instructions directly in tool descriptions • Preference for XML over JSON for its descriptive nature

Agent Framework Selection:

• Chose Sweet Agent primarily because: - Same authors as Sweet Bench - High quality framework - Easy to modify - Uses a "think, act, observe" loop approach

Development Approach and Recommendations

Agent Frameworks and Development Approach:

• Caution against over-relying on agent frameworks, recommending: - Starting without frameworks to understand raw prompts - Avoiding unnecessary complexity - Being wary of frameworks that obfuscate model behavior • Prefer building custom, bespoke tools and utilities • Referenced the "react" paradigm (think, act, observe) • Favor creating utility functions as building blocks • Maintain a flexible, custom "utils" approach

Experimental Observations:

• Agent experiments involved: - Runs with over 100 turns - Context limits around 200K • Compared agent task complexity to human work processes • Noted that complex tasks may require many iterations

Research and Development Challenges:

• Key research areas include: - How agents can work beyond their context length - Methods for pruning unproductive paths - Lossless summarization of learned approaches - Efficient token usage

Computer Use and Robotics Insights

Computer Use Capabilities:

• Discussion centers around "computer use" capabilities in AI • Initially surprising that AI could perform tasks like opening Minecraft • Currently in beta stage with limitations • Seen as a low-friction way to implement tool use • Runs in a sandboxed environment (Docker/VM) for security reasons

Potential Applications:

• Repetitive work automation • End-to-end testing • Front-end and web testing • Research idea generation • Customer support automation • Gaming interactions • Robotic process automation (RPA)

Robotics Innovations:

• Two major technological advances are emerging: - Large Language Models (LLMs) for adding "common sense" to task descriptions - Diffusion-inspired path planning algorithms for motion control • Moving from hard-coded motion programming to learning-based approaches • Models can learn tasks through demonstrations • Goal is generalization across different objects/tasks

Hardware Challenges in Robotics:

• Hardware development is extremely difficult, with significant variability between seemingly identical components • Building multiple robots consistently is much harder than creating a single prototype • Manufacturing challenges include inconsistent motor performance, variations in component behavior, and unpredictable hardware failures • Even small variations in components like USB cables can cause system failures

Future Considerations and Challenges

Reliability Challenges:

• Current robotics technology is similar to self-driving cars 10 years ago • High reliability is critical - 99% success rate is actually quite problematic • Achieving 99.9% reliability will be a significant hurdle • Humans expect near-perfect performance for household/industrial tasks

Economic and Practical Limitations:

• Robots will be expensive to build • Unit economics may be challenging due to labor replacement constraints • Difficult to replace precision manufacturing robots due to extremely low tolerance for error • Skepticism about autonomous vehicle profitability (e.g., Waymo's economics)

LLM Agents Future Challenges:

• Excitement about increasing LLM agent capabilities • Key future challenge: Building trust in agent-generated outputs • Importance of creating: - Trustable work - Auditable processes - Transparent explanations of agent's reasoning and methodology

The new Claude 3.5 Sonnet, Computer Use, and Building SOTA Agents — with Erik Schluntz, Anthropic