our first speaker, friend of the show Dylan Patel, and topic slates for

Latent Space: The AI Engineer Podcast

The new Claude 3.5 Sonnet, Computer Use, and Building SOTA Agents — with Erik Schluntz, Anthropic

Overview

  • AI agents for software engineering are evolving beyond simple code generation to handle complex repository-level tasks, as demonstrated by SweetBench which evaluates models on realistic GitHub issues rather than isolated coding puzzles.
  • Effective AI tool design requires balancing autonomy with guidance - giving models freedom to determine their approach while providing clear, foolproof tools with detailed explanations and avoiding unnecessary constraints or complex frameworks.
  • Claude 3.5 models demonstrate significant improvements in self-correction and persistence, with the ability to try multiple approaches when initial attempts fail and maintain coherence through complex, multi-step processes spanning many iterations.
  • The future of AI agents faces critical challenges in building trust and reliability, requiring systems that produce auditable, transparent work with near-perfect reliability (beyond 99.9%) to meet human expectations for practical applications.

Content

Background and Career Path

  • Eric Schlens' Background:
• Previously CTO and co-founder of Cobalt Robotics, which built security and inspection robots • Became excited about AI after using tools like GitHub Copilot • Took a sabbatical to research AI before joining Anthropic

  • Reasons for Joining Anthropic:
• Felt burnt out from robotics • Knew and trusted many smart people at the company • Aligned with his interest in AI safety • Appreciated the company's culture

  • Professional Approach at Anthropic:
• Joined with interest in code generation and AI that can help people build things • Initially worked on various projects like function calling and tool use • Part of the Product Research team focused on understanding customer needs • Flexible work environment that allows exploration of different projects • Currently working on SweetBench and involved in research around computer use and tool use • Part of the team behind Claude 3.5 Sonnet developments

SweetBench Development and Characteristics

  • Key Points about Sweetbench:
• Focused on real engineering tasks, not abstract benchmarks • Specifically evaluates coding agents and their performance • Limited to 12 Python repositories • Only includes issues with matching commits and passing tests • Represents more authentic engineering work compared to traditional coding benchmarks

  • Benchmark Comparison:
• Unlike HumanEval (which tests isolated coding puzzles), Sweetbench: - Operates within full repository contexts - Requires finding relevant files and understanding system interactions - Simulates more realistic software engineering scenarios • HumanEval still valuable for: - Greenfield code generation testing - Quick, easy implementation - Providing baseline performance signals

  • Challenges and Considerations:
• Sweetbench is more complex and expensive to implement • Requires parsing repositories, multiple code iterations • Uses more computational tokens compared to simpler benchmarks

  • Sweetbench and Sweetbench Verified Details:
• Sweetbench is a set of around 2,000 tasks scraped from GitHub • Many original tasks were effectively impossible due to overly specific test conditions • Sweetbench Verified was created in partnership with OpenAI to address these issues • Tasks were reviewed by humans and categorized by estimated difficulty (15 min to >4 hours) • Current Sweetbench verified performance is around 49% (improved from ~30% previously)

Model Performance and Behavior Insights

  • Key Observations about Model Performance:
• Model often operates at the wrong level of abstraction • Tends to apply smaller fixes instead of comprehensive refactoring • Lacks multimodal capabilities, especially in visual tasks • Successfully solved around 92% of Sweetbench Verified tasks

  • Prompt and Model Behavior Insights:
• Different tasks may require different prompting strategies • Separating tasks and using specialized prompts can be more effective • Classification and routing problems to specific prompts can simplify system design • Models sometimes prefer generating shorter code diffs rather than full implementations • This behavior might mirror human communication patterns • Being explicit in prompts about desired output (full code vs. changes) is important

  • Meta Prompting Observations:
• Meta prompting allows dynamic prompt generation based on task complexity • Anthropic's meta prompting system was highlighted as an innovative approach • Compared to human experts using checklists and scaffolding, meta prompting provides structured guidance for AI models

Agent Architecture and Implementation

  • Agent Architecture and Runtime:
• The approach focuses on giving the AI (Claude) maximum autonomy with minimal constraints • Key principles include: - Providing tools without forcing specific workflow steps - Letting the model decide how to approach problems - Allowing the model to keep calling tools until it believes the task is complete

  • Model Capabilities:
• Claude (especially Sonnet 3.5) demonstrates strong capabilities: - Effective self-correction - Ability to try different approaches after initial failures - Less prone to getting "stuck" compared to previous models • Haiku 3.5 performed well on SweetBench (scoring 40.6) • Demonstrates that even smaller models can handle complex "agentric" tasks • Newer models (Haiku and Sonnet) outperformed the original Opus model

  • Search and Context Handling:
• Used "agentic search" where the model decides: - How to search for information - When to stop searching - Which files/directories to explore • Did not rely on external code indexers or vector databases • Used basic tools like Bash commands (LS, CAT) and a custom "view" file editing tool

Tool Design and Implementation

  • File Editing Approaches:
• String replace method found most reliable for AI file edits • Alternative methods like writing full diffs or regenerating entire files have drawbacks: - Diff writing requires pre-determining line changes - Full file regeneration is accurate but token-cost prohibitive

  • Tool Design Philosophy:
• Need to iterate on tools, not just create minimal API interfaces • Importance of detailed explanations and examples for AI tool usage • Parallels drawn between human interface design and AI computer interfaces • Focus on making tools "foolproof" (referencing the Japanese concept of "poka-yoke")

  • Specific Tool Improvements:
• Forcing absolute path usage to prevent model confusion • Avoiding non-returnable commands (like Vim) • Adding clear instructions directly in tool descriptions • Preference for XML over JSON for its descriptive nature

  • Agent Framework Selection:
• Chose Sweet Agent primarily because: - Same authors as Sweet Bench - High quality framework - Easy to modify - Uses a "think, act, observe" loop approach

Development Approach and Recommendations

  • Agent Frameworks and Development Approach:
• Caution against over-relying on agent frameworks, recommending: - Starting without frameworks to understand raw prompts - Avoiding unnecessary complexity - Being wary of frameworks that obfuscate model behavior • Prefer building custom, bespoke tools and utilities • Referenced the "react" paradigm (think, act, observe) • Favor creating utility functions as building blocks • Maintain a flexible, custom "utils" approach

  • Experimental Observations:
• Agent experiments involved: - Runs with over 100 turns - Context limits around 200K • Compared agent task complexity to human work processes • Noted that complex tasks may require many iterations

  • Research and Development Challenges:
• Key research areas include: - How agents can work beyond their context length - Methods for pruning unproductive paths - Lossless summarization of learned approaches - Efficient token usage

Computer Use and Robotics Insights

  • Computer Use Capabilities:
• Discussion centers around "computer use" capabilities in AI • Initially surprising that AI could perform tasks like opening Minecraft • Currently in beta stage with limitations • Seen as a low-friction way to implement tool use • Runs in a sandboxed environment (Docker/VM) for security reasons

  • Potential Applications:
• Repetitive work automation • End-to-end testing • Front-end and web testing • Research idea generation • Customer support automation • Gaming interactions • Robotic process automation (RPA)

  • Robotics Innovations:
• Two major technological advances are emerging: - Large Language Models (LLMs) for adding "common sense" to task descriptions - Diffusion-inspired path planning algorithms for motion control • Moving from hard-coded motion programming to learning-based approaches • Models can learn tasks through demonstrations • Goal is generalization across different objects/tasks

  • Hardware Challenges in Robotics:
• Hardware development is extremely difficult, with significant variability between seemingly identical components • Building multiple robots consistently is much harder than creating a single prototype • Manufacturing challenges include inconsistent motor performance, variations in component behavior, and unpredictable hardware failures • Even small variations in components like USB cables can cause system failures

Future Considerations and Challenges

  • Reliability Challenges:
• Current robotics technology is similar to self-driving cars 10 years ago • High reliability is critical - 99% success rate is actually quite problematic • Achieving 99.9% reliability will be a significant hurdle • Humans expect near-perfect performance for household/industrial tasks

  • Economic and Practical Limitations:
• Robots will be expensive to build • Unit economics may be challenging due to labor replacement constraints • Difficult to replace precision manufacturing robots due to extremely low tolerance for error • Skepticism about autonomous vehicle profitability (e.g., Waymo's economics)

  • LLM Agents Future Challenges:
• Excitement about increasing LLM agent capabilities • Key future challenge: Building trust in agent-generated outputs • Importance of creating: - Trustable work - Auditable processes - Transparent explanations of agent's reasoning and methodology

More from Latent Space: The AI Engineer Podcast

Explore all episode briefs from this podcast

View All Episodes →

Listen smarter with PodBrief

Get AI-powered briefs for all your favorite podcasts, plus a daily feed that keeps you informed.

Download on the App Store