Overview

* AI agents are rapidly evolving with major companies focusing on consumer coding agents, vision-based computer-using agents, and multi-agent systems, positioning 2025 as the potential "year of agents" with applications spanning coding, developer tools, customer support, and search.

* The most effective agent designs use a minimalist toolset approach (5-6 core tools including bash execution, file editing, and web browsing) rather than numerous specific API calls, prioritizing powerful programming capabilities while maintaining flexibility to adapt to different tasks.

* Current agent performance reaches 30-40% full autonomy, with Claude leading among models, though agents still struggle with information gathering, complex tasks, and thorough investigation of problems before attempting solutions.

* The field is moving toward democratized AI access through open-source solutions and more affordable models, while facing challenges in authentication, effective memory systems, and creating self-improving agents that learn from past experiences.

* Research innovations like workflow memory (showing 22.5% performance improvement after learning from 40 examples) and hybrid web interaction approaches (combining screenshots with textual summaries) represent promising directions for enhancing agent capabilities.

Content: State of LLM Agents in 2024 (NeurIPS Recap)

Introduction and Context

* This is a recap of the Latent Space Live mini-conference at NeurIPS 2024 in Vancouver * The episode features Professor Graham Newbig from CMU, who is also Chief Scientist at All Hands AI and maintainer of Open Hands (open-source coding agent framework) * 2025 is predicted to be the "year of agents" with major AI companies (OpenAI, DeepMind, Anthropic) focusing on consumer encoding agents, vision-based computer-using agents, and multi-agent systems

Notable Agent Developments in 2024

* Open Hands (formerly Open Devon) leads the SWE bench leaderboard * Significant agent innovations across domains: - Coding: Devon, Cursor Composer, Codium's Windsurf - Developer Tools: StackBlitz's Bolt, Vercel's V0 - Customer Support: Sierra ($4B valuation) - Search: Perplexity ($9B valuation)

Agent Capabilities and Use Cases

* The speaker demonstrates using agents for data science tasks, creating new software, and improving existing software * Uses coding agents 5-10 times daily * Demonstrated using an AI agent to analyze GitHub repository data and create visualizations * Agents can interact with GitHub API (checking issue comments, GitHub actions) * Python agents can leverage ecosystem libraries for tasks like data visualization

Agent-Computer Interface Design

* Two main approaches to providing tools for agents: - Granular, specific API calls (many individual tool calls) - Providing coding capability to call arbitrary Python code (more flexible)

* Open Hands approach uses a limited set of approximately 5-6 core tools: 1. Bash program execution 2. Jupyter notebook execution 3. File editing (browsing/reading/overwriting) 4. Global search and replace 5. Web browsing (with sub-capabilities like scrolling, text input, clicking)

* Philosophy: Focus on giving agents powerful, flexible programming capabilities while minimizing the number of specific tools

Human-Agent Interface Principles

* Present actions in clear English descriptions * Show high-level summaries with option to explore details * Integrate into user's existing interaction settings * Adapt interface based on specific use case (e.g., coding vs. insurance)

Language Model Requirements and Evaluation

* Requirements for effective agents: - Excellent instruction following ability - Strong tool use and coding capabilities - Environment understanding - Error awareness and recovery

* Current evaluation results: - Claude currently considered the best agent language model - Strengths include good error recovery and adaptability - Evaluation compared Claude, GPT-4O, O1 Mini, Lama, and others - Open-source Lama 3.1 (405B) performed best among open models - Evaluation is a few months old and rapidly evolving

Problem-Solving Approaches for Agents

* Recommended workflow for coding: - Write tests to reproduce an issue - Run tests to confirm they fail - Fix the tests - Verify tests now pass

* Two main approaches to structuring agent tasks: - Explicit structure (multi-agent systems with defined roles) - Implicit structure (single prompt with sequential instructions)

* Preference for single-agent systems due to flexibility to deviate from original plan and better instruction-following capabilities

Research and Innovation in Agent Workflows

* Notable research papers and approaches: - COACT: Generating and fixing plans dynamically - STEP: Manual workflow creation for web navigation - Agent Workflow Memory: Self-improving agents learning from past successes (demonstrated 22.5% performance increase after 40 examples) - Agent Lists: Creating repository maps for better navigation - Bagel: Web agent exploration through random tasks - Tree Search for Language Agents: Exploring multiple solution paths

Web Agent Interaction Methods

* Three main ways agents interact with websites: 1. Screenshot pixel clicking (currently not very reliable) 2. HTML/accessibility tree element identification 3. Hybrid approach using screenshot and textual summary (most promising)

* Current implementation details: - Currently using text-based web interaction - Recently implemented two web browsing modalities: - Full website interaction with clickable elements - Markdown conversion for easier information gathering

* Agent architecture insights: - Using a general coding and browsing agent with broad instructions - Implemented "micro agents" - specialized prompt additions triggered by specific keywords - Current micro agents include GitHub and NPM interaction instructions

Evaluation and Benchmarks

* Two main approaches to evaluation: - Fast, cheap sanity checks (e.g., mini world of bits, AIDR code editing benchmark) - Highly realistic evaluations (e.g., WebArena for web navigation, Sweebench for coding)

* Benchmark evolution cycle: - 2023: Benchmarks too easy - 2024: Agents too weak - 2025: Anticipating more challenging benchmarks

* Suggestions for composite benchmarks testing multiple coding-related abilities * Concerns about benchmark data leakage in language model training

Current Performance and Limitations

* Agent performance currently around 30-40% full autonomy * Main failure points: - Insufficient information gathering before task solving - Struggle with more complex tasks - Failure to understand GitHub workflows before attempting to solve problems - May not thoroughly investigate potential causes of bugs

* Performance improves when explicitly instructed to gather information or generate multiple hypotheses

Predictions for AI Agents

* By mid-2025: - Every large language model will focus on agent training - Competition will increase, prices will decrease - Smaller models will become competitive as agents

* Improvements expected in: - Instruction following abilities - Error correction - Reduced tendency to get stuck in loops

Accessibility and Democratization of AI

* Strong emphasis on making powerful AI tools: - Affordable - Accessible to more people - Not limited to a select group

* Strategies for democratization: - Use open source solutions - Contribute to open source projects - Train affordable, strong open source models - Create models that help people improve economic opportunities

Future Developments and Challenges

* Authentication challenges for AI agents: - Current solutions are limited in scope - GitHub fine-grained authentication tokens suggested as a potential model - Most APIs lack granular authentication controls - Need to "prepare the world" for agent interactions

* Self-improving agents approach: - Using powerful language models with infinite context - Ability to review past experiences - Mechanism to filter out negative past experiences - Workflow memory concept showing incremental improvement

* Current challenges include: - Large model size - High cost of infinite context - Lack of effective indexing methods - Retrieval-augmented generation (RAG) not working well for code

* The speaker believes AI progress will accelerate rapidly, with agents building better agents

2024 in Agents [LS Live! @ NeurIPS 2024]