AI Engineer World’s Fair were

Latent Space: The AI Engineer Podcast

ICLR 2024 — Best Papers & Talks (Benchmarks, Reasoning & Agents) — ft. Graham Neubig, Aman Sanger, Moritz Hardt)

Overview

Content: Latent Space Podcast ICLR Coverage (Part 2)

Podcast Introduction and Context

* This is part two of the Latent Space Podcast's ICLR (International Conference on Learning Representations) coverage * Hosted by Charlie, with a special interview featuring Graham Newbig from Carnegie Mellon University * Amon Sanger from Cursor AI joins as the first guest co-host

Graham Newbig Interview

Background and Experience

* Professor at Carnegie Mellon University * Has taught advanced NLP course for 7 years * Spent 11 years in Japan as a language teacher and grad student * Actively involved in open source AI software ecosystem (e.g., OpenDevon)

Teaching Approach

* Completely revised NLP course after ChatGPT * Focuses on providing practical, cutting-edge knowledge * Aims to prepare students for research and innovation * Prioritizes modern model-building techniques over older algorithmic approaches * Still teaches Ngram language models as foundational concepts, with potential comeback for speculative decoding

Research Focus

* Shifted focus to benchmarking in 2022, motivated by pushing boundaries of language model capabilities * Goal is to create rigorous, meaningful academic benchmarks demonstrating real capabilities * Mentioned three poster presentations: Web Arena, Zootopia, and performance-improving code edits

Web Arena Project

Concept and Motivation

* A "mini internet" sandbox for testing language model agents * Originated from interest in creating agents that can perform real-world tasks * Explores long-horizon planning and world knowledge application * Robotics was seen as a current bottleneck for practical agent implementation

Benchmark Design

* Created using production-grade open source sites mimicking real platforms: - One Stop Shop (mimicking Amazon) - Postmill (mimicking Reddit) - GitLab (mimicking GitHub) * Tasks derived from researchers' actual browsing histories * Aimed to create a realistic evaluation environment for web browsing tasks * Example tasks include calculating monthly food expenses

Performance and Challenges

* Initial model performance was under 15%, while humans achieved around 78% success * Performance has increased from 14% to 25-30% in six months * Improvements primarily from agent design, not just LLM advancements * Models struggle with: - Web navigation - Filtering information - Mathematical tasks - Recognizing clickable elements like dropdown menus - Planning and navigation challenges - Common sense reasoning

Key Agent Improvement Strategies

* Optimizing prompts and action spaces * Implementing self-refinement/self-reflection mechanisms * Creating "documentation" or "world knowledge" about website interactions * Dynamically feeding site-specific information to agents

Web Interaction Methods

* Most websites do not have APIs, making point-and-click navigation crucial * Current approaches include: - Accessibility tree representations - Visual understanding of websites * Multimodal approaches (visual + text) are seen as potentially more effective

Future Work

* Exploring web browsing tasks in professional contexts (e.g., software engineering scenarios) * Investigating improvements through: - Synthetic data training - Reinforcement learning methods - Alternative website interaction methods

Zootopia Project

Concept and Focus

* Simulates social interactions between AI agents * Explores AI's capability in socially complex scenarios * Examines six interaction types: - Negotiation - Exchange - Competition - Collaboration - Accommodation - Persuasion

Research Methodology

* Conducted experiments with language models interacting with: - Language models talking to other language models - Language models talking to humans * Evaluation methods included: - Human evaluation - Language model-based evaluation - Measuring correlation between evaluation methods

Key Findings

* Language models are "okay" at navigating and evaluating social situations * Correlation between evaluation methods was around 74% * Performance variations in interactions: - GPT-4 with GPT-4: Score of 3.3/7 - GPT-4 with humans: Score of 4.8/7 - Humans with humans: Score of 6.15/7 * Research evaluates agents based on: - Goal achievement - Social interaction believability - Adherence to social rules - Secret preservation

Follow-up Research

* Training better evaluators for social skills * Improving models' ability to navigate social situations * Trained a Mistral 7b model using: - Behavior cloning - Self-reinforcement * Discovered models can optimize for machine judgments but still fall short in human evaluations

Code Optimization Research

Paper Focus and Methodology

* Improving program efficiency using large language models * Used competitive programming problems * Compared slow and fast implementations * Created evaluation harness with virtualized CPUs * Claimed "superhuman performance" in program optimization, with caveats

Performance Conditional Generation

* Used technique of prefixing generated sequences with performance tags (0-10 scale) * Goal is to fine-tune models to generate both slow and fast implementations * At test time, they always append the "fastest" tag * Similar technique used in other domains like generating non-toxic text * Potentially provides benefits similar to reinforcement learning by learning from both good and bad examples

Challenges in Real-World Applications

* Difficult to isolate code performance in real-world environments * No optimized sandbox for performance measurement * Requires good test coverage to prevent models generating overly simplistic solutions * Current approach relies on user providing performance information

OpenDevin Project

Origins and Development

* Inspired by Devin's demo and community excitement * Jun Yang from Alibaba's Quen team created a GitHub repo that quickly gained 1,000 stars * Initial development was experimental, with a non-functional React-like interface * Community interest came from both developers and non-developers

Current Implementation and Performance

* Achieves 21% on SweetBench Lite without explicit planning * Planning is not necessarily a "secret sauce" * Agents generate and adapt plans during execution * Running full benchmarks can be expensive (e.g., $6,000 per SweetBench run with GPT-4)

Agent Capabilities and Limitations

* Challenging for agents to handle complex software engineering projects * Potentially more useful for smaller tasks like setting up simple web apps * Not seen as an immediate threat to developer jobs * Could be helpful for routine, low-complexity tasks that take minimal developer time * Managing an AI agent feels similar to managing junior developers

Future Vision

* Goal is to create "agentic" coding assistants that can: - Spawn off complex work units in the background - Implement helper functions automatically - Preserve human workflow and control * Expect performance improvements with: - More advanced models (GPT-5, Llama 3) - Scaled inference time compute - Potentially smarter model chaining/looping

Project Focus

* Started as a Devin clone * Now focused on: - Open-source approach - Pluggable agent system - Ability to use different language models - Plans to incorporate multiple evaluation benchmarks (SweetBench, WebArene, BrowserGem)

Code Search and Retrieval

* Different approaches to code search and context retrieval * Mentioned tools like Morph (by Jesse Han) for code indexing and searching * Current methods include: - Embedding-based retrieval - Re-rankers - Occasionally using LSP (Language Server Protocol) information

Benchmarking and Evaluation

SWEbench (Software Engineering Benchmark)

* Designed to evaluate AI's ability to solve real-world software problems * Involves giving language models: * A full code base * A problem statement (bug fix or feature request) * Task of generating appropriate code edits * Uses real open-source GitHub repositories as source material * Scrapped 12 popular Python repositories * Collected over 2,000 verified task instances * Represents approximately 3,000 files on average * Includes problem statements, gold patches, and test patches

Performance Insights

* Current state-of-the-art models perform poorly, with the best model (Claude 3 Opus) resolving only 3.8% of issues * Performance challenges include: - Weak retrieval systems - Context length negatively correlating with performance - Models generating simpler, more primitive code compared to gold patches * Researchers fine-tuned Code Llama to create SWI Llama 7b and 13b models * These are the first open-source models with non-zero performance on SWI Bench * A follow-up work called SWI Agent explores an interactive agent-computer interface * SWI Agent improved performance to 12.5% issue resolution

SweetBench Light

* Created a subset with 300 task instances, filtered from 2,294 original instances * Filtering aimed to focus on more reproducible, single-file change scenarios

Evaluation Challenges

* Questioning the definition of "human performance" * Recognizing complexity in measuring difficulty of code changes * Exploring ways to categorize issues by difficulty (easy/medium/hard) * High evaluation costs, especially for RAG (roughly 20 cents per instance) * Cost can quickly add up, potentially reaching $100 or more per benchmark run

Academic Research Landscape

Shifting Dynamics

* Academic leadership in research peaked around 2010-2013 * Compute requirements increasingly challenging for academic institutions * Post-ChatGPT, initial academic uncertainty about research relevance

Adaptation Strategies

* Focusing on model evaluation and identifying current AI model limitations * Emergence of open-source models and training frameworks (e.g., DeepSpeed, Lama Factory) * Universities investing in GPU clusters and specialized computing infrastructure * Developing new systems for computational resource allocation

Compute Resources Challenges

* Limited availability of modern GPU hardware in academic and national supercomputing centers * Most centers have older V100 or limited A100 GPUs * Some potential compute resource providers mentioned: - Luther (compute grants) - Andromeda (potential research compute) - Crusoe Energy - Strong Compute - RunPod and NetMind

Industry-Academia Interaction

* Industry has become more secretive about large language model developments * Less transparency makes it difficult for academics to showcase their contributions * Example: Matryoshka embeddings from UW used by OpenAI with minimal acknowledgment * Potential disincentive for grad students due to reduced visibility of their work

Academic Research Perspectives

* Preference for simple, incremental improvements that demonstrably work * Values papers that make small but meaningful tweaks (e.g., DPO method) * Emphasizes importance of open-source contributions * Recognizes Hugging Face as a committed open-source organization * Identifies lack of organization and focus as a challenge for academic open-source efforts

Model Architecture Insights

Current State

* Most current models use similar architectures (Llama-based) * Architecture engineering has reached a "local optimum" * Common architectural elements include: - RoPE - SwiGLU - Small incremental improvements

Key Observations

* Data and training methods now matter more than architecture * Some architectural innovations (like predicting next 4 tokens) only show benefits at larger scales * Alternative architectures like Mamba and RWKV are promising but have limitations * Hybrid architectures (mixing linear and transformer layers) might address recall issues

Future Directions

* Interest in alternative linear architectures * Exploration of hybrid model designs * Continued focus on improving model performance through data and training techniques * Potential innovations like: - Sublinear retrieval-based attention - K-nearest neighbors operators for more efficient token processing

Test Set Contamination Research

The Problem

* Modern pre-training datasets are massive (trillions of tokens) * It's difficult to verify if benchmarks are truly independent from training data * Example: Code Forces benchmark potentially included in web-crawled training data

Contamination Evidence

* GPT-4 scoring 100% on pre-2021 Code Forces problems, but 0% on recent problems * Phi 1.5 perfectly completing math problem examples * Limited transparency in industry about pre-training data sources

Research Approach

* Goal: Develop a statistical method to detect test set contamination * Proposed method leverages the concept of "exchangeability" in test sets * Key insight: Models trained on a test set would show preference for specific example ordering

Methodology

* Treat contamination as a statistical dependence between model and test set * Use a permutation test comparing likelihood of original vs. randomly shuffled test set orderings * Aim to prove contamination with statistical guarantees and low false positive rate

Findings

* Can detect test set contamination with 100% accuracy when duplicated 10+ times * Detection becomes progressively more challenging at lower duplication counts * At 4 duplications, can detect contamination about 50% of the time * Detection at single duplication count remains very difficult * Tested several popular language models for benchmark data contamination * Found mild evidence of contamination only for Mistral 7B and Arcez * Results suggest popular benchmarks are likely not extensively duplicated in training data

GAIA (General AI Assistant) Benchmark

Overview

* Created by Meta AI to test AI assistants' capabilities * Designed to evaluate models on complex, multi-step reasoning tasks * Aims to identify where current AI models are failing

Characteristics

* Tasks range from level 1 (1-2 reasoning steps) to level 3 (10-20 complex steps) * Requires open-world browsing and information processing * Tasks are designed to have zero ambiguity for automated evaluation

Performance Insights

* Random humans achieve ~90% success * GPT-4 only achieves ~10% success on level 1 tasks * Recent agent systems (like Friday, Copilot, Autogen) have improved to 10-40% performance

Anti-Cheating Strategies

* Keeping test sets private * Requiring manual verification of answers * Asking participants to provide answer traces * Creating questions that can't be solved by simple memorization

Benchmarking Theory and Evolution

Four Eras of Benchmarking

1. DARPA Era (1980s): Comparing scientists' contributions 2. MNIST Era: Academic adoption of public leaderboards 3. ImageNet Era: Deep learning revolution and dominant benchmark 4. Polymorphic Era (current): Radical plurality of benchmarks

Benchmark Challenges

* Traditional view: Benchmarks as simple "holdout method" * Reality: Machine learning community continuously uses test sets * Example: MMLU benchmark with 14,000 data points and 5 million downloads monthly * Consequences: Reduces test set "longevity" from exponential to linear

Multitask Benchmarks

* Fundamental trade-off between diversity and sensitivity * As benchmark diversity increases, sensitivity to irrelevant changes also increases * Some benchmarks like Big Bench Heart and MMLU show extreme sensitivity * Researchers can potentially skip up to 80% of ranks through irrelevant metric transformations

Dynamic Benchmarks

* Proposed as an evolving, time-dependent benchmarking approach * Involves iterative process of building models, finding failure cases, and adding to benchmark * Standard design involves alternating between model building and adversarial data collection * Theorem suggests progress can stall after a small number of rounds * More sophisticated "hierarchical dynamic benchmarks" with parallel threads potentially guarantee more progress

Self-RAG (Retrieval-Augmented Generation)

Framework Innovations

* Introduces special tokens for: - Deciding when to retrieve - Evaluating document relevance - Generating responses - Self-evaluating output

Key Features

* Allows language models to: - Determine if retrieval is necessary - Select only helpful documents - Skip retrieval for queries not requiring factual grounding - Improve answer reliability and efficiency

Training Approach

* Uses a "critic language model" to teach generation * Generates synthetic training data via GPT-4 * Created 150,000 instruction training data points * Enables training without extensive manual annotations

Performance

* Significantly outperforms baseline models, especially on knowledge composition and citation precision tasks * Matches ChatGPT performance on 5 out of 6 tasks, despite being a smaller model (7-13B parameters) * Has been widely adopted in academic and industry applications (e.g., LangChain, Lama Index)

Process Supervised Reward Models

Concept and Approach

* Break down solution verification into individual step evaluations * Trained using human annotators who label each step as correct, incorrect, or neutral * Allows more granular assessment of solution quality * Process supervision provides more direct feedback compared to outcome supervision * With process supervision, researchers can more precisely reinforce specific behaviors

Mathematical

More from Latent Space: The AI Engineer Podcast

Explore all episode briefs from this podcast

View All Episodes →

Listen smarter with PodBrief

Get AI-powered briefs for all your favorite podcasts, plus a daily feed that keeps you informed.

Download on the App Store