Overview
- The podcast explores the development of an agentic software engineer at OpenAI, capable of writing code, running tests, and working independently for extended periods (up to 1-2 hours), representing a significant advancement in AI coding capabilities.
- The team implemented an agents.md approach for providing hierarchical instructions to AI agents, similar to onboarding a new employee, which compresses "exploration time" and provides context without overly deterministic guardrails.
- The development philosophy balances safety constraints with autonomy, currently limiting internet access while allowing the AI to make intelligent decisions independently, with a vision of gradually shifting more decision-making complexity to the model itself.
- Future directions include enabling multimodal inputs, network access, more integrated development environments, and potentially teams of hierarchical agents, as part of a broader progression toward more advanced AI coding assistants.
Content
Background and Origins
- The podcast hosts Alessio Paharani (CTO at Decibel) and Wix (founder of SmallAI) discuss ChatGPT Codex with guests from OpenAI.
- One speaker previously worked on Multi, a Mac OS pair programming tool focused on human-to-human collaboration, before transitioning to thinking about human-AI pair programming and joining OpenAI.
- Another speaker previously founded Airplane, an internal tool platform that was exploring early AI development capabilities, before joining OpenAI after conversations about developing Codex.
- The Codex project focuses on giving AI agents access to computers, exploring the concept of an "agentic software engineer."
Early Development and Key Insights
- OpenAI was experimenting with giving reasoning models access to terminals, with a memorable moment being when they saw an AI demo that could modify its own code.
- The team developed Codex CLI with focus on safe terminal access, full auto mode with increased sandboxing, and enabling longer model thinking and capabilities.
- The speaker experienced a "moon landing moment" early this year, believing we are on the cusp of creating an "agentic software engineer."
- Key considerations in developing Codex included:
- The team acknowledges experiencing significant "scope creep" in developing the project, but each incremental change made sense in pursuing their vision.
Agent Capabilities and Design
- The AI software engineering agent aims to go beyond basic coding with capabilities including:
- The agent can write code, create scripts to modify its own changes, run and test its own code, and provide detailed, referenced feedback on its work.
- Using the agent requires an initial "leap of faith" but can be surprisingly effective, demonstrating long-running independent work capabilities.
Best Practices for AI Coding Agents
- The team recommends using agents.md files to provide hierarchical instructions and context for AI agents, similar to onboarding a new employee.
- Implementing basic linting and formatting tools improves code quality.
- Making the codebase discoverable by providing clear directory and project context is essential.
- Agents.md serves as a way to compress "exploration time" for AI agents, giving them their "first few years of job experience" through training and context.
- The team recommends starting simple with agents.md and gradually building complexity, using tools like CodeX to auto-generate these files.
- Future plans include auto-generating agents.md based on PRs and feedback, making context-setting more automatic.
Software Development Practices with AI
- Recommended practices include using TypeScript for type safety and creating modular, testable code architectures.
- Good architecture is increasingly important with AI tools, while humans remain strong at architectural design.
- Strategic code naming for AI efficiency is important - their project's internal code name "Wham" was chosen to be easily identifiable by AI agents.
- The team believes AI systems are still fundamentally rooted in human communication, with human involvement (code reviews, deployment, requirements) continuing to be essential.
Documentation and Agent Instruction Approach
- The team decided to create an `agents.md` file for instructions specific to agents, choosing a generic, non-branded approach.
- Key design considerations include:
- The team favors a "prompt and trust the model" strategy, letting models operate at full capacity rather than using overly deterministic approaches.
- Agents are trained to aggressively search for and parse `agents.md`, which is treated as a system prompt that guides agent behavior.
AI System Design Philosophy
- The discussion explores appropriate "deterministic guardrails" versus allowing models to make intelligent decisions independently.
- Three key decision-makers in AI product development are identified: users, developers, and AI models.
- Their product has two UI buttons ("ask" and "code") that spawn different model containers, with the goal of gradually shifting more decision-making complexity into the model itself.
- The long-term vision involves moving away from complex, developer-built state machines toward enabling models to solve increasingly complex problems independently.
- The team is exploring the potential for teams of agents with hierarchical management, focusing on training approaches that help models learn.
Context Window Management and Model Development
- Three potential approaches to context management were proposed:
- The long-term vision involves building specialized models for specific purposes, with learnings transferable to generalize larger models.
- The GPT-4.1 development process involved working closely with developers, gathering feedback, creating evaluation metrics, and incorporating learnings into mainline models.
- Current development cycles are slow and potentially expensive, requiring data collection, human raters, and repeated testing.
Compute Platform Capabilities
- Current task time is typically between 1-30 minutes, with a hard cutoff around 1 hour.
- Concurrency limits allow 5-10 simultaneous tasks, with a current limit of approximately 60 tasks per hour.
- The recommended usage approach is to have an "abundance mindset" - quickly generate prompts, delegate tasks to the AI, and continue with other work.
- Users can set up environments and scripts, primarily for installing dependencies, and there's a REPL environment for interactive editing.
- Internet access is currently cut off for agents during runtime for safety reasons, though future plans include potentially allowing limited network/repository access.
Safety and Development Approach
- The team is being conservative about agent capabilities while aiming to give humans and agents as much access as possible within safety constraints.
- Safety tests have shown resilience against prompt injection.
- The current focus is on "one-shot" task completion with minimal human intervention, though they are exploring ways to integrate human-agent interaction.
- The team doesn't claim superiority over other agent approaches but emphasizes their unique "one-shot" approach.
Future Vision and Research Direction
- Codex is viewed as a research experiment exploring autonomous software engineering, with the goal of understanding how AGI can benefit humanity.
- The vision includes humans focusing on ambiguous/creative tasks while delegating routine work to AI agents.
- Potential improvements for a full release include multimodal inputs, network access, more integrated development environment, and seamless transition between cloud and CLI.
- The team is interested in community feedback, especially regarding environment customization.
Current Status and Call to Action
- The tool's environment customization and user experience are still evolving.
- The team is offering generous rate limits to encourage user exploration.
- Users are encouraged to try out the tool extensively, experiment with different prompting approaches, and share feedback.
- Pricing discussion is considered premature at this stage, with focus on demonstrating economic value to users first.
- The long-term goal appears to be developing a general AGI assistant, with the current tool being part of a broader progression toward more advanced AI coding agents.