the first Dev Day in 2023:

and the blip that followed soon after.

As Ben Thompson has

Latent Space: The AI Engineer Podcast

Building AGI in Real Time (OpenAI Dev Day 2024)

Overview

Content

OpenAI Dev Day 2024 Highlights and Technical Innovations

- Uses persistent WebSocket connections - Implements function calling to access external information - Allows for more natural, conversational AI interactions - First WebSocket-based API for OpenAI - Required significant work to ensure safety mitigations for real-time audio

- Travel apps and language assistants - A compelling demo showing an AI assistant ordering 400 chocolate-covered strawberries via phone call - The "Wanderlust" app, which evolved from a travel demo to showcase voice interaction capabilities

- Uses reinforcement learning - Can reason and learn from mistakes - Excels at advanced math and complex coding - Requires more computational power and time compared to previous models

Technical Features and Development Approach

- Can customize AI models for specific business or professional needs - Now includes vision fine-tuning (e.g., medical image analysis) - Vision fine-tuning requires only ~100 images to get started - Particularly useful for OCR on specialized document formats and improving bounding box accuracy

- Achieving human-level latency (around 300 milliseconds for conversation) - Creating a "pit of success" for developers, minimizing complexity - Automatic prompt caching to reduce costs - Working with system prompts to set behavioral ground rules - Implementing structured outputs (like JSON formatting)

- Working with audio and streaming is complex, separate from AI capabilities - API feels similar to assistant/chat completion streaming - Function calling involves events like "function call started", "argument started", etc. - The API is session-based, automatically continuing with each spoken input - No state machine needed; conversation flow is managed through prompts

Developer Considerations and Best Practices

- Consent is crucial before calling individuals or businesses - Businesses may be more receptive to AI interactions than individuals - Legal landscape is still evolving and unclear - Developers should be careful about randomly calling people with AI

- Direct WebSocket connection from browser is possible but not recommended for production - Current limitation: Exposing API keys in source code is a security risk - Developers need to build WebSocket proxies to protect API keys

Platform Strategy and Future Direction

- Moving from one-size-fits-all models to automated, continuously fine-tuned models - Future models could automatically improve as developers use them and provide more data - Aim to reduce manual model updates and snapshots

- OpenAI offers free evaluations in exchange for developers sharing completion data - Default policy: No training on API data unless users opt-in - Data is sanitized to remove personally identifiable information (PII)

- Real-time audio is becoming a first-class experience in application development - Most people prefer speaking and listening over typing/writing - Current voice modes have been overly restrictive with refusals - OpenAI is moving towards a more nuanced approach to content moderation - Planned development of a "safety API" with controllable sensitivity settings

Developer Feedback and Community Focus

- Investment in reasoning capabilities - Multi-modality capabilities - Tool use and function calling - What capabilities developers need that current models can't yet provide

Model Distillation and Performance

- Demonstrated distilling from GPT-4.0 to 4.0 mini - Achieved only a 2% performance hit while being 15x cheaper - Particularly useful for low-latency, low-cost, high-performance applications - Creating smaller, more efficient AI models that are more accessible

- O1 Mini is strong in STEM and coding tasks - O1 Preview is better for broader knowledge and more complex scenarios - Developers are recommended to try both models depending on specific needs

Organizational Changes and Leadership Perspective

- Level 1: Chatbots - Level 2: Reasoners - Level 3: Agents - Level 4: Innovators - Level 5: Organizations

AI Agents and Future Capabilities

- Personalized, adaptive technology across devices - Real-time translation during international business conversations - More intuitive voice assistants and context-aware interfaces - Significant advances in AI context length, potentially reaching 10 million tokens within months and "infinite context" within a decade

Alignment and Safety Approach

- Adapting to where the science actually leads - Making models safer as capabilities increase - Solving real-world alignment challenges as they emerge

Startup and Business Considerations

More from Latent Space: The AI Engineer Podcast

Explore all episode briefs from this podcast

View All Episodes →

Listen smarter with PodBrief

Get AI-powered briefs for all your favorite podcasts, plus a daily feed that keeps you informed.

Download on the App Store