Overview

OpenAI's API evolution reflects a strategic shift from basic completions to sophisticated structured outputs, with JSON Mode and structured outputs addressing developers' needs for reliable, format-constrained responses that integrate seamlessly with applications.

The company's product development philosophy balances engineering and research approaches, focusing on the "80/20 principle" to deliver maximum value while maintaining a clear distinction between function calling (for tool invocation) and structured outputs (for response formatting).

Model selection strategy follows a tiered approach—start with GPT-4.0 Mini for most use cases, upgrade to GPT-4.0 for better performance, and consider fine-tuning (now generally available) for specialized applications, with as few as 100 high-quality examples potentially yielding significant improvements.

The new O1 model represents a parallel development path focused on complex reasoning rather than replacing the GPT series, with GPT-4.0 remaining the "workhorse" for standard tasks while O1 handles more reasoning-intensive problems, demonstrating OpenAI's commitment to specialized models for different use cases.

API features and roadmap include significant improvements to file handling (supporting 10,000 files per assistant), enhanced semantic search capabilities, batch processing with 50% cost savings, and ongoing development of vision, video, and speech capabilities, all designed to make complex AI implementation more accessible.

Content

Background and Career Journey

Michelle Pope is a Canadian engineer who graduated from the University of Waterloo
She completed six internships during her degree, including:

- A bank (working with Visual Basic) - Google - Coinbase (during 2018-2020 crypto era)

Waterloo has a strong startup culture that influenced her career path
Her internship at Coinbase was particularly formative, where she:

- Worked on ACH rails - Learned critical engineering skills - Got early production experience

She tends to join companies at critical scaling moments

Early Career Progression

Worked internships at Stripe and Coinbase
Co-founded Readwise (briefly) during entrepreneurship co-op at Waterloo
Joined Clubhouse as one of the early backend engineers during its rapid growth period
At Clubhouse, she tackled technical challenges including:

- Improving notification speed (initially took 10 minutes) - System reliability improvements - Performance issues in Postgres - Inefficient job queuing infrastructure - Database scaling challenges

Experienced recurring database challenges across multiple companies
Returned to Coinbase to improve engineering skills
Viewed early career roles as opportunities to learn and grow technically

Joining OpenAI

Initially excited about GitHub Copilot's capabilities
Mentions DALL-E as a product that helped explain their work to family
Describes early API platform as small (around 5 people working on it)
Notes significant company growth since initial days

ChatGPT Release Challenges

Experienced scaling difficulties, particularly with:

- Postgres authorization system issues - Complex GPU resource allocation decisions - Linking developer and ChatGPT accounts

JSON Mode and Structured Outputs Development

JSON Mode was introduced at dev day last year as first step towards structured outputs
Allows constraining model output to JSON format
Addresses developer needs by:

- Ensuring output always in JSON format - Attempting to match specified schemas

Motivated by consistent customer feedback over past year

The team took a collaborative approach between engineering and research with key distinctions:

- Engineering approach: Constraining model outputs through token masking - Modeling approach: Training the model to better follow desired formats

Structured Outputs Features:

- Designed for developers who need precise, system-compatible function calls - Uses tools like Pydantic or Zod objects - Eliminates manual serialization complexities - Allows easy parsing of model responses

JSON Mode vs. Structured Outputs:

- JSON mode is better for more creative, open-ended JSON generation - Most developers likely to prefer structured outputs

Function Calling vs. Structured Outputs:

- Function calling: Intended for actual tool/function invocation (e.g., querying databases, sending emails) - Structured outputs: Focused on getting model responses in a specific format - Previously, developers were "hacking" function calling to get desired response formats

New Development: SDK improvements for function calling, including a "run tools" method that can automatically manage the conversation loop

Technical Implementation Details

The API models team focuses on improving models based on developer feedback
They collaborate closely with post-training and safety systems teams
Currently support JSON schema and a dialect of it
Future plans may include broader grammar support
Developed a solution from scratch to meet specific needs

Refusal Mechanism:

- Implemented a "refusal field" to allow models to refuse requests that don't match policies - Goals include preserving model's ability to refuse inappropriate requests - Providing a clear developer experience - Making it easy to programmatically handle refusals - Chose a refusal field over traditional error codes due to developers paying for tokens

Error Handling Considerations:

- Discussing challenges in error handling for AI models, particularly around HTTP error codes - Proposing new error code ranges (600-series) specifically for AI model errors - Potential error codes could include: * 601: Auto-refusal * 602: Chat ML format violations

API and Model Architecture:

- OpenAI transitioned from completions endpoint to chat completions API - Chat completions use Chat ML format with defined message roles (user, assistant, system) - Recent improvements in structured outputs have reduced certain model errors - Models can now better constrain to Chat ML format

Evaluation (Evals) Landscape

Evals are considered the "hardest part of AI"
Collaborated with BFCL (Function Calling Leaderboard)
Current evals challenges:

- Difficult to create robust evaluation pipelines - Many current evals are saturated - Models can often achieve high performance with different prompting

Recommended evals:

- BFCL Function Calling Leaderboard - Sweetbench (tests model performance on GitHub issues)

Sweetbench targets code writing and file searching capabilities
Low pass rates indicating room for improvement

Parallel Function Calling and Latency

OpenAI supports parallel function calling in their API for newer models
Currently not supported with structured outputs due to technical trade-offs
Concerns about potential latency and complexity for developers
First token/request has increased latency
Expected to improve over time
Not currently a major concern for most developers during integration
Exploring potential caching solutions

Agents and Structured Outputs

Structured outputs and function calling seen as critical building blocks for agentic systems
Goal is improving reliability from ~95% to near 100% accuracy
Enables converting natural language intent into application actions

Assistants API

Represents a bet on:

- Hosted tools (file search, code interpreter) - Introducing statefulness to API

Most developers still likely to use standard messages and completion endpoints
The company has successful hosted tools, particularly a file search tool that saves time on building RAG pipelines
They are iterating on making stateful tools more intuitive and easier to use

Structured Output and Model Capabilities

Models have varying capabilities with inherent trade-offs
New response format is currently available only for GPT-4.0 Mini and new 4.0 models
Function calling is enabled across all models that support it

Key Use Cases for Structured Output:

- Extracting structured data from unstructured data - Works with both text and vision inputs - Dynamic UI generation - Enterprise application improvements

Technical Considerations:

- Supports recursive schemas for UI generation - Provides reliability gains by ensuring strict type matching - Uses a custom JSON schema dialect with specific design choices: * Additional properties are allowed by default * Requires explicit configuration for stricter type enforcement * Makes all keys required by default to match developer expectations

Design Principles:

- Aim to be explicit - Provide clear definitions - Allow developers to choose the most suitable model - Maintain compatibility with existing schema standards

Additional Features:

- Developers can make keys nullable by using union types (e.g., int and null) - Allows specifying chain of thought fields before final answers - Enables step-by-step rendering of model responses (e.g., in math tutoring scenarios)

Prompt and Instruction Strategies

Multiple places to provide instructions:

1. System message (good for function call guidelines) 2. Function descriptions (best for explaining how to call a function) 3. Property names and descriptions can guide model behavior

Practical Insights:

- Early stage of AI means developers are still discovering best practices - Renaming properties to be more descriptive can potentially improve model performance - No official benchmarks for very granular instruction techniques

Evaluation and Development Approach:

- Focus on raising overall capabilities for users - Recommend customers develop their own evaluations - Goal is to make evaluation processes easier for developers

Practical Example:

- One developer implemented structured outputs for AI news - Reduced code by 20 lines - Decreased API costs by approximately 55%

Advanced Features and Future Directions

Discussed advanced features for generating structured outputs with AI models
Emphasized saving on latency and cost by generating outputs "in one shot" rather than through multiple retries
Exploring potential future features like custom grammars beyond JSON schema

Model Selection and Fine-Tuning

Recommended model selection strategy:

- Start with GPT-4.0 mini (cheapest, good for most use cases) - Move to GPT-4.0 if more performance is needed - Consider fine-tuning for advanced use cases

Announced general availability of GPT-4.0 fine-tuning
Offering free training tokens (1 million per day until September 23rd)
Fine-tuning can be effective with as few as 100 high-quality examples

Additional Insights:

- Discussed challenges with rating systems and structured outputs - Suggested using log probabilities for more accurate classification tasks - Highlighted importance of model calibration when generating structured responses

Roadmap Considerations:

- Exploring parallel function calling - Interested in developer feedback on potential new features - Aiming to make model migration easier for users

Prompt Engineering and Model Performance

Prompt engineering requires creativity and persistence
Not everyone finds prompting equally easy; some are naturally skilled
ChatGPT has helped people develop intuition about interacting with AI models
Prompt engineering likely won't disappear, as clearly explaining requirements remains crucial

Model Releases and Distinctions

OpenAI recently released two models: GPT-4 (20340806) and Chat GPT-4
The models have different tuning focuses:

- One is function calling-tuned (for API) - One is chat-tuned (for ChatGPT interface)

API models have stable weights for developer reliability
ChatGPT model is a "rolling" model with potentially changing weights

Key Insights on Model Development:

- OpenAI aims to be transparent about model capabilities - They're still learning how to effectively communicate model changes - The goal is to provide developers flexibility in model selection - Future release notes will aim to be more comprehensive about model improvements

OpenAI API and Strategy

The API is viewed as the broadest vehicle for distributing AGI
OpenAI values working with developers who often "see the future before anyone else"
The API predates ChatGPT and was their first commercialization product
They aim to expose all of OpenAI's models through the API, including multi-modal models

Developer and Engineering Insights:

- Engineering remains the primary way to access AI models - There's significant potential ("alpha") in writing code and deploying AI solutions - The concept of "AI engineering" is emerging as a distinct discipline

Assistance API Development

Developed with a small team under tight deadlines
Experienced technical challenges, including a brief outage just before a major launch event
Made significant improvements, particularly in file search capabilities

API and Product Updates

Increased file handling capacity: Now supports 10,000 files per assistant
Enhanced semantic search capabilities across files
Introducing more advanced chunking and re-ranking options
Goal is to make RAG (Retrieval-Augmented Generation) easier to implement at scale

Determinism and Technical Challenges:

- Seed parameter is not fully deterministic - More determinism in initial tokens - Challenges in balancing determinism with system reliability

Logit Bias Feature:

- Primarily used for classification tasks - Can help refine output by biasing towards specific tokens - Considered a power user feature with limited widespread adoption - Potential use cases include guiding classification outputs and controlling punctuation

Tiering and Access:

- More transparent tier system - Developers can now see their current tier in dashboard - Tier two and above have full access to fine-tuning - Rollouts and feature access tied to tier levels

Developer Ecosystem Strategy

Applying 80/20 principle in product development
Focus on building features that provide maximum value
Prioritize developments based on developer feedback
Aim to make complex processes more accessible

Batch API Features

Offers 50% cost savings
24-hour turnaround time for batch jobs
Works with GPT-4 Mini
Very cost-effective (around 7.5 cents per million)
Potential use cases include:

- User activation workflows - Offline evaluations

Potential future interest in exploring even cheaper/free GPU runtime

Vision API Highlights

Integrated across multiple OpenAI APIs
Supports:

- Assistance API - Batch API - Structured outputs

Useful for complex data extraction involving spatial relationships
Current limitation: primarily works with individual image frames
Potential future exploration of continuous video streaming capabilities

Video and API Development

Discussion about potential video analysis API with frame sampling
Exploring batch processing capabilities for video frames
Considering sequential processing options for video analysis

Whisper API Insights

Whisper v3 has diarization feature, but not yet implemented in API
Performance trade-offs between Whisper v2 and v3
Diarization (speaker identification) is technically challenging
Transcription quality is notably high, even with simple implementation

Whisper Transcription Features:

- Supports translation in ~50 languages - Prompt feature allows vocabulary bias, helpful for acronyms and specialized terms - Developers use workarounds like dictionary replacement for transcription accuracy

Future API Developments:

- Exploring new API shapes for advanced voice/speech-to-speech modes - Likely to use socket-based approach (like LiveKit) instead of traditional request-response - Goal is to make new API paradigms easy for developers to adopt

OpenAI Enterprise Features

Recently shipped admin and audit log APIs
Improved SSO (Single Sign-On) offering
Designed for enterprise users to manage API keys and projects programmatically

Waterloo University Insights

Co-op program provides significant practical experience
Cold climate and limited entertainment encourage studying and project work
Students typically graduate with two years of work experience
Strong "hacker mentality" and entrepreneurial culture

Book Recommendations

"The Making of Prince of Persia" - inspirational book about hard work
"Misbehaving" by Richard Thaler - explores behavioral economics and irrational decision-making

OpenAI Hiring and Team Dynamics

OpenAI is hiring across multiple teams and roles
Ideal OpenAI employees are described as:

- Low ego - User-focused - Driven - Willing to "roll up their sleeves" - Unpretentious

They welcome people from diverse backgrounds
No specific AI experience is required to join
Hiring for engineering, research, and a "model behavior" role

O1 Model Release and Features

Multiple models were released (a preview and a broader model)
Pricing is noted as being higher compared to previous models like Opus
The model has accompanying documentation including blog posts, system cards, and technical resources

API differences from previous models include:

- No system role - No temperature setting - No tool calling - No streaming - Limited visibility into token usage

Reasoning tokens are a key feature, but can limit problem-solving complexity

Performance and Evaluation:

- Introduces new evaluation benchmarks - Highlights scaling laws for both training and test-time compute - O1 Mini is notably interesting for its relative outperformance compared to larger models - Trained specifically for certain domains (e.g., Mini performs better in STEM)

OpenAI Model Strategy

O1 is a new model focused on reasoning capabilities
O1 is distinct from the GPT series, but not replacing it
GPT-4.0 will continue to be supported and used
Developers are expected to use both O1 and GPT-4.0 for different tasks

Model Positioning:

- GPT-4.0 remains the "workhorse" for standard tasks like summarization - O1 is designed for more complex, reasoning-intensive problems - The goal is for developers to use both models in complementary ways

Future Development:

- OpenAI is working on improvements to speed up reasoning processes - The API is in early stages, with more features to be added over time - Focus on optimizing context and processing speed - Committed to continuous learning and iteration

O1 Model Capabilities and Access

Currently in preview stage without tool support
Planned future features include:

- Function calling - Code interpreter - Web browsing - Streaming capabilities

Multimodal capabilities built-in
Uses reinforcement learning for reasoning performance

Reasoning and Performance:

- Generates hidden chains of thought during reasoning - Strong performance in lateral tasks and philosophical reasoning - Can creatively solve complex problems - Demonstrates ability to generalize and handle challenging tasks

Prompting and Usage:

From API to AGI: Structured Outputs, OpenAI API platform and O1 Q&A — with Michelle Pokrass & OpenAI Devrel + Strawberry team