Overview
- AI-first interfaces are revolutionizing human-computer interaction, with Canvas exemplifying how AI can invert traditional workflows by starting with generated content that users refine rather than creating from scratch.
- The development of effective AI agents requires balancing technical capability with behavioral design - defining appropriate personalities, collaboration styles, and value systems (helpful, harmless, honest) that may sometimes conflict.
- AI product development is evolving toward a rapid iteration methodology where features are initially deployed as separate models, refined through user feedback, and eventually integrated into core models - as demonstrated by both Canvas and TASKs projects.
- The future of AI interaction is shifting from website-based interactions toward "personal models" that deeply integrate with users' systems, adapt to individual preferences, and generate dynamic, interactive outputs beyond text.
- Effective AI evaluation remains challenging due to non-standardized testing conditions, varying output formats, and the difficulty of selecting appropriate metrics - creating significant obstacles for meaningful model comparisons.
Content
Podcast Context and Guest Introduction
- 2025 is being called the "year of agents" by industry leaders
- OpenAI has significantly reduced pricing for O1 Mini (from $12 to $4.40 per million tokens)
- Released O3 Mini with competitive performance at the same price point
- ChatGPT has been shipping new features like Canvas, recurring tasks, and virtual agent responses
- Karina Wynn is the guest, previously at Anthropic and now at OpenAI
Karina's Background and Career Path
- Started in computer vision for investigative journalism at Berkeley
- Worked with human rights centers and media organizations
- Became interested in AI through vision transformers
- Initially worked as an intern at the New York Times, focusing on product engineering and R&D prototypes
- Wanted to work in AI and applied to Anthropic, getting rejected initially
- Successfully applied when a frontline engineering role opened up
- Used Twitter to gain visibility for side projects
- Early projects included using Clip for fashion recommendation search
- Considered starting a startup but lacked confidence
Anthropic Journey
- Joined Anthropic in August 2022, pre-ChatGPT
- First employee doing both product design and front-end engineering
- Believed in Anthropic's vision of funding safety research through product development
- Initial product work included developing Cloud for Slack with features like:
- Encountered UX constraints with Slack platform
Claude.ai Development
- After ChatGPT launched, was challenged to reproduce a similar interface in two weeks
- Wrote first 50,000 lines of code for cloud.ai with no reviews
- Worked in a small deployment team of 6-7 people
- Claude 1.3 had significant hallucinations
- Leadership was uncertain about deploying the model
- Noted that Claude 1.3 was "extremely creative"
AI Development and Innovation
- The AI landscape was considered "nascent" two years ago, with few product designers thinking about AI integration
- Jason Yuan was noted as an early designer thinking about AI possibilities
- Karina was part of the post-training fine-tuning team for Claude 3
- The team was small (10-12 people)
- Involved in developing Haiku model, evaluations, and writing the model card
- Karina's early workspace concept was inspired by the Tom Riddle's diary scene from Harry Potter, where a document interactively communicates
Model Training Insights
- Training can produce many models, each with potential "brain damage" or unique characteristics
- Key research challenge: understanding data set interactions and potential side effects
- Debugging techniques from software engineering are valuable in model training
- Prioritizing compute resources and experiments is crucial in research management
- Rapid iteration and debugging are essential
Evaluation and Benchmarking Challenges
- GPQA was an early benchmark that revealed high variance in model evaluations
- Model performance comparisons are difficult due to non-standardized testing conditions
- There's no industry-standard evaluation harness, which complicates model comparisons
- Different models require different prompting techniques
- Evaluation challenges include:
One-Shot (O1) Prompting Insights
- O1 prompting works well with hard constraints and specific criteria
- Particularly effective for specialized domains like biology and chemistry
- Helps models select candidates that best match given criteria
Model Usage and Behavioral Design
- OpenAI researchers acknowledge they don't fully understand how to maximize model capabilities
- External feedback and user interactions are crucial for discovering emergent model capabilities
- Verifying model outputs can be challenging, especially for non-expert users
- Emerging field of "behavioral design" focuses on shaping AI model behavior in different contexts
- Core values include: Helpful, Harmless, Honest (HHH)
- Challenges arise when these values potentially contradict each other
- Designing model behavior is more "art than science"
Model Personality Development
- In collaborative contexts like Canvas, considerations include appropriate collaboration tone, when to ask follow-up questions, and adapting communication style
- Synthetic data and constitutional AI are used to shape model behaviors
- Companies are increasingly focusing on "tastemaker" roles to define model personalities
- Claude's personality development was initially unintentional but became more intentional with Claude 3
- Model personality is seen as a reflection of the company and its creators
- Methodology similar to character design in video games, involving defining core personality traits and principles
OpenAI Canvas Project
- The Canvas project was spontaneously formed on July 4th
- Emerged from an impromptu pitch by Karina to their manager Barrett Zoff
- The project was unique in bringing together designers, engineers, product managers, and researchers from the beginning
- Thomas Dimson created the initial engineering prototype
- The team was staffed with 5-6 engineers and Karina as a researcher
- The project emphasized collaborative cross-functional development, with team members able to push back on each other's ideas
Canvas Development and Iteration
- The team developed a canvas tool within ChatGPT, initially using prompted baseline approaches
- They discovered edge cases that required post-training to address
- They retrained the entire model (FOO) with canvas-specific improvements
- Motivations for retraining included ability to rapidly iterate based on user feedback, faster deployment compared to integrating a completely new model, and reduced time from beta to general availability (which took ~3 months)
- Defined specific behavioral parameters for the canvas, such as when to write comments, when to edit documents, when to make partial vs. full document rewrites, and when to trigger canvas functionality
Canvas Writing Quality and Evaluation
- Two primary directions for improvement:
- Conducted internal human evaluation
- Worked with model writers to develop rubrics
- Created test sets of prompts to assess model performance
- Developed specific criteria for evaluating writing quality
- Canvas improvements are now being integrated back into the core model
- Some differences still exist between canvas and API outputs
- Ongoing A/B testing to refine model performance
Canvas Usage and Perspectives
- Karina is a strong advocate for Canvas, viewing it as more than just a writing aid
- Primary use case is drafting and iterative content creation
- Describes Canvas as an "AI-first" interface, inverting traditional document editing workflows
- Key use cases include drafting various content types (conference copy, blog posts), collaborative iterative writing with AI, and potential for code execution and development
- Vision for Canvas is to dynamically adapt its interface based on user intent, potentially morphing into a specialized writing or coding environment
- Current versions of Canvas can have significant differences even with similar inputs
- The technology is still emerging, with many potential users not yet understanding its full potential
TASKs Project Development
- Developed in less than two months
- Led by Karina with a resident named Vivek
- Designed as a simple feature that becomes powerful when integrated with a general model
- Initially launched as a separate model to gather user feedback
- Product development process included:
- Operational strategy involves deploying features as separate models initially, quickly iterating and improving, integrating successful features into core model, and maintaining synergy between different teams and disciplines
AI Agents and Tasks Vision
- Current AI models struggle with multi-task queries and require explicit instructions
- Ideal AI agent would learn from user behavior, proactively suggest actions, understand user patterns, and act like a "natural friend"
- B2B possibilities include organizational productivity tools, automated user feedback processing, data analysis and insights generation, and collaborative task management
- Trust-building approach mirrors human collaboration: start with simple tasks, gradually increase complexity, build reliability through consistent performance, adapt to working styles over time
- Agents seen as progressive systems evolving from single actions to collaborative interactions to potentially full task delegation in complex environments
Computer Use Capabilities for AI
- Computer use is considered a core capability for AI agents
- Enables potential tasks like booking flights, shopping, or searching
- Vision models and perception technologies have improved to support this
- Latency and accurate context understanding are crucial challenges
- Computer use agents open new research questions
- Potential applications include real-time to asynchronous collaboration, coding task delegation, and testing features in virtual environments
- Consumer applications like shopping assistance, Facebook Marketplace scanning, ticket booking
- Coding-related tasks seem most promising
- Current computer use agents are slow, expensive, and imprecise
- Accuracy remains a significant concern
Future AI Interaction Predictions
- Interest in progression of smaller, powerful AI models (e.g., O3 mini, Cloud 3 Haiku)
- Potential for reducing latency in computer use agents
- Prediction that website interactions will decrease as model-mediated internet access increases
- AI will need deeper integration with personal systems (calendars, emails)
- Task completion becoming more personalized and adaptive
- Generative UI that dynamically adjusts to user preferences
- AI outputs potentially becoming interactive (e.g., React apps, charts) instead of just text
- Expense reporting seen as a benchmark for AI task complexity and a tractable problem for AI to solve in near future
- Shift from "personal computer" to "personal model" concept
Organizational Insights and Career Reflections
- Anthropic grew from ~60-70 to ~700 people during Karina's tenure
- Compared product mindsets of OpenAI and Anthropic:
- Learned from Mira about interdisciplinary thinking, balancing product and research, and adopting a systems perspective
- Valued creative freedom and adaptability in research environments
- Emphasized importance of being open to changing research directions
- Currently hiring research engineers who are skilled in model training, interested in product deployment, and product-minded
- Encourages people to experiment with AI models, explore creative potential in interface and product design, and rethink existing software and internet paradigms through AI lens