Overview

AI-first interfaces are revolutionizing human-computer interaction, with Canvas exemplifying how AI can invert traditional workflows by starting with generated content that users refine rather than creating from scratch.

The development of effective AI agents requires balancing technical capability with behavioral design - defining appropriate personalities, collaboration styles, and value systems (helpful, harmless, honest) that may sometimes conflict.

AI product development is evolving toward a rapid iteration methodology where features are initially deployed as separate models, refined through user feedback, and eventually integrated into core models - as demonstrated by both Canvas and TASKs projects.

The future of AI interaction is shifting from website-based interactions toward "personal models" that deeply integrate with users' systems, adapt to individual preferences, and generate dynamic, interactive outputs beyond text.

Effective AI evaluation remains challenging due to non-standardized testing conditions, varying output formats, and the difficulty of selecting appropriate metrics - creating significant obstacles for meaningful model comparisons.

Content

Podcast Context and Guest Introduction

2025 is being called the "year of agents" by industry leaders
OpenAI has significantly reduced pricing for O1 Mini (from $12 to $4.40 per million tokens)
Released O3 Mini with competitive performance at the same price point
ChatGPT has been shipping new features like Canvas, recurring tasks, and virtual agent responses
Karina Wynn is the guest, previously at Anthropic and now at OpenAI

- Wrote first 50,000 lines of Claude.ai - Currently leads a research team at OpenAI focused on human-computer interaction, defining new interaction paradigms, improving reasoning models, and developing novel synthetic model training methods - Will be closing keynote speaker at AI Engineer Summit in New York (Feb 20-22)

Karina's Background and Career Path

Started in computer vision for investigative journalism at Berkeley
Worked with human rights centers and media organizations
Became interested in AI through vision transformers
Initially worked as an intern at the New York Times, focusing on product engineering and R&D prototypes
Wanted to work in AI and applied to Anthropic, getting rejected initially
Successfully applied when a frontline engineering role opened up
Used Twitter to gain visibility for side projects
Early projects included using Clip for fashion recommendation search
Considered starting a startup but lacked confidence

Anthropic Journey

Joined Anthropic in August 2022, pre-ChatGPT
First employee doing both product design and front-end engineering
Believed in Anthropic's vision of funding safety research through product development
Initial product work included developing Cloud for Slack with features like:

- Thread summarization - Tag cloud - Idea suggestion

Encountered UX constraints with Slack platform

Claude.ai Development

After ChatGPT launched, was challenged to reproduce a similar interface in two weeks
Wrote first 50,000 lines of code for cloud.ai with no reviews
Worked in a small deployment team of 6-7 people
Claude 1.3 had significant hallucinations
Leadership was uncertain about deploying the model
Noted that Claude 1.3 was "extremely creative"

AI Development and Innovation

The AI landscape was considered "nascent" two years ago, with few product designers thinking about AI integration
Jason Yuan was noted as an early designer thinking about AI possibilities
Karina was part of the post-training fine-tuning team for Claude 3
The team was small (10-12 people)
Involved in developing Haiku model, evaluations, and writing the model card
Karina's early workspace concept was inspired by the Tom Riddle's diary scene from Harry Potter, where a document interactively communicates

Model Training Insights

Training can produce many models, each with potential "brain damage" or unique characteristics
Key research challenge: understanding data set interactions and potential side effects
Debugging techniques from software engineering are valuable in model training
Prioritizing compute resources and experiments is crucial in research management
Rapid iteration and debugging are essential

Evaluation and Benchmarking Challenges

GPQA was an early benchmark that revealed high variance in model evaluations
Model performance comparisons are difficult due to non-standardized testing conditions
There's no industry-standard evaluation harness, which complicates model comparisons
Different models require different prompting techniques
Evaluation challenges include:

- Parsing output correctly - Handling different output formats (e.g., XML tags) - Selecting appropriate evaluation metrics

One-Shot (O1) Prompting Insights

O1 prompting works well with hard constraints and specific criteria
Particularly effective for specialized domains like biology and chemistry
Helps models select candidates that best match given criteria

Model Usage and Behavioral Design

OpenAI researchers acknowledge they don't fully understand how to maximize model capabilities
External feedback and user interactions are crucial for discovering emergent model capabilities
Verifying model outputs can be challenging, especially for non-expert users
Emerging field of "behavioral design" focuses on shaping AI model behavior in different contexts
Core values include: Helpful, Harmless, Honest (HHH)
Challenges arise when these values potentially contradict each other
Designing model behavior is more "art than science"

Model Personality Development

In collaborative contexts like Canvas, considerations include appropriate collaboration tone, when to ask follow-up questions, and adapting communication style
Synthetic data and constitutional AI are used to shape model behaviors
Companies are increasingly focusing on "tastemaker" roles to define model personalities
Claude's personality development was initially unintentional but became more intentional with Claude 3
Model personality is seen as a reflection of the company and its creators
Methodology similar to character design in video games, involving defining core personality traits and principles

OpenAI Canvas Project

The Canvas project was spontaneously formed on July 4th
Emerged from an impromptu pitch by Karina to their manager Barrett Zoff
The project was unique in bringing together designers, engineers, product managers, and researchers from the beginning
Thomas Dimson created the initial engineering prototype
The team was staffed with 5-6 engineers and Karina as a researcher
The project emphasized collaborative cross-functional development, with team members able to push back on each other's ideas

Canvas Development and Iteration

The team developed a canvas tool within ChatGPT, initially using prompted baseline approaches
They discovered edge cases that required post-training to address
They retrained the entire model (FOO) with canvas-specific improvements
Motivations for retraining included ability to rapidly iterate based on user feedback, faster deployment compared to integrating a completely new model, and reduced time from beta to general availability (which took ~3 months)
Defined specific behavioral parameters for the canvas, such as when to write comments, when to edit documents, when to make partial vs. full document rewrites, and when to trigger canvas functionality

Canvas Writing Quality and Evaluation

Two primary directions for improvement:

1. Improve quality for current non-fiction use cases (emails, blog posts, cover letters) 2. Long-term goal of teaching models more creative writing

Conducted internal human evaluation
Worked with model writers to develop rubrics
Created test sets of prompts to assess model performance
Developed specific criteria for evaluating writing quality
Canvas improvements are now being integrated back into the core model
Some differences still exist between canvas and API outputs
Ongoing A/B testing to refine model performance

Canvas Usage and Perspectives

Karina is a strong advocate for Canvas, viewing it as more than just a writing aid
Primary use case is drafting and iterative content creation
Describes Canvas as an "AI-first" interface, inverting traditional document editing workflows
Key use cases include drafting various content types (conference copy, blog posts), collaborative iterative writing with AI, and potential for code execution and development
Vision for Canvas is to dynamically adapt its interface based on user intent, potentially morphing into a specialized writing or coding environment
Current versions of Canvas can have significant differences even with similar inputs
The technology is still emerging, with many potential users not yet understanding its full potential

TASKs Project Development

Developed in less than two months
Led by Karina with a resident named Vivek
Designed as a simple feature that becomes powerful when integrated with a general model
Initially launched as a separate model to gather user feedback
Product development process included:

- Creating a Product Requirements Document (PRD) - Securing resources/funding - Developing a prompted baseline - Crafting specific evaluations - Iterative model training - Preventing overfitting - Checking for performance regressions

Operational strategy involves deploying features as separate models initially, quickly iterating and improving, integrating successful features into core model, and maintaining synergy between different teams and disciplines

AI Agents and Tasks Vision

Current AI models struggle with multi-task queries and require explicit instructions
Ideal AI agent would learn from user behavior, proactively suggest actions, understand user patterns, and act like a "natural friend"
B2B possibilities include organizational productivity tools, automated user feedback processing, data analysis and insights generation, and collaborative task management
Trust-building approach mirrors human collaboration: start with simple tasks, gradually increase complexity, build reliability through consistent performance, adapt to working styles over time
Agents seen as progressive systems evolving from single actions to collaborative interactions to potentially full task delegation in complex environments

Computer Use Capabilities for AI

Computer use is considered a core capability for AI agents
Enables potential tasks like booking flights, shopping, or searching
Vision models and perception technologies have improved to support this
Latency and accurate context understanding are crucial challenges
Computer use agents open new research questions
Potential applications include real-time to asynchronous collaboration, coding task delegation, and testing features in virtual environments
Consumer applications like shopping assistance, Facebook Marketplace scanning, ticket booking
Coding-related tasks seem most promising
Current computer use agents are slow, expensive, and imprecise
Accuracy remains a significant concern

Future AI Interaction Predictions

Interest in progression of smaller, powerful AI models (e.g., O3 mini, Cloud 3 Haiku)
Potential for reducing latency in computer use agents
Prediction that website interactions will decrease as model-mediated internet access increases
AI will need deeper integration with personal systems (calendars, emails)
Task completion becoming more personalized and adaptive
Generative UI that dynamically adjusts to user preferences
AI outputs potentially becoming interactive (e.g., React apps, charts) instead of just text
Expense reporting seen as a benchmark for AI task complexity and a tractable problem for AI to solve in near future
Shift from "personal computer" to "personal model" concept

Organizational Insights and Career Reflections

Anthropic grew from ~60-70 to ~700 people during Karina's tenure
Compared product mindsets of OpenAI and Anthropic:

- OpenAI more willing to take product risks - Anthropic more focused, potentially enterprise-oriented

Learned from Mira about interdisciplinary thinking, balancing product and research, and adopting a systems perspective
Valued creative freedom and adaptability in research environments
Emphasized importance of being open to changing research directions
Currently hiring research engineers who are skilled in model training, interested in product deployment, and product-minded
Encourages people to experiment with AI models, explore creative potential in interface and product design, and rethink existing software and internet paradigms through AI lens

The conversation concluded with a brief, lighthearted exchange about potential job automation.

The Agent Reasoning Interface: o1/o3, Claude 3, ChatGPT Canvas, Tasks, and Operator — with Karina Nguyen of OpenAI