Overview

The AI industry has seen dramatic efficiency improvements in both cost and performance, with GPT-3 level intelligence dropping from $60 to $0.27 per million tokens (2020-2023) through advances in hardware, quantization, pruning, and model distillation.

Optimization strategies must balance latency vs. throughput considerations, with techniques like batch processing, quantization, and pruning yielding different benefits depending on specific use cases and constraints.

The development of synthetic data generation has become crucial as many AI applications hit "data walls," with game engines providing valuable physics-informed training environments that may become increasingly important for LLMs.

Current 3D AI character development is advancing toward fully interactive experiences with conversational NPCs featuring emotional responsiveness, contextual awareness, and multimodal perception capabilities for applications beyond gaming.

The future of AI may involve more efficient training approaches that prioritize data quality over quantity, with model distillation emerging as a promising technique to transfer capabilities from larger to smaller models while maintaining performance.

Content

Introduction and Background

Podcast is Latent Space, hosted by Celestial and SWIX
Guest is Nyla Werko, currently at Google AI, previously at NVIDIA and ConvAI
Background spans astrophysics research, machine learning, and AI efficiency

Career Journey

Started in astrophysics manually categorizing astronomical images
Discovered machine learning's potential through a 1996 paper using neural networks for astronomical classification
Transitioned from manual image classification to exploring machine learning's broader applications
Moved from CPU to GPU-based research, working on computer vision and edge devices
Career focused on AI model training and inference optimization, particularly at eBay
Later joined NVIDIA's solutions architect program, supporting various AI customers across sectors:

- Retail (Amazon Go, retail tech) - Edge AI (robotics, manufacturing) - Autonomous vehicles

Recently worked at NVIDIA on 3D content creation acceleration
Currently works at Convey, focusing on embodied conversational 3D characters

AI Efficiency Trends

Dramatic price drops in AI intelligence:

- GPT-3 level intelligence: $60 to $0.27 per million tokens (2020-2023) - GPT-4 level intelligence: From over $30 to under $3 per million tokens - OpenAI's GPT-4O Mini is 3.5% the price of GPT-4O

Parallels with computer vision, which saw ~3,000x throughput improvement in six years
Efficiency improvements driven by:

- Efficient GPUs - Quantization - Pruning - Model distillation

Inference Optimization Insights

Inference is critical and should be a primary focus
It's not just about speed, but meeting human-perceived latency requirements
Optimization involves techniques like kernel fusion and model quantization
Specific example from eBay:

- Optimized a ResNet 50 computer vision model for image search - Used TensorRT to improve performance - Increased throughput from 1 image to 4 images in 7 milliseconds - Achieved around 571 images per second on a V100 GPU in 2018

Hardware Evolution and Strategy

Dramatic performance improvements over time
V100 (130 teraflops) compared to newer GB200 (20,000 teraflops)
Hardware capabilities have significantly expanded
Optimization strategy considerations:

- Difficult to forecast hardware changes beyond two years - Optimization depends on specific use cases - Hardware that seemed powerful (like V100) can quickly become obsolete - Modern equivalent tasks might now run on much cheaper devices like Jetson

Observed significant learning gaps between hardware, platform, and AI research teams

Performance Optimization Techniques

Batch size increases can lead to significant efficiency gains
Dynamic/continuous batching provides performance improvements
Quantization techniques evolved over time (FP16, Bfloat16, quantization-aware training)
Pruning networks was effective in computer vision, less so currently for LLMs
Performance optimization considerations:

- Improvements depend on specific use case and constraints - Latency and throughput are key metrics with different optimization approaches - Trade-offs exist between efficiency and accuracy/precision - Strategies vary based on end application (e.g., manufacturing vs. general use)

Model Quantization Insights

Quantization reduces precision by storing information in fewer bits
Vision models may preserve principal feature components more robustly
Language models might be more sensitive to precision loss due to complex word interactions
Smaller models potentially more impacted by quantization than larger models
Discussion of extreme quantization techniques, including ternary models and 1.58 bit models
Hypothesis that for large models, directional information (yes/no) might matter more than precise numerical weights
Analogy drawn to physics constants, where directionality is key

Synthetic Data Development

Identified critical data challenges across industries
Developed synthetic data solutions for specific use cases:

- Amazon: Replacing tape detection in 3D - Robotics: Object pose detection without physical tags

Collaborated with researchers like Jonathan Tremblay
Coined the concept of "hitting a data wall" in AI development
Predicts similar data limitations will emerge for Large Language Models (LLMs)
Generating synthetic data requires specialized skills, considered an "art"
In 3D environments, synthetic data generation is still relatively limited
Game engines valuable for creating temporally coherent, physics-informed synthetic data

3D Content Creation and AI Models

Recent AI models are augmenting 3D content creation processes, including:

- Text-to-texture generation - Text-to-material generation - Image-to-3D conversion

Current 3D generation technologies are still imperfect, often producing flawed outputs
Ongoing research focuses on improving asset topology and generation quality
Anticipates convergence of video and 3D generation technologies
Envisions future interactive experiences with:

- Fully generated 3D environments - Conversational 3D characters - Procedurally generated worlds tailored to individual user interests

Training and Model Development

Current large language model training is relatively "brute force" with massive data ingestion
Recognition that not all training data is equally valuable for specific use cases
Computational efficiency has been a key driver in model architecture choices
Potential for more efficient training by identifying truly valuable data
Model distillation as a promising approach to reduce computational requirements
Different models/approaches for different tasks (e.g., Databricks assistant using model collage)
Multiple types of distillation emerging:

- Knowledge distillation - Preference distillation (transferring RLHF capabilities) - Reasoning distillation - Benchmark performance distillation

Specific examples:

- GitHub Copilot uses a smaller, distilled model compared to GPT-4 - Uncertainty remains about fully replicating large model performance through distillation

Benchmarks and Data Quality

Concerns about benchmark gaming in AI, particularly in computer vision
Researchers sometimes submit papers with checkpoints that are not reproducible
Close benchmark numbers often considered unreliable due to potential manipulation
FineWeb dataset from HuggingFace demonstrates potential for improving data quality using LLMs
Initial results suggest training with fewer, higher-quality tokens can achieve similar or better model performance
Draws parallel to education: quality of information matters more than quantity

AGI Perspectives

AGI challenges include optimizing for "everything" without a specific problem domain
Feedback loops are crucial for AI development (e.g., coding environments provide clear feedback)
Robotics and reinforcement learning show promise, but LLMs are still approximating available knowledge
Views text (especially structured sources like textbooks) as inherently labeled data
Sees current LLM approach as a good approximation of human intelligence, but not necessarily achieving true AGI
Defines potential AGI as self-improving and significantly surpassing human capabilities
Skeptical about current LLM approaches achieving true AGI

Convey and Conversational 3D AI Characters

Speaker discusses their work at Convey, creating conversational 3D AI characters
Key technical capabilities:

- Large language models with retrieval augmented generation - Text-to-speech and automatic speech recognition - Integration with avatar creation platforms (Relution, MetaHuman) - Facial and action animations - Multimodal perception for NPCs (non-player characters)

Use cases:

- Gaming: Interactive NPCs with complex social mechanics - Brand representation: Digital brand agents/ambassadors with personalized interactions - Potential applications in medical assistance and customer support

Technological gaps/challenges:

- Need for comprehensive "full stack" AI agent development - Ongoing work on facial animations, gesture animations, and visual perception

Technology and Interaction Advancements

Emerging technologies creating more realistic AI character interactions with improved:

- Facial gestures - Eye tracking - Emotional responsiveness - Conversational adaptability

Observed holographic displays at Computex with screens embedded in transparent glass
Latency identified as the most critical optimization factor for natural interactions
Goal is to create AI interactions that feel seamless and responsive
Emotional tone detection and appropriate character reactions are crucial
NVIDIA/Convai demo showcased AI characters with:

- Dynamic, non-scripted conversations - Ability to interact with scene objects - Context-aware interactions - Personality-driven dialogue

NPCs Beyond Gaming

Discussion on using NPCs beyond video games for simulation and training
Potential enterprise use cases include:

- Simulating conversations between different characters/roles - Staff training scenarios (e.g., medical training with different patient personalities) - Testing interactions between simulated agents

Gaming industry noted as somewhat conservative about adopting new mechanics
Indie developers more experimental with AI-driven game experiences
Speaker created an entirely AI-generated podcast as an early experiment
While video game market is limited, commercial applications for AI NPCs seem promising

Future Possibilities

Potential for AI to expand and repurpose intellectual property across different media formats
Excitement about AI's ability to extend the lifespan of existing games through modding and AI characters
Anticipation of legal challenges surrounding AI and IP in the coming years
Creating interactive experiences with virtual characters, including:

- Talking to historical figures like Einstein - Interacting with favorite science fiction characters - Accessing "on-demand" versions of experts

Potential for "sanctioned" AI models approved by the original person/entity
Challenges of accurately representing historical figures and their contexts

Contact Information

Naila is open to connections from people interested in:

- AI characters - 3D characters - Synthetic data

Contact methods:

- LinkedIn - Email: naila@convey.com (work) - Email: naila.worker@gmail.com (personal)

Efficiency is Coming: 3000x Faster, Cheaper, Better AI Inference from Hardware Improvements, Quantization, and Synthetic Data Distillation