AI Engineer London meetup in Sept and ">

Latent Space: The AI Engineer Podcast

Efficiency is Coming: 3000x Faster, Cheaper, Better AI Inference from Hardware Improvements, Quantization, and Synthetic Data Distillation

Overview

Content

Introduction and Background

Career Journey

- Retail (Amazon Go, retail tech) - Edge AI (robotics, manufacturing) - Autonomous vehicles

AI Efficiency Trends

- GPT-3 level intelligence: $60 to $0.27 per million tokens (2020-2023) - GPT-4 level intelligence: From over $30 to under $3 per million tokens - OpenAI's GPT-4O Mini is 3.5% the price of GPT-4O - Efficient GPUs - Quantization - Pruning - Model distillation

Inference Optimization Insights

- Optimized a ResNet 50 computer vision model for image search - Used TensorRT to improve performance - Increased throughput from 1 image to 4 images in 7 milliseconds - Achieved around 571 images per second on a V100 GPU in 2018

Hardware Evolution and Strategy

- Difficult to forecast hardware changes beyond two years - Optimization depends on specific use cases - Hardware that seemed powerful (like V100) can quickly become obsolete - Modern equivalent tasks might now run on much cheaper devices like Jetson

Performance Optimization Techniques

- Improvements depend on specific use case and constraints - Latency and throughput are key metrics with different optimization approaches - Trade-offs exist between efficiency and accuracy/precision - Strategies vary based on end application (e.g., manufacturing vs. general use)

Model Quantization Insights

Synthetic Data Development

- Amazon: Replacing tape detection in 3D - Robotics: Object pose detection without physical tags

3D Content Creation and AI Models

- Text-to-texture generation - Text-to-material generation - Image-to-3D conversion - Fully generated 3D environments - Conversational 3D characters - Procedurally generated worlds tailored to individual user interests

Training and Model Development

- Knowledge distillation - Preference distillation (transferring RLHF capabilities) - Reasoning distillation - Benchmark performance distillation - GitHub Copilot uses a smaller, distilled model compared to GPT-4 - Uncertainty remains about fully replicating large model performance through distillation

Benchmarks and Data Quality

AGI Perspectives

Convey and Conversational 3D AI Characters

- Large language models with retrieval augmented generation - Text-to-speech and automatic speech recognition - Integration with avatar creation platforms (Relution, MetaHuman) - Facial and action animations - Multimodal perception for NPCs (non-player characters) - Gaming: Interactive NPCs with complex social mechanics - Brand representation: Digital brand agents/ambassadors with personalized interactions - Potential applications in medical assistance and customer support - Need for comprehensive "full stack" AI agent development - Ongoing work on facial animations, gesture animations, and visual perception

Technology and Interaction Advancements

- Facial gestures - Eye tracking - Emotional responsiveness - Conversational adaptability - Dynamic, non-scripted conversations - Ability to interact with scene objects - Context-aware interactions - Personality-driven dialogue

NPCs Beyond Gaming

- Simulating conversations between different characters/roles - Staff training scenarios (e.g., medical training with different patient personalities) - Testing interactions between simulated agents

Future Possibilities

- Talking to historical figures like Einstein - Interacting with favorite science fiction characters - Accessing "on-demand" versions of experts

Contact Information

- AI characters - 3D characters - Synthetic data - LinkedIn - Email: naila@convey.com (work) - Email: naila.worker@gmail.com (personal)

More from Latent Space: The AI Engineer Podcast

Explore all episode briefs from this podcast

View All Episodes →

Listen smarter with PodBrief

Get AI-powered briefs for all your favorite podcasts, plus a daily feed that keeps you informed.

Download on the App Store