AI Engineer Summit NYC today. Last call!">

Latent Space: The AI Engineer Podcast

[Ride Home] Simon Willison: Things we learned about LLMs in 2024

Overview

* AI model efficiency has dramatically improved, with costs dropping 100x since 2022-2023 and previously high-end models now running on personal laptops, challenging assumptions about escalating AI development costs.

* While AI models haven't experienced a massive intelligence leap beyond GPT-4, they've become significantly more versatile and accessible through multimodal capabilities (handling images, video, audio), longer context lengths, and improved interaction methods.

* Current AI agents show promise in specific domains like research assistance and coding but face fundamental reliability challenges - particularly their inability to distinguish truth from fiction - limiting their autonomous capabilities.

* The user interface for AI remains a critical bottleneck, with current LLM interfaces compared to "dropping users into a Linux terminal," highlighting the urgent need for more intuitive interaction methods beyond text prompts.

* Creative industries are beginning to incorporate AI tools into workflows, with the most effective implementations maintaining human curation and editorial oversight to establish credibility and avoid low-quality "slop" content.

Content

AI Landscape in Early 2025

  • Significant improvements in AI models throughout 2024, characterized by:
- Cheaper and faster models - Multimodal capabilities (images, video, audio) - Longer context lengths - Improved interaction methods

  • Model development observations:
- No massive step change from GPT-4 as initially expected - 18 organizations have developed models that beat the original GPT-4 - Models didn't get dramatically "smarter" but became more versatile - Computational efficiency and inference time are key development areas - Open-source models are now competitive with previous state-of-the-art models - Some advanced models can now run on personal laptops

Model Efficiency and Cost Trends

  • AI models are becoming more efficient, smaller, and cheaper to run:
- Microsoft's 5.4 model can now run on a MacBook Pro - DeepSeek V3 is currently the best open weights model, trained for only $5.5 million

  • Dramatic pricing developments:
- OpenAI models are now 100x cheaper compared to 2022-2023 - Google's Gemini 1.5 Flash model costs 0.075 dollars per million tokens - Gemini 1.5 Flash is 27 times cheaper than GPT 3.5 Turbo from a year ago

  • Cost reduction factors:
- Intense competition in the AI model market is driving prices down - Some providers like Google Gemini and Amazon Nova are operating profitably at these low prices - Cost of achieving GPT-4 level intelligence dropped approximately 1,000x from start to end of last year

DeepSeek's Impact and AI Efficiency

  • DeepSeek's model training achievement is considered a "bombshell" - challenging previous assumptions about escalating AI training costs
  • Key insights about DeepSeek:
- Relatively small company (around 150 employees) - Part of a quant hedge fund - Potentially demonstrating technological capabilities - Raising questions about their motivations and progress

  • Speculation about DeepSeek's model development:
- Possible copying or borrowing of ideas from other AI labs - Rapid rate of progress attracting attention

  • Emerging developments:
- DeepSeek R1: A reasoning model that can run on a laptop - Growing interest in more efficient, accessible AI models - Uncertainty about whether this efficiency improvement is sustainable - Hypothesis that "low-hanging fruit" in AI efficiency is being rapidly discovered

AI Reasoning and Limitations

  • Interesting AI reasoning observations:
- QWQ and QVQ models demonstrate unique "thinking out loud" characteristics - Anecdote about an AI drawing a pelican on a bicycle in SVG, which processed thoughts in Chinese - Reference to Andrei Karpathy's observation about advanced AI reasoning potentially happening in non-English languages

  • AI agent limitations:
- Skepticism about current AI agents due to fundamental reliability issues - Main critique centers on AI models' "gullibility" - their tendency to believe anything presented to them - Highlighted problem: AI agents cannot reliably distinguish truth from fiction - Lack of clear definition of what constitutes an "AI agent" across different contexts - Security risks demonstrated by the Claud example, where an AI was tricked into downloading malware

Agent Technology Perspectives

  • Cautiously optimistic outlook on agent technology:
- Development compared to the gradual progress of self-driving cars - Technological advancement viewed as a "slow cook" process over the next 10 years

  • Promising agent types:
- Research Assistant Agents: Viewed as most credible * Can analyze multiple sources (e.g., Google Gemini 1.5 Pro) * Capable of comprehensive research and reporting - Coding Agents: Proven to work well * Can write code, execute, and self-correct based on error messages * Continuously improving

  • Skeptical areas:
- Autonomous agents making independent financial decisions - Agents acting completely independently without human oversight - Fully autonomous financial agents seen as an "AGI level problem"

  • Specific examples:
- Stripe released an agent toolkit with virtual spending cards - Travel booking agents are a recurring technological "promise" across generations - Existing solutions like Google Flights already work effectively - Notebook LM viewed as two products: a good RAG tool with an interesting but somewhat "gimmicky" podcast feature

Multimodal AI Capabilities

  • Rapid advancement in multimodal AI capabilities:
- Vision and audio models have made significant progress - Video processing now involves capturing images per second and feeding them into AI models - ChatGPT iPhone app allows real-time video interaction and object/scene identification

  • Specific model observations:
- GPT-4 Vision was impressive initially - Google Gemini 1.5 Pro improved multimodal capabilities - Recent models can process audio and images simultaneously - Gemini Flash offers free-tier capabilities like continuous photo capture and prompting

  • Cost and scalability improvements:
- Processing images has become extremely cost-effective - Example: Generating captions for 68,000 photos would cost only $1.68 - Video processing is now economically feasible

  • Video generation models:
- Sora and Google's VO2 were significant launches this year - Discussion suggests VO2 might be technically superior to Sora - The publicly released Sora might be a "lite" version

AI in Creative Industries

  • Sora and generative AI in filmmaking:
- OpenAI's strategy appears to be developing Sora for Hollywood studios while letting others experiment with Sora Lite - Generative AI likely to first impact film production in background and marginal elements, similar to early CGI techniques - The VFX team for "Everything Everywhere All at Once" already used RunwayML in their workflow - Potential AI applications include generating background video, music, and sound effects

  • Industry and cultural perspectives:
- Perceived cultural resistance in Hollywood against AI technologies - Some collaborative efforts emerging, like generative AI video creative hackathons - Chinese AI models (Hyluo, Kling) making significant progress in video generation - Video avatar companies like HeyGen and Swix developing specialized AI video technologies

  • AI content creation workflows:
- HeyGen used to create avatar-based lip-synced videos from audio recordings - Potential for AI influencers, though current attempts are often novelty-driven - AI tools seen as workflow simplifiers that help creators do more ambitious work

  • Credibility considerations:
- Critical importance of human credibility in content creation - AI can generate variations, with humans selecting and endorsing final output - Credibility comes from human review and willingness to "put your name" behind content - LLMs cannot inherently establish credibility

UI Challenges and Innovation

  • Concept of "Slop" in AI content:
- Defined as AI-generated content that is unrequested and unreviewed - Crucial distinction is human curation/editorial review

  • LLM user interface challenges:
- Current LLM interfaces compared to dropping users into a Linux terminal - Urgent need for a more intuitive, user-friendly interface - Parallel drawn to how GUI replaced command-line interfaces

  • Potential UI innovation directions:
- OpenAI's canvas collaboration interface - Drawing-based UI that translates sketches into functional interfaces - Prompt-driven UI development * Generating custom HTML/JavaScript interfaces based on user prompts * Potential for dynamic, interactive interfaces

  • Future UI vision:
- LLMs creating custom interfaces with interactive elements, sliders, map selection tools - Goal: More precise, intuitive interaction with AI models - Current limitation: Interfaces aren't yet "closing the loop" by learning from user interactions

Software Creation and Local AI Models

  • AI-powered software development:
- LLMs enabling easier software creation, with tools like Bolt allowing zero-shot app generation - Potential for creating custom dashboards, web applications, and data exploration tools via prompts - AI integration in productivity tools (like Gemini in Gmail and Google Sheets) simplifying complex tasks

  • Complexity and usability challenges:
- While AI tools are becoming more powerful, they're also becoming more complex - Many AI features have undocumented limitations and edge cases - Understanding the full scope of what's possible requires increasing technical expertise - Challenges include issues like CORS headers and API access limitations

  • Local LLMs and personal computing:
- Local AI models have significantly improved in the past three months - While not yet matching top-tier hosted models like Claude 3.5 Sonic, local models are now practically usable - RAM limitations currently pose a challenge for running large models (e.g., Llama 370B) - Future laptops with more RAM could make local AI models more viable - NVIDIA's new $3,000 128GB machine represents an interesting development in local AI computing

  • Recommended local AI applications:
- MLC Chat (iPhone): Runs Llama 3B, good for creative tasks like generating movie plot outlines - Llama: Recommended as an easy entry point for running local models - LM Studio: Best user interface for local AI models - Open Web UI: Provides a good interface for Ollama models - Local models range from 2GB to 20-30GB in size - Smaller models (3B) are becoming more capable

Practical AI Tools and Industry Landscape

  • Recommended AI tools:
- Mac Whisper: Desktop app that can pull audio/transcripts from YouTube videos, works with MP3 files - Super Whisper: Speech-to-text tool that uses GPT-4 to clean up and rewrite transcripts - Rosebud: AI journaling app, highlighting potential for AI in mental health applications - Riverside: Recording platform with smart editing feature that automatically manages multi-track video editing, saving significant time in post-production

  • OpenAI and AI landscape assessment:
- OpenAI no longer the unambiguous market leader - Facing talent loss challenges - Competitive pressure from Google Gemini and Anthropic - GPT-4 helped maintain their position

  • LLM criticism perspectives:
- Current AI discourse lacks nuanced criticism - Typical criticism focuses on environmental impact, training data plagiarism, unlicensed data usage - The speaker argues that LLMs are valuable when used correctly, despite their tendency to hallucinate

  • Key concerns about AI technology:
- Training data usage (potentially legal but perceived as unfair) - Environmental impact - Potential job displacement in unexpected sectors like art and law

Regulatory Considerations and Emerging Technologies

  • Regulatory perspectives:
- Previous AI regulation attempts (e.g., White House, California) have been ineffective - Regulations often try to "regulate the last war" instead of addressing current technological developments

  • Two areas of potential AI regulation:
1. Preventing opaque AI decision-making in critical areas like insurance claims 2. Strengthening privacy laws to protect user data and prevent unauthorized training

  • Wearable technology resurgence:
- AI-enabled wearables becoming surprisingly affordable, increasingly capable, with decent battery life - Companies like Limitless (formerly Rewind) developing voice-recording wearables - Potential use cases include workplace meeting recording, personal memory preservation - AI transforming smart glasses as a product category - Future potential for integrated technologies combining smart glasses, advanced earbuds, LLMs, with smartphone as a central device

  • Privacy considerations:
- Ongoing societal discussions needed about acceptable use of recording technologies - Importance of user choice and consent in data collection

Upcoming Events and Projects

  • AI.engineer conference in New York City on February 20-21:
- February 20th: Leadership day for management/VPs/CTOs - February 21st: Engineer day for individual contributors - Will feature labs from DeepMind, Anthropic, Meta, OpenAI - Registration website: apply.ai.engineer

  • Simon Willison's work:
- Open source tools for data journalism - Dataset.io (data publishing/exploration platform) - Currently developing AI tools for the platform - Plans to add LLM-powered features for query crafting and dashboard building - Hosts Tech Meme Ride Home, a daily 15-minute tech news podcast - Personal blog: simonwillison.net

More from Latent Space: The AI Engineer Podcast

Explore all episode briefs from this podcast

View All Episodes →

Listen smarter with PodBrief

Get AI-powered briefs for all your favorite podcasts, plus a daily feed that keeps you informed.

Download on the App Store