Overview

Multi-phase training approaches are evolving beyond rigid pre-training/fine-tuning distinctions toward more continuous, flexible parameter scheduling that incorporates original datasets into later training stages, challenging the necessity of random initialization.

Answer AI exemplifies a non-traditional organizational structure with no management hierarchy, emphasizing talent over credentials and operating as "a large yard with narrow fences" - giving team members flexibility while maintaining a shared vision for more practical, accessible AI research.

The team advocates for encoder-decoder architectures for their superior feature representation capabilities, while developing techniques to fine-tune large language models with limited computational resources through approaches like FSDP, QLora, and adapter-based fine-tuning.

Projects like Fast HTML (creating web applications in a single Python file) and "Dialogue Engineering" aim to bridge gaps in current development and AI interaction paradigms, moving beyond both ChatGPT-style interfaces and traditional coding environments.

Future research focuses on helping people learn to maintain AI-generated code, optimizing model inference through adapter distribution rather than full model merging, and exploring conceptual innovations like models that develop "sketches" before generating tokens.

Content

Podcast Context and Introduction

This is an episode of the "Late in Space" podcast featuring Jeremy Howard
Hosts are Alessio and SWIX
This is Jeremy Howard's second or third appearance on the podcast
Episode was recorded over a month ago and delayed due to Llama 3 and SAM 2 paper releases

Evolving Perspectives on Model Training

Jeremy Howard discusses evolving perspectives on model training approaches
Training steps (pre-training, instruction tuning, task training) are not as separate as originally thought
These steps should be treated more as a continuum
Howard advocates for incorporating original dataset into later training stages
He highlights the ability to significantly modify model behavior without starting from random weights
Howard is skeptical of starting model training from random weights
Argues there's likely always some data similarity that makes random initialization unnecessary

Multi-Phase Pre-Training and Optimization

Snowflake released a model called Snowflake Arctic with a three-phase training approach
Training phases involved gradually reducing web text and increasing code percentage
Discussion suggests multi-phase training is becoming more explicitly discussed in research
Preference for flexible parameter scheduling rather than rigid schedules
Mention of Meta's work on "schedule-free" optimizers
Emphasis on having configurable hyperparameters with good default settings

OpenAI Governance and Organizational Design

Critique of OpenAI's governance structure before Sam Altman's firing
Observation that the non-profit/for-profit hybrid model was fundamentally flawed
Argument that financial incentives (equity) made the governance model unsustainable
Reflection on how companies tend to become "sociopathic" and devour their original mission

Creating Better Company Structures

Discussion focuses on creating companies that are less "sociopathic" and more aligned with founders' intentions
Key strategies for maintaining company values include:

- Setting up legal structures that enforce long-term value principles - Using specific legal mechanisms like voting agreements - Becoming a Public Benefit Corporation (PBC)

PBC Benefits:

- Allows companies to reject acquisition offers that conflict with their stated public benefit - Provides legal protection against being forced to make decisions solely for short-term financial gain - Can be implemented with minimal legal complexity

Hiring Philosophy and Team Building

Emphasis on talent and potential over traditional institutional credentials
Interest in candidates with non-traditional backgrounds
Recognition that exceptional work often comes from people with unique life experiences
Valuing people who:

- Succeed despite constraints - Take risks - Are creative - Are tenacious - Are open-minded

At Answer AI, many team members experience imposter syndrome
Mutual intimidation between developers and researchers, who each view the other group as impressive
Key philosophical points:

- It's unreasonable to expect to be the best at everything - Being in an environment where you're not the best at everything can be healthy - The goal is collective learning and bringing different skills together

Answer AI's Organizational Approach

Brief mention of Answer AI, Howard's startup
Noted for shipping multiple open-source projects quickly
Small team (maximum 12 members)
No traditional management hierarchy:

- No managers telling people what to do - Collaborative, experimental approach - Focus on learning from each other

Specific collaborative examples:

- Ben Clavier initiating a new Bert project by gathering experts - Benjamin Warner creating a hackable Transformers implementation - Organic, self-driven collaboration without top-down direction

No required meetings, but regular meetings across time zones
Everyone is interviewed by the entire company during recruitment
Nearly all candidates in the recruiting pipeline have been hired
"A large yard with narrow fences" - giving team members flexibility while maintaining a shared vision

Research and Technical Focus

The group shares a common vision and critique of current AI research
They believe current research is:

- Too expensive - Too complicated - Focused on unnecessary foundation models

They prioritize practical research with real-world outcomes
Technical improvements like:

- Transitioning from LORA to DORA - Creating VLLM extensions - Exploring quantized model training - Improving web GPU programming

Notable Team Members and Talent

Examples of standout people include:

- Ben Clavier: Writes distinctive, high-quality code - Vic (ex-DataQuest CEO): Successful startup founder, won Kaggle NLP competition - Karim: Created state-of-the-art Turkish language model independently - Jono Whittaker (creative tinkerer) - Benjamin (strong community contributor without formal qualifications) - Austin (experienced AI leader with diverse background)

Model Architecture Perspectives

Discussion of encoder-decoder vs. decoder-only model paradigms
Key arguments for encoder-decoder models:

- Better feature representation of input information - More effective for tasks requiring context understanding - Critical for translation and complex encoding tasks

Model architecture observations:

- Encoder-only models work well for classification tasks - Decoder-only models require more training resources and larger model sizes to be competitive

Current research trend tends to focus on incremental improvements rather than exploring promising approaches
Renewed interest in BERT models, particularly BERT24
Growing interest in state-based models (RNNs, LSTMs, XLSTM)

Technical Development Challenges

Developing techniques for fine-tuning large language models with limited computational resources
Key techniques mentioned include:

- FSDP (Fully Sharded Data Parallel) - QLora (Quantized Low-Rank Adaptation) - Adapter-based fine-tuning

The development process was extremely complex and challenging
Significant obstacles included:

- Poorly documented libraries - Complicated CUDA code - Interconnected Hugging Face ecosystem components - Lack of clear, minimal working examples

The goal was to prove it's possible to fine-tune large models (like 70b) on more modest hardware (e.g., 4090 GPU)

Performance and Validation Challenges

Significant challenges in the open source AI ecosystem with performance evaluation
Many claims about model capabilities don't hold up under actual testing
Developing AI models requires extensive "janitorial work" and tenacious effort
Systems implementation is complex, not just theoretical mathematics

Inference and Model Optimization

Working on inference performance optimization
Goals include:

- Avoiding model merging - Promoting quantized models with adapters - Making model downloads and inference faster

Collaborating with communities like CUDA Mode, PyTorch team, and HQQ quantization library
Recommendation is to distribute merged adapters, not full merged models
Merging adapters can often produce better results
Adapter-based approach allows smaller downloads, faster inference, and more efficient customization

Fast HTML Web Development Project

Working on a new web development tool called Fast HTML
Fast HTML allows creating complete web applications in a single Python file
Unlike Streamlit and Gradio, it works directly with web foundations
Built on top of Starlette and closely matches Fast API's interface
Key features:

- No separate template, CSS, or JavaScript files - Can create components using libraries like Daisy UI, Bootstrap, Shoelace - Provides built-in session and security features - Designed to be easy to use "out of the box"

Received support from developers like Sebastian (Fast API creator), Carson (HTMX), and Django community

AI Interaction and "Dialogue Engineering"

Developing a new approach called "Dialogue Engineering"
Created a system named "AI Magic" that increases personal productivity
Compares current AI interfaces to 1970s teletype interactions
Different approaches to AI interaction:

- ChatGPT-style chat interface (beginner-friendly) - Traditional coding environments like Visual Studio Code - A proposed middle ground focused on interactive dialogue

Developed libraries to improve AI API interactions:

- Claudette: A library specifically optimized for Claude - Cosette: A library optimized for OpenAI APIs

Future Projects and Research Directions

Planned "How to Solve It with Code" course:

- Aimed at people learning to code through AI tools - Helps learners understand how to maintain and extend AI-generated code

Exploring emerging application strategies:

- Comparing and combining Retrieval-Augmented Generation (RAG), in-context learning, prompt engineering - KV cache creation and persistent context storage

Interest in conceptual model innovations:

- JEPA and diffusion models as potential solutions for generative processes - Models that can develop a conceptual "sketch" before generating tokens - Gradually refine solutions and update state dynamically

Determining optimal use cases for different approaches and how various model architectures can complement each other

AI Magic: Shipping 1000s of successful products with no managers and a team of 12 — Jeremy Howard of Answer.ai