Overview
- Multi-phase training approaches are evolving beyond rigid pre-training/fine-tuning distinctions toward more continuous, flexible parameter scheduling that incorporates original datasets into later training stages, challenging the necessity of random initialization.
- Answer AI exemplifies a non-traditional organizational structure with no management hierarchy, emphasizing talent over credentials and operating as "a large yard with narrow fences" - giving team members flexibility while maintaining a shared vision for more practical, accessible AI research.
- The team advocates for encoder-decoder architectures for their superior feature representation capabilities, while developing techniques to fine-tune large language models with limited computational resources through approaches like FSDP, QLora, and adapter-based fine-tuning.
- Projects like Fast HTML (creating web applications in a single Python file) and "Dialogue Engineering" aim to bridge gaps in current development and AI interaction paradigms, moving beyond both ChatGPT-style interfaces and traditional coding environments.
- Future research focuses on helping people learn to maintain AI-generated code, optimizing model inference through adapter distribution rather than full model merging, and exploring conceptual innovations like models that develop "sketches" before generating tokens.
Content
Podcast Context and Introduction
- This is an episode of the "Late in Space" podcast featuring Jeremy Howard
- Hosts are Alessio and SWIX
- This is Jeremy Howard's second or third appearance on the podcast
- Episode was recorded over a month ago and delayed due to Llama 3 and SAM 2 paper releases
Evolving Perspectives on Model Training
- Jeremy Howard discusses evolving perspectives on model training approaches
- Training steps (pre-training, instruction tuning, task training) are not as separate as originally thought
- These steps should be treated more as a continuum
- Howard advocates for incorporating original dataset into later training stages
- He highlights the ability to significantly modify model behavior without starting from random weights
- Howard is skeptical of starting model training from random weights
- Argues there's likely always some data similarity that makes random initialization unnecessary
Multi-Phase Pre-Training and Optimization
- Snowflake released a model called Snowflake Arctic with a three-phase training approach
- Training phases involved gradually reducing web text and increasing code percentage
- Discussion suggests multi-phase training is becoming more explicitly discussed in research
- Preference for flexible parameter scheduling rather than rigid schedules
- Mention of Meta's work on "schedule-free" optimizers
- Emphasis on having configurable hyperparameters with good default settings
OpenAI Governance and Organizational Design
- Critique of OpenAI's governance structure before Sam Altman's firing
- Observation that the non-profit/for-profit hybrid model was fundamentally flawed
- Argument that financial incentives (equity) made the governance model unsustainable
- Reflection on how companies tend to become "sociopathic" and devour their original mission
Creating Better Company Structures
- Discussion focuses on creating companies that are less "sociopathic" and more aligned with founders' intentions
- Key strategies for maintaining company values include:
- PBC Benefits:
Hiring Philosophy and Team Building
- Emphasis on talent and potential over traditional institutional credentials
- Interest in candidates with non-traditional backgrounds
- Recognition that exceptional work often comes from people with unique life experiences
- Valuing people who:
- At Answer AI, many team members experience imposter syndrome
- Mutual intimidation between developers and researchers, who each view the other group as impressive
- Key philosophical points:
Answer AI's Organizational Approach
- Brief mention of Answer AI, Howard's startup
- Noted for shipping multiple open-source projects quickly
- Small team (maximum 12 members)
- No traditional management hierarchy:
- Specific collaborative examples:
- No required meetings, but regular meetings across time zones
- Everyone is interviewed by the entire company during recruitment
- Nearly all candidates in the recruiting pipeline have been hired
- "A large yard with narrow fences" - giving team members flexibility while maintaining a shared vision
Research and Technical Focus
- The group shares a common vision and critique of current AI research
- They believe current research is:
- They prioritize practical research with real-world outcomes
- Technical improvements like:
Notable Team Members and Talent
- Examples of standout people include:
Model Architecture Perspectives
- Discussion of encoder-decoder vs. decoder-only model paradigms
- Key arguments for encoder-decoder models:
- Model architecture observations:
- Current research trend tends to focus on incremental improvements rather than exploring promising approaches
- Renewed interest in BERT models, particularly BERT24
- Growing interest in state-based models (RNNs, LSTMs, XLSTM)
Technical Development Challenges
- Developing techniques for fine-tuning large language models with limited computational resources
- Key techniques mentioned include:
- The development process was extremely complex and challenging
- Significant obstacles included:
- The goal was to prove it's possible to fine-tune large models (like 70b) on more modest hardware (e.g., 4090 GPU)
Performance and Validation Challenges
- Significant challenges in the open source AI ecosystem with performance evaluation
- Many claims about model capabilities don't hold up under actual testing
- Developing AI models requires extensive "janitorial work" and tenacious effort
- Systems implementation is complex, not just theoretical mathematics
Inference and Model Optimization
- Working on inference performance optimization
- Goals include:
- Collaborating with communities like CUDA Mode, PyTorch team, and HQQ quantization library
- Recommendation is to distribute merged adapters, not full merged models
- Merging adapters can often produce better results
- Adapter-based approach allows smaller downloads, faster inference, and more efficient customization
Fast HTML Web Development Project
- Working on a new web development tool called Fast HTML
- Fast HTML allows creating complete web applications in a single Python file
- Unlike Streamlit and Gradio, it works directly with web foundations
- Built on top of Starlette and closely matches Fast API's interface
- Key features:
- Received support from developers like Sebastian (Fast API creator), Carson (HTMX), and Django community
AI Interaction and "Dialogue Engineering"
- Developing a new approach called "Dialogue Engineering"
- Created a system named "AI Magic" that increases personal productivity
- Compares current AI interfaces to 1970s teletype interactions
- Different approaches to AI interaction:
- Developed libraries to improve AI API interactions:
Future Projects and Research Directions
- Planned "How to Solve It with Code" course:
- Exploring emerging application strategies:
- Interest in conceptual model innovations:
- Determining optimal use cases for different approaches and how various model architectures can complement each other