The 10,000x Yolo Researcher Metagame — with Yi Tay of Reka
Overview
Transformative career journey from Google Brain researcher to Raker AI co-founder, bringing expertise from working on Palm 2, UL2, Flan, and Bard to build a startup focused on universal intelligence and multimodal agents, despite initial reluctance to leave Google.
The AI research paradigm has shifted dramatically from single-task fine-tuning to foundation models, with ChatGPT serving as a "forcing function" that changed the landscape, while architectural innovations like encoder-decoder models and Mixture of Experts (MOE) continue to evolve the field.
High-performance AI research requires Olympic-level commitment and adaptability rather than specific technical skills, with success often coming through mentorship, collaboration, and helping established researchers with their overflow work.
Current challenges in AI development include compute infrastructure reliability, benchmark contamination, and evaluation methodology, with emerging trends focusing on multi-modal integration, extremely long context windows, and the balance between open and closed-source development.
Personal productivity approaches have evolved from constant work with minimal separation between professional and personal life to seeking more balance, particularly after becoming a father, while still maintaining effective information management systems.
Content
Background and Career Trajectory
Yi Tai is the chief scientist of Raker AI, previously a key researcher at Google Brain with an impressive background:
- Worked on Palm 2 architecture
- Inventor of UL2
- Core contributor to Flan
- Member of Bard core team
- Worked on generative retrieval
Raker AI Details:
- Founded in March 2023
- Raised $58 million Series A (valuation around $250 million)
- Goals include universal intelligence, multimodal/multilingual agents, self-improving AI
- Recent releases: Raker Flash, Raker Core and Edge, Vibe Eval
Career Path:
- Joined Google in late 2019
- Initially focused on model architecture and efficient transformers
- Worked during early transformer research period
- Researched long-range transformer alternatives
- Predates the GPT-3 and large language model era
- Worked extensively on fine-tuning models like T5 and BERT
Evolution of AI Research
The AI research paradigm shifted dramatically with GPT releases:
- Moved from single-task fine-tuning to more universal "foundation models"
- Previously, conferences were organized around specific applications (e.g., question answering)
- Early research focused on incremental improvements in specific domains
- ChatGPT (launched November 2022) was a "forcing function" that dramatically changed the research landscape
Personal Research Approach:
- Describes his research evolution as organic, not strategically planned
- Prioritized working on what seemed most interesting and impactful
- Collaboration and environment influenced research direction
- Emphasized being adaptable and moving with the field rather than getting stuck in one area
Research Context at Google:
- Projects could be top-down (large team efforts) or bottom-up (individual researcher initiatives)
- Worked on projects like UL2, Palm, Emergent Abilities, and Differentiable Search Index around 2021
- Research labs like Google, Meta, and OpenAI were often working on advanced concepts several years ahead of broader academic research
Key Projects and Collaborations
Research and Project Categories:
- Broader, all-level efforts
- Personal projects using available compute (like UL2)
- Collaborative projects with friends (like Flan and Emergent Abilities)
Palm 2 Project:
- Became a co-lead on the Palm 2 project
- Involvement originated from adding UL2 to Palm 2
- Project leadership was a mix of bottom-up and top-down selection
- Leadership selection was based on visibility and contributions
UL2 Project:
- Was a personal project
- At the time, it was the largest model Google had released
- 20B model based on T5, using Pure C4 pre-training dataset
- Released as an "updated" version of T5
Emergent Abilities Research:
- Collaborated with Jason, who was the primary author of the Emergent Abilities paper
- Both strongly believe in the concept of emergence in AI development
- Acknowledged the "Mirage" paper by Ryland Schaefer, which contested some aspects of emergence
- While some benchmarks may challenge emergence, they believe it fundamentally exists
Mentors and Influences
About Kwok:
- Described as having good research intuition and "research taste"
- Viewed as more of a researcher than a corporate manager
- Discussed topics like AGI and singularity
- Served as a friend and mentor figure
About Jason:
- Influenced the speaker's online persona and approach to social media
- Known for posting "spicy" takes on Twitter
- Introduced a philosophy of only posting opinions one can fully stand behind
- Emphasized the importance of marketing and PR in research
- Has a high "hit rate" and "impact density" for research papers, often achieving around 1,000 citations per paper in a year
About Hyeong-won:
- Described as a systematic engineer with unique work habits
- Known for working efficiently on a single small screen, believing keyboard switching is more optimal than multiple monitors
- Maintains a friendship and has interesting philosophical discussions beyond research
Career Advice for Researchers
Key strategies for emerging researchers:
- Find a mentor or co-author with some visibility in the research community
- Collaboration is crucial - seek opportunities to work with researchers who have a good reputation
- Don't be afraid to reach out via direct messages (DMs) - many researchers are more approachable than they seem
- "Pick up what others put down" - help mentors with work they don't have time to complete
On high performance in AI research:
- Success requires intense dedication and willingness to work outside normal hours
- Being a high performer often means addressing critical issues immediately, even at odd hours
- Compares the intensity to Olympic-level commitment, where work-life balance is sacrificed
- Acknowledges this approach can be unhealthy but believes it enables faster progress
Technical skills:
- No single definitive tech stack is recommended for AI researchers
- More important is the ability to continuously learn and adapt
- Willingness to learn new frameworks is key, as technologies change rapidly
- Specific technical knowledge (like CUDA kernels) is less critical than learning adaptability
Startup Journey - Raker AI
Founding and Decision:
- Founded after collaboration between team members at DeepMind and Google Brain
- The speaker initially identified more as a researcher than a startup founder
- Joined after a 6-month persuasion period
- Was happy at Google and didn't have a strong intention to leave initially
- The experience of being a co-founder was the primary motivation for the move
- Left Google in early April
Compute and Infrastructure Challenges:
- Experienced significant delays in obtaining H100 GPUs
- Initially had 500 A100 GPUs with prolonged waiting periods
- Compute infrastructure was often unreliable and "broken"
- Most compute resources arrived in December
- GPU nodes are often unreliable when first provisioned
- Node quality improves over time through testing and returns
- One bad node can potentially kill an entire training job
Technology Stack Decisions:
- Chose to use PyTorch instead of TPUs
- TPUs outside Google were perceived differently from internal Google TPUs
- Switching to TPUs would have been costly and complex
- Existing infrastructure and training setup made TPU migration impractical
Risk and Cost Management:
- Current compute providers rarely share risk for node failures
- Large model training runs are significantly impacted by node failures
- Startups currently bear most of the financial risk of node failures
- An ideal GPU provider would offer to share costs of node failures
- Most providers are confused by large model training specific requirements
- Refund policies are typically minimal and inadequate
Technical Insights on Model Training
Checkpointing and Model Training:
- Checkpointing frequency depends on job stability and file system performance
- For large models (20B-200B), checkpointing can take 30 minutes or more
- Goal is to minimize performance slowdown (1-2% speed reduction is acceptable)
- Storage can become expensive with very large models
- Perfect restart is challenging; some time/work is always lost during checkpointing
Team and Recruitment:
- The team values their ability to solve problems quickly rather than specific technical artifacts
- Team members were a mix of ex-Google colleagues and fresh PhD graduates
- Confidence stems from general problem-solving skills, not identical technical infrastructure
Architecture Discussions:
- GQA (Group Query Attention): Considered a "no-brainer" improvement over MQA
- SweetGlue: Initially underappreciated paper by GNOME, later gained more recognition
- Most architectural modifications don't significantly impact performance
- RoPE (Rotary Positional Embedding) is now a default approach, with advantages like better context extrapolation
- The vanilla transformer has evolved incrementally over 4-5 years
- Current transformer architectures (like GNOME) are strong baselines that are difficult to significantly improve upon
Key Architectural Decisions:
- Encoder-decoder vs. decoder-only models is a fundamental choice
- Encoder-decoder models have unique properties:
* Similar to prefix language models
* Provides "intrinsic sparsity" (encoder-decoder with N params is computationally equivalent to a decoder model with N/2 params)
* Allows more flexibility in handling long contexts
* Encoder can use aggressive pooling or sparse attention techniques
- Architectural changes need to be simple to implement to gain widespread adoption
Model Evaluation and Benchmarking
Current Model Landscape:
- Lama 3 is seen as a significant improvement, potentially catching up to Google's capabilities
- Lama 1 and Lama 2 are viewed as "mid-tier" models
- Phi models (Phi 1, Phi 2, Phi 3) are discussed, with some skepticism about their synthetic training approach
Evaluation Challenges:
- Traditional benchmarks like GSMK are becoming saturated and less meaningful
- LMSYS is considered the most legitimate current evaluation platform
- Existing benchmarks are seen as potentially contaminated by community exposure
- There's a need for new, robust evaluation methods
- A good evaluation set is partially hidden and not fully exposed to the community
- Academics are seen as potentially key to steering evaluation methodologies
Evaluation Methods:
- Two potential evaluation approaches are discussed:
1. LLM as judge
2. Arena-style comparisons
- They use third-party data companies for evaluations, not internal staff
- Researchers often conduct their own evaluations by examining model outputs
Benchmark Challenges:
- There are significant challenges with benchmark contamination and reuse
- Withholding test sets can make benchmarks unpopular or impractical
- Academic and research incentives often push for benchmark accessibility and citations
- Benchmarking is fundamentally an "incentive problem"
- SweetBench is highlighted as a popular coding agent benchmark this year
Emerging Trends in AI
Multi-Modal AI Models:
- Discussion focuses on multi-modal AI models, particularly language and vision integration
- There's a preference for "early fusion" approaches in multi-modal models
- Current model development is influenced by organizational structures (referencing Conway's Law)
- The trend is moving towards unifying different modalities in AI models
- GPT-4.0 and Meta's Chameleon paper are mentioned as examples of early fusion models
Screen vs. Natural Image Intelligence:
- There's a distinction between "screen modality" and "general vision" in AI models
- Future models should be capable of understanding both screen and natural images
- Screen intelligence is seen as an area needing more capability enhancement
- Ideal models would be versatile across different image types (mobile screens, laptop screens, websites, etc.)
Scaling Laws and Model Training:
- Chinchilla scaling laws (one-to-one scaling of model parameters and data) are considered not definitive
- Different perspectives on optimal scaling include compute and data scaling, and inference-based scaling
- Model training continues until performance saturation, but the exact point of saturation is unclear
- Examples like Llama 3 (15T tokens) demonstrate ongoing experimentation with scaling
- Critique of early models that trained large models with few tokens
Long Context Models:
- Emerging trend of models with extremely long context windows (million to hundred million tokens)
- Debate about the future of long context vs. retrieval-augmented generation (RAG)
- Key challenges include lack of robust evaluation benchmarks, determining appropriate use cases, and cost considerations
- Long context seen as potentially valuable for complex tasks requiring comprehensive understanding
- RAG remains useful for specific tasks like fact retrieval
- Potential for hybrid approaches combining long context and retrieval methods
Mixture of Experts (MOE) Models:
- Viewed as a promising architectural approach for AI models
- Allow for more parameters while keeping computational costs (FLOPs) relatively low
- Current trend seems to settle on around 8 experts, but more might provide additional gains
- Potentially offer a way to slightly "cheat" scaling laws by activating fewer parameters
- Unclear how MOE architecture impacts model capabilities
- One participant is "bullish" on MOEs, while another sees them as a modest performance improvement
Efficiency and Open Source Considerations
Technical Efficiency Challenges:
- Implementing certain machine learning methods can significantly slow down performance, sometimes making them 10x slower in practice
- Theoretical efficiency doesn't always translate to actual computational efficiency
- Challenges in accurately comparing models due to throughput considerations and hardware optimization issues
- Researchers can market methods as "efficient" by highlighting reduced parameters, even if the method is actually slower
- Adding complexity to models can reduce performance, especially if not hardware-optimized
Open Source vs. Closed Source Discussion:
- The speaker is not fundamentally against open source but critical of certain narratives
- Distinguishes between large organizations releasing open weights (like Meta's Llama) and grassroots, bottom-up open source development
- Observes that major progress is still primarily driven by big labs and academic institutions
- Critiques open source community's tendency to rename existing models and create "quick wins" based on others' work
Personal Productivity and Work-Life Balance
Work Practices Evolution:
- Historically involved little separation between work and personal life
- Used to work constantly, especially during PhD and Google periods
- Enjoyed writing code and running experiments, which made work feel less like work
- Recognizes this approach wasn't the most healthy
- Recently becoming a father and trying to have more balance in life
Information Management:
- Previously used to check arXiv every morning when the gate opens
- Would skim 1-2 interesting papers
- Now reads fewer papers
- Acknowledges that constantly checking for new papers might be a "newness bias"
- Runs an AI newsletter that summarizes Twitter and Reddit content
- Uses this as a personal method to stay informed about AI news
- Now relies on Twitter to surface important research papers instead of comprehensive reading
Research Approach:
- Starts writing with the end goal/story in mind
- Creates draft titles early in research projects to maintain motivation and focus
- Emphasizes "shipping" the final research product as critical
Academic Background and International Perspectives
Academic Journey:
- Completed PhD at Nanyang Technological University (NTU) in Singapore
- Describes himself as a "regular" undergraduate with decent but not exceptional grades
- Entered PhD program somewhat accidentally, driven by curiosity and desire to stay in Singapore
- Discovered research aptitude during PhD process
- Had minimal advisor guidance, describing his advisor as essentially a "Grammarly-like" editing tool
Research Culture Observations:
- Experienced significant culture shock transitioning from Singapore's research environment to Google
- Critiques Singapore's research community as focused on paper publication rather than impact and less sophisticated compared to US research culture
- Experienced substantial personal growth adapting to different research standards at Google
- Learned research skills largely through self-supervision
AI and National Development Perspectives:
- On building an AI ecosystem like Silicon Valley, suggests governments should avoid over-intervention, respect local talent, be cautious about moving in the wrong direction, support funding for startups and academic conferences, and facilitate international exposure
- Acknowledges a significant "brain drain" to the US in AI
- Expresses concern about US cultural and technological hegemony in AI development
- Highlights the importance of "AI sovereignty" for different nations
- Suggests that exposure and international experience can dramatically expand one's worldview
Shift in AI Development:
- Emphasizes a shift towards individual contributors (ICs) rather than traditional management structures
- Senior technical people should be hands-on and writing code
- Impact is now driven by practitioners actually doing the work, not managers
- The complexity of AI requires deep technical expertise
- The era of "people who talk" is over; success in AI now depends on skilled individual contributors who can directly implement and innovate