Multimodality ft.

Latent Space: The AI Engineer Podcast

The 10,000x Yolo Researcher Metagame — with Yi Tay of Reka

Overview

Content

Background and Career Trajectory

- Worked on Palm 2 architecture - Inventor of UL2 - Core contributor to Flan - Member of Bard core team - Worked on generative retrieval

- Founded in March 2023 - Raised $58 million Series A (valuation around $250 million) - Goals include universal intelligence, multimodal/multilingual agents, self-improving AI - Recent releases: Raker Flash, Raker Core and Edge, Vibe Eval

- Joined Google in late 2019 - Initially focused on model architecture and efficient transformers - Worked during early transformer research period - Researched long-range transformer alternatives - Predates the GPT-3 and large language model era - Worked extensively on fine-tuning models like T5 and BERT

Evolution of AI Research

- Moved from single-task fine-tuning to more universal "foundation models" - Previously, conferences were organized around specific applications (e.g., question answering) - Early research focused on incremental improvements in specific domains - ChatGPT (launched November 2022) was a "forcing function" that dramatically changed the research landscape

- Describes his research evolution as organic, not strategically planned - Prioritized working on what seemed most interesting and impactful - Collaboration and environment influenced research direction - Emphasized being adaptable and moving with the field rather than getting stuck in one area

- Projects could be top-down (large team efforts) or bottom-up (individual researcher initiatives) - Worked on projects like UL2, Palm, Emergent Abilities, and Differentiable Search Index around 2021 - Research labs like Google, Meta, and OpenAI were often working on advanced concepts several years ahead of broader academic research

Key Projects and Collaborations

- Broader, all-level efforts - Personal projects using available compute (like UL2) - Collaborative projects with friends (like Flan and Emergent Abilities)

- Became a co-lead on the Palm 2 project - Involvement originated from adding UL2 to Palm 2 - Project leadership was a mix of bottom-up and top-down selection - Leadership selection was based on visibility and contributions

- Was a personal project - At the time, it was the largest model Google had released - 20B model based on T5, using Pure C4 pre-training dataset - Released as an "updated" version of T5

- Collaborated with Jason, who was the primary author of the Emergent Abilities paper - Both strongly believe in the concept of emergence in AI development - Acknowledged the "Mirage" paper by Ryland Schaefer, which contested some aspects of emergence - While some benchmarks may challenge emergence, they believe it fundamentally exists

Mentors and Influences

- Described as having good research intuition and "research taste" - Viewed as more of a researcher than a corporate manager - Discussed topics like AGI and singularity - Served as a friend and mentor figure

- Influenced the speaker's online persona and approach to social media - Known for posting "spicy" takes on Twitter - Introduced a philosophy of only posting opinions one can fully stand behind - Emphasized the importance of marketing and PR in research - Has a high "hit rate" and "impact density" for research papers, often achieving around 1,000 citations per paper in a year

- Described as a systematic engineer with unique work habits - Known for working efficiently on a single small screen, believing keyboard switching is more optimal than multiple monitors - Maintains a friendship and has interesting philosophical discussions beyond research

Career Advice for Researchers

- Find a mentor or co-author with some visibility in the research community - Collaboration is crucial - seek opportunities to work with researchers who have a good reputation - Don't be afraid to reach out via direct messages (DMs) - many researchers are more approachable than they seem - "Pick up what others put down" - help mentors with work they don't have time to complete

- Success requires intense dedication and willingness to work outside normal hours - Being a high performer often means addressing critical issues immediately, even at odd hours - Compares the intensity to Olympic-level commitment, where work-life balance is sacrificed - Acknowledges this approach can be unhealthy but believes it enables faster progress

- No single definitive tech stack is recommended for AI researchers - More important is the ability to continuously learn and adapt - Willingness to learn new frameworks is key, as technologies change rapidly - Specific technical knowledge (like CUDA kernels) is less critical than learning adaptability

Startup Journey - Raker AI

- Founded after collaboration between team members at DeepMind and Google Brain - The speaker initially identified more as a researcher than a startup founder - Joined after a 6-month persuasion period - Was happy at Google and didn't have a strong intention to leave initially - The experience of being a co-founder was the primary motivation for the move - Left Google in early April

- Experienced significant delays in obtaining H100 GPUs - Initially had 500 A100 GPUs with prolonged waiting periods - Compute infrastructure was often unreliable and "broken" - Most compute resources arrived in December - GPU nodes are often unreliable when first provisioned - Node quality improves over time through testing and returns - One bad node can potentially kill an entire training job

- Chose to use PyTorch instead of TPUs - TPUs outside Google were perceived differently from internal Google TPUs - Switching to TPUs would have been costly and complex - Existing infrastructure and training setup made TPU migration impractical

- Current compute providers rarely share risk for node failures - Large model training runs are significantly impacted by node failures - Startups currently bear most of the financial risk of node failures - An ideal GPU provider would offer to share costs of node failures - Most providers are confused by large model training specific requirements - Refund policies are typically minimal and inadequate

Technical Insights on Model Training

- Checkpointing frequency depends on job stability and file system performance - For large models (20B-200B), checkpointing can take 30 minutes or more - Goal is to minimize performance slowdown (1-2% speed reduction is acceptable) - Storage can become expensive with very large models - Perfect restart is challenging; some time/work is always lost during checkpointing

- The team values their ability to solve problems quickly rather than specific technical artifacts - Team members were a mix of ex-Google colleagues and fresh PhD graduates - Confidence stems from general problem-solving skills, not identical technical infrastructure

- GQA (Group Query Attention): Considered a "no-brainer" improvement over MQA - SweetGlue: Initially underappreciated paper by GNOME, later gained more recognition - Most architectural modifications don't significantly impact performance - RoPE (Rotary Positional Embedding) is now a default approach, with advantages like better context extrapolation - The vanilla transformer has evolved incrementally over 4-5 years - Current transformer architectures (like GNOME) are strong baselines that are difficult to significantly improve upon

- Encoder-decoder vs. decoder-only models is a fundamental choice - Encoder-decoder models have unique properties: * Similar to prefix language models * Provides "intrinsic sparsity" (encoder-decoder with N params is computationally equivalent to a decoder model with N/2 params) * Allows more flexibility in handling long contexts * Encoder can use aggressive pooling or sparse attention techniques - Architectural changes need to be simple to implement to gain widespread adoption

Model Evaluation and Benchmarking

- Lama 3 is seen as a significant improvement, potentially catching up to Google's capabilities - Lama 1 and Lama 2 are viewed as "mid-tier" models - Phi models (Phi 1, Phi 2, Phi 3) are discussed, with some skepticism about their synthetic training approach

- Traditional benchmarks like GSMK are becoming saturated and less meaningful - LMSYS is considered the most legitimate current evaluation platform - Existing benchmarks are seen as potentially contaminated by community exposure - There's a need for new, robust evaluation methods - A good evaluation set is partially hidden and not fully exposed to the community - Academics are seen as potentially key to steering evaluation methodologies

- Two potential evaluation approaches are discussed: 1. LLM as judge 2. Arena-style comparisons - They use third-party data companies for evaluations, not internal staff - Researchers often conduct their own evaluations by examining model outputs

- There are significant challenges with benchmark contamination and reuse - Withholding test sets can make benchmarks unpopular or impractical - Academic and research incentives often push for benchmark accessibility and citations - Benchmarking is fundamentally an "incentive problem" - SweetBench is highlighted as a popular coding agent benchmark this year

Emerging Trends in AI

- Discussion focuses on multi-modal AI models, particularly language and vision integration - There's a preference for "early fusion" approaches in multi-modal models - Current model development is influenced by organizational structures (referencing Conway's Law) - The trend is moving towards unifying different modalities in AI models - GPT-4.0 and Meta's Chameleon paper are mentioned as examples of early fusion models

- There's a distinction between "screen modality" and "general vision" in AI models - Future models should be capable of understanding both screen and natural images - Screen intelligence is seen as an area needing more capability enhancement - Ideal models would be versatile across different image types (mobile screens, laptop screens, websites, etc.)

- Chinchilla scaling laws (one-to-one scaling of model parameters and data) are considered not definitive - Different perspectives on optimal scaling include compute and data scaling, and inference-based scaling - Model training continues until performance saturation, but the exact point of saturation is unclear - Examples like Llama 3 (15T tokens) demonstrate ongoing experimentation with scaling - Critique of early models that trained large models with few tokens

- Emerging trend of models with extremely long context windows (million to hundred million tokens) - Debate about the future of long context vs. retrieval-augmented generation (RAG) - Key challenges include lack of robust evaluation benchmarks, determining appropriate use cases, and cost considerations - Long context seen as potentially valuable for complex tasks requiring comprehensive understanding - RAG remains useful for specific tasks like fact retrieval - Potential for hybrid approaches combining long context and retrieval methods

- Viewed as a promising architectural approach for AI models - Allow for more parameters while keeping computational costs (FLOPs) relatively low - Current trend seems to settle on around 8 experts, but more might provide additional gains - Potentially offer a way to slightly "cheat" scaling laws by activating fewer parameters - Unclear how MOE architecture impacts model capabilities - One participant is "bullish" on MOEs, while another sees them as a modest performance improvement

Efficiency and Open Source Considerations

- Implementing certain machine learning methods can significantly slow down performance, sometimes making them 10x slower in practice - Theoretical efficiency doesn't always translate to actual computational efficiency - Challenges in accurately comparing models due to throughput considerations and hardware optimization issues - Researchers can market methods as "efficient" by highlighting reduced parameters, even if the method is actually slower - Adding complexity to models can reduce performance, especially if not hardware-optimized

- The speaker is not fundamentally against open source but critical of certain narratives - Distinguishes between large organizations releasing open weights (like Meta's Llama) and grassroots, bottom-up open source development - Observes that major progress is still primarily driven by big labs and academic institutions - Critiques open source community's tendency to rename existing models and create "quick wins" based on others' work

Personal Productivity and Work-Life Balance

- Historically involved little separation between work and personal life - Used to work constantly, especially during PhD and Google periods - Enjoyed writing code and running experiments, which made work feel less like work - Recognizes this approach wasn't the most healthy - Recently becoming a father and trying to have more balance in life

- Previously used to check arXiv every morning when the gate opens - Would skim 1-2 interesting papers - Now reads fewer papers - Acknowledges that constantly checking for new papers might be a "newness bias" - Runs an AI newsletter that summarizes Twitter and Reddit content - Uses this as a personal method to stay informed about AI news - Now relies on Twitter to surface important research papers instead of comprehensive reading

- Starts writing with the end goal/story in mind - Creates draft titles early in research projects to maintain motivation and focus - Emphasizes "shipping" the final research product as critical

Academic Background and International Perspectives

- Completed PhD at Nanyang Technological University (NTU) in Singapore - Describes himself as a "regular" undergraduate with decent but not exceptional grades - Entered PhD program somewhat accidentally, driven by curiosity and desire to stay in Singapore - Discovered research aptitude during PhD process - Had minimal advisor guidance, describing his advisor as essentially a "Grammarly-like" editing tool

- Experienced significant culture shock transitioning from Singapore's research environment to Google - Critiques Singapore's research community as focused on paper publication rather than impact and less sophisticated compared to US research culture - Experienced substantial personal growth adapting to different research standards at Google - Learned research skills largely through self-supervision

- On building an AI ecosystem like Silicon Valley, suggests governments should avoid over-intervention, respect local talent, be cautious about moving in the wrong direction, support funding for startups and academic conferences, and facilitate international exposure - Acknowledges a significant "brain drain" to the US in AI - Expresses concern about US cultural and technological hegemony in AI development - Highlights the importance of "AI sovereignty" for different nations - Suggests that exposure and international experience can dramatically expand one's worldview

- Emphasizes a shift towards individual contributors (ICs) rather than traditional management structures - Senior technical people should be hands-on and writing code - Impact is now driven by practitioners actually doing the work, not managers - The complexity of AI requires deep technical expertise - The era of "people who talk" is over; success in AI now depends on skilled individual contributors who can directly implement and innovate

More from Latent Space: The AI Engineer Podcast

Explore all episode briefs from this podcast

View All Episodes →

Listen smarter with PodBrief

Get AI-powered briefs for all your favorite podcasts, plus a daily feed that keeps you informed.

Download on the App Store