the AI Eng NYC meetup, bring demos and vibes!

It is a b">

Latent Space: The AI Engineer Podcast

Production AI Engineering starts with Evals — with Ankur Goyal of Braintrust

Overview

Content

Early Career and Background

- First-generation Indian immigrant with doctor parents - Explored career paths after rejecting traditional routes (big tech, academia) - Sought a high-impact, creative work environment

- Internships at Microsoft working on Bing distributed compute infrastructure - Felt unsatisfied with low-intensity work at large companies and academic research - Moved to San Francisco and connected with a recruiter - Joined MemSQL (now SingleStore) as employee #2 - Joined despite thinking he failed the interview - Dropped out of school to pursue the opportunity - Worked there for almost six years - Ran the engineering team

SingleStore Experience and Technology Insights

- Not suitable for small weekend projects due to hardware and software costs - More expensive than alternatives but still cheaper than Oracle Exadata or SAP HANA - Limited adoption due to complexity and pricing model - Founder Nikita Shamgunov is now pursuing a different strategy with Neon (offering free/inexpensive Postgres)

- Advanced technology doesn't guarantee widespread usability - Pricing and packaging can significantly limit technology adoption - Some technologies have excellent technical capabilities but are constrained by economic/deployment limitations

- Discussed variant types in technologies like Snowflake, Redshift, ClickHouse - Snowflake's variant type is considered an engineering marvel for semi-structured data storage - DuckDB's struct type has limitations compared to variant types

Impera Journey and Challenges

- Make unstructured data as easy to use as structured data, leveraging machine learning - Pre-LLM models and limited data collection made the initial concept challenging - Worked with top financial services and public enterprises

- Initially misunderstood his own sales capabilities - Realized the complexity of getting initial meetings, closing deals at scale, and managing revenue retention - Received guidance from his father-in-law (a seasoned sales leader at Cloudflare/Palo Alto Networks) - Hired Jason, an exceptional account executive who closed 90-95% of Impera's business

- Shifted from selling to technical customers (like at MemSQL) to selling to line-of-business customers - Discovered that lacking a deep, intuitive understanding of customers makes everything more challenging

- The fundamental challenge is not technological, but organizational prioritization - CEOs/CTOs are more likely to prioritize projects that create new user experiences rather than solving existing inefficient processes - Unstructured data solutions often remain a second or third-tier priority for large organizations

Impira's Technical Evolution and Acquisition

- Prior to acquisition, Impira's key advantage was extracting data from PDF documents with minimal training examples - Their approach was primarily computer vision-based, leveraging visual signals in documents - The emergence of advanced language models like BERT and ChatGPT dramatically changed the landscape - Text-based extraction techniques began to outperform previous computer vision methods

- The speaker had difficulty convincing his team about the potential of new AI technologies like Layout LM and GPT-3 - Became a top non-employee contributor to Hugging Face - Experimented with Layout LM and GPT-3 - Noticed GPT-3 significantly outperformed their existing technology - Recognized emerging AI models were rapidly improving and could potentially "cannibalize" their existing technology

- Acquisition by Figma occurred in December 2022 - Received inbound interest due to growing AI awareness - Worked closely with an investor (Ilad) during the process - Ultimately chose to be acquired by Figma (before Adobe's acquisition)

- Found the process of shutting down the company "extremely devastating" - Experienced significant sadness for 3-4 months - Recognized the emotional complexity of shutting down a startup and letting customers down

Figma Experience and Startup Closure

- Worked to provide generous refunds and support for customers during the closure - Emphasized making difficult but right entrepreneurial decisions, even when uncomfortable

- Figma was in a unique position, dealing with an acquisition, exploring identity beyond a design tool, and maintaining an annual release cycle - Introducing AI into Figma was complex due to high product quality standards, challenges with visual AI, and technical difficulties in applying AI to design formats

- Designers are generally skeptical of AI replacing design work - Potential AI value for designers lies in code generation, bridging UI engineering and design, and enhancing collaboration

- Found Figma's slower iteration pace challenging - Appreciated the company and people, but preferred a more rapid shipping environment

Birth of BrainTrust

- Developed an evaluation (eval) system that helped resolve technical disagreements - The eval system transformed discussions from hypothetical to more scientific and data-driven

- Transformers and large language models have made AI development more accessible to software engineers - Existing ML tools are difficult for software engineers to use - Need for evaluation tools designed specifically for software engineers' workflows

- An end-to-end developer platform for building AI products - Core belief: Embrace evaluation as a central workflow in AI engineering - Started by creating a highly regarded evaluation product, initially targeting software engineers

- Began as an evaluation dashboard - Evolved into a debugger-like tool - Progressing towards becoming an integrated development environment (IDE)

BrainTrust Platform Features

- Simplified data collection and ETL process - Logging functionality that automatically captures data in eval-ready format - Allows users to analyze eval results, investigate performance variations, compare metrics, and modify prompts or models for quick re-testing - Offers a collaborative, save-friendly environment for working with AI prompts and models - Users can compare multiple prompts and models side-by-side - Recently added capability to run evaluations directly in the playground

- Allows testing prompts against different models, including fine-tuned models - Enables creating custom evaluations with scoring mechanisms - Supports running pre-built and user-created evaluations - Demonstrates evaluating summary quality of press release documents

- Support for defining custom tools using TypeScript - Integration of external APIs (like EXA search) directly into the platform - Ability to run tool-augmented prompts in the playground environment - Dynamic code evaluation in a sandbox environment - Granular comparison of different AI-generated outputs - Easy deployment of custom tools via a simple command

Technical Approach and API Integration

- Developed a novel syntax for running evaluations without complex for loops - Eval consists of an argument with data, a task function, and one or more scoring functions - Enables parallel and efficient eval running - Supports caching and async processing - Provides consistent interfaces across Python and TypeScript - Converts evals into a declarative data structure

- Provides a REST API endpoint for each prompt - Allows users to spend more time crafting use cases and reusing tools - Integrates development process tightly with evaluation

- Allows sharing and publishing eval histories - Enables team discussions around evaluation results - Provides interactive debugging of task and scoring functions

Development Journey and Strategy

- Worked closely with early users like Brian from Zapier for critical feedback - Iteratively improved the product based on user suggestions - Developed features like prompt rerunning, model comparison, and token count correlation

- Initially considered "stupid" by many VCs and industry observers - Deemed necessary by early customers like Zapier, Coda, and Airtable who wanted data to remain in their cloud - Supported by some investors who saw value in the approach - Compared to Databricks' similar hybrid model - Leveraged serverless technology as a key unlock

- Hybrid on-premises approach - Prioritizing TypeScript SDK (now used by ~75% of users) - Focusing initially on evaluations (evals) as a critical pain point

- Initially, some VCs were skeptical, comparing the market to CICD - Subsequent market interest validated their approach - Impressive customer logos including Stripe and Vercel - Notable quote from Malta (former Google search team member) praising BrainTrust's workflow transformation

Market Insights and Technology Perspectives

- Parallels between the current AI market and the early cloud computing era - The market is highly dynamic, with significant technological shifts happening rapidly - Companies are treating AI as an existential question fundamentally changing software development

- Vector Search is not typically a storage or performance bottleneck - The real challenge is integrating Vector Search with other data systems - Databases are not just storage, but also compilers - Examples of innovative database approaches include Snowflake separating storage from compute and Databricks making arbitrary code a first-class citizen

- Fine-tuning is not necessarily a business in itself - The core goal is "automatic optimization" of use cases - Alternative optimization methods include DSPY-style prompt optimization, hand-crafting prompts, and in-context learning - Very few customers are currently fine-tuning models in production - The landscape of model optimization is rapidly changing

AI Model Landscape and Trends

- Pre-Claude 3, OpenAI dominated nearly 100% of the market - Post-Claude 3, customers are now evaluating both OpenAI and Anthropic - Anthropic's Haiku was particularly notable for being cheap, fast, and supporting tool calling - Sonnet is now seen as both affordable and capable - OpenAI remains the overwhelming majority choice in production environments

- Excels in model availability, rate limits, and reliability - Their single endpoint approach is a significant engineering achievement - Managing multiple cloud endpoints is complex and requires substantial engineering effort

- Big companies don't exclusively use specific cloud providers for AI models - There's a diverse ecosystem with multiple options and tradeoffs - Different model labs (OpenAI, Anthropic, Meta) are actively competing and innovating - OpenAI's GPT-4o release is seen as potentially invigorating competition

AI Use Cases and Future Directions

- Approximately 50% involve single prompt manipulations (auto-generating ticket titles, video/document summaries) - About 25% involve simple agents (prompt + tools, often RAG-based) - Remaining 25% are advanced agents with more complex interactions

- Initial AI integration involved complex, mathematically-oriented programming - Current trend is "sprinkling intelligence" throughout applications - Goal is to make AI implementation easy and low-friction - Developers want AI to feel like a natural part of building software, not a separate paradigm

- Advocates for designing AI agents focused on user experience rather than complex technical implementations - Suggests writing more UI code between LLM calls to craft user interactions - Introduces the concept of "code core versus LLM core" - keeping the core system well-defined and using LLMs sparingly - Highlights the Voyager agent as an innovative approach (writes and persists code for future reuse)

Personal Reflections and Company Update

- Acknowledges a shift in perspective about technical expertise - Recognizes that practical problem understanding is as valuable as deep technical knowledge - Currently in a unique position of understanding AI tools from a user's perspective

- Deeply enjoying his current work environment at BrainTrust - Values working with a team he respects, including his brother, Eden (head of product, first designer at Airtable and Cruise), and Albert (handles business operations) - Prioritizes working on meaningful problems, enjoying his work environment, and collaborating with people he respects

- BrainTrust is currently hiring software engineers, salespeople, DevRel, and one designer - Primarily seeking San Francisco-based candidates with some flexibility for remote candidates - Building AI software and passionate about their problem space - Interested in working with high-quality customers and team members

More from Latent Space: The AI Engineer Podcast

Explore all episode briefs from this podcast

View All Episodes →

Listen smarter with PodBrief

Get AI-powered briefs for all your favorite podcasts, plus a daily feed that keeps you informed.

Download on the App Store