Latent Space: The AI Engineer Podcast

Open Operator, Serverless Browsers and the Future of Computer-Using Agents

Overview

Content

BrowserBase and Paul Klein's Background

- Previously served as CTO of Stream Club, which was acquired by MUX partly due to their internal headless browser infrastructure. - Committed to only starting another company if it was browser infrastructure-related. - Part of an AI grant batch, but focused on infrastructure rather than AI models. - Recently took his first vacation, which coincided with new AI tool releases. - Considers himself an expert in headless browser infrastructure at scale.

Web Scraping and Automation Evolution

- Headless browsers are now necessary because many modern websites require JavaScript to load content. - LLMs enable more generic, adaptable automation scripts that can work across different websites. - Previously, developers had to write unique scripts for each website; now one script can generate site-specific instructions in real-time.

- Websites like Airbnb dynamically load content, requiring JavaScript rendering. - Traditional methods like curl can't capture full page content. - LLMs can generate code to interact with websites more flexibly.

- Computer vision models are increasingly driving UI automation. - Rendering web pages and taking screenshots enables more advanced interaction. - Paul was initially skeptical about vision's importance in web automation.

Technical Challenges of Browser Infrastructure

- Browsers like Chrome are large (250+ MB), making serverless deployment difficult. - Lambda has resource limitations that make running browsers inefficient. - Scaling browser instances across multiple users requires sophisticated infrastructure solutions. - Stateful distributed systems are difficult to manage.

- Configuring browser extensions - Installing fonts - Ensuring emoji support - Recording and observing browser sessions

- Built primarily for personal needs - Inspired by Andrej Karpathy's talk about browsers as a core infrastructure primitive for future AI/LLM systems - Aims to create a category-defining infrastructure company

BrowserBase Solution and Infrastructure

- Allows customers to use existing frameworks (Playwright, Selenium) - Abstracts away complex distributed system management - Enables easy browser connection and disconnection

- Uses Kubernetes for scheduling - Aims to spin up thousands of browsers in milliseconds - Employs Firecracker VM technology for quick scaling, strong multi-tenancy, and nimble infrastructure

- Moved away from Fargate due to need for deeper control - Believes infrastructure companies need ownership of critical path components - Willing to build in-house solutions for core infrastructure - Focuses on providing flexibility (trade-offs between startup speed and cost)

- Emphasized the importance of presentation for developer tools - Invested in professional website (by Herb.Paris) and video production - Believes developers evaluate companies not just on technical reliability, but also on trust and clear messaging

Localization, Proxies, and CAPTCHA Handling

- Browsers can set locale settings (e.g., EN US) to determine content location - Some websites use IP-based routing for regional content - BrowserBase offers proxy features to route connections from specific regions - They have a "proxy super network" that selects appropriate proxies for web automation

- Proxying at scale is complex and challenging - BrowserBase works with multiple web proxy providers - They conduct due diligence to ensure ethical proxy sourcing - Can intelligently route around non-functioning proxies - Do not own their own proxy servers, considering it a mature market

- BrowserBase integrates multiple CAPTCHA solvers - Aims to provide reliable infrastructure for handling CAPTCHAs - Monitors and maintains CAPTCHA solving capabilities - Recognizes current limitations of CAPTCHA technology - Believes future CAPTCHAs might distinguish between "good" and "bad" bots - Focuses on minimizing platform abuse through careful vetting of users

The Future of Web and AI Bot Interactions

- Traditional methods of content protection (like CAPTCHAs) are seen as short-term solutions - The focus is shifting towards identifying and managing "good" vs "bad" bots

- Future authentication will likely involve "agent authentication" (agent auth) - Proposed model: Each human user would have an associated agent token with specific, limited permissions - Similar to OAuth, agents would request access with defined scopes (e.g., booking an Airbnb apartment, but not messaging) - Authentication would involve user approval and role-based access control

- Cloudflare is considering blocking AI bots by default - Authentication providers might develop "hidden login as agent" features - Potential solution involves push notifications for agent login requests - Authentication protocols like SAML, SSO, and WebAuthn provide foundational insights

Live View and Browser Interaction Features

- Builds trust with customers - Users can embed and watch an AI agent operating a browser in real-time - Two-way communication allows human intervention during browser tasks - Uses Chrome DevTools protocol to stream browser interactions - Useful for handling complex scenarios like two-factor authentication

- Can stream browser interactions via PNGs - Supports pausing/resuming browser tasks - Enables human-in-the-loop workflows for complex web interactions

- Some tools are exploring desktop-first approaches for consumer use - Examples include The Browser Company's Dia Browser (AI sidebar that controls local browser) and Google's Project Marina - Web agents may increasingly live alongside or integrate directly with browsers

StageHand Framework and Browser Exploration

- A web browsing framework for AI agents with three core components: - Observe: Identify possible actions on a webpage - Extract: Pull specific data using natural language instructions - Act: Perform actions like clicking buttons or filling forms

- Builds upon existing web automation tools like Puppeteer, Playwright, Selenium - Aims to simplify web automation by using natural language API inputs - Designed as a tool for building web agents, not a complete agent solution - Allows developers to integrate browser interaction tools into their agent loops

- Completely open source with MIT license - Users bring their own API key and LLM - Focused on reliability rather than cost optimization - Not primarily a web scraping tool, but more suited for AI agents and web automation - Can serve as an integration test framework for web interactions

- Discussion about potential browser forking for AI agents - Exploring the concept of parallel path exploration when crawling websites - Technical challenges of truly forking browser state

Computer Use Agents and Open Operator

- Seen as demonstrating potential of AI automation, not necessarily a "company killer" - Most exciting aspects include screenshot reasoning and ability to output step-by-step processes - Current limitations include unreliable mouse coordinate interactions and limited viewport visibility - StageHand's approach anchors interactions directly to DOM elements for more accuracy

- Shows how to build browser-based agents - It's an agent loop that: - Takes a high-level goal - Breaks it down into steps - Uses tool calling to accomplish steps - Takes screenshots and uses LLM to generate actions - Uses Stagehand to execute actions

- Paul is optimistic about computer use models' potential - Expects more labs to launch similar technologies - Uncertain about whether Operator will be released as an API - Views current implementations as early-stage demonstrations of possibilities

Use Cases and Market Perspective

1. Workflow Automation (competing with UiPath) 2. Agents 3. Web Scraping

- Recommended "waterfall" approach: 1. First, try a curl request 2. Then try a scraping-specific API 3. Use browser base as a last resort when other methods fail

- Discussion of how many daily tasks involve complex, time-consuming form submissions - Example of Benny app, which automates receipt submission for rebates - Observation that millions of forms (visas, government documents) consume human time - Hope for software that can automate unnecessary web forms

- Currently seen as a non-zero-sum market with massive potential for automation - Expectation of future "agent platforms" that integrate multiple tools - Recognition that complex primitives (like browsers) may require specialized solutions

Developer Tooling and Future of Software

- Search APIs - JSON/markdown extractors - Virtual browsers - Virtual machines - Code interpreters

- Argument that browsers are becoming primary computing environments - BrowserBase founder's perspective that browsers can be run more efficiently than full operating systems - Claim that browser-based tools can provide 90% functionality at 10% cost of full OS - Reference to Mark Driesen quote about browsers turning OS into "device drivers"

- Paul believes future software will be more autonomous, with systems that can perform complex tasks with minimal human intervention - Software will increasingly use other software/APIs to complete tasks, moving beyond simple button clicks and computations - This shift requires new infrastructure, UI approaches, and developer thinking - Emerging trends include chat interfaces, human-loop workflows, and more asynchronous processes - Best practices for AI-driven software are still developing

Company Culture and Team Building

- Paul is a solo founder who believes in the benefits of this approach - Advantages include faster decision-making, no co-founder alignment overhead, and direct team communication - For DevTools, a solo founder who can build product and talk to customers can be successful - Key requirements include ability to discuss product, willingness to engage customers, and clear core principles

- Emphasizes building a strong team with high agency and ownership - Fully in-person work culture with balanced hours (10am start, 5-6pm end, Monday-Friday) - Critiques extreme work models like "996" (9am-9pm, 6 days a week) - Allows for flexible weekend "fun work" where team can explore non-roadmap projects - The company has a diverse workforce with employees at different career stages - Creating binary/clear-cut cultural choices (like office attendance) can help create cohesion

- Primarily recruits through personal referrals and targeted outreach - Prefers hiring former Y Combinator CTOs, ex-founders, and future founders - Values engineers who can make immediate impact at a company with product-market fit - Personally messages interesting potential candidates rather than using broad recruiting tools - Runs a weekly "run club" on Mondays for team bonding - Encourages team members to be active on social media and "build in public"

- Using AI/web browsing to extract insights from publicly recorded meetings - Potential strategy: Monitor local government meetings to predict real estate opportunities - Example: Identifying when a new Walmart might be approved and buying nearby real estate

More from Latent Space: The AI Engineer Podcast

Explore all episode briefs from this podcast

View All Episodes →

Listen smarter with PodBrief

Get AI-powered briefs for all your favorite podcasts, plus a daily feed that keeps you informed.

Download on the App Store