Sponsorships and applications for the

Latent Space: The AI Engineer Podcast

Everything you need to run Mission Critical Inference (ft. DeepSeek v3 + SGLang)

Overview

Content

DeepSeq V3 Model and Industry Context

- Currently ranked 7th on LM Arena leaderboard with a score of 1319 - Considered the best open weights model as of January 2025 - Uses native FP8 mixed precision training - Features multi-head latent attention and a new multi-token prediction objective - Contains 256 experts

- Extremely large model size makes serving difficult - Requires specialized hardware (H200 clusters) - Needs ~71 GB for weights plus additional KV cache memory - Cannot be effectively run on standard H100 GPUs

Model Performance and Customer Considerations

- Rate limiting from cloud providers - High pricing from existing services - Specific latency or time-to-first-token requirements - Desire for full model control - Avoiding potential API model changes

- Large models having long loading times - Difficult debugging due to model size - Limited support for FP8 quantization in current frameworks

Base10's Approach and MOE Architecture

- Focuses on customers with custom models and specific workflow requirements - Provides dedicated inference resources - Prioritizes customer needs over generic shared endpoints - Avoids unilateral model quantization without customer collaboration

- Potential benefits of quantization up to 6-bit during training - Challenges in implementing block-wise FP8 kernels - Promising benchmark results with quantized models (e.g., GSM8K score of 94.6)

- Some companies like Baidu already using internal MOE models - Interesting that major AI labs haven't widely adopted MOE - Speculation that Gemini 1.5 Pro might be an MOE model - LAMA potentially didn't open-source an MOE model due to training challenges - MOE models are perceived as potentially underperforming compared to other model types - Benchmark scores for MOE models are lower than some alternative models

Pricing and Infrastructure

- Pricing for models running on their infrastructure - Pricing for customers using their own cloud resources - Multi-cloud capabilities allowing horizontal scaling across different clouds - Flexible approach enabling customers to use their committed cloud resources efficiently

- Trust, an open-source model packaging and deployment library - Deep support for TensorRTLM framework - Custom versions of Triton inference server for performance - SGLang for improved developer experience - Framework-agnostic approach supporting multiple deployment options - NVIDIA CUDA kernels for high-performance computing

Trust Platform and Customer Priorities

- Initial goal was to make "easy things easy, hard things possible" - Started with simple model serving requiring just two functions - Evolved to handle more complex use cases over time

- Developed to handle multi-model, multi-step inference workflows - Enables low-latency processing by streaming data between models - Example use case: AI phone calls with sub-400 millisecond latency - Requires all models to be hosted on Base10 platform

- Transparency in platform operations - Maintaining model output quality - Performance metrics (latency, time to first token, throughput consistency, P95/P99) - Security and compliance considerations - Geographical data restrictions

Framework Comparison and Mission-Critical Inference

- SGLang has better performance than VLLM for common use cases - SGLang offers better usability compared to TRT-LLM - SGLang uniquely supports optimizations like multi-latent attention (MLA), DP attention for DeepSeq, and BlockWise FP8 kernel

1. Performance at Model Level (single GPU model performance, speculative decoding, framework-specific optimizations) 2. Horizontal Scaling Infrastructure (beyond simple Kubernetes autoscaling, scaling across multiple replicas) 3. Developer Experience (supporting complex, multi-step, multi-model workflows)

SGLAN Framework Development

- Created around August 2023 as a new LLM inference engine - January 2024: Introduced Radex prefix caching technology - February 2024: Added concentrated decoding and jump forward capabilities - Mid-2024: Positioned as a full LLM inference engine

- Radex Attention/KVCache using block size of 1 (vs. typical block size of 32) - Higher cache hit rates, particularly beneficial for scenarios with shared system prompts - Initially claimed performance 3x faster than VLM

- VLM: Easy to use but poor performance, messy code - TensorFlow RT: High performance but difficult to extend

Advanced Techniques and Future Outlook

- Jump forward decoding using tools like Outline and Xgrammar - Ability to constrain output structure and potentially improve decoding speed - Xgrammar (from MLC AI) preferred over Outlines for better performance

- Created by members of XAI (Elon Musk's AI company) - Offers API speculative execution and front-end language control flow - Rapid growth from 2,000 to 7,000 GitHub stars - Maintains a public roadmap and hosts bi-weekly community meetings

- Eagle is a speculative coding technique now supported in some frameworks - Training of draft models is crucial for performance - Growing interest in RL trainers for Large Language Models - Current RL approaches may be limited by base model architectures

- Fine-tuning examples in healthcare (medical document extraction, jargon understanding) - Skepticism that fine-tuning will completely disappear - Changing prompts might be easier than full fine-tuning in many cases - Cost-effectiveness of advanced reasoning models remains a question

Production AI Model Deployment Requirements

1. Model Performance (frameworks, training/fine-tuning draft models, server reliability) 2. Infrastructure Scaling (horizontal scaling, distribution across regions/clouds, resource availability) 3. Workflow Enablement (multi-step/multi-model inference, low latency, reliable workflows)

- Strict latency requirements - Large throughput support - Prevention of performance interference between customers - Compliance (HIPAA, SOC) - Geo-aware traffic routing - Minimal latency impact (50-100 milliseconds matters)

More from Latent Space: The AI Engineer Podcast

Explore all episode briefs from this podcast

View All Episodes →

Listen smarter with PodBrief

Get AI-powered briefs for all your favorite podcasts, plus a daily feed that keeps you informed.

Download on the App Store