Overview

DeepSeq V3 represents a significant advancement in open-weights models with its 671 billion parameter Mixture of Experts (MOE) architecture, though its massive size creates substantial inference challenges requiring specialized hardware like H200 clusters and approximately 71GB just for weights.

Organizations adopt models like DeepSeq V3 not just for performance but to address practical concerns including API rate limiting, high pricing from existing services, specific latency requirements, and the desire for complete model control without being subject to unexpected API changes.

Base10's inference approach prioritizes dedicated resources and customer-specific needs over generic shared endpoints, avoiding unilateral model quantization decisions while supporting advanced techniques like native FP8 quantization during training.

Mission-critical AI inference requires excellence across three pillars: model-level performance optimization, sophisticated horizontal scaling infrastructure beyond simple Kubernetes, and developer experience supporting complex multi-model workflows.

The SGLang framework has rapidly evolved since 2023 to address limitations in existing frameworks, offering innovations like Radex prefix caching and jump-forward decoding that significantly improve performance for real-world applications requiring structured outputs and low latency.

Content

DeepSeq V3 Model and Industry Context

The podcast begins as the first Latent Space episode of 2025, featuring Amir (Base10 co-founder) and Yining Zhang (model performance engineer).

DeepSeq V3 is discussed as a 671 billion parameter Mixture of Experts (MOE) model trained with 15 trillion tokens, including synthetic reasoning data.

- Currently ranked 7th on LM Arena leaderboard with a score of 1319 - Considered the best open weights model as of January 2025 - Uses native FP8 mixed precision training - Features multi-head latent attention and a new multi-token prediction objective - Contains 256 experts

Inference challenges for DeepSeq V3 are significant:

- Extremely large model size makes serving difficult - Requires specialized hardware (H200 clusters) - Needs ~71 GB for weights plus additional KV cache memory - Cannot be effectively run on standard H100 GPUs

The model is part of a trend of Chinese labs releasing large open weights models, including Tencent's HunYuan and Hyluo's Minimax.

Base10 was first to get DeepSeq v3 online, using H200 clusters and SGLang.

Model Performance and Customer Considerations

Despite larger models being available, LAMA 70B is currently more common than the 405B model as users generally find the performance gains at inference time are not significant.

Organizations adopt models like DeepSync V3 for several reasons:

- Rate limiting from cloud providers - High pricing from existing services - Specific latency or time-to-first-token requirements - Desire for full model control - Avoiding potential API model changes

Customers typically focus on performance requirements (latency, throughput, cost) rather than specifying particular GPU models.

Technical challenges include:

- Large models having long loading times - Difficult debugging due to model size - Limited support for FP8 quantization in current frameworks

There's an emerging trend of training with native quantization, with companies like Together releasing quantized model versions.

Base10's Approach and MOE Architecture

Base10's inference approach:

- Focuses on customers with custom models and specific workflow requirements - Provides dedicated inference resources - Prioritizes customer needs over generic shared endpoints - Avoids unilateral model quantization without customer collaboration

Quantization trends show emerging interest in FP8 training for large models:

- Potential benefits of quantization up to 6-bit during training - Challenges in implementing block-wise FP8 kernels - Promising benchmark results with quantized models (e.g., GSM8K score of 94.6)

Mixture of Experts (MOE) architecture is discussed as a potential emerging trend in 2024:

- Some companies like Baidu already using internal MOE models - Interesting that major AI labs haven't widely adopted MOE - Speculation that Gemini 1.5 Pro might be an MOE model - LAMA potentially didn't open-source an MOE model due to training challenges - MOE models are perceived as potentially underperforming compared to other model types - Benchmark scores for MOE models are lower than some alternative models

Pricing and Infrastructure

D6 API pricing strategy is based on consumption of resources, not per token:

- Pricing for models running on their infrastructure - Pricing for customers using their own cloud resources - Multi-cloud capabilities allowing horizontal scaling across different clouds - Flexible approach enabling customers to use their committed cloud resources efficiently

Technical infrastructure leverages:

- Trust, an open-source model packaging and deployment library - Deep support for TensorRTLM framework - Custom versions of Triton inference server for performance - SGLang for improved developer experience - Framework-agnostic approach supporting multiple deployment options - NVIDIA CUDA kernels for high-performance computing

Trust Platform and Customer Priorities

Trust Platform design philosophy:

- Initial goal was to make "easy things easy, hard things possible" - Started with simple model serving requiring just two functions - Evolved to handle more complex use cases over time

Trust Chains innovation:

- Developed to handle multi-model, multi-step inference workflows - Enables low-latency processing by streaming data between models - Example use case: AI phone calls with sub-400 millisecond latency - Requires all models to be hosted on Base10 platform

Customer priorities include:

- Transparency in platform operations - Maintaining model output quality - Performance metrics (latency, time to first token, throughput consistency, P95/P99) - Security and compliance considerations - Geographical data restrictions

Framework Comparison and Mission-Critical Inference

Framework comparison reveals:

- SGLang has better performance than VLLM for common use cases - SGLang offers better usability compared to TRT-LLM - SGLang uniquely supports optimizations like multi-latent attention (MLA), DP attention for DeepSeq, and BlockWise FP8 kernel

Three pillars of mission-critical inference workloads:

1. Performance at Model Level (single GPU model performance, speculative decoding, framework-specific optimizations) 2. Horizontal Scaling Infrastructure (beyond simple Kubernetes autoscaling, scaling across multiple replicas) 3. Developer Experience (supporting complex, multi-step, multi-model workflows)

The discussion emphasizes that serving frameworks are only part of the production inference solution.

SGLAN Framework Development

SGLAN framework development timeline:

- Created around August 2023 as a new LLM inference engine - January 2024: Introduced Radex prefix caching technology - February 2024: Added concentrated decoding and jump forward capabilities - Mid-2024: Positioned as a full LLM inference engine

Key technical innovations include:

- Radex Attention/KVCache using block size of 1 (vs. typical block size of 32) - Higher cache hit rates, particularly beneficial for scenarios with shared system prompts - Initially claimed performance 3x faster than VLM

SGLAN was created to address limitations in existing frameworks:

- VLM: Easy to use but poor performance, messy code - TensorFlow RT: High performance but difficult to extend

Advanced Techniques and Future Outlook

Structured output and decoding techniques:

- Jump forward decoding using tools like Outline and Xgrammar - Ability to constrain output structure and potentially improve decoding speed - Xgrammar (from MLC AI) preferred over Outlines for better performance

SG Lang context:

- Created by members of XAI (Elon Musk's AI company) - Offers API speculative execution and front-end language control flow - Rapid growth from 2,000 to 7,000 GitHub stars - Maintains a public roadmap and hosts bi-weekly community meetings

Speculative decoding and model training:

- Eagle is a speculative coding technique now supported in some frameworks - Training of draft models is crucial for performance - Growing interest in RL trainers for Large Language Models - Current RL approaches may be limited by base model architectures

Practical applications and future outlook:

- Fine-tuning examples in healthcare (medical document extraction, jargon understanding) - Skepticism that fine-tuning will completely disappear - Changing prompts might be easier than full fine-tuning in many cases - Cost-effectiveness of advanced reasoning models remains a question

Production AI Model Deployment Requirements

Production AI model deployment requires three key pillars:

1. Model Performance (frameworks, training/fine-tuning draft models, server reliability) 2. Infrastructure Scaling (horizontal scaling, distribution across regions/clouds, resource availability) 3. Workflow Enablement (multi-step/multi-model inference, low latency, reliable workflows)

Critical considerations for mission-critical inference:

- Strict latency requirements - Large throughput support - Prevention of performance interference between customers - Compliance (HIPAA, SOC) - Geo-aware traffic routing - Minimal latency impact (50-100 milliseconds matters)

The speakers suggest creating a manifesto/framework similar to Heroku's 12-factor app to outline requirements for mission-critical AI applications.

The podcast concludes with expressions of gratitude and appreciation to Amir, with references to other participants named Alessio and Sean, suggesting ongoing collaboration.

Everything you need to run Mission Critical Inference (ft. DeepSeek v3 + SGLang)