Overview
- The open model landscape has expanded dramatically from just 5 major players in 2023 to a diverse ecosystem in 2024, with new entrants like Google's Gemma, AI2's OLMO family, and Meta's LLAMA 3, serving both research purposes and practical applications.
- A significant trend toward "fully open" models emerged in 2024, with organizations releasing complete development pipelines (checkpoints, training data, code, logs) to enable collaborative research, exemplified by AI2's OLMO2 - currently the best state-of-the-art fully open language model.
- The industry faces a growing compute divide between organizations with access to massive GPU infrastructure (10,000-50,000 GPUs) and those with limited resources, creating rising computational barriers to innovation in AI development.
- Training data availability is becoming a critical challenge as more websites block web crawling in response to AI models, disproportionately disadvantaging open-source AI development while benefiting established closed AI labs.
- Mistral AI exemplifies rapid innovation in open models, releasing multiple models since its 2023 founding, including specialized offerings for edge devices, coding, and multimodal applications, with a tiered licensing approach balancing openness with commercial viability.
Content: Latent Space Live Mini Conference at NeurIPS 2024
Open Models Landscape Overview
* Latent Space Live held its first mini conference at NeurIPS 2024 in Vancouver with 200 in-person attendees and 2,200 online viewers, focusing on the state of open models in 2024.
* Luca Saldani, Research Scientist at Allen Institute for AI, presented a recap of open model themes in 2024, highlighting the dramatic expansion of the ecosystem: * 2023 had only five major open LLM players (Mistral, MPT, Falcon, Yi, LLAMA) * 2024 saw significant expansion with new models from Google (Gemma), Cohere (Command R), Alibaba (QEN), DeepSeek, Allen Institute (OLMO, OLMOE, Pixmo, Molmo), and Meta (LLAMA 3)
* Open models serve multiple purposes: * Research: Enable studies on model behavior, support evaluation and interpretability research * Practical applications: Better performance in specific domains (e.g., retrieval), edge AI constraints, model stability, and predictable behavior
Open Source Principles and Definitions
* Open models embody core open source principles, particularly collaboration, allowing researchers to build upon existing innovations and benefit from shared resources.
* The Open Source Initiative (OSI) released the first open source AI definition in 2024, requiring: * Weights must be freely available * Code must have an open source license * No restrictive use case clauses * Models like LAMA would not qualify under this definition
* Data transparency remains contentious: * OSI takes a "soft stance," requiring "sufficient detail" to replicate data pipeline * Critics note vague language around data accessibility * Full data availability is not mandated
Computational Resources and Development Stages
* The AI development pipeline has varying resource requirements: * Pre-training requires massive GPU resources (1,000-50,000+ GPUs) * Post-training can be done with fewer GPUs (as few as 8) * Inference and evaluation can be done with minimal resources
* 2024 saw the emergence of a "compute rich club" with access to massive GPU infrastructure (10,000-50,000 GPUs), creating rising computational barriers to innovation.
Fully Open Models Trend
* A significant 2024 trend was "fully open" models releasing complete development pipelines: * Model checkpoint, training data, code, logs, and intermediate checkpoints * Enables collaborative research and model improvement
* AI2's notable open model releases include: * OMOE: State-of-the-art Mixture of Experts model * Molmo: Multimodal model development recipe * TULU3: Post-training model development recipe * OLMO2: Current best state-of-the-art fully open language model
Training Data Challenges
* Common Crawl analysis reveals many websites are blocking web crawling, especially in response to closed AI models: * Content owners increasingly prevent data collection, often unknowingly using technologies like Cloudflare * This trend disproportionately advantages established closed AI labs * The community is effectively "running out of open training data"
AI Regulation and Lobbying
* Strong lobbying efforts portray open-source AI as extremely risky, often: * Exaggerating AI risks * Ignoring existing software industry risk management approaches * Overlooking ongoing safety efforts in open model development
* Recent examples like bio-risk concerns have been proven largely unfounded.
* California's SP1047 bill was potentially harmful to open AI development, but open-source and economic communities successfully collaborated to challenge the legislation.
Mistral AI Case Study
* Founded in Paris in May 2023, Mistral AI has rapidly released multiple open-source models: * September 2023: First open-source model Mistral 7B * December 2023: Mistral 8x7b model * February 2024: Mistral Small, Mistral Large, Lachette chat interface, embedding model * April-May 2024: AX22B MOE model, CodeStraw code model * July-November 2024: Multiple releases including Mistral 3B (edge devices), Mistral 8B, Nemo 12B (NVIDIA collaboration), multi-modal models (Mistral 12B, Pixel Large), and research models like Coastal Mamba
* Mistral's licensing approach includes: * Models available on major cloud platforms (Google Cloud, AWS, Azure) * Fine-tuning services and open-source fine-tuning codebase * Premium models with Mistral research license (free for exploration, purchase required for enterprise/production)
* Mistral's model portfolio addresses various use cases: * Mistral Small: Low latency * Mistral Large: Sophisticated use cases * Pixel Large: Frontier multi-model * Coastral: Coding-focused * Mistral Embedding model
LeChet Demonstration and Closing
* LeChet (Mistral's chat interface) was demonstrated at chat.mastro.ai, showcasing: * Image understanding/OCR * Code generation and execution * Web search * Image generation * Interactive coding (e.g., creating a Tetris game)
* Laura Hamilton from Notable Capital (formerly GGV) closed the session, inviting collaboration with entrepreneurs and researchers, mentioning partnerships with companies like HashiCorp and Brassell.