Overview

The open model landscape has expanded dramatically from just 5 major players in 2023 to a diverse ecosystem in 2024, with new entrants like Google's Gemma, AI2's OLMO family, and Meta's LLAMA 3, serving both research purposes and practical applications.

A significant trend toward "fully open" models emerged in 2024, with organizations releasing complete development pipelines (checkpoints, training data, code, logs) to enable collaborative research, exemplified by AI2's OLMO2 - currently the best state-of-the-art fully open language model.

The industry faces a growing compute divide between organizations with access to massive GPU infrastructure (10,000-50,000 GPUs) and those with limited resources, creating rising computational barriers to innovation in AI development.

Training data availability is becoming a critical challenge as more websites block web crawling in response to AI models, disproportionately disadvantaging open-source AI development while benefiting established closed AI labs.

Mistral AI exemplifies rapid innovation in open models, releasing multiple models since its 2023 founding, including specialized offerings for edge devices, coding, and multimodal applications, with a tiered licensing approach balancing openness with commercial viability.

Content: Latent Space Live Mini Conference at NeurIPS 2024

Open Models Landscape Overview

* Latent Space Live held its first mini conference at NeurIPS 2024 in Vancouver with 200 in-person attendees and 2,200 online viewers, focusing on the state of open models in 2024.

* Luca Saldani, Research Scientist at Allen Institute for AI, presented a recap of open model themes in 2024, highlighting the dramatic expansion of the ecosystem: * 2023 had only five major open LLM players (Mistral, MPT, Falcon, Yi, LLAMA) * 2024 saw significant expansion with new models from Google (Gemma), Cohere (Command R), Alibaba (QEN), DeepSeek, Allen Institute (OLMO, OLMOE, Pixmo, Molmo), and Meta (LLAMA 3)

* Open models serve multiple purposes: * Research: Enable studies on model behavior, support evaluation and interpretability research * Practical applications: Better performance in specific domains (e.g., retrieval), edge AI constraints, model stability, and predictable behavior

Open Source Principles and Definitions

* Open models embody core open source principles, particularly collaboration, allowing researchers to build upon existing innovations and benefit from shared resources.

* The Open Source Initiative (OSI) released the first open source AI definition in 2024, requiring: * Weights must be freely available * Code must have an open source license * No restrictive use case clauses * Models like LAMA would not qualify under this definition

* Data transparency remains contentious: * OSI takes a "soft stance," requiring "sufficient detail" to replicate data pipeline * Critics note vague language around data accessibility * Full data availability is not mandated

Computational Resources and Development Stages

* The AI development pipeline has varying resource requirements: * Pre-training requires massive GPU resources (1,000-50,000+ GPUs) * Post-training can be done with fewer GPUs (as few as 8) * Inference and evaluation can be done with minimal resources

* 2024 saw the emergence of a "compute rich club" with access to massive GPU infrastructure (10,000-50,000 GPUs), creating rising computational barriers to innovation.

Fully Open Models Trend

* A significant 2024 trend was "fully open" models releasing complete development pipelines: * Model checkpoint, training data, code, logs, and intermediate checkpoints * Enables collaborative research and model improvement

* AI2's notable open model releases include: * OMOE: State-of-the-art Mixture of Experts model * Molmo: Multimodal model development recipe * TULU3: Post-training model development recipe * OLMO2: Current best state-of-the-art fully open language model

Training Data Challenges

* Common Crawl analysis reveals many websites are blocking web crawling, especially in response to closed AI models: * Content owners increasingly prevent data collection, often unknowingly using technologies like Cloudflare * This trend disproportionately advantages established closed AI labs * The community is effectively "running out of open training data"

AI Regulation and Lobbying

* Strong lobbying efforts portray open-source AI as extremely risky, often: * Exaggerating AI risks * Ignoring existing software industry risk management approaches * Overlooking ongoing safety efforts in open model development

* Recent examples like bio-risk concerns have been proven largely unfounded.

* California's SP1047 bill was potentially harmful to open AI development, but open-source and economic communities successfully collaborated to challenge the legislation.

Mistral AI Case Study

* Founded in Paris in May 2023, Mistral AI has rapidly released multiple open-source models: * September 2023: First open-source model Mistral 7B * December 2023: Mistral 8x7b model * February 2024: Mistral Small, Mistral Large, Lachette chat interface, embedding model * April-May 2024: AX22B MOE model, CodeStraw code model * July-November 2024: Multiple releases including Mistral 3B (edge devices), Mistral 8B, Nemo 12B (NVIDIA collaboration), multi-modal models (Mistral 12B, Pixel Large), and research models like Coastal Mamba

* Mistral's licensing approach includes: * Models available on major cloud platforms (Google Cloud, AWS, Azure) * Fine-tuning services and open-source fine-tuning codebase * Premium models with Mistral research license (free for exploration, purchase required for enterprise/production)

* Mistral's model portfolio addresses various use cases: * Mistral Small: Low latency * Mistral Large: Sophisticated use cases * Pixel Large: Frontier multi-model * Coastral: Coding-focused * Mistral Embedding model

LeChet Demonstration and Closing

* LeChet (Mistral's chat interface) was demonstrated at chat.mastro.ai, showcasing: * Image understanding/OCR * Code generation and execution * Web search * Image generation * Interactive coding (e.g., creating a Tetris game)

* Laura Hamilton from Notable Capital (formerly GGV) closed the session, inviting collaboration with entrepreneurs and researchers, mentioning partnerships with companies like HashiCorp and Brassell.

2024 in Open Models [LS Live @ NeurIPS]