Overview

Synthetic data has evolved from a post-training tool to a comprehensive resource used throughout the entire LLM pipeline, with research showing it can actually improve model performance rather than cause the feared "model collapse" when properly implemented.

The generation of high-quality synthetic data has become increasingly sophisticated, with approaches like HuggingFace's Cosmopedia using web seeds and varied generation styles, alongside advanced filtering techniques that can transform trillions of tokens into curated, high-value training sets.

Small language models have made remarkable progress in 2024, with 1-4B parameter models now matching the performance of previously larger models, driven by extended training on more tokens and architectural optimizations that prioritize depth over width.

On-device AI is gaining momentum through small, efficient models that offer enhanced privacy, lower inference costs, and specialized capabilities, suggesting a shift from the "bigger is always better" paradigm toward more practical, targeted applications.

The field is witnessing a potential return to fine-tuning over generic prompt engineering, with domain-specific applications and synthetic data generation enabling more cost-effective specialized models across text, vision, and audio modalities.

Content

Synthetic Data in 2024

* Synthetic data has expanded across the entire LLM pipeline (pre-training, post-training, evaluation) * Initially used for post-training to replace human annotators * Now used in pre-training to generate controlled, high-quality web-like content * Possible to train an entire LLM using 100% synthetic data

* Advantages of synthetic data: * Enabled by strong open and closed models * Cheaper and faster than human annotations * Provides more control over data generation * Supported by improved inference frameworks

* Model collapse concerns: * Significant media and academic concern about potential data quality degradation * Challenges in detecting synthetic web content * Researchers exploring methods to measure synthetic data prevalence (e.g., tracking proxy words)

* Research findings on synthetic data: * Researchers observed an increase in synthetic data on the web after ChatGPT's release * Contrary to concerns, synthetic data does not necessarily make models worse * Models trained on latest data dumps showed improved performance on NLP benchmarks * Model collapse concerns are more relevant in small-scale iterative generation scenarios

Synthetic Data Generation Approaches

* HuggingFace's Cosmopedia dataset: * Contains ~30 billion tokens of synthetic data * Generation strategy involves: - Using web page extracts as seed prompts - Generating content related to specific topics - Varying generation styles (e.g., middle school vs. college textbooks) * Consistently performed better in training compared to Fine Web * Enabled training 1B models on 150 billion synthetic tokens

* Methodology highlights: * Use search tools to find relevant web pages * Create prompts with diverse seeds * Maintain topical relevance in synthetic data generation * Different generation styles perform better on different benchmarks

* Rephrasing web content: * Researchers using LLMs to rewrite web samples into different formats (e.g., Wikipedia passages, Q&A pages) * Allows rewriting without requiring extensive model knowledge * Can improve dataset quality and create more diverse training data

* Dataset filtering techniques: * FineWeb EDU approach: Used Llama 3 to rate educational content of web pages (0-5 scale) * Trained a BERT classifier on these synthetic annotations * Reduced 15 trillion tokens dataset to 1.5 trillion high-quality educational tokens * Demonstrated significant performance improvements on benchmarks

* Other notable approaches: * DCLM dataset: Trained classifiers on instruction tuning and Reddit content * NemoTron's Common Crawl: Used ensemble of classifiers for higher quality data filtering * NemoTron CC project generated 1.9 trillion tokens of synthetic data, significantly expanding the scale

* Synthetic data for post-training: * Microsoft's Agent Instruct dataset targets specific skills * Focuses on diverse content generation (code, brain teasers, open-domain QA) * Demonstrated ability to outperform original instruction models (e.g., improved Mistral 7b)

* Advanced generation methods: * Using persona datasets to generate varied content * Employing multiple teacher models to create multilingual datasets * Using reward models to filter and select best generations

Small Language Models Advancement

* Significant progress in small model performance in 2024: * LAMA 3.2 1B matches previous larger models on benchmarks * 3-4 billion parameter models now scoring high on MMLU * Small models can now run efficiently on devices like iPhones

* Model scaling perspective: * Shifting from "bigger is always better" narrative * Recognizing trade-offs in model size: - Larger models improve performance - But also significantly increase inference costs and computational requirements * Emerging focus on model efficiency over pure size

* Training strategies for small models: * Training smaller models for longer is becoming a significant trend * Meta's LAMA3 was trained on 15 trillion tokens vs. 1 trillion for LAMA, showing performance gains through extended training * Meta's Mobile LLM paper reveals insights about small model architectures: - Depth is more important than width - Techniques like GQA and embedding tying are beneficial

* Notable small model developments: * Apple Intelligence developed a 3 billion parameter on-device model using pruning and distillation * NVIDIA created a hybrid transformer/state space model with good performance * SmallM2 series: - Best-in-class models at different sizes - Trained on 11 trillion tokens - Fully open-source with training code and datasets * Small vision models like SmallVLM and Moondream show promising performance

* Advantages of small models: * On-device inference * Enhanced privacy (data stays local) * Multiple inference frameworks available (MLX, MLC, Lama Cpp, etc.) * Can be specialized for specific tasks with fine-tuning * Potential for running models in browser without internet

Practical Applications and Future Trends

* GitHub Issue Extraction Tool: * Demonstrates a model that can transform free text complaints into structured GitHub issues * Can extract key information like priority, issue type, title, and estimated fix time * Allows for easy issue creation directly in the browser

* Domain-specific applications: * Domain-specific synthetic data is becoming increasingly important * Generating synthetic data for specialized domains (e.g., math) can improve model reasoning * Many researchers working on domain-specific synthetic data generation

* Small model strategy: * Fine-tuning small models can be more cost-effective than using large models * Small models can achieve decent performance on specific tasks * Applies across text, vision, and audio modalities

* Emerging trends: * Increasing popularity of on-device AI frameworks (e.g., Pokespal, Olama) * Predicted growth of such frameworks in 2025 * Shifting from fine-tuning to prompt engineering, and now potentially back to fine-tuning * Motivation: Reducing costs and improving model specialization * Anticipating more focus on fine-tuning and less on generic prompt engineering

2024 in Synthetic Data and Smol Models [LS Live @ NeurIPS]