Overview
* Video generation has seen a transformative leap with Sora's ability to create minute-long, high-detail videos at 1080p, marking a shift from image-based to video-based models and utilizing increased compute power through diffusion transformers rather than autoregressive approaches.
* Object detection is evolving beyond YOLO with emerging "Debtor" models that demonstrate superior performance, faster convergence (50-60 epochs), and better real-world dataset results, particularly when leveraging pre-training techniques.
* Large Language Models struggle with fine-grained visual perception, performing only slightly better than random guessing on tasks requiring detailed visual discrimination, highlighting a fundamental limitation in current vision-language integration.
* New multimodal architectures like Florence 2 and Polygema 2 are addressing vision-language challenges through innovative approaches including spatial hierarchy, semantic granularity, and treating vision tasks as language tasks with specialized tokens that point to pixel space locations.
* Grounded chain of thought approaches show promise for improving visual reasoning capabilities in VLMs, particularly for complex tasks like gauge reading, by implementing point-based annotation and explicit reasoning steps that can be traced for debugging.
Content
Vision Trends and Developments in 2024
* Latent Space Live held a mini-conference at NeurIPS 2024 in Vancouver, focusing on 2024 domain recaps, with vision being the highest interest domain (200 in-person and 2,200 online attendees)
* Major shifts in computer vision for 2024: - Transition from per-image to video-based models - Debtors emerging as alternatives to YOLO in real-time object detection - Widespread adoption of multimodal vision-language models
Video Generation Breakthroughs
* Sora (February 2024) highlighted as the biggest breakthrough: - Capable of generating 1080p, minute-long videos with high detail - No official paper yet, insights drawn from replication efforts - Previous state-of-the-art (MagVIT) limited to 5-second videos
* OpenSora technical approach: - Uses a 3D VAE (possibly Mag-VIT V2) and a diffusion transformer - Shifted from autoregressive to diffusion transformer approach - Emphasizes increasing compute as a primary method for improvement - Highlights the challenge of compute budget limitations
* Key video generation techniques: - LLM video captioning - Aesthetic and motion filtering - Space-time latent encoding
* Diffusion Model Developments: - Moving from DDPM to rectified flows - Rectified flows can generate high-quality samples with fewer steps - Compute increase appears more impactful than specific hyperparameters
SAM2 (Segment Anything Model 2)
* Video segmentation model with significant labeling time reduction * Can track objects across video frames, even when objects temporarily disappear
* Key improvements over original SAM: - Uses HERA hierarchical encoder instead of standard ViT - 6x faster inference - Creates a memory bank for cross-attending features - Supports various prompting methods (bounding boxes, points, masks)
* Memory Bank implementation: - Unique data creation paradigm that tightly integrates model and training set - Uses last few video frames and prompts for real-time object tracking - Counterintuitively, attending to limited frames can maintain performance - Benchmarking shows improvement over previous state-of-the-art methods
Object Detection Evolution
* YOLO object detection models have plateaued in performance * Emerging "Debtor" models are breaking through performance limitations
* Three key developments in real-time object detection: - RT Debtor: Matched/outperformed YOLO speeds - LW Debtor: Demonstrated effectiveness of pre-training - Define: Added advanced features and improvements
* Pre-Training Insights: - Pre-training shows significant performance boosts for Debtor models - Pre-training benefits diminish with longer training cycles - Debtor models converge faster (50-60 epochs) - Demonstrated superior performance on real-world datasets
* Key Technical Innovations: - More efficient transformer encoder with decoupled multi-scale features - Comprehensive latency benchmarking, including Non-Maximum Suppression (NMS) - Adoption of complex loss functions from YOLO models
* Research Focus: - Developing models that perform well with less data - Interest in exploring co-debtor models and benchmarking across different inference scales - Desire to see more research on pre-training techniques
Large Language Models (LLM) Visual Perception Challenges
* Current LLMs cannot effectively "see" fine-grained visual details, demonstrated by their inability to: - Accurately describe precise image details - Correctly identify object orientations - Distinguish between visually similar images
* MMVP Paper Highlights: - Investigates why LLMs struggle with fine-grained visual perception - Hypothesis: Vision encoders initialized with CLIP lack detailed feature extraction - Methodology: Identifies image pairs similar in CLIP space but different in Dynav2 space - Creates a benchmark of multiple-choice questions using these image pairs
* Performance Findings: - Current models (ChatGPT, GPT, Lava) perform poorly on fine-grained visual tasks - Models perform only slightly better than random guessing - Significantly below human performance in distinguishing subtle image differences
* Attempts to improve Lava model: - Incorporating Dyno v2 features showed mixed results - Additive mixing of features led to worse language modeling performance - Dyno v2 features, trained in image space, are not compatible with text models - Interleaving features only marginally improved performance
Florence 2 and Polygema 2 Innovations
* Florence 2 Paper Highlights: - Aims to solve vision-language model challenges through spatial hierarchy and semantic granularity - Introduces three annotation paradigms: text captioning, region-text pairs, and text phrase-region annotations - Dumps vision encoder features into encoder-decoder transformer - Trains vision tasks as language tasks - Uses new tokens to point to pixel space locations
* Florence 2 Performance: - Pre-trained models transfer well - Achieved ~60% mAP on Coco dataset - Converges faster and more efficiently - Currently limited by small model size (0.2-0.7 billion parameters) - Image and region-level annotations perform better for object detection than including pixel-level annotations
* Polygema 2 Highlights: - Released recently, available on RoboFlow within 14 hours of launch - Uses a decoder-only transformer model with location tokens - Introduces multiple sizes of language encoders - Utilizes prefix loss technique for improved token attention - Performance increases with higher resolution and increased parameter count - Achieved 47.3 on MMVP, which is significant for a 2B parameter model
* AIMV2 Key Innovations: - Simplifies image-language model integration - Uses vision encoder with decoder-only transformer - Learns by reconstructing image tokens via mean squared error - Uses randomly sampled prefix length for image tokens - Trained on high-quality internet-scale data - Shows continued performance improvement with more samples
Vision Model Performance and Challenges
* MV2 model shows promising results in image classification and visual feature extraction * Performance on object detection benchmarks like COCO is competitive with existing models like CLIP
* Key Insights on Vision Model Development: - Increasing model parameter count and training data generally improves performance - Image resolution impacts model accuracy and feature detection - Object detection remains challenging compared to image classification
* Challenges in Vision Model Generalization: - Current foundation models are not as effective at object detection as specialized models - Object detection architectures are highly domain-specific - Real-time object detectors historically showed limited benefits from pre-training
Moondream Project
* Model Capabilities and Features: - Supports multiple output modalities: query mode for English language image questions, image captioning, open vocabulary object detection, and pointing capability - Two current models: 2B parameter general purpose model and 0.5B parameter model for edge devices/mobile
* 0.5B Model Development: - Developed through pruning technique - Started with 2B model - Used gradient-based importance estimation - Iteratively pruned and retrained to preserve performance - Allows developers to explore and then optimize for deployment
* Gauge Reading Challenge: - Identified a problem with existing models reading analog gauges - Current models struggle due to training data limitations - Most gauge images are product detail shots - Models recognize metadata but not actual needle readings - Proposed solution: Synthetic data generation - Discovered reading gauges is complex, multi-step process - Suggested using "chain of thought" approach
Grounded Chain of Thought Approach
* Developing a method for visual language models (VLMs) to improve image perception and reasoning * Using a "grounded chain of thought" approach for tasks like reading clocks and gauges * Implemented point-based annotation instead of bounding boxes for more precise image interaction
* Technical Implementation: - Created a clock reading benchmark with 500 images - Demonstrated improved sample efficiency using chain of thought methodology - Allows for understanding and debugging model errors by tracing reasoning steps - Supports few-shot prompting and test-time training for specific use cases
* Insights on VLM Limitations: - VLMs are currently lagging behind language models in reasoning capabilities - Hypothesis: Lack of comprehensive perception training data - Internet image-text pairs are low-quality sources for perception learning - Humans are naturally good at perception but rarely document perception techniques explicitly
* Research Goals: - Build versatile VLMs that can operate across different tasks - Develop methods to teach models how to reason with visual information - Explore scaling perception and reasoning capabilities