Lens: Rethinking Training Efficiency for Foundational Text-to-Image Models Paper • 2605.21573 • Published 10 days ago • 103
PiD: Fast and High-Resolution Latent Decoding with Pixel Diffusion Paper • 2605.23902 • Published 8 days ago • 41
Look Where It Matters: High-Resolution Crops Retrieval for Efficient VLMs Paper • 2603.16932 • Published Mar 14 • 90
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency Paper • 2508.18265 • Published Aug 25, 2025 • 224
Video Analysis and Generation via a Semantic Progress Function Paper • 2604.22554 • Published Apr 24 • 63
Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation Paper • 2604.24763 • Published Apr 27 • 71
Denoising, Fast and Slow: Difficulty-Aware Adaptive Sampling for Image Generation Paper • 2604.19141 • Published Apr 21 • 1
LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model Paper • 2604.20796 • Published Apr 22 • 242