STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal Generation Paper • 2605.08029 • Published 5 days ago • 10
Rethinking JEPA: Compute-Efficient Video SSL with Frozen Teachers Paper • 2509.24317 • Published Sep 29, 2025 • 11
AIMv2 Collection A collection of AIMv2 vision encoders that supports a number of resolutions, native resolution, and a distilled checkpoint. • 16 items • Updated Mar 2 • 83
Stabilizing Transformer Training by Preventing Attention Entropy Collapse Paper • 2303.06296 • Published Mar 11, 2023 • 1
Learning Controllable 3D Diffusion Models from Single-view Images Paper • 2304.06700 • Published Apr 13, 2023
What Algorithms can Transformers Learn? A Study in Length Generalization Paper • 2310.16028 • Published Oct 24, 2023 • 2
Value function estimation using conditional diffusion models for control Paper • 2306.07290 • Published Jun 9, 2023
Overcoming the Pitfalls of Vision-Language Model Finetuning for OOD Generalization Paper • 2401.15914 • Published Jan 29, 2024 • 7
How Far Are We from Intelligent Visual Deductive Reasoning? Paper • 2403.04732 • Published Mar 7, 2024 • 21
NerfDiff: Single-image View Synthesis with NeRF-guided Distillation from 3D-aware Diffusion Paper • 2302.10109 • Published Feb 20, 2023
DART: Denoising Autoregressive Transformer for Scalable Text-to-Image Generation Paper • 2410.08159 • Published Oct 10, 2024 • 26
Multimodal Autoregressive Pre-training of Large Vision Encoders Paper • 2411.14402 • Published Nov 21, 2024 • 47
Kaleido Diffusion: Improving Conditional Diffusion Models with Autoregressive Latent Modeling Paper • 2405.21048 • Published May 31, 2024 • 16