Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO Paper • 2505.22453 • Published May 28, 2025 • 46
UniRL: Self-Improving Unified Multimodal Models via Supervised and Reinforcement Learning Paper • 2505.23380 • Published May 29, 2025 • 22
More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models Paper • 2505.21523 • Published May 23, 2025 • 13
Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces Paper • 2506.00123 • Published May 30, 2025 • 35
Discrete Diffusion in Large Language and Multimodal Models: A Survey Paper • 2506.13759 • Published Jun 16, 2025 • 43
MultiFinBen: A Multilingual, Multimodal, and Difficulty-Aware Benchmark for Financial LLM Evaluation Paper • 2506.14028 • Published Jun 16, 2025 • 93
OmniGen2: Exploration to Advanced Multimodal Generation Paper • 2506.18871 • Published Jun 23, 2025 • 78
Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens Paper • 2506.17218 • Published Jun 20, 2025 • 29
UniFork: Exploring Modality Alignment for Unified Multimodal Understanding and Generation Paper • 2506.17202 • Published Jun 20, 2025 • 10
HumanOmniV2: From Understanding to Omni-Modal Reasoning with Context Paper • 2506.21277 • Published Jun 26, 2025 • 14
GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning Paper • 2507.01006 • Published Jul 1, 2025 • 252
VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents Paper • 2507.04590 • Published Jul 7, 2025 • 17
Robust Multimodal Large Language Models Against Modality Conflict Paper • 2507.07151 • Published Jul 9, 2025 • 6
Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation from Diffusion Models Paper • 2507.07104 • Published Jul 9, 2025 • 46
Can Multimodal Foundation Models Understand Schematic Diagrams? An Empirical Study on Information-Seeking QA over Scientific Papers Paper • 2507.10787 • Published Jul 14, 2025 • 13
VL-Cogito: Progressive Curriculum Reinforcement Learning for Advanced Multimodal Reasoning Paper • 2507.22607 • Published Jul 30, 2025 • 47
Skywork UniPic: Unified Autoregressive Modeling for Visual Understanding and Generation Paper • 2508.03320 • Published Aug 5, 2025 • 64
Visual-CoG: Stage-Aware Reinforcement Learning with Chain of Guidance for Text-to-Image Generation Paper • 2508.18032 • Published Aug 25, 2025 • 41
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency Paper • 2508.18265 • Published Aug 25, 2025 • 217
Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning Paper • 2508.20751 • Published Aug 28, 2025 • 90
OpenVision 2: A Family of Generative Pretrained Visual Encoders for Multimodal Learning Paper • 2509.01644 • Published Sep 1, 2025 • 34
Visual Representation Alignment for Multimodal Large Language Models Paper • 2509.07979 • Published Sep 9, 2025 • 84
Reconstruction Alignment Improves Unified Multimodal Models Paper • 2509.07295 • Published Sep 8, 2025 • 40
MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer Paper • 2509.16197 • Published Sep 19, 2025 • 58
Hyper-Bagel: A Unified Acceleration Framework for Multimodal Understanding and Generation Paper • 2509.18824 • Published Sep 23, 2025 • 23
Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training Paper • 2509.26625 • Published Sep 30, 2025 • 43
More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models Paper • 2509.25848 • Published Sep 30, 2025 • 81
IMG: Calibrating Diffusion Models via Implicit Multimodal Guidance Paper • 2509.26231 • Published Sep 30, 2025 • 18
Self-Improvement in Multimodal Large Language Models: A Survey Paper • 2510.02665 • Published Oct 3, 2025 • 21
MathSE: Improving Multimodal Mathematical Reasoning via Self-Evolving Iterative Reflection and Reward-Guided Fine-Tuning Paper • 2511.06805 • Published Nov 10, 2025 • 13
Mixture of States: Routing Token-Level Dynamics for Multimodal Generation Paper • 2511.12207 • Published Nov 15, 2025 • 10
M3-Bench: Multi-Modal, Multi-Hop, Multi-Threaded Tool-Using MLLM Agent Benchmark Paper • 2511.17729 • Published Nov 21, 2025 • 17
Does Understanding Inform Generation in Unified Multimodal Models? From Analysis to Path Forward Paper • 2511.20561 • Published Nov 25, 2025 • 33
Architecture Decoupling Is Not All You Need For Unified Multimodal Model Paper • 2511.22663 • Published Nov 27, 2025 • 29
HyperVL: An Efficient and Dynamic Multimodal Large Language Model for Edge Devices Paper • 2512.14052 • Published Dec 16, 2025 • 42
NextFlow: Unified Sequential Modeling Activates Multimodal Understanding and Generation Paper • 2601.02204 • Published Jan 5 • 63
LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning Paper • 2601.10129 • Published Jan 15 • 13
VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents Paper • 2601.16973 • Published Jan 23 • 40
MMFineReason: Closing the Multimodal Reasoning Gap via Open Data-Centric Methods Paper • 2601.21821 • Published Jan 29 • 62
Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models Paper • 2601.22060 • Published Jan 29 • 155
OmniSIFT: Modality-Asymmetric Token Compression for Efficient Omni-modal Large Language Models Paper • 2602.04804 • Published Feb 4 • 49
Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models Paper • 2602.07026 • Published Feb 2 • 140
Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device Paper • 2602.20161 • Published Feb 23 • 23
Beyond Language Modeling: An Exploration of Multimodal Pretraining Paper • 2603.03276 • Published about 1 month ago • 102
Mario: Multimodal Graph Reasoning with Large Language Models Paper • 2603.05181 • Published 29 days ago • 8
Reading, Not Thinking: Understanding and Bridging the Modality Gap When Text Becomes Pixels in Multimodal LLMs Paper • 2603.09095 • Published 24 days ago • 28
Rethinking Token-Level Policy Optimization for Multimodal Chain-of-Thought Paper • 2603.22847 • Published 10 days ago • 25
GEMS: Agent-Native Multimodal Generation with Memory and Skills Paper • 2603.28088 • Published 4 days ago • 71
LongCat-Next: Lexicalizing Modalities as Discrete Tokens Paper • 2603.27538 • Published 5 days ago • 122
MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome Paper • 2603.28407 • Published 4 days ago • 52
HippoCamp: Benchmarking Contextual Agents on Personal Computers Paper • 2604.01221 • Published 1 day ago • 16