Papers
arxiv:2603.05181

Mario: Multimodal Graph Reasoning with Large Language Models

Published on Mar 5
· Submitted by
Yuanfu Sun
on Mar 9
Authors:
,
,
,

Abstract

Mario is a unified framework that enables large language model-based reasoning on multimodal graphs by addressing cross-modal consistency and heterogeneous modality preferences through graph-conditioned vision-language modeling and modality-adaptive instruction tuning.

AI-generated summary

Recent advances in large language models (LLMs) have opened new avenues for multimodal reasoning. Yet, most existing methods still rely on pretrained vision-language models (VLMs) to encode image-text pairs in isolation, ignoring the relational structure that real-world multimodal data naturally form. This motivates reasoning on multimodal graphs (MMGs), where each node has textual and visual attributes and edges provide structural cues. Enabling LLM-based reasoning on such heterogeneous multimodal signals while preserving graph topology introduces two key challenges: resolving weak cross-modal consistency and handling heterogeneous modality preference. To address this, we propose Mario, a unified framework that simultaneously resolves the two above challenges and enables effective LLM-based reasoning over MMGs. Mario consists of two innovative stages. Firstly, a graph-conditioned VLM design that jointly refines textual and visual features through fine-grained cross-modal contrastive learning guided by graph topology. Secondly, a modality-adaptive graph instruction tuning mechanism that organizes aligned multimodal features into graph-aware instruction views and employs a learnable router to surface, for each node and its neighborhood, the most informative modality configuration to the LLM. Extensive experiments across diverse MMG benchmarks demonstrate that Mario consistently outperforms state-of-the-art graph models in both supervised and zero-shot scenarios for node classification and link prediction. The code will be made available at https://github.com/sunyuanfu/Mario.

Community

Paper author Paper submitter
edited about 11 hours ago

[CVPR 2026]. A new framework designed for relational text-vision data.

In our work, we study multimodal graphs (MMGs), where each node comes with both text and image information, while edges provide additional structural context. We find that reasoning over such graphs is harder than it looks, mainly because of two challenges:
🐢 weak cross-modal consistency — text and image are often only loosely aligned, and
🐢 heterogeneous modality preference — different nodes may prefer different modality information for correct reasoning.

To address this, we propose Mario, a unified two-stage framework:
✨ Stage 1: a graph-conditioned vision-language model that performs structure-aware image-text alignment under graph topology
✨ Stage 2: a modality-adaptive graph instruction tuning mechanism with a learnable router that selects the most informative modality view for each node and its local neighborhood

Extensive evaluations across diverse MMG benchmarks demonstrate Mario’s state-of-the-art performance in multiple graph reasoning tasks. Notably, Mario consistently outperforms leading baselines, achieving up to 1.6× gains in zero-shot transfer settings. More broadly, this work is our step toward enabling LLMs to reason not just over text or isolated image-text pairs, but over structured multimodal worlds.

We are actively organizing and refining our codebase to make it clean, stable, and easy to reproduce. Due to our current busy schedule, we plan to gradually release the entire code starting in April. Thank you for your interest in our work. We truly appreciate your attention and support💗!

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.05181 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.05181 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.05181 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.