Papers
arxiv:2602.03210

VIRAL: Visual In-Context Reasoning via Analogy in Diffusion Transformers

Published on Feb 3
Authors:
,
,
,
,
,
,

Abstract

VIRAL is a framework that uses visual analogy and diffusion transformers to replicate in-context learning in computer vision, achieving superior performance across diverse visual tasks through role-aware conditioning and expert mixing techniques.

AI-generated summary

Replicating In-Context Learning (ICL) in computer vision remains challenging due to task heterogeneity. We propose VIRAL, a framework that elicits visual reasoning from a pre-trained image editing model by formulating ICL as conditional generation via visual analogy (x_s : x_t :: x_q : y_q). We adapt a frozen Diffusion Transformer (DiT) using role-aware multi-image conditioning and introduce a Mixture-of-Experts LoRA to mitigate gradient interference across diverse tasks. Additionally, to bridge the gaps in current visual context datasets, we curate a large-scale dataset spanning perception, restoration, and editing. Experiments demonstrate that VIRAL outperforms existing methods, validating that a unified V-ICL paradigm can handle the majority of visual tasks, including open-domain editing. Our code is available at https://anonymous.4open.science/r/VIRAL-744A

Community

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2602.03210
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.03210 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.03210 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.03210 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.