VIRAL: Visual In-Context Reasoning via Analogy in Diffusion Transformers
Abstract
VIRAL is a framework that uses visual analogy and diffusion transformers to replicate in-context learning in computer vision, achieving superior performance across diverse visual tasks through role-aware conditioning and expert mixing techniques.
Replicating In-Context Learning (ICL) in computer vision remains challenging due to task heterogeneity. We propose VIRAL, a framework that elicits visual reasoning from a pre-trained image editing model by formulating ICL as conditional generation via visual analogy (x_s : x_t :: x_q : y_q). We adapt a frozen Diffusion Transformer (DiT) using role-aware multi-image conditioning and introduce a Mixture-of-Experts LoRA to mitigate gradient interference across diverse tasks. Additionally, to bridge the gaps in current visual context datasets, we curate a large-scale dataset spanning perception, restoration, and editing. Experiments demonstrate that VIRAL outperforms existing methods, validating that a unified V-ICL paradigm can handle the majority of visual tasks, including open-domain editing. Our code is available at https://anonymous.4open.science/r/VIRAL-744A
Get this paper in your agent:
hf papers read 2602.03210 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper