UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors
Abstract
UniVidX is a unified multimodal framework that uses video diffusion model priors for versatile video generation through stochastic condition masking, decoupled gated LoRA, and cross-modal self-attention mechanisms.
Recent progress has shown that video diffusion models (VDMs) can be repurposed for diverse multimodal graphics tasks. However, existing methods often train separate models for each problem setting, which fixes the input-output mapping and limits the modeling of correlations across modalities. We present UniVidX, a unified multimodal framework that leverages VDM priors for versatile video generation. UniVidX formulates pixel-aligned tasks as conditional generation in a shared multimodal space, adapts to modality-specific distributions while preserving the backbone's native priors, and promotes cross-modal consistency during synthesis. It is built on three key designs. Stochastic Condition Masking (SCM) randomly partitions modalities into clean conditions and noisy targets during training, enabling omni-directional conditional generation instead of fixed mappings. Decoupled Gated LoRA (DGL) introduces per-modality LoRAs that are activated when a modality serves as the generation target, preserving the strong priors of the VDM. Cross-Modal Self-Attention (CMSA) shares keys and values across modalities while keeping modality-specific queries, facilitating information exchange and inter-modal alignment. We instantiate UniVidX in two domains: UniVid-Intrinsic, for RGB videos and intrinsic maps including albedo, irradiance, and normal; and UniVid-Alpha, for blended RGB videos and their constituent RGBA layers. Experiments show that both models achieve performance competitive with state-of-the-art methods across distinct tasks and generalize robustly to in-the-wild scenarios, even when trained on fewer than 1,000 videos. Project page: https://houyuanchen111.github.io/UniVidX.github.io/
Community
the scm idea is a clean way to turn a single diffusion backbone into omni-directional multimodal generation, and the cmsa + dgl trio feels like a good recipe for cross-modal coherence.
one question: since scm randomizes which modalities are conditions vs targets, how robust is cross-modal alignment when a modality is missing or very noisy at test time—does the model gracefully fall back to priors or does it struggle?
the arXivLens breakdown helped me parse the method details and catch how cmsa shares keys and values across modalities while keeping per-modality queries.
a small ablation on the mask distribution vs fixed mappings would be nice to see, to separate the benefit from scm itself versus the gating adapters.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Uni-ViGU: Towards Unified Video Generation and Understanding via A Diffusion-Based Video Generator (2026)
- Identity as Presence: Towards Appearance and Voice Personalized Joint Audio-Video Generation (2026)
- MMPhysVideo: Scaling Physical Plausibility in Video Generation via Joint Multimodal Modeling (2026)
- Improving Joint Audio-Video Generation with Cross-Modal Context Learning (2026)
- MMControl: Unified Multi-Modal Control for Joint Audio-Video Generation (2026)
- UniVid: Pyramid Diffusion Model for High Quality Video Generation (2026)
- RefAlign: Representation Alignment for Reference-to-Video Generation (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2605.00658 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper