MMFace-DiT: A Dual-Stream Diffusion Transformer for High-Fidelity Multimodal Face Generation
Abstract
A unified dual-stream diffusion transformer model enables synergistic multimodal face synthesis by jointly processing spatial and semantic tokens through shared attention mechanisms while maintaining flexibility across different input conditions.
Recent multimodal face generation models address the spatial control limitations of text-to-image diffusion models by augmenting text-based conditioning with spatial priors such as segmentation masks, sketches, or edge maps. This multimodal fusion enables controllable synthesis aligned with both high-level semantic intent and low-level structural layout. However, most existing approaches typically extend pre-trained text-to-image pipelines by appending auxiliary control modules or stitching together separate uni-modal networks. These ad hoc designs inherit architectural constraints, duplicate parameters, and often fail under conflicting modalities or mismatched latent spaces, limiting their ability to perform synergistic fusion across semantic and spatial domains. We introduce MMFace-DiT, a unified dual-stream diffusion transformer engineered for synergistic multimodal face synthesis. Its core novelty lies in a dual-stream transformer block that processes spatial (mask/sketch) and semantic (text) tokens in parallel, deeply fusing them through a shared Rotary Position-Embedded (RoPE) Attention mechanism. This design prevents modal dominance and ensures strong adherence to both text and structural priors to achieve unprecedented spatial-semantic consistency for controllable face generation. Furthermore, a novel Modality Embedder enables a single cohesive model to dynamically adapt to varying spatial conditions without retraining. MMFace-DiT achieves a 40% improvement in visual fidelity and prompt alignment over six state-of-the-art multimodal face generation models, establishing a flexible new paradigm for end-to-end controllable generative modeling. The code and dataset are available on our project page: https://vcbsl.github.io/MMFace-DiT/
Community
MMFace-DiT: A Dual-Stream Diffusion Transformer for High-Fidelity Multimodal Face Generation
Accepted to the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026.
Project Page: https://vcbsl.github.io/MMFace-DiT/
GitHub Repo: https://github.com/Bharath-K3/MMFace-DiT
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Layout-Guided Controllable Pathology Image Generation with In-Context Diffusion Transformers (2026)
- A2BFR: Attribute-Aware Blind Face Restoration (2026)
- DualTSR: Unified Dual-Diffusion Transformer for Scene Text Image Super-Resolution (2026)
- Disentangled Textual Priors for Diffusion-based Image Super-Resolution (2026)
- PokeFusion Attention: A Lightweight Cross-Attention Mechanism for Style-Conditioned Image Generation (2026)
- Component-Aware Sketch-to-Image Generation Using Self-Attention Encoding and Coordinate-Preserving Fusion (2026)
- Multi-Level Conditioning by Pairing Localized Text and Sketch for Fashion Image Generation (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2603.29029 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 1
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper