SGT: Semantic Generative Tuning for Unified Multimodal Models
This repository hosts checkpoints fine-tuned with Semantic Generative Tuning (SGT) β a training paradigm that couples visual understanding and generation in Unified Multimodal Models (UMMs) by using image segmentation as a generative proxy.
Unified multimodal models typically optimize understanding and generation with misaligned objectives (sparse text tokens vs. dense pixel targets), which isolates the two capabilities. SGT introduces segmentation β a high-level semantic task β as a unified generative objective that aligns the two branches, improves feature linear separability, and optimizes visual-textual attention allocation.
π§ Method Overview
SGT reformulates classical visual tasks as generative proxies and establishes a hierarchical taxonomy (low-/mid-/high-level). Extensive experiments show that high-level semantic tasks (e.g. image segmentation) are the optimal proxy, outperforming depth, edge, reconstruction and MAE/inpainting for synergizing understanding and generation.
Key findings:
- High-level > low-level: segmentation gives larger gains in both understanding and generation than depth / edge / pixel reconstruction.
- Perception, not reasoning: visual supervision mainly strengthens vision-centric perception (spatial, hallucination, OCR), rather than abstract reasoning.
- Architecture-agnostic: the gains hold for both BAGEL and OmniGen2.
π¦ Released Artifacts
| Repo | Type | Base Model | Content |
|---|---|---|---|
Two-hot/SGT-BAGEL |
model | BAGEL-7B-MoT | SGT fine-tuned BAGEL checkpoint |
Two-hot/SGT-Gen2 |
model | OmniGen2 | SGT fine-tuned OmniGen2 checkpoint (transformer/ only) |
Two-hot/SAM-SGT |
dataset | β | Segmentation training data (tar-sharded) used by SGT |
Use the SAM-SGT dataset
See Two-hot/SAM-SGT for the data
layout and the extraction instructions (files are stored as 5GB tar shards to fit HF limits).
π Highlights
- +6.02% average gain over BAGEL on the CV-Bench evaluation.
- Consistent improvements in spatial reasoning, hallucination resistance, and OCR.
- Generation: gains across GenEval dimensions (Position / Color / Counting / Single-Object / etc.).
- Verified on two representative UMM architectures (BAGEL, OmniGen2).
π License
Apache-2.0. Base models remain under their original licenses: BAGEL (Apache-2.0, based on Qwen2.5-7B + SigLIP + FLUX VAE) and OmniGen2 (based on Qwen2.5-VL + diffusion transformer).
βοΈ Citation
If you find this work useful, please cite our paper (anonymous ECCV 2026 submission, paper ID #3064):
@article{sgt2026,
title = {Semantic Generative Tuning for Unified Multimodal Models},
author = {Songsong Yu, Yuxin Chen, Ying Shan, and Yanwei Li},
journal = {arxiv},
year = {2026}
}
- Downloads last month
- -