SGT: Semantic Generative Tuning for Unified Multimodal Models

This repository hosts checkpoints fine-tuned with Semantic Generative Tuning (SGT) — a training paradigm that couples visual understanding and generation in Unified Multimodal Models (UMMs) by using image segmentation as a generative proxy.

Unified multimodal models typically optimize understanding and generation with misaligned objectives (sparse text tokens vs. dense pixel targets), which isolates the two capabilities. SGT introduces segmentation — a high-level semantic task — as a unified generative objective that aligns the two branches, improves feature linear separability, and optimizes visual-textual attention allocation.

🧠 Method Overview

SGT reformulates classical visual tasks as generative proxies and establishes a hierarchical taxonomy (low-/mid-/high-level). Extensive experiments show that high-level semantic tasks (e.g. image segmentation) are the optimal proxy, outperforming depth, edge, reconstruction and MAE/inpainting for synergizing understanding and generation.

Key findings:

High-level > low-level: segmentation gives larger gains in both understanding and generation than depth / edge / pixel reconstruction.
Perception, not reasoning: visual supervision mainly strengthens vision-centric perception (spatial, hallucination, OCR), rather than abstract reasoning.
Architecture-agnostic: the gains hold for both BAGEL and OmniGen2.

📦 Released Artifacts

Repo	Type	Base Model	Content
`Two-hot/SGT-BAGEL`	model	BAGEL-7B-MoT	SGT fine-tuned BAGEL checkpoint
`Two-hot/SGT-Gen2`	model	OmniGen2	SGT fine-tuned OmniGen2 checkpoint (transformer/ only)
`Two-hot/SAM-SGT`	dataset	—	Segmentation training data (tar-sharded) used by SGT

Use the SAM-SGT dataset

See Two-hot/SAM-SGT for the data layout and the extraction instructions (files are stored as 5GB tar shards to fit HF limits).

📊 Highlights

+6.02% average gain over BAGEL on the CV-Bench evaluation.
Consistent improvements in spatial reasoning, hallucination resistance, and OCR.
Generation: gains across GenEval dimensions (Position / Color / Counting / Single-Object / etc.).
Verified on two representative UMM architectures (BAGEL, OmniGen2).

📝 License

Apache-2.0. Base models remain under their original licenses: BAGEL (Apache-2.0, based on Qwen2.5-7B + SigLIP + FLUX VAE) and OmniGen2 (based on Qwen2.5-VL + diffusion transformer).

✍️ Citation

If you find this work useful, please cite our paper (anonymous ECCV 2026 submission, paper ID #3064):

@article{sgt2026,
  title   = {Semantic Generative Tuning for Unified Multimodal Models},
  author  = {Songsong Yu, Yuxin Chen, Ying Shan, and Yanwei Li},
  journal = {arxiv},
  year    = {2026}
}

Downloads last month: -

Inference Providers NEW

Any-to-Any

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support