SGT: Semantic Generative Tuning for Unified Multimodal Models

This repository hosts checkpoints fine-tuned with Semantic Generative Tuning (SGT) β€” a training paradigm that couples visual understanding and generation in Unified Multimodal Models (UMMs) by using image segmentation as a generative proxy.

Unified multimodal models typically optimize understanding and generation with misaligned objectives (sparse text tokens vs. dense pixel targets), which isolates the two capabilities. SGT introduces segmentation β€” a high-level semantic task β€” as a unified generative objective that aligns the two branches, improves feature linear separability, and optimizes visual-textual attention allocation.

🧠 Method Overview

SGT reformulates classical visual tasks as generative proxies and establishes a hierarchical taxonomy (low-/mid-/high-level). Extensive experiments show that high-level semantic tasks (e.g. image segmentation) are the optimal proxy, outperforming depth, edge, reconstruction and MAE/inpainting for synergizing understanding and generation.

Key findings:

  1. High-level > low-level: segmentation gives larger gains in both understanding and generation than depth / edge / pixel reconstruction.
  2. Perception, not reasoning: visual supervision mainly strengthens vision-centric perception (spatial, hallucination, OCR), rather than abstract reasoning.
  3. Architecture-agnostic: the gains hold for both BAGEL and OmniGen2.

πŸ“¦ Released Artifacts

Repo Type Base Model Content
Two-hot/SGT-BAGEL model BAGEL-7B-MoT SGT fine-tuned BAGEL checkpoint
Two-hot/SGT-Gen2 model OmniGen2 SGT fine-tuned OmniGen2 checkpoint (transformer/ only)
Two-hot/SAM-SGT dataset β€” Segmentation training data (tar-sharded) used by SGT

Use the SAM-SGT dataset

See Two-hot/SAM-SGT for the data layout and the extraction instructions (files are stored as 5GB tar shards to fit HF limits).

πŸ“Š Highlights

  • +6.02% average gain over BAGEL on the CV-Bench evaluation.
  • Consistent improvements in spatial reasoning, hallucination resistance, and OCR.
  • Generation: gains across GenEval dimensions (Position / Color / Counting / Single-Object / etc.).
  • Verified on two representative UMM architectures (BAGEL, OmniGen2).

πŸ“ License

Apache-2.0. Base models remain under their original licenses: BAGEL (Apache-2.0, based on Qwen2.5-7B + SigLIP + FLUX VAE) and OmniGen2 (based on Qwen2.5-VL + diffusion transformer).

✍️ Citation

If you find this work useful, please cite our paper (anonymous ECCV 2026 submission, paper ID #3064):

@article{sgt2026,
  title   = {Semantic Generative Tuning for Unified Multimodal Models},
  author  = {Songsong Yu, Yuxin Chen, Ying Shan, and Yanwei Li},
  journal = {arxiv},
  year    = {2026}
}
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support