Upload folder using huggingface_hub

85c2ed2 verified 3 days ago

7.18 kB

license: apache-2.0
datasets:
  - Alex11556666/Reason_Tuning
base_model:
  - Qwen/Qwen2.5-VL-3B-Instruct
pipeline_tag: text-to-image

💡 DeepGen 1.0 (Diffusers Format): A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing

This is the diffusers-compatible version of DeepGen-1.0. The model weights are stored in safetensors format with a self-contained pipeline script (deepgen_pipeline.py) — no need to clone the DeepGen repository.

DeepGen 1.0 is a lightweight unified multimodal model with only 5B parameters (3B VLM + 2B DiT). It integrates five core capabilities—general image generation, general image editing, reasoning image generation, reasoning image editing, and text rendering—within a single model. Across multiple authoritative benchmarks, DeepGen 1.0 is competitive with or surpassing the state-of-the-art unified multimodal models that are 3× to 16× larger.

🛠️ Quick Start

Installation

pip install torch diffusers transformers safetensors einops accelerate huggingface_hub
# Flash Attention (recommended)
pip install flash-attn --no-build-isolation

Load Pipeline

import torch
from diffusers import DiffusionPipeline

pipe = DiffusionPipeline.from_pretrained(
    "deepgenteam/DeepGen-1.0-diffusers",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)
pipe.to("cuda")

# Optional: enable CPU offload for GPUs with limited memory (< 24GB)
# pipe.enable_model_cpu_offload()

Text-to-Image

result = pipe(
    prompt="a racoon holding a shiny red apple over its head",
    height=512, width=512,
    num_inference_steps=50,
    guidance_scale=4.0,
    seed=42,
)
result.images[0].save("output.png")

Image Editing

from PIL import Image

source_image = Image.open("guitar.png").convert("RGB")
result = pipe(
    prompt="Take a photo of this guitar placed on a sandy beach with the sunset in the background.",
    image=source_image,
    height=512, width=512,
    num_inference_steps=50,
    guidance_scale=4.0,
    seed=42,
)
result.images[0].save("edited.png")

📋 Parameters

Parameter	Default	Description
`prompt`	required	Text prompt for generation or editing
`image`	`None`	Input image for editing. If `None`, performs text-to-image generation
`height`	512	Output image height
`width`	512	Output image width
`num_inference_steps`	50	Number of denoising steps
`guidance_scale`	4.0	Classifier-free guidance scale
`seed`	`None`	Random seed for reproducibility
`negative_prompt`	`""`	Negative prompt for CFG

💾 Memory Requirements

Mode	VRAM
Full GPU	~20 GB
CPU Offload (`pipe.enable_model_cpu_offload()`)	~14 GB

📁 Directory Structure

DeepGen-1.0-diffusers/
├── transformer/          # SD3 DiT weights (safetensors)
├── vae/                  # AutoencoderKL weights
├── connector/            # SCB Connector weights + config
├── scheduler/            # FlowMatchEulerDiscreteScheduler config
├── tokenizer/            # Qwen2.5-VL tokenizer
├── prompt_template.json  # Prompt formatting template
├── model_index.json      # Model metadata
└── deepgen_pipeline.py   # Self-contained pipeline script

Note: The VLM (Qwen2.5-VL-3B-Instruct) is loaded separately from Qwen/Qwen2.5-VL-3B-Instruct. You can override the VLM path using the vlm_model_path parameter in from_pretrained().

🧠 Method

Our core observation is that a lightweight model, when empowered by synergistic architecture design and data-centric training strategies, can achieve comprehensive capabilities competitive with or even surpassing much larger counterparts. To overcome the limitations of lightweight models in semantic understanding and fine-grained control, we introduce Stacked Channel Bridging (SCB), a deep alignment framework that extracts hierarchical features from multiple VLM layers and fuses them with learnable "think tokens" to provide the generative backbone with structured, reasoning-rich guidance.

Component	Parameters	Description
VLM (Qwen2.5-VL-3B)	3B	Visual Language Model for understanding prompts and reference images
Connector (SCB)	~0.8B	6-layer Transformer bridging VLM hidden states to DiT conditioning
DiT (SD3.5M Kontext)	2B	Diffusion Transformer for image generation
VAE	~80M	Image encoder/decoder

📊 Benchmarks

1. General Image Generation

Model	Params	Geneval ↑	DPGBench ↑	UniGenBench ↑
OmniGen2	3B + 4B	0.80	83.57	63.09
BAGEL	14B	0.82	85.10	61.53
X-Omni	7B + 12B	0.83	87.65🥉	53.77
Lumina-DiMOO	8B	0.88🥇	86.04	71.12
Hunyuan-Image-3.0	80B	0.72	86.10	—
Qwen-Image	7B + 20B	0.87 🥈	88.32 🥇	78.81 🥇
LongCat-Image	7B + 6B	0.87 🥈	86.80	—
Z-Image-Turbo	4B + 6B	0.84	85.15	71.40
GLM-Image	9B + 7B	—	84.78	—
DeepGen 1.0 (SFT)	3B + 2B	0.86 🥉	87.05	74.18 🥉
DeepGen 1.0 (RL)	3B + 2B	0.87 🥈	87.90 🥈	75.74 🥈

2. General Image Editing

Model	Params	GEdit-EN ↑	ImgEdit ↑
BAGEL	14B	6.52	3.20
Qwen-Image-Edit [2509]	7B + 20B	7.54 🥈	4.35 🥈
LongCat-Image-Edit	7B + 6B	7.60 🥇	4.50 🥇
Mammoth2	8B + 3B + 2B	6.60	4.06
DeepGen 1.0 (SFT)	3B + 2B	7.12	4.09
DeepGen 1.0 (RL)	3B + 2B	7.17 🥉	4.14 🥉

3. Reasoning Image Generation

Model	Params	WISE ↑	T2I-CoREBench ↑
OmniGen2	3B + 4B	0.47	36.1
BAGEL	14B	0.70 🥉	41.1
Hunyuan-Image-3.0	80B	0.57	46.0
Qwen-Image	7B + 20B	0.62	46.3 🥉
LongCat-Image	7B + 6B	0.65	52.2 🥇
Z-Image-Turbo	4B + 6B	-	43.7
DeepGen 1.0 (SFT)	3B + 2B	0.72 🥈	45.7
DeepGen 1.0 (RL)	3B + 2B	0.73 🥇	46.5 🥈

4. Reasoning Image Editing

Model	Params	RISE ↑	UniREditBench ↑
OmniGen2	3B + 4B	-	43.4
BAGEL	14B	11.9 🥈	51.0
Qwen-Image-Edit [2509]	7B + 20B	8.9	56.5 🥉
DeepGen 1.0 (SFT)	3B + 2B	13.3 🥇	77.5 🥇
DeepGen 1.0 (RL)	3B + 2B	10.8 🥉	75.7 🥈

⭐ Citation

@article{wang2026deepgen,
  title={DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing},
  author={Wang, Dianyi and Li, Ruihang and Han, Feng and Ma, Chaofan and Song, Wei and Wang, Siyuan and Wang, Yibin and Xin, Yi and Liu, Hongjian and Zhang, Zhixiong and others},
  journal={arXiv preprint arXiv:2602.12205},
  year={2026}
}

License

Apache 2.0