π‘ DeepGen 1.0 (Diffusers Format): A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing
This is the diffusers-compatible version of DeepGen-1.0. The model weights are stored in safetensors format with a self-contained pipeline script (deepgen_pipeline.py) β no need to clone the DeepGen repository.
DeepGen 1.0 is a lightweight unified multimodal model with only 5B parameters (3B VLM + 2B DiT). It integrates five core capabilitiesβgeneral image generation, general image editing, reasoning image generation, reasoning image editing, and text renderingβwithin a single model. Across multiple authoritative benchmarks, DeepGen 1.0 is competitive with or surpassing the state-of-the-art unified multimodal models that are 3Γ to 16Γ larger.
import torch
from diffusers import DiffusionPipeline
pipe = DiffusionPipeline.from_pretrained(
"deepgenteam/DeepGen-1.0-diffusers",
torch_dtype=torch.bfloat16,
trust_remote_code=True,
)
pipe.to("cuda")
# Optional: enable CPU offload for GPUs with limited memory (< 24GB)# pipe.enable_model_cpu_offload()
Text-to-Image
result = pipe(
prompt="a racoon holding a shiny red apple over its head",
height=512, width=512,
num_inference_steps=50,
guidance_scale=4.0,
seed=42,
)
result.images[0].save("output.png")
Image Editing
from PIL import Image
source_image = Image.open("guitar.png").convert("RGB")
result = pipe(
prompt="Take a photo of this guitar placed on a sandy beach with the sunset in the background.",
image=source_image,
height=512, width=512,
num_inference_steps=50,
guidance_scale=4.0,
seed=42,
)
result.images[0].save("edited.png")
π Parameters
Parameter
Default
Description
prompt
required
Text prompt for generation or editing
image
None
Input image for editing. If None, performs text-to-image generation
Note: The VLM (Qwen2.5-VL-3B-Instruct) is loaded separately from Qwen/Qwen2.5-VL-3B-Instruct. You can override the VLM path using the vlm_model_path parameter in from_pretrained().
π§ Method
Our core observation is that a lightweight model, when empowered by synergistic architecture design and data-centric training strategies, can achieve comprehensive capabilities competitive with or even surpassing much larger counterparts. To overcome the limitations of lightweight models in semantic understanding and fine-grained control, we introduce Stacked Channel Bridging (SCB), a deep alignment framework that extracts hierarchical features from multiple VLM layers and fuses them with learnable "think tokens" to provide the generative backbone with structured, reasoning-rich guidance.
Component
Parameters
Description
VLM (Qwen2.5-VL-3B)
3B
Visual Language Model for understanding prompts and reference images
Connector (SCB)
~0.8B
6-layer Transformer bridging VLM hidden states to DiT conditioning
DiT (SD3.5M Kontext)
2B
Diffusion Transformer for image generation
VAE
~80M
Image encoder/decoder
π Benchmarks
1. General Image Generation
Model
Params
Geneval β
DPGBench β
UniGenBench β
OmniGen2
3B + 4B
0.80
83.57
63.09
BAGEL
14B
0.82
85.10
61.53
X-Omni
7B + 12B
0.83
87.65π₯
53.77
Lumina-DiMOO
8B
0.88π₯
86.04
71.12
Hunyuan-Image-3.0
80B
0.72
86.10
β
Qwen-Image
7B + 20B
0.87 π₯
88.32 π₯
78.81 π₯
LongCat-Image
7B + 6B
0.87 π₯
86.80
β
Z-Image-Turbo
4B + 6B
0.84
85.15
71.40
GLM-Image
9B + 7B
β
84.78
β
DeepGen 1.0 (SFT)
3B + 2B
0.86 π₯
87.05
74.18 π₯
DeepGen 1.0 (RL)
3B + 2B
0.87 π₯
87.90 π₯
75.74 π₯
2. General Image Editing
Model
Params
GEdit-EN β
ImgEdit β
BAGEL
14B
6.52
3.20
Qwen-Image-Edit [2509]
7B + 20B
7.54 π₯
4.35 π₯
LongCat-Image-Edit
7B + 6B
7.60 π₯
4.50 π₯
Mammoth2
8B + 3B + 2B
6.60
4.06
DeepGen 1.0 (SFT)
3B + 2B
7.12
4.09
DeepGen 1.0 (RL)
3B + 2B
7.17 π₯
4.14 π₯
3. Reasoning Image Generation
Model
Params
WISE β
T2I-CoREBench β
OmniGen2
3B + 4B
0.47
36.1
BAGEL
14B
0.70 π₯
41.1
Hunyuan-Image-3.0
80B
0.57
46.0
Qwen-Image
7B + 20B
0.62
46.3 π₯
LongCat-Image
7B + 6B
0.65
52.2 π₯
Z-Image-Turbo
4B + 6B
-
43.7
DeepGen 1.0 (SFT)
3B + 2B
0.72 π₯
45.7
DeepGen 1.0 (RL)
3B + 2B
0.73 π₯
46.5 π₯
4. Reasoning Image Editing
Model
Params
RISE β
UniREditBench β
OmniGen2
3B + 4B
-
43.4
BAGEL
14B
11.9 π₯
51.0
Qwen-Image-Edit [2509]
7B + 20B
8.9
56.5 π₯
DeepGen 1.0 (SFT)
3B + 2B
13.3 π₯
77.5 π₯
DeepGen 1.0 (RL)
3B + 2B
10.8 π₯
75.7 π₯
β Citation
@article{wang2026deepgen,
title={DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing},
author={Wang, Dianyi and Li, Ruihang and Han, Feng and Ma, Chaofan and Song, Wei and Wang, Siyuan and Wang, Yibin and Xin, Yi and Liu, Hongjian and Zhang, Zhixiong and others},
journal={arXiv preprint arXiv:2602.12205},
year={2026}
}