Text-to-Image
Diffusers
Safetensors
File size: 7,180 Bytes
85c2ed2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
---
license: apache-2.0
datasets:
- Alex11556666/Reason_Tuning
base_model:
- Qwen/Qwen2.5-VL-3B-Instruct
pipeline_tag: text-to-image
---

# πŸ’‘ DeepGen 1.0 (Diffusers Format): A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing

This is the **diffusers-compatible** version of [DeepGen-1.0](https://huggingface.co/deepgenteam/DeepGen-1.0). The model weights are stored in safetensors format with a self-contained pipeline script (`deepgen_pipeline.py`) β€” **no need to clone the DeepGen repository**.

DeepGen 1.0 is a lightweight unified multimodal model with only 5B parameters (3B VLM + 2B DiT). It integrates five core capabilitiesβ€”general image generation, general image editing, reasoning image generation, reasoning image editing, and text renderingβ€”within a single model. Across multiple authoritative benchmarks, DeepGen 1.0 is competitive with or surpassing the state-of-the-art unified multimodal models that are 3Γ— to 16Γ— larger.

## πŸ› οΈ Quick Start

### Installation

```bash
pip install torch diffusers transformers safetensors einops accelerate huggingface_hub
# Flash Attention (recommended)
pip install flash-attn --no-build-isolation
```

### Load Pipeline

```python
import torch
from diffusers import DiffusionPipeline

pipe = DiffusionPipeline.from_pretrained(
    "deepgenteam/DeepGen-1.0-diffusers",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)
pipe.to("cuda")

# Optional: enable CPU offload for GPUs with limited memory (< 24GB)
# pipe.enable_model_cpu_offload()
```

### Text-to-Image

```python
result = pipe(
    prompt="a racoon holding a shiny red apple over its head",
    height=512, width=512,
    num_inference_steps=50,
    guidance_scale=4.0,
    seed=42,
)
result.images[0].save("output.png")
```

### Image Editing

```python
from PIL import Image

source_image = Image.open("guitar.png").convert("RGB")
result = pipe(
    prompt="Take a photo of this guitar placed on a sandy beach with the sunset in the background.",
    image=source_image,
    height=512, width=512,
    num_inference_steps=50,
    guidance_scale=4.0,
    seed=42,
)
result.images[0].save("edited.png")
```

## πŸ“‹ Parameters

| Parameter | Default | Description |
|-----------|---------|-------------|
| `prompt` | required | Text prompt for generation or editing |
| `image` | `None` | Input image for editing. If `None`, performs text-to-image generation |
| `height` | 512 | Output image height |
| `width` | 512 | Output image width |
| `num_inference_steps` | 50 | Number of denoising steps |
| `guidance_scale` | 4.0 | Classifier-free guidance scale |
| `seed` | `None` | Random seed for reproducibility |
| `negative_prompt` | `""` | Negative prompt for CFG |

## πŸ’Ύ Memory Requirements

| Mode | VRAM |
|------|------|
| Full GPU | ~20 GB |
| CPU Offload (`pipe.enable_model_cpu_offload()`) | ~14 GB |

## πŸ“ Directory Structure

```
DeepGen-1.0-diffusers/
β”œβ”€β”€ transformer/          # SD3 DiT weights (safetensors)
β”œβ”€β”€ vae/                  # AutoencoderKL weights
β”œβ”€β”€ connector/            # SCB Connector weights + config
β”œβ”€β”€ scheduler/            # FlowMatchEulerDiscreteScheduler config
β”œβ”€β”€ tokenizer/            # Qwen2.5-VL tokenizer
β”œβ”€β”€ prompt_template.json  # Prompt formatting template
β”œβ”€β”€ model_index.json      # Model metadata
└── deepgen_pipeline.py   # Self-contained pipeline script
```

> **Note:** The VLM (Qwen2.5-VL-3B-Instruct) is loaded separately from [Qwen/Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct). You can override the VLM path using the `vlm_model_path` parameter in `from_pretrained()`.

## 🧠 Method

Our core observation is that a lightweight model, when empowered by synergistic architecture design and data-centric training strategies, can achieve comprehensive capabilities competitive with or even surpassing much larger counterparts. To overcome the limitations of lightweight models in semantic understanding and fine-grained control, we introduce **Stacked Channel Bridging (SCB)**, a deep alignment framework that extracts hierarchical features from multiple VLM layers and fuses them with learnable "think tokens" to provide the generative backbone with structured, reasoning-rich guidance.

| Component | Parameters | Description |
|-----------|-----------|-------------|
| VLM (Qwen2.5-VL-3B) | 3B | Visual Language Model for understanding prompts and reference images |
| Connector (SCB) | ~0.8B | 6-layer Transformer bridging VLM hidden states to DiT conditioning |
| DiT (SD3.5M Kontext) | 2B | Diffusion Transformer for image generation |
| VAE | ~80M | Image encoder/decoder |

## πŸ“Š Benchmarks

### 1. General Image Generation

| Model | Params | Geneval ↑ | DPGBench ↑ | UniGenBench ↑ |
| --------------------- | ----------- | ----------- | ------------ | ------------- |
| OmniGen2 | 3B + 4B | 0.80 | 83.57 | 63.09 |
| BAGEL | 14B | 0.82 | 85.10 | 61.53 |
| X-Omni | 7B + 12B | 0.83 | 87.65πŸ₯‰ | 53.77 |
| Lumina-DiMOO | 8B | 0.88πŸ₯‡ | 86.04 | 71.12 |
| Hunyuan-Image-3.0 | 80B | 0.72 | 86.10 | β€” |
| Qwen-Image | 7B + 20B | 0.87 πŸ₯ˆ | 88.32 πŸ₯‡ | 78.81 πŸ₯‡ |
| LongCat-Image | 7B + 6B | 0.87 πŸ₯ˆ | 86.80 | β€” |
| Z-Image-Turbo | 4B + 6B | 0.84 | 85.15 | 71.40 |
| GLM-Image | 9B + 7B | β€” | 84.78 | β€” |
| **DeepGen 1.0 (SFT)** | **3B + 2B** | 0.86 πŸ₯‰ | 87.05 | 74.18 πŸ₯‰ |
| **DeepGen 1.0 (RL)** | **3B + 2B** | 0.87 πŸ₯ˆ | 87.90 πŸ₯ˆ | 75.74 πŸ₯ˆ |

### 2. General Image Editing

| Model | Params | GEdit-EN ↑ | ImgEdit ↑ |
| :--- | :--- | :--- | :--- |
| BAGEL | 14B | 6.52 | 3.20 |
| Qwen-Image-Edit [2509] | 7B + 20B | 7.54 πŸ₯ˆ | 4.35 πŸ₯ˆ |
| LongCat-Image-Edit | 7B + 6B | 7.60 πŸ₯‡ | 4.50 πŸ₯‡ |
| Mammoth2 | 8B + 3B + 2B | 6.60 | 4.06 |
| **DeepGen 1.0 (SFT)** | **3B + 2B** | 7.12 | 4.09 |
| **DeepGen 1.0 (RL)** | **3B + 2B** | 7.17 πŸ₯‰ | 4.14 πŸ₯‰ |

### 3. Reasoning Image Generation

| Model | Params | WISE ↑ | T2I-CoREBench ↑ |
| :--- | :--- | :--- | :--- |
| OmniGen2 | 3B + 4B | 0.47 | 36.1 |
| BAGEL | 14B | 0.70 πŸ₯‰ | 41.1 |
| Hunyuan-Image-3.0 | 80B | 0.57 | 46.0 |
| Qwen-Image | 7B + 20B | 0.62 | 46.3 πŸ₯‰ |
| LongCat-Image | 7B + 6B | 0.65 | 52.2 πŸ₯‡ |
| Z-Image-Turbo | 4B + 6B | - | 43.7 |
| **DeepGen 1.0 (SFT)** | **3B + 2B** | 0.72 πŸ₯ˆ | 45.7 |
| **DeepGen 1.0 (RL)** | **3B + 2B** | 0.73 πŸ₯‡ | 46.5 πŸ₯ˆ |

### 4. Reasoning Image Editing

| Model | Params | RISE ↑ | UniREditBench ↑ |
| :--- | :--- | :--- | :--- |
| OmniGen2 | 3B + 4B | - | 43.4 |
| BAGEL | 14B | 11.9 πŸ₯ˆ | 51.0 |
| Qwen-Image-Edit [2509] | 7B + 20B | 8.9 | 56.5 πŸ₯‰ |
| **DeepGen 1.0 (SFT)** | **3B + 2B** | 13.3 πŸ₯‡ | 77.5 πŸ₯‡ |
| **DeepGen 1.0 (RL)** | **3B + 2B** | 10.8 πŸ₯‰ | 75.7 πŸ₯ˆ |

## ⭐ Citation

```bibtex
@article{wang2026deepgen,
  title={DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing},
  author={Wang, Dianyi and Li, Ruihang and Han, Feng and Ma, Chaofan and Song, Wei and Wang, Siyuan and Wang, Yibin and Xin, Yi and Liu, Hongjian and Zhang, Zhixiong and others},
  journal={arXiv preprint arXiv:2602.12205},
  year={2026}
}
```

## License

Apache 2.0