File size: 7,180 Bytes
85c2ed2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 | ---
license: apache-2.0
datasets:
- Alex11556666/Reason_Tuning
base_model:
- Qwen/Qwen2.5-VL-3B-Instruct
pipeline_tag: text-to-image
---
# π‘ DeepGen 1.0 (Diffusers Format): A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing
This is the **diffusers-compatible** version of [DeepGen-1.0](https://huggingface.co/deepgenteam/DeepGen-1.0). The model weights are stored in safetensors format with a self-contained pipeline script (`deepgen_pipeline.py`) β **no need to clone the DeepGen repository**.
DeepGen 1.0 is a lightweight unified multimodal model with only 5B parameters (3B VLM + 2B DiT). It integrates five core capabilitiesβgeneral image generation, general image editing, reasoning image generation, reasoning image editing, and text renderingβwithin a single model. Across multiple authoritative benchmarks, DeepGen 1.0 is competitive with or surpassing the state-of-the-art unified multimodal models that are 3Γ to 16Γ larger.
## π οΈ Quick Start
### Installation
```bash
pip install torch diffusers transformers safetensors einops accelerate huggingface_hub
# Flash Attention (recommended)
pip install flash-attn --no-build-isolation
```
### Load Pipeline
```python
import torch
from diffusers import DiffusionPipeline
pipe = DiffusionPipeline.from_pretrained(
"deepgenteam/DeepGen-1.0-diffusers",
torch_dtype=torch.bfloat16,
trust_remote_code=True,
)
pipe.to("cuda")
# Optional: enable CPU offload for GPUs with limited memory (< 24GB)
# pipe.enable_model_cpu_offload()
```
### Text-to-Image
```python
result = pipe(
prompt="a racoon holding a shiny red apple over its head",
height=512, width=512,
num_inference_steps=50,
guidance_scale=4.0,
seed=42,
)
result.images[0].save("output.png")
```
### Image Editing
```python
from PIL import Image
source_image = Image.open("guitar.png").convert("RGB")
result = pipe(
prompt="Take a photo of this guitar placed on a sandy beach with the sunset in the background.",
image=source_image,
height=512, width=512,
num_inference_steps=50,
guidance_scale=4.0,
seed=42,
)
result.images[0].save("edited.png")
```
## π Parameters
| Parameter | Default | Description |
|-----------|---------|-------------|
| `prompt` | required | Text prompt for generation or editing |
| `image` | `None` | Input image for editing. If `None`, performs text-to-image generation |
| `height` | 512 | Output image height |
| `width` | 512 | Output image width |
| `num_inference_steps` | 50 | Number of denoising steps |
| `guidance_scale` | 4.0 | Classifier-free guidance scale |
| `seed` | `None` | Random seed for reproducibility |
| `negative_prompt` | `""` | Negative prompt for CFG |
## πΎ Memory Requirements
| Mode | VRAM |
|------|------|
| Full GPU | ~20 GB |
| CPU Offload (`pipe.enable_model_cpu_offload()`) | ~14 GB |
## π Directory Structure
```
DeepGen-1.0-diffusers/
βββ transformer/ # SD3 DiT weights (safetensors)
βββ vae/ # AutoencoderKL weights
βββ connector/ # SCB Connector weights + config
βββ scheduler/ # FlowMatchEulerDiscreteScheduler config
βββ tokenizer/ # Qwen2.5-VL tokenizer
βββ prompt_template.json # Prompt formatting template
βββ model_index.json # Model metadata
βββ deepgen_pipeline.py # Self-contained pipeline script
```
> **Note:** The VLM (Qwen2.5-VL-3B-Instruct) is loaded separately from [Qwen/Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct). You can override the VLM path using the `vlm_model_path` parameter in `from_pretrained()`.
## π§ Method
Our core observation is that a lightweight model, when empowered by synergistic architecture design and data-centric training strategies, can achieve comprehensive capabilities competitive with or even surpassing much larger counterparts. To overcome the limitations of lightweight models in semantic understanding and fine-grained control, we introduce **Stacked Channel Bridging (SCB)**, a deep alignment framework that extracts hierarchical features from multiple VLM layers and fuses them with learnable "think tokens" to provide the generative backbone with structured, reasoning-rich guidance.
| Component | Parameters | Description |
|-----------|-----------|-------------|
| VLM (Qwen2.5-VL-3B) | 3B | Visual Language Model for understanding prompts and reference images |
| Connector (SCB) | ~0.8B | 6-layer Transformer bridging VLM hidden states to DiT conditioning |
| DiT (SD3.5M Kontext) | 2B | Diffusion Transformer for image generation |
| VAE | ~80M | Image encoder/decoder |
## π Benchmarks
### 1. General Image Generation
| Model | Params | Geneval β | DPGBench β | UniGenBench β |
| --------------------- | ----------- | ----------- | ------------ | ------------- |
| OmniGen2 | 3B + 4B | 0.80 | 83.57 | 63.09 |
| BAGEL | 14B | 0.82 | 85.10 | 61.53 |
| X-Omni | 7B + 12B | 0.83 | 87.65π₯ | 53.77 |
| Lumina-DiMOO | 8B | 0.88π₯ | 86.04 | 71.12 |
| Hunyuan-Image-3.0 | 80B | 0.72 | 86.10 | β |
| Qwen-Image | 7B + 20B | 0.87 π₯ | 88.32 π₯ | 78.81 π₯ |
| LongCat-Image | 7B + 6B | 0.87 π₯ | 86.80 | β |
| Z-Image-Turbo | 4B + 6B | 0.84 | 85.15 | 71.40 |
| GLM-Image | 9B + 7B | β | 84.78 | β |
| **DeepGen 1.0 (SFT)** | **3B + 2B** | 0.86 π₯ | 87.05 | 74.18 π₯ |
| **DeepGen 1.0 (RL)** | **3B + 2B** | 0.87 π₯ | 87.90 π₯ | 75.74 π₯ |
### 2. General Image Editing
| Model | Params | GEdit-EN β | ImgEdit β |
| :--- | :--- | :--- | :--- |
| BAGEL | 14B | 6.52 | 3.20 |
| Qwen-Image-Edit [2509] | 7B + 20B | 7.54 π₯ | 4.35 π₯ |
| LongCat-Image-Edit | 7B + 6B | 7.60 π₯ | 4.50 π₯ |
| Mammoth2 | 8B + 3B + 2B | 6.60 | 4.06 |
| **DeepGen 1.0 (SFT)** | **3B + 2B** | 7.12 | 4.09 |
| **DeepGen 1.0 (RL)** | **3B + 2B** | 7.17 π₯ | 4.14 π₯ |
### 3. Reasoning Image Generation
| Model | Params | WISE β | T2I-CoREBench β |
| :--- | :--- | :--- | :--- |
| OmniGen2 | 3B + 4B | 0.47 | 36.1 |
| BAGEL | 14B | 0.70 π₯ | 41.1 |
| Hunyuan-Image-3.0 | 80B | 0.57 | 46.0 |
| Qwen-Image | 7B + 20B | 0.62 | 46.3 π₯ |
| LongCat-Image | 7B + 6B | 0.65 | 52.2 π₯ |
| Z-Image-Turbo | 4B + 6B | - | 43.7 |
| **DeepGen 1.0 (SFT)** | **3B + 2B** | 0.72 π₯ | 45.7 |
| **DeepGen 1.0 (RL)** | **3B + 2B** | 0.73 π₯ | 46.5 π₯ |
### 4. Reasoning Image Editing
| Model | Params | RISE β | UniREditBench β |
| :--- | :--- | :--- | :--- |
| OmniGen2 | 3B + 4B | - | 43.4 |
| BAGEL | 14B | 11.9 π₯ | 51.0 |
| Qwen-Image-Edit [2509] | 7B + 20B | 8.9 | 56.5 π₯ |
| **DeepGen 1.0 (SFT)** | **3B + 2B** | 13.3 π₯ | 77.5 π₯ |
| **DeepGen 1.0 (RL)** | **3B + 2B** | 10.8 π₯ | 75.7 π₯ |
## β Citation
```bibtex
@article{wang2026deepgen,
title={DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing},
author={Wang, Dianyi and Li, Ruihang and Han, Feng and Ma, Chaofan and Song, Wei and Wang, Siyuan and Wang, Yibin and Xin, Yi and Liu, Hongjian and Zhang, Zhixiong and others},
journal={arXiv preprint arXiv:2602.12205},
year={2026}
}
```
## License
Apache 2.0
|