File size: 7,180 Bytes

85c2ed2

---
license: apache-2.0
datasets:
- Alex11556666/Reason_Tuning
base_model:
- Qwen/Qwen2.5-VL-3B-Instruct
pipeline_tag: text-to-image
---

# 💡 DeepGen 1.0 (Diffusers Format): A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing

This is the **diffusers-compatible** version of [DeepGen-1.0](https://huggingface.co/deepgenteam/DeepGen-1.0). The model weights are stored in safetensors format with a self-contained pipeline script (`deepgen_pipeline.py`) — **no need to clone the DeepGen repository**.

DeepGen 1.0 is a lightweight unified multimodal model with only 5B parameters (3B VLM + 2B DiT). It integrates five core capabilities—general image generation, general image editing, reasoning image generation, reasoning image editing, and text rendering—within a single model. Across multiple authoritative benchmarks, DeepGen 1.0 is competitive with or surpassing the state-of-the-art unified multimodal models that are 3× to 16× larger.

## 🛠️ Quick Start

### Installation

```bash
pip install torch diffusers transformers safetensors einops accelerate huggingface_hub
# Flash Attention (recommended)
pip install flash-attn --no-build-isolation
```

### Load Pipeline

```python
import torch
from diffusers import DiffusionPipeline

pipe = DiffusionPipeline.from_pretrained(
    "deepgenteam/DeepGen-1.0-diffusers",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)
pipe.to("cuda")

# Optional: enable CPU offload for GPUs with limited memory (< 24GB)
# pipe.enable_model_cpu_offload()
```

### Text-to-Image

```python
result = pipe(
    prompt="a racoon holding a shiny red apple over its head",
    height=512, width=512,
    num_inference_steps=50,
    guidance_scale=4.0,
    seed=42,
)
result.images[0].save("output.png")
```

### Image Editing

```python
from PIL import Image

source_image = Image.open("guitar.png").convert("RGB")
result = pipe(
    prompt="Take a photo of this guitar placed on a sandy beach with the sunset in the background.",
    image=source_image,
    height=512, width=512,
    num_inference_steps=50,
    guidance_scale=4.0,
    seed=42,
)
result.images[0].save("edited.png")
```

## 📋 Parameters

| Parameter | Default | Description |
|-----------|---------|-------------|
| `prompt` | required | Text prompt for generation or editing |
| `image` | `None` | Input image for editing. If `None`, performs text-to-image generation |
| `height` | 512 | Output image height |
| `width` | 512 | Output image width |
| `num_inference_steps` | 50 | Number of denoising steps |
| `guidance_scale` | 4.0 | Classifier-free guidance scale |
| `seed` | `None` | Random seed for reproducibility |
| `negative_prompt` | `""` | Negative prompt for CFG |

## 💾 Memory Requirements

| Mode | VRAM |
|------|------|
| Full GPU | ~20 GB |
| CPU Offload (`pipe.enable_model_cpu_offload()`) | ~14 GB |

## 📁 Directory Structure

```
DeepGen-1.0-diffusers/
├── transformer/          # SD3 DiT weights (safetensors)
├── vae/                  # AutoencoderKL weights
├── connector/            # SCB Connector weights + config
├── scheduler/            # FlowMatchEulerDiscreteScheduler config
├── tokenizer/            # Qwen2.5-VL tokenizer
├── prompt_template.json  # Prompt formatting template
├── model_index.json      # Model metadata
└── deepgen_pipeline.py   # Self-contained pipeline script
```

> **Note:** The VLM (Qwen2.5-VL-3B-Instruct) is loaded separately from [Qwen/Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct). You can override the VLM path using the `vlm_model_path` parameter in `from_pretrained()`.

## 🧠 Method

Our core observation is that a lightweight model, when empowered by synergistic architecture design and data-centric training strategies, can achieve comprehensive capabilities competitive with or even surpassing much larger counterparts. To overcome the limitations of lightweight models in semantic understanding and fine-grained control, we introduce **Stacked Channel Bridging (SCB)**, a deep alignment framework that extracts hierarchical features from multiple VLM layers and fuses them with learnable "think tokens" to provide the generative backbone with structured, reasoning-rich guidance.

| Component | Parameters | Description |
|-----------|-----------|-------------|
| VLM (Qwen2.5-VL-3B) | 3B | Visual Language Model for understanding prompts and reference images |
| Connector (SCB) | ~0.8B | 6-layer Transformer bridging VLM hidden states to DiT conditioning |
| DiT (SD3.5M Kontext) | 2B | Diffusion Transformer for image generation |
| VAE | ~80M | Image encoder/decoder |

## 📊 Benchmarks

### 1. General Image Generation

| Model | Params | Geneval ↑ | DPGBench ↑ | UniGenBench ↑ |
| --------------------- | ----------- | ----------- | ------------ | ------------- |
| OmniGen2 | 3B + 4B | 0.80 | 83.57 | 63.09 |
| BAGEL | 14B | 0.82 | 85.10 | 61.53 |
| X-Omni | 7B + 12B | 0.83 | 87.65🥉 | 53.77 |
| Lumina-DiMOO | 8B | 0.88🥇 | 86.04 | 71.12 |
| Hunyuan-Image-3.0 | 80B | 0.72 | 86.10 | — |
| Qwen-Image | 7B + 20B | 0.87 🥈 | 88.32 🥇 | 78.81 🥇 |
| LongCat-Image | 7B + 6B | 0.87 🥈 | 86.80 | — |
| Z-Image-Turbo | 4B + 6B | 0.84 | 85.15 | 71.40 |
| GLM-Image | 9B + 7B | — | 84.78 | — |
| **DeepGen 1.0 (SFT)** | **3B + 2B** | 0.86 🥉 | 87.05 | 74.18 🥉 |
| **DeepGen 1.0 (RL)** | **3B + 2B** | 0.87 🥈 | 87.90 🥈 | 75.74 🥈 |

### 2. General Image Editing

| Model | Params | GEdit-EN ↑ | ImgEdit ↑ |
| :--- | :--- | :--- | :--- |
| BAGEL | 14B | 6.52 | 3.20 |
| Qwen-Image-Edit [2509] | 7B + 20B | 7.54 🥈 | 4.35 🥈 |
| LongCat-Image-Edit | 7B + 6B | 7.60 🥇 | 4.50 🥇 |
| Mammoth2 | 8B + 3B + 2B | 6.60 | 4.06 |
| **DeepGen 1.0 (SFT)** | **3B + 2B** | 7.12 | 4.09 |
| **DeepGen 1.0 (RL)** | **3B + 2B** | 7.17 🥉 | 4.14 🥉 |

### 3. Reasoning Image Generation

| Model | Params | WISE ↑ | T2I-CoREBench ↑ |
| :--- | :--- | :--- | :--- |
| OmniGen2 | 3B + 4B | 0.47 | 36.1 |
| BAGEL | 14B | 0.70 🥉 | 41.1 |
| Hunyuan-Image-3.0 | 80B | 0.57 | 46.0 |
| Qwen-Image | 7B + 20B | 0.62 | 46.3 🥉 |
| LongCat-Image | 7B + 6B | 0.65 | 52.2 🥇 |
| Z-Image-Turbo | 4B + 6B | - | 43.7 |
| **DeepGen 1.0 (SFT)** | **3B + 2B** | 0.72 🥈 | 45.7 |
| **DeepGen 1.0 (RL)** | **3B + 2B** | 0.73 🥇 | 46.5 🥈 |

### 4. Reasoning Image Editing

| Model | Params | RISE ↑ | UniREditBench ↑ |
| :--- | :--- | :--- | :--- |
| OmniGen2 | 3B + 4B | - | 43.4 |
| BAGEL | 14B | 11.9 🥈 | 51.0 |
| Qwen-Image-Edit [2509] | 7B + 20B | 8.9 | 56.5 🥉 |
| **DeepGen 1.0 (SFT)** | **3B + 2B** | 13.3 🥇 | 77.5 🥇 |
| **DeepGen 1.0 (RL)** | **3B + 2B** | 10.8 🥉 | 75.7 🥈 |

## ⭐ Citation

```bibtex
@article{wang2026deepgen,
  title={DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing},
  author={Wang, Dianyi and Li, Ruihang and Han, Feng and Ma, Chaofan and Song, Wei and Wang, Siyuan and Wang, Yibin and Xin, Yi and Liu, Hongjian and Zhang, Zhixiong and others},
  journal={arXiv preprint arXiv:2602.12205},
  year={2026}
}
```

## License

Apache 2.0