| | --- |
| | license: apache-2.0 |
| | datasets: |
| | - Alex11556666/Reason_Tuning |
| | base_model: |
| | - Qwen/Qwen2.5-VL-3B-Instruct |
| | pipeline_tag: text-to-image |
| | --- |
| | |
| | # π‘ DeepGen 1.0 (Diffusers Format): A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing |
| |
|
| | This is the **diffusers-compatible** version of [DeepGen-1.0](https://huggingface.co/deepgenteam/DeepGen-1.0). The model weights are stored in safetensors format with a self-contained pipeline script (`deepgen_pipeline.py`) β **no need to clone the DeepGen repository**. |
| |
|
| | DeepGen 1.0 is a lightweight unified multimodal model with only 5B parameters (3B VLM + 2B DiT). It integrates five core capabilitiesβgeneral image generation, general image editing, reasoning image generation, reasoning image editing, and text renderingβwithin a single model. Across multiple authoritative benchmarks, DeepGen 1.0 is competitive with or surpassing the state-of-the-art unified multimodal models that are 3Γ to 16Γ larger. |
| |
|
| | ## π οΈ Quick Start |
| |
|
| | ### Installation |
| |
|
| | ```bash |
| | pip install torch diffusers transformers safetensors einops accelerate huggingface_hub |
| | # Flash Attention (recommended) |
| | pip install flash-attn --no-build-isolation |
| | ``` |
| |
|
| | ### Load Pipeline |
| |
|
| | ```python |
| | import torch |
| | from diffusers import DiffusionPipeline |
| | |
| | pipe = DiffusionPipeline.from_pretrained( |
| | "deepgenteam/DeepGen-1.0-diffusers", |
| | torch_dtype=torch.bfloat16, |
| | trust_remote_code=True, |
| | ) |
| | pipe.to("cuda") |
| | |
| | # Optional: enable CPU offload for GPUs with limited memory (< 24GB) |
| | # pipe.enable_model_cpu_offload() |
| | ``` |
| |
|
| | ### Text-to-Image |
| |
|
| | ```python |
| | result = pipe( |
| | prompt="a racoon holding a shiny red apple over its head", |
| | height=512, width=512, |
| | num_inference_steps=50, |
| | guidance_scale=4.0, |
| | seed=42, |
| | ) |
| | result.images[0].save("output.png") |
| | ``` |
| |
|
| | ### Image Editing |
| |
|
| | ```python |
| | from PIL import Image |
| | |
| | source_image = Image.open("guitar.png").convert("RGB") |
| | result = pipe( |
| | prompt="Take a photo of this guitar placed on a sandy beach with the sunset in the background.", |
| | image=source_image, |
| | height=512, width=512, |
| | num_inference_steps=50, |
| | guidance_scale=4.0, |
| | seed=42, |
| | ) |
| | result.images[0].save("edited.png") |
| | ``` |
| |
|
| | ## π Parameters |
| |
|
| | | Parameter | Default | Description | |
| | |-----------|---------|-------------| |
| | | `prompt` | required | Text prompt for generation or editing | |
| | | `image` | `None` | Input image for editing. If `None`, performs text-to-image generation | |
| | | `height` | 512 | Output image height | |
| | | `width` | 512 | Output image width | |
| | | `num_inference_steps` | 50 | Number of denoising steps | |
| | | `guidance_scale` | 4.0 | Classifier-free guidance scale | |
| | | `seed` | `None` | Random seed for reproducibility | |
| | | `negative_prompt` | `""` | Negative prompt for CFG | |
| |
|
| | ## πΎ Memory Requirements |
| |
|
| | | Mode | VRAM | |
| | |------|------| |
| | | Full GPU | ~20 GB | |
| | | CPU Offload (`pipe.enable_model_cpu_offload()`) | ~14 GB | |
| |
|
| | ## π Directory Structure |
| |
|
| | ``` |
| | DeepGen-1.0-diffusers/ |
| | βββ transformer/ # SD3 DiT weights (safetensors) |
| | βββ vae/ # AutoencoderKL weights |
| | βββ connector/ # SCB Connector weights + config |
| | βββ scheduler/ # FlowMatchEulerDiscreteScheduler config |
| | βββ tokenizer/ # Qwen2.5-VL tokenizer |
| | βββ prompt_template.json # Prompt formatting template |
| | βββ model_index.json # Model metadata |
| | βββ deepgen_pipeline.py # Self-contained pipeline script |
| | ``` |
| |
|
| | > **Note:** The VLM (Qwen2.5-VL-3B-Instruct) is loaded separately from [Qwen/Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct). You can override the VLM path using the `vlm_model_path` parameter in `from_pretrained()`. |
| | |
| | ## π§ Method |
| | |
| | Our core observation is that a lightweight model, when empowered by synergistic architecture design and data-centric training strategies, can achieve comprehensive capabilities competitive with or even surpassing much larger counterparts. To overcome the limitations of lightweight models in semantic understanding and fine-grained control, we introduce **Stacked Channel Bridging (SCB)**, a deep alignment framework that extracts hierarchical features from multiple VLM layers and fuses them with learnable "think tokens" to provide the generative backbone with structured, reasoning-rich guidance. |
| | |
| | | Component | Parameters | Description | |
| | |-----------|-----------|-------------| |
| | | VLM (Qwen2.5-VL-3B) | 3B | Visual Language Model for understanding prompts and reference images | |
| | | Connector (SCB) | ~0.8B | 6-layer Transformer bridging VLM hidden states to DiT conditioning | |
| | | DiT (SD3.5M Kontext) | 2B | Diffusion Transformer for image generation | |
| | | VAE | ~80M | Image encoder/decoder | |
| | |
| | ## π Benchmarks |
| | |
| | ### 1. General Image Generation |
| | |
| | | Model | Params | Geneval β | DPGBench β | UniGenBench β | |
| | | --------------------- | ----------- | ----------- | ------------ | ------------- | |
| | | OmniGen2 | 3B + 4B | 0.80 | 83.57 | 63.09 | |
| | | BAGEL | 14B | 0.82 | 85.10 | 61.53 | |
| | | X-Omni | 7B + 12B | 0.83 | 87.65π₯ | 53.77 | |
| | | Lumina-DiMOO | 8B | 0.88π₯ | 86.04 | 71.12 | |
| | | Hunyuan-Image-3.0 | 80B | 0.72 | 86.10 | β | |
| | | Qwen-Image | 7B + 20B | 0.87 π₯ | 88.32 π₯ | 78.81 π₯ | |
| | | LongCat-Image | 7B + 6B | 0.87 π₯ | 86.80 | β | |
| | | Z-Image-Turbo | 4B + 6B | 0.84 | 85.15 | 71.40 | |
| | | GLM-Image | 9B + 7B | β | 84.78 | β | |
| | | **DeepGen 1.0 (SFT)** | **3B + 2B** | 0.86 π₯ | 87.05 | 74.18 π₯ | |
| | | **DeepGen 1.0 (RL)** | **3B + 2B** | 0.87 π₯ | 87.90 π₯ | 75.74 π₯ | |
| | |
| | ### 2. General Image Editing |
| | |
| | | Model | Params | GEdit-EN β | ImgEdit β | |
| | | :--- | :--- | :--- | :--- | |
| | | BAGEL | 14B | 6.52 | 3.20 | |
| | | Qwen-Image-Edit [2509] | 7B + 20B | 7.54 π₯ | 4.35 π₯ | |
| | | LongCat-Image-Edit | 7B + 6B | 7.60 π₯ | 4.50 π₯ | |
| | | Mammoth2 | 8B + 3B + 2B | 6.60 | 4.06 | |
| | | **DeepGen 1.0 (SFT)** | **3B + 2B** | 7.12 | 4.09 | |
| | | **DeepGen 1.0 (RL)** | **3B + 2B** | 7.17 π₯ | 4.14 π₯ | |
| | |
| | ### 3. Reasoning Image Generation |
| | |
| | | Model | Params | WISE β | T2I-CoREBench β | |
| | | :--- | :--- | :--- | :--- | |
| | | OmniGen2 | 3B + 4B | 0.47 | 36.1 | |
| | | BAGEL | 14B | 0.70 π₯ | 41.1 | |
| | | Hunyuan-Image-3.0 | 80B | 0.57 | 46.0 | |
| | | Qwen-Image | 7B + 20B | 0.62 | 46.3 π₯ | |
| | | LongCat-Image | 7B + 6B | 0.65 | 52.2 π₯ | |
| | | Z-Image-Turbo | 4B + 6B | - | 43.7 | |
| | | **DeepGen 1.0 (SFT)** | **3B + 2B** | 0.72 π₯ | 45.7 | |
| | | **DeepGen 1.0 (RL)** | **3B + 2B** | 0.73 π₯ | 46.5 π₯ | |
| | |
| | ### 4. Reasoning Image Editing |
| | |
| | | Model | Params | RISE β | UniREditBench β | |
| | | :--- | :--- | :--- | :--- | |
| | | OmniGen2 | 3B + 4B | - | 43.4 | |
| | | BAGEL | 14B | 11.9 π₯ | 51.0 | |
| | | Qwen-Image-Edit [2509] | 7B + 20B | 8.9 | 56.5 π₯ | |
| | | **DeepGen 1.0 (SFT)** | **3B + 2B** | 13.3 π₯ | 77.5 π₯ | |
| | | **DeepGen 1.0 (RL)** | **3B + 2B** | 10.8 π₯ | 75.7 π₯ | |
| | |
| | ## β Citation |
| | |
| | ```bibtex |
| | @article{wang2026deepgen, |
| | title={DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing}, |
| | author={Wang, Dianyi and Li, Ruihang and Han, Feng and Ma, Chaofan and Song, Wei and Wang, Siyuan and Wang, Yibin and Xin, Yi and Liu, Hongjian and Zhang, Zhixiong and others}, |
| | journal={arXiv preprint arXiv:2602.12205}, |
| | year={2026} |
| | } |
| | ``` |
| | |
| | ## License |
| | |
| | Apache 2.0 |
| | |