--- license: apache-2.0 datasets: - Alex11556666/Reason_Tuning base_model: - Qwen/Qwen2.5-VL-3B-Instruct pipeline_tag: text-to-image --- # πŸ’‘ DeepGen 1.0 (Diffusers Format): A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing This is the **diffusers-compatible** version of [DeepGen-1.0](https://huggingface.co/deepgenteam/DeepGen-1.0). The model weights are stored in safetensors format with a self-contained pipeline script (`deepgen_pipeline.py`) β€” **no need to clone the DeepGen repository**. DeepGen 1.0 is a lightweight unified multimodal model with only 5B parameters (3B VLM + 2B DiT). It integrates five core capabilitiesβ€”general image generation, general image editing, reasoning image generation, reasoning image editing, and text renderingβ€”within a single model. Across multiple authoritative benchmarks, DeepGen 1.0 is competitive with or surpassing the state-of-the-art unified multimodal models that are 3Γ— to 16Γ— larger. ## πŸ› οΈ Quick Start ### Installation ```bash pip install torch diffusers transformers safetensors einops accelerate huggingface_hub # Flash Attention (recommended) pip install flash-attn --no-build-isolation ``` ### Load Pipeline ```python import torch from diffusers import DiffusionPipeline pipe = DiffusionPipeline.from_pretrained( "deepgenteam/DeepGen-1.0-diffusers", torch_dtype=torch.bfloat16, trust_remote_code=True, ) pipe.to("cuda") # Optional: enable CPU offload for GPUs with limited memory (< 24GB) # pipe.enable_model_cpu_offload() ``` ### Text-to-Image ```python result = pipe( prompt="a racoon holding a shiny red apple over its head", height=512, width=512, num_inference_steps=50, guidance_scale=4.0, seed=42, ) result.images[0].save("output.png") ``` ### Image Editing ```python from PIL import Image source_image = Image.open("guitar.png").convert("RGB") result = pipe( prompt="Take a photo of this guitar placed on a sandy beach with the sunset in the background.", image=source_image, height=512, width=512, num_inference_steps=50, guidance_scale=4.0, seed=42, ) result.images[0].save("edited.png") ``` ## πŸ“‹ Parameters | Parameter | Default | Description | |-----------|---------|-------------| | `prompt` | required | Text prompt for generation or editing | | `image` | `None` | Input image for editing. If `None`, performs text-to-image generation | | `height` | 512 | Output image height | | `width` | 512 | Output image width | | `num_inference_steps` | 50 | Number of denoising steps | | `guidance_scale` | 4.0 | Classifier-free guidance scale | | `seed` | `None` | Random seed for reproducibility | | `negative_prompt` | `""` | Negative prompt for CFG | ## πŸ’Ύ Memory Requirements | Mode | VRAM | |------|------| | Full GPU | ~20 GB | | CPU Offload (`pipe.enable_model_cpu_offload()`) | ~14 GB | ## πŸ“ Directory Structure ``` DeepGen-1.0-diffusers/ β”œβ”€β”€ transformer/ # SD3 DiT weights (safetensors) β”œβ”€β”€ vae/ # AutoencoderKL weights β”œβ”€β”€ connector/ # SCB Connector weights + config β”œβ”€β”€ scheduler/ # FlowMatchEulerDiscreteScheduler config β”œβ”€β”€ tokenizer/ # Qwen2.5-VL tokenizer β”œβ”€β”€ prompt_template.json # Prompt formatting template β”œβ”€β”€ model_index.json # Model metadata └── deepgen_pipeline.py # Self-contained pipeline script ``` > **Note:** The VLM (Qwen2.5-VL-3B-Instruct) is loaded separately from [Qwen/Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct). You can override the VLM path using the `vlm_model_path` parameter in `from_pretrained()`. ## 🧠 Method Our core observation is that a lightweight model, when empowered by synergistic architecture design and data-centric training strategies, can achieve comprehensive capabilities competitive with or even surpassing much larger counterparts. To overcome the limitations of lightweight models in semantic understanding and fine-grained control, we introduce **Stacked Channel Bridging (SCB)**, a deep alignment framework that extracts hierarchical features from multiple VLM layers and fuses them with learnable "think tokens" to provide the generative backbone with structured, reasoning-rich guidance. | Component | Parameters | Description | |-----------|-----------|-------------| | VLM (Qwen2.5-VL-3B) | 3B | Visual Language Model for understanding prompts and reference images | | Connector (SCB) | ~0.8B | 6-layer Transformer bridging VLM hidden states to DiT conditioning | | DiT (SD3.5M Kontext) | 2B | Diffusion Transformer for image generation | | VAE | ~80M | Image encoder/decoder | ## πŸ“Š Benchmarks ### 1. General Image Generation | Model | Params | Geneval ↑ | DPGBench ↑ | UniGenBench ↑ | | --------------------- | ----------- | ----------- | ------------ | ------------- | | OmniGen2 | 3B + 4B | 0.80 | 83.57 | 63.09 | | BAGEL | 14B | 0.82 | 85.10 | 61.53 | | X-Omni | 7B + 12B | 0.83 | 87.65πŸ₯‰ | 53.77 | | Lumina-DiMOO | 8B | 0.88πŸ₯‡ | 86.04 | 71.12 | | Hunyuan-Image-3.0 | 80B | 0.72 | 86.10 | β€” | | Qwen-Image | 7B + 20B | 0.87 πŸ₯ˆ | 88.32 πŸ₯‡ | 78.81 πŸ₯‡ | | LongCat-Image | 7B + 6B | 0.87 πŸ₯ˆ | 86.80 | β€” | | Z-Image-Turbo | 4B + 6B | 0.84 | 85.15 | 71.40 | | GLM-Image | 9B + 7B | β€” | 84.78 | β€” | | **DeepGen 1.0 (SFT)** | **3B + 2B** | 0.86 πŸ₯‰ | 87.05 | 74.18 πŸ₯‰ | | **DeepGen 1.0 (RL)** | **3B + 2B** | 0.87 πŸ₯ˆ | 87.90 πŸ₯ˆ | 75.74 πŸ₯ˆ | ### 2. General Image Editing | Model | Params | GEdit-EN ↑ | ImgEdit ↑ | | :--- | :--- | :--- | :--- | | BAGEL | 14B | 6.52 | 3.20 | | Qwen-Image-Edit [2509] | 7B + 20B | 7.54 πŸ₯ˆ | 4.35 πŸ₯ˆ | | LongCat-Image-Edit | 7B + 6B | 7.60 πŸ₯‡ | 4.50 πŸ₯‡ | | Mammoth2 | 8B + 3B + 2B | 6.60 | 4.06 | | **DeepGen 1.0 (SFT)** | **3B + 2B** | 7.12 | 4.09 | | **DeepGen 1.0 (RL)** | **3B + 2B** | 7.17 πŸ₯‰ | 4.14 πŸ₯‰ | ### 3. Reasoning Image Generation | Model | Params | WISE ↑ | T2I-CoREBench ↑ | | :--- | :--- | :--- | :--- | | OmniGen2 | 3B + 4B | 0.47 | 36.1 | | BAGEL | 14B | 0.70 πŸ₯‰ | 41.1 | | Hunyuan-Image-3.0 | 80B | 0.57 | 46.0 | | Qwen-Image | 7B + 20B | 0.62 | 46.3 πŸ₯‰ | | LongCat-Image | 7B + 6B | 0.65 | 52.2 πŸ₯‡ | | Z-Image-Turbo | 4B + 6B | - | 43.7 | | **DeepGen 1.0 (SFT)** | **3B + 2B** | 0.72 πŸ₯ˆ | 45.7 | | **DeepGen 1.0 (RL)** | **3B + 2B** | 0.73 πŸ₯‡ | 46.5 πŸ₯ˆ | ### 4. Reasoning Image Editing | Model | Params | RISE ↑ | UniREditBench ↑ | | :--- | :--- | :--- | :--- | | OmniGen2 | 3B + 4B | - | 43.4 | | BAGEL | 14B | 11.9 πŸ₯ˆ | 51.0 | | Qwen-Image-Edit [2509] | 7B + 20B | 8.9 | 56.5 πŸ₯‰ | | **DeepGen 1.0 (SFT)** | **3B + 2B** | 13.3 πŸ₯‡ | 77.5 πŸ₯‡ | | **DeepGen 1.0 (RL)** | **3B + 2B** | 10.8 πŸ₯‰ | 75.7 πŸ₯ˆ | ## ⭐ Citation ```bibtex @article{wang2026deepgen, title={DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing}, author={Wang, Dianyi and Li, Ruihang and Han, Feng and Ma, Chaofan and Song, Wei and Wang, Siyuan and Wang, Yibin and Xin, Yi and Liu, Hongjian and Zhang, Zhixiong and others}, journal={arXiv preprint arXiv:2602.12205}, year={2026} } ``` ## License Apache 2.0