hellogroup-opensource commited on
Commit
7568ebc
Β·
verified Β·
1 Parent(s): 3e2d50d

Upload folder using huggingface_hub

Browse files
Files changed (3) hide show
  1. README.md +127 -3
  2. amber_image.drawio.svg +0 -0
  3. show.svg +0 -0
README.md CHANGED
@@ -1,3 +1,127 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+
3
+ <h1 align="center">Amber-Image</h1>
4
+ <h3 align="center">Efficient Compression of Large-Scale Diffusion Transformers</h3>
5
+
6
+ <p align="center">
7
+ <a href="https://github.com/HelloVision/AMBER-IMAGE"><img src="https://img.shields.io/badge/GitHub-Repo-blue?logo=github" alt="GitHub"></a>
8
+ <a href="https://huggingface.co/HelloVision/AMBER-IMAGE"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Model-yellow" alt="Hugging Face"></a>
9
+ </p>
10
+
11
+ <p align="center">
12
+ <img src="show.svg" width="100%" alt="Representative samples generated by Amber-Image.">
13
+ </p>
14
+
15
+ ## 🎨 Amber-Image
16
+
17
+ **Amber-Image** is a family of efficient text-to-image (T2I) generation models built through a dedicated compression pipeline that integrates structured pruning, architectural evolution, and knowledge distillation. Rather than training from scratch, Amber-Image transforms the 60-layer, 20B-parameter dual-stream MMDiT backbone of [Qwen-Image](https://huggingface.co/Qwen/Qwen-Image) into lightweight variants β€” **Amber-Image-10B** and **Amber-Image-6B** β€” reducing parameters by up to **70%** while maintaining competitive generation quality.
18
+
19
+ The compression pipeline operates in two stages:
20
+
21
+ 1. **Amber-Image-10B**: Derived via timestep-sensitive depth pruning, removing 30 of the 60 MMDiT layers identified as least critical. Retained layers are reinitialized through local weight averaging and recovered via layer-wise distillation from the original Qwen-Image, followed by full-parameter fine-tuning.
22
+
23
+ 2. **Amber-Image-6B**: Introduces a hybrid-stream architecture where the first 10 layers retain dual-stream processing for modality-specific feature extraction, while the deeper 20 layers are converted to a single stream initialized from the image branch. Knowledge is transferred from Amber-Image-10B via progressive distillation and lightweight fine-tuning.
24
+
25
+ <p align="center">
26
+ <img src="amber_image.drawio.svg" width="80%" alt="Overview of the Amber-Image compression pipeline.">
27
+ </p>
28
+
29
+ ### 🌟 Key Features
30
+
31
+ - **No Training from Scratch**: Operates entirely through strategic compression and refinement of an existing foundation model, dramatically reducing both computational budget and data requirements.
32
+ - **Structured Depth Pruning with Fidelity-Aware Initialization**: A layer importance estimation method accounts for global fidelity impact and timestep sensitivity, enabling safe removal of half the layers. Retained layers are initialized via arithmetic averaging of pruned neighboring blocks for a high-quality warm start.
33
+ - **Hybrid-Stream Architecture**: Early layers retain dual-stream processing for modality-specific feature extraction, while deeper layers are converted to a single stream β€” further reducing parameters by 40% with minimal quality loss.
34
+ - **Two-Stage Knowledge Transfer**: Layer-wise distillation from the full model recovers pruning-induced degradation, followed by distillation from the intermediate pruned model to align the single-stream layers. Both stages require only limited fine-tuning on a small, high-quality dataset.
35
+ - **Competitive Benchmarks**: Amber-Image achieves state-of-the-art results on DPG-Bench and GenEval, surpassing all compared models including closed-source systems and the 20B teacher. On text rendering benchmarks (LongText-Bench, CVTG-2K), Amber-Image-10B outperforms several closed-source baselines while maintaining competitive fidelity.
36
+
37
+ ### πŸ†š Amber-Image-10B vs Amber-Image-6B
38
+
39
+ | Aspect | Amber-Image-10B | Amber-Image-6B |
40
+ |---|---|---|
41
+ | Parameters | ~10B | ~6B |
42
+ | Backbone Layers | 30 | 30 (10 dual-stream + 20 single-stream) |
43
+ | Architecture | Dual-Stream MMDiT | Hybrid-Stream (Dual + Single) |
44
+ | Compression Ratio | 50% depth reduction | 70% parameter reduction |
45
+ | Base Model | Qwen-Image (20B) | Amber-Image-10B |
46
+ | Text Encoder | Qwen2.5-VL-7B | Qwen2.5-VL-7B |
47
+ | VAE | Qwen-Image VAE | Qwen-Image VAE |
48
+
49
+ ## πŸ“Š Benchmark Results
50
+
51
+ ### General Text-to-Image Generation
52
+
53
+ **DPG-Bench** β€” Dense prompt following with 1,065 semantically rich prompts. Both Amber-Image variants achieve the highest overall scores among all compared models, surpassing closed-source Seedream 3.0 and GPT Image 1, the 20B teacher Qwen-Image, and all 7B-class open-source competitors.
54
+
55
+ | Model | Global | Entity | Attribute | Relation | Other | Overall |
56
+ |---|---|---|---|---|---|---|
57
+ | Seedream 3.0 | **94.31** | **92.65** | 91.36 | 92.78 | 88.24 | 88.27 |
58
+ | GPT Image 1 | 88.89 | 88.94 | 89.84 | 92.63 | 90.96 | 85.15 |
59
+ | Qwen-Image | 91.32 | 91.56 | 92.02 | 94.31 | **92.73** | 88.32 |
60
+ | Z-Image | **93.39** | 91.22 | **93.16** | 92.22 | 91.52 | 88.14 |
61
+ | LongCat-Image | 89.10 | **92.54** | 92.00 | 93.28 | 87.50 | 86.80 |
62
+ | Ovis-Image | 82.37 | 92.38 | 90.42 | 93.98 | 91.20 | 86.59 |
63
+ | **Amber-Image-10B** | 83.28 | **92.54** | 90.16 | **94.47** | 87.60 | **89.61** |
64
+ | **Amber-Image-6B** | 79.73 | 90.45 | 91.64 | 93.87 | 89.11 | 88.96 |
65
+
66
+ **GenEval** β€” Semantic reasoning and object-centric grounding. Both Amber-Image variants achieve the best overall scores, outperforming the teacher Qwen-Image, closed-source systems, and all 7B-class open-source competitors. Notably strong in "Position" and "Attribute" dimensions.
67
+
68
+ | Model | Single | Two | Counting | Colors | Position | Attribute | Overall |
69
+ |---|---|---|---|---|---|---|---|
70
+ | Seedream 3.0 | 0.990 | 0.960 | **0.910** | **0.930** | 0.470 | 0.800 | 0.840 |
71
+ | GPT Image 1 | 0.990 | 0.920 | 0.850 | 0.920 | 0.750 | 0.610 | 0.840 |
72
+ | Qwen-Image | 0.990 | 0.920 | 0.890 | 0.880 | 0.760 | 0.770 | 0.870 |
73
+ | Z-Image | 1.000 | 0.940 | 0.780 | **0.930** | 0.620 | 0.770 | 0.840 |
74
+ | LongCat-Image | 0.990 | **0.980** | 0.860 | 0.860 | 0.750 | 0.730 | 0.870 |
75
+ | Ovis-Image | 1.000 | 0.970 | 0.760 | 0.860 | 0.670 | 0.800 | 0.840 |
76
+ | **Amber-Image-10B** | 0.963 | 0.849 | **0.900** | 0.862 | 0.850 | **0.860** | 0.881 |
77
+ | **Amber-Image-6B** | 0.963 | 0.879 | 0.875 | 0.894 | **0.880** | 0.810 | **0.883** |
78
+
79
+ **OneIG-Bench** β€” Multi-faceted instruction following (English / Chinese). Amber-Image maintains competitive "Text" rendering scores approaching the teacher Qwen-Image, while a gap remains in "Style" and "Diversity" dimensions β€” attributed to the limited diversity of fine-tuning data and aesthetic priors lost during compression.
80
+
81
+ | Model | EN Overall | ZH Overall |
82
+ |---|---|---|
83
+ | Seedream 3.0 | 0.530 | 0.528 |
84
+ | GPT Image 1 | 0.533 | 0.474 |
85
+ | Qwen-Image | 0.539 | **0.548** |
86
+ | Z-Image | **0.546** | 0.535 |
87
+ | Ovis-Image | 0.530 | 0.521 |
88
+ | **Amber-Image-10B** | 0.489 | 0.470 |
89
+ | **Amber-Image-6B** | 0.477 | 0.456 |
90
+
91
+ ### Text Rendering
92
+
93
+ **LongText-Bench** β€” Extended bilingual text rendering. Amber-Image-10B outperforms the closed-source Seedream 3.0 on both English and Chinese splits, and significantly surpasses GPT Image 1 on Chinese text rendering. The 6B variant still exceeds many larger baselines such as OmniGen2 and FLUX.1[Dev].
94
+
95
+ | Model | EN | ZH |
96
+ |---|---|---|
97
+ | Seedream 3.0 | 0.896 | 0.878 |
98
+ | GPT Image 1 | **0.956** | 0.619 |
99
+ | Qwen-Image | 0.943 | 0.946 |
100
+ | Z-Image | 0.935 | 0.936 |
101
+ | Ovis-Image | 0.922 | **0.964** |
102
+ | **Amber-Image-10B** | 0.911 | 0.915 |
103
+ | **Amber-Image-6B** | 0.838 | 0.828 |
104
+
105
+ **CVTG-2K** β€” Complex visual text generation. Amber-Image-10B achieves the highest CLIPScore among all compared models, indicating strong semantic alignment. Word accuracy remains stable across increasing region counts.
106
+
107
+ | Model | NED | CLIPScore | 2 regions | 3 regions | 4 regions | 5 regions | Average |
108
+ |---|---|---|---|---|---|---|---|
109
+ | GPT Image 1 | 0.9478 | 0.7982 | 0.8779 | 0.8659 | 0.8731 | 0.8218 | 0.8569 |
110
+ | Qwen-Image | 0.9116 | 0.8017 | 0.8370 | 0.8364 | 0.8313 | 0.8158 | 0.8288 |
111
+ | Z-Image | 0.9367 | 0.7969 | 0.9006 | 0.8722 | 0.8652 | 0.8512 | 0.8671 |
112
+ | Ovis-Image | **0.9695** | 0.8368 | **0.9248** | **0.9239** | **0.9180** | **0.9166** | **0.9200** |
113
+ | LongCat-Image | 0.9361 | 0.7859 | 0.9129 | 0.8737 | 0.8557 | 0.8310 | 0.8658 |
114
+ | **Amber-Image-10B** | 0.8938 | **0.8116** | 0.8791 | 0.8339 | 0.7959 | 0.6952 | 0.8011 |
115
+ | **Amber-Image-6B** | 0.8523 | 0.8047 | 0.8669 | 0.7994 | 0.7200 | 0.6428 | 0.7573 |
116
+
117
+ ## πŸ“œ Citation
118
+
119
+ If you find our work useful in your research, please consider citing:
120
+
121
+ ```bibtex
122
+ @article{hellogroup2025amberimage,
123
+ title={Amber-Image: Efficient Compression of Large-Scale Diffusion Transformers},
124
+ author={Computational Intelligence Dept, HelloGroup Inc.},
125
+ year={2025}
126
+ }
127
+ ```
amber_image.drawio.svg ADDED
show.svg ADDED