Add model card and metadata for Vision-R1-CI-7B
Browse filesHi! I'm Niels from the Hugging Face community team. I've updated the model card for `Vision-R1-CI-7B` to include:
- Metadata for `library_name`, `pipeline_tag`, and `base_model`.
- A descriptive model card linking to the [Vision-R1 paper](https://arxiv.org/abs/2503.06749) and official [GitHub repository](https://github.com/Osilly/Vision-R1).
- Information about the model being the cold-start initialization (CI) version for subsequent RL training.
- Citation information for the research.
This helps make the model more discoverable and provides users with necessary context regarding its usage in the Vision-R1 pipeline.
README.md
CHANGED
|
@@ -1,3 +1,48 @@
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
+
library_name: transformers
|
| 4 |
+
pipeline_tag: image-text-to-text
|
| 5 |
+
base_model: Qwen/Qwen2.5-VL-7B-Instruct
|
| 6 |
+
tags:
|
| 7 |
+
- multimodal
|
| 8 |
+
- reasoning
|
| 9 |
+
- vision-r1
|
| 10 |
+
- qwen2.5-vl
|
| 11 |
+
- chain-of-thought
|
| 12 |
---
|
| 13 |
+
|
| 14 |
+
# Vision-R1-CI-7B
|
| 15 |
+
|
| 16 |
+
Vision-R1-CI-7B is a multimodal reasoning model that serves as the **Cold-start Initialization (CI)** checkpoint for the **Vision-R1** project. It is introduced in the paper [Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models](https://arxiv.org/abs/2503.06749).
|
| 17 |
+
|
| 18 |
+
- **GitHub Repository:** [Osilly/Vision-R1](https://github.com/Osilly/Vision-R1)
|
| 19 |
+
- **Paper:** [Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models](https://arxiv.org/abs/2503.06749)
|
| 20 |
+
- **Base Model:** [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)
|
| 21 |
+
|
| 22 |
+
## Model Description
|
| 23 |
+
|
| 24 |
+
Vision-R1-CI (Cold-start Initialized) is a 7B parameter multimodal large language model (MLLM) developed to bridge the gap between standard vision-language tasks and complex reasoning. It was obtained by fine-tuning the **Qwen2.5-VL-7B-Instruct** base model on the **Vision-R1-cold** dataset—a 200K high-quality multimodal Chain-of-Thought (CoT) dataset constructed by leveraging DeepSeek-R1 and existing MLLMs through modality bridging and data filtering.
|
| 25 |
+
|
| 26 |
+
This model acts as the critical starting point for subsequent Reinforcement Learning (RL) using Group Relative Policy Optimization (GRPO) and the Progressive Thinking Suppression Training (PTST) strategy, which enables the emergence of "Aha moments" and self-reflective reasoning in multimodal contexts.
|
| 27 |
+
|
| 28 |
+
## Performance
|
| 29 |
+
|
| 30 |
+
The Vision-R1 series demonstrates strong performance across various math-centric multimodal benchmarks. Vision-R1-7B (the version after RL training) achieves significant improvements:
|
| 31 |
+
|
| 32 |
+
| Model | MathVista | MathVerse | MM-Math | DynaMath | AVG. |
|
| 33 |
+
| -------------------------- | --------- | --------- | ------- | -------- | ---- |
|
| 34 |
+
| Qwen2.5-VL-7B | 68.1 | 46.7 | 34.1 | 50.7 | 47.9 |
|
| 35 |
+
| **Vision-R1-7B (Ours)** | 73.5 | 52.4 | 40.2 | 56.3 | 53.8 |
|
| 36 |
+
|
| 37 |
+
## Citation
|
| 38 |
+
|
| 39 |
+
If you find this model useful in your research, please cite the following paper:
|
| 40 |
+
|
| 41 |
+
```bibtex
|
| 42 |
+
@article{huang2025visionr1,
|
| 43 |
+
title={Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models},
|
| 44 |
+
author={Huang, Wenxuan and Jia, Bohan and Zhai, Zijie and Cao, Shaosheng and Ye, Zheyu and Zhao, Fei and Hu, Yao and Lin, Shaohui},
|
| 45 |
+
journal={arXiv preprint arXiv:2503.06749},
|
| 46 |
+
year={2025}
|
| 47 |
+
}
|
| 48 |
+
```
|