csfufu
/

Revisual-R1-final

Image-Text-to-Text

text-generation-inference

Model card Files Files and versions

Revisual-R1-final / README.md

csfufu's picture

Update README.md

ff437d3 verified 8 months ago

|

history blame contribute delete

1.29 kB

	---
	base_model:
	- Qwen/Qwen2.5-VL-7B-Instruct
	language:
	- en
	license: apache-2.0
	pipeline_tag: image-text-to-text
	tags:
	- transformers
	- multimodal
	library_name: transformers
	---


	## 🌟 ReVisual-R1 (7B) — Open-Source Multimodal Reasoner

	> One cold-start, two RL stages, endless reasoning power.

	---

	### 🔑 Highlights

	* SOTA on 9 tough benchmarks covering visual–math + text reasoning.
	* Three-Stage SRO Training

	1. Text Cold-Start — seed deep reflection
	2. Multimodal RL — align vision & logic
	3. Text RL — polish fluency & brevity
	* PAD (Prioritized Advantage Distillation) keeps gradients alive.
	* Efficient-Length Reward = concise, self-reflective CoT.

	---

	### 📚 Resources

	* [Paper](https://arxiv.org/abs/2506.04207)
	* [Code](https://github.com/CSfufu/Revisual-R1)


	---

	### 📌 Citation

	```bibtex
	@article{chen2025advancing,
	title={Advancing Multimodal Reasoning: From Optimized Cold Start to Staged Reinforcement Learning},
	author={Chen, Shuang and Guo, Yue and Su, Zhaochen and Li, Yafu and Wu, Yulun and Chen, Jiacheng and Chen, Jiayu and Wang, Weijie and Qu, Xiaoye and Cheng, Yu},
	journal={arXiv preprint arXiv:2506.04207},
	year={2025}
	}
	```

	Take ReVisual-R1 for a spin and let us know what you build! 🎯