Vision-R1-32B / README.md
nielsr's picture
nielsr HF Staff
Add model card for Vision-R1-32B
930e07b verified
|
raw
history blame
3.05 kB
---
license: apache-2.0
library_name: transformers
pipeline_tag: image-text-to-text
tags:
- multimodal
- reasoning
- math
- r1
---
# Vision-R1-32B
Vision-R1-32B is a multimodal reasoning model introduced in the paper [Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models](https://huggingface.co/papers/2503.06749). It is based on the Qwen2.5-VL-32B architecture and is specifically optimized to enhance reasoning capabilities (such as self-reflection and questioning) in multimodal tasks.
- **Paper:** [Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models](https://huggingface.co/papers/2503.06749)
- **Repository:** [https://github.com/Osilly/Vision-R1](https://github.com/Osilly/Vision-R1)
## Model Description
Vision-R1 addresses the difficulty of activating complex reasoning in MLLMs without human-annotated reasoning data. The model was developed using a two-stage pipeline:
1. **Cold-start Initialization**: Fine-tuning on a 200K multimodal Chain-of-Thought (CoT) dataset (Vision-R1-cold).
2. **Reinforcement Learning (RL)**: Utilizing Group Relative Policy Optimization (GRPO) with a Progressive Thinking Suppression Training (PTST) strategy. This strategy gradually increases the reasoning length (4K -> 8K -> 16K) to refine the model's ability to learn complex reasoning processes.
## Performance
Vision-R1-32B demonstrates strong performance across various multimodal math reasoning benchmarks, significantly outperforming its base model:
| Model | MathVista | MathVerse | MathVerse (mini) | MM-Math | DynaMath (Avg) | AVG. |
| -------------------------- | ----------- | ------------ | ---------------- | ------------ | -------------- | ------------ |
| Qwen2.5-VL-32B | 72.9 | 52.3 | 47.6 | 34.9 | 55.5 | 52.6 |
| **Vision-R1-32B (Ours)** | **76.4** | **62.1** | **59.0** | **55.3** | **65.6** | **63.7** |
## Quickstart
### Inference via Transformers
You can use the inference script provided in the [official repository](https://github.com/Osilly/Vision-R1).
```bash
# Inference script for Vision-R1-32B model
MODEL_PATH="Osilly/Vision-R1-32B"
IMAGE_PATH="path/to/your/image.png"
PROMPT="Your math problem or question here."
python3 inference.py \
--model_path ${MODEL_PATH} \
--enable_flash_attn True \
--image_path ${IMAGE_PATH} \
--prompt "${PROMPT}" \
--max_tokens 4096 \
--temperature 0.6 \
--top_p 0.95
```
The model is also compatible with **vLLM** (version > 0.7.2) for faster deployment and local inference.
## Citation
If you find Vision-R1 useful, please cite the following paper:
```bibtex
@article{huang2025visionr1,
title={Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models},
author={Huang, Wenxuan and Jia, Bohan and Zhai, Zijie and Cao, Shaosheng and Ye, Zheyu and Zhao, Fei and Hu, Yao and Lin, Shaohui},
journal={arXiv preprint arXiv:2503.06749},
year={2025}
}
```