Improve Fine-R1 model card with paper link, authors, and structured content
Browse filesHi! I'm Niels from the Hugging Face community team.
I've updated the model card for Fine-R1 to make it more informative and better integrated with the Hugging Face ecosystem. Changes include:
- Updating the paper link to the official Hugging Face paper page.
- Adding the authors of the paper.
- Shortening and restructuring the abstract into a concise introduction and a dedicated methodology section, as per best practices.
- Adding the official BibTeX citation for easier referencing.
- Including descriptive metadata tags for better discoverability of the model.
Let me know if you have any questions!
README.md
CHANGED
|
@@ -1,20 +1,36 @@
|
|
| 1 |
---
|
|
|
|
| 2 |
license: mit
|
| 3 |
pipeline_tag: image-text-to-text
|
| 4 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 5 |
---
|
| 6 |
|
| 7 |
# Fine-R1: Make Multi-modal LLMs Excel in Fine-Grained Visual Recognition by Chain-of-Thought Reasoning
|
| 8 |
|
| 9 |
-
This is the official model released for the paper **[Fine-R1: Make Multi-modal LLMs Excel in Fine-Grained Visual Recognition by Chain-of-Thought Reasoning](https://
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
|
| 11 |
-
|
| 12 |
|
| 13 |
-
|
| 14 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 15 |
|
| 16 |
## GitHub Repository
|
| 17 |
|
|
|
|
| 18 |
[https://github.com/PKU-ICST-MIPL/FineR1_ICLR2026](https://github.com/PKU-ICST-MIPL/FineR1_ICLR2026)
|
| 19 |
|
| 20 |
## Model Version
|
|
@@ -22,4 +38,17 @@ Fine-R1-3B
|
|
| 22 |
|
| 23 |
## Usage
|
| 24 |
|
| 25 |
-
This model can be used with the Hugging Face `transformers` library. For detailed usage examples and how to integrate it into your projects, please refer to the official [GitHub Repository](https://github.com/PKU-ICST-MIPL/FineR1_ICLR2026).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
library_name: transformers
|
| 3 |
license: mit
|
| 4 |
pipeline_tag: image-text-to-text
|
| 5 |
+
tags:
|
| 6 |
+
- fine-grained-recognition
|
| 7 |
+
- chain-of-thought
|
| 8 |
+
- vision-language
|
| 9 |
+
- reasoning
|
| 10 |
+
- qwen2-vl
|
| 11 |
+
- arxiv:2602.07605
|
| 12 |
---
|
| 13 |
|
| 14 |
# Fine-R1: Make Multi-modal LLMs Excel in Fine-Grained Visual Recognition by Chain-of-Thought Reasoning
|
| 15 |
|
| 16 |
+
This is the official 3B model released for the paper **[Fine-R1: Make Multi-modal LLMs Excel in Fine-Grained Visual Recognition by Chain-of-Thought Reasoning](https://huggingface.co/papers/2602.07605)**.
|
| 17 |
+
|
| 18 |
+
**Authors**: Hulingxiao He, Zijun Geng, and Yuxin Peng.
|
| 19 |
+
|
| 20 |
+
## Introduction
|
| 21 |
|
| 22 |
+
Fine-R1 is a Multi-modal Large Language Model (MLLM) specifically designed to excel in Fine-Grained Visual Recognition (FGVR). While traditional MLLMs often struggle with FGVR compared to contrastive models like CLIP, Fine-R1 bridges this performance gap by incorporating Chain-of-Thought (CoT) reasoning. It achieves state-of-the-art performance, even surpassing strong CLIP-like models, in identifying both seen and unseen fine-grained sub-categories with only 4-shot training.
|
| 23 |
|
| 24 |
+
## Methodology
|
| 25 |
|
| 26 |
+
Fine-R1 employs an R1-style training framework consisting of two key stages:
|
| 27 |
+
|
| 28 |
+
1. **Chain-of-Thought Supervised Fine-tuning (SFT)**: This stage involves constructing a high-quality FGVR CoT dataset with rationales covering "visual analysis, candidate sub-categories, comparison, and prediction." This process trains the model to act as a strong open-world classifier.
|
| 29 |
+
2. **Triplet Augmented Policy Optimization (TAPO)**: This stage enhances the model's robustness and discriminative ability. It uses Intra-class Augmentation to improve robustness to intra-class variance and Inter-class Augmentation to maximize response distinction across sub-categories.
|
| 30 |
|
| 31 |
## GitHub Repository
|
| 32 |
|
| 33 |
+
For code, data, and detailed training/evaluation instructions, please refer to the official repository:
|
| 34 |
[https://github.com/PKU-ICST-MIPL/FineR1_ICLR2026](https://github.com/PKU-ICST-MIPL/FineR1_ICLR2026)
|
| 35 |
|
| 36 |
## Model Version
|
|
|
|
| 38 |
|
| 39 |
## Usage
|
| 40 |
|
| 41 |
+
This model can be used with the Hugging Face `transformers` library. For detailed usage examples and how to integrate it into your projects, please refer to the official [GitHub Repository](https://github.com/PKU-ICST-MIPL/FineR1_ICLR2026).
|
| 42 |
+
|
| 43 |
+
## Citation
|
| 44 |
+
|
| 45 |
+
If you find this model or the research helpful, please consider citing:
|
| 46 |
+
|
| 47 |
+
```bibtex
|
| 48 |
+
@article{he2026finer1,
|
| 49 |
+
title={Fine-R1: Make Multi-modal LLMs Excel in Fine-Grained Visual Recognition by Chain-of-Thought Reasoning},
|
| 50 |
+
author={He, Hulingxiao and Geng, Zijun and Peng, Yuxin},
|
| 51 |
+
journal={arXiv preprint arXiv:2602.07605},
|
| 52 |
+
year={2026}
|
| 53 |
+
}
|
| 54 |
+
```
|