nielsr HF Staff commited on
Commit
cf09189
·
verified ·
1 Parent(s): 039d50e

Improve model card: update paper link and summarize abstract

Browse files

Hi! I'm Niels from the Hugging Face community science team.

I've opened this PR to improve your model card:
- Summarized the paper abstract into a more readable format.
- Updated the paper reference to link to the Hugging Face paper page.
- Added relevant tags for discoverability (e.g., fine-grained visual recognition, chain-of-thought).
- Maintained existing metadata for library compatibility.

Please feel free to review and merge!

Files changed (1) hide show
  1. README.md +32 -10
README.md CHANGED
@@ -1,25 +1,47 @@
1
  ---
 
2
  license: mit
3
  pipeline_tag: image-text-to-text
4
- library_name: transformers
 
 
 
5
  ---
6
 
7
  # Fine-R1: Make Multi-modal LLMs Excel in Fine-Grained Visual Recognition by Chain-of-Thought Reasoning
8
 
9
- This is the official model released for the paper **[Fine-R1: Make Multi-modal LLMs Excel in Fine-Grained Visual Recognition by Chain-of-Thought Reasoning](https://openreview.net/pdf?id=kyzHM557gE)**.
10
 
11
- ## Abstract
12
 
13
- Any entity in the visual world can be hierarchically grouped based on shared characteristics and mapped to fine-grained sub-categories. While Multi-modal Large Language Models (MLLMs) achieve strong performance on coarse-grained visual tasks, they often struggle with Fine-Grained Visual Recognition (FGVR). Adapting general-purpose MLLMs to FGVR typically requires large amounts of annotated data, which is costly to obtain, leaving a substantial performance gap compared to contrastive CLIP models dedicated for discriminative tasks. Moreover, MLLMs tend to overfit to seen sub-categories and generalize poorly to unseen ones. To address these challenges, we propose Fine-R1, an MLLM tailored for FGVR through an R1-style training framework: (1) Chain-of-Thought Supervised Fine-tuning, where we construct a high-quality FGVR CoT dataset with rationales of "visual analysis, candidate sub-categories, comparison, and prediction", transition the model into a strong open-world classifier; and (2) Triplet Augmented Policy Optimization, where Intra-class Augmentation mixes trajectories from anchor and positive images within the same category to improve robustness to intra-class variance, while Inter-class Augmentation maximizes the response distinction conditioned on images across sub-categories to enhance discriminative ability. With only 4-shot training, Fine-R1 outperforms existing general MLLMs, reasoning MLLMs, and even contrastive CLIP models in identifying both seen and unseen sub-categories, showing promise in working in knowledge-intensive domains where gathering expert annotations for all sub-categories is arduous.
14
 
 
 
 
15
 
16
- ## GitHub Repository
17
 
18
- [https://github.com/PKU-ICST-MIPL/FineR1_ICLR2026](https://github.com/PKU-ICST-MIPL/FineR1_ICLR2026)
19
-
20
- ## Model Version
21
- Fine-R1-7B
22
 
23
  ## Usage
24
 
25
- This model can be used with the Hugging Face `transformers` library. For detailed usage examples and how to integrate it into your projects, please refer to the official [GitHub Repository](https://github.com/PKU-ICST-MIPL/FineR1_ICLR2026).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ library_name: transformers
3
  license: mit
4
  pipeline_tag: image-text-to-text
5
+ tags:
6
+ - fine-grained-visual-recognition
7
+ - chain-of-thought
8
+ - vision-reasoning
9
  ---
10
 
11
  # Fine-R1: Make Multi-modal LLMs Excel in Fine-Grained Visual Recognition by Chain-of-Thought Reasoning
12
 
13
+ This is the official model repository for the paper **[Fine-R1: Make Multi-modal LLMs Excel in Fine-Grained Visual Recognition by Chain-of-Thought Reasoning](https://huggingface.co/papers/2602.07605)**.
14
 
15
+ ## Introduction
16
 
17
+ **Fine-R1** is a Multi-modal Large Language Model (MLLM) specifically designed for **Fine-Grained Visual Recognition (FGVR)**. While general MLLMs often struggle with distinguishing between highly similar sub-categories, Fine-R1 bridges the gap between generative models and specialized discriminative models (like CLIP) through an R1-style training framework.
18
 
19
+ ### Key Innovations:
20
+ - **Chain-of-Thought Supervised Fine-tuning (CoT-SFT)**: The model is trained on high-quality FGVR CoT datasets, teaching it to perform visual analysis, consider candidate sub-categories, and compare them before predicting.
21
+ - **Triplet Augmented Policy Optimization (TAPO)**: This includes Intra-class Augmentation to handle visual variance and Inter-class Augmentation to maximize distinction between similar sub-categories.
22
 
23
+ With only 4-shot training, Fine-R1 excels in identifying both seen and unseen sub-categories, outperforming many general reasoning MLLMs and contrastive models.
24
 
25
+ ## Resources
26
+ - **Paper:** [Hugging Face Papers](https://huggingface.co/papers/2602.07605)
27
+ - **GitHub:** [PKU-ICST-MIPL/FineR1_ICLR2026](https://github.com/PKU-ICST-MIPL/FineR1_ICLR2026)
 
28
 
29
  ## Usage
30
 
31
+ This model is compatible with the Hugging Face `transformers` library. For detailed instructions on environment setup, training scripts, and evaluation pipelines (closed-world and open-world), please refer to the official [GitHub Repository](https://github.com/PKU-ICST-MIPL/FineR1_ICLR2026).
32
+
33
+ ## Citation
34
+
35
+ If you find Fine-R1 helpful in your research, please cite the following paper:
36
+
37
+ ```bibtex
38
+ @article{he2026finer1,
39
+ title={Fine-R1: Make Multi-modal LLMs Excel in Fine-Grained Visual Recognition by Chain-of-Thought Reasoning},
40
+ author={He, Hulingxiao and Geng, Zijun and Peng, Yuxin},
41
+ journal={arXiv preprint arXiv:2602.07605},
42
+ year={2026}
43
+ }
44
+ ```
45
+
46
+ ## License
47
+ This project is licensed under the MIT License.