Fine-R1-7B / README.md
nielsr's picture
nielsr HF Staff
Improve model card: update paper link and summarize abstract
cf09189 verified
|
raw
history blame
2.37 kB
metadata
library_name: transformers
license: mit
pipeline_tag: image-text-to-text
tags:
  - fine-grained-visual-recognition
  - chain-of-thought
  - vision-reasoning

Fine-R1: Make Multi-modal LLMs Excel in Fine-Grained Visual Recognition by Chain-of-Thought Reasoning

This is the official model repository for the paper Fine-R1: Make Multi-modal LLMs Excel in Fine-Grained Visual Recognition by Chain-of-Thought Reasoning.

Introduction

Fine-R1 is a Multi-modal Large Language Model (MLLM) specifically designed for Fine-Grained Visual Recognition (FGVR). While general MLLMs often struggle with distinguishing between highly similar sub-categories, Fine-R1 bridges the gap between generative models and specialized discriminative models (like CLIP) through an R1-style training framework.

Key Innovations:

  • Chain-of-Thought Supervised Fine-tuning (CoT-SFT): The model is trained on high-quality FGVR CoT datasets, teaching it to perform visual analysis, consider candidate sub-categories, and compare them before predicting.
  • Triplet Augmented Policy Optimization (TAPO): This includes Intra-class Augmentation to handle visual variance and Inter-class Augmentation to maximize distinction between similar sub-categories.

With only 4-shot training, Fine-R1 excels in identifying both seen and unseen sub-categories, outperforming many general reasoning MLLMs and contrastive models.

Resources

Usage

This model is compatible with the Hugging Face transformers library. For detailed instructions on environment setup, training scripts, and evaluation pipelines (closed-world and open-world), please refer to the official GitHub Repository.

Citation

If you find Fine-R1 helpful in your research, please cite the following paper:

@article{he2026finer1,
  title={Fine-R1: Make Multi-modal LLMs Excel in Fine-Grained Visual Recognition by Chain-of-Thought Reasoning},
  author={He, Hulingxiao and Geng, Zijun and Peng, Yuxin},
  journal={arXiv preprint arXiv:2602.07605},
  year={2026}
}

License

This project is licensed under the MIT License.