Fine-R1-3B / README.md
nielsr's picture
nielsr HF Staff
Improve Fine-R1 model card with paper link, authors, and structured content
4df0b8d verified
|
raw
history blame
2.6 kB
metadata
library_name: transformers
license: mit
pipeline_tag: image-text-to-text
tags:
  - fine-grained-recognition
  - chain-of-thought
  - vision-language
  - reasoning
  - qwen2-vl
  - arxiv:2602.07605

Fine-R1: Make Multi-modal LLMs Excel in Fine-Grained Visual Recognition by Chain-of-Thought Reasoning

This is the official 3B model released for the paper Fine-R1: Make Multi-modal LLMs Excel in Fine-Grained Visual Recognition by Chain-of-Thought Reasoning.

Authors: Hulingxiao He, Zijun Geng, and Yuxin Peng.

Introduction

Fine-R1 is a Multi-modal Large Language Model (MLLM) specifically designed to excel in Fine-Grained Visual Recognition (FGVR). While traditional MLLMs often struggle with FGVR compared to contrastive models like CLIP, Fine-R1 bridges this performance gap by incorporating Chain-of-Thought (CoT) reasoning. It achieves state-of-the-art performance, even surpassing strong CLIP-like models, in identifying both seen and unseen fine-grained sub-categories with only 4-shot training.

Methodology

Fine-R1 employs an R1-style training framework consisting of two key stages:

  1. Chain-of-Thought Supervised Fine-tuning (SFT): This stage involves constructing a high-quality FGVR CoT dataset with rationales covering "visual analysis, candidate sub-categories, comparison, and prediction." This process trains the model to act as a strong open-world classifier.
  2. Triplet Augmented Policy Optimization (TAPO): This stage enhances the model's robustness and discriminative ability. It uses Intra-class Augmentation to improve robustness to intra-class variance and Inter-class Augmentation to maximize response distinction across sub-categories.

GitHub Repository

For code, data, and detailed training/evaluation instructions, please refer to the official repository: https://github.com/PKU-ICST-MIPL/FineR1_ICLR2026

Model Version

Fine-R1-3B

Usage

This model can be used with the Hugging Face transformers library. For detailed usage examples and how to integrate it into your projects, please refer to the official GitHub Repository.

Citation

If you find this model or the research helpful, please consider citing:

@article{he2026finer1,
  title={Fine-R1: Make Multi-modal LLMs Excel in Fine-Grained Visual Recognition by Chain-of-Thought Reasoning},
  author={He, Hulingxiao and Geng, Zijun and Peng, Yuxin},
  journal={arXiv preprint arXiv:2602.07605},
  year={2026}
}