| | --- |
| | license: cc-by-nc-nd-4.0 |
| | language: |
| | - en |
| | pipeline_tag: image-text-to-text |
| | library_name: transformers |
| | tags: |
| | - multimodal |
| | - Pathology |
| | - arxiv:2505.11404 |
| | extra_gated_prompt: >- |
| | The Patho-R1-3B model and its associated materials are released under the CC-BY-NC-ND 4.0 license. |
| | Access is restricted to non-commercial, academic research purposes only, with proper citation required. |
| | Any commercial usage, redistribution, or derivative work (including training models based on this model or generating datasets from its outputs) |
| | is strictly prohibited without prior written approval. |
| | |
| | Users must register with an official institutional email address (generic domains such as @gmail, @qq, @hotmail, etc. will not be accepted). |
| | By requesting access, you confirm that your information is accurate and current, and that you agree to comply with all terms listed herein. |
| | If other members of your organization wish to use the model, they must register independently and agree to the same terms. |
| |
|
| | extra_gated_fields: |
| | Full name (first and last): text |
| | Institutional affiliation (no abbreviations): text |
| | Role/Position: |
| | type: select |
| | options: |
| | - Faculty/Principal Investigator |
| | - PhD Student |
| | - Postdoctoral Researcher |
| | - Research Staff |
| | - Other |
| | Official institutional email (**must match your Hugging Face primary email; generic domains will be denied**): text |
| | Intended research use (be specific): text |
| | I agree to use this model only for non-commercial academic purposes: checkbox |
| | I agree not to redistribute this model or share it outside of my individual usage: checkbox |
| | I confirm that all submitted information is accurate and up to date: checkbox |
| | --- |
| | # Patho-R1: A Multimodal Reinforcement Learning-Based Pathology Expert Reasoner |
| | \[[Arxiv](https://arxiv.org/abs/2505.11404)\] | \[[Github Repo](https://github.com/Wenchuan-Zhang/Patho-R1)] | \[[Cite](#citation❤️)\] |
| |
|
| | ## Introduction📝 |
| | While vision-language models have shown impressive progress in general medical domains, pathology remains a challenging subfield due to its high-resolution image requirements and complex diagnostic reasoning. |
| |
|
| | To address this gap, we introduce **Patho-R1-3B**, a multimodal pathology reasoner designed to enhance diagnostic understanding through structured reasoning. **Patho-R1-3B** is trained using a three-stage pipeline: |
| | 1. *Continued pretraining* on **3.5M pathology figure-caption pairs** for domain knowledge acquisition |
| | 2. *Supervised fine-tuning* on **500k expert-annotated Chain-of-Thought samples** to encourage reasoning |
| | 3. *Reinforcement learning* with **Decoupled CLIP and Dynamic sAmpling Policy Optimization** to refine response quality |
| |
|
| | Experimental results show that **Patho-R1-3B** achieves strong performance across key pathology tasks, including **multiple choice questions** and **visual question answering**, highlighting its potential for real-world pathology AI applications. |
| |  |
| |
|
| | ### Quickstart🏃 |
| | Here we show a code snippet to show you how to use the chat model with `transformers` and `qwen_vl_utils`: |
| |
|
| | ```python |
| | from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor |
| | from qwen_vl_utils import process_vision_info |
| | |
| | |
| | model = Qwen2_5_VLForConditionalGeneration.from_pretrained( |
| | "WenchuanZhang/Patho-R1-3B", |
| | torch_dtype="auto", device_map="auto" |
| | ) |
| | processor = AutoProcessor.from_pretrained("WenchuanZhang/Patho-R1-3B") |
| | |
| | # example question from Pathmmu-test-dataset |
| | # ground truth: D |
| | # Reasoning style options (choose one): |
| | # - Chain-of-Draft, a concise reasoning prompting strategy (COD): |
| | # You are a pathology expert, your task is to think step by step, but only keep a minimum draft for each thinking step, with 5 words at most. Return the answer at the end of the response after a separator. Use the following format:<think> Your step-by-step reasoning </think><answer> Your final answer </answer> |
| | # - Chain-of-Thought (COT): |
| | messages = [ |
| | { "role": "system", |
| | "content": "You are a pathology expert, your task is to answer question step by step. Use the following format:<think> Your step-by-step reasoning </think><answer> Your final answer </answer>"}, |
| | { |
| | "role": "user", |
| | "content": [ |
| | { |
| | "type": "image", |
| | "image": "./images/example.jpg", |
| | }, |
| | {"type": "text", "text": "What feature in the provided micrograph is indicative of chronic inflammation? /n A. Granuloma formation /n B. Multinucleated giant cells /n C. Neutrophilic infiltration /n D. Plasma cells with eccentrically placed nuclei"}, |
| | ], |
| | } |
| | ] |
| | # Preparation for inference |
| | text = processor.apply_chat_template( |
| | messages, tokenize=False, add_generation_prompt=True |
| | ) |
| | image_inputs, video_inputs = process_vision_info(messages) |
| | inputs = processor( |
| | text=[text], |
| | images=image_inputs, |
| | videos=video_inputs, |
| | padding=True, |
| | return_tensors="pt", |
| | ) |
| | inputs = inputs.to(model.device) |
| | |
| | # Inference: Generation of the output |
| | generated_ids = model.generate(**inputs, max_new_tokens=2048) |
| | generated_ids_trimmed = [ |
| | out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) |
| | ] |
| | output_text = processor.batch_decode( |
| | generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False |
| | ) |
| | print(output_text) |
| | ``` |
| | ## Acknowledgements🎖 |
| |
|
| | We gratefully acknowledge the contributions of the open-source community, particularly the following projects which laid the foundation for various components of this work: |
| |
|
| | - [Qwen](https://github.com/QwenLM) for providing powerful vision language models that significantly advanced our multimodal understanding and generation capabilities. |
| | - [DocLayout-YOLO](https://github.com/opendatalab/DocLayout-YOLO) for document layout detection. |
| | - [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR) for comprehensive optical character recognition. |
| | - [ModelScope Swift](https://github.com/modelscope/ms-swift) for efficient model serving and deployment tools. |
| | - [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) for robust LLM training and fine-tuning pipelines. |
| | - [VERL](https://github.com/volcengine/verl) for valuable visual-language pretraining resources. |
| | - [DeepSeek](https://github.com/deepseek-ai) for high-quality models and infrastructure supporting text understanding. |
| |
|
| | We thank the authors and contributors of these repositories for their dedication and impactful work, which made our development of Patho-R1-3B possible. |
| |
|
| | ## Citation❤️ |
| | If you find our work helpful, a citation would be greatly appreciated: |
| |
|
| | ``` |
| | @article{zhang2025patho, |
| | title={Patho-R1: A Multimodal Reinforcement Learning-Based Pathology Expert Reasoner}, |
| | author={Zhang, Wenchuan and Zhang, Penghao and Guo, Jingru and Cheng, Tao and Chen, Jie and Zhang, Shuwan and Zhang, Zhang and Yi, Yuhao and Bu, Hong}, |
| | journal={arXiv preprint arXiv:2505.11404}, |
| | year={2025} |
| | } |
| | ``` |