Update README.md

ed2bc27 verified 3 months ago

7.04 kB

	---
	license: cc-by-nc-nd-4.0
	language:
	- en
	pipeline_tag: image-text-to-text
	library_name: transformers
	tags:
	- multimodal
	- Pathology
	- arxiv:2505.11404
	extra_gated_prompt: >-
	The Patho-R1-3B model and its associated materials are released under the CC-BY-NC-ND 4.0 license.
	Access is restricted to non-commercial, academic research purposes only, with proper citation required.
	Any commercial usage, redistribution, or derivative work (including training models based on this model or generating datasets from its outputs)
	is strictly prohibited without prior written approval.

	Users must register with an official institutional email address (generic domains such as @gmail, @qq, @hotmail, etc. will not be accepted).
	By requesting access, you confirm that your information is accurate and current, and that you agree to comply with all terms listed herein.
	If other members of your organization wish to use the model, they must register independently and agree to the same terms.

	extra_gated_fields:
	Full name (first and last): text
	Institutional affiliation (no abbreviations): text
	Role/Position:
	type: select
	options:
	- Faculty/Principal Investigator
	- PhD Student
	- Postdoctoral Researcher
	- Research Staff
	- Other
	Official institutional email (must match your Hugging Face primary email; generic domains will be denied): text
	Intended research use (be specific): text
	I agree to use this model only for non-commercial academic purposes: checkbox
	I agree not to redistribute this model or share it outside of my individual usage: checkbox
	I confirm that all submitted information is accurate and up to date: checkbox
	---
	# Patho-R1: A Multimodal Reinforcement Learning-Based Pathology Expert Reasoner
	\[[Arxiv](https://arxiv.org/abs/2505.11404)\] \| \[[Github Repo](https://github.com/Wenchuan-Zhang/Patho-R1)] \| \[[Cite](#citation❤️)\]

	## Introduction📝
	While vision-language models have shown impressive progress in general medical domains, pathology remains a challenging subfield due to its high-resolution image requirements and complex diagnostic reasoning.

	To address this gap, we introduce Patho-R1-3B, a multimodal pathology reasoner designed to enhance diagnostic understanding through structured reasoning. Patho-R1-3B is trained using a three-stage pipeline:
	1. Continued pretraining on 3.5M pathology figure-caption pairs for domain knowledge acquisition
	2. Supervised fine-tuning on 500k expert-annotated Chain-of-Thought samples to encourage reasoning
	3. Reinforcement learning with Decoupled CLIP and Dynamic sAmpling Policy Optimization to refine response quality

	Experimental results show that Patho-R1-3B achieves strong performance across key pathology tasks, including multiple choice questions and visual question answering, highlighting its potential for real-world pathology AI applications.
	![Patho-R1 Overview](https://github.com/Wenchuan-Zhang/Patho-R1/raw/main/docs/overview.png)

	### Quickstart🏃
	Here we show a code snippet to show you how to use the chat model with `transformers` and `qwen_vl_utils`:

	```python
	from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
	from qwen_vl_utils import process_vision_info


	model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
	"WenchuanZhang/Patho-R1-3B",
	torch_dtype="auto", device_map="auto"
	)
	processor = AutoProcessor.from_pretrained("WenchuanZhang/Patho-R1-3B")

	# example question from Pathmmu-test-dataset
	# ground truth: D
	# Reasoning style options (choose one):
	# - Chain-of-Draft, a concise reasoning prompting strategy (COD):
	# You are a pathology expert, your task is to think step by step, but only keep a minimum draft for each thinking step, with 5 words at most. Return the answer at the end of the response after a separator. Use the following format:<think> Your step-by-step reasoning </think><answer> Your final answer </answer>
	# - Chain-of-Thought (COT):
	messages = [
	{ "role": "system",
	"content": "You are a pathology expert, your task is to answer question step by step. Use the following format:<think> Your step-by-step reasoning </think><answer> Your final answer </answer>"},
	{
	"role": "user",
	"content": [
	{
	"type": "image",
	"image": "./images/example.jpg",
	},
	{"type": "text", "text": "What feature in the provided micrograph is indicative of chronic inflammation? /n A. Granuloma formation /n B. Multinucleated giant cells /n C. Neutrophilic infiltration /n D. Plasma cells with eccentrically placed nuclei"},
	],
	}
	]
	# Preparation for inference
	text = processor.apply_chat_template(
	messages, tokenize=False, add_generation_prompt=True
	)
	image_inputs, video_inputs = process_vision_info(messages)
	inputs = processor(
	text=[text],
	images=image_inputs,
	videos=video_inputs,
	padding=True,
	return_tensors="pt",
	)
	inputs = inputs.to(model.device)

	# Inference: Generation of the output
	generated_ids = model.generate(**inputs, max_new_tokens=2048)
	generated_ids_trimmed = [
	out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
	]
	output_text = processor.batch_decode(
	generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
	)
	print(output_text)
	```
	## Acknowledgements🎖

	We gratefully acknowledge the contributions of the open-source community, particularly the following projects which laid the foundation for various components of this work:

	- [Qwen](https://github.com/QwenLM) for providing powerful vision language models that significantly advanced our multimodal understanding and generation capabilities.
	- [DocLayout-YOLO](https://github.com/opendatalab/DocLayout-YOLO) for document layout detection.
	- [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR) for comprehensive optical character recognition.
	- [ModelScope Swift](https://github.com/modelscope/ms-swift) for efficient model serving and deployment tools.
	- [LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory) for robust LLM training and fine-tuning pipelines.
	- [VERL](https://github.com/volcengine/verl) for valuable visual-language pretraining resources.
	- [DeepSeek](https://github.com/deepseek-ai) for high-quality models and infrastructure supporting text understanding.

	We thank the authors and contributors of these repositories for their dedication and impactful work, which made our development of Patho-R1-3B possible.

	## Citation❤️
	If you find our work helpful, a citation would be greatly appreciated:

	```
	@article{zhang2025patho,
	title={Patho-R1: A Multimodal Reinforcement Learning-Based Pathology Expert Reasoner},
	author={Zhang, Wenchuan and Zhang, Penghao and Guo, Jingru and Cheng, Tao and Chen, Jie and Zhang, Shuwan and Zhang, Zhang and Yi, Yuhao and Bu, Hong},
	journal={arXiv preprint arXiv:2505.11404},
	year={2025}
	}
	```