DATEXIS
/

DeepICD-R1-zero-32B

+---
+language:
+- en
+license: other
+library_name: transformers
+pipeline_tag: text-generation
+tags:
+- clinical-nlp
+- medical-reasoning
+- icd-10-cm
+- reinforcement-learning
+- grpo
+- qwen2.5
+- diagnosis-prediction
+- chain-of-thought
+- research
+base_model:
+- Qwen/Qwen2.5-32B-Instruct
+model-index:
+- name: DeepICD-R1-zero-32B
+  results: []
+---
+# DeepICD-R1-zero-32B
+DeepICD-R1-zero-32B is a clinical reasoning model for **ICD-10-CM diagnosis outcome prediction from admission notes**, obtained by applying **Group Relative Policy Optimization (GRPO)** to **Qwen2.5-32B-Instruct**.
+This model corresponds to the **GRPO-only large model** described in the DeepICD-R1 paper, where it serves as the first-stage reasoning model before the creation of the distilled supervised fine-tuning dataset used for smaller downstream models. In the paper, this model is referred to as **DeepICD-R1-zero-32B**. It is part of the broader **DeepICD-R1** framework for hierarchical medical reasoning with verifiable rewards.
+## Relation to the paper
+This repository contains the model corresponding to the **“zero” GRPO-trained large model** in:
+**DeepICD-R1: Medical Reasoning through Hierarchical Rewards and Unsupervised Distillation**
+In the paper, the overall framework has two stages:
+1. A large instruction-tuned model is optimized with **GRPO** using structured clinical rewards.
+   This yields **DeepICD-R1-zero-32B**.
+2. That model is then used to generate reasoning traces which are distilled into a large supervised dataset for smaller models such as DeepICD-R1-7B.
+The paper describes this role as follows: the large base LLM is trained with GRPO and dedicated reward functions to produce **DeepICD-R1-zero-32B**, which is then used in dataset construction and later fine-tuning stages. :contentReference[oaicite:1]{index=1}
+## Model description
+- **Model name:** DeepICD-R1-zero-32B
+- **Base model:** `Qwen/Qwen2.5-32B-Instruct`
+- **Training method:** Reinforcement learning with **GRPO**
+- **Domain:** Clinical NLP
+- **Task:** Predicting the first annotated **ICD-10-CM diagnosis code** from admission notes
+- **Input:** Admission note text
+- **Output:** Structured reasoning plus a predicted ICD-10-CM code
+The model is trained to generate outputs in the following structure:
+```xml
+<think>
+...
+</think>
+<diagnosis>
+...
+</diagnosis>
+```
+The paper uses this structured format together with **hierarchical ICD-aware rewards** and an **LLM-based reasoning reward**.
+---
+# Intended Use
+This model is intended for:
+- research on clinical reasoning with language models
+- ICD-10-CM outcome prediction from admission notes
+- studying reinforcement learning with verifiable hierarchical rewards
+- generating reasoning traces for analysis or data distillation
+- reproducing or extending the DeepICD-R1 framework
+---
+# Out-of-Scope Use
+This model is **not intended for**:
+- real-world diagnosis
+- clinical decision support in production
+- autonomous medical coding in care settings
+- unsupervised deployment on patient data
+- use without human oversight
+As emphasized in the paper, this is a **research prototype** and must not be used for real-world diagnosis or clinical decision-making. Generated reasoning may appear plausible while still being clinically incorrect.
+---
+# Training Data
+The model was trained on **MIMIC-IV admission notes** for single-label prospective ICD-10-CM outcome prediction.
+According to the paper, the task is formulated as predicting the **first annotated diagnosis code from admission-time information**, using MIMIC-IV admission notes and excluding leakage-prone diagnostic and treatment sections.
+PhysioNet Link soon!
+---
+# Training Procedure
+This model was trained with the **verl PPO trainer** using **GRPO** as the advantage estimator.
+## Core Setup
+- **Trainer:** `verl.trainer.main_ppo`
+- **Advantage estimator:** `grpo`
+- **Base model:** `Qwen/Qwen2.5-32B-Instruct`
+- **Epochs:** 1
+- **Effective train batch size:** 64
+- **Rollouts per prompt:** 8
+- **Max prompt length:** 2048
+- **Max response length:** 1024
+- **Sampling temperature:** 0.9
+- **Learning rate:** 1e-6
+- **Warmup steps:** 80
+- **Entropy coefficient:** 0.001
+- **KL loss:** disabled
+- **Actor torch compile:** enabled
+- **Gradient checkpointing:** enabled
+- **Rollout engine:** vLLM
+- **Rollout dtype:** bfloat16
+---
+# Hardware
+- **GPUs:** 8
+- **Nodes:** 1
+- **GPU type:** not explicitly specified in the config
+- **Memory limit:** 512 GiB
+---
+# Reward Setup
+Training used a **custom batched reward function** with the following active components:
+- Outcome reward: enabled
+- Format reward: enabled
+- LLM-as-a-judge reward: enabled
+- Judge RAG: enabled
+- Guidelines file: ICD-10-CM chapter guidelines JSON
+- Judge model: `meta-llama/Llama-3.1-8B-Instruct`
+## Selected Reward Environment Settings
+ACTIVATE_OUTCOME_REWARD=True
+ACTIVATE_FORMAT_REWARD=True
+JUDGE_RAG_ENABLED=True
+NO_MATCH_MALUS=-1
+THINK_TRACE_REWARD=1
+MATCH_REWARD=15
+LLM_REWARD_SCALING=0.8
+This aligns with the paper’s training design, which combines:
+- a **format reward** for `<think>` and `<diagnosis>` structure
+- a **hierarchical ICD outcome reward**
+- an **LLM-as-a-judge reward** to improve reasoning clarity and consistency
+---
+# Prompt / Output Format
+The model expects an **admission note** and is trained to return:
+1. a reasoning trace inside `<think>...</think>`
+2. a predicted ICD-10-CM code inside `<diagnosis>...</diagnosis>`
+## Example Schema
+```xml
+<think>
+Reasoning over presenting symptoms, history, and admission note evidence.
+</think>
+<diagnosis>
+M5116
+</diagnosis>
+```
+Users should validate that generated outputs conform to this format before downstream evaluation.
+---
+# Evaluation
+In the paper, evaluation is performed on **hierarchical ICD-10-CM prediction tasks** at three levels:
+- **Chapter**
+- **Category**
+- **Full diagnosis code**
+The paper reports that the **GRPO-only 32B model improves over the instruction-tuned baseline**, but remains weaker than models trained with **both supervised fine-tuning and GRPO**, especially for **fine-grained full-code prediction**.
+This repository does **not claim any additional benchmark results** beyond those reported in the paper unless explicitly added later.
+---
+# Limitations
+Important limitations discussed in the paper include:
+- reasoning traces may be coherent but **not clinically correct**
+- the model can exhibit **premature diagnostic closure**
+- performance drops on **fine-grained and rare ICD codes**
+- the underlying data reflects **institutional and demographic bias**
+- the model may fail to capture the **severity or clinical significance of diagnoses**
+- reinforcement signals based on **automatic rewards and LLM judging are only proxies for expert review**
+The paper also notes that clinicians often preferred **concise reasoning**, and that plausible-looking outputs may still omit important differential diagnoses.
+---
+# Ethical Considerations
+- Trained on **de-identified MIMIC-IV data** under the applicable data-use framework
+- **Research-only release**
+- Not suitable for **patient-facing or clinician-facing decision support** without substantial additional validation
+- May propagate **dataset bias and disease-frequency imbalance**
+- Outputs should **not be interpreted as medical advice**
+Please read the paper’s **Ethical Considerations** and **Limitations** sections before using this model.
+---
+# Citation
+If you use this model, please cite the paper:
+```bibtex
+@article{roehr2026deepicdr1,
+  title={DeepICD-R1: Medical Reasoning through Hierarchical Rewards and Unsupervised Distillation},
+  author={R{\"o}hr, Tom and Steffek, Thomas and Teucher, Roman and Bressem, Keno and Figueroa, Alexei and Grundmann, Paul and Troeger, Peter and Gers, Felix and L{\"o}ser, Alexander},
+  year={2026},
+  journal={Proceedings of LREC-COLING 2026}
+}