toroe commited on
Commit
341c168
·
verified ·
1 Parent(s): 63e3af4

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +243 -0
README.md ADDED
@@ -0,0 +1,243 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: other
5
+ library_name: transformers
6
+ pipeline_tag: text-generation
7
+ tags:
8
+ - clinical-nlp
9
+ - medical-reasoning
10
+ - icd-10-cm
11
+ - reinforcement-learning
12
+ - grpo
13
+ - qwen2.5
14
+ - diagnosis-prediction
15
+ - chain-of-thought
16
+ - research
17
+ base_model:
18
+ - Qwen/Qwen2.5-32B-Instruct
19
+ model-index:
20
+ - name: DeepICD-R1-zero-32B
21
+ results: []
22
+ ---
23
+
24
+ # DeepICD-R1-zero-32B
25
+
26
+ DeepICD-R1-zero-32B is a clinical reasoning model for **ICD-10-CM diagnosis outcome prediction from admission notes**, obtained by applying **Group Relative Policy Optimization (GRPO)** to **Qwen2.5-32B-Instruct**.
27
+
28
+ This model corresponds to the **GRPO-only large model** described in the DeepICD-R1 paper, where it serves as the first-stage reasoning model before the creation of the distilled supervised fine-tuning dataset used for smaller downstream models. In the paper, this model is referred to as **DeepICD-R1-zero-32B**. It is part of the broader **DeepICD-R1** framework for hierarchical medical reasoning with verifiable rewards.
29
+
30
+ ## Relation to the paper
31
+
32
+ This repository contains the model corresponding to the **“zero” GRPO-trained large model** in:
33
+
34
+ **DeepICD-R1: Medical Reasoning through Hierarchical Rewards and Unsupervised Distillation**
35
+
36
+ In the paper, the overall framework has two stages:
37
+
38
+ 1. A large instruction-tuned model is optimized with **GRPO** using structured clinical rewards.
39
+ This yields **DeepICD-R1-zero-32B**.
40
+ 2. That model is then used to generate reasoning traces which are distilled into a large supervised dataset for smaller models such as DeepICD-R1-7B.
41
+
42
+ The paper describes this role as follows: the large base LLM is trained with GRPO and dedicated reward functions to produce **DeepICD-R1-zero-32B**, which is then used in dataset construction and later fine-tuning stages. :contentReference[oaicite:1]{index=1}
43
+
44
+ ## Model description
45
+
46
+ - **Model name:** DeepICD-R1-zero-32B
47
+ - **Base model:** `Qwen/Qwen2.5-32B-Instruct`
48
+ - **Training method:** Reinforcement learning with **GRPO**
49
+ - **Domain:** Clinical NLP
50
+ - **Task:** Predicting the first annotated **ICD-10-CM diagnosis code** from admission notes
51
+ - **Input:** Admission note text
52
+ - **Output:** Structured reasoning plus a predicted ICD-10-CM code
53
+
54
+ The model is trained to generate outputs in the following structure:
55
+
56
+ ```xml
57
+ <think>
58
+ ...
59
+ </think>
60
+ <diagnosis>
61
+ ...
62
+ </diagnosis>
63
+ ```
64
+ The paper uses this structured format together with **hierarchical ICD-aware rewards** and an **LLM-based reasoning reward**.
65
+
66
+ ---
67
+
68
+ # Intended Use
69
+
70
+ This model is intended for:
71
+
72
+ - research on clinical reasoning with language models
73
+ - ICD-10-CM outcome prediction from admission notes
74
+ - studying reinforcement learning with verifiable hierarchical rewards
75
+ - generating reasoning traces for analysis or data distillation
76
+ - reproducing or extending the DeepICD-R1 framework
77
+
78
+ ---
79
+
80
+ # Out-of-Scope Use
81
+
82
+ This model is **not intended for**:
83
+
84
+ - real-world diagnosis
85
+ - clinical decision support in production
86
+ - autonomous medical coding in care settings
87
+ - unsupervised deployment on patient data
88
+ - use without human oversight
89
+
90
+ As emphasized in the paper, this is a **research prototype** and must not be used for real-world diagnosis or clinical decision-making. Generated reasoning may appear plausible while still being clinically incorrect.
91
+
92
+ ---
93
+
94
+ # Training Data
95
+
96
+ The model was trained on **MIMIC-IV admission notes** for single-label prospective ICD-10-CM outcome prediction.
97
+ According to the paper, the task is formulated as predicting the **first annotated diagnosis code from admission-time information**, using MIMIC-IV admission notes and excluding leakage-prone diagnostic and treatment sections.
98
+ PhysioNet Link soon!
99
+
100
+ ---
101
+
102
+ # Training Procedure
103
+
104
+ This model was trained with the **verl PPO trainer** using **GRPO** as the advantage estimator.
105
+
106
+ ## Core Setup
107
+
108
+ - **Trainer:** `verl.trainer.main_ppo`
109
+ - **Advantage estimator:** `grpo`
110
+ - **Base model:** `Qwen/Qwen2.5-32B-Instruct`
111
+ - **Epochs:** 1
112
+ - **Effective train batch size:** 64
113
+ - **Rollouts per prompt:** 8
114
+ - **Max prompt length:** 2048
115
+ - **Max response length:** 1024
116
+ - **Sampling temperature:** 0.9
117
+ - **Learning rate:** 1e-6
118
+ - **Warmup steps:** 80
119
+ - **Entropy coefficient:** 0.001
120
+ - **KL loss:** disabled
121
+ - **Actor torch compile:** enabled
122
+ - **Gradient checkpointing:** enabled
123
+ - **Rollout engine:** vLLM
124
+ - **Rollout dtype:** bfloat16
125
+
126
+ ---
127
+
128
+ # Hardware
129
+
130
+ - **GPUs:** 8
131
+ - **Nodes:** 1
132
+ - **GPU type:** not explicitly specified in the config
133
+ - **Memory limit:** 512 GiB
134
+
135
+ ---
136
+
137
+ # Reward Setup
138
+
139
+ Training used a **custom batched reward function** with the following active components:
140
+
141
+ - Outcome reward: enabled
142
+ - Format reward: enabled
143
+ - LLM-as-a-judge reward: enabled
144
+ - Judge RAG: enabled
145
+ - Guidelines file: ICD-10-CM chapter guidelines JSON
146
+ - Judge model: `meta-llama/Llama-3.1-8B-Instruct`
147
+
148
+ ## Selected Reward Environment Settings
149
+
150
+ ACTIVATE_OUTCOME_REWARD=True
151
+ ACTIVATE_FORMAT_REWARD=True
152
+ JUDGE_RAG_ENABLED=True
153
+ NO_MATCH_MALUS=-1
154
+ THINK_TRACE_REWARD=1
155
+ MATCH_REWARD=15
156
+ LLM_REWARD_SCALING=0.8
157
+
158
+
159
+ This aligns with the paper’s training design, which combines:
160
+
161
+ - a **format reward** for `<think>` and `<diagnosis>` structure
162
+ - a **hierarchical ICD outcome reward**
163
+ - an **LLM-as-a-judge reward** to improve reasoning clarity and consistency
164
+
165
+ ---
166
+
167
+ # Prompt / Output Format
168
+
169
+ The model expects an **admission note** and is trained to return:
170
+
171
+ 1. a reasoning trace inside `<think>...</think>`
172
+ 2. a predicted ICD-10-CM code inside `<diagnosis>...</diagnosis>`
173
+
174
+ ## Example Schema
175
+
176
+ ```xml
177
+ <think>
178
+ Reasoning over presenting symptoms, history, and admission note evidence.
179
+ </think>
180
+
181
+ <diagnosis>
182
+ M5116
183
+ </diagnosis>
184
+ ```
185
+ Users should validate that generated outputs conform to this format before downstream evaluation.
186
+
187
+ ---
188
+
189
+ # Evaluation
190
+
191
+ In the paper, evaluation is performed on **hierarchical ICD-10-CM prediction tasks** at three levels:
192
+
193
+ - **Chapter**
194
+ - **Category**
195
+ - **Full diagnosis code**
196
+
197
+ The paper reports that the **GRPO-only 32B model improves over the instruction-tuned baseline**, but remains weaker than models trained with **both supervised fine-tuning and GRPO**, especially for **fine-grained full-code prediction**.
198
+
199
+ This repository does **not claim any additional benchmark results** beyond those reported in the paper unless explicitly added later.
200
+
201
+ ---
202
+
203
+ # Limitations
204
+
205
+ Important limitations discussed in the paper include:
206
+
207
+ - reasoning traces may be coherent but **not clinically correct**
208
+ - the model can exhibit **premature diagnostic closure**
209
+ - performance drops on **fine-grained and rare ICD codes**
210
+ - the underlying data reflects **institutional and demographic bias**
211
+ - the model may fail to capture the **severity or clinical significance of diagnoses**
212
+ - reinforcement signals based on **automatic rewards and LLM judging are only proxies for expert review**
213
+
214
+ The paper also notes that clinicians often preferred **concise reasoning**, and that plausible-looking outputs may still omit important differential diagnoses.
215
+
216
+ ---
217
+
218
+ # Ethical Considerations
219
+
220
+ - Trained on **de-identified MIMIC-IV data** under the applicable data-use framework
221
+ - **Research-only release**
222
+ - Not suitable for **patient-facing or clinician-facing decision support** without substantial additional validation
223
+ - May propagate **dataset bias and disease-frequency imbalance**
224
+ - Outputs should **not be interpreted as medical advice**
225
+
226
+ Please read the paper’s **Ethical Considerations** and **Limitations** sections before using this model.
227
+
228
+ ---
229
+
230
+ # Citation
231
+
232
+ If you use this model, please cite the paper:
233
+
234
+ ```bibtex
235
+ @article{roehr2026deepicdr1,
236
+ title={DeepICD-R1: Medical Reasoning through Hierarchical Rewards and Unsupervised Distillation},
237
+ author={R{\"o}hr, Tom and Steffek, Thomas and Teucher, Roman and Bressem, Keno and Figueroa, Alexei and Grundmann, Paul and Troeger, Peter and Gers, Felix and L{\"o}ser, Alexander},
238
+ year={2026},
239
+ journal={Proceedings of LREC-COLING 2026}
240
+ }
241
+
242
+
243
+