black-yt commited on
Commit
96b75d0
Β·
1 Parent(s): d4ad2d7

Expand model card and assets

Browse files
.gitattributes CHANGED
@@ -34,3 +34,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  tokenizer.json filter=lfs diff=lfs merge=lfs -text
 
 
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  tokenizer.json filter=lfs diff=lfs merge=lfs -text
37
+ assets/*.png filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -17,31 +17,83 @@ tags:
17
  </div>
18
  </div>
19
 
 
 
 
 
20
  <div align="center">
21
 
22
  [![Website](https://img.shields.io/badge/%F0%9F%9A%80%20Website-LabHorizon-00c2a8)](https://conglab-research.github.io/LabHorizon/)&nbsp;
23
  ![arXiv](https://img.shields.io/badge/arXiv-coming%20soon-b31b1b?logo=arxiv&logoColor=white)&nbsp;
24
  [![Code](https://img.shields.io/badge/Code-LabHorizon-000000?logo=github&logoColor=white)](https://github.com/CongLab-Research/LabHorizon)&nbsp;
25
- [![Data L1](https://img.shields.io/badge/%F0%9F%A4%97%20Data-L1-blue)](https://huggingface.co/datasets/CongLab-Research/LabHorizon-3D-Asset-Perception)&nbsp;
26
- [![Data L2](https://img.shields.io/badge/%F0%9F%A4%97%20Data-L2-purple)](https://huggingface.co/datasets/CongLab-Research/LabHorizon-Protocol-Conditioned-Planning)&nbsp;
27
- [![Model](https://img.shields.io/badge/%F0%9F%A4%97%20Model-LoRA-orange)](https://huggingface.co/CongLab-Research/LabHorizon-Model)
28
 
29
  **Qwen3.6-35B-A3B LoRA for protocol-conditioned laboratory action prediction**
30
 
 
 
31
  </div>
32
 
33
  ---
34
 
 
 
 
 
35
  ## πŸ”Ž Overview
36
 
37
- This repository releases the LabHorizon LoRA adapter trained from `Qwen/Qwen3.6-35B-A3B` on the 6,000-sample LabHorizon training split. The model is optimized for **Protocol-Conditioned Action Prediction**:
38
 
39
  - **Level 1:** connect multi-view laboratory assets and historical actions to the gold next action.
40
  - **Level 2:** produce a structured long-horizon experimental action sequence from context, constraints, available inputs, and an action pool.
41
 
42
- The released weights are an adapter, not the base model. Load them with the corresponding Qwen3.6-35B-A3B base model.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
43
 
44
- ## πŸ“¦ Files
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
45
 
46
  | File | Meaning |
47
  |:---|:---|
@@ -53,6 +105,91 @@ The released weights are an adapter, not the base model. Load them with the corr
53
  | `trainer_state.json`, `trainer_log.jsonl`, `training_args.bin` | Training state and arguments for reproducibility. |
54
  | `training_loss.png`, `training_eval_loss.png` | Loss curves. |
55
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
56
  ## 🧠 Training Result
57
 
58
  The table compares direct-prompting SOTA/baseline systems, the base Qwen model, this trained LoRA adapter, and the trained+agents system evaluated on the same LabHorizon test splits.
@@ -65,11 +202,21 @@ The table compares direct-prompting SOTA/baseline systems, the base Qwen model,
65
  | Kimi K2.6 | 0.550 | 0.2845 | 0.3456 | 0.3150 |
66
  | Qwen3.6-35B-A3B | 0.475 | 0.2585 | 0.2483 | 0.2534 |
67
  | Qwen3.6-35B-A3B(trained) | 0.635 | 0.4030 | 0.4170 | 0.4100 |
68
- | Qwen3.6-35B-A3B(trained+agents*) | **0.665** | **0.4485** | **0.4580** | **0.4532** |
 
 
 
 
 
 
 
 
 
 
69
 
70
- `*` uses `Qwen3.6-35B-A3B(trained)` as Actor and Gemini 3.1 Pro Preview as Simulator/Selector. The Simulator/Selector choice is the current setting and has not been exhaustively ablated.
71
 
72
- ## βš™οΈ Loading
73
 
74
  ```python
75
  from transformers import AutoModelForCausalLM, AutoProcessor
@@ -88,9 +235,47 @@ base = AutoModelForCausalLM.from_pretrained(
88
  model = PeftModel.from_pretrained(base, adapter_id)
89
  ```
90
 
 
 
 
 
 
 
 
 
 
 
 
 
 
91
  ## ⚠️ Intended Use
92
 
93
- This adapter is intended for academic research on laboratory action prediction, experimental planning, and AI scientist systems. It should not be used as an autonomous wet-lab controller or for safety-critical experimental decisions without expert review.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
94
 
95
  ## πŸ“œ Citation
96
 
 
17
  </div>
18
  </div>
19
 
20
+ <div align="center">
21
+ <img src="./assets/stanford_logo.png" width="15%" alt="logo">
22
+ </div>
23
+
24
  <div align="center">
25
 
26
  [![Website](https://img.shields.io/badge/%F0%9F%9A%80%20Website-LabHorizon-00c2a8)](https://conglab-research.github.io/LabHorizon/)&nbsp;
27
  ![arXiv](https://img.shields.io/badge/arXiv-coming%20soon-b31b1b?logo=arxiv&logoColor=white)&nbsp;
28
  [![Code](https://img.shields.io/badge/Code-LabHorizon-000000?logo=github&logoColor=white)](https://github.com/CongLab-Research/LabHorizon)&nbsp;
29
+ [![Data L1 3D Asset](https://img.shields.io/badge/%F0%9F%A4%97%20Data-L1%203D%20Asset-blue)](https://huggingface.co/datasets/CongLab-Research/LabHorizon-3D-Asset-Perception)&nbsp;
30
+ [![Data L2 Protocol](https://img.shields.io/badge/%F0%9F%A4%97%20Data-L2%20Protocol-purple)](https://huggingface.co/datasets/CongLab-Research/LabHorizon-Protocol-Conditioned-Planning)&nbsp;
31
+ [![Model](https://img.shields.io/badge/%F0%9F%A4%97%20Model-Qwen3.6-orange)](https://huggingface.co/CongLab-Research/LabHorizon-Model)
32
 
33
  **Qwen3.6-35B-A3B LoRA for protocol-conditioned laboratory action prediction**
34
 
35
+ [Overview](#-overview) | [News](#-news) | [Highlights](#-highlights) | [Datasets](#-datasets) | [Evaluation](#-evaluation) | [Leaderboard](#-leaderboard) | [Training](#-training-result) | [Agent](#-actor-simulator-selector-agent) | [Quick Start](#-quick-start) | [Citation](#-citation)
36
+
37
  </div>
38
 
39
  ---
40
 
41
+ <p align="center">
42
+ <img src="./assets/terser.png" alt="LabHorizon laboratory asset teaser" width="100%">
43
+ </p>
44
+
45
  ## πŸ”Ž Overview
46
 
47
+ This repository releases the LabHorizon Qwen3.6 LoRA adapter trained from `Qwen/Qwen3.6-35B-A3B` on the 6,000-sample LabHorizon training split. The model is optimized for **Protocol-Conditioned Action Prediction**:
48
 
49
  - **Level 1:** connect multi-view laboratory assets and historical actions to the gold next action.
50
  - **Level 2:** produce a structured long-horizon experimental action sequence from context, constraints, available inputs, and an action pool.
51
 
52
+ This model repository is the model-side companion to the LabHorizon code and dataset releases. The GitHub repository is the full project entry point; the two dataset cards describe Level 1 and Level 2 data; this card focuses on the trained Qwen3.6 adapter, its files, training signal, evaluation result, and loading instructions.
53
+
54
+ ## πŸ“° News
55
+
56
+ - **2026-06-03:** Released the LabHorizon LoRA adapter weights and reproducibility files on Hugging Face.
57
+ - **2026-06-03:** Updated the public LabHorizon leaderboards with Claude Opus 4.8 and MiniMax M3 direct-prompting evaluations.
58
+
59
+ ## ✨ Highlights
60
+
61
+ <table>
62
+ <tr>
63
+ <td align="center" width="25%">πŸ§ͺ<br/><b>Qwen3.6 Adapter</b><br/><sub>LoRA weights for Qwen3.6-35B-A3B</sub></td>
64
+ <td align="center" width="25%">πŸ”¬<br/><b>Level 1 Signal</b><br/><sub>Multi-view asset next-action prediction</sub></td>
65
+ <td align="center" width="25%">🧭<br/><b>Level 2 Signal</b><br/><sub>Long-horizon protocol-conditioned planning</sub></td>
66
+ <td align="center" width="25%">🧠<br/><b>Train + Agent</b><br/><sub>Supports trained and trained+agents settings</sub></td>
67
+ </tr>
68
+ </table>
69
+
70
+ ## πŸ“¦ Datasets
71
+
72
+ The adapter is trained on the same public LabHorizon train split described by the two dataset cards. The evaluation results below use the same `v20260510-repaired` test split as the GitHub README and the dataset READMEs.
73
+
74
+ | Level | Hugging Face Dataset | Input | Target | Metric |
75
+ |:---|:---|:---|:---|:---|
76
+ | **Level 1** | [LabHorizon-3D-Asset-Perception](https://huggingface.co/datasets/CongLab-Research/LabHorizon-3D-Asset-Perception) | Three asset views, historical actions, candidate next actions | Gold next action | Next-action accuracy |
77
+ | **Level 2** | [LabHorizon-Protocol-Conditioned-Planning](https://huggingface.co/datasets/CongLab-Research/LabHorizon-Protocol-Conditioned-Planning) | Context, goal, constraints, available inputs, action pool | Gold experimental action sequence | Action Sequence Similarity, Parameter Accuracy |
78
+
79
+ ## πŸ“¦ Model
80
 
81
+ ### 🧾 Model Card
82
+
83
+ | Field | Value |
84
+ |:---|:---|
85
+ | Base model | `Qwen/Qwen3.6-35B-A3B` |
86
+ | Adapter type | LoRA / PEFT adapter |
87
+ | Training data | 6,000 LabHorizon train samples |
88
+ | Level 1 training split | 3,000 multimodal laboratory 3D asset samples |
89
+ | Level 2 training split | 3,000 text-only protocol-conditioned planning samples |
90
+ | Main task | Protocol-conditioned laboratory action prediction |
91
+ | Main metrics | Level 1 Next Action Accuracy; Level 2 Action Sequence Similarity and Parameter Accuracy |
92
+ | Intended loading mode | Load this adapter with the matching Qwen3.6-35B-A3B base model |
93
+
94
+ The released weights are an adapter, not the base model. Users must load them with the corresponding Qwen3.6-35B-A3B base model.
95
+
96
+ ### πŸ“ Files
97
 
98
  | File | Meaning |
99
  |:---|:---|
 
105
  | `trainer_state.json`, `trainer_log.jsonl`, `training_args.bin` | Training state and arguments for reproducibility. |
106
  | `training_loss.png`, `training_eval_loss.png` | Loss curves. |
107
 
108
+ ## πŸ“ Evaluation
109
+
110
+ LabHorizon uses the same evaluation contracts across direct-prompting models, the trained adapter, and the trained+agents setting.
111
+
112
+ | Level | Output format | Metric |
113
+ |:---|:---|:---|
114
+ | Level 1 | Reasoning followed by a final next action | Next Action Accuracy |
115
+ | Level 2 | Structured action sequence parsed by Python AST | Action Sequence Similarity, Parameter Accuracy, Final Score |
116
+
117
+ For Level 1, the evaluator maps the final next action back to the candidate list. For Level 2, the evaluator parses action names, keyword parameters, assigned intermediate variables, and dependency references with Python AST. This model card reports the same metrics as the GitHub and dataset READMEs.
118
+
119
+ ## πŸ† Leaderboard
120
+
121
+ The tables below report direct-prompting baselines on the same test split used for the trained model comparison. The full code and evaluation scripts are maintained in the [LabHorizon GitHub repository](https://github.com/CongLab-Research/LabHorizon).
122
+
123
+ ### πŸ”¬ Level 1: 3D Asset Perception
124
+
125
+ | Rank | Model | Next Action Accuracy |
126
+ |:---:|:---|---:|
127
+ | πŸ₯‡ | Grok 4.3 | 0.555 |
128
+ | πŸ₯ˆ | Kimi K2.6 | 0.550 |
129
+ | πŸ₯‰ | GPT-5.5 | 0.535 |
130
+ | 4 | GPT-5.4 | 0.520 |
131
+ | 5 | Claude Opus 4.8 | 0.515 |
132
+ | 6 | MiniMax M3 | 0.510 |
133
+ | 7 | Qwen3.6 Plus | 0.505 |
134
+ | 8 | Claude Opus 4.7 | 0.500 |
135
+ | 9 | Qwen3.5 35B-A3B | 0.495 |
136
+ | 10 | MiMo V2.5 | 0.495 |
137
+ | 11 | Qwen3.5 9B | 0.485 |
138
+ | 12 | Gemini 3.5 Flash | 0.485 |
139
+ | 13 | Qwen3.6 35B-A3B | 0.475 |
140
+ | 14 | Gemini 3.1 Pro Preview | 0.465 |
141
+
142
+ ### πŸ§ͺ Level 2: Protocol-Conditioned Planning
143
+
144
+ | Rank | Model | Final Score | Action Sequence Similarity | Parameter Accuracy |
145
+ |:---:|:---|---:|---:|---:|
146
+ | πŸ₯‡ | Gemini 3.1 Pro Preview | 0.3263 | 0.3195 | 0.3331 |
147
+ | πŸ₯ˆ | Grok 4.3 | 0.3244 | 0.3339 | 0.3148 |
148
+ | πŸ₯‰ | Kimi K2.6 | 0.3150 | 0.2845 | 0.3456 |
149
+ | 4 | Gemini 3.5 Flash | 0.3039 | 0.2686 | 0.3391 |
150
+ | 5 | Qwen3.7 Max | 0.3003 | 0.2905 | 0.3102 |
151
+ | 6 | MiniMax M3 | 0.2954 | 0.2812 | 0.3095 |
152
+ | 7 | Claude Opus 4.8 | 0.2911 | 0.2756 | 0.3066 |
153
+ | 8 | Claude Opus 4.7 | 0.2737 | 0.2619 | 0.2856 |
154
+ | 9 | GPT-5.4 | 0.2715 | 0.2191 | 0.3239 |
155
+ | 10 | Qwen3.6 35B-A3B | 0.2534 | 0.2585 | 0.2483 |
156
+ | 11 | Qwen3.6 Plus | 0.2526 | 0.2264 | 0.2787 |
157
+ | 12 | MiMo V2.5 | 0.2491 | 0.2269 | 0.2713 |
158
+ | 13 | GLM 5.1 | 0.2413 | 0.2307 | 0.2519 |
159
+ | 14 | Qwen3.5 35B-A3B | 0.2391 | 0.2385 | 0.2398 |
160
+ | 15 | GPT-5.5 | 0.2276 | 0.2092 | 0.2459 |
161
+ | 16 | DeepSeek V4 Pro | 0.2135 | 0.1927 | 0.2342 |
162
+ | 17 | Qwen3.5 9B | 0.1315 | 0.1359 | 0.1271 |
163
+
164
+ ## 🧬 Training Data and Setup
165
+
166
+ The adapter is trained on the public LabHorizon training split:
167
+
168
+ | Component | Size | Role |
169
+ |:---|---:|:---|
170
+ | Level 1 train | 3,000 | Multi-view laboratory asset perception and next-action prediction |
171
+ | Level 2 train | 3,000 | Protocol-conditioned long-horizon experimental action-sequence planning |
172
+ | Total train | 6,000 | Unified supervised fine-tuning data for laboratory action prediction |
173
+
174
+ The training data are converted into Qwen chat format and then into the LLaMA-Factory ShareGPT-VL-style format. Level 1 keeps the three asset images and candidate next actions; Level 2 uses text-only context, constraints, available inputs, action pool, and gold experimental action sequence.
175
+
176
+ Main training settings:
177
+
178
+ | Setting | Value |
179
+ |:---|:---|
180
+ | LoRA rank / alpha / dropout | `32 / 64 / 0.10` |
181
+ | Learning rate | `1.0e-4` |
182
+ | Scheduler | Cosine |
183
+ | Warmup ratio | `0.10` |
184
+ | Cutoff length | `4096` |
185
+ | Image max pixels | `501760` |
186
+ | Epochs / max steps | `10 / 2500` |
187
+ | Precision | `bf16` |
188
+ | Gradient checkpointing | Enabled |
189
+ | Runtime | `10014.77 s` |
190
+ | Final train loss | `0.2691` |
191
+ | Final eval loss | `0.4426` |
192
+
193
  ## 🧠 Training Result
194
 
195
  The table compares direct-prompting SOTA/baseline systems, the base Qwen model, this trained LoRA adapter, and the trained+agents system evaluated on the same LabHorizon test splits.
 
202
  | Kimi K2.6 | 0.550 | 0.2845 | 0.3456 | 0.3150 |
203
  | Qwen3.6-35B-A3B | 0.475 | 0.2585 | 0.2483 | 0.2534 |
204
  | Qwen3.6-35B-A3B(trained) | 0.635 | 0.4030 | 0.4170 | 0.4100 |
205
+ | Qwen3.6-35B-A3B(trained+agents) | **0.665** | **0.4485** | **0.4580** | **0.4532** |
206
+
207
+ Agent setting: `Qwen3.6-35B-A3B(trained)` is used as Actor, and Gemini 3.1 Pro Preview is used as Simulator/Selector. The Simulator/Selector choice is the current setting and has not been exhaustively ablated.
208
+
209
+ The trained adapter improves both levels over the direct Qwen3.6-35B-A3B baseline. Level 1 improves from `0.475` to `0.635`, indicating better laboratory asset-to-action alignment. Level 2 Final Score improves from `0.2534` to `0.4100`, indicating better action ordering, parameter retention, and dependency tracking. The trained+agents setting further improves consistency by selecting candidates with stronger symbolic protocol-state validity.
210
+
211
+ ## πŸ€– Actor-Simulator-Selector Agent
212
+
213
+ The trained+agents result uses this adapter as the Actor and combines it with a separate Simulator/Selector model. The agent is not a physical simulator and does not execute wet-lab actions. It samples candidate next actions or action sequences, checks symbolic protocol-state consistency, and selects the most consistent candidate.
214
+
215
+ Agent setting: `Qwen3.6-35B-A3B(trained)` is used as Actor, and Gemini 3.1 Pro Preview is used as Simulator/Selector. This Simulator/Selector choice is the current setting and has not been exhaustively ablated.
216
 
217
+ ## πŸš€ Quick Start
218
 
219
+ ### Load Adapter
220
 
221
  ```python
222
  from transformers import AutoModelForCausalLM, AutoProcessor
 
235
  model = PeftModel.from_pretrained(base, adapter_id)
236
  ```
237
 
238
+ ### Evaluate with LabHorizon
239
+
240
+ Use the public code repository for evaluation and agent workflows:
241
+
242
+ ```bash
243
+ git clone https://github.com/CongLab-Research/LabHorizon
244
+ cd LabHorizon
245
+ ```
246
+
247
+ Configure an OpenAI-compatible endpoint in `.env`, then run the Level 1 / Level 2 evaluators or the Actor-Simulator-Selector agent following the GitHub README.
248
+
249
+ For evaluation, use the public LabHorizon code repository and point the evaluator to a compatible model endpoint or local serving stack. The model card itself only releases the adapter and training artifacts.
250
+
251
  ## ⚠️ Intended Use
252
 
253
+ This adapter is intended for academic research on laboratory action prediction, experimental planning, and AI scientist systems. It is not an autonomous wet-lab controller. Outputs should be treated as model predictions and should not be used for safety-critical experimental decisions without expert review.
254
+
255
+ Recommended use cases:
256
+
257
+ - Evaluate protocol-conditioned next-action prediction and long-horizon planning.
258
+ - Study how training data improves laboratory action prediction.
259
+ - Use the adapter as the Actor in the Actor-Simulator-Selector framework.
260
+ - Analyze remaining failures in action order, parameter copying, dependency tracking, and protocol-stage consistency.
261
+
262
+ Not intended for:
263
+
264
+ - Autonomous wet-lab execution.
265
+ - Clinical, safety-critical, or regulated decision-making.
266
+ - Generating executable biological protocols without expert validation.
267
+
268
+ ## πŸ”— Relationship to LabHorizon
269
+
270
+ LabHorizon has four public entry points:
271
+
272
+ | Resource | Link | Role |
273
+ |:---|:---|:---|
274
+ | Website | [LabHorizon Website](https://conglab-research.github.io/LabHorizon/) | Interactive examples and visual explorer |
275
+ | Code | [CongLab-Research/LabHorizon](https://github.com/CongLab-Research/LabHorizon) | Evaluation code, agents, tests, and documentation |
276
+ | Level 1 Data | [LabHorizon-3D-Asset-Perception](https://huggingface.co/datasets/CongLab-Research/LabHorizon-3D-Asset-Perception) | Multi-view laboratory 3D asset perception data |
277
+ | Level 2 Data | [LabHorizon-Protocol-Conditioned-Planning](https://huggingface.co/datasets/CongLab-Research/LabHorizon-Protocol-Conditioned-Planning) | Protocol-conditioned long-horizon planning data |
278
+ | Model | [LabHorizon-Model](https://huggingface.co/CongLab-Research/LabHorizon-Model) | Qwen3.6 LoRA adapter trained on LabHorizon |
279
 
280
  ## πŸ“œ Citation
281
 
assets/figure2_pipeline.png ADDED

Git LFS Details

  • SHA256: 2153bf1bb6b1ae1a2f7d2f8394a15e2a23685b97b4763254ea0fc1c816d8053d
  • Pointer size: 132 Bytes
  • Size of remote file: 1.28 MB
assets/stanford_logo.png ADDED

Git LFS Details

  • SHA256: f276a5e47b7801c8160095b3a45e419895d5d7b4eb82c8c5fa0a632cab830c52
  • Pointer size: 130 Bytes
  • Size of remote file: 24.7 kB
assets/terser.png ADDED

Git LFS Details

  • SHA256: b5d9ced44d7842e5b6737e2d1f18f4f9893242befbec506598f881b8eb46331a
  • Pointer size: 132 Bytes
  • Size of remote file: 6.52 MB