Image-Text-to-Text
PEFT
Safetensors
laboratory
protocol-conditioned-action-prediction
lora
qwen
long-horizon-planning
conversational
Instructions to use Stanford-CongLab/LabHorizon-Model with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use Stanford-CongLab/LabHorizon-Model with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3.6-35B-A3B") model = PeftModel.from_pretrained(base_model, "Stanford-CongLab/LabHorizon-Model") - Notebooks
- Google Colab
- Kaggle
File size: 13,881 Bytes
1411198 d4ad2d7 1411198 d4ad2d7 96b75d0 d4ad2d7 3a4740a d4ad2d7 3a4740a d4ad2d7 96b75d0 d4ad2d7 96b75d0 d4ad2d7 96b75d0 d4ad2d7 96b75d0 3a4740a 96b75d0 d4ad2d7 96b75d0 eb5ee5d 96b75d0 d4ad2d7 96b75d0 eb5ee5d 96b75d0 3a4740a 96b75d0 c03a16b 96b75d0 eb5ee5d 96b75d0 c03a16b 96b75d0 d4ad2d7 b031d6a d4ad2d7 eb5ee5d d4ad2d7 c03a16b d4ad2d7 96b75d0 c03a16b 96b75d0 eb5ee5d 96b75d0 403bb75 c03a16b d4ad2d7 96b75d0 d4ad2d7 96b75d0 d4ad2d7 3a4740a d4ad2d7 96b75d0 3a4740a 96b75d0 d4ad2d7 96b75d0 d4ad2d7 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 | ---
license: mit
base_model: Qwen/Qwen3.6-35B-A3B
library_name: peft
pipeline_tag: image-text-to-text
tags:
- laboratory
- protocol-conditioned-action-prediction
- lora
- qwen
- long-horizon-planning
---
<div align="center">
<div style="font-size: 2em; font-weight: bold;">
LabHorizon Model
</div>
</div>
<div align="center">
<img src="./assets/stanford_logo.png" width="15%" alt="logo">
</div>
<div align="center">
[](https://stanford-conglab.github.io/LabHorizon/)

[](https://github.com/Stanford-CongLab/LabHorizon)
[](https://huggingface.co/datasets/Stanford-CongLab/LabHorizon-3D-Asset-Perception)
[](https://huggingface.co/datasets/Stanford-CongLab/LabHorizon-Protocol-Conditioned-Planning)
[](https://huggingface.co/Stanford-CongLab/LabHorizon-Model)
**Qwen3.6-35B-A3B LoRA for protocol-conditioned laboratory action prediction**
[Overview](#-overview) | [News](#-news) | [Highlights](#-highlights) | [Datasets](#-datasets) | [Evaluation](#-evaluation) | [Leaderboard](#-leaderboard) | [Training](#-training-result) | [Agent](#-actor-simulator-selector-agent) | [Quick Start](#-quick-start) | [Citation](#-citation)
</div>
---
<p align="center">
<img src="./assets/terser.png" alt="LabHorizon laboratory asset teaser" width="100%">
</p>
## π Overview
This repository releases the LabHorizon Qwen3.6 LoRA adapter trained from `Qwen/Qwen3.6-35B-A3B` on the 6,000-sample LabHorizon training split. The model is optimized for **Protocol-Conditioned Action Prediction**:
- **Level 1:** connect multi-view laboratory assets and historical actions to the gold next action.
- **Level 2:** produce a structured long-horizon experimental action sequence from context, constraints, available inputs, and an action pool.
This model repository is the model-side companion to the LabHorizon code and dataset releases. The GitHub repository is the full project entry point; the two dataset cards describe Level 1 and Level 2 data; this card focuses on the trained Qwen3.6 adapter, its files, training signal, evaluation result, and loading instructions.
## π° News
- **2026-06-03:** Released the LabHorizon LoRA adapter weights and reproducibility files on Hugging Face.
- **2026-06-03:** Updated the public LabHorizon leaderboards with Claude Opus 4.8 and MiniMax M3 direct-prompting evaluations.
## β¨ Highlights
<table>
<tr>
<td align="center" width="25%">π§ͺ<br/><b>Qwen3.6 Adapter</b><br/><sub>LoRA weights for Qwen3.6-35B-A3B</sub></td>
<td align="center" width="25%">π¬<br/><b>Level 1 Signal</b><br/><sub>Multi-view asset next-action prediction</sub></td>
<td align="center" width="25%">π§<br/><b>Level 2 Signal</b><br/><sub>Long-horizon protocol-conditioned planning</sub></td>
<td align="center" width="25%">π§ <br/><b>Train + Agent</b><br/><sub>Supports trained and trained+agents settings</sub></td>
</tr>
</table>
## π¦ Datasets
The adapter is trained on the same public LabHorizon train split described by the two dataset cards. The evaluation results below use the same `v20260510-repaired` test split as the GitHub README and the dataset READMEs.
| Level | Hugging Face Dataset | Input | Target | Metric |
|:---|:---|:---|:---|:---|
| **Level 1** | [LabHorizon-3D-Asset-Perception](https://huggingface.co/datasets/Stanford-CongLab/LabHorizon-3D-Asset-Perception) | Three asset views, historical actions, candidate next actions | Gold next action | Next-action accuracy |
| **Level 2** | [LabHorizon-Protocol-Conditioned-Planning](https://huggingface.co/datasets/Stanford-CongLab/LabHorizon-Protocol-Conditioned-Planning) | Context, goal, constraints, available inputs, action pool | Gold experimental action sequence | L2 Action Sequence Similarity, L2 Parameter Accuracy |
## π¦ Model
### π§Ύ Model Card
| Field | Value |
|:---|:---|
| Base model | `Qwen/Qwen3.6-35B-A3B` |
| Adapter type | LoRA / PEFT adapter |
| Training data | 6,000 LabHorizon train samples |
| Level 1 training split | 3,000 multimodal laboratory 3D asset samples |
| Level 2 training split | 3,000 text-only protocol-conditioned planning samples |
| Main task | Protocol-conditioned laboratory action prediction |
| Main metrics | Level 1 Next Action Accuracy; L2 Action Sequence Similarity and L2 Parameter Accuracy |
| Intended loading mode | Load this adapter with the matching Qwen3.6-35B-A3B base model |
The released weights are an adapter, not the base model. Users must load them with the corresponding Qwen3.6-35B-A3B base model.
### π Files
| File | Meaning |
|:---|:---|
| `adapter_model.safetensors` | LoRA adapter weights. |
| `adapter_config.json` | PEFT adapter configuration. |
| `tokenizer.json`, `tokenizer_config.json`, `chat_template.jinja` | Tokenizer and chat template files used for training/evaluation. |
| `processor_config.json` | Processor configuration. |
| `train_results.json`, `eval_results.json`, `all_results.json` | Training and evaluation summaries from the LoRA run. |
| `trainer_state.json`, `trainer_log.jsonl`, `training_args.bin` | Training state and arguments for reproducibility. |
| `training_loss.png`, `training_eval_loss.png` | Loss curves. |
## π Evaluation
LabHorizon uses the same evaluation contracts across direct-prompting models, the trained adapter, and the trained+agents setting.
| Level | Output format | Metric |
|:---|:---|:---|
| Level 1 | Reasoning followed by a final next action | Next Action Accuracy |
| Level 2 | Structured action sequence parsed by Python AST | L2 Action Sequence Similarity, L2 Parameter Accuracy, L2 Final Score |
For Level 1, the evaluator maps the final next action back to the candidate list. For Level 2, the evaluator parses action names, keyword parameters, assigned intermediate variables, and dependency references with Python AST. This model card reports the same metrics as the GitHub and dataset READMEs.
## π Leaderboard
The tables below report direct-prompting baselines on the same test split used for the trained model comparison. The full code and evaluation scripts are maintained in the [LabHorizon GitHub repository](https://github.com/Stanford-CongLab/LabHorizon).
### π¬ Level 1: 3D Asset Perception
| Rank | Model | Next Action Accuracy |
|:---:|:---|---:|
| π₯ | Grok 4.3 | 0.555 |
| π₯ | Kimi K2.6 | 0.550 |
| π₯ | GPT-5.5 | 0.535 |
| 4 | GPT-5.4 | 0.520 |
| 5 | Claude Opus 4.8 | 0.515 |
| 6 | MiniMax M3 | 0.510 |
| 7 | Qwen3.6 Plus | 0.505 |
| 8 | Claude Opus 4.7 | 0.500 |
| 9 | Qwen3.5 35B-A3B | 0.495 |
| 10 | MiMo V2.5 | 0.495 |
| 11 | Qwen3.5 9B | 0.485 |
| 12 | Gemini 3.5 Flash | 0.485 |
| 13 | Qwen3.6 35B-A3B | 0.475 |
| 14 | Gemini 3.1 Pro | 0.465 |
### π§ͺ Level 2: Protocol-Conditioned Planning
| Rank | Model | L2 Final Score | L2 Action Sequence Similarity | L2 Parameter Accuracy |
|:---:|:---|---:|---:|---:|
| π₯ | Gemini 3.1 Pro | 0.3263 | 0.3195 | 0.3331 |
| π₯ | Grok 4.3 | 0.3244 | 0.3339 | 0.3148 |
| π₯ | Kimi K2.6 | 0.3150 | 0.2845 | 0.3456 |
| 4 | Gemini 3.5 Flash | 0.3039 | 0.2686 | 0.3391 |
| 5 | Qwen3.7 Max | 0.3003 | 0.2905 | 0.3102 |
| 6 | MiniMax M3 | 0.2954 | 0.2812 | 0.3095 |
| 7 | Claude Opus 4.8 | 0.2911 | 0.2756 | 0.3066 |
| 8 | Claude Opus 4.7 | 0.2737 | 0.2619 | 0.2856 |
| 9 | GPT-5.4 | 0.2715 | 0.2191 | 0.3239 |
| 10 | Qwen3.6 35B-A3B | 0.2534 | 0.2585 | 0.2483 |
| 11 | Qwen3.6 Plus | 0.2526 | 0.2264 | 0.2787 |
| 12 | MiMo V2.5 | 0.2491 | 0.2269 | 0.2713 |
| 13 | GLM 5.1 | 0.2413 | 0.2307 | 0.2519 |
| 14 | Qwen3.5 35B-A3B | 0.2391 | 0.2385 | 0.2398 |
| 15 | GPT-5.5 | 0.2276 | 0.2092 | 0.2459 |
| 16 | DeepSeek V4 Pro | 0.2135 | 0.1927 | 0.2342 |
| 17 | Qwen3.5 9B | 0.1315 | 0.1359 | 0.1271 |
## 𧬠Training Data and Setup
The adapter is trained on the public LabHorizon training split:
| Component | Size | Role |
|:---|---:|:---|
| Level 1 train | 3,000 | Multi-view laboratory asset perception and next-action prediction |
| Level 2 train | 3,000 | Protocol-conditioned long-horizon experimental action-sequence planning |
| Total train | 6,000 | Unified supervised fine-tuning data for laboratory action prediction |
The training data are converted into Qwen chat format and then into the LLaMA-Factory ShareGPT-VL-style format. Level 1 keeps the three asset images and candidate next actions; Level 2 uses text-only context, constraints, available inputs, action pool, and gold experimental action sequence.
Main training settings:
| Setting | Value |
|:---|:---|
| LoRA rank / alpha / dropout | `32 / 64 / 0.10` |
| Learning rate | `1.0e-4` |
| Scheduler | Cosine |
| Warmup ratio | `0.10` |
| Cutoff length | `4096` |
| Image max pixels | `501760` |
| Epochs / max steps | `10 / 2500` |
| Precision | `bf16` |
| Gradient checkpointing | Enabled |
| Runtime | `10014.77 s` |
| Final train loss | `0.2691` |
| Final eval loss | `0.4426` |
## π§ Training Result
The table compares direct-prompting SOTA/baseline systems, the base Qwen model, and the trained+agents system evaluated on the same LabHorizon test splits.
| System | Level 1 Next Action Accuracy | L2 Action Sequence Similarity | L2 Parameter Accuracy | L2 Final Score |
|:---|---:|---:|---:|---:|
| Grok 4.3 | 0.555 | 0.3339 | 0.3148 | 0.3244 |
| Gemini 3.1 Pro | 0.465 | 0.3195 | 0.3331 | 0.3263 |
| GPT-5.5 | 0.535 | 0.2092 | 0.2459 | 0.2276 |
| Kimi K2.6 | 0.550 | 0.2845 | 0.3456 | 0.3150 |
| Qwen3.6-35B-A3B | 0.475 | 0.2585 | 0.2483 | 0.2534 |
| Qwen3.6-35B-A3B(trained+agents) | **0.665** | **0.4485** | **0.4580** | **0.4532** |
Agent setting: `Qwen3.6-35B-A3B(trained)` is used as Actor, and Gemini 3.1 Pro is used as Simulator/Selector. The Simulator/Selector choice is the current setting and has not been exhaustively ablated.
The trained adapter improves both levels over the direct Qwen3.6-35B-A3B baseline. Level 1 improves from `0.475` to `0.635`, indicating better laboratory asset-to-action alignment. L2 Final Score improves from `0.2534` to `0.4100`, indicating better action ordering, parameter retention, and dependency tracking. The trained+agents setting further improves consistency by selecting candidates with stronger symbolic protocol-state validity.
## π€ Actor-Simulator-Selector Agent
The trained+agents result uses this adapter as the Actor and combines it with a separate Simulator/Selector model. The agent is not a physical simulator and does not execute wet-lab actions. It samples candidate next actions or action sequences, checks symbolic protocol-state consistency, and selects the most consistent candidate.
<p align="center">
<img src="assets/figure3_agent.png" alt="Actor-Simulator-Selector agent pipeline" width="100%">
</p>
The trained Actor reads the same task inputs used by the public datasets: multi-view asset images, historical actions, and candidate next actions for Level 1, or wet experiment context, constraints, available inputs, and an action pool for Level 2. The Simulator builds current and target symbolic protocol states and predicts candidate reagent/instrument state transitions. The Selector compares the candidate-state pairs and returns the selected action prediction, which is evaluated with Level 1 next-action accuracy or Level 2 AST-based action-sequence and parameter metrics.
Agent setting: `Qwen3.6-35B-A3B(trained)` is used as Actor, and Gemini 3.1 Pro is used as Simulator/Selector. This Simulator/Selector choice is the current setting and has not been exhaustively ablated.
## π Quick Start
### Load Adapter
```python
from transformers import AutoModelForCausalLM, AutoProcessor
from peft import PeftModel
base_id = "Qwen/Qwen3.6-35B-A3B"
adapter_id = "Stanford-CongLab/LabHorizon-Model"
processor = AutoProcessor.from_pretrained(adapter_id, trust_remote_code=True)
base = AutoModelForCausalLM.from_pretrained(
base_id,
device_map="auto",
torch_dtype="auto",
trust_remote_code=True,
)
model = PeftModel.from_pretrained(base, adapter_id)
```
### Evaluate with LabHorizon
Use the public code repository for evaluation and agent workflows:
```bash
git clone https://github.com/Stanford-CongLab/LabHorizon
cd LabHorizon
```
Configure an OpenAI-compatible endpoint in `.env`, then run the Level 1 / Level 2 evaluators or the Actor-Simulator-Selector agent following the GitHub README.
For evaluation, use the public LabHorizon code repository and point the evaluator to a compatible model endpoint or local serving stack. The model card itself only releases the adapter and training artifacts.
## β οΈ Intended Use
This adapter is intended for academic research on laboratory action prediction, experimental planning, and AI scientist systems. It is not an autonomous wet-lab controller. Outputs should be treated as model predictions and should not be used for safety-critical experimental decisions without expert review.
Recommended use cases:
- Evaluate protocol-conditioned next-action prediction and long-horizon planning.
- Study how training data improves laboratory action prediction.
- Use the adapter as the Actor in the Actor-Simulator-Selector framework.
- Analyze remaining failures in action order, parameter copying, dependency tracking, and protocol-stage consistency.
Not intended for:
- Autonomous wet-lab execution.
- Clinical, safety-critical, or regulated decision-making.
- Generating executable biological protocols without expert validation.
## π Citation
Coming soon...
|