Stanford-CongLab
/

LabHorizon-Model

@@ -137,13 +137,13 @@ The tables below report direct-prompting baselines on the same test split used f
 | 11 | Qwen3.5 9B | 0.485 |
 | 12 | Gemini 3.5 Flash | 0.485 |
 | 13 | Qwen3.6 35B-A3B | 0.475 |
-| 14 | Gemini 3.1 Pro Preview | 0.465 |
 ### 🧪 Level 2: Protocol-Conditioned Planning
 | Rank | Model | Final Score | Action Sequence Similarity | Parameter Accuracy |
 |:---:|:---|---:|---:|---:|
-| 🥇 | Gemini 3.1 Pro Preview | 0.3263 | 0.3195 | 0.3331 |
 | 🥈 | Grok 4.3 | 0.3244 | 0.3339 | 0.3148 |
 | 🥉 | Kimi K2.6 | 0.3150 | 0.2845 | 0.3456 |
 | 4 | Gemini 3.5 Flash | 0.3039 | 0.2686 | 0.3391 |
@@ -197,14 +197,14 @@ The table compares direct-prompting SOTA/baseline systems, the base Qwen model,
 | System | Level 1 Next Action Accuracy | Level 2 Action Sequence Similarity | Level 2 Parameter Accuracy | Level 2 Final Score |
 |:---|---:|---:|---:|---:|
 | Grok 4.3 | 0.555 | 0.3339 | 0.3148 | 0.3244 |
-| Gemini 3.1 Pro Preview | 0.465 | 0.3195 | 0.3331 | 0.3263 |
 | GPT-5.5 | 0.535 | 0.2092 | 0.2459 | 0.2276 |
 | Kimi K2.6 | 0.550 | 0.2845 | 0.3456 | 0.3150 |
 | Qwen3.6-35B-A3B | 0.475 | 0.2585 | 0.2483 | 0.2534 |
 | Qwen3.6-35B-A3B(trained) | 0.635 | 0.4030 | 0.4170 | 0.4100 |
 | Qwen3.6-35B-A3B(trained+agents) | **0.665** | **0.4485** | **0.4580** | **0.4532** |
-Agent setting: `Qwen3.6-35B-A3B(trained)` is used as Actor, and Gemini 3.1 Pro Preview is used as Simulator/Selector. The Simulator/Selector choice is the current setting and has not been exhaustively ablated.
 The trained adapter improves both levels over the direct Qwen3.6-35B-A3B baseline. Level 1 improves from `0.475` to `0.635`, indicating better laboratory asset-to-action alignment. Level 2 Final Score improves from `0.2534` to `0.4100`, indicating better action ordering, parameter retention, and dependency tracking. The trained+agents setting further improves consistency by selecting candidates with stronger symbolic protocol-state validity.
@@ -212,7 +212,7 @@ The trained adapter improves both levels over the direct Qwen3.6-35B-A3B baselin
 The trained+agents result uses this adapter as the Actor and combines it with a separate Simulator/Selector model. The agent is not a physical simulator and does not execute wet-lab actions. It samples candidate next actions or action sequences, checks symbolic protocol-state consistency, and selects the most consistent candidate.
-Agent setting: `Qwen3.6-35B-A3B(trained)` is used as Actor, and Gemini 3.1 Pro Preview is used as Simulator/Selector. This Simulator/Selector choice is the current setting and has not been exhaustively ablated.
 ## 🚀 Quick Start

 | 11 | Qwen3.5 9B | 0.485 |
 | 12 | Gemini 3.5 Flash | 0.485 |
 | 13 | Qwen3.6 35B-A3B | 0.475 |
+| 14 | Gemini 3.1 Pro | 0.465 |
 ### 🧪 Level 2: Protocol-Conditioned Planning
 | Rank | Model | Final Score | Action Sequence Similarity | Parameter Accuracy |
 |:---:|:---|---:|---:|---:|
+| 🥇 | Gemini 3.1 Pro | 0.3263 | 0.3195 | 0.3331 |
 | 🥈 | Grok 4.3 | 0.3244 | 0.3339 | 0.3148 |
 | 🥉 | Kimi K2.6 | 0.3150 | 0.2845 | 0.3456 |
 | 4 | Gemini 3.5 Flash | 0.3039 | 0.2686 | 0.3391 |
 | System | Level 1 Next Action Accuracy | Level 2 Action Sequence Similarity | Level 2 Parameter Accuracy | Level 2 Final Score |
 |:---|---:|---:|---:|---:|
 | Grok 4.3 | 0.555 | 0.3339 | 0.3148 | 0.3244 |
+| Gemini 3.1 Pro | 0.465 | 0.3195 | 0.3331 | 0.3263 |
 | GPT-5.5 | 0.535 | 0.2092 | 0.2459 | 0.2276 |
 | Kimi K2.6 | 0.550 | 0.2845 | 0.3456 | 0.3150 |
 | Qwen3.6-35B-A3B | 0.475 | 0.2585 | 0.2483 | 0.2534 |
 | Qwen3.6-35B-A3B(trained) | 0.635 | 0.4030 | 0.4170 | 0.4100 |
 | Qwen3.6-35B-A3B(trained+agents) | **0.665** | **0.4485** | **0.4580** | **0.4532** |
+Agent setting: `Qwen3.6-35B-A3B(trained)` is used as Actor, and Gemini 3.1 Pro is used as Simulator/Selector. The Simulator/Selector choice is the current setting and has not been exhaustively ablated.
 The trained adapter improves both levels over the direct Qwen3.6-35B-A3B baseline. Level 1 improves from `0.475` to `0.635`, indicating better laboratory asset-to-action alignment. Level 2 Final Score improves from `0.2534` to `0.4100`, indicating better action ordering, parameter retention, and dependency tracking. The trained+agents setting further improves consistency by selecting candidates with stronger symbolic protocol-state validity.
 The trained+agents result uses this adapter as the Actor and combines it with a separate Simulator/Selector model. The agent is not a physical simulator and does not execute wet-lab actions. It samples candidate next actions or action sequences, checks symbolic protocol-state consistency, and selects the most consistent candidate.
+Agent setting: `Qwen3.6-35B-A3B(trained)` is used as Actor, and Gemini 3.1 Pro is used as Simulator/Selector. This Simulator/Selector choice is the current setting and has not been exhaustively ablated.
 ## 🚀 Quick Start