Stanford-CongLab
/

LabHorizon-Model

@@ -74,7 +74,7 @@ The adapter is trained on the same public LabHorizon train split described by th
 | Level | Hugging Face Dataset | Input | Target | Metric |
 |:---|:---|:---|:---|:---|
 | **Level 1** | [LabHorizon-3D-Asset-Perception](https://huggingface.co/datasets/CongLab-Research/LabHorizon-3D-Asset-Perception) | Three asset views, historical actions, candidate next actions | Gold next action | Next-action accuracy |
-| **Level 2** | [LabHorizon-Protocol-Conditioned-Planning](https://huggingface.co/datasets/CongLab-Research/LabHorizon-Protocol-Conditioned-Planning) | Context, goal, constraints, available inputs, action pool | Gold experimental action sequence | Action Sequence Similarity, Parameter Accuracy |
 ## 📦 Model
@@ -88,7 +88,7 @@ The adapter is trained on the same public LabHorizon train split described by th
 | Level 1 training split | 3,000 multimodal laboratory 3D asset samples |
 | Level 2 training split | 3,000 text-only protocol-conditioned planning samples |
 | Main task | Protocol-conditioned laboratory action prediction |
-| Main metrics | Level 1 Next Action Accuracy; Level 2 Action Sequence Similarity and Parameter Accuracy |
 | Intended loading mode | Load this adapter with the matching Qwen3.6-35B-A3B base model |
 The released weights are an adapter, not the base model. Users must load them with the corresponding Qwen3.6-35B-A3B base model.
@@ -112,7 +112,7 @@ LabHorizon uses the same evaluation contracts across direct-prompting models, th
 | Level | Output format | Metric |
 |:---|:---|:---|
 | Level 1 | Reasoning followed by a final next action | Next Action Accuracy |
-| Level 2 | Structured action sequence parsed by Python AST | Action Sequence Similarity, Parameter Accuracy, Final Score |
 For Level 1, the evaluator maps the final next action back to the candidate list. For Level 2, the evaluator parses action names, keyword parameters, assigned intermediate variables, and dependency references with Python AST. This model card reports the same metrics as the GitHub and dataset READMEs.
@@ -141,7 +141,7 @@ The tables below report direct-prompting baselines on the same test split used f
 ### 🧪 Level 2: Protocol-Conditioned Planning
-| Rank | Model | Final Score | Action Sequence Similarity | Parameter Accuracy |
 |:---:|:---|---:|---:|---:|
 | 🥇 | Gemini 3.1 Pro | 0.3263 | 0.3195 | 0.3331 |
 | 🥈 | Grok 4.3 | 0.3244 | 0.3339 | 0.3148 |
@@ -194,7 +194,7 @@ Main training settings:
 The table compares direct-prompting SOTA/baseline systems, the base Qwen model, and the trained+agents system evaluated on the same LabHorizon test splits.
-| System | Level 1 Next Action Accuracy | Level 2 Action Sequence Similarity | Level 2 Parameter Accuracy | Level 2 Final Score |
 |:---|---:|---:|---:|---:|
 | Grok 4.3 | 0.555 | 0.3339 | 0.3148 | 0.3244 |
 | Gemini 3.1 Pro | 0.465 | 0.3195 | 0.3331 | 0.3263 |
@@ -205,7 +205,7 @@ The table compares direct-prompting SOTA/baseline systems, the base Qwen model,
 Agent setting: `Qwen3.6-35B-A3B(trained)` is used as Actor, and Gemini 3.1 Pro is used as Simulator/Selector. The Simulator/Selector choice is the current setting and has not been exhaustively ablated.
-The trained adapter improves both levels over the direct Qwen3.6-35B-A3B baseline. Level 1 improves from `0.475` to `0.635`, indicating better laboratory asset-to-action alignment. Level 2 Final Score improves from `0.2534` to `0.4100`, indicating better action ordering, parameter retention, and dependency tracking. The trained+agents setting further improves consistency by selecting candidates with stronger symbolic protocol-state validity.
 ## 🤖 Actor-Simulator-Selector Agent

 | Level | Hugging Face Dataset | Input | Target | Metric |
 |:---|:---|:---|:---|:---|
 | **Level 1** | [LabHorizon-3D-Asset-Perception](https://huggingface.co/datasets/CongLab-Research/LabHorizon-3D-Asset-Perception) | Three asset views, historical actions, candidate next actions | Gold next action | Next-action accuracy |
+| **Level 2** | [LabHorizon-Protocol-Conditioned-Planning](https://huggingface.co/datasets/CongLab-Research/LabHorizon-Protocol-Conditioned-Planning) | Context, goal, constraints, available inputs, action pool | Gold experimental action sequence | L2 Action Sequence Similarity, L2 Parameter Accuracy |
 ## 📦 Model
 | Level 1 training split | 3,000 multimodal laboratory 3D asset samples |
 | Level 2 training split | 3,000 text-only protocol-conditioned planning samples |
 | Main task | Protocol-conditioned laboratory action prediction |
+| Main metrics | Level 1 Next Action Accuracy; L2 Action Sequence Similarity and L2 Parameter Accuracy |
 | Intended loading mode | Load this adapter with the matching Qwen3.6-35B-A3B base model |
 The released weights are an adapter, not the base model. Users must load them with the corresponding Qwen3.6-35B-A3B base model.
 | Level | Output format | Metric |
 |:---|:---|:---|
 | Level 1 | Reasoning followed by a final next action | Next Action Accuracy |
+| Level 2 | Structured action sequence parsed by Python AST | L2 Action Sequence Similarity, L2 Parameter Accuracy, L2 Final Score |
 For Level 1, the evaluator maps the final next action back to the candidate list. For Level 2, the evaluator parses action names, keyword parameters, assigned intermediate variables, and dependency references with Python AST. This model card reports the same metrics as the GitHub and dataset READMEs.
 ### 🧪 Level 2: Protocol-Conditioned Planning
+| Rank | Model | L2 Final Score | L2 Action Sequence Similarity | L2 Parameter Accuracy |
 |:---:|:---|---:|---:|---:|
 | 🥇 | Gemini 3.1 Pro | 0.3263 | 0.3195 | 0.3331 |
 | 🥈 | Grok 4.3 | 0.3244 | 0.3339 | 0.3148 |
 The table compares direct-prompting SOTA/baseline systems, the base Qwen model, and the trained+agents system evaluated on the same LabHorizon test splits.
+| System | Level 1 Next Action Accuracy | L2 Action Sequence Similarity | L2 Parameter Accuracy | L2 Final Score |
 |:---|---:|---:|---:|---:|
 | Grok 4.3 | 0.555 | 0.3339 | 0.3148 | 0.3244 |
 | Gemini 3.1 Pro | 0.465 | 0.3195 | 0.3331 | 0.3263 |
 Agent setting: `Qwen3.6-35B-A3B(trained)` is used as Actor, and Gemini 3.1 Pro is used as Simulator/Selector. The Simulator/Selector choice is the current setting and has not been exhaustively ablated.
+The trained adapter improves both levels over the direct Qwen3.6-35B-A3B baseline. Level 1 improves from `0.475` to `0.635`, indicating better laboratory asset-to-action alignment. L2 Final Score improves from `0.2534` to `0.4100`, indicating better action ordering, parameter retention, and dependency tracking. The trained+agents setting further improves consistency by selecting candidates with stronger symbolic protocol-state validity.
 ## 🤖 Actor-Simulator-Selector Agent