black-yt commited on
Commit
eb5ee5d
·
1 Parent(s): b031d6a

docs: clarify L2 metric labels

Browse files
Files changed (1) hide show
  1. README.md +6 -6
README.md CHANGED
@@ -74,7 +74,7 @@ The adapter is trained on the same public LabHorizon train split described by th
74
  | Level | Hugging Face Dataset | Input | Target | Metric |
75
  |:---|:---|:---|:---|:---|
76
  | **Level 1** | [LabHorizon-3D-Asset-Perception](https://huggingface.co/datasets/CongLab-Research/LabHorizon-3D-Asset-Perception) | Three asset views, historical actions, candidate next actions | Gold next action | Next-action accuracy |
77
- | **Level 2** | [LabHorizon-Protocol-Conditioned-Planning](https://huggingface.co/datasets/CongLab-Research/LabHorizon-Protocol-Conditioned-Planning) | Context, goal, constraints, available inputs, action pool | Gold experimental action sequence | Action Sequence Similarity, Parameter Accuracy |
78
 
79
  ## 📦 Model
80
 
@@ -88,7 +88,7 @@ The adapter is trained on the same public LabHorizon train split described by th
88
  | Level 1 training split | 3,000 multimodal laboratory 3D asset samples |
89
  | Level 2 training split | 3,000 text-only protocol-conditioned planning samples |
90
  | Main task | Protocol-conditioned laboratory action prediction |
91
- | Main metrics | Level 1 Next Action Accuracy; Level 2 Action Sequence Similarity and Parameter Accuracy |
92
  | Intended loading mode | Load this adapter with the matching Qwen3.6-35B-A3B base model |
93
 
94
  The released weights are an adapter, not the base model. Users must load them with the corresponding Qwen3.6-35B-A3B base model.
@@ -112,7 +112,7 @@ LabHorizon uses the same evaluation contracts across direct-prompting models, th
112
  | Level | Output format | Metric |
113
  |:---|:---|:---|
114
  | Level 1 | Reasoning followed by a final next action | Next Action Accuracy |
115
- | Level 2 | Structured action sequence parsed by Python AST | Action Sequence Similarity, Parameter Accuracy, Final Score |
116
 
117
  For Level 1, the evaluator maps the final next action back to the candidate list. For Level 2, the evaluator parses action names, keyword parameters, assigned intermediate variables, and dependency references with Python AST. This model card reports the same metrics as the GitHub and dataset READMEs.
118
 
@@ -141,7 +141,7 @@ The tables below report direct-prompting baselines on the same test split used f
141
 
142
  ### 🧪 Level 2: Protocol-Conditioned Planning
143
 
144
- | Rank | Model | Final Score | Action Sequence Similarity | Parameter Accuracy |
145
  |:---:|:---|---:|---:|---:|
146
  | 🥇 | Gemini 3.1 Pro | 0.3263 | 0.3195 | 0.3331 |
147
  | 🥈 | Grok 4.3 | 0.3244 | 0.3339 | 0.3148 |
@@ -194,7 +194,7 @@ Main training settings:
194
 
195
  The table compares direct-prompting SOTA/baseline systems, the base Qwen model, and the trained+agents system evaluated on the same LabHorizon test splits.
196
 
197
- | System | Level 1 Next Action Accuracy | Level 2 Action Sequence Similarity | Level 2 Parameter Accuracy | Level 2 Final Score |
198
  |:---|---:|---:|---:|---:|
199
  | Grok 4.3 | 0.555 | 0.3339 | 0.3148 | 0.3244 |
200
  | Gemini 3.1 Pro | 0.465 | 0.3195 | 0.3331 | 0.3263 |
@@ -205,7 +205,7 @@ The table compares direct-prompting SOTA/baseline systems, the base Qwen model,
205
 
206
  Agent setting: `Qwen3.6-35B-A3B(trained)` is used as Actor, and Gemini 3.1 Pro is used as Simulator/Selector. The Simulator/Selector choice is the current setting and has not been exhaustively ablated.
207
 
208
- The trained adapter improves both levels over the direct Qwen3.6-35B-A3B baseline. Level 1 improves from `0.475` to `0.635`, indicating better laboratory asset-to-action alignment. Level 2 Final Score improves from `0.2534` to `0.4100`, indicating better action ordering, parameter retention, and dependency tracking. The trained+agents setting further improves consistency by selecting candidates with stronger symbolic protocol-state validity.
209
 
210
  ## 🤖 Actor-Simulator-Selector Agent
211
 
 
74
  | Level | Hugging Face Dataset | Input | Target | Metric |
75
  |:---|:---|:---|:---|:---|
76
  | **Level 1** | [LabHorizon-3D-Asset-Perception](https://huggingface.co/datasets/CongLab-Research/LabHorizon-3D-Asset-Perception) | Three asset views, historical actions, candidate next actions | Gold next action | Next-action accuracy |
77
+ | **Level 2** | [LabHorizon-Protocol-Conditioned-Planning](https://huggingface.co/datasets/CongLab-Research/LabHorizon-Protocol-Conditioned-Planning) | Context, goal, constraints, available inputs, action pool | Gold experimental action sequence | L2 Action Sequence Similarity, L2 Parameter Accuracy |
78
 
79
  ## 📦 Model
80
 
 
88
  | Level 1 training split | 3,000 multimodal laboratory 3D asset samples |
89
  | Level 2 training split | 3,000 text-only protocol-conditioned planning samples |
90
  | Main task | Protocol-conditioned laboratory action prediction |
91
+ | Main metrics | Level 1 Next Action Accuracy; L2 Action Sequence Similarity and L2 Parameter Accuracy |
92
  | Intended loading mode | Load this adapter with the matching Qwen3.6-35B-A3B base model |
93
 
94
  The released weights are an adapter, not the base model. Users must load them with the corresponding Qwen3.6-35B-A3B base model.
 
112
  | Level | Output format | Metric |
113
  |:---|:---|:---|
114
  | Level 1 | Reasoning followed by a final next action | Next Action Accuracy |
115
+ | Level 2 | Structured action sequence parsed by Python AST | L2 Action Sequence Similarity, L2 Parameter Accuracy, L2 Final Score |
116
 
117
  For Level 1, the evaluator maps the final next action back to the candidate list. For Level 2, the evaluator parses action names, keyword parameters, assigned intermediate variables, and dependency references with Python AST. This model card reports the same metrics as the GitHub and dataset READMEs.
118
 
 
141
 
142
  ### 🧪 Level 2: Protocol-Conditioned Planning
143
 
144
+ | Rank | Model | L2 Final Score | L2 Action Sequence Similarity | L2 Parameter Accuracy |
145
  |:---:|:---|---:|---:|---:|
146
  | 🥇 | Gemini 3.1 Pro | 0.3263 | 0.3195 | 0.3331 |
147
  | 🥈 | Grok 4.3 | 0.3244 | 0.3339 | 0.3148 |
 
194
 
195
  The table compares direct-prompting SOTA/baseline systems, the base Qwen model, and the trained+agents system evaluated on the same LabHorizon test splits.
196
 
197
+ | System | Level 1 Next Action Accuracy | L2 Action Sequence Similarity | L2 Parameter Accuracy | L2 Final Score |
198
  |:---|---:|---:|---:|---:|
199
  | Grok 4.3 | 0.555 | 0.3339 | 0.3148 | 0.3244 |
200
  | Gemini 3.1 Pro | 0.465 | 0.3195 | 0.3331 | 0.3263 |
 
205
 
206
  Agent setting: `Qwen3.6-35B-A3B(trained)` is used as Actor, and Gemini 3.1 Pro is used as Simulator/Selector. The Simulator/Selector choice is the current setting and has not been exhaustively ablated.
207
 
208
+ The trained adapter improves both levels over the direct Qwen3.6-35B-A3B baseline. Level 1 improves from `0.475` to `0.635`, indicating better laboratory asset-to-action alignment. L2 Final Score improves from `0.2534` to `0.4100`, indicating better action ordering, parameter retention, and dependency tracking. The trained+agents setting further improves consistency by selecting candidates with stronger symbolic protocol-state validity.
209
 
210
  ## 🤖 Actor-Simulator-Selector Agent
211