| # Concrete Example: Validation Scoring with 8 Environments |
|
|
| ## Scenario Setup |
|
|
| Let's say we have 8 validation environments: |
| 1. WEBSHOP |
| 2. ALFWORLD |
| 3. BABYAI |
| 4. SCIWORLD |
| 5. TEXTCRAFT |
| 6. SAT |
| 7. DED |
| 8. ABD |
|
|
| And we're evaluating a model's performance across these environments. |
|
|
| ## Validation Layer Breakdown |
|
|
| ### Layer 3 (3-environment combinations) |
| - **Total subsets**: C(8,3) = 56 combinations |
| - **Examples**: |
| - {WEBSHOP, ALFWORLD, BABYAI} |
| - {SCIWORLD, TEXTCRAFT, SAT} |
| - {DED, ABD, WEBSHOP} |
| - ... (54 more combinations) |
| - **Total weight for layer**: 2³ = 8.0 |
| - **Weight per subset**: 8.0 / 56 = **0.143** |
|
|
| ### Layer 4 (4-environment combinations) |
| - **Total subsets**: C(8,4) = 70 combinations |
| - **Examples**: |
| - {WEBSHOP, ALFWORLD, BABYAI, SCIWORLD} |
| - {TEXTCRAFT, SAT, DED, ABD} |
| - ... (68 more combinations) |
| - **Total weight for layer**: 2⁴ = 16.0 |
| - **Weight per subset**: 16.0 / 70 = **0.229** |
|
|
| ### Layer 5 (5-environment combinations) |
| - **Total subsets**: C(8,5) = 56 combinations |
| - **Total weight for layer**: 2⁵ = 32.0 |
| - **Weight per subset**: 32.0 / 56 = **0.571** |
|
|
| ### Layer 6 (6-environment combinations) |
| - **Total subsets**: C(8,6) = 28 combinations |
| - **Total weight for layer**: 2⁶ = 64.0 |
| - **Weight per subset**: 64.0 / 28 = **2.286** |
|
|
| ### Layer 7 (7-environment combinations) |
| - **Total subsets**: C(8,7) = 8 combinations |
| - **Examples**: |
| - All environments except WEBSHOP |
| - All environments except ALFWORLD |
| - ... (6 more combinations) |
| - **Total weight for layer**: 2⁷ = 128.0 |
| - **Weight per subset**: 128.0 / 8 = **16.0** |
|
|
| ### Layer 8 (All 8 environments) |
| - **Total subsets**: C(8,8) = 1 combination |
| - **The subset**: {WEBSHOP, ALFWORLD, BABYAI, SCIWORLD, TEXTCRAFT, SAT, DED, ABD} |
| - **Total weight for layer**: 2⁸ = 256.0 |
| - **Weight per subset**: 256.0 / 1 = **256.0** (highest reward!) |
|
|
| ## Scoring Example |
|
|
| Let's say Model A performs well on: |
| - Layer 3: 10 out of 56 subsets (wins on 10 different 3-environment combinations) |
| - Layer 4: 5 out of 70 subsets |
| - Layer 5: 2 out of 56 subsets |
| - Layer 6: 1 out of 28 subsets |
| - Layer 7: 0 out of 8 subsets |
| - Layer 8: 0 out of 1 subset (doesn't perform well on all 8 simultaneously) |
|
|
| **Model A's total score** (simplified, assuming equal performance on winning subsets): |
| ``` |
| = (10 × 0.143) + (5 × 0.229) + (2 × 0.571) + (1 × 2.286) + (0 × 16.0) + (0 × 256.0) |
| = 1.43 + 1.145 + 1.142 + 2.286 + 0 + 0 |
| = 6.003 |
| ``` |
|
|
| Now let's say Model B performs well on: |
| - Layer 3: 5 out of 56 subsets |
| - Layer 4: 3 out of 70 subsets |
| - Layer 5: 1 out of 56 subsets |
| - Layer 6: 0 out of 28 subsets |
| - Layer 7: 0 out of 8 subsets |
| - Layer 8: **1 out of 1 subset** (performs well on ALL 8 environments!) |
|
|
| **Model B's total score**: |
| ``` |
| = (5 × 0.143) + (3 × 0.229) + (1 × 0.571) + (0 × 2.286) + (0 × 16.0) + (1 × 256.0) |
| = 0.715 + 0.687 + 0.571 + 0 + 0 + 256.0 |
| = 257.973 |
| ``` |
|
|
| **Result**: Model B wins decisively because it performs well across all 8 environments simultaneously, earning the massive Layer 8 reward of 256.0! |
|
|
| ## Key Takeaways |
|
|
| 1. **Exponential Rewards**: Each layer gets 2× more total weight than the previous layer |
| 2. **Comprehensive Performance Matters**: Performing well on all 8 environments (Layer 8) gives 256× more weight than a single 3-environment combination |
| 3. **Distributed Weight**: Within each layer, weight is evenly distributed, so winning more subsets in a layer increases score |
| 4. **Top-Layer Focus**: Only layers 3-8 are evaluated, focusing on multi-environment capability |
|
|
| ## How This Relates to the 36 Transformer Layers |
|
|
| The 36 transformer layers in the model work together to: |
| 1. Process input from any of the 8 environments |
| 2. Generate appropriate responses for each task type |
| 3. Learn representations that generalize across environments |
|
|
| The validation scoring system then: |
| 1. Tests the model on all 8 environments |
| 2. Rewards models that perform well across multiple environments |
| 3. Uses combinatoric layers to incentivize comprehensive ability |
|
|
| The 36 layers are the **capacity** (how the model processes information), while the 8 environments and combinatoric scoring are the **evaluation framework** (how we measure and reward performance). |
|
|
|
|