sql_env / docs_draft /reward_design.md
hjerpe's picture
Upload folder using huggingface_hub
9e64e71 verified
# Distance-to-Goal Metrics: Complete Reference Table
## Priority-Ordered Metric Comparison
| Priority | Metric Name | Effort | Value | Concept | What It Measures | Key Value | Potential Risks |
|---------|-------------|--------|-------|---------|------------------|-----------|-----------------|
| **1** | **Cardinality Matching** | 30 min | Very High | Compare row/value counts | "Did you return the right number of things?" | Catches over/under-retrieval early; works universally | None significant; very robust |
| **2** | **Value Overlap (Set-based)** | 45 min | Very High | Flatten results to value sets, compute Jaccard | "How many correct values did you find?" | Rewards partial correctness; format-agnostic | May over-reward coincidental matches in large results |
| **3** | **Numeric Range Proximity** | 1 hour | High | Logarithmic distance for numbers | "Are your numbers in the right ballpark?" | Critical for COUNT/SUM/AVG queries; rewards order-of-magnitude | Only useful for numeric questions (~40% of dataset) |
| **4** | **Row-wise Best Match** | 1.5 hours | High | Find best row pairing between results | "How many rows are correct, ignoring order?" | Handles ORDER BY issues; forgiving for column mismatch | Computationally expensive for large result sets (100+ rows) |
| **5** | **Schema Coverage** | 2 hours | Medium-High | Compare tables used in queries | "Are you querying the right tables?" | Guides exploration toward relevant parts of schema | Requires SQL parsing; may reward irrelevant table access |
| **6** | **Column Alignment** | 2.5 hours | Medium | Fuzzy match column names | "Do your result columns make semantic sense?" | Helps disambiguate multi-column results | Requires metadata tracking; fuzzy matching can be noisy |
| **7** | **Rank Correlation** | 3 hours | Medium | Spearman correlation on ordered results | "Is your ranking correct for TOP-K queries?" | Specific to ordered results; robust to ties | Only applicable to ~20% of questions; needs scipy dependency |
| **8** | **SQL Structure** | 6+ hours | Low-Medium | Parse query AST, compare structure | "Is your query syntactically similar?" | Might help with complex multi-join queries | HIGH RISK: Reward hacking; overfitting; ignores semantic equivalence |
| **9** | **Execution Plan** | 8+ hours | Low | Compare database query plans | "Does your query execute similarly?" | Theoretical value for optimization | NOT RECOMMENDED: Too complex; DB-specific; doesn't guarantee correctness |
---
## Detailed Conceptual Explanations
### 🥇 Tier 1: Must-Have Metrics
#### 1. Cardinality Matching
**How It Works**:
```
Input: agent_result = [row1, row2, row3, row4, row5] (5 rows)
gold_result = [row1, row2, row3] (3 rows)
Calculation:
difference = |5 - 3| = 2
relative_error = 2 / 3 = 0.667
reward = 1 - min(1.0, 0.667) = 0.333
```
**Conceptual Meaning**: "Getting the right number of things is the first step to getting the right things."
**Value Provided**:
- **Early signal**: Before content is right, size can be right
- **Universal**: Works for integers (1 value), lists (N values), tables (M rows)
- **Catches common errors**:
- Missing `GROUP BY` → too few rows
- Cartesian join → too many rows
- Wrong aggregation → count off by orders of magnitude
**Risk Analysis**: ✅ **Very low risk**
- No reward hacking opportunities (agent can't game cardinality without semantic progress)
- Monotonic with correctness (better results = better cardinality)
- Fast to compute (O(1))
---
#### 2. Value Overlap (Set-based)
**How It Works**:
```
Question: "Which departments have >50 employees?"
Gold: {Engineering, Sales, Marketing}
Agent: {Engineering, Sales, HR, Legal} (2 correct, 2 wrong)
Calculation:
intersection = {Engineering, Sales} = 2 values
union = {Engineering, Sales, Marketing, HR, Legal} = 5 values
Jaccard = 2/5 = 0.4
```
**Conceptual Meaning**: "You found some of the right answers; find the rest."
**Value Provided**:
- **Partial credit**: Agent gets 0.4 reward even though final answer isn't perfect
- **Format agnostic**: Flattens everything to atomic values
- `[("Engineering", 42)]``{Engineering, 42}`
- `[(42, "Engineering")]``{Engineering, 42}` (same!)
- **Compositional**: Finding 1 of 3 correct departments = 0.33, finding 2 of 3 = 0.67, finding all 3 = 1.0
**Risk Analysis**: ⚠️ **Low-medium risk**
- **Coincidental matches**: If gold answer is `{42}` and agent returns entire `employees` table with 100 rows, might contain `42` by chance
- **Mitigation**: Combine with cardinality (penalizes returning too many values)
- **False precision**: `{42}` vs `{42.0}` vs `{42.001}` might all become `{"42"}` after string conversion
---
#### 3. Numeric Range Proximity
**How It Works**:
```
Question: "Average salary in Engineering?"
Gold: 95000
Agent attempt 1: 87000 (off by 8.4%)
relative_error = 8000 / 95000 = 0.084
reward = 1 - log10(1 + 0.084) = 1 - log10(1.084) = 1 - 0.035 = 0.965
Agent attempt 2: 9500 (off by 90%, wrong order of magnitude)
relative_error = 85500 / 95000 = 0.9
reward = 1 - log10(1 + 0.9) = 1 - log10(1.9) = 1 - 0.279 = 0.721
Agent attempt 3: 950000 (10x too high)
relative_error = 855000 / 95000 = 9.0
reward = 1 - log10(1 + 9.0) = 1 - log10(10) = 1 - 1.0 = 0.0
```
**Conceptual Meaning**: "Being off by 10% is very different from being off by 10x."
**Value Provided**:
- **Order-of-magnitude thinking**: Rewards agent for "ballpark correct" before exact
- **Logarithmic scale**:
- 95k → 100k (5% error) = high reward (0.98)
- 95k → 190k (100% error) = medium reward (0.70)
- 95k → 950k (900% error) = no reward (0.05)
- **Natural for SQL**: COUNT, SUM, AVG queries often close but not exact on first try
**Risk Analysis**: ⚠️ **Medium risk**
- **Only useful for ~40% of questions**: Text/categorical answers get no benefit
- **Multiple numbers**: If result has `[42, 100, 5]` and gold is `[42]`, which number to compare?
- Solution: Use closest match for each gold number
- **Zero-handling**: `gold=0, agent=1` is infinitely far; needs special case
---
#### 4. Row-wise Best Match
**How It Works**:
```
Question: "Top 3 departments by size"
Gold: [(Engineering, 65), (Sales, 58), (Marketing, 52)]
Agent: [(Marketing, 52), (Engineering, 65), (Sales, 58)] # Wrong order!
Process:
For each gold row, find best matching agent row:
Gold row 1: (Engineering, 65)
vs Agent row 1: (Marketing, 52) → 0/2 match = 0.0
vs Agent row 2: (Engineering, 65) → 2/2 match = 1.0 ✓
vs Agent row 3: (Sales, 58) → 0/2 match = 0.0
Best match: 1.0
Gold row 2: (Sales, 58)
Best match: 1.0 (agent row 3)
Gold row 3: (Marketing, 52)
Best match: 1.0 (agent row 1)
Final reward: (1.0 + 1.0 + 1.0) / 3 = 1.0
```
**Conceptual Meaning**: "You got the right rows, just in the wrong order—that's still mostly correct."
**Value Provided**:
- **Order-invariant**: Catches `ORDER BY` mistakes without penalizing heavily
- **Extra columns forgiven**: If agent returns `(Engineering, 65, 95000)` and gold is `(Engineering, 65)`, first 2 columns match → 0.67 reward
- **Partial row matches**: Agent got department name right but count wrong → 0.5 reward per row
**Risk Analysis**: ⚠️ **Medium risk**
- **Computationally expensive**: O(M×N) comparisons for M gold rows and N agent rows
- For 100-row results: 10,000 comparisons per reward calculation
- Mitigation: Limit to first 20 rows
- **Ambiguous matching**: If gold has duplicate rows, which agent row should match which?
---
### 🥈 Tier 2: Nice-to-Have Metrics
#### 5. Schema Coverage
**How It Works**:
```
Question: "How many employees in Engineering?"
Gold query: SELECT COUNT(*) FROM employees WHERE department='Engineering'
Tables used: {employees}
Agent query: SELECT COUNT(*) FROM employees e
JOIN departments d ON e.dept_id = d.id
WHERE d.name='Engineering'
Tables used: {employees, departments}
Calculation:
intersection = {employees} = 1
union = {employees, departments} = 2
Jaccard = 1/2 = 0.5
Penalty for extra table: 0.1 * 1 = 0.1
Final: 0.5 - 0.1 = 0.4
```
**Conceptual Meaning**: "You're exploring the right part of the database."
**Value Provided**:
- **Exploration guidance**: Early signal before query results are correct
- **Helps with multi-hop**: "You found `employees`, now look at `departments`"
- **Penalizes shotgun approach**: Agent that queries every table gets low reward
**Risk Analysis**: ⚠️ **Medium-high risk**
- **Multiple valid paths**: Simple query might use 1 table, complex query uses 3—both correct
- **Irrelevant table penalty**: What if agent explores `departments` first before finding `employees`? Gets penalized for valid exploration
- **Requires SQL parsing**: Dependency on `sqlparse` library; edge cases in parsing
---
#### 6. Column Alignment
**How It Works**:
```
Question: "List departments and their average salaries"
Gold columns: [department_name, avg_salary]
Agent columns: [dept, average_compensation]
Fuzzy matching:
"department_name" vs "dept"
→ SequenceMatcher = 0.65 (partial match) ✓
"avg_salary" vs "average_compensation"
→ SequenceMatcher = 0.45 (weak match) ✗
Reward: 1/2 columns matched = 0.5
```
**Conceptual Meaning**: "Your columns have the right semantic meaning."
**Value Provided**:
- **Disambiguates multi-column results**: If result has `[42, 100, 5]`, which column is the answer?
- **Catches projection errors**: Agent did `SELECT *` when should've done `SELECT department, COUNT(*)`
- **Fuzzy matching helps**: "dept" matches "department", "emp_id" matches "employee_id"
**Risk Analysis**: ⚠️ **High risk**
- **Requires metadata**: Need to track column names from query results (not always available in raw SQLite)
- **Fuzzy matching noise**: "count" matches "country" (0.7 similarity), "id" matches "bid" (0.67 similarity)
- **Aliasing issues**: `SELECT COUNT(*) AS total` vs `SELECT COUNT(*) AS num_employees`—both mean the same thing
---
#### 7. Rank Correlation
**How It Works**:
```
Question: "Top 5 products by revenue"
Gold: [ProductA: $1M, ProductB: $900K, ProductC: $800K, ProductD: $750K, ProductE: $700K]
Ranks: [1, 2, 3, 4, 5]
Agent: [ProductA: $1M, ProductC: $850K, ProductB: $880K, ProductE: $710K, ProductD: $740K]
Ranks: [1, 3, 2, 5, 4]
Spearman correlation:
Rank differences: [0, -1, +1, -1, +1]
Correlation coefficient: 0.9
Reward: (0.9 + 1) / 2 = 0.95
```
**Conceptual Meaning**: "You got the relative ordering mostly right."
**Value Provided**:
- **Specific to TOP-K queries**: ~20% of Spider questions involve ranking
- **Robust to ties**: Handles "tied for 2nd place" correctly
- **Partial credit for ordering**: Top 3 correct but bottom 2 swapped → still high reward
**Risk Analysis**: ⚠️ **Medium risk**
- **Limited applicability**: Only works for ordered results
- **Requires scipy**: Heavy dependency just for one metric
- **Rank vs. value confusion**: Agent might get ranking right but values wrong (or vice versa)
---
### 🥉 Tier 3: Avoid (High Risk, Low Value)
#### 8. SQL Structure Similarity
**How It Works**:
```
Gold query: SELECT d.name, COUNT(*)
FROM employees e
JOIN departments d ON e.dept_id = d.id
GROUP BY d.name
Agent query: SELECT department, COUNT(*)
FROM employees
GROUP BY department
Structural comparison:
Tables: {employees} vs {employees, departments} → 0.5
Joins: {(employees, departments)} vs {} → 0.0
Aggregates: {COUNT} vs {COUNT} → 1.0
Group By: {d.name} vs {department} → 0.5
Weighted average: 0.5
```
**Conceptual Meaning**: "Your query looks syntactically similar to the gold query."
**Risk Analysis**: 🛑 **VERY HIGH RISK - DO NOT IMPLEMENT**
- **Reward hacking**: Agent learns to copy SQL structure without understanding semantics
- **Multiple valid solutions**: Gold uses JOIN, agent uses subquery—both correct, but structure reward penalizes
- **Overfitting**: Agent optimizes for "looking like gold query" instead of "getting right answer"
- **Research evidence**: SQL-TRAIL paper found structure-based rewards hurt generalization
---
#### 9. Execution Plan Similarity
**How It Works**:
```
Gold query execution plan:
1. Scan departments (10 rows)
2. Scan employees (1000 rows)
3. Hash join (O(N))
4. Aggregate (O(N))
Agent query execution plan:
1. Scan employees (1000 rows)
2. Nested loop with departments (O(N²))
3. Aggregate (O(N))
Similarity: 2/4 steps similar = 0.5
```
**Conceptual Meaning**: "Your query executes in a similar way."
**Risk Analysis**: 🛑 **VERY HIGH RISK - DO NOT IMPLEMENT**
- **Database-specific**: SQLite plans differ from PostgreSQL plans
- **Doesn't guarantee correctness**: Two queries with similar plans can have different results
- **Computationally expensive**: Running EXPLAIN on every query doubles execution time
- **Complexity**: Comparing tree structures is non-trivial
- **No research evidence**: No prior work shows this helps RL training
---
## 🧮 Combining Multiple Metrics into Final Reward
### The Challenge
You have multiple distance-to-goal metrics. How do you combine them into a single scalar reward?
```
Current state:
cardinality_score = 0.8
value_overlap_score = 0.6
numeric_range_score = 0.9
row_match_score = 0.7
Need: single_reward = ???
```
---
## 🎯 Method 1: Weighted Average (RECOMMENDED for MVP)
**Formula**:
```python
def weighted_average_reward(scores, weights):
"""Simple weighted average of applicable metrics."""
total_weight = sum(weights.values())
return sum(scores[k] * weights[k] for k in scores) / total_weight
```
**Example Implementation**:
```python
def compute_progress_reward(agent_result, gold_result, agent_query=None):
# Compute all metrics
scores = {
'cardinality': cardinality_reward(agent_result, gold_result),
'value_overlap': value_overlap_reward(agent_result, gold_result),
'numeric_range': numeric_range_reward(agent_result, gold_result),
'row_match': rowwise_best_match(agent_result, gold_result),
}
# Fixed weights (tune these!)
weights = {
'cardinality': 0.25,
'value_overlap': 0.40, # Highest weight (most universal)
'numeric_range': 0.15,
'row_match': 0.20,
}
return weighted_average_reward(scores, weights)
```
**Pros**:
- ✅ Simple to implement and understand
- ✅ Easy to tune (adjust weights based on training performance)
- ✅ Monotonic (if individual metrics improve, final reward improves)
- ✅ Bounded [0, 1]
**Cons**:
- ⚠️ Fixed weights might not be optimal for all question types
- ⚠️ Treats all metrics as equally important (regardless of context)
**When to use**: Default choice for MVP. Start here.
---
## 🎯 Method 2: Adaptive Weighting by Question Type
**Formula**:
```python
def adaptive_weighted_reward(scores, question_metadata):
"""Adjust weights based on question characteristics."""
# Detect question type
is_numeric = has_numeric_answer(gold_result)
is_multirow = len(gold_result) > 1
is_ordered = "TOP" in question.upper() or "ORDER BY" in gold_query
# Adaptive weights
weights = {
'cardinality': 0.25, # Always important
'value_overlap': 0.40 if not is_numeric else 0.30,
'numeric_range': 0.30 if is_numeric else 0.0, # Only for numeric
'row_match': 0.20 if is_multirow else 0.10,
}
# Normalize
total = sum(weights.values())
weights = {k: v/total for k, v in weights.items()}
return sum(scores[k] * weights[k] for k in scores)
```
**Example**:
```python
Question: "Average salary in Engineering?" (numeric, single-value)
→ weights: cardinality=0.25, value_overlap=0.30, numeric_range=0.35, row_match=0.10
Question: "List all departments with >50 employees" (text, multi-row)
→ weights: cardinality=0.25, value_overlap=0.45, numeric_range=0.0, row_match=0.30
```
**Pros**:
- ✅ More accurate reward signal for different question types
- ✅ Automatically disables irrelevant metrics (e.g., numeric_range for text questions)
- ✅ Can tune weights per question type independently
**Cons**:
- ⚠️ More complex to implement (need question type detection)
- ⚠️ More hyperparameters to tune
- ⚠️ Risk of over-engineering
**When to use**: If fixed weighting shows poor performance on specific question types.
---
## 🎯 Method 3: Max Pooling (Optimistic)
**Formula**:
```python
def max_pooling_reward(scores):
"""Take the best metric (optimistic reward)."""
return max(scores.values())
```
**Example**:
```
scores = {
'cardinality': 0.3, # Wrong row count
'value_overlap': 0.8, # Found most values!
'numeric_range': 0.4, # Numbers off
'row_match': 0.5, # Some rows match
}
reward = max(0.3, 0.8, 0.4, 0.5) = 0.8
```
**Conceptual Meaning**: "Give credit for whatever the agent did best."
**Pros**:
- ✅ Very forgiving (agent gets credit for any progress)
- ✅ Encourages diverse exploration strategies
- ✅ Simple to implement
**Cons**:
- 🛑 **Too lenient**: Agent might game easiest metric
- 🛑 **Non-compositional**: Doesn't reward improving multiple aspects simultaneously
- 🛑 **Unstable gradients**: Reward can jump dramatically between steps
**When to use**: If agent is struggling to learn anything (extremely sparse rewards). Use as temporary scaffolding, then switch to weighted average.
---
## 🎯 Method 4: Minimum Threshold + Average (Strict)
**Formula**:
```python
def threshold_average_reward(scores, thresholds):
"""All metrics must meet threshold; then take average."""
# Check all thresholds
for metric, score in scores.items():
if score < thresholds.get(metric, 0.0):
return 0.0 # Fail if any metric below threshold
# All thresholds met → return average
return sum(scores.values()) / len(scores)
```
**Example**:
```python
scores = {
'cardinality': 0.9,
'value_overlap': 0.7,
'numeric_range': 0.3, # Below threshold!
}
thresholds = {
'cardinality': 0.5,
'value_overlap': 0.5,
'numeric_range': 0.5,
}
# numeric_range (0.3) < threshold (0.5) → return 0.0
```
**Conceptual Meaning**: "You must do reasonably well on all aspects to get any reward."
**Pros**:
- ✅ Prevents over-optimization of single metric
- ✅ Encourages balanced progress
**Cons**:
- 🛑 **Too strict**: Might be too hard for early training
- 🛑 **Cliff dynamics**: Slight improvement might not change reward at all
- 🛑 **Threshold tuning**: Requires careful calibration
**When to use**: If agent is gaming one metric while ignoring others. Use as penalty mechanism.
---
## 🎯 Method 5: Hierarchical (Coarse-to-Fine)
**Formula**:
```python
def hierarchical_reward(scores):
"""First get cardinality right, then content, then structure."""
# Layer 1: Cardinality (must be >0.5 to proceed)
if scores['cardinality'] < 0.5:
return scores['cardinality'] * 0.3 # Low reward, stuck at layer 1
# Layer 2: Content (must be >0.5 to proceed)
content_score = (scores['value_overlap'] + scores['numeric_range']) / 2
if content_score < 0.5:
return 0.3 + content_score * 0.4 # Medium reward, stuck at layer 2
# Layer 3: Structure (all metrics combined)
final_score = (
0.2 * scores['cardinality'] +
0.4 * content_score +
0.4 * scores['row_match']
)
return 0.5 + final_score * 0.5 # High reward, layer 3
```
**Conceptual Meaning**: "Learn to get the count right first, then the values, then the structure."
**Visualization**:
```
Reward progression:
0.0 ─────── 0.3 ─────── 0.7 ─────── 1.0
↑ ↑ ↑
Cardinality Content Structure
correct correct correct
```
**Pros**:
- ✅ Natural curriculum (easier tasks first)
- ✅ Clear progression signal
- ✅ Prevents agent from over-optimizing structure before content
**Cons**:
- ⚠️ More complex logic
- ⚠️ Requires careful threshold tuning
- ⚠️ Might slow down learning if thresholds too strict
**When to use**: If agent learns poorly with flat reward. Provides curriculum learning.
---
## 🎯 Method 6: Product (Multiplicative)
**Formula**:
```python
def product_reward(scores):
"""Multiply all metrics (all must be good)."""
product = 1.0
for score in scores.values():
product *= score
return product
```
**Example**:
```
scores = {
'cardinality': 0.9,
'value_overlap': 0.8,
'numeric_range': 0.7,
}
reward = 0.9 × 0.8 × 0.7 = 0.504
```
**Conceptual Meaning**: "All aspects must be good; weak performance on any metric drags down total."
**Pros**:
- ✅ Encourages balanced improvement
- ✅ Penalizes weak performance on any dimension
**Cons**:
- 🛑 **Too strict**: Single low score (0.1) makes entire reward near zero
- 🛑 **Vanishing gradients**: Product of small numbers becomes very small
- 🛑 **Not bounded predictably**: Can produce very small rewards even for good progress
**When to use**: Rarely. Only if you need extremely strict "all-or-nothing" reward.
---
## 🎯 Method 7: Percentile Aggregation (Robust)
**Formula**:
```python
def percentile_reward(scores, percentile=50):
"""Use median (or other percentile) of all metrics."""
import numpy as np
return np.percentile(list(scores.values()), percentile)
```
**Example**:
```
scores = [0.9, 0.8, 0.3, 0.7, 0.6]
sorted = [0.3, 0.6, 0.7, 0.8, 0.9]
percentile_50 (median) = 0.7
percentile_75 = 0.8
percentile_25 = 0.6
```
**Conceptual Meaning**: "Reward based on typical performance, ignoring outliers."
**Pros**:
- ✅ Robust to outlier metrics (one very low or very high score doesn't dominate)
- ✅ Simple to implement
- ✅ Tunable (change percentile to be more/less strict)
**Cons**:
- ⚠️ Less interpretable than weighted average
- ⚠️ Ignores some information (throws away best and worst scores)
**When to use**: If one metric is noisy or unreliable, use median to ignore it.
---
## 📊 Comparison Table: Combination Methods
| Method | Complexity | Interpretability | Robustness | Training Stability | Best Use Case |
|--------|------------|------------------|------------|-------------------|---------------|
| **Weighted Average** | Low | High | Medium | High | **MVP default** |
| **Adaptive Weighting** | Medium | Medium | High | High | Different question types need different signals |
| **Max Pooling** | Low | Medium | Low | Low | Agent struggling to learn anything |
| **Threshold + Average** | Medium | High | Medium | Medium | Agent gaming one metric |
| **Hierarchical** | High | High | High | Medium | Want curriculum learning |
| **Product** | Low | Low | Low | Low | All aspects must be perfect (rare) |
| **Percentile** | Low | Low | High | High | One metric is noisy/unreliable |
---
## 🎬 Recommended Implementation Strategy
### Phase 1: Start Simple (Week 1)
```python
def compute_progress_reward(agent_result, gold_result):
"""Initial implementation: weighted average of 3 metrics."""
scores = {
'cardinality': cardinality_reward(agent_result, gold_result),
'value_overlap': value_overlap_reward(agent_result, gold_result),
'numeric_range': numeric_range_reward(agent_result, gold_result),
}
weights = {'cardinality': 0.25, 'value_overlap': 0.50, 'numeric_range': 0.25}
return weighted_average_reward(scores, weights)
```
**Why**: Simple, interpretable, easy to debug.
---
### Phase 2: Add Context (Week 2, if needed)
```python
def compute_progress_reward(agent_result, gold_result, question_type):
"""Adaptive weighting based on question type."""
scores = {
'cardinality': cardinality_reward(agent_result, gold_result),
'value_overlap': value_overlap_reward(agent_result, gold_result),
'numeric_range': numeric_range_reward(agent_result, gold_result),
'row_match': rowwise_best_match(agent_result, gold_result),
}
# Adapt weights
if question_type == 'numeric':
weights = {'cardinality': 0.2, 'value_overlap': 0.3,
'numeric_range': 0.4, 'row_match': 0.1}
elif question_type == 'multirow':
weights = {'cardinality': 0.25, 'value_overlap': 0.4,
'numeric_range': 0.05, 'row_match': 0.3}
else: # default
weights = {'cardinality': 0.25, 'value_overlap': 0.5,
'numeric_range': 0.15, 'row_match': 0.1}
return weighted_average_reward(scores, weights)
```
**Why**: Improves signal quality without major complexity.
---
### Phase 3: Add Safeguards (Week 3, if agent is gaming)
```python
def compute_progress_reward(agent_result, gold_result, question_type):
"""Weighted average with anti-gaming measures."""
scores = compute_all_scores(agent_result, gold_result)
# Anti-gaming: if cardinality is way off, cap other rewards
if scores['cardinality'] < 0.3:
# Agent is nowhere close on size → limit credit for content
return scores['cardinality'] * 0.5
# Anti-gaming: if value overlap is low, cap row match
if scores['value_overlap'] < 0.4:
scores['row_match'] *= 0.5 # Penalize structure if content is wrong
# Standard weighted average
return weighted_average_reward(scores, get_weights(question_type))
```
**Why**: Prevents reward hacking while keeping interpretability.
---
## 🧪 How to Validate Your Combination Method
Create test suite:
```python
# Test 1: Perfect match
assert compute_reward(gold, gold) == 1.0
# Test 2: Completely wrong
assert compute_reward(random_result, gold) < 0.2
# Test 3: Monotonicity (better result → higher reward)
result_v1 = partially_correct_result() # 30% right
result_v2 = more_correct_result() # 60% right
result_v3 = mostly_correct_result() # 90% right
assert compute_reward(result_v1, gold) < compute_reward(result_v2, gold) < compute_reward(result_v3, gold)
# Test 4: Bounded [0, 1]
for _ in range(100):
random_result = generate_random_result()
reward = compute_reward(random_result, gold)
assert 0.0 <= reward <= 1.0
# Test 5: Insensitive to format (same values, different structure)
result_format_a = [("Engineering", 42)]
result_format_b = [(42, "Engineering")]
assert abs(compute_reward(result_format_a, gold) -
compute_reward(result_format_b, gold)) < 0.1 # Allow small difference
```
---
## 💡 Final Recommendation
**For MVP (Phase 1-3)**:
- Use **Method 1: Weighted Average** with metrics #1-3 (cardinality, value overlap, numeric range)
- Fixed weights: `{0.25, 0.50, 0.25}`
**If training shows issues**:
- Add **Method 2: Adaptive Weighting** based on question type
- Add metric #4 (row-wise match)
**If agent games rewards**:
- Add threshold checks from **Method 4**
- Add anti-gaming logic from Phase 3 example
**Never use**:
- Method 3 (Max Pooling) - too gameable
- Method 6 (Product) - too strict, vanishing gradients
- Metrics #8-9 (SQL structure, execution plans) - research shows they hurt
**Bottom line**: Start simple (weighted average of 3 metrics), add complexity only when needed, always validate with test suite.