Distance-to-Goal Metrics: Complete Reference Table
Priority-Ordered Metric Comparison
| Priority | Metric Name | Effort | Value | Concept | What It Measures | Key Value | Potential Risks |
|---|---|---|---|---|---|---|---|
| 1 | Cardinality Matching | 30 min | Very High | Compare row/value counts | "Did you return the right number of things?" | Catches over/under-retrieval early; works universally | None significant; very robust |
| 2 | Value Overlap (Set-based) | 45 min | Very High | Flatten results to value sets, compute Jaccard | "How many correct values did you find?" | Rewards partial correctness; format-agnostic | May over-reward coincidental matches in large results |
| 3 | Numeric Range Proximity | 1 hour | High | Logarithmic distance for numbers | "Are your numbers in the right ballpark?" | Critical for COUNT/SUM/AVG queries; rewards order-of-magnitude | Only useful for numeric questions (~40% of dataset) |
| 4 | Row-wise Best Match | 1.5 hours | High | Find best row pairing between results | "How many rows are correct, ignoring order?" | Handles ORDER BY issues; forgiving for column mismatch | Computationally expensive for large result sets (100+ rows) |
| 5 | Schema Coverage | 2 hours | Medium-High | Compare tables used in queries | "Are you querying the right tables?" | Guides exploration toward relevant parts of schema | Requires SQL parsing; may reward irrelevant table access |
| 6 | Column Alignment | 2.5 hours | Medium | Fuzzy match column names | "Do your result columns make semantic sense?" | Helps disambiguate multi-column results | Requires metadata tracking; fuzzy matching can be noisy |
| 7 | Rank Correlation | 3 hours | Medium | Spearman correlation on ordered results | "Is your ranking correct for TOP-K queries?" | Specific to ordered results; robust to ties | Only applicable to ~20% of questions; needs scipy dependency |
| 8 | SQL Structure | 6+ hours | Low-Medium | Parse query AST, compare structure | "Is your query syntactically similar?" | Might help with complex multi-join queries | HIGH RISK: Reward hacking; overfitting; ignores semantic equivalence |
| 9 | Execution Plan | 8+ hours | Low | Compare database query plans | "Does your query execute similarly?" | Theoretical value for optimization | NOT RECOMMENDED: Too complex; DB-specific; doesn't guarantee correctness |
Detailed Conceptual Explanations
๐ฅ Tier 1: Must-Have Metrics
1. Cardinality Matching
How It Works:
Input: agent_result = [row1, row2, row3, row4, row5] (5 rows)
gold_result = [row1, row2, row3] (3 rows)
Calculation:
difference = |5 - 3| = 2
relative_error = 2 / 3 = 0.667
reward = 1 - min(1.0, 0.667) = 0.333
Conceptual Meaning: "Getting the right number of things is the first step to getting the right things."
Value Provided:
- Early signal: Before content is right, size can be right
- Universal: Works for integers (1 value), lists (N values), tables (M rows)
- Catches common errors:
- Missing
GROUP BYโ too few rows - Cartesian join โ too many rows
- Wrong aggregation โ count off by orders of magnitude
- Missing
Risk Analysis: โ Very low risk
- No reward hacking opportunities (agent can't game cardinality without semantic progress)
- Monotonic with correctness (better results = better cardinality)
- Fast to compute (O(1))
2. Value Overlap (Set-based)
How It Works:
Question: "Which departments have >50 employees?"
Gold: {Engineering, Sales, Marketing}
Agent: {Engineering, Sales, HR, Legal} (2 correct, 2 wrong)
Calculation:
intersection = {Engineering, Sales} = 2 values
union = {Engineering, Sales, Marketing, HR, Legal} = 5 values
Jaccard = 2/5 = 0.4
Conceptual Meaning: "You found some of the right answers; find the rest."
Value Provided:
- Partial credit: Agent gets 0.4 reward even though final answer isn't perfect
- Format agnostic: Flattens everything to atomic values
[("Engineering", 42)]โ{Engineering, 42}[(42, "Engineering")]โ{Engineering, 42}(same!)
- Compositional: Finding 1 of 3 correct departments = 0.33, finding 2 of 3 = 0.67, finding all 3 = 1.0
Risk Analysis: โ ๏ธ Low-medium risk
- Coincidental matches: If gold answer is
{42}and agent returns entireemployeestable with 100 rows, might contain42by chance - Mitigation: Combine with cardinality (penalizes returning too many values)
- False precision:
{42}vs{42.0}vs{42.001}might all become{"42"}after string conversion
3. Numeric Range Proximity
How It Works:
Question: "Average salary in Engineering?"
Gold: 95000
Agent attempt 1: 87000 (off by 8.4%)
relative_error = 8000 / 95000 = 0.084
reward = 1 - log10(1 + 0.084) = 1 - log10(1.084) = 1 - 0.035 = 0.965
Agent attempt 2: 9500 (off by 90%, wrong order of magnitude)
relative_error = 85500 / 95000 = 0.9
reward = 1 - log10(1 + 0.9) = 1 - log10(1.9) = 1 - 0.279 = 0.721
Agent attempt 3: 950000 (10x too high)
relative_error = 855000 / 95000 = 9.0
reward = 1 - log10(1 + 9.0) = 1 - log10(10) = 1 - 1.0 = 0.0
Conceptual Meaning: "Being off by 10% is very different from being off by 10x."
Value Provided:
- Order-of-magnitude thinking: Rewards agent for "ballpark correct" before exact
- Logarithmic scale:
- 95k โ 100k (5% error) = high reward (0.98)
- 95k โ 190k (100% error) = medium reward (0.70)
- 95k โ 950k (900% error) = no reward (0.05)
- Natural for SQL: COUNT, SUM, AVG queries often close but not exact on first try
Risk Analysis: โ ๏ธ Medium risk
- Only useful for ~40% of questions: Text/categorical answers get no benefit
- Multiple numbers: If result has
[42, 100, 5]and gold is[42], which number to compare?- Solution: Use closest match for each gold number
- Zero-handling:
gold=0, agent=1is infinitely far; needs special case
4. Row-wise Best Match
How It Works:
Question: "Top 3 departments by size"
Gold: [(Engineering, 65), (Sales, 58), (Marketing, 52)]
Agent: [(Marketing, 52), (Engineering, 65), (Sales, 58)] # Wrong order!
Process:
For each gold row, find best matching agent row:
Gold row 1: (Engineering, 65)
vs Agent row 1: (Marketing, 52) โ 0/2 match = 0.0
vs Agent row 2: (Engineering, 65) โ 2/2 match = 1.0 โ
vs Agent row 3: (Sales, 58) โ 0/2 match = 0.0
Best match: 1.0
Gold row 2: (Sales, 58)
Best match: 1.0 (agent row 3)
Gold row 3: (Marketing, 52)
Best match: 1.0 (agent row 1)
Final reward: (1.0 + 1.0 + 1.0) / 3 = 1.0
Conceptual Meaning: "You got the right rows, just in the wrong orderโthat's still mostly correct."
Value Provided:
- Order-invariant: Catches
ORDER BYmistakes without penalizing heavily - Extra columns forgiven: If agent returns
(Engineering, 65, 95000)and gold is(Engineering, 65), first 2 columns match โ 0.67 reward - Partial row matches: Agent got department name right but count wrong โ 0.5 reward per row
Risk Analysis: โ ๏ธ Medium risk
- Computationally expensive: O(MรN) comparisons for M gold rows and N agent rows
- For 100-row results: 10,000 comparisons per reward calculation
- Mitigation: Limit to first 20 rows
- Ambiguous matching: If gold has duplicate rows, which agent row should match which?
๐ฅ Tier 2: Nice-to-Have Metrics
5. Schema Coverage
How It Works:
Question: "How many employees in Engineering?"
Gold query: SELECT COUNT(*) FROM employees WHERE department='Engineering'
Tables used: {employees}
Agent query: SELECT COUNT(*) FROM employees e
JOIN departments d ON e.dept_id = d.id
WHERE d.name='Engineering'
Tables used: {employees, departments}
Calculation:
intersection = {employees} = 1
union = {employees, departments} = 2
Jaccard = 1/2 = 0.5
Penalty for extra table: 0.1 * 1 = 0.1
Final: 0.5 - 0.1 = 0.4
Conceptual Meaning: "You're exploring the right part of the database."
Value Provided:
- Exploration guidance: Early signal before query results are correct
- Helps with multi-hop: "You found
employees, now look atdepartments" - Penalizes shotgun approach: Agent that queries every table gets low reward
Risk Analysis: โ ๏ธ Medium-high risk
- Multiple valid paths: Simple query might use 1 table, complex query uses 3โboth correct
- Irrelevant table penalty: What if agent explores
departmentsfirst before findingemployees? Gets penalized for valid exploration - Requires SQL parsing: Dependency on
sqlparselibrary; edge cases in parsing
6. Column Alignment
How It Works:
Question: "List departments and their average salaries"
Gold columns: [department_name, avg_salary]
Agent columns: [dept, average_compensation]
Fuzzy matching:
"department_name" vs "dept"
โ SequenceMatcher = 0.65 (partial match) โ
"avg_salary" vs "average_compensation"
โ SequenceMatcher = 0.45 (weak match) โ
Reward: 1/2 columns matched = 0.5
Conceptual Meaning: "Your columns have the right semantic meaning."
Value Provided:
- Disambiguates multi-column results: If result has
[42, 100, 5], which column is the answer? - Catches projection errors: Agent did
SELECT *when should've doneSELECT department, COUNT(*) - Fuzzy matching helps: "dept" matches "department", "emp_id" matches "employee_id"
Risk Analysis: โ ๏ธ High risk
- Requires metadata: Need to track column names from query results (not always available in raw SQLite)
- Fuzzy matching noise: "count" matches "country" (0.7 similarity), "id" matches "bid" (0.67 similarity)
- Aliasing issues:
SELECT COUNT(*) AS totalvsSELECT COUNT(*) AS num_employeesโboth mean the same thing
7. Rank Correlation
How It Works:
Question: "Top 5 products by revenue"
Gold: [ProductA: $1M, ProductB: $900K, ProductC: $800K, ProductD: $750K, ProductE: $700K]
Ranks: [1, 2, 3, 4, 5]
Agent: [ProductA: $1M, ProductC: $850K, ProductB: $880K, ProductE: $710K, ProductD: $740K]
Ranks: [1, 3, 2, 5, 4]
Spearman correlation:
Rank differences: [0, -1, +1, -1, +1]
Correlation coefficient: 0.9
Reward: (0.9 + 1) / 2 = 0.95
Conceptual Meaning: "You got the relative ordering mostly right."
Value Provided:
- Specific to TOP-K queries: ~20% of Spider questions involve ranking
- Robust to ties: Handles "tied for 2nd place" correctly
- Partial credit for ordering: Top 3 correct but bottom 2 swapped โ still high reward
Risk Analysis: โ ๏ธ Medium risk
- Limited applicability: Only works for ordered results
- Requires scipy: Heavy dependency just for one metric
- Rank vs. value confusion: Agent might get ranking right but values wrong (or vice versa)
๐ฅ Tier 3: Avoid (High Risk, Low Value)
8. SQL Structure Similarity
How It Works:
Gold query: SELECT d.name, COUNT(*)
FROM employees e
JOIN departments d ON e.dept_id = d.id
GROUP BY d.name
Agent query: SELECT department, COUNT(*)
FROM employees
GROUP BY department
Structural comparison:
Tables: {employees} vs {employees, departments} โ 0.5
Joins: {(employees, departments)} vs {} โ 0.0
Aggregates: {COUNT} vs {COUNT} โ 1.0
Group By: {d.name} vs {department} โ 0.5
Weighted average: 0.5
Conceptual Meaning: "Your query looks syntactically similar to the gold query."
Risk Analysis: ๐ VERY HIGH RISK - DO NOT IMPLEMENT
- Reward hacking: Agent learns to copy SQL structure without understanding semantics
- Multiple valid solutions: Gold uses JOIN, agent uses subqueryโboth correct, but structure reward penalizes
- Overfitting: Agent optimizes for "looking like gold query" instead of "getting right answer"
- Research evidence: SQL-TRAIL paper found structure-based rewards hurt generalization
9. Execution Plan Similarity
How It Works:
Gold query execution plan:
1. Scan departments (10 rows)
2. Scan employees (1000 rows)
3. Hash join (O(N))
4. Aggregate (O(N))
Agent query execution plan:
1. Scan employees (1000 rows)
2. Nested loop with departments (O(Nยฒ))
3. Aggregate (O(N))
Similarity: 2/4 steps similar = 0.5
Conceptual Meaning: "Your query executes in a similar way."
Risk Analysis: ๐ VERY HIGH RISK - DO NOT IMPLEMENT
- Database-specific: SQLite plans differ from PostgreSQL plans
- Doesn't guarantee correctness: Two queries with similar plans can have different results
- Computationally expensive: Running EXPLAIN on every query doubles execution time
- Complexity: Comparing tree structures is non-trivial
- No research evidence: No prior work shows this helps RL training
๐งฎ Combining Multiple Metrics into Final Reward
The Challenge
You have multiple distance-to-goal metrics. How do you combine them into a single scalar reward?
Current state:
cardinality_score = 0.8
value_overlap_score = 0.6
numeric_range_score = 0.9
row_match_score = 0.7
Need: single_reward = ???
๐ฏ Method 1: Weighted Average (RECOMMENDED for MVP)
Formula:
def weighted_average_reward(scores, weights):
"""Simple weighted average of applicable metrics."""
total_weight = sum(weights.values())
return sum(scores[k] * weights[k] for k in scores) / total_weight
Example Implementation:
def compute_progress_reward(agent_result, gold_result, agent_query=None):
# Compute all metrics
scores = {
'cardinality': cardinality_reward(agent_result, gold_result),
'value_overlap': value_overlap_reward(agent_result, gold_result),
'numeric_range': numeric_range_reward(agent_result, gold_result),
'row_match': rowwise_best_match(agent_result, gold_result),
}
# Fixed weights (tune these!)
weights = {
'cardinality': 0.25,
'value_overlap': 0.40, # Highest weight (most universal)
'numeric_range': 0.15,
'row_match': 0.20,
}
return weighted_average_reward(scores, weights)
Pros:
- โ Simple to implement and understand
- โ Easy to tune (adjust weights based on training performance)
- โ Monotonic (if individual metrics improve, final reward improves)
- โ Bounded [0, 1]
Cons:
- โ ๏ธ Fixed weights might not be optimal for all question types
- โ ๏ธ Treats all metrics as equally important (regardless of context)
When to use: Default choice for MVP. Start here.
๐ฏ Method 2: Adaptive Weighting by Question Type
Formula:
def adaptive_weighted_reward(scores, question_metadata):
"""Adjust weights based on question characteristics."""
# Detect question type
is_numeric = has_numeric_answer(gold_result)
is_multirow = len(gold_result) > 1
is_ordered = "TOP" in question.upper() or "ORDER BY" in gold_query
# Adaptive weights
weights = {
'cardinality': 0.25, # Always important
'value_overlap': 0.40 if not is_numeric else 0.30,
'numeric_range': 0.30 if is_numeric else 0.0, # Only for numeric
'row_match': 0.20 if is_multirow else 0.10,
}
# Normalize
total = sum(weights.values())
weights = {k: v/total for k, v in weights.items()}
return sum(scores[k] * weights[k] for k in scores)
Example:
Question: "Average salary in Engineering?" (numeric, single-value)
โ weights: cardinality=0.25, value_overlap=0.30, numeric_range=0.35, row_match=0.10
Question: "List all departments with >50 employees" (text, multi-row)
โ weights: cardinality=0.25, value_overlap=0.45, numeric_range=0.0, row_match=0.30
Pros:
- โ More accurate reward signal for different question types
- โ Automatically disables irrelevant metrics (e.g., numeric_range for text questions)
- โ Can tune weights per question type independently
Cons:
- โ ๏ธ More complex to implement (need question type detection)
- โ ๏ธ More hyperparameters to tune
- โ ๏ธ Risk of over-engineering
When to use: If fixed weighting shows poor performance on specific question types.
๐ฏ Method 3: Max Pooling (Optimistic)
Formula:
def max_pooling_reward(scores):
"""Take the best metric (optimistic reward)."""
return max(scores.values())
Example:
scores = {
'cardinality': 0.3, # Wrong row count
'value_overlap': 0.8, # Found most values!
'numeric_range': 0.4, # Numbers off
'row_match': 0.5, # Some rows match
}
reward = max(0.3, 0.8, 0.4, 0.5) = 0.8
Conceptual Meaning: "Give credit for whatever the agent did best."
Pros:
- โ Very forgiving (agent gets credit for any progress)
- โ Encourages diverse exploration strategies
- โ Simple to implement
Cons:
- ๐ Too lenient: Agent might game easiest metric
- ๐ Non-compositional: Doesn't reward improving multiple aspects simultaneously
- ๐ Unstable gradients: Reward can jump dramatically between steps
When to use: If agent is struggling to learn anything (extremely sparse rewards). Use as temporary scaffolding, then switch to weighted average.
๐ฏ Method 4: Minimum Threshold + Average (Strict)
Formula:
def threshold_average_reward(scores, thresholds):
"""All metrics must meet threshold; then take average."""
# Check all thresholds
for metric, score in scores.items():
if score < thresholds.get(metric, 0.0):
return 0.0 # Fail if any metric below threshold
# All thresholds met โ return average
return sum(scores.values()) / len(scores)
Example:
scores = {
'cardinality': 0.9,
'value_overlap': 0.7,
'numeric_range': 0.3, # Below threshold!
}
thresholds = {
'cardinality': 0.5,
'value_overlap': 0.5,
'numeric_range': 0.5,
}
# numeric_range (0.3) < threshold (0.5) โ return 0.0
Conceptual Meaning: "You must do reasonably well on all aspects to get any reward."
Pros:
- โ Prevents over-optimization of single metric
- โ Encourages balanced progress
Cons:
- ๐ Too strict: Might be too hard for early training
- ๐ Cliff dynamics: Slight improvement might not change reward at all
- ๐ Threshold tuning: Requires careful calibration
When to use: If agent is gaming one metric while ignoring others. Use as penalty mechanism.
๐ฏ Method 5: Hierarchical (Coarse-to-Fine)
Formula:
def hierarchical_reward(scores):
"""First get cardinality right, then content, then structure."""
# Layer 1: Cardinality (must be >0.5 to proceed)
if scores['cardinality'] < 0.5:
return scores['cardinality'] * 0.3 # Low reward, stuck at layer 1
# Layer 2: Content (must be >0.5 to proceed)
content_score = (scores['value_overlap'] + scores['numeric_range']) / 2
if content_score < 0.5:
return 0.3 + content_score * 0.4 # Medium reward, stuck at layer 2
# Layer 3: Structure (all metrics combined)
final_score = (
0.2 * scores['cardinality'] +
0.4 * content_score +
0.4 * scores['row_match']
)
return 0.5 + final_score * 0.5 # High reward, layer 3
Conceptual Meaning: "Learn to get the count right first, then the values, then the structure."
Visualization:
Reward progression:
0.0 โโโโโโโ 0.3 โโโโโโโ 0.7 โโโโโโโ 1.0
โ โ โ
Cardinality Content Structure
correct correct correct
Pros:
- โ Natural curriculum (easier tasks first)
- โ Clear progression signal
- โ Prevents agent from over-optimizing structure before content
Cons:
- โ ๏ธ More complex logic
- โ ๏ธ Requires careful threshold tuning
- โ ๏ธ Might slow down learning if thresholds too strict
When to use: If agent learns poorly with flat reward. Provides curriculum learning.
๐ฏ Method 6: Product (Multiplicative)
Formula:
def product_reward(scores):
"""Multiply all metrics (all must be good)."""
product = 1.0
for score in scores.values():
product *= score
return product
Example:
scores = {
'cardinality': 0.9,
'value_overlap': 0.8,
'numeric_range': 0.7,
}
reward = 0.9 ร 0.8 ร 0.7 = 0.504
Conceptual Meaning: "All aspects must be good; weak performance on any metric drags down total."
Pros:
- โ Encourages balanced improvement
- โ Penalizes weak performance on any dimension
Cons:
- ๐ Too strict: Single low score (0.1) makes entire reward near zero
- ๐ Vanishing gradients: Product of small numbers becomes very small
- ๐ Not bounded predictably: Can produce very small rewards even for good progress
When to use: Rarely. Only if you need extremely strict "all-or-nothing" reward.
๐ฏ Method 7: Percentile Aggregation (Robust)
Formula:
def percentile_reward(scores, percentile=50):
"""Use median (or other percentile) of all metrics."""
import numpy as np
return np.percentile(list(scores.values()), percentile)
Example:
scores = [0.9, 0.8, 0.3, 0.7, 0.6]
sorted = [0.3, 0.6, 0.7, 0.8, 0.9]
percentile_50 (median) = 0.7
percentile_75 = 0.8
percentile_25 = 0.6
Conceptual Meaning: "Reward based on typical performance, ignoring outliers."
Pros:
- โ Robust to outlier metrics (one very low or very high score doesn't dominate)
- โ Simple to implement
- โ Tunable (change percentile to be more/less strict)
Cons:
- โ ๏ธ Less interpretable than weighted average
- โ ๏ธ Ignores some information (throws away best and worst scores)
When to use: If one metric is noisy or unreliable, use median to ignore it.
๐ Comparison Table: Combination Methods
| Method | Complexity | Interpretability | Robustness | Training Stability | Best Use Case |
|---|---|---|---|---|---|
| Weighted Average | Low | High | Medium | High | MVP default |
| Adaptive Weighting | Medium | Medium | High | High | Different question types need different signals |
| Max Pooling | Low | Medium | Low | Low | Agent struggling to learn anything |
| Threshold + Average | Medium | High | Medium | Medium | Agent gaming one metric |
| Hierarchical | High | High | High | Medium | Want curriculum learning |
| Product | Low | Low | Low | Low | All aspects must be perfect (rare) |
| Percentile | Low | Low | High | High | One metric is noisy/unreliable |
๐ฌ Recommended Implementation Strategy
Phase 1: Start Simple (Week 1)
def compute_progress_reward(agent_result, gold_result):
"""Initial implementation: weighted average of 3 metrics."""
scores = {
'cardinality': cardinality_reward(agent_result, gold_result),
'value_overlap': value_overlap_reward(agent_result, gold_result),
'numeric_range': numeric_range_reward(agent_result, gold_result),
}
weights = {'cardinality': 0.25, 'value_overlap': 0.50, 'numeric_range': 0.25}
return weighted_average_reward(scores, weights)
Why: Simple, interpretable, easy to debug.
Phase 2: Add Context (Week 2, if needed)
def compute_progress_reward(agent_result, gold_result, question_type):
"""Adaptive weighting based on question type."""
scores = {
'cardinality': cardinality_reward(agent_result, gold_result),
'value_overlap': value_overlap_reward(agent_result, gold_result),
'numeric_range': numeric_range_reward(agent_result, gold_result),
'row_match': rowwise_best_match(agent_result, gold_result),
}
# Adapt weights
if question_type == 'numeric':
weights = {'cardinality': 0.2, 'value_overlap': 0.3,
'numeric_range': 0.4, 'row_match': 0.1}
elif question_type == 'multirow':
weights = {'cardinality': 0.25, 'value_overlap': 0.4,
'numeric_range': 0.05, 'row_match': 0.3}
else: # default
weights = {'cardinality': 0.25, 'value_overlap': 0.5,
'numeric_range': 0.15, 'row_match': 0.1}
return weighted_average_reward(scores, weights)
Why: Improves signal quality without major complexity.
Phase 3: Add Safeguards (Week 3, if agent is gaming)
def compute_progress_reward(agent_result, gold_result, question_type):
"""Weighted average with anti-gaming measures."""
scores = compute_all_scores(agent_result, gold_result)
# Anti-gaming: if cardinality is way off, cap other rewards
if scores['cardinality'] < 0.3:
# Agent is nowhere close on size โ limit credit for content
return scores['cardinality'] * 0.5
# Anti-gaming: if value overlap is low, cap row match
if scores['value_overlap'] < 0.4:
scores['row_match'] *= 0.5 # Penalize structure if content is wrong
# Standard weighted average
return weighted_average_reward(scores, get_weights(question_type))
Why: Prevents reward hacking while keeping interpretability.
๐งช How to Validate Your Combination Method
Create test suite:
# Test 1: Perfect match
assert compute_reward(gold, gold) == 1.0
# Test 2: Completely wrong
assert compute_reward(random_result, gold) < 0.2
# Test 3: Monotonicity (better result โ higher reward)
result_v1 = partially_correct_result() # 30% right
result_v2 = more_correct_result() # 60% right
result_v3 = mostly_correct_result() # 90% right
assert compute_reward(result_v1, gold) < compute_reward(result_v2, gold) < compute_reward(result_v3, gold)
# Test 4: Bounded [0, 1]
for _ in range(100):
random_result = generate_random_result()
reward = compute_reward(random_result, gold)
assert 0.0 <= reward <= 1.0
# Test 5: Insensitive to format (same values, different structure)
result_format_a = [("Engineering", 42)]
result_format_b = [(42, "Engineering")]
assert abs(compute_reward(result_format_a, gold) -
compute_reward(result_format_b, gold)) < 0.1 # Allow small difference
๐ก Final Recommendation
For MVP (Phase 1-3):
- Use Method 1: Weighted Average with metrics #1-3 (cardinality, value overlap, numeric range)
- Fixed weights:
{0.25, 0.50, 0.25}
If training shows issues:
- Add Method 2: Adaptive Weighting based on question type
- Add metric #4 (row-wise match)
If agent games rewards:
- Add threshold checks from Method 4
- Add anti-gaming logic from Phase 3 example
Never use:
- Method 3 (Max Pooling) - too gameable
- Method 6 (Product) - too strict, vanishing gradients
- Metrics #8-9 (SQL structure, execution plans) - research shows they hurt
Bottom line: Start simple (weighted average of 3 metrics), add complexity only when needed, always validate with test suite.