| # Distance-to-Goal Metrics: Complete Reference Table |
|
|
| ## Priority-Ordered Metric Comparison |
|
|
| | Priority | Metric Name | Effort | Value | Concept | What It Measures | Key Value | Potential Risks | |
| |---------|-------------|--------|-------|---------|------------------|-----------|-----------------| |
| | **1** | **Cardinality Matching** | 30 min | Very High | Compare row/value counts | "Did you return the right number of things?" | Catches over/under-retrieval early; works universally | None significant; very robust | |
| | **2** | **Value Overlap (Set-based)** | 45 min | Very High | Flatten results to value sets, compute Jaccard | "How many correct values did you find?" | Rewards partial correctness; format-agnostic | May over-reward coincidental matches in large results | |
| | **3** | **Numeric Range Proximity** | 1 hour | High | Logarithmic distance for numbers | "Are your numbers in the right ballpark?" | Critical for COUNT/SUM/AVG queries; rewards order-of-magnitude | Only useful for numeric questions (~40% of dataset) | |
| | **4** | **Row-wise Best Match** | 1.5 hours | High | Find best row pairing between results | "How many rows are correct, ignoring order?" | Handles ORDER BY issues; forgiving for column mismatch | Computationally expensive for large result sets (100+ rows) | |
| | **5** | **Schema Coverage** | 2 hours | Medium-High | Compare tables used in queries | "Are you querying the right tables?" | Guides exploration toward relevant parts of schema | Requires SQL parsing; may reward irrelevant table access | |
| | **6** | **Column Alignment** | 2.5 hours | Medium | Fuzzy match column names | "Do your result columns make semantic sense?" | Helps disambiguate multi-column results | Requires metadata tracking; fuzzy matching can be noisy | |
| | **7** | **Rank Correlation** | 3 hours | Medium | Spearman correlation on ordered results | "Is your ranking correct for TOP-K queries?" | Specific to ordered results; robust to ties | Only applicable to ~20% of questions; needs scipy dependency | |
| | **8** | **SQL Structure** | 6+ hours | Low-Medium | Parse query AST, compare structure | "Is your query syntactically similar?" | Might help with complex multi-join queries | HIGH RISK: Reward hacking; overfitting; ignores semantic equivalence | |
| | **9** | **Execution Plan** | 8+ hours | Low | Compare database query plans | "Does your query execute similarly?" | Theoretical value for optimization | NOT RECOMMENDED: Too complex; DB-specific; doesn't guarantee correctness | |
|
|
| --- |
|
|
| ## Detailed Conceptual Explanations |
|
|
| ### 🥇 Tier 1: Must-Have Metrics |
|
|
| #### 1. Cardinality Matching |
|
|
| **How It Works**: |
| ``` |
| Input: agent_result = [row1, row2, row3, row4, row5] (5 rows) |
| gold_result = [row1, row2, row3] (3 rows) |
| |
| Calculation: |
| difference = |5 - 3| = 2 |
| relative_error = 2 / 3 = 0.667 |
| reward = 1 - min(1.0, 0.667) = 0.333 |
| ``` |
|
|
| **Conceptual Meaning**: "Getting the right number of things is the first step to getting the right things." |
|
|
| **Value Provided**: |
| - **Early signal**: Before content is right, size can be right |
| - **Universal**: Works for integers (1 value), lists (N values), tables (M rows) |
| - **Catches common errors**: |
| - Missing `GROUP BY` → too few rows |
| - Cartesian join → too many rows |
| - Wrong aggregation → count off by orders of magnitude |
|
|
| **Risk Analysis**: ✅ **Very low risk** |
| - No reward hacking opportunities (agent can't game cardinality without semantic progress) |
| - Monotonic with correctness (better results = better cardinality) |
| - Fast to compute (O(1)) |
|
|
| --- |
|
|
| #### 2. Value Overlap (Set-based) |
|
|
| **How It Works**: |
| ``` |
| Question: "Which departments have >50 employees?" |
| Gold: {Engineering, Sales, Marketing} |
| Agent: {Engineering, Sales, HR, Legal} (2 correct, 2 wrong) |
| |
| Calculation: |
| intersection = {Engineering, Sales} = 2 values |
| union = {Engineering, Sales, Marketing, HR, Legal} = 5 values |
| Jaccard = 2/5 = 0.4 |
| ``` |
|
|
| **Conceptual Meaning**: "You found some of the right answers; find the rest." |
|
|
| **Value Provided**: |
| - **Partial credit**: Agent gets 0.4 reward even though final answer isn't perfect |
| - **Format agnostic**: Flattens everything to atomic values |
| - `[("Engineering", 42)]` → `{Engineering, 42}` |
| - `[(42, "Engineering")]` → `{Engineering, 42}` (same!) |
| - **Compositional**: Finding 1 of 3 correct departments = 0.33, finding 2 of 3 = 0.67, finding all 3 = 1.0 |
|
|
| **Risk Analysis**: ⚠️ **Low-medium risk** |
| - **Coincidental matches**: If gold answer is `{42}` and agent returns entire `employees` table with 100 rows, might contain `42` by chance |
| - **Mitigation**: Combine with cardinality (penalizes returning too many values) |
| - **False precision**: `{42}` vs `{42.0}` vs `{42.001}` might all become `{"42"}` after string conversion |
|
|
| --- |
|
|
| #### 3. Numeric Range Proximity |
|
|
| **How It Works**: |
| ``` |
| Question: "Average salary in Engineering?" |
| Gold: 95000 |
| |
| Agent attempt 1: 87000 (off by 8.4%) |
| relative_error = 8000 / 95000 = 0.084 |
| reward = 1 - log10(1 + 0.084) = 1 - log10(1.084) = 1 - 0.035 = 0.965 |
| |
| Agent attempt 2: 9500 (off by 90%, wrong order of magnitude) |
| relative_error = 85500 / 95000 = 0.9 |
| reward = 1 - log10(1 + 0.9) = 1 - log10(1.9) = 1 - 0.279 = 0.721 |
| |
| Agent attempt 3: 950000 (10x too high) |
| relative_error = 855000 / 95000 = 9.0 |
| reward = 1 - log10(1 + 9.0) = 1 - log10(10) = 1 - 1.0 = 0.0 |
| ``` |
|
|
| **Conceptual Meaning**: "Being off by 10% is very different from being off by 10x." |
|
|
| **Value Provided**: |
| - **Order-of-magnitude thinking**: Rewards agent for "ballpark correct" before exact |
| - **Logarithmic scale**: |
| - 95k → 100k (5% error) = high reward (0.98) |
| - 95k → 190k (100% error) = medium reward (0.70) |
| - 95k → 950k (900% error) = no reward (0.05) |
| - **Natural for SQL**: COUNT, SUM, AVG queries often close but not exact on first try |
|
|
| **Risk Analysis**: ⚠️ **Medium risk** |
| - **Only useful for ~40% of questions**: Text/categorical answers get no benefit |
| - **Multiple numbers**: If result has `[42, 100, 5]` and gold is `[42]`, which number to compare? |
| - Solution: Use closest match for each gold number |
| - **Zero-handling**: `gold=0, agent=1` is infinitely far; needs special case |
|
|
| --- |
|
|
| #### 4. Row-wise Best Match |
|
|
| **How It Works**: |
| ``` |
| Question: "Top 3 departments by size" |
| Gold: [(Engineering, 65), (Sales, 58), (Marketing, 52)] |
| Agent: [(Marketing, 52), (Engineering, 65), (Sales, 58)] # Wrong order! |
| |
| Process: |
| For each gold row, find best matching agent row: |
| Gold row 1: (Engineering, 65) |
| vs Agent row 1: (Marketing, 52) → 0/2 match = 0.0 |
| vs Agent row 2: (Engineering, 65) → 2/2 match = 1.0 ✓ |
| vs Agent row 3: (Sales, 58) → 0/2 match = 0.0 |
| Best match: 1.0 |
| |
| Gold row 2: (Sales, 58) |
| Best match: 1.0 (agent row 3) |
| |
| Gold row 3: (Marketing, 52) |
| Best match: 1.0 (agent row 1) |
| |
| Final reward: (1.0 + 1.0 + 1.0) / 3 = 1.0 |
| ``` |
|
|
| **Conceptual Meaning**: "You got the right rows, just in the wrong order—that's still mostly correct." |
|
|
| **Value Provided**: |
| - **Order-invariant**: Catches `ORDER BY` mistakes without penalizing heavily |
| - **Extra columns forgiven**: If agent returns `(Engineering, 65, 95000)` and gold is `(Engineering, 65)`, first 2 columns match → 0.67 reward |
| - **Partial row matches**: Agent got department name right but count wrong → 0.5 reward per row |
|
|
| **Risk Analysis**: ⚠️ **Medium risk** |
| - **Computationally expensive**: O(M×N) comparisons for M gold rows and N agent rows |
| - For 100-row results: 10,000 comparisons per reward calculation |
| - Mitigation: Limit to first 20 rows |
| - **Ambiguous matching**: If gold has duplicate rows, which agent row should match which? |
|
|
| --- |
|
|
| ### 🥈 Tier 2: Nice-to-Have Metrics |
|
|
| #### 5. Schema Coverage |
|
|
| **How It Works**: |
| ``` |
| Question: "How many employees in Engineering?" |
| Gold query: SELECT COUNT(*) FROM employees WHERE department='Engineering' |
| Tables used: {employees} |
| |
| Agent query: SELECT COUNT(*) FROM employees e |
| JOIN departments d ON e.dept_id = d.id |
| WHERE d.name='Engineering' |
| Tables used: {employees, departments} |
| |
| Calculation: |
| intersection = {employees} = 1 |
| union = {employees, departments} = 2 |
| Jaccard = 1/2 = 0.5 |
| |
| Penalty for extra table: 0.1 * 1 = 0.1 |
| Final: 0.5 - 0.1 = 0.4 |
| ``` |
|
|
| **Conceptual Meaning**: "You're exploring the right part of the database." |
|
|
| **Value Provided**: |
| - **Exploration guidance**: Early signal before query results are correct |
| - **Helps with multi-hop**: "You found `employees`, now look at `departments`" |
| - **Penalizes shotgun approach**: Agent that queries every table gets low reward |
|
|
| **Risk Analysis**: ⚠️ **Medium-high risk** |
| - **Multiple valid paths**: Simple query might use 1 table, complex query uses 3—both correct |
| - **Irrelevant table penalty**: What if agent explores `departments` first before finding `employees`? Gets penalized for valid exploration |
| - **Requires SQL parsing**: Dependency on `sqlparse` library; edge cases in parsing |
|
|
| --- |
|
|
| #### 6. Column Alignment |
|
|
| **How It Works**: |
| ``` |
| Question: "List departments and their average salaries" |
| Gold columns: [department_name, avg_salary] |
| Agent columns: [dept, average_compensation] |
| |
| Fuzzy matching: |
| "department_name" vs "dept" |
| → SequenceMatcher = 0.65 (partial match) ✓ |
| |
| "avg_salary" vs "average_compensation" |
| → SequenceMatcher = 0.45 (weak match) ✗ |
| |
| Reward: 1/2 columns matched = 0.5 |
| ``` |
|
|
| **Conceptual Meaning**: "Your columns have the right semantic meaning." |
|
|
| **Value Provided**: |
| - **Disambiguates multi-column results**: If result has `[42, 100, 5]`, which column is the answer? |
| - **Catches projection errors**: Agent did `SELECT *` when should've done `SELECT department, COUNT(*)` |
| - **Fuzzy matching helps**: "dept" matches "department", "emp_id" matches "employee_id" |
|
|
| **Risk Analysis**: ⚠️ **High risk** |
| - **Requires metadata**: Need to track column names from query results (not always available in raw SQLite) |
| - **Fuzzy matching noise**: "count" matches "country" (0.7 similarity), "id" matches "bid" (0.67 similarity) |
| - **Aliasing issues**: `SELECT COUNT(*) AS total` vs `SELECT COUNT(*) AS num_employees`—both mean the same thing |
|
|
| --- |
|
|
| #### 7. Rank Correlation |
|
|
| **How It Works**: |
| ``` |
| Question: "Top 5 products by revenue" |
| Gold: [ProductA: $1M, ProductB: $900K, ProductC: $800K, ProductD: $750K, ProductE: $700K] |
| Ranks: [1, 2, 3, 4, 5] |
| |
| Agent: [ProductA: $1M, ProductC: $850K, ProductB: $880K, ProductE: $710K, ProductD: $740K] |
| Ranks: [1, 3, 2, 5, 4] |
| |
| Spearman correlation: |
| Rank differences: [0, -1, +1, -1, +1] |
| Correlation coefficient: 0.9 |
| |
| Reward: (0.9 + 1) / 2 = 0.95 |
| ``` |
|
|
| **Conceptual Meaning**: "You got the relative ordering mostly right." |
|
|
| **Value Provided**: |
| - **Specific to TOP-K queries**: ~20% of Spider questions involve ranking |
| - **Robust to ties**: Handles "tied for 2nd place" correctly |
| - **Partial credit for ordering**: Top 3 correct but bottom 2 swapped → still high reward |
|
|
| **Risk Analysis**: ⚠️ **Medium risk** |
| - **Limited applicability**: Only works for ordered results |
| - **Requires scipy**: Heavy dependency just for one metric |
| - **Rank vs. value confusion**: Agent might get ranking right but values wrong (or vice versa) |
|
|
| --- |
|
|
| ### 🥉 Tier 3: Avoid (High Risk, Low Value) |
|
|
| #### 8. SQL Structure Similarity |
|
|
| **How It Works**: |
| ``` |
| Gold query: SELECT d.name, COUNT(*) |
| FROM employees e |
| JOIN departments d ON e.dept_id = d.id |
| GROUP BY d.name |
| |
| Agent query: SELECT department, COUNT(*) |
| FROM employees |
| GROUP BY department |
| |
| Structural comparison: |
| Tables: {employees} vs {employees, departments} → 0.5 |
| Joins: {(employees, departments)} vs {} → 0.0 |
| Aggregates: {COUNT} vs {COUNT} → 1.0 |
| Group By: {d.name} vs {department} → 0.5 |
| |
| Weighted average: 0.5 |
| ``` |
|
|
| **Conceptual Meaning**: "Your query looks syntactically similar to the gold query." |
|
|
| **Risk Analysis**: 🛑 **VERY HIGH RISK - DO NOT IMPLEMENT** |
| - **Reward hacking**: Agent learns to copy SQL structure without understanding semantics |
| - **Multiple valid solutions**: Gold uses JOIN, agent uses subquery—both correct, but structure reward penalizes |
| - **Overfitting**: Agent optimizes for "looking like gold query" instead of "getting right answer" |
| - **Research evidence**: SQL-TRAIL paper found structure-based rewards hurt generalization |
|
|
| --- |
|
|
| #### 9. Execution Plan Similarity |
|
|
| **How It Works**: |
| ``` |
| Gold query execution plan: |
| 1. Scan departments (10 rows) |
| 2. Scan employees (1000 rows) |
| 3. Hash join (O(N)) |
| 4. Aggregate (O(N)) |
| |
| Agent query execution plan: |
| 1. Scan employees (1000 rows) |
| 2. Nested loop with departments (O(N²)) |
| 3. Aggregate (O(N)) |
| |
| Similarity: 2/4 steps similar = 0.5 |
| ``` |
|
|
| **Conceptual Meaning**: "Your query executes in a similar way." |
|
|
| **Risk Analysis**: 🛑 **VERY HIGH RISK - DO NOT IMPLEMENT** |
| - **Database-specific**: SQLite plans differ from PostgreSQL plans |
| - **Doesn't guarantee correctness**: Two queries with similar plans can have different results |
| - **Computationally expensive**: Running EXPLAIN on every query doubles execution time |
| - **Complexity**: Comparing tree structures is non-trivial |
| - **No research evidence**: No prior work shows this helps RL training |
|
|
| --- |
|
|
| ## 🧮 Combining Multiple Metrics into Final Reward |
|
|
| ### The Challenge |
|
|
| You have multiple distance-to-goal metrics. How do you combine them into a single scalar reward? |
|
|
| ``` |
| Current state: |
| cardinality_score = 0.8 |
| value_overlap_score = 0.6 |
| numeric_range_score = 0.9 |
| row_match_score = 0.7 |
| |
| Need: single_reward = ??? |
| ``` |
|
|
| --- |
|
|
| ## 🎯 Method 1: Weighted Average (RECOMMENDED for MVP) |
|
|
| **Formula**: |
| ```python |
| def weighted_average_reward(scores, weights): |
| """Simple weighted average of applicable metrics.""" |
| total_weight = sum(weights.values()) |
| return sum(scores[k] * weights[k] for k in scores) / total_weight |
| ``` |
|
|
| **Example Implementation**: |
| ```python |
| def compute_progress_reward(agent_result, gold_result, agent_query=None): |
| # Compute all metrics |
| scores = { |
| 'cardinality': cardinality_reward(agent_result, gold_result), |
| 'value_overlap': value_overlap_reward(agent_result, gold_result), |
| 'numeric_range': numeric_range_reward(agent_result, gold_result), |
| 'row_match': rowwise_best_match(agent_result, gold_result), |
| } |
| |
| # Fixed weights (tune these!) |
| weights = { |
| 'cardinality': 0.25, |
| 'value_overlap': 0.40, # Highest weight (most universal) |
| 'numeric_range': 0.15, |
| 'row_match': 0.20, |
| } |
| |
| return weighted_average_reward(scores, weights) |
| ``` |
|
|
| **Pros**: |
| - ✅ Simple to implement and understand |
| - ✅ Easy to tune (adjust weights based on training performance) |
| - ✅ Monotonic (if individual metrics improve, final reward improves) |
| - ✅ Bounded [0, 1] |
|
|
| **Cons**: |
| - ⚠️ Fixed weights might not be optimal for all question types |
| - ⚠️ Treats all metrics as equally important (regardless of context) |
|
|
| **When to use**: Default choice for MVP. Start here. |
|
|
| --- |
|
|
| ## 🎯 Method 2: Adaptive Weighting by Question Type |
|
|
| **Formula**: |
| ```python |
| def adaptive_weighted_reward(scores, question_metadata): |
| """Adjust weights based on question characteristics.""" |
| |
| # Detect question type |
| is_numeric = has_numeric_answer(gold_result) |
| is_multirow = len(gold_result) > 1 |
| is_ordered = "TOP" in question.upper() or "ORDER BY" in gold_query |
| |
| # Adaptive weights |
| weights = { |
| 'cardinality': 0.25, # Always important |
| 'value_overlap': 0.40 if not is_numeric else 0.30, |
| 'numeric_range': 0.30 if is_numeric else 0.0, # Only for numeric |
| 'row_match': 0.20 if is_multirow else 0.10, |
| } |
| |
| # Normalize |
| total = sum(weights.values()) |
| weights = {k: v/total for k, v in weights.items()} |
| |
| return sum(scores[k] * weights[k] for k in scores) |
| ``` |
|
|
| **Example**: |
| ```python |
| Question: "Average salary in Engineering?" (numeric, single-value) |
| → weights: cardinality=0.25, value_overlap=0.30, numeric_range=0.35, row_match=0.10 |
| |
| Question: "List all departments with >50 employees" (text, multi-row) |
| → weights: cardinality=0.25, value_overlap=0.45, numeric_range=0.0, row_match=0.30 |
| ``` |
|
|
| **Pros**: |
| - ✅ More accurate reward signal for different question types |
| - ✅ Automatically disables irrelevant metrics (e.g., numeric_range for text questions) |
| - ✅ Can tune weights per question type independently |
| |
| **Cons**: |
| - ⚠️ More complex to implement (need question type detection) |
| - ⚠️ More hyperparameters to tune |
| - ⚠️ Risk of over-engineering |
| |
| **When to use**: If fixed weighting shows poor performance on specific question types. |
| |
| --- |
| |
| ## 🎯 Method 3: Max Pooling (Optimistic) |
| |
| **Formula**: |
| ```python |
| def max_pooling_reward(scores): |
| """Take the best metric (optimistic reward).""" |
| return max(scores.values()) |
| ``` |
| |
| **Example**: |
| ``` |
| scores = { |
| 'cardinality': 0.3, # Wrong row count |
| 'value_overlap': 0.8, # Found most values! |
| 'numeric_range': 0.4, # Numbers off |
| 'row_match': 0.5, # Some rows match |
| } |
| |
| reward = max(0.3, 0.8, 0.4, 0.5) = 0.8 |
| ``` |
| |
| **Conceptual Meaning**: "Give credit for whatever the agent did best." |
| |
| **Pros**: |
| - ✅ Very forgiving (agent gets credit for any progress) |
| - ✅ Encourages diverse exploration strategies |
| - ✅ Simple to implement |
| |
| **Cons**: |
| - 🛑 **Too lenient**: Agent might game easiest metric |
| - 🛑 **Non-compositional**: Doesn't reward improving multiple aspects simultaneously |
| - 🛑 **Unstable gradients**: Reward can jump dramatically between steps |
| |
| **When to use**: If agent is struggling to learn anything (extremely sparse rewards). Use as temporary scaffolding, then switch to weighted average. |
| |
| --- |
| |
| ## 🎯 Method 4: Minimum Threshold + Average (Strict) |
| |
| **Formula**: |
| ```python |
| def threshold_average_reward(scores, thresholds): |
| """All metrics must meet threshold; then take average.""" |
| # Check all thresholds |
| for metric, score in scores.items(): |
| if score < thresholds.get(metric, 0.0): |
| return 0.0 # Fail if any metric below threshold |
| |
| # All thresholds met → return average |
| return sum(scores.values()) / len(scores) |
| ``` |
| |
| **Example**: |
| ```python |
| scores = { |
| 'cardinality': 0.9, |
| 'value_overlap': 0.7, |
| 'numeric_range': 0.3, # Below threshold! |
| } |
| |
| thresholds = { |
| 'cardinality': 0.5, |
| 'value_overlap': 0.5, |
| 'numeric_range': 0.5, |
| } |
| |
| # numeric_range (0.3) < threshold (0.5) → return 0.0 |
| ``` |
|
|
| **Conceptual Meaning**: "You must do reasonably well on all aspects to get any reward." |
|
|
| **Pros**: |
| - ✅ Prevents over-optimization of single metric |
| - ✅ Encourages balanced progress |
|
|
| **Cons**: |
| - 🛑 **Too strict**: Might be too hard for early training |
| - 🛑 **Cliff dynamics**: Slight improvement might not change reward at all |
| - 🛑 **Threshold tuning**: Requires careful calibration |
|
|
| **When to use**: If agent is gaming one metric while ignoring others. Use as penalty mechanism. |
|
|
| --- |
|
|
| ## 🎯 Method 5: Hierarchical (Coarse-to-Fine) |
|
|
| **Formula**: |
| ```python |
| def hierarchical_reward(scores): |
| """First get cardinality right, then content, then structure.""" |
| |
| # Layer 1: Cardinality (must be >0.5 to proceed) |
| if scores['cardinality'] < 0.5: |
| return scores['cardinality'] * 0.3 # Low reward, stuck at layer 1 |
| |
| # Layer 2: Content (must be >0.5 to proceed) |
| content_score = (scores['value_overlap'] + scores['numeric_range']) / 2 |
| if content_score < 0.5: |
| return 0.3 + content_score * 0.4 # Medium reward, stuck at layer 2 |
| |
| # Layer 3: Structure (all metrics combined) |
| final_score = ( |
| 0.2 * scores['cardinality'] + |
| 0.4 * content_score + |
| 0.4 * scores['row_match'] |
| ) |
| return 0.5 + final_score * 0.5 # High reward, layer 3 |
| ``` |
|
|
| **Conceptual Meaning**: "Learn to get the count right first, then the values, then the structure." |
|
|
| **Visualization**: |
| ``` |
| Reward progression: |
| 0.0 ─────── 0.3 ─────── 0.7 ─────── 1.0 |
| ↑ ↑ ↑ |
| Cardinality Content Structure |
| correct correct correct |
| ``` |
|
|
| **Pros**: |
| - ✅ Natural curriculum (easier tasks first) |
| - ✅ Clear progression signal |
| - ✅ Prevents agent from over-optimizing structure before content |
|
|
| **Cons**: |
| - ⚠️ More complex logic |
| - ⚠️ Requires careful threshold tuning |
| - ⚠️ Might slow down learning if thresholds too strict |
|
|
| **When to use**: If agent learns poorly with flat reward. Provides curriculum learning. |
|
|
| --- |
|
|
| ## 🎯 Method 6: Product (Multiplicative) |
|
|
| **Formula**: |
| ```python |
| def product_reward(scores): |
| """Multiply all metrics (all must be good).""" |
| product = 1.0 |
| for score in scores.values(): |
| product *= score |
| return product |
| ``` |
|
|
| **Example**: |
| ``` |
| scores = { |
| 'cardinality': 0.9, |
| 'value_overlap': 0.8, |
| 'numeric_range': 0.7, |
| } |
| |
| reward = 0.9 × 0.8 × 0.7 = 0.504 |
| ``` |
|
|
| **Conceptual Meaning**: "All aspects must be good; weak performance on any metric drags down total." |
|
|
| **Pros**: |
| - ✅ Encourages balanced improvement |
| - ✅ Penalizes weak performance on any dimension |
|
|
| **Cons**: |
| - 🛑 **Too strict**: Single low score (0.1) makes entire reward near zero |
| - 🛑 **Vanishing gradients**: Product of small numbers becomes very small |
| - 🛑 **Not bounded predictably**: Can produce very small rewards even for good progress |
|
|
| **When to use**: Rarely. Only if you need extremely strict "all-or-nothing" reward. |
|
|
| --- |
|
|
| ## 🎯 Method 7: Percentile Aggregation (Robust) |
|
|
| **Formula**: |
| ```python |
| def percentile_reward(scores, percentile=50): |
| """Use median (or other percentile) of all metrics.""" |
| import numpy as np |
| return np.percentile(list(scores.values()), percentile) |
| ``` |
|
|
| **Example**: |
| ``` |
| scores = [0.9, 0.8, 0.3, 0.7, 0.6] |
| sorted = [0.3, 0.6, 0.7, 0.8, 0.9] |
| |
| percentile_50 (median) = 0.7 |
| percentile_75 = 0.8 |
| percentile_25 = 0.6 |
| ``` |
|
|
| **Conceptual Meaning**: "Reward based on typical performance, ignoring outliers." |
|
|
| **Pros**: |
| - ✅ Robust to outlier metrics (one very low or very high score doesn't dominate) |
| - ✅ Simple to implement |
| - ✅ Tunable (change percentile to be more/less strict) |
|
|
| **Cons**: |
| - ⚠️ Less interpretable than weighted average |
| - ⚠️ Ignores some information (throws away best and worst scores) |
|
|
| **When to use**: If one metric is noisy or unreliable, use median to ignore it. |
|
|
| --- |
|
|
| ## 📊 Comparison Table: Combination Methods |
|
|
| | Method | Complexity | Interpretability | Robustness | Training Stability | Best Use Case | |
| |--------|------------|------------------|------------|-------------------|---------------| |
| | **Weighted Average** | Low | High | Medium | High | **MVP default** | |
| | **Adaptive Weighting** | Medium | Medium | High | High | Different question types need different signals | |
| | **Max Pooling** | Low | Medium | Low | Low | Agent struggling to learn anything | |
| | **Threshold + Average** | Medium | High | Medium | Medium | Agent gaming one metric | |
| | **Hierarchical** | High | High | High | Medium | Want curriculum learning | |
| | **Product** | Low | Low | Low | Low | All aspects must be perfect (rare) | |
| | **Percentile** | Low | Low | High | High | One metric is noisy/unreliable | |
|
|
| --- |
|
|
| ## 🎬 Recommended Implementation Strategy |
|
|
| ### Phase 1: Start Simple (Week 1) |
|
|
| ```python |
| def compute_progress_reward(agent_result, gold_result): |
| """Initial implementation: weighted average of 3 metrics.""" |
| |
| scores = { |
| 'cardinality': cardinality_reward(agent_result, gold_result), |
| 'value_overlap': value_overlap_reward(agent_result, gold_result), |
| 'numeric_range': numeric_range_reward(agent_result, gold_result), |
| } |
| |
| weights = {'cardinality': 0.25, 'value_overlap': 0.50, 'numeric_range': 0.25} |
| |
| return weighted_average_reward(scores, weights) |
| ``` |
|
|
| **Why**: Simple, interpretable, easy to debug. |
|
|
| --- |
|
|
| ### Phase 2: Add Context (Week 2, if needed) |
|
|
| ```python |
| def compute_progress_reward(agent_result, gold_result, question_type): |
| """Adaptive weighting based on question type.""" |
| |
| scores = { |
| 'cardinality': cardinality_reward(agent_result, gold_result), |
| 'value_overlap': value_overlap_reward(agent_result, gold_result), |
| 'numeric_range': numeric_range_reward(agent_result, gold_result), |
| 'row_match': rowwise_best_match(agent_result, gold_result), |
| } |
| |
| # Adapt weights |
| if question_type == 'numeric': |
| weights = {'cardinality': 0.2, 'value_overlap': 0.3, |
| 'numeric_range': 0.4, 'row_match': 0.1} |
| elif question_type == 'multirow': |
| weights = {'cardinality': 0.25, 'value_overlap': 0.4, |
| 'numeric_range': 0.05, 'row_match': 0.3} |
| else: # default |
| weights = {'cardinality': 0.25, 'value_overlap': 0.5, |
| 'numeric_range': 0.15, 'row_match': 0.1} |
| |
| return weighted_average_reward(scores, weights) |
| ``` |
|
|
| **Why**: Improves signal quality without major complexity. |
|
|
| --- |
|
|
| ### Phase 3: Add Safeguards (Week 3, if agent is gaming) |
|
|
| ```python |
| def compute_progress_reward(agent_result, gold_result, question_type): |
| """Weighted average with anti-gaming measures.""" |
| |
| scores = compute_all_scores(agent_result, gold_result) |
| |
| # Anti-gaming: if cardinality is way off, cap other rewards |
| if scores['cardinality'] < 0.3: |
| # Agent is nowhere close on size → limit credit for content |
| return scores['cardinality'] * 0.5 |
| |
| # Anti-gaming: if value overlap is low, cap row match |
| if scores['value_overlap'] < 0.4: |
| scores['row_match'] *= 0.5 # Penalize structure if content is wrong |
| |
| # Standard weighted average |
| return weighted_average_reward(scores, get_weights(question_type)) |
| ``` |
|
|
| **Why**: Prevents reward hacking while keeping interpretability. |
|
|
| --- |
|
|
| ## 🧪 How to Validate Your Combination Method |
|
|
| Create test suite: |
|
|
| ```python |
| # Test 1: Perfect match |
| assert compute_reward(gold, gold) == 1.0 |
| |
| # Test 2: Completely wrong |
| assert compute_reward(random_result, gold) < 0.2 |
| |
| # Test 3: Monotonicity (better result → higher reward) |
| result_v1 = partially_correct_result() # 30% right |
| result_v2 = more_correct_result() # 60% right |
| result_v3 = mostly_correct_result() # 90% right |
| assert compute_reward(result_v1, gold) < compute_reward(result_v2, gold) < compute_reward(result_v3, gold) |
| |
| # Test 4: Bounded [0, 1] |
| for _ in range(100): |
| random_result = generate_random_result() |
| reward = compute_reward(random_result, gold) |
| assert 0.0 <= reward <= 1.0 |
| |
| # Test 5: Insensitive to format (same values, different structure) |
| result_format_a = [("Engineering", 42)] |
| result_format_b = [(42, "Engineering")] |
| assert abs(compute_reward(result_format_a, gold) - |
| compute_reward(result_format_b, gold)) < 0.1 # Allow small difference |
| ``` |
|
|
| --- |
|
|
| ## 💡 Final Recommendation |
|
|
| **For MVP (Phase 1-3)**: |
| - Use **Method 1: Weighted Average** with metrics #1-3 (cardinality, value overlap, numeric range) |
| - Fixed weights: `{0.25, 0.50, 0.25}` |
|
|
| **If training shows issues**: |
| - Add **Method 2: Adaptive Weighting** based on question type |
| - Add metric #4 (row-wise match) |
|
|
| **If agent games rewards**: |
| - Add threshold checks from **Method 4** |
| - Add anti-gaming logic from Phase 3 example |
|
|
| **Never use**: |
| - Method 3 (Max Pooling) - too gameable |
| - Method 6 (Product) - too strict, vanishing gradients |
| - Metrics #8-9 (SQL structure, execution plans) - research shows they hurt |
|
|
| **Bottom line**: Start simple (weighted average of 3 metrics), add complexity only when needed, always validate with test suite. |