Spaces:

hjerpe
/

sql_env

Sleeping

File size: 27,502 Bytes

9e64e71

# Distance-to-Goal Metrics: Complete Reference Table

## Priority-Ordered Metric Comparison

| Priority | Metric Name | Effort | Value | Concept | What It Measures | Key Value | Potential Risks |
|---------|-------------|--------|-------|---------|------------------|-----------|-----------------|
| **1** | **Cardinality Matching** | 30 min | Very High | Compare row/value counts | "Did you return the right number of things?" | Catches over/under-retrieval early; works universally | None significant; very robust |
| **2** | **Value Overlap (Set-based)** | 45 min | Very High | Flatten results to value sets, compute Jaccard | "How many correct values did you find?" | Rewards partial correctness; format-agnostic | May over-reward coincidental matches in large results |
| **3** | **Numeric Range Proximity** | 1 hour | High | Logarithmic distance for numbers | "Are your numbers in the right ballpark?" | Critical for COUNT/SUM/AVG queries; rewards order-of-magnitude | Only useful for numeric questions (~40% of dataset) |
| **4** | **Row-wise Best Match** | 1.5 hours | High | Find best row pairing between results | "How many rows are correct, ignoring order?" | Handles ORDER BY issues; forgiving for column mismatch | Computationally expensive for large result sets (100+ rows) |
| **5** | **Schema Coverage** | 2 hours | Medium-High | Compare tables used in queries | "Are you querying the right tables?" | Guides exploration toward relevant parts of schema | Requires SQL parsing; may reward irrelevant table access |
| **6** | **Column Alignment** | 2.5 hours | Medium | Fuzzy match column names | "Do your result columns make semantic sense?" | Helps disambiguate multi-column results | Requires metadata tracking; fuzzy matching can be noisy |
| **7** | **Rank Correlation** | 3 hours | Medium | Spearman correlation on ordered results | "Is your ranking correct for TOP-K queries?" | Specific to ordered results; robust to ties | Only applicable to ~20% of questions; needs scipy dependency |
| **8** | **SQL Structure** | 6+ hours | Low-Medium | Parse query AST, compare structure | "Is your query syntactically similar?" | Might help with complex multi-join queries | HIGH RISK: Reward hacking; overfitting; ignores semantic equivalence |
| **9** | **Execution Plan** | 8+ hours | Low | Compare database query plans | "Does your query execute similarly?" | Theoretical value for optimization | NOT RECOMMENDED: Too complex; DB-specific; doesn't guarantee correctness |

---

## Detailed Conceptual Explanations

### 🥇 Tier 1: Must-Have Metrics

#### 1. Cardinality Matching

**How It Works**:
```
Input:  agent_result = [row1, row2, row3, row4, row5]  (5 rows)
        gold_result  = [row1, row2, row3]              (3 rows)

Calculation:
  difference = |5 - 3| = 2
  relative_error = 2 / 3 = 0.667
  reward = 1 - min(1.0, 0.667) = 0.333
```

**Conceptual Meaning**: "Getting the right number of things is the first step to getting the right things."

**Value Provided**:
- **Early signal**: Before content is right, size can be right
- **Universal**: Works for integers (1 value), lists (N values), tables (M rows)
- **Catches common errors**:
  - Missing `GROUP BY` → too few rows
  - Cartesian join → too many rows
  - Wrong aggregation → count off by orders of magnitude

**Risk Analysis**: ✅ **Very low risk**
- No reward hacking opportunities (agent can't game cardinality without semantic progress)
- Monotonic with correctness (better results = better cardinality)
- Fast to compute (O(1))

---

#### 2. Value Overlap (Set-based)

**How It Works**:
```
Question: "Which departments have >50 employees?"
Gold:  {Engineering, Sales, Marketing}
Agent: {Engineering, Sales, HR, Legal}  (2 correct, 2 wrong)

Calculation:
  intersection = {Engineering, Sales} = 2 values
  union = {Engineering, Sales, Marketing, HR, Legal} = 5 values
  Jaccard = 2/5 = 0.4
```

**Conceptual Meaning**: "You found some of the right answers; find the rest."

**Value Provided**:
- **Partial credit**: Agent gets 0.4 reward even though final answer isn't perfect
- **Format agnostic**: Flattens everything to atomic values
  - `[("Engineering", 42)]` → `{Engineering, 42}`
  - `[(42, "Engineering")]` → `{Engineering, 42}` (same!)
- **Compositional**: Finding 1 of 3 correct departments = 0.33, finding 2 of 3 = 0.67, finding all 3 = 1.0

**Risk Analysis**: ⚠️ **Low-medium risk**
- **Coincidental matches**: If gold answer is `{42}` and agent returns entire `employees` table with 100 rows, might contain `42` by chance
- **Mitigation**: Combine with cardinality (penalizes returning too many values)
- **False precision**: `{42}` vs `{42.0}` vs `{42.001}` might all become `{"42"}` after string conversion

---

#### 3. Numeric Range Proximity

**How It Works**:
```
Question: "Average salary in Engineering?"
Gold: 95000

Agent attempt 1: 87000 (off by 8.4%)
  relative_error = 8000 / 95000 = 0.084
  reward = 1 - log10(1 + 0.084) = 1 - log10(1.084) = 1 - 0.035 = 0.965

Agent attempt 2: 9500 (off by 90%, wrong order of magnitude)
  relative_error = 85500 / 95000 = 0.9
  reward = 1 - log10(1 + 0.9) = 1 - log10(1.9) = 1 - 0.279 = 0.721

Agent attempt 3: 950000 (10x too high)
  relative_error = 855000 / 95000 = 9.0
  reward = 1 - log10(1 + 9.0) = 1 - log10(10) = 1 - 1.0 = 0.0
```

**Conceptual Meaning**: "Being off by 10% is very different from being off by 10x."

**Value Provided**:
- **Order-of-magnitude thinking**: Rewards agent for "ballpark correct" before exact
- **Logarithmic scale**: 
  - 95k → 100k (5% error) = high reward (0.98)
  - 95k → 190k (100% error) = medium reward (0.70)
  - 95k → 950k (900% error) = no reward (0.05)
- **Natural for SQL**: COUNT, SUM, AVG queries often close but not exact on first try

**Risk Analysis**: ⚠️ **Medium risk**
- **Only useful for ~40% of questions**: Text/categorical answers get no benefit
- **Multiple numbers**: If result has `[42, 100, 5]` and gold is `[42]`, which number to compare?
  - Solution: Use closest match for each gold number
- **Zero-handling**: `gold=0, agent=1` is infinitely far; needs special case

---

#### 4. Row-wise Best Match

**How It Works**:
```
Question: "Top 3 departments by size"
Gold:   [(Engineering, 65), (Sales, 58), (Marketing, 52)]
Agent:  [(Marketing, 52), (Engineering, 65), (Sales, 58)]  # Wrong order!

Process:
  For each gold row, find best matching agent row:
    Gold row 1: (Engineering, 65)
      vs Agent row 1: (Marketing, 52)     → 0/2 match = 0.0
      vs Agent row 2: (Engineering, 65)   → 2/2 match = 1.0 ✓
      vs Agent row 3: (Sales, 58)         → 0/2 match = 0.0
      Best match: 1.0
    
    Gold row 2: (Sales, 58)
      Best match: 1.0 (agent row 3)
    
    Gold row 3: (Marketing, 52)
      Best match: 1.0 (agent row 1)
  
  Final reward: (1.0 + 1.0 + 1.0) / 3 = 1.0
```

**Conceptual Meaning**: "You got the right rows, just in the wrong order—that's still mostly correct."

**Value Provided**:
- **Order-invariant**: Catches `ORDER BY` mistakes without penalizing heavily
- **Extra columns forgiven**: If agent returns `(Engineering, 65, 95000)` and gold is `(Engineering, 65)`, first 2 columns match → 0.67 reward
- **Partial row matches**: Agent got department name right but count wrong → 0.5 reward per row

**Risk Analysis**: ⚠️ **Medium risk**
- **Computationally expensive**: O(M×N) comparisons for M gold rows and N agent rows
  - For 100-row results: 10,000 comparisons per reward calculation
  - Mitigation: Limit to first 20 rows
- **Ambiguous matching**: If gold has duplicate rows, which agent row should match which?

---

### 🥈 Tier 2: Nice-to-Have Metrics

#### 5. Schema Coverage

**How It Works**:
```
Question: "How many employees in Engineering?"
Gold query:   SELECT COUNT(*) FROM employees WHERE department='Engineering'
              Tables used: {employees}

Agent query:  SELECT COUNT(*) FROM employees e 
              JOIN departments d ON e.dept_id = d.id 
              WHERE d.name='Engineering'
              Tables used: {employees, departments}

Calculation:
  intersection = {employees} = 1
  union = {employees, departments} = 2
  Jaccard = 1/2 = 0.5
  
  Penalty for extra table: 0.1 * 1 = 0.1
  Final: 0.5 - 0.1 = 0.4
```

**Conceptual Meaning**: "You're exploring the right part of the database."

**Value Provided**:
- **Exploration guidance**: Early signal before query results are correct
- **Helps with multi-hop**: "You found `employees`, now look at `departments`"
- **Penalizes shotgun approach**: Agent that queries every table gets low reward

**Risk Analysis**: ⚠️ **Medium-high risk**
- **Multiple valid paths**: Simple query might use 1 table, complex query uses 3—both correct
- **Irrelevant table penalty**: What if agent explores `departments` first before finding `employees`? Gets penalized for valid exploration
- **Requires SQL parsing**: Dependency on `sqlparse` library; edge cases in parsing

---

#### 6. Column Alignment

**How It Works**:
```
Question: "List departments and their average salaries"
Gold columns:    [department_name, avg_salary]
Agent columns:   [dept, average_compensation]

Fuzzy matching:
  "department_name" vs "dept" 
    → SequenceMatcher = 0.65 (partial match) ✓
  
  "avg_salary" vs "average_compensation"
    → SequenceMatcher = 0.45 (weak match) ✗

Reward: 1/2 columns matched = 0.5
```

**Conceptual Meaning**: "Your columns have the right semantic meaning."

**Value Provided**:
- **Disambiguates multi-column results**: If result has `[42, 100, 5]`, which column is the answer?
- **Catches projection errors**: Agent did `SELECT *` when should've done `SELECT department, COUNT(*)`
- **Fuzzy matching helps**: "dept" matches "department", "emp_id" matches "employee_id"

**Risk Analysis**: ⚠️ **High risk**
- **Requires metadata**: Need to track column names from query results (not always available in raw SQLite)
- **Fuzzy matching noise**: "count" matches "country" (0.7 similarity), "id" matches "bid" (0.67 similarity)
- **Aliasing issues**: `SELECT COUNT(*) AS total` vs `SELECT COUNT(*) AS num_employees`—both mean the same thing

---

#### 7. Rank Correlation

**How It Works**:
```
Question: "Top 5 products by revenue"
Gold:   [ProductA: $1M, ProductB: $900K, ProductC: $800K, ProductD: $750K, ProductE: $700K]
        Ranks: [1, 2, 3, 4, 5]

Agent:  [ProductA: $1M, ProductC: $850K, ProductB: $880K, ProductE: $710K, ProductD: $740K]
        Ranks: [1, 3, 2, 5, 4]

Spearman correlation:
  Rank differences: [0, -1, +1, -1, +1]
  Correlation coefficient: 0.9
  
  Reward: (0.9 + 1) / 2 = 0.95
```

**Conceptual Meaning**: "You got the relative ordering mostly right."

**Value Provided**:
- **Specific to TOP-K queries**: ~20% of Spider questions involve ranking
- **Robust to ties**: Handles "tied for 2nd place" correctly
- **Partial credit for ordering**: Top 3 correct but bottom 2 swapped → still high reward

**Risk Analysis**: ⚠️ **Medium risk**
- **Limited applicability**: Only works for ordered results
- **Requires scipy**: Heavy dependency just for one metric
- **Rank vs. value confusion**: Agent might get ranking right but values wrong (or vice versa)

---

### 🥉 Tier 3: Avoid (High Risk, Low Value)

#### 8. SQL Structure Similarity

**How It Works**:
```
Gold query:   SELECT d.name, COUNT(*) 
              FROM employees e 
              JOIN departments d ON e.dept_id = d.id 
              GROUP BY d.name

Agent query:  SELECT department, COUNT(*) 
              FROM employees 
              GROUP BY department

Structural comparison:
  Tables: {employees} vs {employees, departments} → 0.5
  Joins: {(employees, departments)} vs {} → 0.0
  Aggregates: {COUNT} vs {COUNT} → 1.0
  Group By: {d.name} vs {department} → 0.5
  
  Weighted average: 0.5
```

**Conceptual Meaning**: "Your query looks syntactically similar to the gold query."

**Risk Analysis**: 🛑 **VERY HIGH RISK - DO NOT IMPLEMENT**
- **Reward hacking**: Agent learns to copy SQL structure without understanding semantics
- **Multiple valid solutions**: Gold uses JOIN, agent uses subquery—both correct, but structure reward penalizes
- **Overfitting**: Agent optimizes for "looking like gold query" instead of "getting right answer"
- **Research evidence**: SQL-TRAIL paper found structure-based rewards hurt generalization

---

#### 9. Execution Plan Similarity

**How It Works**:
```
Gold query execution plan:
  1. Scan departments (10 rows)
  2. Scan employees (1000 rows)  
  3. Hash join (O(N))
  4. Aggregate (O(N))

Agent query execution plan:
  1. Scan employees (1000 rows)
  2. Nested loop with departments (O(N²))
  3. Aggregate (O(N))

Similarity: 2/4 steps similar = 0.5
```

**Conceptual Meaning**: "Your query executes in a similar way."

**Risk Analysis**: 🛑 **VERY HIGH RISK - DO NOT IMPLEMENT**
- **Database-specific**: SQLite plans differ from PostgreSQL plans
- **Doesn't guarantee correctness**: Two queries with similar plans can have different results
- **Computationally expensive**: Running EXPLAIN on every query doubles execution time
- **Complexity**: Comparing tree structures is non-trivial
- **No research evidence**: No prior work shows this helps RL training

---

## 🧮 Combining Multiple Metrics into Final Reward

### The Challenge

You have multiple distance-to-goal metrics. How do you combine them into a single scalar reward?

```
Current state:
  cardinality_score = 0.8
  value_overlap_score = 0.6
  numeric_range_score = 0.9
  row_match_score = 0.7

Need: single_reward = ???
```

---

## 🎯 Method 1: Weighted Average (RECOMMENDED for MVP)

**Formula**:
```python
def weighted_average_reward(scores, weights):
    """Simple weighted average of applicable metrics."""
    total_weight = sum(weights.values())
    return sum(scores[k] * weights[k] for k in scores) / total_weight
```

**Example Implementation**:
```python
def compute_progress_reward(agent_result, gold_result, agent_query=None):
    # Compute all metrics
    scores = {
        'cardinality': cardinality_reward(agent_result, gold_result),
        'value_overlap': value_overlap_reward(agent_result, gold_result),
        'numeric_range': numeric_range_reward(agent_result, gold_result),
        'row_match': rowwise_best_match(agent_result, gold_result),
    }
    
    # Fixed weights (tune these!)
    weights = {
        'cardinality': 0.25,
        'value_overlap': 0.40,  # Highest weight (most universal)
        'numeric_range': 0.15,
        'row_match': 0.20,
    }
    
    return weighted_average_reward(scores, weights)
```

**Pros**:
- ✅ Simple to implement and understand
- ✅ Easy to tune (adjust weights based on training performance)
- ✅ Monotonic (if individual metrics improve, final reward improves)
- ✅ Bounded [0, 1]

**Cons**:
- ⚠️ Fixed weights might not be optimal for all question types
- ⚠️ Treats all metrics as equally important (regardless of context)

**When to use**: Default choice for MVP. Start here.

---

## 🎯 Method 2: Adaptive Weighting by Question Type

**Formula**:
```python
def adaptive_weighted_reward(scores, question_metadata):
    """Adjust weights based on question characteristics."""
    
    # Detect question type
    is_numeric = has_numeric_answer(gold_result)
    is_multirow = len(gold_result) > 1
    is_ordered = "TOP" in question.upper() or "ORDER BY" in gold_query
    
    # Adaptive weights
    weights = {
        'cardinality': 0.25,  # Always important
        'value_overlap': 0.40 if not is_numeric else 0.30,
        'numeric_range': 0.30 if is_numeric else 0.0,  # Only for numeric
        'row_match': 0.20 if is_multirow else 0.10,
    }
    
    # Normalize
    total = sum(weights.values())
    weights = {k: v/total for k, v in weights.items()}
    
    return sum(scores[k] * weights[k] for k in scores)
```

**Example**:
```python
Question: "Average salary in Engineering?" (numeric, single-value)
  → weights: cardinality=0.25, value_overlap=0.30, numeric_range=0.35, row_match=0.10

Question: "List all departments with >50 employees" (text, multi-row)
  → weights: cardinality=0.25, value_overlap=0.45, numeric_range=0.0, row_match=0.30
```

**Pros**:
- ✅ More accurate reward signal for different question types
- ✅ Automatically disables irrelevant metrics (e.g., numeric_range for text questions)
- ✅ Can tune weights per question type independently

**Cons**:
- ⚠️ More complex to implement (need question type detection)
- ⚠️ More hyperparameters to tune
- ⚠️ Risk of over-engineering

**When to use**: If fixed weighting shows poor performance on specific question types.

---

## 🎯 Method 3: Max Pooling (Optimistic)

**Formula**:
```python
def max_pooling_reward(scores):
    """Take the best metric (optimistic reward)."""
    return max(scores.values())
```

**Example**:
```
scores = {
    'cardinality': 0.3,  # Wrong row count
    'value_overlap': 0.8,  # Found most values!
    'numeric_range': 0.4,  # Numbers off
    'row_match': 0.5,  # Some rows match
}

reward = max(0.3, 0.8, 0.4, 0.5) = 0.8
```

**Conceptual Meaning**: "Give credit for whatever the agent did best."

**Pros**:
- ✅ Very forgiving (agent gets credit for any progress)
- ✅ Encourages diverse exploration strategies
- ✅ Simple to implement

**Cons**:
- 🛑 **Too lenient**: Agent might game easiest metric
- 🛑 **Non-compositional**: Doesn't reward improving multiple aspects simultaneously
- 🛑 **Unstable gradients**: Reward can jump dramatically between steps

**When to use**: If agent is struggling to learn anything (extremely sparse rewards). Use as temporary scaffolding, then switch to weighted average.

---

## 🎯 Method 4: Minimum Threshold + Average (Strict)

**Formula**:
```python
def threshold_average_reward(scores, thresholds):
    """All metrics must meet threshold; then take average."""
    # Check all thresholds
    for metric, score in scores.items():
        if score < thresholds.get(metric, 0.0):
            return 0.0  # Fail if any metric below threshold
    
    # All thresholds met → return average
    return sum(scores.values()) / len(scores)
```

**Example**:
```python
scores = {
    'cardinality': 0.9,
    'value_overlap': 0.7,
    'numeric_range': 0.3,  # Below threshold!
}

thresholds = {
    'cardinality': 0.5,
    'value_overlap': 0.5,
    'numeric_range': 0.5,
}

# numeric_range (0.3) < threshold (0.5) → return 0.0
```

**Conceptual Meaning**: "You must do reasonably well on all aspects to get any reward."

**Pros**:
- ✅ Prevents over-optimization of single metric
- ✅ Encourages balanced progress

**Cons**:
- 🛑 **Too strict**: Might be too hard for early training
- 🛑 **Cliff dynamics**: Slight improvement might not change reward at all
- 🛑 **Threshold tuning**: Requires careful calibration

**When to use**: If agent is gaming one metric while ignoring others. Use as penalty mechanism.

---

## 🎯 Method 5: Hierarchical (Coarse-to-Fine)

**Formula**:
```python
def hierarchical_reward(scores):
    """First get cardinality right, then content, then structure."""
    
    # Layer 1: Cardinality (must be >0.5 to proceed)
    if scores['cardinality'] < 0.5:
        return scores['cardinality'] * 0.3  # Low reward, stuck at layer 1
    
    # Layer 2: Content (must be >0.5 to proceed)
    content_score = (scores['value_overlap'] + scores['numeric_range']) / 2
    if content_score < 0.5:
        return 0.3 + content_score * 0.4  # Medium reward, stuck at layer 2
    
    # Layer 3: Structure (all metrics combined)
    final_score = (
        0.2 * scores['cardinality'] +
        0.4 * content_score +
        0.4 * scores['row_match']
    )
    return 0.5 + final_score * 0.5  # High reward, layer 3
```

**Conceptual Meaning**: "Learn to get the count right first, then the values, then the structure."

**Visualization**:
```
Reward progression:
0.0 ─────── 0.3 ─────── 0.7 ─────── 1.0
     ↑            ↑            ↑
 Cardinality  Content    Structure
   correct     correct     correct
```

**Pros**:
- ✅ Natural curriculum (easier tasks first)
- ✅ Clear progression signal
- ✅ Prevents agent from over-optimizing structure before content

**Cons**:
- ⚠️ More complex logic
- ⚠️ Requires careful threshold tuning
- ⚠️ Might slow down learning if thresholds too strict

**When to use**: If agent learns poorly with flat reward. Provides curriculum learning.

---

## 🎯 Method 6: Product (Multiplicative)

**Formula**:
```python
def product_reward(scores):
    """Multiply all metrics (all must be good)."""
    product = 1.0
    for score in scores.values():
        product *= score
    return product
```

**Example**:
```
scores = {
    'cardinality': 0.9,
    'value_overlap': 0.8,
    'numeric_range': 0.7,
}

reward = 0.9 × 0.8 × 0.7 = 0.504
```

**Conceptual Meaning**: "All aspects must be good; weak performance on any metric drags down total."

**Pros**:
- ✅ Encourages balanced improvement
- ✅ Penalizes weak performance on any dimension

**Cons**:
- 🛑 **Too strict**: Single low score (0.1) makes entire reward near zero
- 🛑 **Vanishing gradients**: Product of small numbers becomes very small
- 🛑 **Not bounded predictably**: Can produce very small rewards even for good progress

**When to use**: Rarely. Only if you need extremely strict "all-or-nothing" reward.

---

## 🎯 Method 7: Percentile Aggregation (Robust)

**Formula**:
```python
def percentile_reward(scores, percentile=50):
    """Use median (or other percentile) of all metrics."""
    import numpy as np
    return np.percentile(list(scores.values()), percentile)
```

**Example**:
```
scores = [0.9, 0.8, 0.3, 0.7, 0.6]
sorted = [0.3, 0.6, 0.7, 0.8, 0.9]

percentile_50 (median) = 0.7
percentile_75 = 0.8
percentile_25 = 0.6
```

**Conceptual Meaning**: "Reward based on typical performance, ignoring outliers."

**Pros**:
- ✅ Robust to outlier metrics (one very low or very high score doesn't dominate)
- ✅ Simple to implement
- ✅ Tunable (change percentile to be more/less strict)

**Cons**:
- ⚠️ Less interpretable than weighted average
- ⚠️ Ignores some information (throws away best and worst scores)

**When to use**: If one metric is noisy or unreliable, use median to ignore it.

---

## 📊 Comparison Table: Combination Methods

| Method | Complexity | Interpretability | Robustness | Training Stability | Best Use Case |
|--------|------------|------------------|------------|-------------------|---------------|
| **Weighted Average** | Low | High | Medium | High | **MVP default** |
| **Adaptive Weighting** | Medium | Medium | High | High | Different question types need different signals |
| **Max Pooling** | Low | Medium | Low | Low | Agent struggling to learn anything |
| **Threshold + Average** | Medium | High | Medium | Medium | Agent gaming one metric |
| **Hierarchical** | High | High | High | Medium | Want curriculum learning |
| **Product** | Low | Low | Low | Low | All aspects must be perfect (rare) |
| **Percentile** | Low | Low | High | High | One metric is noisy/unreliable |

---

## 🎬 Recommended Implementation Strategy

### Phase 1: Start Simple (Week 1)

```python
def compute_progress_reward(agent_result, gold_result):
    """Initial implementation: weighted average of 3 metrics."""
    
    scores = {
        'cardinality': cardinality_reward(agent_result, gold_result),
        'value_overlap': value_overlap_reward(agent_result, gold_result),
        'numeric_range': numeric_range_reward(agent_result, gold_result),
    }
    
    weights = {'cardinality': 0.25, 'value_overlap': 0.50, 'numeric_range': 0.25}
    
    return weighted_average_reward(scores, weights)
```

**Why**: Simple, interpretable, easy to debug.

---

### Phase 2: Add Context (Week 2, if needed)

```python
def compute_progress_reward(agent_result, gold_result, question_type):
    """Adaptive weighting based on question type."""
    
    scores = {
        'cardinality': cardinality_reward(agent_result, gold_result),
        'value_overlap': value_overlap_reward(agent_result, gold_result),
        'numeric_range': numeric_range_reward(agent_result, gold_result),
        'row_match': rowwise_best_match(agent_result, gold_result),
    }
    
    # Adapt weights
    if question_type == 'numeric':
        weights = {'cardinality': 0.2, 'value_overlap': 0.3, 
                   'numeric_range': 0.4, 'row_match': 0.1}
    elif question_type == 'multirow':
        weights = {'cardinality': 0.25, 'value_overlap': 0.4, 
                   'numeric_range': 0.05, 'row_match': 0.3}
    else:  # default
        weights = {'cardinality': 0.25, 'value_overlap': 0.5, 
                   'numeric_range': 0.15, 'row_match': 0.1}
    
    return weighted_average_reward(scores, weights)
```

**Why**: Improves signal quality without major complexity.

---

### Phase 3: Add Safeguards (Week 3, if agent is gaming)

```python
def compute_progress_reward(agent_result, gold_result, question_type):
    """Weighted average with anti-gaming measures."""
    
    scores = compute_all_scores(agent_result, gold_result)
    
    # Anti-gaming: if cardinality is way off, cap other rewards
    if scores['cardinality'] < 0.3:
        # Agent is nowhere close on size → limit credit for content
        return scores['cardinality'] * 0.5
    
    # Anti-gaming: if value overlap is low, cap row match
    if scores['value_overlap'] < 0.4:
        scores['row_match'] *= 0.5  # Penalize structure if content is wrong
    
    # Standard weighted average
    return weighted_average_reward(scores, get_weights(question_type))
```

**Why**: Prevents reward hacking while keeping interpretability.

---

## 🧪 How to Validate Your Combination Method

Create test suite:

```python
# Test 1: Perfect match
assert compute_reward(gold, gold) == 1.0

# Test 2: Completely wrong
assert compute_reward(random_result, gold) < 0.2

# Test 3: Monotonicity (better result → higher reward)
result_v1 = partially_correct_result()  # 30% right
result_v2 = more_correct_result()       # 60% right
result_v3 = mostly_correct_result()     # 90% right
assert compute_reward(result_v1, gold) < compute_reward(result_v2, gold) < compute_reward(result_v3, gold)

# Test 4: Bounded [0, 1]
for _ in range(100):
    random_result = generate_random_result()
    reward = compute_reward(random_result, gold)
    assert 0.0 <= reward <= 1.0

# Test 5: Insensitive to format (same values, different structure)
result_format_a = [("Engineering", 42)]
result_format_b = [(42, "Engineering")]
assert abs(compute_reward(result_format_a, gold) - 
           compute_reward(result_format_b, gold)) < 0.1  # Allow small difference
```

---

## 💡 Final Recommendation

**For MVP (Phase 1-3)**:
- Use **Method 1: Weighted Average** with metrics #1-3 (cardinality, value overlap, numeric range)
- Fixed weights: `{0.25, 0.50, 0.25}`

**If training shows issues**:
- Add **Method 2: Adaptive Weighting** based on question type
- Add metric #4 (row-wise match)

**If agent games rewards**:
- Add threshold checks from **Method 4**
- Add anti-gaming logic from Phase 3 example

**Never use**:
- Method 3 (Max Pooling) - too gameable
- Method 6 (Product) - too strict, vanishing gradients
- Metrics #8-9 (SQL structure, execution plans) - research shows they hurt

**Bottom line**: Start simple (weighted average of 3 metrics), add complexity only when needed, always validate with test suite.