Spaces:

hjerpe
/

sql_env

Sleeping

App Files Files Community

sql_env / docs_draft /reward_design.md

hjerpe

Upload folder using huggingface_hub

9e64e71 verified 28 days ago

preview code

raw

history blame contribute delete

27.5 kB

	# Distance-to-Goal Metrics: Complete Reference Table

	## Priority-Ordered Metric Comparison

	\| Priority \| Metric Name \| Effort \| Value \| Concept \| What It Measures \| Key Value \| Potential Risks \|
	\|---------\|-------------\|--------\|-------\|---------\|------------------\|-----------\|-----------------\|
	\| 1 \| Cardinality Matching \| 30 min \| Very High \| Compare row/value counts \| "Did you return the right number of things?" \| Catches over/under-retrieval early; works universally \| None significant; very robust \|
	\| 2 \| Value Overlap (Set-based) \| 45 min \| Very High \| Flatten results to value sets, compute Jaccard \| "How many correct values did you find?" \| Rewards partial correctness; format-agnostic \| May over-reward coincidental matches in large results \|
	\| 3 \| Numeric Range Proximity \| 1 hour \| High \| Logarithmic distance for numbers \| "Are your numbers in the right ballpark?" \| Critical for COUNT/SUM/AVG queries; rewards order-of-magnitude \| Only useful for numeric questions (~40% of dataset) \|
	\| 4 \| Row-wise Best Match \| 1.5 hours \| High \| Find best row pairing between results \| "How many rows are correct, ignoring order?" \| Handles ORDER BY issues; forgiving for column mismatch \| Computationally expensive for large result sets (100+ rows) \|
	\| 5 \| Schema Coverage \| 2 hours \| Medium-High \| Compare tables used in queries \| "Are you querying the right tables?" \| Guides exploration toward relevant parts of schema \| Requires SQL parsing; may reward irrelevant table access \|
	\| 6 \| Column Alignment \| 2.5 hours \| Medium \| Fuzzy match column names \| "Do your result columns make semantic sense?" \| Helps disambiguate multi-column results \| Requires metadata tracking; fuzzy matching can be noisy \|
	\| 7 \| Rank Correlation \| 3 hours \| Medium \| Spearman correlation on ordered results \| "Is your ranking correct for TOP-K queries?" \| Specific to ordered results; robust to ties \| Only applicable to ~20% of questions; needs scipy dependency \|
	\| 8 \| SQL Structure \| 6+ hours \| Low-Medium \| Parse query AST, compare structure \| "Is your query syntactically similar?" \| Might help with complex multi-join queries \| HIGH RISK: Reward hacking; overfitting; ignores semantic equivalence \|
	\| 9 \| Execution Plan \| 8+ hours \| Low \| Compare database query plans \| "Does your query execute similarly?" \| Theoretical value for optimization \| NOT RECOMMENDED: Too complex; DB-specific; doesn't guarantee correctness \|

	---

	## Detailed Conceptual Explanations

	### 🥇 Tier 1: Must-Have Metrics

	#### 1. Cardinality Matching

	How It Works:
	```
	Input: agent_result = [row1, row2, row3, row4, row5] (5 rows)
	gold_result = [row1, row2, row3] (3 rows)

	Calculation:
	difference = \|5 - 3\| = 2
	relative_error = 2 / 3 = 0.667
	reward = 1 - min(1.0, 0.667) = 0.333
	```

	Conceptual Meaning: "Getting the right number of things is the first step to getting the right things."

	Value Provided:
	- Early signal: Before content is right, size can be right
	- Universal: Works for integers (1 value), lists (N values), tables (M rows)
	- Catches common errors:
	- Missing `GROUP BY` → too few rows
	- Cartesian join → too many rows
	- Wrong aggregation → count off by orders of magnitude

	Risk Analysis: ✅ Very low risk
	- No reward hacking opportunities (agent can't game cardinality without semantic progress)
	- Monotonic with correctness (better results = better cardinality)
	- Fast to compute (O(1))

	---

	#### 2. Value Overlap (Set-based)

	How It Works:
	```
	Question: "Which departments have >50 employees?"
	Gold: {Engineering, Sales, Marketing}
	Agent: {Engineering, Sales, HR, Legal} (2 correct, 2 wrong)

	Calculation:
	intersection = {Engineering, Sales} = 2 values
	union = {Engineering, Sales, Marketing, HR, Legal} = 5 values
	Jaccard = 2/5 = 0.4
	```

	Conceptual Meaning: "You found some of the right answers; find the rest."

	Value Provided:
	- Partial credit: Agent gets 0.4 reward even though final answer isn't perfect
	- Format agnostic: Flattens everything to atomic values
	- `[("Engineering", 42)]` → `{Engineering, 42}`
	- `[(42, "Engineering")]` → `{Engineering, 42}` (same!)
	- Compositional: Finding 1 of 3 correct departments = 0.33, finding 2 of 3 = 0.67, finding all 3 = 1.0

	Risk Analysis: ⚠️ Low-medium risk
	- Coincidental matches: If gold answer is `{42}` and agent returns entire `employees` table with 100 rows, might contain `42` by chance
	- Mitigation: Combine with cardinality (penalizes returning too many values)
	- False precision: `{42}` vs `{42.0}` vs `{42.001}` might all become `{"42"}` after string conversion

	---

	#### 3. Numeric Range Proximity

	How It Works:
	```
	Question: "Average salary in Engineering?"
	Gold: 95000

	Agent attempt 1: 87000 (off by 8.4%)
	relative_error = 8000 / 95000 = 0.084
	reward = 1 - log10(1 + 0.084) = 1 - log10(1.084) = 1 - 0.035 = 0.965

	Agent attempt 2: 9500 (off by 90%, wrong order of magnitude)
	relative_error = 85500 / 95000 = 0.9
	reward = 1 - log10(1 + 0.9) = 1 - log10(1.9) = 1 - 0.279 = 0.721

	Agent attempt 3: 950000 (10x too high)
	relative_error = 855000 / 95000 = 9.0
	reward = 1 - log10(1 + 9.0) = 1 - log10(10) = 1 - 1.0 = 0.0
	```

	Conceptual Meaning: "Being off by 10% is very different from being off by 10x."

	Value Provided:
	- Order-of-magnitude thinking: Rewards agent for "ballpark correct" before exact
	- Logarithmic scale:
	- 95k → 100k (5% error) = high reward (0.98)
	- 95k → 190k (100% error) = medium reward (0.70)
	- 95k → 950k (900% error) = no reward (0.05)
	- Natural for SQL: COUNT, SUM, AVG queries often close but not exact on first try

	Risk Analysis: ⚠️ Medium risk
	- Only useful for ~40% of questions: Text/categorical answers get no benefit
	- Multiple numbers: If result has `[42, 100, 5]` and gold is `[42]`, which number to compare?
	- Solution: Use closest match for each gold number
	- Zero-handling: `gold=0, agent=1` is infinitely far; needs special case

	---

	#### 4. Row-wise Best Match

	How It Works:
	```
	Question: "Top 3 departments by size"
	Gold: [(Engineering, 65), (Sales, 58), (Marketing, 52)]
	Agent: [(Marketing, 52), (Engineering, 65), (Sales, 58)] # Wrong order!

	Process:
	For each gold row, find best matching agent row:
	Gold row 1: (Engineering, 65)
	vs Agent row 1: (Marketing, 52) → 0/2 match = 0.0
	vs Agent row 2: (Engineering, 65) → 2/2 match = 1.0 ✓
	vs Agent row 3: (Sales, 58) → 0/2 match = 0.0
	Best match: 1.0

	Gold row 2: (Sales, 58)
	Best match: 1.0 (agent row 3)

	Gold row 3: (Marketing, 52)
	Best match: 1.0 (agent row 1)

	Final reward: (1.0 + 1.0 + 1.0) / 3 = 1.0
	```

	Conceptual Meaning: "You got the right rows, just in the wrong order—that's still mostly correct."

	Value Provided:
	- Order-invariant: Catches `ORDER BY` mistakes without penalizing heavily
	- Extra columns forgiven: If agent returns `(Engineering, 65, 95000)` and gold is `(Engineering, 65)`, first 2 columns match → 0.67 reward
	- Partial row matches: Agent got department name right but count wrong → 0.5 reward per row

	Risk Analysis: ⚠️ Medium risk
	- Computationally expensive: O(M×N) comparisons for M gold rows and N agent rows
	- For 100-row results: 10,000 comparisons per reward calculation
	- Mitigation: Limit to first 20 rows
	- Ambiguous matching: If gold has duplicate rows, which agent row should match which?

	---

	### 🥈 Tier 2: Nice-to-Have Metrics

	#### 5. Schema Coverage

	How It Works:
	```
	Question: "How many employees in Engineering?"
	Gold query: SELECT COUNT(*) FROM employees WHERE department='Engineering'
	Tables used: {employees}

	Agent query: SELECT COUNT(*) FROM employees e
	JOIN departments d ON e.dept_id = d.id
	WHERE d.name='Engineering'
	Tables used: {employees, departments}

	Calculation:
	intersection = {employees} = 1
	union = {employees, departments} = 2
	Jaccard = 1/2 = 0.5

	Penalty for extra table: 0.1 * 1 = 0.1
	Final: 0.5 - 0.1 = 0.4
	```

	Conceptual Meaning: "You're exploring the right part of the database."

	Value Provided:
	- Exploration guidance: Early signal before query results are correct
	- Helps with multi-hop: "You found `employees`, now look at `departments`"
	- Penalizes shotgun approach: Agent that queries every table gets low reward

	Risk Analysis: ⚠️ Medium-high risk
	- Multiple valid paths: Simple query might use 1 table, complex query uses 3—both correct
	- Irrelevant table penalty: What if agent explores `departments` first before finding `employees`? Gets penalized for valid exploration
	- Requires SQL parsing: Dependency on `sqlparse` library; edge cases in parsing

	---

	#### 6. Column Alignment

	How It Works:
	```
	Question: "List departments and their average salaries"
	Gold columns: [department_name, avg_salary]
	Agent columns: [dept, average_compensation]

	Fuzzy matching:
	"department_name" vs "dept"
	→ SequenceMatcher = 0.65 (partial match) ✓

	"avg_salary" vs "average_compensation"
	→ SequenceMatcher = 0.45 (weak match) ✗

	Reward: 1/2 columns matched = 0.5
	```

	Conceptual Meaning: "Your columns have the right semantic meaning."

	Value Provided:
	- Disambiguates multi-column results: If result has `[42, 100, 5]`, which column is the answer?
	- Catches projection errors: Agent did `SELECT ` when should've done `SELECT department, COUNT()`
	- Fuzzy matching helps: "dept" matches "department", "emp_id" matches "employee_id"

	Risk Analysis: ⚠️ High risk
	- Requires metadata: Need to track column names from query results (not always available in raw SQLite)
	- Fuzzy matching noise: "count" matches "country" (0.7 similarity), "id" matches "bid" (0.67 similarity)
	- Aliasing issues: `SELECT COUNT() AS total` vs `SELECT COUNT() AS num_employees`—both mean the same thing

	---

	#### 7. Rank Correlation

	How It Works:
	```
	Question: "Top 5 products by revenue"
	Gold: [ProductA: $1M, ProductB: $900K, ProductC: $800K, ProductD: $750K, ProductE: $700K]
	Ranks: [1, 2, 3, 4, 5]

	Agent: [ProductA: $1M, ProductC: $850K, ProductB: $880K, ProductE: $710K, ProductD: $740K]
	Ranks: [1, 3, 2, 5, 4]

	Spearman correlation:
	Rank differences: [0, -1, +1, -1, +1]
	Correlation coefficient: 0.9

	Reward: (0.9 + 1) / 2 = 0.95
	```

	Conceptual Meaning: "You got the relative ordering mostly right."

	Value Provided:
	- Specific to TOP-K queries: ~20% of Spider questions involve ranking
	- Robust to ties: Handles "tied for 2nd place" correctly
	- Partial credit for ordering: Top 3 correct but bottom 2 swapped → still high reward

	Risk Analysis: ⚠️ Medium risk
	- Limited applicability: Only works for ordered results
	- Requires scipy: Heavy dependency just for one metric
	- Rank vs. value confusion: Agent might get ranking right but values wrong (or vice versa)

	---

	### 🥉 Tier 3: Avoid (High Risk, Low Value)

	#### 8. SQL Structure Similarity

	How It Works:
	```
	Gold query: SELECT d.name, COUNT(*)
	FROM employees e
	JOIN departments d ON e.dept_id = d.id
	GROUP BY d.name

	Agent query: SELECT department, COUNT(*)
	FROM employees
	GROUP BY department

	Structural comparison:
	Tables: {employees} vs {employees, departments} → 0.5
	Joins: {(employees, departments)} vs {} → 0.0
	Aggregates: {COUNT} vs {COUNT} → 1.0
	Group By: {d.name} vs {department} → 0.5

	Weighted average: 0.5
	```

	Conceptual Meaning: "Your query looks syntactically similar to the gold query."

	Risk Analysis: 🛑 VERY HIGH RISK - DO NOT IMPLEMENT
	- Reward hacking: Agent learns to copy SQL structure without understanding semantics
	- Multiple valid solutions: Gold uses JOIN, agent uses subquery—both correct, but structure reward penalizes
	- Overfitting: Agent optimizes for "looking like gold query" instead of "getting right answer"
	- Research evidence: SQL-TRAIL paper found structure-based rewards hurt generalization

	---

	#### 9. Execution Plan Similarity

	How It Works:
	```
	Gold query execution plan:
	1. Scan departments (10 rows)
	2. Scan employees (1000 rows)
	3. Hash join (O(N))
	4. Aggregate (O(N))

	Agent query execution plan:
	1. Scan employees (1000 rows)
	2. Nested loop with departments (O(N²))
	3. Aggregate (O(N))

	Similarity: 2/4 steps similar = 0.5
	```

	Conceptual Meaning: "Your query executes in a similar way."

	Risk Analysis: 🛑 VERY HIGH RISK - DO NOT IMPLEMENT
	- Database-specific: SQLite plans differ from PostgreSQL plans
	- Doesn't guarantee correctness: Two queries with similar plans can have different results
	- Computationally expensive: Running EXPLAIN on every query doubles execution time
	- Complexity: Comparing tree structures is non-trivial
	- No research evidence: No prior work shows this helps RL training

	---

	## 🧮 Combining Multiple Metrics into Final Reward

	### The Challenge

	You have multiple distance-to-goal metrics. How do you combine them into a single scalar reward?

	```
	Current state:
	cardinality_score = 0.8
	value_overlap_score = 0.6
	numeric_range_score = 0.9
	row_match_score = 0.7

	Need: single_reward = ???
	```

	---

	## 🎯 Method 1: Weighted Average (RECOMMENDED for MVP)

	Formula:
	```python
	def weighted_average_reward(scores, weights):
	"""Simple weighted average of applicable metrics."""
	total_weight = sum(weights.values())
	return sum(scores[k] * weights[k] for k in scores) / total_weight
	```

	Example Implementation:
	```python
	def compute_progress_reward(agent_result, gold_result, agent_query=None):
	# Compute all metrics
	scores = {
	'cardinality': cardinality_reward(agent_result, gold_result),
	'value_overlap': value_overlap_reward(agent_result, gold_result),
	'numeric_range': numeric_range_reward(agent_result, gold_result),
	'row_match': rowwise_best_match(agent_result, gold_result),
	}

	# Fixed weights (tune these!)
	weights = {
	'cardinality': 0.25,
	'value_overlap': 0.40, # Highest weight (most universal)
	'numeric_range': 0.15,
	'row_match': 0.20,
	}

	return weighted_average_reward(scores, weights)
	```

	Pros:
	- ✅ Simple to implement and understand
	- ✅ Easy to tune (adjust weights based on training performance)
	- ✅ Monotonic (if individual metrics improve, final reward improves)
	- ✅ Bounded [0, 1]

	Cons:
	- ⚠️ Fixed weights might not be optimal for all question types
	- ⚠️ Treats all metrics as equally important (regardless of context)

	When to use: Default choice for MVP. Start here.

	---

	## 🎯 Method 2: Adaptive Weighting by Question Type

	Formula:
	```python
	def adaptive_weighted_reward(scores, question_metadata):
	"""Adjust weights based on question characteristics."""

	# Detect question type
	is_numeric = has_numeric_answer(gold_result)
	is_multirow = len(gold_result) > 1
	is_ordered = "TOP" in question.upper() or "ORDER BY" in gold_query

	# Adaptive weights
	weights = {
	'cardinality': 0.25, # Always important
	'value_overlap': 0.40 if not is_numeric else 0.30,
	'numeric_range': 0.30 if is_numeric else 0.0, # Only for numeric
	'row_match': 0.20 if is_multirow else 0.10,
	}

	# Normalize
	total = sum(weights.values())
	weights = {k: v/total for k, v in weights.items()}

	return sum(scores[k] * weights[k] for k in scores)
	```

	Example:
	```python
	Question: "Average salary in Engineering?" (numeric, single-value)
	→ weights: cardinality=0.25, value_overlap=0.30, numeric_range=0.35, row_match=0.10

	Question: "List all departments with >50 employees" (text, multi-row)
	→ weights: cardinality=0.25, value_overlap=0.45, numeric_range=0.0, row_match=0.30
	```

	Pros:
	- ✅ More accurate reward signal for different question types
	- ✅ Automatically disables irrelevant metrics (e.g., numeric_range for text questions)
	- ✅ Can tune weights per question type independently

	Cons:
	- ⚠️ More complex to implement (need question type detection)
	- ⚠️ More hyperparameters to tune
	- ⚠️ Risk of over-engineering

	When to use: If fixed weighting shows poor performance on specific question types.

	---

	## 🎯 Method 3: Max Pooling (Optimistic)

	Formula:
	```python
	def max_pooling_reward(scores):
	"""Take the best metric (optimistic reward)."""
	return max(scores.values())
	```

	Example:
	```
	scores = {
	'cardinality': 0.3, # Wrong row count
	'value_overlap': 0.8, # Found most values!
	'numeric_range': 0.4, # Numbers off
	'row_match': 0.5, # Some rows match
	}

	reward = max(0.3, 0.8, 0.4, 0.5) = 0.8
	```

	Conceptual Meaning: "Give credit for whatever the agent did best."

	Pros:
	- ✅ Very forgiving (agent gets credit for any progress)
	- ✅ Encourages diverse exploration strategies
	- ✅ Simple to implement

	Cons:
	- 🛑 Too lenient: Agent might game easiest metric
	- 🛑 Non-compositional: Doesn't reward improving multiple aspects simultaneously
	- 🛑 Unstable gradients: Reward can jump dramatically between steps

	When to use: If agent is struggling to learn anything (extremely sparse rewards). Use as temporary scaffolding, then switch to weighted average.

	---

	## 🎯 Method 4: Minimum Threshold + Average (Strict)

	Formula:
	```python
	def threshold_average_reward(scores, thresholds):
	"""All metrics must meet threshold; then take average."""
	# Check all thresholds
	for metric, score in scores.items():
	if score < thresholds.get(metric, 0.0):
	return 0.0 # Fail if any metric below threshold

	# All thresholds met → return average
	return sum(scores.values()) / len(scores)
	```

	Example:
	```python
	scores = {
	'cardinality': 0.9,
	'value_overlap': 0.7,
	'numeric_range': 0.3, # Below threshold!
	}

	thresholds = {
	'cardinality': 0.5,
	'value_overlap': 0.5,
	'numeric_range': 0.5,
	}

	# numeric_range (0.3) < threshold (0.5) → return 0.0
	```

	Conceptual Meaning: "You must do reasonably well on all aspects to get any reward."

	Pros:
	- ✅ Prevents over-optimization of single metric
	- ✅ Encourages balanced progress

	Cons:
	- 🛑 Too strict: Might be too hard for early training
	- 🛑 Cliff dynamics: Slight improvement might not change reward at all
	- 🛑 Threshold tuning: Requires careful calibration

	When to use: If agent is gaming one metric while ignoring others. Use as penalty mechanism.

	---

	## 🎯 Method 5: Hierarchical (Coarse-to-Fine)

	Formula:
	```python
	def hierarchical_reward(scores):
	"""First get cardinality right, then content, then structure."""

	# Layer 1: Cardinality (must be >0.5 to proceed)
	if scores['cardinality'] < 0.5:
	return scores['cardinality'] * 0.3 # Low reward, stuck at layer 1

	# Layer 2: Content (must be >0.5 to proceed)
	content_score = (scores['value_overlap'] + scores['numeric_range']) / 2
	if content_score < 0.5:
	return 0.3 + content_score * 0.4 # Medium reward, stuck at layer 2

	# Layer 3: Structure (all metrics combined)
	final_score = (
	0.2 * scores['cardinality'] +
	0.4 * content_score +
	0.4 * scores['row_match']
	)
	return 0.5 + final_score * 0.5 # High reward, layer 3
	```

	Conceptual Meaning: "Learn to get the count right first, then the values, then the structure."

	Visualization:
	```
	Reward progression:
	0.0 ─────── 0.3 ─────── 0.7 ─────── 1.0
	↑ ↑ ↑
	Cardinality Content Structure
	correct correct correct
	```

	Pros:
	- ✅ Natural curriculum (easier tasks first)
	- ✅ Clear progression signal
	- ✅ Prevents agent from over-optimizing structure before content

	Cons:
	- ⚠️ More complex logic
	- ⚠️ Requires careful threshold tuning
	- ⚠️ Might slow down learning if thresholds too strict

	When to use: If agent learns poorly with flat reward. Provides curriculum learning.

	---

	## 🎯 Method 6: Product (Multiplicative)

	Formula:
	```python
	def product_reward(scores):
	"""Multiply all metrics (all must be good)."""
	product = 1.0
	for score in scores.values():
	product *= score
	return product
	```

	Example:
	```
	scores = {
	'cardinality': 0.9,
	'value_overlap': 0.8,
	'numeric_range': 0.7,
	}

	reward = 0.9 × 0.8 × 0.7 = 0.504
	```

	Conceptual Meaning: "All aspects must be good; weak performance on any metric drags down total."

	Pros:
	- ✅ Encourages balanced improvement
	- ✅ Penalizes weak performance on any dimension

	Cons:
	- 🛑 Too strict: Single low score (0.1) makes entire reward near zero
	- 🛑 Vanishing gradients: Product of small numbers becomes very small
	- 🛑 Not bounded predictably: Can produce very small rewards even for good progress

	When to use: Rarely. Only if you need extremely strict "all-or-nothing" reward.

	---

	## 🎯 Method 7: Percentile Aggregation (Robust)

	Formula:
	```python
	def percentile_reward(scores, percentile=50):
	"""Use median (or other percentile) of all metrics."""
	import numpy as np
	return np.percentile(list(scores.values()), percentile)
	```

	Example:
	```
	scores = [0.9, 0.8, 0.3, 0.7, 0.6]
	sorted = [0.3, 0.6, 0.7, 0.8, 0.9]

	percentile_50 (median) = 0.7
	percentile_75 = 0.8
	percentile_25 = 0.6
	```

	Conceptual Meaning: "Reward based on typical performance, ignoring outliers."

	Pros:
	- ✅ Robust to outlier metrics (one very low or very high score doesn't dominate)
	- ✅ Simple to implement
	- ✅ Tunable (change percentile to be more/less strict)

	Cons:
	- ⚠️ Less interpretable than weighted average
	- ⚠️ Ignores some information (throws away best and worst scores)

	When to use: If one metric is noisy or unreliable, use median to ignore it.

	---

	## 📊 Comparison Table: Combination Methods

	\| Method \| Complexity \| Interpretability \| Robustness \| Training Stability \| Best Use Case \|
	\|--------\|------------\|------------------\|------------\|-------------------\|---------------\|
	\| Weighted Average \| Low \| High \| Medium \| High \| MVP default \|
	\| Adaptive Weighting \| Medium \| Medium \| High \| High \| Different question types need different signals \|
	\| Max Pooling \| Low \| Medium \| Low \| Low \| Agent struggling to learn anything \|
	\| Threshold + Average \| Medium \| High \| Medium \| Medium \| Agent gaming one metric \|
	\| Hierarchical \| High \| High \| High \| Medium \| Want curriculum learning \|
	\| Product \| Low \| Low \| Low \| Low \| All aspects must be perfect (rare) \|
	\| Percentile \| Low \| Low \| High \| High \| One metric is noisy/unreliable \|

	---

	## 🎬 Recommended Implementation Strategy

	### Phase 1: Start Simple (Week 1)

	```python
	def compute_progress_reward(agent_result, gold_result):
	"""Initial implementation: weighted average of 3 metrics."""

	scores = {
	'cardinality': cardinality_reward(agent_result, gold_result),
	'value_overlap': value_overlap_reward(agent_result, gold_result),
	'numeric_range': numeric_range_reward(agent_result, gold_result),
	}

	weights = {'cardinality': 0.25, 'value_overlap': 0.50, 'numeric_range': 0.25}

	return weighted_average_reward(scores, weights)
	```

	Why: Simple, interpretable, easy to debug.

	---

	### Phase 2: Add Context (Week 2, if needed)

	```python
	def compute_progress_reward(agent_result, gold_result, question_type):
	"""Adaptive weighting based on question type."""

	scores = {
	'cardinality': cardinality_reward(agent_result, gold_result),
	'value_overlap': value_overlap_reward(agent_result, gold_result),
	'numeric_range': numeric_range_reward(agent_result, gold_result),
	'row_match': rowwise_best_match(agent_result, gold_result),
	}

	# Adapt weights
	if question_type == 'numeric':
	weights = {'cardinality': 0.2, 'value_overlap': 0.3,
	'numeric_range': 0.4, 'row_match': 0.1}
	elif question_type == 'multirow':
	weights = {'cardinality': 0.25, 'value_overlap': 0.4,
	'numeric_range': 0.05, 'row_match': 0.3}
	else: # default
	weights = {'cardinality': 0.25, 'value_overlap': 0.5,
	'numeric_range': 0.15, 'row_match': 0.1}

	return weighted_average_reward(scores, weights)
	```

	Why: Improves signal quality without major complexity.

	---

	### Phase 3: Add Safeguards (Week 3, if agent is gaming)

	```python
	def compute_progress_reward(agent_result, gold_result, question_type):
	"""Weighted average with anti-gaming measures."""

	scores = compute_all_scores(agent_result, gold_result)

	# Anti-gaming: if cardinality is way off, cap other rewards
	if scores['cardinality'] < 0.3:
	# Agent is nowhere close on size → limit credit for content
	return scores['cardinality'] * 0.5

	# Anti-gaming: if value overlap is low, cap row match
	if scores['value_overlap'] < 0.4:
	scores['row_match'] *= 0.5 # Penalize structure if content is wrong

	# Standard weighted average
	return weighted_average_reward(scores, get_weights(question_type))
	```

	Why: Prevents reward hacking while keeping interpretability.

	---

	## 🧪 How to Validate Your Combination Method

	Create test suite:

	```python
	# Test 1: Perfect match
	assert compute_reward(gold, gold) == 1.0

	# Test 2: Completely wrong
	assert compute_reward(random_result, gold) < 0.2

	# Test 3: Monotonicity (better result → higher reward)
	result_v1 = partially_correct_result() # 30% right
	result_v2 = more_correct_result() # 60% right
	result_v3 = mostly_correct_result() # 90% right
	assert compute_reward(result_v1, gold) < compute_reward(result_v2, gold) < compute_reward(result_v3, gold)

	# Test 4: Bounded [0, 1]
	for _ in range(100):
	random_result = generate_random_result()
	reward = compute_reward(random_result, gold)
	assert 0.0 <= reward <= 1.0

	# Test 5: Insensitive to format (same values, different structure)
	result_format_a = [("Engineering", 42)]
	result_format_b = [(42, "Engineering")]
	assert abs(compute_reward(result_format_a, gold) -
	compute_reward(result_format_b, gold)) < 0.1 # Allow small difference
	```

	---

	## 💡 Final Recommendation

	For MVP (Phase 1-3):
	- Use Method 1: Weighted Average with metrics #1-3 (cardinality, value overlap, numeric range)
	- Fixed weights: `{0.25, 0.50, 0.25}`

	If training shows issues:
	- Add Method 2: Adaptive Weighting based on question type
	- Add metric #4 (row-wise match)

	If agent games rewards:
	- Add threshold checks from Method 4
	- Add anti-gaming logic from Phase 3 example

	Never use:
	- Method 3 (Max Pooling) - too gameable
	- Method 6 (Product) - too strict, vanishing gradients
	- Metrics #8-9 (SQL structure, execution plans) - research shows they hurt

	Bottom line: Start simple (weighted average of 3 metrics), add complexity only when needed, always validate with test suite.