SQLEnv: Interactive Database Query Environment β v1 Spec
OpenEnv Challenge Submission Design & MVP Development Plan
1. What We're Building & Delivering
1.1 One-Sentence Summary
SQLEnv is an RL environment where an agent answers natural language questions about databases through iterative SQL exploration β schema inspection, exploratory queries, result observation, and refinement β before submitting a final answer.
1.2 Why It's Novel
No RL environment for interactive SQL exists. Text-to-SQL benchmarks (Spider, BIRD, WikiSQL) are static single-shot evaluations. SQLEnv transforms this into a multi-turn exploration problem where agents develop query strategies through trial, error, and feedback. This maps directly to how real data analysts work.
1.3 Submission Artifacts (Mapped to Evaluation Criteria)
| Evaluation Criterion | Artifact | Description |
|---|---|---|
| Creative and Robust use of OpenEnv | SQLEnv environment on HF Hub | Full environment: models, server, client, Dockerfile, openenv.yaml |
| Technical Excellence | 3-layer reward architecture + multi-type answer verification | Dense stepwise reward (not just terminal), typed answer comparison |
| Story-telling | HuggingFace blog post | "Teaching AI to think like a data analyst" β untrained vs. trained agent |
| Open Source Demo | Training notebooks/scripts on GitHub | GRPO training script (TRL-compatible), baseline comparison |
| Green Agent wrapper | Green Agent class | Automated evaluation wrapper following OpenEnv pattern |
1.4 What Exactly Gets Submitted
- HF Hub Space: Docker container running SQLEnv server (WebSocket API)
- GitHub repo: Environment source + GRPO training notebook + results
- HF Blog post: Narrative + learning curves + side-by-side demo (untrained vs. trained)
2. Environment Design
2.1 OpenEnv Integration Architecture
envs/sql_env/
βββ __init__.py
βββ models.py # SQLAction, SQLObservation (Pydantic)
βββ client.py # SQLEnv(EnvClient) β WebSocket client
βββ openenv.yaml # Environment manifest
βββ pyproject.toml
βββ uv.lock
βββ data/ # SQLite databases + question sets
β βββ databases/ # Spider DB files
β βββ questions/ # JSON question sets with gold answers
βββ server/
βββ __init__.py
βββ app.py # create_app(SQLEnvironment, SQLAction, SQLObservation)
βββ environment.py # SQLEnvironment(Environment) β core logic
βββ reward.py # Reward computation (3 layers)
βββ verifier.py # Answer comparison (multi-type)
βββ requirements.txt
βββ Dockerfile
2.2 Pydantic Models
# models.py
from pydantic import Field
from openenv.core.env_server.types import Action, Observation, State
class SQLAction(Action):
"""What the agent sends each step."""
action_type: str = Field(
...,
description="One of: DESCRIBE, SAMPLE, QUERY, ANSWER"
)
argument: str = Field(
...,
description="Table name (for DESCRIBE/SAMPLE), SQL string (for QUERY), or answer value (for ANSWER)"
)
class SQLObservation(Observation):
"""What the agent receives after each step."""
# Inherited: done (bool), reward (float | None)
question: str = Field(..., description="The NL question to answer")
schema_info: str = Field(..., description="Database schema description")
result: str = Field(default="", description="Result of the last action (truncated)")
error: str = Field(default="", description="Error message if action failed")
step_count: int = Field(default=0, description="Current step number")
budget_remaining: int = Field(default=0, description="Steps left before timeout")
action_history: list[str] = Field(
default_factory=list,
description="Summary of previous actions taken"
)
Design note: result is a string, not raw data. Results are always truncated/summarized (max N rows as formatted text). This is intentional β the agent sees "what a real analyst would see", not the full database. This makes the environment a POMDP, which is appropriate for the task and beneficial for learning dynamics (see Section 3).
2.3 State
Uses the core State class from OpenEnv (episode_id + step_count). No custom state needed for MVP.
2.4 Action Space
| Action | Argument | Effect | Cost |
|---|---|---|---|
DESCRIBE |
table_name | Returns column names, types, row count | 1 step |
SAMPLE |
table_name | Returns 5 random rows (formatted text) | 1 step |
QUERY |
sql_string | Executes SQL, returns truncated results (max 20 rows) | 1 step |
ANSWER |
value | Submits final answer, ends episode | 0 steps (terminal) |
Step budget: 15 steps per episode (configurable). This is enough for 2-3 exploration actions + 3-5 query attempts + answer. Keeps episodes short enough for efficient training.
Query sandboxing: All SQL runs in a read-only SQLite connection with a statement timeout (5 seconds). Only SELECT statements allowed. No writes, no DDL, no pragmas.
2.5 Episode Lifecycle
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β reset() β
β β Pick random question from question set β
β β Load corresponding SQLite database (read-only) β
β β Return initial observation: β
β question, schema_info (table names only), budget=15 β
ββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β step(action) loop β
β β
β DESCRIBE β return columns/types for requested table β
β SAMPLE β return 5 random rows from requested table β
β QUERY β execute SQL, return truncated result or error β
β ANSWER β compare to gold, compute terminal reward, done=T β
β β
β Each non-ANSWER step: compute stepwise reward, decrement β
β budget. If budget=0 and no ANSWER: done=True, reward=0 β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Important: On reset(), the agent sees only table names in schema_info, not full column details. The agent must actively DESCRIBE tables to learn the schema. This is a deliberate design choice β it creates an exploration incentive and mirrors real-world "I have a database, what's in it?" workflows.
3. Reward Architecture
The reward research (SQL-TRAIL, PaVeRL-SQL, QueryGym, Graph-Reward-SQL) converges on one principle: reward improvement during exploration, but make correctness dominate at termination.
Our reward has 3 layers. The total episode reward is:
R_episode = R_terminal + sum(R_step_i) where sum(R_step_i) is capped at 0.5
3.1 Layer 1 β Operational Validity (every step, no gold reference needed)
These signals reward "being able to operate the tool" without referencing the gold answer.
| Signal | Value | When |
|---|---|---|
r_exec_ok |
+0.02 | A QUERY executes without error |
r_new_info |
+0.01 | Action reveals new info (new table described, new column seen). Capped at 0.10 per episode |
r_repeat |
-0.01 | Exact same SQL run again (hash match), or exact same DESCRIBE/SAMPLE repeated |
r_cost |
-0.005 | Every step (small constant, keeps trajectories short) |
Design rationale: Prefer small positive signals + "no reward" over heavy negative penalties. Strong negatives make agents risk-averse and destabilize training (PaVeRL-SQL). The step cost is tiny β just enough to prefer shorter trajectories when everything else is equal.
3.2 Layer 2 β Progress-to-Target (QUERY steps only, oracle-based but coarsened)
After each QUERY, we compute how close the query result is to the gold answer, then reward only improvement over the best-so-far.
Progress metric (depends on answer type):
| Answer Type | Progress Computation |
|---|---|
| Integer/Count | 1 - min(1, abs(pred - gold) / max(1, abs(gold))) |
| Float/Average | Same as integer but with tolerance (within 1% = 1.0) |
| String/Name | Exact match = 1.0, else 0.0 (too fragile to partially score) |
| List/Set | Jaccard(pred_set, gold_set) β set overlap |
| Table | 0.5 * column_overlap + 0.5 * row_sample_overlap |
Coarsening: To prevent reward leakage (agent hill-climbing on the reward signal instead of reasoning), we bin progress into 5 levels: {0, 0.25, 0.5, 0.75, 1.0}.
Improvement-only reward:
progress_binned = bin_to_nearest([0, 0.25, 0.5, 0.75, 1.0], progress_raw)
r_progress = max(0, progress_binned - best_progress_so_far) * 0.15
best_progress_so_far = max(best_progress_so_far, progress_binned)
This prevents agents from "farming" reward by oscillating between good and bad queries.
3.3 Layer 3 β Terminal Correctness (ANSWER action only)
| Condition | Reward |
|---|---|
| Answer matches gold (type-appropriate comparison) | +1.0 |
| Answer does not match | 0.0 |
| Episode times out (budget exhausted, no ANSWER) | 0.0 |
Terminal correctness is always the dominant signal. With the cap on stepwise rewards (0.5 max), a correct answer is always worth at least 2x the maximum exploration reward.
3.4 Total Reward Formula
R_episode = 1.0 * R_correct + clamp(sum(r_exec_ok + r_new_info + r_repeat + r_cost + r_progress), -0.2, 0.5)
The negative clamp at -0.2 prevents catastrophic negative episodes from destabilizing training.
3.5 How This Maps to TRL's GRPOTrainer
Following the Wordle GRPO tutorial pattern, we expose reward components as separate reward_funcs:
# In the training script
trainer = GRPOTrainer(
model=model_name,
processing_class=tokenizer,
reward_funcs=[
reward_correctness, # Terminal: 0.0 or 1.0
reward_progress, # Cumulative progress improvement (0 to 0.15)
reward_operational, # Sum of exec_ok, new_info, repeat, cost signals
],
train_dataset=dataset,
args=grpo_config,
rollout_func=rollout_func,
)
The rollout_func runs a full episode against the SQLEnv server (via WebSocket), collects the observation stream, and computes per-component rewards. This lets TRL handle the weighting/normalization.
3.6 Anti-Gaming Measures
| Risk | Mitigation |
|---|---|
| Agent hill-climbs on progress signal to infer gold answer | Coarsen progress to 5 bins; step rewards are small vs. terminal |
Agent DESCRIBEs everything to harvest r_new_info |
Cap r_new_info at 0.10 per episode |
| Agent submits random answers hoping for partial credit | No partial credit on ANSWER β binary correctness only |
Agent runs many queries to accumulate r_exec_ok |
Step cost + budget limit make this net-negative after ~10 queries |
| Agent repeats identical queries | Hash-based repeat detection, penalty applied |
4. Question Sourcing & Verification
4.1 MVP: Spider Subset
Why Spider: Most-studied text-to-SQL benchmark, gold SQL available for all questions, existing tooling for test suite generation, clean SQLite databases included, well-understood difficulty levels.
MVP question set: 50-100 questions from Spider's dev set, selected to cover:
- Simple (SELECT/WHERE/COUNT): ~40%
- Medium (JOIN/GROUP BY): ~40%
- Hard (subqueries/HAVING/multi-step): ~20%
Answer types to support in MVP:
- Integer (counts, sums)
- Float (averages β with 1% tolerance)
- String (single value lookups)
- List (top-k results β order-insensitive set comparison)
4.2 Answer Verification
def verify_answer(predicted, gold, answer_type: str) -> bool:
match answer_type:
case "integer":
return int(predicted) == int(gold)
case "float":
return abs(float(predicted) - float(gold)) / max(1, abs(float(gold))) < 0.01
case "string":
return str(predicted).strip().lower() == str(gold).strip().lower()
case "list":
return set(normalize(predicted)) == set(normalize(gold))
case "table":
return compare_tables(predicted, gold) # Column + row overlap
4.3 Question Metadata Format
Each question in the JSON dataset includes:
{
"id": "spider_dev_042",
"question": "How many employees are in the Engineering department?",
"database": "company_db",
"gold_sql": "SELECT COUNT(*) FROM employees WHERE department = 'Engineering'",
"gold_answer": "42",
"answer_type": "integer",
"difficulty": "easy",
"tables_involved": ["employees"]
}
The gold_sql is used for progress computation (running the gold query on the DB to get the reference answer). The gold_answer is the cached expected result. Both are hidden from the agent.
4.4 Multi-DB Verification (Post-MVP)
MVP: Single database per question. Accept correct result on the original DB.
Post-MVP: For each question, generate 1-2 variant databases with the same schema but different data. An answer must be correct across all variants.
Variant generation strategies (prioritized):
- Irrelevant row injection β add records outside the question's filter scope
- Join multiplicity trap β add duplicates in bridge tables
- ID remap β apply bijection to primary keys, update foreign keys
(See Section 6.2 for the full metamorphic testing backlog.)
5. MVP Development Track
The goal is to get a working, submittable solution as fast as possible, then improve iteratively. Each phase has a "done when" gate. Do not start the next phase until the current one passes its gate.
Phase 1: Scaffold & Stub (Day 1)
What: Run openenv init sql_env, customize the generated models/server/client stubs to match our Pydantic models. Get a Docker container that starts and responds to reset/step.
Tasks:
openenv init sql_env- Replace generated models with
SQLAction,SQLObservationfrom Section 2.2 - Implement stub
SQLEnvironment.reset()β returns hardcoded observation - Implement stub
SQLEnvironment.step()β accepts action, returns hardcoded observation - Implement
SQLEnv(EnvClient)client with_step_payload,_parse_result,_parse_state openenv buildandopenenv validate
Done when: openenv validate --verbose passes. Client can connect, reset, step, and receive typed observations.
Phase 2: Core Loop with Terminal Reward (Days 2-4)
What: Wire up real SQLite databases, implement the action handlers, add terminal-only reward (binary correctness on ANSWER). This is a submittable environment β sparse reward, but functional.
Tasks:
- Download Spider dev databases (SQLite files) and a curated question set (30-50 questions)
- Implement
reset(): pick random question, load DB, return initial observation with table names - Implement DESCRIBE handler: query
sqlite_master+PRAGMA table_info - Implement SAMPLE handler:
SELECT * FROM {table} ORDER BY RANDOM() LIMIT 5 - Implement QUERY handler: execute read-only SQL with timeout, truncate results to 20 rows
- Implement ANSWER handler: compare to gold answer using
verify_answer(), return reward - Add budget tracking (15 steps), timeout handling
- SQL sandboxing: read-only connection, statement timeout, no DDL
Done when: Can run a full episode manually β reset, DESCRIBE a table, run a query, submit an answer, get correct/incorrect reward. Works in Docker via WebSocket.
Phase 3: Dense Reward (Days 5-7)
What: Add reward Layers 1 and 2 (operational validity + progress-to-target). This makes the environment trainable β agents get feedback before the terminal step.
Tasks:
- Implement
reward.pywith the 3-layer reward computation - Layer 1: Track executed queries (hash set for repeat detection), track schema exploration (set of described tables/columns), compute
r_exec_ok,r_new_info,r_repeat,r_cost - Layer 2: After each QUERY, run gold SQL on the DB, compare query result to gold result using the progress metric (type-dependent), bin to 5 levels, compute improvement-only reward
- Wire reward computation into
step()β return stepwise reward in observation - Add reward capping logic (sum of step rewards <= 0.5, negative floor at -0.2)
- Test: run episodes, verify reward signals are sensible
Done when: Reward signal varies meaningfully across different agent behaviors (random exploration gives some small positive reward; targeted queries give progress reward; correct answer gives terminal reward).
Phase 4: Training Pipeline (Days 8-12)
What: Implement the GRPO training script following the Wordle tutorial pattern. Train a small model. Produce baseline vs. trained comparison.
Tasks:
- Write
rollout_functhat plays full SQLEnv episodes via WebSocket client - Design system prompt for the SQL agent (schema understanding, query strategy, answer formatting)
- Implement reward functions for TRL (
reward_correctness,reward_progress,reward_operational) - Set up training config (Qwen3-1.7B or similar small model, GRPO via TRL+vLLM)
- Run training (start small: 100 episodes, observe learning curves)
- Implement Green Agent wrapper (automated evaluation: run N episodes, report success rate)
- Produce comparison: random policy vs. trained model (success rate, avg steps, avg reward)
- Debug and iterate on reward weights if training doesn't converge
Done when: Trained model measurably outperforms random baseline on success rate. Training notebook runs end-to-end. Green Agent reports evaluation metrics.
Phase 5: Polish & Submit (Days 13-16)
What: Deploy to HF, write blog, prepare demo.
Tasks:
openenv pushto HuggingFace Spaces- Clean up GitHub repo: README, requirements, training notebook, results
- Write HF blog post:
- Hook: "What if AI could learn to query databases like a data analyst?"
- Problem: Static benchmarks don't teach exploration strategy
- Solution: SQLEnv β interactive environment with dense reward
- Results: Learning curves, before/after comparison, example episodes
- Technical: Reward architecture, OpenEnv integration
- Record/screenshot side-by-side demo (untrained vs. trained agent)
- Final validation: someone else can
pip install, connect to HF Space, run training notebook
Done when: All 3 submission artifacts are live (HF Space, GitHub repo, HF blog). Blog tells a compelling story with real results.
Phase Summary
| Phase | Days | Produces | Risk if Skipped |
|---|---|---|---|
| 1. Scaffold | 1 | Stub environment in Docker | Can't build anything |
| 2. Core Loop | 2-4 | Working env with terminal reward | No submittable environment |
| 3. Dense Reward | 5-7 | Trainable environment | Terminal-only reward may not train |
| 4. Training | 8-12 | Trained model + comparison | No "Technical Excellence" or "Demo" |
| 5. Polish | 13-16 | Blog + HF Space + GitHub | No submission |
Minimum viable submission: Phases 1-2-4-5 (skip dense reward, train with terminal-only). This is risky β terminal-only reward is sparse and may not produce a meaningful trained model β but it's submittable.
Recommended path: All 5 phases in order. Dense reward (Phase 3) is what makes training work and what demonstrates "Technical Excellence."
6. Post-Submission Improvements (Backlog)
These are improvements to pursue only after a working submission exists. Ordered by expected impact.
6.1 Multi-Database Verification
Impact: High. Defends against "accidental correctness" β queries that return the right answer for the wrong reasons on one dataset.
What: For each question, generate 1-2 variant SQLite databases with the same schema but different data distributions. Answer must be correct across all variants.
Implementation: Script that takes a base DB + gold SQL, runs targeted mutations (see 6.2), re-runs gold SQL to get new expected answer, packages as variant DB.
6.2 Metamorphic Testing Suite
Ten database mutations that catch common SQL errors without requiring "SQL correctness" checking:
| # | Test | What It Catches | Best For |
|---|---|---|---|
| 1 | Row-order permutation | Missing ORDER BY, positional dependencies | All types |
| 2 | Irrelevant row injection | Missing filters, wrong date logic | Filtered aggregates |
| 3 | Dangling entity injection | Incorrect outer joins, wrong join direction | Aggregates over fact tables |
| 4 | Key re-encoding (ID remap) | Hard-coded IDs, magnitude assumptions | All types |
| 5 | Duplicate bridge rows | Missing DISTINCT, cartesian joins | Unique counts |
| 6 | NULL perturbation | COUNT(col) vs COUNT(*), NULL comparison bugs | Counts, joins |
| 7 | Unit scaling (numeric Γ factor) | Wrong aggregation, measure/count confusion | Numeric aggregates |
| 8 | Noise facts (orphan FKs) | Missing joins, fact-only queries | Join-dependent queries |
| 9 | Tie injection at k-th boundary | Brittle top-k, missing tiebreak | Top-k/ranking |
| 10 | Label swap (category permutation) | Surface-string heuristics, shortcut patterns | Category-based queries |
MVP subset: Tests 2, 4, 5 (highest signal, cheapest to implement).
6.3 Two-Tier Action Space (RA/CTE Mode)
Impact: Medium. Makes intermediate rewards easier (subset/superset signals on intermediate tables) and improves dialect portability.
What: Add structured relational algebra operations (filter, join, group, union) as an alternative to raw SQL. Agent can build intermediate tables step-by-step.
Why it helps: QueryGym shows this makes the environment more RL-friendly β less syntax failure, easier partial credit, engine-agnostic. But it's significant implementation work.
When to add: Only if raw SQL + dense reward proves insufficient for training.
6.4 Difficulty Curriculum
What: Organize questions into Easy/Medium/Hard tiers. Start training on Easy, progress to Medium/Hard as the agent improves.
| Level | Schema | Query Type | Example |
|---|---|---|---|
| Easy | 1-2 tables, <10 cols | SELECT, WHERE, COUNT | "How many orders in January?" |
| Medium | 3-5 tables, JOINs | JOIN, GROUP BY, HAVING | "Top 3 customers by total spend" |
| Hard | 5+ tables, subqueries | Nested queries, CTEs | "Customers in every product category" |
When to add: After basic training works. Curriculum learning can significantly improve convergence on harder questions.
6.5 Additional Question Sources
| Source | Benefit | Effort |
|---|---|---|
| BIRD | Richer real-world databases, harder questions | Medium (different format, needs adaptation) |
| WikiTableQuestions | Simple single-table questions, good for easy tier | Low |
| Custom-generated | Control over difficulty distribution | High (need to write questions + verify gold) |
6.6 Structural SQL Similarity Signals
Impact: Low-Medium. Additional reward signal based on structural similarity between agent's SQL and gold SQL.
What: Compare table references, join graph overlap, aggregate functions used. NOT lexical similarity (bigrams overfit to syntax style).
Caution: Keep weight very low. SQL-TRAIL uses this but acknowledges it can reward "copying style" over semantic correctness.
6.7 Observation Enhancements
- Column statistics (min/max/distinct count) as part of DESCRIBE response
- Query execution plan as optional feedback
- "Hint" mode for progressive difficulty (reveal join paths after N failed queries)
7. Risk Register
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Terminal-only reward doesn't train | High | High | Phase 3 (dense reward) is the primary mitigation. If needed, fall back to even simpler environment (fewer tables, easier questions) |
| Dense reward leaks gold answer | Medium | Medium | Coarsen progress to 5 bins, cap step rewards at 0.5, keep step signals small |
| Agent learns to exploit reward shaping | Medium | Medium | Anti-gaming measures (Section 3.6), monitor training for degenerate behavior |
| Spider questions too easy / too hard for RL | Medium | Medium | Curate question subset carefully; start with questions where random exploration has some chance of partial progress |
| Training doesn't converge in time | Medium | High | Start with a very small model (1.7B), short episodes, easy questions. A small improvement over random is sufficient for the demo |
| WebSocket timeout during training | Low | Medium | Increase --ws-ping-interval and --ws-ping-timeout to 300s in Dockerfile (documented in OpenEnv troubleshooting) |
| SQLite limitations (no window functions, limited types) | Low | Low | Spider questions are designed for SQLite. Avoid questions requiring features SQLite lacks |
| Blog doesn't have compelling results | Medium | High | Even negative results are interesting ("here's what we tried, here's what worked"). Focus story on the environment design, not just training outcomes |
Appendix A: Key Differences from SQLEnv_Concept.md (v0)
| Topic | v0 (SQLEnv_Concept.md) | v1 (This Document) |
|---|---|---|
| Reward structure | Static weights: 70% correctness, 20% efficiency, 10% quality | 3-layer architecture: operational validity + progress-to-target + terminal correctness, mapped to TRL reward_funcs |
| Exploration penalty | Quadratic penalty after 2 "free" queries | Constant step cost + repeat detection. No quadratic penalty (it punishes exploration) |
| Query quality scoring | "Appropriate JOINs" (subjective) | Removed. Replaced with objective signals (executed successfully, improved progress, no repeats) |
| Reward implementation | Unclear how to integrate with training | Explicit mapping to TRL GRPOTrainer pattern (reward_funcs + rollout_func) |
| Multi-DB verification | Listed as core feature | Moved to post-submission backlog. MVP uses single DB |
| Scale database (10x) | Listed as one of 3 DB variants | Dropped. Performance testing is irrelevant for correctness verification |
| Deliverables | Checklist without order | Phased MVP track with explicit "done when" gates |
| Development timeline | "2-3 weeks" (unstructured) | 5 phases with day estimates and dependencies |
Appendix B: Research References
| Paper / System | Key Idea Adopted |
|---|---|
| SQL-TRAIL (2026) | Multi-term reward panel: execution correctness + behavioral signals. Correctness must dominate |
| PaVeRL-SQL (2025) | Fractional execution accuracy (partial match). Avoid strong negative rewards |
| QueryGym (2025) | Subset/superset intermediate rewards. Two-tier action space (RA + SQL). POMDP framing |
| Graph-Reward-SQL (EMNLP 2025) | Stepwise CTE evaluation. Intermediate structure supervision |
| OpenEnv Wordle GRPO | TRL integration pattern: rollout_func + reward_funcs + GRPOTrainer |
| Spider Test Suite | Multi-database verification for semantic equivalence |
Appendix C: Green Agent Wrapper (Sketch)
class SQLGreenAgent:
"""Automated evaluation agent for SQLEnv.
Runs N episodes with a given policy (random, heuristic, or trained model),
reports success rate, avg reward, avg steps.
"""
def __init__(self, env_client: SQLEnv, policy):
self.env = env_client
self.policy = policy
def evaluate(self, n_episodes: int = 100) -> dict:
results = []
for _ in range(n_episodes):
result = self.env.reset()
obs = result.observation
total_reward = 0
while not result.done:
action = self.policy.select_action(obs)
result = self.env.step(action)
obs = result.observation
total_reward += result.reward or 0
results.append({
"correct": total_reward > 0.5, # terminal reward dominates
"total_reward": total_reward,
"steps": obs.step_count,
})
return {
"success_rate": sum(r["correct"] for r in results) / len(results),
"avg_reward": sum(r["total_reward"] for r in results) / len(results),
"avg_steps": sum(r["steps"] for r in results) / len(results),
}