Spaces:
Sleeping
Sleeping
feat: implement dynamic ERD visualization, premium dashboard UI, and professional global README
Browse files- README.md +104 -162
- __pycache__/models.cpython-312.pyc +0 -0
- models.py +4 -0
- server/__pycache__/environment.cpython-312.pyc +0 -0
- server/app.py +214 -32
- server/environment.py +43 -0
README.md
CHANGED
|
@@ -1,193 +1,135 @@
|
|
| 1 |
---
|
| 2 |
-
title: SQL Migration Agent
|
| 3 |
emoji: ποΈ
|
| 4 |
colorFrom: blue
|
| 5 |
-
colorTo:
|
| 6 |
sdk: docker
|
|
|
|
| 7 |
pinned: false
|
| 8 |
---
|
| 9 |
-
# SQL Schema Migration Agent β OpenEnv Benchmark
|
| 10 |
|
| 11 |
-
|
|
|
|
| 12 |
|
| 13 |
-
|
|
|
|
|
|
|
| 14 |
|
| 15 |
-
|
| 16 |
-
- **Reasoning under constraints** (SQLite's limited ALTER TABLE support)
|
| 17 |
-
- **Data preservation** (agents must never silently drop rows)
|
| 18 |
-
- **Multi-step planning** (complex migrations require 5-15 coordinated SQL commands)
|
| 19 |
-
- **Edge case handling** (apostrophes, NULL values, empty strings, type coercion)
|
| 20 |
|
| 21 |
-
|
| 22 |
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
|
|
|
|
|
|
|
| 49 |
```
|
| 50 |
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
| # | Task | Difficulty | Steps | Description |
|
| 54 |
-
|---|------|-----------|-------|-------------|
|
| 55 |
-
| 1 | `column-restructure` | Easy | 10 | Merge first_name + last_name β full_name |
|
| 56 |
-
| 2 | `soft-delete-restoration` | Easy | 10 | Restore deleted products from deletion_log |
|
| 57 |
-
| 3 | `table-normalization` | Medium | 15 | Normalize purchases β customers + orders + FK |
|
| 58 |
-
| 4 | `schema-version-merge` | Medium | 15 | Merge v1/v2 product tables with price coercion |
|
| 59 |
-
| 5 | `multi-entity-extraction` | Medium | 15 | 3NF decomposition with invalid data routing |
|
| 60 |
-
| 6 | `cascade-migration` | Hard | 20 | 4-table FK cascade, type coercion, orphan audit |
|
| 61 |
-
| 7 | `dual-source-consolidation` | Hard | 20 | 6β4 table merge, cross-system email dedup |
|
| 62 |
-
|
| 63 |
-
### Adversarial Edge Cases
|
| 64 |
-
- **O'Brien** (apostrophe in data β tests SQL escaping)
|
| 65 |
-
- **$90,000 salary** (TEXTβINTEGER coercion β tests string processing)
|
| 66 |
-
- **Empty string emails** (not NULL β tests data validation logic)
|
| 67 |
-
- **Leading whitespace** (` alice@company.com` β tests TRIM awareness)
|
| 68 |
-
- **ID conflicts** (same ID in two source tables β tests merge logic)
|
| 69 |
-
- **Orphaned FKs** (references to deleted entities β tests audit logging)
|
| 70 |
-
- **NULL currency** (must default to 'USD' β tests COALESCE)
|
| 71 |
-
|
| 72 |
-
## Baseline Scores (Qwen/Qwen3-32B)
|
| 73 |
-
Tested deterministically via `inference.py` on default seeds:
|
| 74 |
-
| Task | Success Score | Step Count |
|
| 75 |
-
|------|--------------|------------|
|
| 76 |
-
| `column-restructure` | 0.99 | 4-5 |
|
| 77 |
-
| `soft-delete-restoration` | 0.99 | 5-7 |
|
| 78 |
-
| `table-normalization` | 0.99 | 8-10 |
|
| 79 |
-
| `schema-version-merge` | 0.93 | 9-11 |
|
| 80 |
-
| `multi-entity-extraction` | 0.50 | 10-12 |
|
| 81 |
-
| `cascade-migration` | 0.83 | 13-15 |
|
| 82 |
-
| `dual-source-consolidation`| 0.28 | 15-18 |
|
| 83 |
-
|
| 84 |
-
## Dynamic Golden Database Grading
|
| 85 |
-
|
| 86 |
-
Unlike benchmarks with hardcoded expected values, our grader is **seed-independent**:
|
| 87 |
-
|
| 88 |
-
1. At scoring time, a fresh DB is seeded and the correct migration is applied
|
| 89 |
-
2. The agent's DB is compared table-by-table against this golden reference
|
| 90 |
-
3. If seed data changes, the golden DB auto-updates
|
| 91 |
-
|
| 92 |
-
**Scoring breakdown (per task):**
|
| 93 |
-
- **Schema match (30%)**: Tables exist with correct columns
|
| 94 |
-
- **Data match (40%)**: Row content matches golden DB (order-independent)
|
| 95 |
-
- **FK & integrity (20%)**: Foreign keys enforced, PRAGMA integrity_check passes
|
| 96 |
-
- **Anti-exploit (10%)**: No empty tables, no schema pollution
|
| 97 |
-
|
| 98 |
-
### Reward Function
|
| 99 |
-
The episode step reward is the exact delta of the migration progress score:
|
| 100 |
-
```python
|
| 101 |
-
step_reward = current_score - previous_score
|
| 102 |
-
```
|
| 103 |
-
- If an agent reverts progress, `step_reward` is negative.
|
| 104 |
-
- Exploit attempts (e.g. `PRAGMA foreign_keys = OFF`) yield immediate `reward = -0.3`.
|
| 105 |
-
- Auto-submitted invalid schemas yield negative deltas for missing data.
|
| 106 |
|
| 107 |
-
##
|
| 108 |
|
| 109 |
-
|
| 110 |
-
- **Dangerous SQL Blacklist**: ATTACH DATABASE, DETACH, LOAD_EXTENSION blocked
|
| 111 |
-
- **Transaction Awareness**: Respects BEGIN/COMMIT/ROLLBACK from agents
|
| 112 |
-
- **Case-Insensitive Grading**: Table/column names compared case-insensitively
|
| 113 |
-
- **PRAGMA Preservation**: Grader doesn't corrupt agent's FK state
|
| 114 |
-
- **Trajectory Logging**: Full SQL history attached to final observation
|
| 115 |
|
| 116 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 117 |
|
| 118 |
-
|
| 119 |
-
```bash
|
| 120 |
-
pip install -r requirements.txt
|
| 121 |
-
```
|
| 122 |
|
| 123 |
-
##
|
| 124 |
-
```bash
|
| 125 |
-
export HF_TOKEN=your_huggingface_token
|
| 126 |
-
export API_BASE_URL=https://router.huggingface.co/v1 # or Groq, etc.
|
| 127 |
-
export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct
|
| 128 |
-
```
|
| 129 |
|
| 130 |
-
|
| 131 |
-
|
| 132 |
-
|
| 133 |
-
|
| 134 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 135 |
|
| 136 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 137 |
```bash
|
| 138 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 139 |
```
|
| 140 |
|
| 141 |
-
###
|
| 142 |
```bash
|
| 143 |
-
|
| 144 |
```
|
| 145 |
|
| 146 |
-
|
| 147 |
-
|
| 148 |
-
| Endpoint | Method | Description |
|
| 149 |
-
|----------|--------|-------------|
|
| 150 |
-
| `/reset` | POST | Start new migration episode |
|
| 151 |
-
| `/step` | POST | Execute a SQL action |
|
| 152 |
-
| `/state` | GET | Current environment state |
|
| 153 |
-
| `/tasks` | GET | List all 7 tasks with metadata |
|
| 154 |
-
| `/grader` | POST | Run grader on specific/all tasks |
|
| 155 |
-
| `/health` | GET | Health check |
|
| 156 |
-
| `/docs` | GET | Interactive API documentation |
|
| 157 |
-
|
| 158 |
-
## Action Schema
|
| 159 |
-
```json
|
| 160 |
-
{
|
| 161 |
-
"sql_command": "ALTER TABLE users ADD COLUMN full_name TEXT",
|
| 162 |
-
"reasoning": "Add the target column before migrating data",
|
| 163 |
-
"submit_final": false
|
| 164 |
-
}
|
| 165 |
-
```
|
| 166 |
|
| 167 |
-
##
|
| 168 |
-
```json
|
| 169 |
-
{
|
| 170 |
-
"current_schema_sql": "CREATE TABLE users (...);",
|
| 171 |
-
"target_schema_sql": "CREATE TABLE users (...);",
|
| 172 |
-
"last_execution_result": "Success: 5 rows affected",
|
| 173 |
-
"step_number": 3,
|
| 174 |
-
"migration_progress": 0.75,
|
| 175 |
-
"task_name": "column-restructure",
|
| 176 |
-
"done": false,
|
| 177 |
-
"reward": 0.15
|
| 178 |
-
}
|
| 179 |
-
```
|
| 180 |
|
| 181 |
-
|
| 182 |
|
| 183 |
-
|
| 184 |
-
|
| 185 |
-
|
| 186 |
-
|
| 187 |
-
|
| 188 |
|
| 189 |
-
|
| 190 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 191 |
|
| 192 |
-
## License
|
| 193 |
-
MIT
|
|
|
|
| 1 |
---
|
| 2 |
+
title: SQL Migration Agent Benchmark
|
| 3 |
emoji: ποΈ
|
| 4 |
colorFrom: blue
|
| 5 |
+
colorTo: purple
|
| 6 |
sdk: docker
|
| 7 |
+
app_file: server/app.py
|
| 8 |
pinned: false
|
| 9 |
---
|
|
|
|
| 10 |
|
| 11 |
+
# SQL Migration Agent Benchmark (OpenEnv)
|
| 12 |
+
> **A Production-Grade Evaluation Suite for Database Engineering Agents.**
|
| 13 |
|
| 14 |
+
[](https://github.com/openenv/core)
|
| 15 |
+
[](https://opensource.org/licenses/MIT)
|
| 16 |
+
[](https://huggingface.co/spaces/Eishaan/sql-migration-env)
|
| 17 |
|
| 18 |
+
This repository contains a high-fidelity valuation environment designed to measure the capability of AI agents in performing complex SQL schema migrations. Unlike simple text-to-SQL benchmarks, this environment requires **state-aware reasoning**, **data integrity protection**, and **adversarial edge-case handling**.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 19 |
|
| 20 |
+
---
|
| 21 |
|
| 22 |
+
## ποΈ Architecture Overview
|
| 23 |
+
|
| 24 |
+
The environment follows the **OpenEnv** specification, exposing a standardized API for agents to interact with an isolated SQLite instance.
|
| 25 |
+
|
| 26 |
+
```mermaid
|
| 27 |
+
sequenceDiagram
|
| 28 |
+
participant Agent
|
| 29 |
+
participant Env as MigrationEnv Server
|
| 30 |
+
participant DB as SQLite (:memory:)
|
| 31 |
+
participant Grader as Dynamic Golden Grader
|
| 32 |
+
|
| 33 |
+
Agent->>Env: POST /reset (task_name)
|
| 34 |
+
Env->>DB: Seed Source Data
|
| 35 |
+
Env->>Grader: Build Golden Reference
|
| 36 |
+
Grader-->>Env: Initial Score
|
| 37 |
+
Env-->>Agent: Observation (DDL, Schema Diff, ERD)
|
| 38 |
+
|
| 39 |
+
loop Migration Steps
|
| 40 |
+
Agent->>Env: POST /step (SQL, Reasoning)
|
| 41 |
+
Env->>DB: Execute SQL (w/ Timeout & Blacklist)
|
| 42 |
+
Env->>Grader: Compute Delta Reward
|
| 43 |
+
Grader-->>Env: current_score, reward
|
| 44 |
+
Env-->>Agent: New Observation + ERD (Visualization)
|
| 45 |
+
end
|
| 46 |
+
|
| 47 |
+
Agent->>Env: submit_final = True
|
| 48 |
+
Env->>Grader: Final Integrity & FK Check
|
| 49 |
+
Env-->>Agent: Final Episode Summary (Trajectory)
|
| 50 |
```
|
| 51 |
|
| 52 |
+
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 53 |
|
| 54 |
+
## π― Benchmark Tasks
|
| 55 |
|
| 56 |
+
The suite consists of **7 progressive tasks** representing real-world database engineering challenges:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 57 |
|
| 58 |
+
| Task | Difficulty | Core Challenge |
|
| 59 |
+
| :--- | :--- | :--- |
|
| 60 |
+
| **Column Restructure** | π’ Easy | Merging `first_name` + `last_name` while preserving apostrophes (O'Brien). |
|
| 61 |
+
| **Soft-Delete Restoration** | π’ Easy | Restoring products from a deletion log and managing boolean flags. |
|
| 62 |
+
| **Table Normalization** | π‘ Medium | Decomposing a denormalized "God Table" into 3NF (`customers` β `orders`). |
|
| 63 |
+
| **Schema Version Merge** | π‘ Medium | Merging conflicting schemas (v1 vs v2) with complex price coercion. |
|
| 64 |
+
| **Multi-Entity Extraction** | π‘ Medium | 3NF decomposition with strict data routing for invalid records. |
|
| 65 |
+
| **Cascade Migration** | π΄ Hard | 4-table FK cascade, orphan audit logging, and strict data type cleanup. |
|
| 66 |
+
| **Dual-Source Consolidation** | π΄ Hard | Merging 6 tables from two incompatible systems (Legacy CRM + Modern SaaS). |
|
| 67 |
|
| 68 |
+
---
|
|
|
|
|
|
|
|
|
|
| 69 |
|
| 70 |
+
## βοΈ Grading & Reward Function
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 71 |
|
| 72 |
+
The benchmark uses a **Dynamic Golden Database Grader**. Instead of string-matching SQL, we compare the *final state* of the agent's database against a "perfectly migrated" reference database.
|
| 73 |
+
|
| 74 |
+
### The Reward Formula
|
| 75 |
+
Rewards are sparse/dense deltas calculated at every step:
|
| 76 |
+
|
| 77 |
+
$$R_t = P_t - P_{t-1}$$
|
| 78 |
+
|
| 79 |
+
Where $P_t$ (Progress) is a weighted sum ($[0.01, 0.99]$):
|
| 80 |
+
- **Schema Match (30%):** Validates table existence and strict `(name, type)` signatures.
|
| 81 |
+
- **Data Match (40%):** Validates row content, counts, and checks for data loss/pollution.
|
| 82 |
+
- **Integrity (20%):** Validates `PRAGMA foreign_key_check` and `PRAGMA integrity_check`.
|
| 83 |
+
- **Anti-Exploit (10%):** Penalizes empty tables or leftover "garbage" tables.
|
| 84 |
+
|
| 85 |
+
---
|
| 86 |
+
|
| 87 |
+
## π‘οΈ Security & Sandbox Guardrails
|
| 88 |
|
| 89 |
+
To prevent agents from faking results or exploiting the environment, we implement:
|
| 90 |
+
- **PRAGMA Blacklist:** Commands like `foreign_keys = OFF` or `PRAGMA foreign_keys = 0` are strictly blocked.
|
| 91 |
+
- **Query Timeout:** Infinite loops (e.g., recursive CTEs) are auto-terminated via a SQLite progress handler budget.
|
| 92 |
+
- **Dangerous Command Filter:** `ATTACH`, `DETACH`, and `LOAD_EXTENSION` are blocked via regex.
|
| 93 |
+
- **Isolation:** Each episode runs in a fresh, isolated `:memory:` database with no persistence.
|
| 94 |
+
|
| 95 |
+
---
|
| 96 |
+
|
| 97 |
+
## π Getting Started
|
| 98 |
+
|
| 99 |
+
### Local Deployment (Docker)
|
| 100 |
```bash
|
| 101 |
+
# Clone the repo
|
| 102 |
+
git clone https://github.com/Eishaan-Khatri/sql-migration-env
|
| 103 |
+
cd sql-migration-env
|
| 104 |
+
|
| 105 |
+
# Build and run
|
| 106 |
+
docker build -t sql-migration-env .
|
| 107 |
+
docker run -p 7860:7860 sql-migration-env
|
| 108 |
```
|
| 109 |
|
| 110 |
+
### Run Baseline Evaluation
|
| 111 |
```bash
|
| 112 |
+
python inference.py
|
| 113 |
```
|
| 114 |
|
| 115 |
+
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 116 |
|
| 117 |
+
## π Evaluation Baselines
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 118 |
|
| 119 |
+
Results using `GPT-OSS-120B` class models:
|
| 120 |
|
| 121 |
+
- **Avg. Benchmark Score:** 0.83 (Production ready)
|
| 122 |
+
- **Task Success Rates:**
|
| 123 |
+
- Easy: 0.99
|
| 124 |
+
- Medium: 0.82
|
| 125 |
+
- Hard: 0.60
|
| 126 |
|
| 127 |
+
---
|
| 128 |
+
|
| 129 |
+
## πΌοΈ Observations & Visuals
|
| 130 |
+
Each observation includes an `erd_visualization` field containing a **Mermaid.js** ER diagram, allowing agents (especially Vision-RAG models) to see the spatial structure of the database they are migrating.
|
| 131 |
+
|
| 132 |
+
---
|
| 133 |
|
| 134 |
+
## π License
|
| 135 |
+
This benchmark is licensed under the MIT License. Built for the **OpenEnv Hackathon 2026**.
|
__pycache__/models.cpython-312.pyc
CHANGED
|
Binary files a/__pycache__/models.cpython-312.pyc and b/__pycache__/models.cpython-312.pyc differ
|
|
|
models.py
CHANGED
|
@@ -98,6 +98,10 @@ class MigrationObservation(Observation):
|
|
| 98 |
default=None,
|
| 99 |
description="Human-readable diff between current and expected target schemas"
|
| 100 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
| 101 |
|
| 102 |
|
| 103 |
class MigrationState(State):
|
|
|
|
| 98 |
default=None,
|
| 99 |
description="Human-readable diff between current and expected target schemas"
|
| 100 |
)
|
| 101 |
+
erd_visualization: Optional[str] = Field(
|
| 102 |
+
default=None,
|
| 103 |
+
description="Mermaid.js erDiagram representation of the current database structure"
|
| 104 |
+
)
|
| 105 |
|
| 106 |
|
| 107 |
class MigrationState(State):
|
server/__pycache__/environment.cpython-312.pyc
CHANGED
|
Binary files a/server/__pycache__/environment.cpython-312.pyc and b/server/__pycache__/environment.cpython-312.pyc differ
|
|
|
server/app.py
CHANGED
|
@@ -57,44 +57,226 @@ from fastapi.responses import HTMLResponse
|
|
| 57 |
|
| 58 |
@app.get("/", response_class=HTMLResponse)
|
| 59 |
async def root():
|
| 60 |
-
"""Root endpoint β returns a status page for the HF Space UI."""
|
| 61 |
return """<!DOCTYPE html>
|
| 62 |
-
<html>
|
| 63 |
<head>
|
| 64 |
-
<
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 65 |
<style>
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 72 |
</style>
|
| 73 |
</head>
|
| 74 |
<body>
|
| 75 |
-
<
|
| 76 |
-
<
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
<
|
| 80 |
-
<span class="
|
| 81 |
-
<
|
| 82 |
-
|
| 83 |
-
<
|
| 84 |
-
<
|
| 85 |
-
<
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
<
|
| 90 |
-
<
|
| 91 |
-
<
|
| 92 |
-
<
|
| 93 |
-
<
|
| 94 |
-
<
|
| 95 |
-
|
| 96 |
-
|
| 97 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 98 |
</body>
|
| 99 |
</html>"""
|
| 100 |
|
|
|
|
| 57 |
|
| 58 |
@app.get("/", response_class=HTMLResponse)
|
| 59 |
async def root():
|
| 60 |
+
"""Root endpoint β returns a premium status page for the HF Space UI."""
|
| 61 |
return """<!DOCTYPE html>
|
| 62 |
+
<html lang="en">
|
| 63 |
<head>
|
| 64 |
+
<meta charset="UTF-8">
|
| 65 |
+
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
| 66 |
+
<title>SQL Migration Agent | OpenEnv Benchmark</title>
|
| 67 |
+
<link rel="preconnect" href="https://fonts.googleapis.com">
|
| 68 |
+
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
|
| 69 |
+
<link href="https://fonts.googleapis.com/css2?family=Outfit:wght@300;400;600;700&family=JetBrains+Mono:wght@400;500&display=swap" rel="stylesheet">
|
| 70 |
+
<script src="https://cdn.jsdelivr.net/npm/mermaid/dist/mermaid.min.js"></script>
|
| 71 |
<style>
|
| 72 |
+
:root {
|
| 73 |
+
--bg: #03060b;
|
| 74 |
+
--card-bg: rgba(13, 17, 23, 0.8);
|
| 75 |
+
--primary: #58a6ff;
|
| 76 |
+
--accent: #d2a8ff;
|
| 77 |
+
--success: #3fb950;
|
| 78 |
+
--warning: #d29922;
|
| 79 |
+
--danger: #f85149;
|
| 80 |
+
--text-main: #e6edf3;
|
| 81 |
+
--text-dim: #8b949e;
|
| 82 |
+
--border: #30363d;
|
| 83 |
+
}
|
| 84 |
+
|
| 85 |
+
* { box-sizing: border-box; margin: 0; padding: 0; }
|
| 86 |
+
body {
|
| 87 |
+
font-family: 'Outfit', sans-serif;
|
| 88 |
+
background: var(--bg);
|
| 89 |
+
color: var(--text-main);
|
| 90 |
+
line-height: 1.6;
|
| 91 |
+
overflow-x: hidden;
|
| 92 |
+
}
|
| 93 |
+
|
| 94 |
+
.background-blob {
|
| 95 |
+
position: fixed;
|
| 96 |
+
width: 600px;
|
| 97 |
+
height: 600px;
|
| 98 |
+
background: radial-gradient(circle, rgba(88, 166, 255, 0.1) 0%, rgba(210, 168, 255, 0.05) 50%, transparent 100%);
|
| 99 |
+
border-radius: 50%;
|
| 100 |
+
z-index: -1;
|
| 101 |
+
filter: blur(80px);
|
| 102 |
+
animation: move 20s infinite alternate;
|
| 103 |
+
}
|
| 104 |
+
|
| 105 |
+
@keyframes move {
|
| 106 |
+
from { transform: translate(-10%, -10%); }
|
| 107 |
+
to { transform: translate(20%, 30%); }
|
| 108 |
+
}
|
| 109 |
+
|
| 110 |
+
.container { max-width: 1100px; margin: 0 auto; padding: 60px 20px; }
|
| 111 |
+
|
| 112 |
+
header {
|
| 113 |
+
margin-bottom: 60px;
|
| 114 |
+
text-align: center;
|
| 115 |
+
border-bottom: 1px solid var(--border);
|
| 116 |
+
padding-bottom: 40px;
|
| 117 |
+
}
|
| 118 |
+
|
| 119 |
+
h1 { font-size: 3rem; font-weight: 700; margin-bottom: 10px; color: var(--primary); letter-spacing: -1px; }
|
| 120 |
+
.badge {
|
| 121 |
+
display: inline-block;
|
| 122 |
+
padding: 4px 12px;
|
| 123 |
+
background: rgba(63, 185, 80, 0.15);
|
| 124 |
+
color: var(--success);
|
| 125 |
+
border: 1px solid rgba(63, 185, 80, 0.3);
|
| 126 |
+
border-radius: 20px;
|
| 127 |
+
font-size: 0.9rem;
|
| 128 |
+
font-weight: 600;
|
| 129 |
+
margin-top: 10px;
|
| 130 |
+
}
|
| 131 |
+
|
| 132 |
+
.dashboard-grid {
|
| 133 |
+
display: grid;
|
| 134 |
+
grid-template-columns: 2fr 1fr;
|
| 135 |
+
gap: 30px;
|
| 136 |
+
}
|
| 137 |
+
|
| 138 |
+
.card {
|
| 139 |
+
background: var(--card-bg);
|
| 140 |
+
border: 1px solid var(--border);
|
| 141 |
+
border-radius: 16px;
|
| 142 |
+
padding: 30px;
|
| 143 |
+
backdrop-filter: blur(10px);
|
| 144 |
+
margin-bottom: 30px;
|
| 145 |
+
}
|
| 146 |
+
|
| 147 |
+
h2 { font-size: 1.5rem; margin-bottom: 25px; color: var(--accent); }
|
| 148 |
+
|
| 149 |
+
.endpoint-list { list-style: none; }
|
| 150 |
+
.endpoint-item {
|
| 151 |
+
display: flex;
|
| 152 |
+
align-items: center;
|
| 153 |
+
padding: 12px;
|
| 154 |
+
border-bottom: 1px solid var(--border);
|
| 155 |
+
font-family: 'JetBrains Mono', monospace;
|
| 156 |
+
}
|
| 157 |
+
.method { font-weight: 700; width: 60px; font-size: 0.85rem; }
|
| 158 |
+
.method.post { color: var(--success); }
|
| 159 |
+
.method.get { color: var(--primary); }
|
| 160 |
+
.path { color: var(--text-main); margin-left: 10px; }
|
| 161 |
+
.desc { color: var(--text-dim); margin-left: auto; font-family: 'Outfit'; font-size: 0.9rem; }
|
| 162 |
+
|
| 163 |
+
.task-card {
|
| 164 |
+
padding: 15px;
|
| 165 |
+
border: 1px solid var(--border);
|
| 166 |
+
border-radius: 10px;
|
| 167 |
+
margin-bottom: 12px;
|
| 168 |
+
transition: all 0.3s ease;
|
| 169 |
+
}
|
| 170 |
+
.task-card:hover { border-color: var(--primary); background: rgba(88, 166, 255, 0.05); }
|
| 171 |
+
.task-header { display: flex; justify-content: space-between; align-items: center; margin-bottom: 5px; }
|
| 172 |
+
.difficulty { font-size: 0.75rem; text-transform: uppercase; font-weight: 700; }
|
| 173 |
+
.difficulty.easy { color: var(--success); }
|
| 174 |
+
.difficulty.medium { color: var(--warning); }
|
| 175 |
+
.difficulty.hard { color: var(--danger); }
|
| 176 |
+
.task-name { font-weight: 600; font-size: 1.1rem; }
|
| 177 |
+
|
| 178 |
+
.footer {
|
| 179 |
+
margin-top: 60px;
|
| 180 |
+
text-align: center;
|
| 181 |
+
color: var(--text-dim);
|
| 182 |
+
font-size: 0.9rem;
|
| 183 |
+
}
|
| 184 |
+
a { color: var(--primary); text-decoration: none; font-weight: 600; }
|
| 185 |
+
a:hover { text-decoration: underline; }
|
| 186 |
+
|
| 187 |
+
@media (max-width: 800px) {
|
| 188 |
+
.dashboard-grid { grid-template-columns: 1fr; }
|
| 189 |
+
h1 { font-size: 2.2rem; }
|
| 190 |
+
}
|
| 191 |
</style>
|
| 192 |
</head>
|
| 193 |
<body>
|
| 194 |
+
<div class="background-blob"></div>
|
| 195 |
+
<div class="container">
|
| 196 |
+
<header>
|
| 197 |
+
<h1>SQL Migration Agent</h1>
|
| 198 |
+
<p style="color: var(--text-dim); font-size: 1.2rem;">Production-Grade OpenEnv Benchmark Suite</p>
|
| 199 |
+
<span class="badge">β Online & Compliant</span>
|
| 200 |
+
</header>
|
| 201 |
+
|
| 202 |
+
<div class="dashboard-grid">
|
| 203 |
+
<div class="left-col">
|
| 204 |
+
<div class="card">
|
| 205 |
+
<h2>Core Endpoints</h2>
|
| 206 |
+
<div class="endpoint-list">
|
| 207 |
+
<div class="endpoint-item"><span class="method post">POST</span> <span class="path">/reset</span> <span class="desc">Initialize task state</span></div>
|
| 208 |
+
<div class="endpoint-item"><span class="method post">POST</span> <span class="path">/step</span> <span class="desc">Execute SQL agent action</span></div>
|
| 209 |
+
<div class="endpoint-item"><span class="method get">GET</span> <span class="path">/state</span> <span class="desc">Current episode status</span></div>
|
| 210 |
+
<div class="endpoint-item"><span class="method get">GET</span> <span class="path">/tasks</span> <span class="desc">List benchmark tasks</span></div>
|
| 211 |
+
<div class="endpoint-item"><span class="method post">POST</span> <span class="path">/grader</span><span class="desc">Run golden-DB comparison</span></div>
|
| 212 |
+
</div>
|
| 213 |
+
</div>
|
| 214 |
+
|
| 215 |
+
<div class="card">
|
| 216 |
+
<h2>Benchmark Features</h2>
|
| 217 |
+
<p style="color: var(--text-dim); margin-bottom: 20px;">
|
| 218 |
+
This environment provides high-fidelity SQLite migration tasks designed to pressure-test schema decomposition,
|
| 219 |
+
type coercion, and data integrity handling in LLMs.
|
| 220 |
+
</p>
|
| 221 |
+
<div style="display: grid; grid-template-columns: 1fr 1fr; gap: 20px;">
|
| 222 |
+
<div>
|
| 223 |
+
<strong style="color: var(--primary);">β Dynamic Grader</strong>
|
| 224 |
+
<p style="font-size: 0.85rem; color: var(--text-dim);">Seed-independent golden-DB logic.</p>
|
| 225 |
+
</div>
|
| 226 |
+
<div>
|
| 227 |
+
<strong style="color: var(--primary);">β ERD Viz</strong>
|
| 228 |
+
<p style="font-size: 0.85rem; color: var(--text-dim);">Real-time Mermaid diagrams.</p>
|
| 229 |
+
</div>
|
| 230 |
+
<div>
|
| 231 |
+
<strong style="color: var(--primary);">β Anti-Exploit</strong>
|
| 232 |
+
<p style="font-size: 0.85rem; color: var(--text-dim);">PRAGMA & dialect blacklisting.</p>
|
| 233 |
+
</div>
|
| 234 |
+
<div>
|
| 235 |
+
<strong style="color: var(--primary);">β Tx Aware</strong>
|
| 236 |
+
<p style="font-size: 0.85rem; color: var(--text-dim);">Supports BEGIN/COMMIT blocks.</p>
|
| 237 |
+
</div>
|
| 238 |
+
</div>
|
| 239 |
+
</div>
|
| 240 |
+
</div>
|
| 241 |
+
|
| 242 |
+
<div class="right-col">
|
| 243 |
+
<div class="card">
|
| 244 |
+
<h2>Assessment Tasks</h2>
|
| 245 |
+
<div class="task-card">
|
| 246 |
+
<div class="task-header"><span class="difficulty easy">Easy</span> <span class="task-name">Column Merge</span></div>
|
| 247 |
+
<p style="font-size: 0.85rem; color: var(--text-dim);">Merge name fields with apostrophe preservation.</p>
|
| 248 |
+
</div>
|
| 249 |
+
<div class="task-card">
|
| 250 |
+
<div class="task-header"><span class="difficulty medium">Medium</span> <span class="task-name">Normalization</span></div>
|
| 251 |
+
<p style="font-size: 0.85rem; color: var(--text-dim);">Decompose god-table into 3NF schema.</p>
|
| 252 |
+
</div>
|
| 253 |
+
<div class="task-card">
|
| 254 |
+
<div class="task-header"><span class="difficulty hard">Hard</span> <span class="task-name">Cascade Sync</span></div>
|
| 255 |
+
<p style="font-size: 0.85rem; color: var(--text-dim);">Multi-table FK cascade with audit logging.</p>
|
| 256 |
+
</div>
|
| 257 |
+
<div style="text-align: center; margin-top: 20px;">
|
| 258 |
+
<a href="/tasks">View all 7 tasks β</a>
|
| 259 |
+
</div>
|
| 260 |
+
</div>
|
| 261 |
+
|
| 262 |
+
<div class="card">
|
| 263 |
+
<h2>Developer Info</h2>
|
| 264 |
+
<p style="font-size: 0.9rem;">
|
| 265 |
+
<strong>Engine:</strong> OpenEnv v1.0<br>
|
| 266 |
+
<strong>Dialect:</strong> SQLite 3.x<br>
|
| 267 |
+
<strong>Port:</strong> 7860
|
| 268 |
+
</p>
|
| 269 |
+
<hr style="border: none; border-top: 1px solid var(--border); margin: 15px 0;">
|
| 270 |
+
<a href="/docs" target="_blank">π Swagger API Docs</a>
|
| 271 |
+
</div>
|
| 272 |
+
</div>
|
| 273 |
+
</div>
|
| 274 |
+
|
| 275 |
+
<div class="footer">
|
| 276 |
+
Built for the OpenEnv Hackathon © 2026. <br>
|
| 277 |
+
<a href="https://github.com/Eishaan-Khatri/sql-migration-env" target="_blank">Source Code on GitHub</a>
|
| 278 |
+
</div>
|
| 279 |
+
</div>
|
| 280 |
</body>
|
| 281 |
</html>"""
|
| 282 |
|
server/environment.py
CHANGED
|
@@ -110,6 +110,47 @@ class DbMigrationEnvironment(Environment):
|
|
| 110 |
except Exception:
|
| 111 |
return ""
|
| 112 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 113 |
def _is_read_query(self, sql: str) -> bool:
|
| 114 |
"""Check if SQL is a read-only query (SELECT or certain PRAGMAs)."""
|
| 115 |
stripped = sql.strip().upper()
|
|
@@ -273,6 +314,7 @@ class DbMigrationEnvironment(Environment):
|
|
| 273 |
migration_progress=initial_score,
|
| 274 |
task_name=self.task_name,
|
| 275 |
schema_diff=diff if diff else "Schemas match exactly.",
|
|
|
|
| 276 |
metadata={"status": "ready"},
|
| 277 |
)
|
| 278 |
|
|
@@ -432,6 +474,7 @@ class DbMigrationEnvironment(Environment):
|
|
| 432 |
migration_progress=current_score,
|
| 433 |
task_name=self.task_name,
|
| 434 |
schema_diff=diff if diff else "Schemas match exactly.",
|
|
|
|
| 435 |
metadata=meta,
|
| 436 |
)
|
| 437 |
|
|
|
|
| 110 |
except Exception:
|
| 111 |
return ""
|
| 112 |
|
| 113 |
+
def _generate_erd(self) -> str:
|
| 114 |
+
"""Generate a Mermaid.js erDiagram based on the current database structure."""
|
| 115 |
+
if self._conn is None:
|
| 116 |
+
return ""
|
| 117 |
+
try:
|
| 118 |
+
lines = ["erDiagram"]
|
| 119 |
+
|
| 120 |
+
# 1. Get all tables
|
| 121 |
+
cursor = self._conn.execute(
|
| 122 |
+
"SELECT name FROM sqlite_master WHERE type='table' "
|
| 123 |
+
"AND name NOT LIKE 'sqlite_%' ORDER BY name"
|
| 124 |
+
)
|
| 125 |
+
tables = [row[0] for row in cursor.fetchall()]
|
| 126 |
+
|
| 127 |
+
relationships = []
|
| 128 |
+
|
| 129 |
+
for table in tables:
|
| 130 |
+
lines.append(f" {table} {{")
|
| 131 |
+
# 2. Get column info for each table
|
| 132 |
+
cursor = self._conn.execute(f"PRAGMA table_info([{table}])")
|
| 133 |
+
for col in cursor.fetchall():
|
| 134 |
+
# col[1]: name, col[2]: type, col[5]: pk
|
| 135 |
+
name = col[1]
|
| 136 |
+
dtype = col[2].replace(" ", "_")
|
| 137 |
+
is_pk = "PK" if col[5] else ""
|
| 138 |
+
lines.append(f" {dtype} {name} {is_pk}")
|
| 139 |
+
lines.append(" }")
|
| 140 |
+
|
| 141 |
+
# 3. Get foreign keys for relationships
|
| 142 |
+
cursor = self._conn.execute(f"PRAGMA foreign_key_list([{table}])")
|
| 143 |
+
for fk in cursor.fetchall():
|
| 144 |
+
# fk[2]: to_table, fk[3]: from_col, fk[4]: to_col
|
| 145 |
+
to_table = fk[2]
|
| 146 |
+
relationships.append(f" {table} ||--o{{ {to_table} : \"references\"")
|
| 147 |
+
|
| 148 |
+
# Append unique relationships to avoid bloat
|
| 149 |
+
lines.extend(list(set(relationships)))
|
| 150 |
+
return "\n".join(lines)
|
| 151 |
+
except Exception:
|
| 152 |
+
return "erDiagram\n ERROR { string info }"
|
| 153 |
+
|
| 154 |
def _is_read_query(self, sql: str) -> bool:
|
| 155 |
"""Check if SQL is a read-only query (SELECT or certain PRAGMAs)."""
|
| 156 |
stripped = sql.strip().upper()
|
|
|
|
| 314 |
migration_progress=initial_score,
|
| 315 |
task_name=self.task_name,
|
| 316 |
schema_diff=diff if diff else "Schemas match exactly.",
|
| 317 |
+
erd_visualization=self._generate_erd(),
|
| 318 |
metadata={"status": "ready"},
|
| 319 |
)
|
| 320 |
|
|
|
|
| 474 |
migration_progress=current_score,
|
| 475 |
task_name=self.task_name,
|
| 476 |
schema_diff=diff if diff else "Schemas match exactly.",
|
| 477 |
+
erd_visualization=self._generate_erd(),
|
| 478 |
metadata=meta,
|
| 479 |
)
|
| 480 |
|