| # Implementation Specification |
|
|
| **Change:** F007 — HuggingFace Deployment & Submission Package |
| **Date:** 2026-03-27 |
| **Research Summary:** [F007-RESEARCH_SUMMARY.md](./F007-RESEARCH_SUMMARY.md) |
| **Verification Spec:** See VERIFICATION_SPEC.md (generated by autocode-verification-planner) |
| **Behavior Delta:** Archived to [specs/behavior/deployment.md](./behavior/deployment.md) |
| |
| **Plan Status:** |
| - [x] Draft |
| - [x] Approved for Implementation |
| - [x] Implementation Complete |
| - [x] Verification Passed |
| |
| --- |
| |
| ## Core Intent (Immutable) |
| |
| > **DO NOT MODIFY THIS SECTION DURING REFINEMENT** |
| > Changes to Core Intent mean you're describing a different feature. |
| > If refinement reveals the need to change this section, create a new feature instead. |
| |
| **User Problem:** |
| Judges can: read the blog, visit the HF Space, run the training notebook, and reproduce results. Someone outside the team can understand, use, and build on SQLEnv. |
| |
| **Success Criteria:** |
| - Blog tells a compelling story even if training results are modest |
| - HF Space just works -- connect, reset, play an episode |
| - Training notebook runs end-to-end on Colab with one click |
| |
| **Avoid:** |
| - Docker build fails on HF Spaces (free tier CPU) |
| - Blog is all technical with no narrative hook |
| - Notebook has undocumented setup steps |
| |
| **Out of Scope:** |
| - Full blog post writing (outline + key sections only, manual polish later) |
| - Paid HF Spaces tier or GPU resources |
| - Training the agent (that is F006) |
| - Video recording of demo (manual task) |
| |
| --- |
| |
| ## 0. Slicing & Scope Budget (Anti-Waterfall) |
| |
| This spec must be executable in **small, mergeable increments**. |
| |
| ### Scope Budget |
| - Target: **3 slices** |
| - Hard max: **<= 10 steps total** |
| - Each step must end in: **implement -> verify -> merge** |
| |
| ### Slice Definition |
| A slice is a vertical increment that delivers user-visible value or a safe internal capability. |
| |
| **Each slice must have:** |
| - Clear outcome |
| - Minimal interface change |
| - Merge criteria |
| |
| **Note:** Verification criteria are defined in VERIFICATION_SPEC.md (separate agent). |
|
|
| ## Status Icons |
|
|
| **Step Status:** |
| - !! Not Started |
| - :: In Progress |
| - OK Completed |
| - XX Blocked/Failed |
|
|
| **Result Outcome:** |
| - OK Fully Successful (all tests passed, no issues) |
| - ~~ Completed with Issues (needs follow-up) |
| - XX Failed/Blocked |
|
|
| --- |
|
|
| ## 1. Implementation Overview |
|
|
| ### Summary |
| Prepare the complete competition submission package: (1) harden the Dockerfile for HF Spaces free-tier deployment with bundled Spider databases, (2) overhaul README.md to be a polished project showcase, (3) create a blog post outline with key narrative sections, and (4) create a Colab-ready training notebook stub that references F006 outputs. This is the terminal feature -- it depends on F001-F006 being complete. |
|
|
| ### Scope |
|
|
| **In Scope:** |
| - Dockerfile hardening for HF Spaces (bundle Spider DBs, CPU-only, health check) |
| - `openenv.yaml` validation for HF Hub compatibility |
| - README.md overhaul (architecture diagram, setup, usage, links) |
| - Blog post outline (`docs/blog-outline.md`) |
| - Training notebook stub (`notebooks/train_grpo.ipynb`) |
| - `.dockerignore` for clean builds |
|
|
| **Out of Scope:** |
| - Full blog prose (outline only) |
| - Agent training (F006) |
| - Reward/verifier logic (F003/F004) |
| - Video demo recording |
| - Paid HF Spaces configuration |
|
|
| --- |
|
|
| ## 1a. Execution Status |
| <!-- Auto-updated by /autocode-next-step - do not edit manually --> |
|
|
| **Progress:** 7/7 steps complete |
| **Current Step:** Finalization Protocol (OK Completed) |
| **Last Updated:** 2026-03-29T07:29:32Z |
| **Latest Result:** OK Final verification gate passed. Authenticated deployment evidence is now complete: `uv run openenv build -t openenv-sql-env-f007-hf-submission` succeeded, `uv run openenv push` completed successfully to `https://huggingface.co/spaces/hjerpe/sql_env`, and regression verification remained green (`uv run --with pytest pytest tests/ -v`: 250 passed, 1 skipped). `uv run openenv validate --verbose` still reports non-Docker entrypoint warnings, but Docker mode is supported and remains the scoped deployment path for F007. |
| **Blockers:** None. |
|
|
| --- |
|
|
| ## 1b. Risk Assessment |
|
|
| **Risk Tier:** Low |
|
|
| **Risk Tier Definitions:** |
| - **Low:** Pure logic, non-user-facing, no security implications |
| - **Medium:** User input handling, data validation, API changes |
| - **High:** Authentication, payments, secrets management, untrusted input |
|
|
| **High-Risk Indicators Present:** None |
|
|
| **Security Review Required:** No |
|
|
| **Justification:** |
| This feature creates documentation, configuration files, and a notebook. No authentication, secrets, or untrusted input handling. The Dockerfile bundles existing data and runs an existing server. |
|
|
| --- |
|
|
| ## 2. Change Manifest |
|
|
| ### Files to Create |
|
|
| | File | Purpose | |
| |------|---------| |
| | `notebooks/train_grpo.ipynb` | Colab-ready training notebook stub | |
| | `docs/blog-outline.md` | HF blog post outline with narrative structure | |
| | `.dockerignore` | Exclude dev artifacts from Docker build | |
|
|
| ### Files to Modify |
|
|
| | File | Changes | |
| |------|---------| |
| | `server/Dockerfile` | Bundle Spider DBs, optimize for HF Spaces free tier | |
| | `openenv.yaml` | Validate/update for HF Hub push compatibility | |
| | `README.md` | Full overhaul -- polished project showcase | |
|
|
| ### Files to Delete |
|
|
| None. |
|
|
| --- |
|
|
| ## 3. Interface Specifications |
|
|
| ### Dockerfile Structure |
|
|
| ```dockerfile |
| # server/Dockerfile -- HF Spaces compatible |
| # Key changes from current: |
| # 1. Bundle Spider databases (COPY data/databases/ ...) |
| # 2. Ensure CPU-only (no torch GPU deps) |
| # 3. Expose port 7860 (HF Spaces default) OR 8000 (openenv default) |
| # 4. HEALTHCHECK on /health endpoint |
| # 5. Non-root user for HF Spaces security |
| ``` |
|
|
| ### openenv.yaml Schema |
|
|
| ```yaml |
| spec_version: 1 |
| name: sql_env |
| type: space |
| runtime: fastapi |
| app: server.app:app |
| port: 8000 |
| ``` |
|
|
| No structural changes needed -- validate existing manifest is HF Hub compatible. |
|
|
| ### Blog Outline Structure |
|
|
| ```markdown |
| # docs/blog-outline.md |
| # Sections: |
| # 1. Hook -- "Teaching AI to think like a data analyst" |
| # 2. Problem -- Static benchmarks vs. interactive exploration |
| # 3. Solution -- SQLEnv architecture overview |
| # 4. How It Works -- Episode flow, reward design |
| # 5. Results -- Learning curves, comparison (placeholder for F006 data) |
| # 6. Technical Deep Dive -- Reward architecture, GRPO training |
| # 7. Try It Yourself -- Links to HF Space, notebook, GitHub |
| ``` |
|
|
| ### Training Notebook Structure |
|
|
| ```python |
| # notebooks/train_grpo.ipynb |
| # Cells: |
| # 1. Setup -- pip install, clone repo |
| # 2. Configure -- HF Space URL, model selection |
| # 3. Connect -- SQLEnvClient connect + test |
| # 4. Train -- GRPO training loop (references F006 scripts/) |
| # 5. Evaluate -- Run eval episodes, plot results |
| # 6. Results -- Display learning curves |
| ``` |
|
|
| ### New Functions |
|
|
| No new Python functions. This feature produces configuration and documentation artifacts. |
|
|
| --- |
|
|
| ## 4. Data Flow |
|
|
| ### Primary Flow: HF Spaces Deployment |
|
|
| ``` |
| 1. Developer runs `openenv validate` |
| - Input: openenv.yaml, Dockerfile |
| - Action: Validates manifest and Docker build locally |
| - Output: Pass/fail with diagnostics |
| |
| 2. Developer runs `openenv build` |
| - Input: Dockerfile, project files, Spider DBs |
| - Action: Builds Docker image with bundled databases |
| - Output: Docker image (~200MB with DBs) |
| |
| 3. Developer runs `openenv push` |
| - Input: Built Docker image, HF token |
| - Action: Pushes to HuggingFace Spaces |
| - Output: Live HF Space URL |
| ``` |
|
|
| ### Alternative Flow: Local Docker Test |
|
|
| ``` |
| 1. docker build -t sql-env:latest -f server/Dockerfile . |
| 2. docker run -p 8000:8000 sql-env:latest |
| 3. curl http://localhost:8000/health -> {"status": "healthy"} |
| 4. WebSocket client connects, plays episode |
| ``` |
|
|
| --- |
|
|
| ## 5. Error Handling |
|
|
| ### Error Types |
|
|
| | Error | When | Resolution | |
| |-------|------|------------| |
| | Docker build failure | Missing deps or files | Check .dockerignore, verify COPY paths | |
| | DB not found at runtime | DBs not bundled correctly | Verify COPY data/databases/ in Dockerfile | |
| | Port mismatch | HF Spaces expects 7860 | Use PORT env var with fallback | |
| | Memory limit exceeded | Container too large for free tier | Reduce bundled DBs to essential set | |
|
|
| ### Error Handling Strategy |
|
|
| The Dockerfile should: |
| 1. Use a PORT environment variable with default 8000 (HF Spaces sets PORT=7860) |
| 2. Include a startup check that verifies databases are accessible |
| 3. Keep image size minimal (no dev dependencies, no torch GPU packages) |
|
|
| --- |
|
|
| ## 6. Slice Plan (What we will ship, in order) |
|
|
| ### Slice S1 -- Docker & Deployment |
| **Value:** HF Space can be built and deployed; server runs on free tier |
| **User-visible change:** Yes -- live HF Space |
| **Interfaces introduced/changed:** Dockerfile, .dockerignore, openenv.yaml |
| **Rollback safety:** Additive only, no existing behavior changed |
|
|
| ### Slice S2 -- Documentation & README |
| **Value:** GitHub repo is a polished showcase; judges can understand the project |
| **User-visible change:** Yes -- README overhaul, blog outline |
| **Interfaces introduced/changed:** README.md, docs/blog-outline.md |
| **Rollback safety:** Documentation only, fully reversible |
|
|
| ### Slice S3 -- Training Notebook |
| **Value:** Judges can reproduce training with one click on Colab |
| **User-visible change:** Yes -- notebook artifact |
| **Interfaces introduced/changed:** notebooks/train_grpo.ipynb |
| **Rollback safety:** New file only, no existing code changed |
| |
| --- |
| |
| ## 7. Implementation Steps |
| |
| > **VERIFICATION NOTE:** Test criteria for each step are defined in VERIFICATION_SPEC.md. |
| > The verification-planner (separate agent) generated independent test criteria. |
| > Run the tests specified there after implementing each step. |
|
|
| ### Step 1.1: Dockerfile Hardening for HF Spaces |
| **Slice:** S1 |
| **Goal:** Update Dockerfile to bundle Spider databases, support HF Spaces PORT variable, run as non-root user, and minimize image size. |
|
|
| **Files:** |
| - `server/Dockerfile` - modify - Harden for HF Spaces free tier |
| - `.dockerignore` - create - Exclude dev artifacts (tests, docs, .git, __pycache__) |
|
|
| **Details:** |
| 1. Add COPY for `data/databases/` into the Docker image (bundle the SQLite files) |
| 2. Add `ENV PORT=8000` with CMD that reads `$PORT` (HF Spaces sets PORT=7860) |
| 3. Add non-root user (`useradd --create-home appuser`) for HF Spaces security requirement |
| 4. Ensure no GPU/CUDA dependencies are installed (CPU-only) |
| 5. Create `.dockerignore` excluding: `.git`, `__pycache__`, `tests/`, `docs/`, `docs_draft/`, `specs/`, `vision/`, `*.md` (except README), `.env` |
|
|
| **Interface Changes:** None (Dockerfile is configuration) |
|
|
| **Verification:** |
| > See VERIFICATION_SPEC.md for test criteria defined by independent verification planner. |
| |
| **Risk Tier for This Step:** Low |
| |
| **Merge Criteria:** |
| - [x] Tests from VERIFICATION_SPEC.md pass |
| - [x] No TODOs left in changed code (or explicitly tracked) |
| - [x] Backwards compatible (or flag/migration documented) |
|
|
| **Changes Made:** |
| - Updated `server/Dockerfile` with `ENV PORT=8000` and runtime `uvicorn` command that respects `${PORT:-8000}` for HF Spaces compatibility. |
| - Added explicit database bundling copy instruction: `COPY --from=builder /app/env/data/databases /app/env/data/databases`. |
| - Added non-root runtime user (`appuser`) and ownership handoff for `/app`. |
| - Created `.dockerignore` to exclude dev/test/docs/spec artifacts and keep only `README.md` among markdown files. |
|
|
| **Result:** |
| - OK Fully Successful |
| - Verification command: `uv run --with pytest pytest tests/ -v` |
| - Verification evidence: 250 passed, 1 skipped |
|
|
| **Context for Next Step:** |
| - Continue with Step 1.2 by validating database source requirements from `data/questions/db_list.json` and aligning Docker health checks with bundled DB presence. |
|
|
| **Status:** OK Completed |
|
|
| --- |
|
|
| ### Step 1.2: Bundle Spider Databases for Docker |
| **Slice:** S1 |
| **Goal:** Ensure the essential Spider SQLite databases are available for bundling into Docker, and the Dockerfile COPY path is correct. |
|
|
| **Files:** |
| - `server/Dockerfile` - modify - Verify COPY paths for data/databases/ |
| - `data/questions/db_list.json` - read - Identify which DBs are required |
|
|
| **Details:** |
| 1. Read `data/questions/db_list.json` to identify the required database IDs |
| 2. Ensure the Dockerfile copies `data/databases/` into the image at the correct path |
| 3. Add a Docker HEALTHCHECK that also verifies at least one database file exists |
| 4. The bundled DBs are small SQLite files (~50MB total), well within free tier limits |
|
|
| **Interface Changes:** None |
|
|
| **Verification:** |
| > See VERIFICATION_SPEC.md for test criteria defined by independent verification planner. |
| |
| **Risk Tier for This Step:** Low |
| |
| **Merge Criteria:** |
| - [x] Tests from VERIFICATION_SPEC.md pass |
| - [x] No TODOs left in changed code (or explicitly tracked) |
| - [x] Backwards compatible (or flag/migration documented) |
|
|
| **Changes Made:** |
| - Read `data/questions/db_list.json` and confirmed required bundled DB IDs: `student_assessment`, `concert_singer`, `world_1`, `car_1`, `employee_hire_evaluation`, `pets_1`, `cre_Doc_Template_Mgt`, `dog_kennels`, `flight_2`, `poker_player`. |
| - Verified Docker bundling path remains correct: `COPY --from=builder /app/env/data/databases /app/env/data/databases`. |
| - Updated Docker `HEALTHCHECK` to enforce both bundled DB presence (`*.sqlite` under `/app/env/data/databases`) and API liveness via `/health` on `${PORT:-8000}`. |
|
|
| **Result:** |
| - OK Fully Successful |
| - Verification command: `uv run --with pytest pytest tests/ -v` |
| - Verification evidence: 250 passed, 1 skipped |
|
|
| **Context for Next Step:** |
| - Proceed to Step 1.3 by validating `openenv.yaml` shape (`spec_version`, `name`, `type`, `runtime`, `app`, `port`) and running `openenv validate`. |
|
|
| **Status:** OK Completed |
|
|
| --- |
|
|
| ### Step 1.3: Validate openenv.yaml |
| **Slice:** S1 |
| **Goal:** Ensure openenv.yaml is valid for `openenv push` to HuggingFace Spaces. |
|
|
| **Files:** |
| - `openenv.yaml` - modify (if needed) - Ensure HF Hub compatibility |
|
|
| **Details:** |
| 1. Verify `spec_version`, `name`, `type`, `runtime`, `app`, and `port` fields |
| 2. Confirm `app: server.app:app` matches the actual FastAPI application path inside the Docker container |
| 3. Update `port` if needed (openenv framework may handle PORT mapping) |
| 4. Run `openenv validate` locally to check |
|
|
| **Interface Changes:** None |
|
|
| **Verification:** |
| > See VERIFICATION_SPEC.md for test criteria defined by independent verification planner. |
| |
| **Risk Tier for This Step:** Low |
| |
| **Merge Criteria:** |
| - [x] Tests from VERIFICATION_SPEC.md pass |
| - [x] No TODOs left in changed code (or explicitly tracked) |
| - [x] Backwards compatible (or flag/migration documented) |
|
|
| **Changes Made:** |
| - Validated `openenv.yaml` fields against the required HF Space manifest shape (`spec_version`, `name`, `type`, `runtime`, `app`, `port`) and confirmed no manifest edits were needed. |
| - Ran `uv run openenv validate --verbose`; manifest compatibility checks passed for Docker mode, with non-blocking warnings that `openenv_serve`/`uv_run`/`python_module` modes need a callable `server/app.py main()` entrypoint. |
| - Ran full regression suite via `uv run --with pytest pytest tests/ -v` to ensure no feature regressions while validating deployment configuration. |
|
|
| **Result:** |
| - OK Fully Successful |
| - Verification command: `uv run --with pytest pytest tests/ -v` |
| - Verification evidence: 250 passed, 1 skipped |
|
|
| **Context for Next Step:** |
| - Proceed to Step 2.1 and overhaul `README.md` into competition-ready narrative + quickstart + architecture flow, using the now-validated `openenv.yaml` values as the source-of-truth deployment metadata. |
|
|
| **Status:** OK Completed |
|
|
| --- |
|
|
| ### Step 2.1: README.md Overhaul |
| **Slice:** S2 |
| **Goal:** Transform README into a polished project showcase suitable for competition judges. |
|
|
| **Files:** |
| - `README.md` - modify - Full overhaul |
|
|
| **Details:** |
| 1. **Header:** Project name, one-line description, badges (Python version, license) |
| 2. **Elevator Pitch:** 2-3 sentences explaining what SQLEnv does and why it matters (narrative hook: "Teaching AI to think like a data analyst") |
| 3. **Architecture Diagram:** ASCII or Mermaid diagram showing Agent <-> Client <-> Server <-> SQLite flow |
| 4. **Quick Start:** Streamlined setup (3 commands max to get running) |
| 5. **How It Works:** Episode flow with action types table (DESCRIBE, SAMPLE, QUERY, ANSWER) |
| 6. **Training:** Link to notebook, brief GRPO explanation |
| 7. **HF Space:** Link to live deployment |
| 8. **Project Structure:** Updated tree reflecting final state |
| 9. **Links:** OpenEnv, Spider, HF Space, blog post |
| 10. Remove "Current Status" section (no longer relevant for submission) |
| 11. Remove cautionary notes about untested Docker paths |
|
|
| **Interface Changes:** None |
|
|
| **Verification:** |
| > See VERIFICATION_SPEC.md for test criteria defined by independent verification planner. |
| |
| **Risk Tier for This Step:** Low |
| |
| **Merge Criteria:** |
| - [x] Tests from VERIFICATION_SPEC.md pass |
| - [x] No TODOs left in changed code (or explicitly tracked) |
| - [x] Backwards compatible (or flag/migration documented) |
|
|
| **Changes Made:** |
| - Rewrote `README.md` into a submission-facing narrative that starts with a clear elevator pitch and removes stale cautionary/status language. |
| - Added a compact architecture diagram and refreshed "How It Works" with explicit action semantics (`DESCRIBE`, `SAMPLE`, `QUERY`, `ANSWER`) and episode flow. |
| - Replaced setup sprawl with a 3-command quickstart, plus explicit local server and Docker launch commands. |
| - Added sections for training artifacts, HuggingFace Space deployment path, project structure, deployment checklist, and canonical resource links. |
|
|
| **Result:** |
| - OK Fully Successful |
| - Verification command: `uv run --with pytest pytest tests/ -v` |
| - Verification evidence: 250 passed, 1 skipped |
|
|
| **Context for Next Step:** |
| - Proceed to Step 2.2 by creating `docs/blog-outline.md` with hook/problem/solution/how-it-works/results placeholder/technical highlights/try-it sections and 2-4 bullets per section. |
|
|
| **Status:** OK Completed |
|
|
| --- |
|
|
| ### Step 2.2: Blog Post Outline |
| **Slice:** S2 |
| **Goal:** Create a structured blog post outline with key narrative sections for the HF blog submission. |
|
|
| **Files:** |
| - `docs/blog-outline.md` - create - Blog post outline |
|
|
| **Details:** |
| 1. **Hook:** "What if we taught AI to explore databases the way a data analyst does -- not memorize answers, but learn to ask the right questions?" |
| 2. **The Problem:** Static text-to-SQL benchmarks reward memorization, not reasoning. One-shot generation fails on novel schemas. |
| 3. **Our Approach:** SQLEnv -- an RL environment where agents learn through iterative exploration (DESCRIBE, SAMPLE, QUERY, ANSWER) |
| 4. **How SQLEnv Works:** Episode flow diagram, reward design (execution + correctness + efficiency) |
| 5. **Training with GRPO:** Brief explanation of Group Relative Policy Optimization, why it fits |
| 6. **Results:** [PLACEHOLDER for F006 data] Learning curves, comparison with baselines |
| 7. **Technical Highlights:** Multi-DB support, token-level reward shaping, OpenEnv compatibility |
| 8. **Try It Yourself:** Links to HF Space, Colab notebook, GitHub repo |
| 9. **What We Learned:** Key insights from building the environment |
|
|
| Each section should have 2-4 bullet points of key content to include when writing the full post. |
|
|
| **Interface Changes:** None |
|
|
| **Verification:** |
| > See VERIFICATION_SPEC.md for test criteria defined by independent verification planner. |
| |
| **Risk Tier for This Step:** Low |
| |
| **Merge Criteria:** |
| - [x] Tests from VERIFICATION_SPEC.md pass |
| - [x] No TODOs left in changed code (or explicitly tracked) |
| - [x] Backwards compatible (or flag/migration documented) |
|
|
| **Changes Made:** |
| - Created `docs/blog-outline.md` with a complete submission-ready structure covering hook, benchmark problem framing, SQLEnv approach, episode/reward flow, GRPO training context, results placeholder, technical highlights, try-it links section, and lessons learned. |
| - Ensured each section has 2-4 concrete bullets and expanded prose sufficient for a substantive draft handoff. |
| - Kept the only explicit placeholder in the Results section for F006 metric insertion, aligned with scope. |
|
|
| **Result:** |
| - OK Fully Successful |
| - Verification command: `uv run --with pytest pytest tests/ -v` |
| - Verification evidence: 250 passed, 1 skipped |
|
|
| **Context for Next Step:** |
| - Proceed to Step 3.1 by creating `notebooks/train_grpo.ipynb` with Colab-compatible metadata and ordered cells for setup, configuration, connect/test episode, training loop, evaluation, and plotting. |
|
|
| **Status:** OK Completed |
|
|
| --- |
|
|
| ### Step 3.1: Training Notebook Stub |
| **Slice:** S3 |
| **Goal:** Create a Colab-ready Jupyter notebook that demonstrates end-to-end training with SQLEnv. |
|
|
| **Files:** |
| - `notebooks/train_grpo.ipynb` - create - Colab training notebook |
|
|
| **Details:** |
| Create a Jupyter notebook with these cells: |
|
|
| 1. **Title + Description** (markdown): "Training a SQL Agent with GRPO + SQLEnv" |
| 2. **Setup** (code): `!pip install sql-env[train]` or `!pip install -r requirements.txt`, clone repo if needed |
| 3. **Configuration** (code): Set HF Space URL (or local server), model name, hyperparameters |
| 4. **Connect & Test** (code): Create `SQLEnvClient`, connect, run a test episode (reset + 2 steps) |
| 5. **Training Loop** (code): GRPO training referencing F006 scripts (import from scripts/ or inline simplified version) |
| 6. **Evaluation** (code): Run eval episodes on held-out questions, compute metrics |
| 7. **Plot Results** (code): matplotlib learning curves (reward over episodes) |
| 8. **Next Steps** (markdown): Links to full training script, HF Space, blog post |
|
|
| Each code cell should have markdown cells above explaining what it does and why. Include `# TODO: update after F006` comments where training-specific code depends on F006 outputs. |
|
|
| **Interface Changes:** None |
|
|
| **Verification:** |
| > See VERIFICATION_SPEC.md for test criteria defined by independent verification planner. |
| |
| **Risk Tier for This Step:** Low |
| |
| **Merge Criteria:** |
| - [x] Tests from VERIFICATION_SPEC.md pass |
| - [x] No TODOs left in changed code (or explicitly tracked) |
| - [x] Backwards compatible (or flag/migration documented) |
|
|
| **Changes Made:** |
| - Replaced `notebooks/train_grpo.ipynb` with a clean, Colab-compatible training stub organized as: title/description, setup, configuration, connect smoke test, GRPO training loop, held-out evaluation, plotting, and next steps. |
| - Added explicit `SQLEnvClient` connectivity example and retained F006 training hooks (`GRPOConfig`, `load_model_and_tokenizer`, `build_trainer`, `run_training_with_metrics`, and `sample_random_baseline`) so notebook smoke tests continue to validate expected flow. |
| - Cleared all notebook cell outputs and removed hardcoded local absolute paths to keep the artifact reproducible for judges and portable to Colab/local runs. |
|
|
| **Result:** |
| - OK Fully Successful |
| - Verification commands: |
| - `uv run --with pytest pytest tests/e2e/test_training_e2e.py -v` |
| - `uv run --with pytest pytest tests/ -v` |
| - Verification evidence: |
| - Targeted notebook E2E: 5 passed |
| - Full regression suite: 250 passed, 1 skipped |
|
|
| **Context for Next Step:** |
| - Implementation steps are complete for F007; proceed to finalization protocol (verification gate + verifier/compound-engineer/archive-spec + Plan Status/PR Contract/FEATURES sync). |
|
|
| **Status:** OK Completed |
|
|
| --- |
|
|
| ## 8. Rollout Considerations |
|
|
| ### Feature Flags |
| - Required: No |
| - This is a one-time deployment, not a progressive rollout |
|
|
| ### Migration |
| - Data migration needed: No |
| - Spider databases are bundled fresh in Docker build |
|
|
| ### Rollback Plan |
| HF Spaces can be deleted/recreated. README and docs changes are pure git reverts. No data migration or state to worry about. |
|
|
| --- |
|
|
| ## 9. Execution Tracking |
|
|
| All execution state is tracked within this document: |
| - **Section 1a:** Overall progress summary |
| - **Section 7:** Per-step completion details, test results, and handoff context |
| - **FEATURES.json:** Feature-level status/progress metadata used by `/autocode-next-step` and `opencode-ctx ralph run` |
| - **Git history:** Full audit trail of changes to this file |
|
|
| The implementing agent updates this document after each step and keeps the matching `FEATURES.json` entry in sync during implementation/finalization. Humans can monitor progress by: |
| - Checking Section 1a for summary |
| - Reviewing Section 7 for detailed step status |
| - Inspecting the feature's `progress` and `status` fields in `FEATURES.json` |
| - Running `git log --oneline IMPLEMENTATION_SPEC.md` for change history |
|
|
| --- |
|
|
| ## 9a. Slice Completion Protocol |
|
|
| After all steps in a slice pass verification: |
|
|
| 1. **Run verifier subagent** for spec compliance |
| - Validates against VERIFICATION_SPEC.md criteria |
| - Ensures no TODOs or incomplete work in slice |
| |
| 2. **Run compound-engineer subagent** to extract learnings |
| - **Mandatory invocation** after every slice completion |
| - Updates CLAUDE.md Learnings section (if durable patterns found) |
| - May exit with "no update needed" (valid for routine work) |
| |
| 3. **Commit** the slice changes |
| - Follow commit message format in CLAUDE.md |
| - Each slice gets its own atomic commit |
| |
| 4. **Continue to next slice** (if more slices remain) |
| - Or proceed to final verification if all slices complete |
| |
| **Note:** PR creation happens only after ALL slices are complete. Use `/commit-push-pr` manually when ready. |
| |
| --- |
| |
| ## 10. User Value Summary |
| |
| <!-- Populated by /autocode-next-step when final step completes --> |
| |
| **Status:** Generated |
| |
| ### What Users Can Now Do |
| Judges and external developers can now consume a full submission package: deploy and run SQLEnv in HF Spaces with bundled databases, follow a polished README quickstart, use a structured blog outline for narrative submission, and run a Colab-ready GRPO notebook workflow end-to-end. |
| |
| ### How to Access/Test |
| - README quickstart: Follow commands in `README.md` |
| - Blog outline: Open `docs/blog-outline.md` |
| - Notebook: Open `notebooks/train_grpo.ipynb` in Colab |
| - Deployment assets: `server/Dockerfile`, `.dockerignore`, and `openenv.yaml` |
|
|
| ### Demo |
| - **Command:** `uv run --with pytest pytest tests/ -v` |
| - **Health Check (after deploy):** `curl https://<space-url>/health` |
| - **Notebook:** `notebooks/train_grpo.ipynb` |
|
|
| ### Release Notes Snippet |
| Completed submission-ready packaging for SQLEnv with HF Spaces-compatible Docker deployment, polished repository docs, blog narrative outline, and a Colab-ready GRPO training notebook. |
|
|
| --- |
|
|
| ## 11. PR Contract (Auto-Generated by autocode-next-step) |
|
|
| <!-- This section is auto-populated by autocode-next-step command when all steps complete --> |
|
|
| **Status:** Generated |
|
|
| ### PR Title |
| feat(submission): finalize F007 huggingface deployment package |
|
|
| ### PR Summary |
| - Finalize HF Spaces submission artifacts: hardened Docker packaging, deployment-ready manifest, polished README, blog outline, and Colab-ready training notebook. |
| - Complete final verification gate with full regression evidence and archive behavior deltas into the deployment behavior spec. |
| - Sync F007 completion metadata in `specs/FEATURES.json` and extract durable learnings for future delivery cycles. |
|
|
| ### Verification |
| - `uv run --with pytest pytest tests/ -v` |
|
|
| ### Follow-up |
| None. |
|
|
| --- |
|
|
| ## Stop Conditions (When to Split This Spec) |
|
|
| Stop and create a new IMPLEMENTATION_SPEC if: |
| - A step requires touching more than **3 files** in unrelated areas |
| - You need to introduce **multiple new abstractions** "just in case" |
| - Verification cannot be made targeted and concrete |
| - You discover new unknowns that change the plan materially |
| - The next slice cannot be merged safely without finishing later slices |
| |
| When splitting, ensure the current slice ends in a merged, stable state. |
| |
| --- |
| |
| ## Human Checkpoint |
| |
| **Before handing to AI agent:** |
| |
| - [ ] Interface specifications are complete |
| - [ ] Data flow is accurate |
| - [ ] Error handling is specified |
| - [ ] Implementation order makes sense |
| - [ ] VERIFICATION_SPEC.md has been generated |
|
|
| **Questions:** |
| 1. Confirm Spider database list for bundling (from `data/questions/db_list.json`) |
| 2. Confirm HF Space repository name for `openenv push` |
|
|
| --- |
|
|
| ## Handoff Notes |
|
|
| **For the implementing AI agent:** |
|
|
| ``` |
| Context: See RESEARCH_SUMMARY.md for system understanding |
| Spec: Follow this document exactly |
| Verification: Use tests from VERIFICATION_SPEC.md (independent agent) |
| Ambiguity: Stop and ask rather than assume |
| Order: Follow implementation order exactly |
| Dependencies: This feature assumes F001-F006 are complete |
| ``` |
|
|
| --- |
|
|
| *Specification completed: 2026-03-27* |
| *Approved by: --* |
| *Verification spec: VERIFICATION_SPEC.md* |
| *Verification input: [F007-VERIFICATION_INPUT.json](./F007-VERIFICATION_INPUT.json)* |
| *Target agent: Claude Code* |
|
|
| ## User Clarifications |
|
|
| ### 2026-03-28 21:40:54 |
| **Question:** External deployment verification is blocked by GHCR access/auth failure (403 pulling base image), so verifier gate cannot approve final completion yet. |
| **Response:** Clearly state in demo and verification what the user needs to adjust |
|
|
| ### 2026-03-28 22:02:53 |
| **Question:** External credential/access dependency remains: need authenticated GHCR pull and HF push evidence (build+push attempt) to satisfy final verifier approval. |
| **Response:** Ensure you write what the user should verify and we will manually validate |
|
|
| ### 2026-03-28 22:55:03 |
| **Question:** Missing external authenticated deployment evidence (GHCR-authenticated build and Hugging Face push output) required by F007 final verification gate. |
| **Response:** I have already authenticated you should be able to run the commands now |
|
|