sql_env / specs /F007-IMPLEMENTATION_SPEC.md
hjerpe's picture
Upload folder using huggingface_hub
519b9a3 verified
# Implementation Specification
**Change:** F007 — HuggingFace Deployment & Submission Package
**Date:** 2026-03-27
**Research Summary:** [F007-RESEARCH_SUMMARY.md](./F007-RESEARCH_SUMMARY.md)
**Verification Spec:** See VERIFICATION_SPEC.md (generated by autocode-verification-planner)
**Behavior Delta:** Archived to [specs/behavior/deployment.md](./behavior/deployment.md)
**Plan Status:**
- [x] Draft
- [x] Approved for Implementation
- [x] Implementation Complete
- [x] Verification Passed
---
## Core Intent (Immutable)
> **DO NOT MODIFY THIS SECTION DURING REFINEMENT**
> Changes to Core Intent mean you're describing a different feature.
> If refinement reveals the need to change this section, create a new feature instead.
**User Problem:**
Judges can: read the blog, visit the HF Space, run the training notebook, and reproduce results. Someone outside the team can understand, use, and build on SQLEnv.
**Success Criteria:**
- Blog tells a compelling story even if training results are modest
- HF Space just works -- connect, reset, play an episode
- Training notebook runs end-to-end on Colab with one click
**Avoid:**
- Docker build fails on HF Spaces (free tier CPU)
- Blog is all technical with no narrative hook
- Notebook has undocumented setup steps
**Out of Scope:**
- Full blog post writing (outline + key sections only, manual polish later)
- Paid HF Spaces tier or GPU resources
- Training the agent (that is F006)
- Video recording of demo (manual task)
---
## 0. Slicing & Scope Budget (Anti-Waterfall)
This spec must be executable in **small, mergeable increments**.
### Scope Budget
- Target: **3 slices**
- Hard max: **<= 10 steps total**
- Each step must end in: **implement -> verify -> merge**
### Slice Definition
A slice is a vertical increment that delivers user-visible value or a safe internal capability.
**Each slice must have:**
- Clear outcome
- Minimal interface change
- Merge criteria
**Note:** Verification criteria are defined in VERIFICATION_SPEC.md (separate agent).
## Status Icons
**Step Status:**
- !! Not Started
- :: In Progress
- OK Completed
- XX Blocked/Failed
**Result Outcome:**
- OK Fully Successful (all tests passed, no issues)
- ~~ Completed with Issues (needs follow-up)
- XX Failed/Blocked
---
## 1. Implementation Overview
### Summary
Prepare the complete competition submission package: (1) harden the Dockerfile for HF Spaces free-tier deployment with bundled Spider databases, (2) overhaul README.md to be a polished project showcase, (3) create a blog post outline with key narrative sections, and (4) create a Colab-ready training notebook stub that references F006 outputs. This is the terminal feature -- it depends on F001-F006 being complete.
### Scope
**In Scope:**
- Dockerfile hardening for HF Spaces (bundle Spider DBs, CPU-only, health check)
- `openenv.yaml` validation for HF Hub compatibility
- README.md overhaul (architecture diagram, setup, usage, links)
- Blog post outline (`docs/blog-outline.md`)
- Training notebook stub (`notebooks/train_grpo.ipynb`)
- `.dockerignore` for clean builds
**Out of Scope:**
- Full blog prose (outline only)
- Agent training (F006)
- Reward/verifier logic (F003/F004)
- Video demo recording
- Paid HF Spaces configuration
---
## 1a. Execution Status
<!-- Auto-updated by /autocode-next-step - do not edit manually -->
**Progress:** 7/7 steps complete
**Current Step:** Finalization Protocol (OK Completed)
**Last Updated:** 2026-03-29T07:29:32Z
**Latest Result:** OK Final verification gate passed. Authenticated deployment evidence is now complete: `uv run openenv build -t openenv-sql-env-f007-hf-submission` succeeded, `uv run openenv push` completed successfully to `https://huggingface.co/spaces/hjerpe/sql_env`, and regression verification remained green (`uv run --with pytest pytest tests/ -v`: 250 passed, 1 skipped). `uv run openenv validate --verbose` still reports non-Docker entrypoint warnings, but Docker mode is supported and remains the scoped deployment path for F007.
**Blockers:** None.
---
## 1b. Risk Assessment
**Risk Tier:** Low
**Risk Tier Definitions:**
- **Low:** Pure logic, non-user-facing, no security implications
- **Medium:** User input handling, data validation, API changes
- **High:** Authentication, payments, secrets management, untrusted input
**High-Risk Indicators Present:** None
**Security Review Required:** No
**Justification:**
This feature creates documentation, configuration files, and a notebook. No authentication, secrets, or untrusted input handling. The Dockerfile bundles existing data and runs an existing server.
---
## 2. Change Manifest
### Files to Create
| File | Purpose |
|------|---------|
| `notebooks/train_grpo.ipynb` | Colab-ready training notebook stub |
| `docs/blog-outline.md` | HF blog post outline with narrative structure |
| `.dockerignore` | Exclude dev artifacts from Docker build |
### Files to Modify
| File | Changes |
|------|---------|
| `server/Dockerfile` | Bundle Spider DBs, optimize for HF Spaces free tier |
| `openenv.yaml` | Validate/update for HF Hub push compatibility |
| `README.md` | Full overhaul -- polished project showcase |
### Files to Delete
None.
---
## 3. Interface Specifications
### Dockerfile Structure
```dockerfile
# server/Dockerfile -- HF Spaces compatible
# Key changes from current:
# 1. Bundle Spider databases (COPY data/databases/ ...)
# 2. Ensure CPU-only (no torch GPU deps)
# 3. Expose port 7860 (HF Spaces default) OR 8000 (openenv default)
# 4. HEALTHCHECK on /health endpoint
# 5. Non-root user for HF Spaces security
```
### openenv.yaml Schema
```yaml
spec_version: 1
name: sql_env
type: space
runtime: fastapi
app: server.app:app
port: 8000
```
No structural changes needed -- validate existing manifest is HF Hub compatible.
### Blog Outline Structure
```markdown
# docs/blog-outline.md
# Sections:
# 1. Hook -- "Teaching AI to think like a data analyst"
# 2. Problem -- Static benchmarks vs. interactive exploration
# 3. Solution -- SQLEnv architecture overview
# 4. How It Works -- Episode flow, reward design
# 5. Results -- Learning curves, comparison (placeholder for F006 data)
# 6. Technical Deep Dive -- Reward architecture, GRPO training
# 7. Try It Yourself -- Links to HF Space, notebook, GitHub
```
### Training Notebook Structure
```python
# notebooks/train_grpo.ipynb
# Cells:
# 1. Setup -- pip install, clone repo
# 2. Configure -- HF Space URL, model selection
# 3. Connect -- SQLEnvClient connect + test
# 4. Train -- GRPO training loop (references F006 scripts/)
# 5. Evaluate -- Run eval episodes, plot results
# 6. Results -- Display learning curves
```
### New Functions
No new Python functions. This feature produces configuration and documentation artifacts.
---
## 4. Data Flow
### Primary Flow: HF Spaces Deployment
```
1. Developer runs `openenv validate`
- Input: openenv.yaml, Dockerfile
- Action: Validates manifest and Docker build locally
- Output: Pass/fail with diagnostics
2. Developer runs `openenv build`
- Input: Dockerfile, project files, Spider DBs
- Action: Builds Docker image with bundled databases
- Output: Docker image (~200MB with DBs)
3. Developer runs `openenv push`
- Input: Built Docker image, HF token
- Action: Pushes to HuggingFace Spaces
- Output: Live HF Space URL
```
### Alternative Flow: Local Docker Test
```
1. docker build -t sql-env:latest -f server/Dockerfile .
2. docker run -p 8000:8000 sql-env:latest
3. curl http://localhost:8000/health -> {"status": "healthy"}
4. WebSocket client connects, plays episode
```
---
## 5. Error Handling
### Error Types
| Error | When | Resolution |
|-------|------|------------|
| Docker build failure | Missing deps or files | Check .dockerignore, verify COPY paths |
| DB not found at runtime | DBs not bundled correctly | Verify COPY data/databases/ in Dockerfile |
| Port mismatch | HF Spaces expects 7860 | Use PORT env var with fallback |
| Memory limit exceeded | Container too large for free tier | Reduce bundled DBs to essential set |
### Error Handling Strategy
The Dockerfile should:
1. Use a PORT environment variable with default 8000 (HF Spaces sets PORT=7860)
2. Include a startup check that verifies databases are accessible
3. Keep image size minimal (no dev dependencies, no torch GPU packages)
---
## 6. Slice Plan (What we will ship, in order)
### Slice S1 -- Docker & Deployment
**Value:** HF Space can be built and deployed; server runs on free tier
**User-visible change:** Yes -- live HF Space
**Interfaces introduced/changed:** Dockerfile, .dockerignore, openenv.yaml
**Rollback safety:** Additive only, no existing behavior changed
### Slice S2 -- Documentation & README
**Value:** GitHub repo is a polished showcase; judges can understand the project
**User-visible change:** Yes -- README overhaul, blog outline
**Interfaces introduced/changed:** README.md, docs/blog-outline.md
**Rollback safety:** Documentation only, fully reversible
### Slice S3 -- Training Notebook
**Value:** Judges can reproduce training with one click on Colab
**User-visible change:** Yes -- notebook artifact
**Interfaces introduced/changed:** notebooks/train_grpo.ipynb
**Rollback safety:** New file only, no existing code changed
---
## 7. Implementation Steps
> **VERIFICATION NOTE:** Test criteria for each step are defined in VERIFICATION_SPEC.md.
> The verification-planner (separate agent) generated independent test criteria.
> Run the tests specified there after implementing each step.
### Step 1.1: Dockerfile Hardening for HF Spaces
**Slice:** S1
**Goal:** Update Dockerfile to bundle Spider databases, support HF Spaces PORT variable, run as non-root user, and minimize image size.
**Files:**
- `server/Dockerfile` - modify - Harden for HF Spaces free tier
- `.dockerignore` - create - Exclude dev artifacts (tests, docs, .git, __pycache__)
**Details:**
1. Add COPY for `data/databases/` into the Docker image (bundle the SQLite files)
2. Add `ENV PORT=8000` with CMD that reads `$PORT` (HF Spaces sets PORT=7860)
3. Add non-root user (`useradd --create-home appuser`) for HF Spaces security requirement
4. Ensure no GPU/CUDA dependencies are installed (CPU-only)
5. Create `.dockerignore` excluding: `.git`, `__pycache__`, `tests/`, `docs/`, `docs_draft/`, `specs/`, `vision/`, `*.md` (except README), `.env`
**Interface Changes:** None (Dockerfile is configuration)
**Verification:**
> See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.
**Risk Tier for This Step:** Low
**Merge Criteria:**
- [x] Tests from VERIFICATION_SPEC.md pass
- [x] No TODOs left in changed code (or explicitly tracked)
- [x] Backwards compatible (or flag/migration documented)
**Changes Made:**
- Updated `server/Dockerfile` with `ENV PORT=8000` and runtime `uvicorn` command that respects `${PORT:-8000}` for HF Spaces compatibility.
- Added explicit database bundling copy instruction: `COPY --from=builder /app/env/data/databases /app/env/data/databases`.
- Added non-root runtime user (`appuser`) and ownership handoff for `/app`.
- Created `.dockerignore` to exclude dev/test/docs/spec artifacts and keep only `README.md` among markdown files.
**Result:**
- OK Fully Successful
- Verification command: `uv run --with pytest pytest tests/ -v`
- Verification evidence: 250 passed, 1 skipped
**Context for Next Step:**
- Continue with Step 1.2 by validating database source requirements from `data/questions/db_list.json` and aligning Docker health checks with bundled DB presence.
**Status:** OK Completed
---
### Step 1.2: Bundle Spider Databases for Docker
**Slice:** S1
**Goal:** Ensure the essential Spider SQLite databases are available for bundling into Docker, and the Dockerfile COPY path is correct.
**Files:**
- `server/Dockerfile` - modify - Verify COPY paths for data/databases/
- `data/questions/db_list.json` - read - Identify which DBs are required
**Details:**
1. Read `data/questions/db_list.json` to identify the required database IDs
2. Ensure the Dockerfile copies `data/databases/` into the image at the correct path
3. Add a Docker HEALTHCHECK that also verifies at least one database file exists
4. The bundled DBs are small SQLite files (~50MB total), well within free tier limits
**Interface Changes:** None
**Verification:**
> See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.
**Risk Tier for This Step:** Low
**Merge Criteria:**
- [x] Tests from VERIFICATION_SPEC.md pass
- [x] No TODOs left in changed code (or explicitly tracked)
- [x] Backwards compatible (or flag/migration documented)
**Changes Made:**
- Read `data/questions/db_list.json` and confirmed required bundled DB IDs: `student_assessment`, `concert_singer`, `world_1`, `car_1`, `employee_hire_evaluation`, `pets_1`, `cre_Doc_Template_Mgt`, `dog_kennels`, `flight_2`, `poker_player`.
- Verified Docker bundling path remains correct: `COPY --from=builder /app/env/data/databases /app/env/data/databases`.
- Updated Docker `HEALTHCHECK` to enforce both bundled DB presence (`*.sqlite` under `/app/env/data/databases`) and API liveness via `/health` on `${PORT:-8000}`.
**Result:**
- OK Fully Successful
- Verification command: `uv run --with pytest pytest tests/ -v`
- Verification evidence: 250 passed, 1 skipped
**Context for Next Step:**
- Proceed to Step 1.3 by validating `openenv.yaml` shape (`spec_version`, `name`, `type`, `runtime`, `app`, `port`) and running `openenv validate`.
**Status:** OK Completed
---
### Step 1.3: Validate openenv.yaml
**Slice:** S1
**Goal:** Ensure openenv.yaml is valid for `openenv push` to HuggingFace Spaces.
**Files:**
- `openenv.yaml` - modify (if needed) - Ensure HF Hub compatibility
**Details:**
1. Verify `spec_version`, `name`, `type`, `runtime`, `app`, and `port` fields
2. Confirm `app: server.app:app` matches the actual FastAPI application path inside the Docker container
3. Update `port` if needed (openenv framework may handle PORT mapping)
4. Run `openenv validate` locally to check
**Interface Changes:** None
**Verification:**
> See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.
**Risk Tier for This Step:** Low
**Merge Criteria:**
- [x] Tests from VERIFICATION_SPEC.md pass
- [x] No TODOs left in changed code (or explicitly tracked)
- [x] Backwards compatible (or flag/migration documented)
**Changes Made:**
- Validated `openenv.yaml` fields against the required HF Space manifest shape (`spec_version`, `name`, `type`, `runtime`, `app`, `port`) and confirmed no manifest edits were needed.
- Ran `uv run openenv validate --verbose`; manifest compatibility checks passed for Docker mode, with non-blocking warnings that `openenv_serve`/`uv_run`/`python_module` modes need a callable `server/app.py main()` entrypoint.
- Ran full regression suite via `uv run --with pytest pytest tests/ -v` to ensure no feature regressions while validating deployment configuration.
**Result:**
- OK Fully Successful
- Verification command: `uv run --with pytest pytest tests/ -v`
- Verification evidence: 250 passed, 1 skipped
**Context for Next Step:**
- Proceed to Step 2.1 and overhaul `README.md` into competition-ready narrative + quickstart + architecture flow, using the now-validated `openenv.yaml` values as the source-of-truth deployment metadata.
**Status:** OK Completed
---
### Step 2.1: README.md Overhaul
**Slice:** S2
**Goal:** Transform README into a polished project showcase suitable for competition judges.
**Files:**
- `README.md` - modify - Full overhaul
**Details:**
1. **Header:** Project name, one-line description, badges (Python version, license)
2. **Elevator Pitch:** 2-3 sentences explaining what SQLEnv does and why it matters (narrative hook: "Teaching AI to think like a data analyst")
3. **Architecture Diagram:** ASCII or Mermaid diagram showing Agent <-> Client <-> Server <-> SQLite flow
4. **Quick Start:** Streamlined setup (3 commands max to get running)
5. **How It Works:** Episode flow with action types table (DESCRIBE, SAMPLE, QUERY, ANSWER)
6. **Training:** Link to notebook, brief GRPO explanation
7. **HF Space:** Link to live deployment
8. **Project Structure:** Updated tree reflecting final state
9. **Links:** OpenEnv, Spider, HF Space, blog post
10. Remove "Current Status" section (no longer relevant for submission)
11. Remove cautionary notes about untested Docker paths
**Interface Changes:** None
**Verification:**
> See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.
**Risk Tier for This Step:** Low
**Merge Criteria:**
- [x] Tests from VERIFICATION_SPEC.md pass
- [x] No TODOs left in changed code (or explicitly tracked)
- [x] Backwards compatible (or flag/migration documented)
**Changes Made:**
- Rewrote `README.md` into a submission-facing narrative that starts with a clear elevator pitch and removes stale cautionary/status language.
- Added a compact architecture diagram and refreshed "How It Works" with explicit action semantics (`DESCRIBE`, `SAMPLE`, `QUERY`, `ANSWER`) and episode flow.
- Replaced setup sprawl with a 3-command quickstart, plus explicit local server and Docker launch commands.
- Added sections for training artifacts, HuggingFace Space deployment path, project structure, deployment checklist, and canonical resource links.
**Result:**
- OK Fully Successful
- Verification command: `uv run --with pytest pytest tests/ -v`
- Verification evidence: 250 passed, 1 skipped
**Context for Next Step:**
- Proceed to Step 2.2 by creating `docs/blog-outline.md` with hook/problem/solution/how-it-works/results placeholder/technical highlights/try-it sections and 2-4 bullets per section.
**Status:** OK Completed
---
### Step 2.2: Blog Post Outline
**Slice:** S2
**Goal:** Create a structured blog post outline with key narrative sections for the HF blog submission.
**Files:**
- `docs/blog-outline.md` - create - Blog post outline
**Details:**
1. **Hook:** "What if we taught AI to explore databases the way a data analyst does -- not memorize answers, but learn to ask the right questions?"
2. **The Problem:** Static text-to-SQL benchmarks reward memorization, not reasoning. One-shot generation fails on novel schemas.
3. **Our Approach:** SQLEnv -- an RL environment where agents learn through iterative exploration (DESCRIBE, SAMPLE, QUERY, ANSWER)
4. **How SQLEnv Works:** Episode flow diagram, reward design (execution + correctness + efficiency)
5. **Training with GRPO:** Brief explanation of Group Relative Policy Optimization, why it fits
6. **Results:** [PLACEHOLDER for F006 data] Learning curves, comparison with baselines
7. **Technical Highlights:** Multi-DB support, token-level reward shaping, OpenEnv compatibility
8. **Try It Yourself:** Links to HF Space, Colab notebook, GitHub repo
9. **What We Learned:** Key insights from building the environment
Each section should have 2-4 bullet points of key content to include when writing the full post.
**Interface Changes:** None
**Verification:**
> See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.
**Risk Tier for This Step:** Low
**Merge Criteria:**
- [x] Tests from VERIFICATION_SPEC.md pass
- [x] No TODOs left in changed code (or explicitly tracked)
- [x] Backwards compatible (or flag/migration documented)
**Changes Made:**
- Created `docs/blog-outline.md` with a complete submission-ready structure covering hook, benchmark problem framing, SQLEnv approach, episode/reward flow, GRPO training context, results placeholder, technical highlights, try-it links section, and lessons learned.
- Ensured each section has 2-4 concrete bullets and expanded prose sufficient for a substantive draft handoff.
- Kept the only explicit placeholder in the Results section for F006 metric insertion, aligned with scope.
**Result:**
- OK Fully Successful
- Verification command: `uv run --with pytest pytest tests/ -v`
- Verification evidence: 250 passed, 1 skipped
**Context for Next Step:**
- Proceed to Step 3.1 by creating `notebooks/train_grpo.ipynb` with Colab-compatible metadata and ordered cells for setup, configuration, connect/test episode, training loop, evaluation, and plotting.
**Status:** OK Completed
---
### Step 3.1: Training Notebook Stub
**Slice:** S3
**Goal:** Create a Colab-ready Jupyter notebook that demonstrates end-to-end training with SQLEnv.
**Files:**
- `notebooks/train_grpo.ipynb` - create - Colab training notebook
**Details:**
Create a Jupyter notebook with these cells:
1. **Title + Description** (markdown): "Training a SQL Agent with GRPO + SQLEnv"
2. **Setup** (code): `!pip install sql-env[train]` or `!pip install -r requirements.txt`, clone repo if needed
3. **Configuration** (code): Set HF Space URL (or local server), model name, hyperparameters
4. **Connect & Test** (code): Create `SQLEnvClient`, connect, run a test episode (reset + 2 steps)
5. **Training Loop** (code): GRPO training referencing F006 scripts (import from scripts/ or inline simplified version)
6. **Evaluation** (code): Run eval episodes on held-out questions, compute metrics
7. **Plot Results** (code): matplotlib learning curves (reward over episodes)
8. **Next Steps** (markdown): Links to full training script, HF Space, blog post
Each code cell should have markdown cells above explaining what it does and why. Include `# TODO: update after F006` comments where training-specific code depends on F006 outputs.
**Interface Changes:** None
**Verification:**
> See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.
**Risk Tier for This Step:** Low
**Merge Criteria:**
- [x] Tests from VERIFICATION_SPEC.md pass
- [x] No TODOs left in changed code (or explicitly tracked)
- [x] Backwards compatible (or flag/migration documented)
**Changes Made:**
- Replaced `notebooks/train_grpo.ipynb` with a clean, Colab-compatible training stub organized as: title/description, setup, configuration, connect smoke test, GRPO training loop, held-out evaluation, plotting, and next steps.
- Added explicit `SQLEnvClient` connectivity example and retained F006 training hooks (`GRPOConfig`, `load_model_and_tokenizer`, `build_trainer`, `run_training_with_metrics`, and `sample_random_baseline`) so notebook smoke tests continue to validate expected flow.
- Cleared all notebook cell outputs and removed hardcoded local absolute paths to keep the artifact reproducible for judges and portable to Colab/local runs.
**Result:**
- OK Fully Successful
- Verification commands:
- `uv run --with pytest pytest tests/e2e/test_training_e2e.py -v`
- `uv run --with pytest pytest tests/ -v`
- Verification evidence:
- Targeted notebook E2E: 5 passed
- Full regression suite: 250 passed, 1 skipped
**Context for Next Step:**
- Implementation steps are complete for F007; proceed to finalization protocol (verification gate + verifier/compound-engineer/archive-spec + Plan Status/PR Contract/FEATURES sync).
**Status:** OK Completed
---
## 8. Rollout Considerations
### Feature Flags
- Required: No
- This is a one-time deployment, not a progressive rollout
### Migration
- Data migration needed: No
- Spider databases are bundled fresh in Docker build
### Rollback Plan
HF Spaces can be deleted/recreated. README and docs changes are pure git reverts. No data migration or state to worry about.
---
## 9. Execution Tracking
All execution state is tracked within this document:
- **Section 1a:** Overall progress summary
- **Section 7:** Per-step completion details, test results, and handoff context
- **FEATURES.json:** Feature-level status/progress metadata used by `/autocode-next-step` and `opencode-ctx ralph run`
- **Git history:** Full audit trail of changes to this file
The implementing agent updates this document after each step and keeps the matching `FEATURES.json` entry in sync during implementation/finalization. Humans can monitor progress by:
- Checking Section 1a for summary
- Reviewing Section 7 for detailed step status
- Inspecting the feature's `progress` and `status` fields in `FEATURES.json`
- Running `git log --oneline IMPLEMENTATION_SPEC.md` for change history
---
## 9a. Slice Completion Protocol
After all steps in a slice pass verification:
1. **Run verifier subagent** for spec compliance
- Validates against VERIFICATION_SPEC.md criteria
- Ensures no TODOs or incomplete work in slice
2. **Run compound-engineer subagent** to extract learnings
- **Mandatory invocation** after every slice completion
- Updates CLAUDE.md Learnings section (if durable patterns found)
- May exit with "no update needed" (valid for routine work)
3. **Commit** the slice changes
- Follow commit message format in CLAUDE.md
- Each slice gets its own atomic commit
4. **Continue to next slice** (if more slices remain)
- Or proceed to final verification if all slices complete
**Note:** PR creation happens only after ALL slices are complete. Use `/commit-push-pr` manually when ready.
---
## 10. User Value Summary
<!-- Populated by /autocode-next-step when final step completes -->
**Status:** Generated
### What Users Can Now Do
Judges and external developers can now consume a full submission package: deploy and run SQLEnv in HF Spaces with bundled databases, follow a polished README quickstart, use a structured blog outline for narrative submission, and run a Colab-ready GRPO notebook workflow end-to-end.
### How to Access/Test
- README quickstart: Follow commands in `README.md`
- Blog outline: Open `docs/blog-outline.md`
- Notebook: Open `notebooks/train_grpo.ipynb` in Colab
- Deployment assets: `server/Dockerfile`, `.dockerignore`, and `openenv.yaml`
### Demo
- **Command:** `uv run --with pytest pytest tests/ -v`
- **Health Check (after deploy):** `curl https://<space-url>/health`
- **Notebook:** `notebooks/train_grpo.ipynb`
### Release Notes Snippet
Completed submission-ready packaging for SQLEnv with HF Spaces-compatible Docker deployment, polished repository docs, blog narrative outline, and a Colab-ready GRPO training notebook.
---
## 11. PR Contract (Auto-Generated by autocode-next-step)
<!-- This section is auto-populated by autocode-next-step command when all steps complete -->
**Status:** Generated
### PR Title
feat(submission): finalize F007 huggingface deployment package
### PR Summary
- Finalize HF Spaces submission artifacts: hardened Docker packaging, deployment-ready manifest, polished README, blog outline, and Colab-ready training notebook.
- Complete final verification gate with full regression evidence and archive behavior deltas into the deployment behavior spec.
- Sync F007 completion metadata in `specs/FEATURES.json` and extract durable learnings for future delivery cycles.
### Verification
- `uv run --with pytest pytest tests/ -v`
### Follow-up
None.
---
## Stop Conditions (When to Split This Spec)
Stop and create a new IMPLEMENTATION_SPEC if:
- A step requires touching more than **3 files** in unrelated areas
- You need to introduce **multiple new abstractions** "just in case"
- Verification cannot be made targeted and concrete
- You discover new unknowns that change the plan materially
- The next slice cannot be merged safely without finishing later slices
When splitting, ensure the current slice ends in a merged, stable state.
---
## Human Checkpoint
**Before handing to AI agent:**
- [ ] Interface specifications are complete
- [ ] Data flow is accurate
- [ ] Error handling is specified
- [ ] Implementation order makes sense
- [ ] VERIFICATION_SPEC.md has been generated
**Questions:**
1. Confirm Spider database list for bundling (from `data/questions/db_list.json`)
2. Confirm HF Space repository name for `openenv push`
---
## Handoff Notes
**For the implementing AI agent:**
```
Context: See RESEARCH_SUMMARY.md for system understanding
Spec: Follow this document exactly
Verification: Use tests from VERIFICATION_SPEC.md (independent agent)
Ambiguity: Stop and ask rather than assume
Order: Follow implementation order exactly
Dependencies: This feature assumes F001-F006 are complete
```
---
*Specification completed: 2026-03-27*
*Approved by: --*
*Verification spec: VERIFICATION_SPEC.md*
*Verification input: [F007-VERIFICATION_INPUT.json](./F007-VERIFICATION_INPUT.json)*
*Target agent: Claude Code*
## User Clarifications
### 2026-03-28 21:40:54
**Question:** External deployment verification is blocked by GHCR access/auth failure (403 pulling base image), so verifier gate cannot approve final completion yet.
**Response:** Clearly state in demo and verification what the user needs to adjust
### 2026-03-28 22:02:53
**Question:** External credential/access dependency remains: need authenticated GHCR pull and HF push evidence (build+push attempt) to satisfy final verifier approval.
**Response:** Ensure you write what the user should verify and we will manually validate
### 2026-03-28 22:55:03
**Question:** Missing external authenticated deployment evidence (GHCR-authenticated build and Hugging Face push output) required by F007 final verification gate.
**Response:** I have already authenticated you should be able to run the commands now