Spaces:

hjerpe
/

sql_env

Sleeping

App Files Files Community

sql_env / specs /F007-IMPLEMENTATION_SPEC.md

hjerpe

Upload folder using huggingface_hub

519b9a3 verified about 2 months ago

preview code

raw

history blame contribute delete

29.2 kB

	# Implementation Specification

	Change: F007 — HuggingFace Deployment & Submission Package
	Date: 2026-03-27
	Research Summary: [F007-RESEARCH_SUMMARY.md](./F007-RESEARCH_SUMMARY.md)
	Verification Spec: See VERIFICATION_SPEC.md (generated by autocode-verification-planner)
	Behavior Delta: Archived to [specs/behavior/deployment.md](./behavior/deployment.md)

	Plan Status:
	- [x] Draft
	- [x] Approved for Implementation
	- [x] Implementation Complete
	- [x] Verification Passed

	---

	## Core Intent (Immutable)

	> DO NOT MODIFY THIS SECTION DURING REFINEMENT
	> Changes to Core Intent mean you're describing a different feature.
	> If refinement reveals the need to change this section, create a new feature instead.

	User Problem:
	Judges can: read the blog, visit the HF Space, run the training notebook, and reproduce results. Someone outside the team can understand, use, and build on SQLEnv.

	Success Criteria:
	- Blog tells a compelling story even if training results are modest
	- HF Space just works -- connect, reset, play an episode
	- Training notebook runs end-to-end on Colab with one click

	Avoid:
	- Docker build fails on HF Spaces (free tier CPU)
	- Blog is all technical with no narrative hook
	- Notebook has undocumented setup steps

	Out of Scope:
	- Full blog post writing (outline + key sections only, manual polish later)
	- Paid HF Spaces tier or GPU resources
	- Training the agent (that is F006)
	- Video recording of demo (manual task)

	---

	## 0. Slicing & Scope Budget (Anti-Waterfall)

	This spec must be executable in small, mergeable increments.

	### Scope Budget
	- Target: 3 slices
	- Hard max: <= 10 steps total
	- Each step must end in: implement -> verify -> merge

	### Slice Definition
	A slice is a vertical increment that delivers user-visible value or a safe internal capability.

	Each slice must have:
	- Clear outcome
	- Minimal interface change
	- Merge criteria

	Note: Verification criteria are defined in VERIFICATION_SPEC.md (separate agent).

	## Status Icons

	Step Status:
	- !! Not Started
	- :: In Progress
	- OK Completed
	- XX Blocked/Failed

	Result Outcome:
	- OK Fully Successful (all tests passed, no issues)
	- ~~ Completed with Issues (needs follow-up)
	- XX Failed/Blocked

	---

	## 1. Implementation Overview

	### Summary
	Prepare the complete competition submission package: (1) harden the Dockerfile for HF Spaces free-tier deployment with bundled Spider databases, (2) overhaul README.md to be a polished project showcase, (3) create a blog post outline with key narrative sections, and (4) create a Colab-ready training notebook stub that references F006 outputs. This is the terminal feature -- it depends on F001-F006 being complete.

	### Scope

	In Scope:
	- Dockerfile hardening for HF Spaces (bundle Spider DBs, CPU-only, health check)
	- `openenv.yaml` validation for HF Hub compatibility
	- README.md overhaul (architecture diagram, setup, usage, links)
	- Blog post outline (`docs/blog-outline.md`)
	- Training notebook stub (`notebooks/train_grpo.ipynb`)
	- `.dockerignore` for clean builds

	Out of Scope:
	- Full blog prose (outline only)
	- Agent training (F006)
	- Reward/verifier logic (F003/F004)
	- Video demo recording
	- Paid HF Spaces configuration

	---

	## 1a. Execution Status
	<!-- Auto-updated by /autocode-next-step - do not edit manually -->

	Progress: 7/7 steps complete
	Current Step: Finalization Protocol (OK Completed)
	Last Updated: 2026-03-29T07:29:32Z
	Latest Result: OK Final verification gate passed. Authenticated deployment evidence is now complete: `uv run openenv build -t openenv-sql-env-f007-hf-submission` succeeded, `uv run openenv push` completed successfully to `https://huggingface.co/spaces/hjerpe/sql_env`, and regression verification remained green (`uv run --with pytest pytest tests/ -v`: 250 passed, 1 skipped). `uv run openenv validate --verbose` still reports non-Docker entrypoint warnings, but Docker mode is supported and remains the scoped deployment path for F007.
	Blockers: None.

	---

	## 1b. Risk Assessment

	Risk Tier: Low

	Risk Tier Definitions:
	- Low: Pure logic, non-user-facing, no security implications
	- Medium: User input handling, data validation, API changes
	- High: Authentication, payments, secrets management, untrusted input

	High-Risk Indicators Present: None

	Security Review Required: No

	Justification:
	This feature creates documentation, configuration files, and a notebook. No authentication, secrets, or untrusted input handling. The Dockerfile bundles existing data and runs an existing server.

	---

	## 2. Change Manifest

	### Files to Create

	\| File \| Purpose \|
	\|------\|---------\|
	\| `notebooks/train_grpo.ipynb` \| Colab-ready training notebook stub \|
	\| `docs/blog-outline.md` \| HF blog post outline with narrative structure \|
	\| `.dockerignore` \| Exclude dev artifacts from Docker build \|

	### Files to Modify

	\| File \| Changes \|
	\|------\|---------\|
	\| `server/Dockerfile` \| Bundle Spider DBs, optimize for HF Spaces free tier \|
	\| `openenv.yaml` \| Validate/update for HF Hub push compatibility \|
	\| `README.md` \| Full overhaul -- polished project showcase \|

	### Files to Delete

	None.

	---

	## 3. Interface Specifications

	### Dockerfile Structure

	```dockerfile
	# server/Dockerfile -- HF Spaces compatible
	# Key changes from current:
	# 1. Bundle Spider databases (COPY data/databases/ ...)
	# 2. Ensure CPU-only (no torch GPU deps)
	# 3. Expose port 7860 (HF Spaces default) OR 8000 (openenv default)
	# 4. HEALTHCHECK on /health endpoint
	# 5. Non-root user for HF Spaces security
	```

	### openenv.yaml Schema

	```yaml
	spec_version: 1
	name: sql_env
	type: space
	runtime: fastapi
	app: server.app:app
	port: 8000
	```

	No structural changes needed -- validate existing manifest is HF Hub compatible.

	### Blog Outline Structure

	```markdown
	# docs/blog-outline.md
	# Sections:
	# 1. Hook -- "Teaching AI to think like a data analyst"
	# 2. Problem -- Static benchmarks vs. interactive exploration
	# 3. Solution -- SQLEnv architecture overview
	# 4. How It Works -- Episode flow, reward design
	# 5. Results -- Learning curves, comparison (placeholder for F006 data)
	# 6. Technical Deep Dive -- Reward architecture, GRPO training
	# 7. Try It Yourself -- Links to HF Space, notebook, GitHub
	```

	### Training Notebook Structure

	```python
	# notebooks/train_grpo.ipynb
	# Cells:
	# 1. Setup -- pip install, clone repo
	# 2. Configure -- HF Space URL, model selection
	# 3. Connect -- SQLEnvClient connect + test
	# 4. Train -- GRPO training loop (references F006 scripts/)
	# 5. Evaluate -- Run eval episodes, plot results
	# 6. Results -- Display learning curves
	```

	### New Functions

	No new Python functions. This feature produces configuration and documentation artifacts.

	---

	## 4. Data Flow

	### Primary Flow: HF Spaces Deployment

	```
	1. Developer runs `openenv validate`
	- Input: openenv.yaml, Dockerfile
	- Action: Validates manifest and Docker build locally
	- Output: Pass/fail with diagnostics

	2. Developer runs `openenv build`
	- Input: Dockerfile, project files, Spider DBs
	- Action: Builds Docker image with bundled databases
	- Output: Docker image (~200MB with DBs)

	3. Developer runs `openenv push`
	- Input: Built Docker image, HF token
	- Action: Pushes to HuggingFace Spaces
	- Output: Live HF Space URL
	```

	### Alternative Flow: Local Docker Test

	```
	1. docker build -t sql-env:latest -f server/Dockerfile .
	2. docker run -p 8000:8000 sql-env:latest
	3. curl http://localhost:8000/health -> {"status": "healthy"}
	4. WebSocket client connects, plays episode
	```

	---

	## 5. Error Handling

	### Error Types

	\| Error \| When \| Resolution \|
	\|-------\|------\|------------\|
	\| Docker build failure \| Missing deps or files \| Check .dockerignore, verify COPY paths \|
	\| DB not found at runtime \| DBs not bundled correctly \| Verify COPY data/databases/ in Dockerfile \|
	\| Port mismatch \| HF Spaces expects 7860 \| Use PORT env var with fallback \|
	\| Memory limit exceeded \| Container too large for free tier \| Reduce bundled DBs to essential set \|

	### Error Handling Strategy

	The Dockerfile should:
	1. Use a PORT environment variable with default 8000 (HF Spaces sets PORT=7860)
	2. Include a startup check that verifies databases are accessible
	3. Keep image size minimal (no dev dependencies, no torch GPU packages)

	---

	## 6. Slice Plan (What we will ship, in order)

	### Slice S1 -- Docker & Deployment
	Value: HF Space can be built and deployed; server runs on free tier
	User-visible change: Yes -- live HF Space
	Interfaces introduced/changed: Dockerfile, .dockerignore, openenv.yaml
	Rollback safety: Additive only, no existing behavior changed

	### Slice S2 -- Documentation & README
	Value: GitHub repo is a polished showcase; judges can understand the project
	User-visible change: Yes -- README overhaul, blog outline
	Interfaces introduced/changed: README.md, docs/blog-outline.md
	Rollback safety: Documentation only, fully reversible

	### Slice S3 -- Training Notebook
	Value: Judges can reproduce training with one click on Colab
	User-visible change: Yes -- notebook artifact
	Interfaces introduced/changed: notebooks/train_grpo.ipynb
	Rollback safety: New file only, no existing code changed

	---

	## 7. Implementation Steps

	> VERIFICATION NOTE: Test criteria for each step are defined in VERIFICATION_SPEC.md.
	> The verification-planner (separate agent) generated independent test criteria.
	> Run the tests specified there after implementing each step.

	### Step 1.1: Dockerfile Hardening for HF Spaces
	Slice: S1
	Goal: Update Dockerfile to bundle Spider databases, support HF Spaces PORT variable, run as non-root user, and minimize image size.

	Files:
	- `server/Dockerfile` - modify - Harden for HF Spaces free tier
	- `.dockerignore` - create - Exclude dev artifacts (tests, docs, .git, __pycache__)

	Details:
	1. Add COPY for `data/databases/` into the Docker image (bundle the SQLite files)
	2. Add `ENV PORT=8000` with CMD that reads `$PORT` (HF Spaces sets PORT=7860)
	3. Add non-root user (`useradd --create-home appuser`) for HF Spaces security requirement
	4. Ensure no GPU/CUDA dependencies are installed (CPU-only)
	5. Create `.dockerignore` excluding: `.git`, `__pycache__`, `tests/`, `docs/`, `docs_draft/`, `specs/`, `vision/`, `*.md` (except README), `.env`

	Interface Changes: None (Dockerfile is configuration)

	Verification:
	> See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

	Risk Tier for This Step: Low

	Merge Criteria:
	- [x] Tests from VERIFICATION_SPEC.md pass
	- [x] No TODOs left in changed code (or explicitly tracked)
	- [x] Backwards compatible (or flag/migration documented)

	Changes Made:
	- Updated `server/Dockerfile` with `ENV PORT=8000` and runtime `uvicorn` command that respects `${PORT:-8000}` for HF Spaces compatibility.
	- Added explicit database bundling copy instruction: `COPY --from=builder /app/env/data/databases /app/env/data/databases`.
	- Added non-root runtime user (`appuser`) and ownership handoff for `/app`.
	- Created `.dockerignore` to exclude dev/test/docs/spec artifacts and keep only `README.md` among markdown files.

	Result:
	- OK Fully Successful
	- Verification command: `uv run --with pytest pytest tests/ -v`
	- Verification evidence: 250 passed, 1 skipped

	Context for Next Step:
	- Continue with Step 1.2 by validating database source requirements from `data/questions/db_list.json` and aligning Docker health checks with bundled DB presence.

	Status: OK Completed

	---

	### Step 1.2: Bundle Spider Databases for Docker
	Slice: S1
	Goal: Ensure the essential Spider SQLite databases are available for bundling into Docker, and the Dockerfile COPY path is correct.

	Files:
	- `server/Dockerfile` - modify - Verify COPY paths for data/databases/
	- `data/questions/db_list.json` - read - Identify which DBs are required

	Details:
	1. Read `data/questions/db_list.json` to identify the required database IDs
	2. Ensure the Dockerfile copies `data/databases/` into the image at the correct path
	3. Add a Docker HEALTHCHECK that also verifies at least one database file exists
	4. The bundled DBs are small SQLite files (~50MB total), well within free tier limits

	Interface Changes: None

	Verification:
	> See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

	Risk Tier for This Step: Low

	Merge Criteria:
	- [x] Tests from VERIFICATION_SPEC.md pass
	- [x] No TODOs left in changed code (or explicitly tracked)
	- [x] Backwards compatible (or flag/migration documented)

	Changes Made:
	- Read `data/questions/db_list.json` and confirmed required bundled DB IDs: `student_assessment`, `concert_singer`, `world_1`, `car_1`, `employee_hire_evaluation`, `pets_1`, `cre_Doc_Template_Mgt`, `dog_kennels`, `flight_2`, `poker_player`.
	- Verified Docker bundling path remains correct: `COPY --from=builder /app/env/data/databases /app/env/data/databases`.
	- Updated Docker `HEALTHCHECK` to enforce both bundled DB presence (`*.sqlite` under `/app/env/data/databases`) and API liveness via `/health` on `${PORT:-8000}`.

	Result:
	- OK Fully Successful
	- Verification command: `uv run --with pytest pytest tests/ -v`
	- Verification evidence: 250 passed, 1 skipped

	Context for Next Step:
	- Proceed to Step 1.3 by validating `openenv.yaml` shape (`spec_version`, `name`, `type`, `runtime`, `app`, `port`) and running `openenv validate`.

	Status: OK Completed

	---

	### Step 1.3: Validate openenv.yaml
	Slice: S1
	Goal: Ensure openenv.yaml is valid for `openenv push` to HuggingFace Spaces.

	Files:
	- `openenv.yaml` - modify (if needed) - Ensure HF Hub compatibility

	Details:
	1. Verify `spec_version`, `name`, `type`, `runtime`, `app`, and `port` fields
	2. Confirm `app: server.app:app` matches the actual FastAPI application path inside the Docker container
	3. Update `port` if needed (openenv framework may handle PORT mapping)
	4. Run `openenv validate` locally to check

	Interface Changes: None

	Verification:
	> See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

	Risk Tier for This Step: Low

	Merge Criteria:
	- [x] Tests from VERIFICATION_SPEC.md pass
	- [x] No TODOs left in changed code (or explicitly tracked)
	- [x] Backwards compatible (or flag/migration documented)

	Changes Made:
	- Validated `openenv.yaml` fields against the required HF Space manifest shape (`spec_version`, `name`, `type`, `runtime`, `app`, `port`) and confirmed no manifest edits were needed.
	- Ran `uv run openenv validate --verbose`; manifest compatibility checks passed for Docker mode, with non-blocking warnings that `openenv_serve`/`uv_run`/`python_module` modes need a callable `server/app.py main()` entrypoint.
	- Ran full regression suite via `uv run --with pytest pytest tests/ -v` to ensure no feature regressions while validating deployment configuration.

	Result:
	- OK Fully Successful
	- Verification command: `uv run --with pytest pytest tests/ -v`
	- Verification evidence: 250 passed, 1 skipped

	Context for Next Step:
	- Proceed to Step 2.1 and overhaul `README.md` into competition-ready narrative + quickstart + architecture flow, using the now-validated `openenv.yaml` values as the source-of-truth deployment metadata.

	Status: OK Completed

	---

	### Step 2.1: README.md Overhaul
	Slice: S2
	Goal: Transform README into a polished project showcase suitable for competition judges.

	Files:
	- `README.md` - modify - Full overhaul

	Details:
	1. Header: Project name, one-line description, badges (Python version, license)
	2. Elevator Pitch: 2-3 sentences explaining what SQLEnv does and why it matters (narrative hook: "Teaching AI to think like a data analyst")
	3. Architecture Diagram: ASCII or Mermaid diagram showing Agent <-> Client <-> Server <-> SQLite flow
	4. Quick Start: Streamlined setup (3 commands max to get running)
	5. How It Works: Episode flow with action types table (DESCRIBE, SAMPLE, QUERY, ANSWER)
	6. Training: Link to notebook, brief GRPO explanation
	7. HF Space: Link to live deployment
	8. Project Structure: Updated tree reflecting final state
	9. Links: OpenEnv, Spider, HF Space, blog post
	10. Remove "Current Status" section (no longer relevant for submission)
	11. Remove cautionary notes about untested Docker paths

	Interface Changes: None

	Verification:
	> See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

	Risk Tier for This Step: Low

	Merge Criteria:
	- [x] Tests from VERIFICATION_SPEC.md pass
	- [x] No TODOs left in changed code (or explicitly tracked)
	- [x] Backwards compatible (or flag/migration documented)

	Changes Made:
	- Rewrote `README.md` into a submission-facing narrative that starts with a clear elevator pitch and removes stale cautionary/status language.
	- Added a compact architecture diagram and refreshed "How It Works" with explicit action semantics (`DESCRIBE`, `SAMPLE`, `QUERY`, `ANSWER`) and episode flow.
	- Replaced setup sprawl with a 3-command quickstart, plus explicit local server and Docker launch commands.
	- Added sections for training artifacts, HuggingFace Space deployment path, project structure, deployment checklist, and canonical resource links.

	Result:
	- OK Fully Successful
	- Verification command: `uv run --with pytest pytest tests/ -v`
	- Verification evidence: 250 passed, 1 skipped

	Context for Next Step:
	- Proceed to Step 2.2 by creating `docs/blog-outline.md` with hook/problem/solution/how-it-works/results placeholder/technical highlights/try-it sections and 2-4 bullets per section.

	Status: OK Completed

	---

	### Step 2.2: Blog Post Outline
	Slice: S2
	Goal: Create a structured blog post outline with key narrative sections for the HF blog submission.

	Files:
	- `docs/blog-outline.md` - create - Blog post outline

	Details:
	1. Hook: "What if we taught AI to explore databases the way a data analyst does -- not memorize answers, but learn to ask the right questions?"
	2. The Problem: Static text-to-SQL benchmarks reward memorization, not reasoning. One-shot generation fails on novel schemas.
	3. Our Approach: SQLEnv -- an RL environment where agents learn through iterative exploration (DESCRIBE, SAMPLE, QUERY, ANSWER)
	4. How SQLEnv Works: Episode flow diagram, reward design (execution + correctness + efficiency)
	5. Training with GRPO: Brief explanation of Group Relative Policy Optimization, why it fits
	6. Results: [PLACEHOLDER for F006 data] Learning curves, comparison with baselines
	7. Technical Highlights: Multi-DB support, token-level reward shaping, OpenEnv compatibility
	8. Try It Yourself: Links to HF Space, Colab notebook, GitHub repo
	9. What We Learned: Key insights from building the environment

	Each section should have 2-4 bullet points of key content to include when writing the full post.

	Interface Changes: None

	Verification:
	> See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

	Risk Tier for This Step: Low

	Merge Criteria:
	- [x] Tests from VERIFICATION_SPEC.md pass
	- [x] No TODOs left in changed code (or explicitly tracked)
	- [x] Backwards compatible (or flag/migration documented)

	Changes Made:
	- Created `docs/blog-outline.md` with a complete submission-ready structure covering hook, benchmark problem framing, SQLEnv approach, episode/reward flow, GRPO training context, results placeholder, technical highlights, try-it links section, and lessons learned.
	- Ensured each section has 2-4 concrete bullets and expanded prose sufficient for a substantive draft handoff.
	- Kept the only explicit placeholder in the Results section for F006 metric insertion, aligned with scope.

	Result:
	- OK Fully Successful
	- Verification command: `uv run --with pytest pytest tests/ -v`
	- Verification evidence: 250 passed, 1 skipped

	Context for Next Step:
	- Proceed to Step 3.1 by creating `notebooks/train_grpo.ipynb` with Colab-compatible metadata and ordered cells for setup, configuration, connect/test episode, training loop, evaluation, and plotting.

	Status: OK Completed

	---

	### Step 3.1: Training Notebook Stub
	Slice: S3
	Goal: Create a Colab-ready Jupyter notebook that demonstrates end-to-end training with SQLEnv.

	Files:
	- `notebooks/train_grpo.ipynb` - create - Colab training notebook

	Details:
	Create a Jupyter notebook with these cells:

	1. Title + Description (markdown): "Training a SQL Agent with GRPO + SQLEnv"
	2. Setup (code): `!pip install sql-env[train]` or `!pip install -r requirements.txt`, clone repo if needed
	3. Configuration (code): Set HF Space URL (or local server), model name, hyperparameters
	4. Connect & Test (code): Create `SQLEnvClient`, connect, run a test episode (reset + 2 steps)
	5. Training Loop (code): GRPO training referencing F006 scripts (import from scripts/ or inline simplified version)
	6. Evaluation (code): Run eval episodes on held-out questions, compute metrics
	7. Plot Results (code): matplotlib learning curves (reward over episodes)
	8. Next Steps (markdown): Links to full training script, HF Space, blog post

	Each code cell should have markdown cells above explaining what it does and why. Include `# TODO: update after F006` comments where training-specific code depends on F006 outputs.

	Interface Changes: None

	Verification:
	> See VERIFICATION_SPEC.md for test criteria defined by independent verification planner.

	Risk Tier for This Step: Low

	Merge Criteria:
	- [x] Tests from VERIFICATION_SPEC.md pass
	- [x] No TODOs left in changed code (or explicitly tracked)
	- [x] Backwards compatible (or flag/migration documented)

	Changes Made:
	- Replaced `notebooks/train_grpo.ipynb` with a clean, Colab-compatible training stub organized as: title/description, setup, configuration, connect smoke test, GRPO training loop, held-out evaluation, plotting, and next steps.
	- Added explicit `SQLEnvClient` connectivity example and retained F006 training hooks (`GRPOConfig`, `load_model_and_tokenizer`, `build_trainer`, `run_training_with_metrics`, and `sample_random_baseline`) so notebook smoke tests continue to validate expected flow.
	- Cleared all notebook cell outputs and removed hardcoded local absolute paths to keep the artifact reproducible for judges and portable to Colab/local runs.

	Result:
	- OK Fully Successful
	- Verification commands:
	- `uv run --with pytest pytest tests/e2e/test_training_e2e.py -v`
	- `uv run --with pytest pytest tests/ -v`
	- Verification evidence:
	- Targeted notebook E2E: 5 passed
	- Full regression suite: 250 passed, 1 skipped

	Context for Next Step:
	- Implementation steps are complete for F007; proceed to finalization protocol (verification gate + verifier/compound-engineer/archive-spec + Plan Status/PR Contract/FEATURES sync).

	Status: OK Completed

	---

	## 8. Rollout Considerations

	### Feature Flags
	- Required: No
	- This is a one-time deployment, not a progressive rollout

	### Migration
	- Data migration needed: No
	- Spider databases are bundled fresh in Docker build

	### Rollback Plan
	HF Spaces can be deleted/recreated. README and docs changes are pure git reverts. No data migration or state to worry about.

	---

	## 9. Execution Tracking

	All execution state is tracked within this document:
	- Section 1a: Overall progress summary
	- Section 7: Per-step completion details, test results, and handoff context
	- FEATURES.json: Feature-level status/progress metadata used by `/autocode-next-step` and `opencode-ctx ralph run`
	- Git history: Full audit trail of changes to this file

	The implementing agent updates this document after each step and keeps the matching `FEATURES.json` entry in sync during implementation/finalization. Humans can monitor progress by:
	- Checking Section 1a for summary
	- Reviewing Section 7 for detailed step status
	- Inspecting the feature's `progress` and `status` fields in `FEATURES.json`
	- Running `git log --oneline IMPLEMENTATION_SPEC.md` for change history

	---

	## 9a. Slice Completion Protocol

	After all steps in a slice pass verification:

	1. Run verifier subagent for spec compliance
	- Validates against VERIFICATION_SPEC.md criteria
	- Ensures no TODOs or incomplete work in slice

	2. Run compound-engineer subagent to extract learnings
	- Mandatory invocation after every slice completion
	- Updates CLAUDE.md Learnings section (if durable patterns found)
	- May exit with "no update needed" (valid for routine work)

	3. Commit the slice changes
	- Follow commit message format in CLAUDE.md
	- Each slice gets its own atomic commit

	4. Continue to next slice (if more slices remain)
	- Or proceed to final verification if all slices complete

	Note: PR creation happens only after ALL slices are complete. Use `/commit-push-pr` manually when ready.

	---

	## 10. User Value Summary

	<!-- Populated by /autocode-next-step when final step completes -->

	Status: Generated

	### What Users Can Now Do
	Judges and external developers can now consume a full submission package: deploy and run SQLEnv in HF Spaces with bundled databases, follow a polished README quickstart, use a structured blog outline for narrative submission, and run a Colab-ready GRPO notebook workflow end-to-end.

	### How to Access/Test
	- README quickstart: Follow commands in `README.md`
	- Blog outline: Open `docs/blog-outline.md`
	- Notebook: Open `notebooks/train_grpo.ipynb` in Colab
	- Deployment assets: `server/Dockerfile`, `.dockerignore`, and `openenv.yaml`

	### Demo
	- Command: `uv run --with pytest pytest tests/ -v`
	- Health Check (after deploy): `curl https://<space-url>/health`
	- Notebook: `notebooks/train_grpo.ipynb`

	### Release Notes Snippet
	Completed submission-ready packaging for SQLEnv with HF Spaces-compatible Docker deployment, polished repository docs, blog narrative outline, and a Colab-ready GRPO training notebook.

	---

	## 11. PR Contract (Auto-Generated by autocode-next-step)

	<!-- This section is auto-populated by autocode-next-step command when all steps complete -->

	Status: Generated

	### PR Title
	feat(submission): finalize F007 huggingface deployment package

	### PR Summary
	- Finalize HF Spaces submission artifacts: hardened Docker packaging, deployment-ready manifest, polished README, blog outline, and Colab-ready training notebook.
	- Complete final verification gate with full regression evidence and archive behavior deltas into the deployment behavior spec.
	- Sync F007 completion metadata in `specs/FEATURES.json` and extract durable learnings for future delivery cycles.

	### Verification
	- `uv run --with pytest pytest tests/ -v`

	### Follow-up
	None.

	---

	## Stop Conditions (When to Split This Spec)

	Stop and create a new IMPLEMENTATION_SPEC if:
	- A step requires touching more than 3 files in unrelated areas
	- You need to introduce multiple new abstractions "just in case"
	- Verification cannot be made targeted and concrete
	- You discover new unknowns that change the plan materially
	- The next slice cannot be merged safely without finishing later slices

	When splitting, ensure the current slice ends in a merged, stable state.

	---

	## Human Checkpoint

	Before handing to AI agent:

	- [ ] Interface specifications are complete
	- [ ] Data flow is accurate
	- [ ] Error handling is specified
	- [ ] Implementation order makes sense
	- [ ] VERIFICATION_SPEC.md has been generated

	Questions:
	1. Confirm Spider database list for bundling (from `data/questions/db_list.json`)
	2. Confirm HF Space repository name for `openenv push`

	---

	## Handoff Notes

	For the implementing AI agent:

	```
	Context: See RESEARCH_SUMMARY.md for system understanding
	Spec: Follow this document exactly
	Verification: Use tests from VERIFICATION_SPEC.md (independent agent)
	Ambiguity: Stop and ask rather than assume
	Order: Follow implementation order exactly
	Dependencies: This feature assumes F001-F006 are complete
	```

	---

	Specification completed: 2026-03-27
	Approved by: --
	Verification spec: VERIFICATION_SPEC.md
	Verification input: [F007-VERIFICATION_INPUT.json](./F007-VERIFICATION_INPUT.json)
	Target agent: Claude Code

	## User Clarifications

	### 2026-03-28 21:40:54
	Question: External deployment verification is blocked by GHCR access/auth failure (403 pulling base image), so verifier gate cannot approve final completion yet.
	Response: Clearly state in demo and verification what the user needs to adjust

	### 2026-03-28 22:02:53
	Question: External credential/access dependency remains: need authenticated GHCR pull and HF push evidence (build+push attempt) to satisfy final verifier approval.
	Response: Ensure you write what the user should verify and we will manually validate

	### 2026-03-28 22:55:03
	Question: Missing external authenticated deployment evidence (GHCR-authenticated build and Hugging Face push output) required by F007 final verification gate.
	Response: I have already authenticated you should be able to run the commands now