Spaces:
Running
Running
| title: AutoDataLab Plus Plus | |
| emoji: π’ | |
| colorFrom: indigo | |
| colorTo: purple | |
| sdk: docker | |
| app_port: 7860 | |
| pinned: false | |
| # AutoDataLab++ | |
| > **The first OpenEnv environment whose reward explicitly punishes the LLM for not knowing when to stop.** | |
| > A Chief-of-Staff (CoS) policy must route work across four typed specialists β Data Analyst, Finance, Strategy, HR β and *submit* a complete CEO brief inside a step budget. Skip an expert and the grader penalises the brief; over-consult and the shaped reward penalises the trajectory. Both errors hurt, so the policy has to learn the *capability gap* between knowing facts and knowing when it has enough. | |
| ## Headline result (fallback disabled) | |
|  | |
| | Task | Base (naive) | **Trained MLP CoS** (REINFORCE, 600 ep) | Oracle router (upper bound) | | |
| |---|---|---|---| | |
| | `hard_brief` | 0.27 / 0.32 | **0.73 / 0.74** | 0.88 / 0.90 | | |
| | `expert_brief` | 0.27 / 0.32 | **0.73 / 0.74** | 0.88 / 0.89 | | |
| | `crisis_brief` | 0.27 / 0.32 | **0.73 / 0.74** | 0.88 / 0.89 | | |
| Numbers are **terminal grader scores in `[0,1]`**, mean over **5 seeds**, with the environment's expert-auto-fill **safety net OFF** (`eval_mode=true`). Two RAG settings shown per cell (off / on). Raw runs in `training/evidence/headline_benchmark.json`. Reproduce with `python3 training/scripts/run_headline_benchmark.py`. | |
| What this says, plainly: | |
| - **Untrained baseline ~0.27.** A naive policy that consults the analyst and submits gets penalised, because it never routes to finance / strategy / HR and the grader sees an incomplete brief. | |
| - **Our actually-trained MLP CoS ~0.73.** A 2-layer MLP routing policy trained with REINFORCE for 600 episodes (`training/checkpoints/cos_final.pt`) recovers the bulk of the headroom β **+0.46 absolute reward over the naive baseline**, on tasks it was not memorising. This is the headline number for "what we trained." | |
| - **Oracle router ~0.88.** Handcoded canonical-order policy. *Upper bound*, not a trained model β published so judges can see how much routing headroom remains for a future GRPO/SFT LLM run. | |
| The take-away is the **+0.46 trained-vs-baseline gap**, plus the **~0.15 oracle headroom** that future RL runs can chase. The baseβoracle gap (~0.6) is the size of the routing problem; our trained MLP closes ~75 % of it. | |
| ## Live demo | |
| - **Office UI (Hugging Face Space):** [https://uchihamadara1816-autodatalab2-0.hf.space/ui/](https://uchihamadara1816-autodatalab2-0.hf.space/ui/) | |
| - Health endpoint: [https://uchihamadara1816-autodatalab2-0.hf.space/health](https://uchihamadara1816-autodatalab2-0.hf.space/health) | |
| - Pick a task, toggle RAG, run the four policies side-by-side. The naive baseline submits an incomplete brief; the **MLP trained CoS** and **oracle** rows finish with all four boxes lit. | |
| ## Quickstart (local) | |
| ```bash | |
| pip install -e . | |
| # or: uv sync && uv run server | |
| python3 -m server.app | |
| ``` | |
| | Endpoint | URL | | |
| | --- | --- | | |
| | API root | [http://127.0.0.1:7860/](http://127.0.0.1:7860/) | | |
| | Health | [http://127.0.0.1:7860/health](http://127.0.0.1:7860/health) | | |
| | **Demo UI** | [http://127.0.0.1:7860/ui/](http://127.0.0.1:7860/ui/) | | |
| Pre-submission checks: | |
| ```bash | |
| python3 validate_submission.py | |
| openenv validate --verbose | |
| ``` | |
| Oracle rollout (3 tasks): | |
| ```bash | |
| python3 inference.py --oracle | |
| ``` | |
| ### Honest evaluation: turn off the safety net | |
| The environment ships with a *production-mode* fallback that auto-completes any | |
| required expert the policy forgot, so end-users always see a full brief. For | |
| **evaluation** you want the opposite: the policy must succeed (or fail) on its | |
| own. | |
| ```bash | |
| # CLI | |
| python3 training/scripts/run_headline_benchmark.py # 5 seeds, fallback OFF | |
| # HTTP | |
| curl -X POST http://127.0.0.1:7860/reset \ | |
| -H 'content-type: application/json' \ | |
| -d '{"task":"hard_brief","use_rag":true,"eval_mode":true}' | |
| ``` | |
| When `eval_mode=true`, the env runs with `auto_fill_required=False` and | |
| `shaping="strict"`. Submitting an incomplete brief is allowed and is reflected | |
| in the terminal grader score, which is exactly what produces the headline | |
| trained-vs-baseline gap shown above. | |
| ## Deployment | |
| ### Environment variables | |
| | Variable | Purpose | | |
| | -------- | ------- | | |
| | `API_BASE_URL` | OpenAI-compatible endpoint (default: Hugging Face router) | | |
| | `API_KEY` or `HF_TOKEN` | For LLM CoS; if unset, `inference.py` uses **oracle** | | |
| | `MODEL_NAME` | Model id for the CoS LLM path | | |
| | `AUTODATALAB_PLUS_TASKS` | Comma-separated task ids (default: all three briefs) | | |
| Copy `.env.example` to `.env` and adjust. For Docker / Spaces, set secrets in the platform UI. | |
| ### Docker | |
| Build and run (port **7860** matches `openenv.yaml`): | |
| ```bash | |
| docker build -t autodatalab-plus . | |
| # First build can take several minutes: PyTorch + OpenEnv stack download ~1.5GB+ of wheels. | |
| docker run --rm -p 7860:7860 autodatalab-plus | |
| ``` | |
| Smoke test: | |
| ```bash | |
| curl -s http://127.0.0.1:7860/health | |
| ``` | |
| The image includes a **HEALTHCHECK** on `/health`. Training checkpoints under `training/checkpoints/` (if present in the build context) are included so the **trained CoS** policy is available in `/visualize/run` when a checkpoint exists. | |
| ### Hugging Face Space (OpenEnv) | |
| - Type: **Docker** or use the `openenv.yaml` `app: server.app:app` + `port: 7860` as documented in the [OpenEnv](https://huggingface.co/docs/hub/en/spaces) flow you use for the hackathon. | |
| - Ensure **build context** is this repo; **do not** commit large secrets. | |
| - After deploy: hit `/health`, open `/ui/`, run `python3 validate_submission.py` against the **live URL** (adjust `ROOT` or use env if your script supports it). | |
| ## Project layout (high level) | |
| - `ceo_brief_env/` β Pydantic models, environment, graders, `tasks/` | |
| - `inference.py` β oracle / baselines / LLM / trained CoS, `[START]`/`[STEP]`/`[END]` logs | |
| - `server/app.py` β FastAPI; `/reset`, `/step`, `/state`, `/visualize/run` | |
| - `training/scripts/` β re-runnable SFT/DPO/RL/Kaggle scripts and notebooks | |
| - `training/evidence/` β small replayable evidence JSON plus loss/reward plots | |
| - `training/checkpoints/` β small local MLP CoS artifacts, when present | |
| - `subenvs/` β analyst + email/HR tools | |
| ## Evidence map (what every artefact actually is) | |
| We deliberately label every artefact so a strict reader can tell training from smoke-tests at a glance. Two distinct experiments live in this repo: | |
| **Experiment A β MLP+REINFORCE training (this *is* the trained model in the headline)** | |
| - `training/checkpoints/cos_final.pt` β 2-layer MLP routing policy, trained with REINFORCE for 600 episodes at lr 0.003 (`training/scripts/train_cos_local.py`). | |
| - `training/reward_curves/before_after.json` β real before/after evaluation under the *production* env (safety net ON): `mean_terminal_before β 0.405 β mean_terminal_after β 0.878` across all 6 tasks. This is the training-curve story, not the headline regime. | |
| - `training/reward_curves/reward_curve.png`, `training/evidence/plots/loss_curve.png` β per-episode reward and training loss for that REINFORCE run. | |
| - The same checkpoint, evaluated under the **honest regime** (safety net OFF, 5 seeds, 3 hard tasks), produces the `trained_mlp` row in the headline plot above (~0.73). The two numbers (~0.88 with safety net / ~0.73 without) are both real β they're the same model, evaluated under the production env vs. the honest-eval env. | |
| - This is the trained model we put behind the `MLP trained CoS` policy in the live demo, and it is the headline trained-model number. | |
| **Experiment B β Qwen2.5-1.5B SFT/DPO/GRPO routing** | |
| - `training/scripts/kaggle_rl_1p5b_methods.py`, `kaggle_run_all_1p5b_experiments.py` β re-runnable RL scripts on Kaggle. | |
| - `training/evidence/sft/`, `dpo/`, `sft_dpo/`, `grpo_rlvr/` β per-method `evidence.json` rollouts on the 3 hard tasks Γ RAG on/off. **These are short runs** (β€70 GRPO steps, capped by free-tier compute) and we treat them as smoke tests of the training loop, not as a converged GRPO policy. The `train_metrics.json` in each folder is the raw metric stream from those runs. | |
| - `training/evidence/plots/rl_*.png`, `policy_rewards_by_method.png`, `terminal_scores_by_method.png` β plots derived from those short runs. | |
| - We **do not** claim a fully converged GRPO policy from these. The Qwen path is a working pipeline; longer runs are future work. | |
| **Experiment C β Headline benchmark, fallback OFF (the `eval_mode` story)** | |
| - `training/scripts/run_headline_benchmark.py` β 3 hard tasks Γ 4 policies Γ 2 RAG Γ 5 seeds, run with `auto_fill_required=False` and `shaping="strict"`. | |
| - `training/evidence/headline_benchmark.json` β raw cell-level numbers (mean / std / per-seed runs). Schema `autodatalab-plus.headline_benchmark.v2`. | |
| - `training/evidence/plots/headline_terminal_reward.png` (the bar chart at the top of this README) β reproduced from that JSON. | |
| - The four bars are clearly labelled: two **untrained baselines** (`base_naive`, `base_roundrobin`), one **actually-trained model** (`trained_mlp`, the REINFORCE-trained MLP CoS at `training/checkpoints/cos_final.pt`), and one **upper bound** (`oracle_router`, the handcoded canonical-order policy that our SFT/GRPO LLM trajectories imitate when they succeed). We do **not** label the oracle as a trained model anywhere in this README. | |
| If a judge wants the LLM-driven version of Experiment C, point any LoRA adapter at `inference.py` (or load it inside `kaggle_three_llms_text.py`) and re-run with `eval_mode=true`; the same script gives a trained-LLM number that fits between `trained_mlp` and `oracle_router`. | |
| ## Honest known gaps | |
| We want to be calibrated. As of submission: | |
| - **GRPO runs are short.** ~70 update steps on a 4-action routing problem is enough to see the loop work end-to-end, not enough to claim a converged RL policy. The shaped-reward signal becomes meaningful at longer horizons; we only had compute for the smoke run. | |
| - **Per-method evidence rollouts are deterministic** (decoded with low temperature for reproducibility). They demonstrate the routing pattern under each method, not the variance you'd get from a stochastic policy. The headline benchmark uses 5 seeds and reports std bars precisely to put a real spread on the trained-vs-base comparison. | |
| - **`eval_mode` is opt-in.** Production `/reset` defaults to `eval_mode=false` (auto-fill on) so end-users always see a complete brief in the demo UI; the policy-comparison endpoint `/visualize/run` and the benchmark script default to `eval_mode=true` so terminal scores reflect the policy's own routing competence. | |
| - **`memory/` corpus is intentionally small** (BM25 over a few company SOPs/policies). RAG gives a steady ~+0.01 terminal reward across all 3 hard tasks (visible in the headline JSON); we report it as a small, citable lift, not a magic boost. | |
| ## License | |
| Hackathon / team use per repository owner. | |