--- title: AutoDataLab Plus Plus emoji: 🏒 colorFrom: indigo colorTo: purple sdk: docker app_port: 7860 pinned: false --- # AutoDataLab++ > **The first OpenEnv environment whose reward explicitly punishes the LLM for not knowing when to stop.** > A Chief-of-Staff (CoS) policy must route work across four typed specialists β€” Data Analyst, Finance, Strategy, HR β€” and *submit* a complete CEO brief inside a step budget. Skip an expert and the grader penalises the brief; over-consult and the shaped reward penalises the trajectory. Both errors hurt, so the policy has to learn the *capability gap* between knowing facts and knowing when it has enough. ## Headline result (fallback disabled) ![Terminal reward, fallback disabled β€” base vs trained MLP CoS vs oracle upper bound, 3 hard tasks, 5 seeds](training/evidence/plots/headline_terminal_reward.png) | Task | Base (naive) | **Trained MLP CoS** (REINFORCE, 600 ep) | Oracle router (upper bound) | |---|---|---|---| | `hard_brief` | 0.27 / 0.32 | **0.73 / 0.74** | 0.88 / 0.90 | | `expert_brief` | 0.27 / 0.32 | **0.73 / 0.74** | 0.88 / 0.89 | | `crisis_brief` | 0.27 / 0.32 | **0.73 / 0.74** | 0.88 / 0.89 | Numbers are **terminal grader scores in `[0,1]`**, mean over **5 seeds**, with the environment's expert-auto-fill **safety net OFF** (`eval_mode=true`). Two RAG settings shown per cell (off / on). Raw runs in `training/evidence/headline_benchmark.json`. Reproduce with `python3 training/scripts/run_headline_benchmark.py`. What this says, plainly: - **Untrained baseline ~0.27.** A naive policy that consults the analyst and submits gets penalised, because it never routes to finance / strategy / HR and the grader sees an incomplete brief. - **Our actually-trained MLP CoS ~0.73.** A 2-layer MLP routing policy trained with REINFORCE for 600 episodes (`training/checkpoints/cos_final.pt`) recovers the bulk of the headroom β€” **+0.46 absolute reward over the naive baseline**, on tasks it was not memorising. This is the headline number for "what we trained." - **Oracle router ~0.88.** Handcoded canonical-order policy. *Upper bound*, not a trained model β€” published so judges can see how much routing headroom remains for a future GRPO/SFT LLM run. The take-away is the **+0.46 trained-vs-baseline gap**, plus the **~0.15 oracle headroom** that future RL runs can chase. The baseβ†’oracle gap (~0.6) is the size of the routing problem; our trained MLP closes ~75 % of it. ## Live demo - **Office UI (Hugging Face Space):** [https://uchihamadara1816-autodatalab2-0.hf.space/ui/](https://uchihamadara1816-autodatalab2-0.hf.space/ui/) - Health endpoint: [https://uchihamadara1816-autodatalab2-0.hf.space/health](https://uchihamadara1816-autodatalab2-0.hf.space/health) - Pick a task, toggle RAG, run the four policies side-by-side. The naive baseline submits an incomplete brief; the **MLP trained CoS** and **oracle** rows finish with all four boxes lit. ## Quickstart (local) ```bash pip install -e . # or: uv sync && uv run server python3 -m server.app ``` | Endpoint | URL | | --- | --- | | API root | [http://127.0.0.1:7860/](http://127.0.0.1:7860/) | | Health | [http://127.0.0.1:7860/health](http://127.0.0.1:7860/health) | | **Demo UI** | [http://127.0.0.1:7860/ui/](http://127.0.0.1:7860/ui/) | Pre-submission checks: ```bash python3 validate_submission.py openenv validate --verbose ``` Oracle rollout (3 tasks): ```bash python3 inference.py --oracle ``` ### Honest evaluation: turn off the safety net The environment ships with a *production-mode* fallback that auto-completes any required expert the policy forgot, so end-users always see a full brief. For **evaluation** you want the opposite: the policy must succeed (or fail) on its own. ```bash # CLI python3 training/scripts/run_headline_benchmark.py # 5 seeds, fallback OFF # HTTP curl -X POST http://127.0.0.1:7860/reset \ -H 'content-type: application/json' \ -d '{"task":"hard_brief","use_rag":true,"eval_mode":true}' ``` When `eval_mode=true`, the env runs with `auto_fill_required=False` and `shaping="strict"`. Submitting an incomplete brief is allowed and is reflected in the terminal grader score, which is exactly what produces the headline trained-vs-baseline gap shown above. ## Deployment ### Environment variables | Variable | Purpose | | -------- | ------- | | `API_BASE_URL` | OpenAI-compatible endpoint (default: Hugging Face router) | | `API_KEY` or `HF_TOKEN` | For LLM CoS; if unset, `inference.py` uses **oracle** | | `MODEL_NAME` | Model id for the CoS LLM path | | `AUTODATALAB_PLUS_TASKS` | Comma-separated task ids (default: all three briefs) | Copy `.env.example` to `.env` and adjust. For Docker / Spaces, set secrets in the platform UI. ### Docker Build and run (port **7860** matches `openenv.yaml`): ```bash docker build -t autodatalab-plus . # First build can take several minutes: PyTorch + OpenEnv stack download ~1.5GB+ of wheels. docker run --rm -p 7860:7860 autodatalab-plus ``` Smoke test: ```bash curl -s http://127.0.0.1:7860/health ``` The image includes a **HEALTHCHECK** on `/health`. Training checkpoints under `training/checkpoints/` (if present in the build context) are included so the **trained CoS** policy is available in `/visualize/run` when a checkpoint exists. ### Hugging Face Space (OpenEnv) - Type: **Docker** or use the `openenv.yaml` `app: server.app:app` + `port: 7860` as documented in the [OpenEnv](https://huggingface.co/docs/hub/en/spaces) flow you use for the hackathon. - Ensure **build context** is this repo; **do not** commit large secrets. - After deploy: hit `/health`, open `/ui/`, run `python3 validate_submission.py` against the **live URL** (adjust `ROOT` or use env if your script supports it). ## Project layout (high level) - `ceo_brief_env/` β€” Pydantic models, environment, graders, `tasks/` - `inference.py` β€” oracle / baselines / LLM / trained CoS, `[START]`/`[STEP]`/`[END]` logs - `server/app.py` β€” FastAPI; `/reset`, `/step`, `/state`, `/visualize/run` - `training/scripts/` β€” re-runnable SFT/DPO/RL/Kaggle scripts and notebooks - `training/evidence/` β€” small replayable evidence JSON plus loss/reward plots - `training/checkpoints/` β€” small local MLP CoS artifacts, when present - `subenvs/` β€” analyst + email/HR tools ## Evidence map (what every artefact actually is) We deliberately label every artefact so a strict reader can tell training from smoke-tests at a glance. Two distinct experiments live in this repo: **Experiment A β€” MLP+REINFORCE training (this *is* the trained model in the headline)** - `training/checkpoints/cos_final.pt` β€” 2-layer MLP routing policy, trained with REINFORCE for 600 episodes at lr 0.003 (`training/scripts/train_cos_local.py`). - `training/reward_curves/before_after.json` β€” real before/after evaluation under the *production* env (safety net ON): `mean_terminal_before β‰ˆ 0.405 β†’ mean_terminal_after β‰ˆ 0.878` across all 6 tasks. This is the training-curve story, not the headline regime. - `training/reward_curves/reward_curve.png`, `training/evidence/plots/loss_curve.png` β€” per-episode reward and training loss for that REINFORCE run. - The same checkpoint, evaluated under the **honest regime** (safety net OFF, 5 seeds, 3 hard tasks), produces the `trained_mlp` row in the headline plot above (~0.73). The two numbers (~0.88 with safety net / ~0.73 without) are both real β€” they're the same model, evaluated under the production env vs. the honest-eval env. - This is the trained model we put behind the `MLP trained CoS` policy in the live demo, and it is the headline trained-model number. **Experiment B β€” Qwen2.5-1.5B SFT/DPO/GRPO routing** - `training/scripts/kaggle_rl_1p5b_methods.py`, `kaggle_run_all_1p5b_experiments.py` β€” re-runnable RL scripts on Kaggle. - `training/evidence/sft/`, `dpo/`, `sft_dpo/`, `grpo_rlvr/` β€” per-method `evidence.json` rollouts on the 3 hard tasks Γ— RAG on/off. **These are short runs** (≀70 GRPO steps, capped by free-tier compute) and we treat them as smoke tests of the training loop, not as a converged GRPO policy. The `train_metrics.json` in each folder is the raw metric stream from those runs. - `training/evidence/plots/rl_*.png`, `policy_rewards_by_method.png`, `terminal_scores_by_method.png` β€” plots derived from those short runs. - We **do not** claim a fully converged GRPO policy from these. The Qwen path is a working pipeline; longer runs are future work. **Experiment C β€” Headline benchmark, fallback OFF (the `eval_mode` story)** - `training/scripts/run_headline_benchmark.py` β€” 3 hard tasks Γ— 4 policies Γ— 2 RAG Γ— 5 seeds, run with `auto_fill_required=False` and `shaping="strict"`. - `training/evidence/headline_benchmark.json` β€” raw cell-level numbers (mean / std / per-seed runs). Schema `autodatalab-plus.headline_benchmark.v2`. - `training/evidence/plots/headline_terminal_reward.png` (the bar chart at the top of this README) β€” reproduced from that JSON. - The four bars are clearly labelled: two **untrained baselines** (`base_naive`, `base_roundrobin`), one **actually-trained model** (`trained_mlp`, the REINFORCE-trained MLP CoS at `training/checkpoints/cos_final.pt`), and one **upper bound** (`oracle_router`, the handcoded canonical-order policy that our SFT/GRPO LLM trajectories imitate when they succeed). We do **not** label the oracle as a trained model anywhere in this README. If a judge wants the LLM-driven version of Experiment C, point any LoRA adapter at `inference.py` (or load it inside `kaggle_three_llms_text.py`) and re-run with `eval_mode=true`; the same script gives a trained-LLM number that fits between `trained_mlp` and `oracle_router`. ## Honest known gaps We want to be calibrated. As of submission: - **GRPO runs are short.** ~70 update steps on a 4-action routing problem is enough to see the loop work end-to-end, not enough to claim a converged RL policy. The shaped-reward signal becomes meaningful at longer horizons; we only had compute for the smoke run. - **Per-method evidence rollouts are deterministic** (decoded with low temperature for reproducibility). They demonstrate the routing pattern under each method, not the variance you'd get from a stochastic policy. The headline benchmark uses 5 seeds and reports std bars precisely to put a real spread on the trained-vs-base comparison. - **`eval_mode` is opt-in.** Production `/reset` defaults to `eval_mode=false` (auto-fill on) so end-users always see a complete brief in the demo UI; the policy-comparison endpoint `/visualize/run` and the benchmark script default to `eval_mode=true` so terminal scores reflect the policy's own routing competence. - **`memory/` corpus is intentionally small** (BM25 over a few company SOPs/policies). RAG gives a steady ~+0.01 terminal reward across all 3 hard tasks (visible in the headline JSON); we report it as a small, citable lift, not a magic boost. ## License Hackathon / team use per repository owner.