Spaces:
Running
Running
File size: 10,993 Bytes
27c4aca d02bacd 27c4aca d02bacd 27c4aca d02bacd 27c4aca d02bacd | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 | ---
title: AutoDataLab Plus Plus
emoji: π’
colorFrom: indigo
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
---
# AutoDataLab++
> **The first OpenEnv environment whose reward explicitly punishes the LLM for not knowing when to stop.**
> A Chief-of-Staff (CoS) policy must route work across four typed specialists β Data Analyst, Finance, Strategy, HR β and *submit* a complete CEO brief inside a step budget. Skip an expert and the grader penalises the brief; over-consult and the shaped reward penalises the trajectory. Both errors hurt, so the policy has to learn the *capability gap* between knowing facts and knowing when it has enough.
## Headline result (fallback disabled)

| Task | Base (naive) | **Trained MLP CoS** (REINFORCE, 600 ep) | Oracle router (upper bound) |
|---|---|---|---|
| `hard_brief` | 0.27 / 0.32 | **0.73 / 0.74** | 0.88 / 0.90 |
| `expert_brief` | 0.27 / 0.32 | **0.73 / 0.74** | 0.88 / 0.89 |
| `crisis_brief` | 0.27 / 0.32 | **0.73 / 0.74** | 0.88 / 0.89 |
Numbers are **terminal grader scores in `[0,1]`**, mean over **5 seeds**, with the environment's expert-auto-fill **safety net OFF** (`eval_mode=true`). Two RAG settings shown per cell (off / on). Raw runs in `training/evidence/headline_benchmark.json`. Reproduce with `python3 training/scripts/run_headline_benchmark.py`.
What this says, plainly:
- **Untrained baseline ~0.27.** A naive policy that consults the analyst and submits gets penalised, because it never routes to finance / strategy / HR and the grader sees an incomplete brief.
- **Our actually-trained MLP CoS ~0.73.** A 2-layer MLP routing policy trained with REINFORCE for 600 episodes (`training/checkpoints/cos_final.pt`) recovers the bulk of the headroom β **+0.46 absolute reward over the naive baseline**, on tasks it was not memorising. This is the headline number for "what we trained."
- **Oracle router ~0.88.** Handcoded canonical-order policy. *Upper bound*, not a trained model β published so judges can see how much routing headroom remains for a future GRPO/SFT LLM run.
The take-away is the **+0.46 trained-vs-baseline gap**, plus the **~0.15 oracle headroom** that future RL runs can chase. The baseβoracle gap (~0.6) is the size of the routing problem; our trained MLP closes ~75 % of it.
## Live demo
- **Office UI (Hugging Face Space):** [https://uchihamadara1816-autodatalab2-0.hf.space/ui/](https://uchihamadara1816-autodatalab2-0.hf.space/ui/)
- Health endpoint: [https://uchihamadara1816-autodatalab2-0.hf.space/health](https://uchihamadara1816-autodatalab2-0.hf.space/health)
- Pick a task, toggle RAG, run the four policies side-by-side. The naive baseline submits an incomplete brief; the **MLP trained CoS** and **oracle** rows finish with all four boxes lit.
## Quickstart (local)
```bash
pip install -e .
# or: uv sync && uv run server
python3 -m server.app
```
| Endpoint | URL |
| --- | --- |
| API root | [http://127.0.0.1:7860/](http://127.0.0.1:7860/) |
| Health | [http://127.0.0.1:7860/health](http://127.0.0.1:7860/health) |
| **Demo UI** | [http://127.0.0.1:7860/ui/](http://127.0.0.1:7860/ui/) |
Pre-submission checks:
```bash
python3 validate_submission.py
openenv validate --verbose
```
Oracle rollout (3 tasks):
```bash
python3 inference.py --oracle
```
### Honest evaluation: turn off the safety net
The environment ships with a *production-mode* fallback that auto-completes any
required expert the policy forgot, so end-users always see a full brief. For
**evaluation** you want the opposite: the policy must succeed (or fail) on its
own.
```bash
# CLI
python3 training/scripts/run_headline_benchmark.py # 5 seeds, fallback OFF
# HTTP
curl -X POST http://127.0.0.1:7860/reset \
-H 'content-type: application/json' \
-d '{"task":"hard_brief","use_rag":true,"eval_mode":true}'
```
When `eval_mode=true`, the env runs with `auto_fill_required=False` and
`shaping="strict"`. Submitting an incomplete brief is allowed and is reflected
in the terminal grader score, which is exactly what produces the headline
trained-vs-baseline gap shown above.
## Deployment
### Environment variables
| Variable | Purpose |
| -------- | ------- |
| `API_BASE_URL` | OpenAI-compatible endpoint (default: Hugging Face router) |
| `API_KEY` or `HF_TOKEN` | For LLM CoS; if unset, `inference.py` uses **oracle** |
| `MODEL_NAME` | Model id for the CoS LLM path |
| `AUTODATALAB_PLUS_TASKS` | Comma-separated task ids (default: all three briefs) |
Copy `.env.example` to `.env` and adjust. For Docker / Spaces, set secrets in the platform UI.
### Docker
Build and run (port **7860** matches `openenv.yaml`):
```bash
docker build -t autodatalab-plus .
# First build can take several minutes: PyTorch + OpenEnv stack download ~1.5GB+ of wheels.
docker run --rm -p 7860:7860 autodatalab-plus
```
Smoke test:
```bash
curl -s http://127.0.0.1:7860/health
```
The image includes a **HEALTHCHECK** on `/health`. Training checkpoints under `training/checkpoints/` (if present in the build context) are included so the **trained CoS** policy is available in `/visualize/run` when a checkpoint exists.
### Hugging Face Space (OpenEnv)
- Type: **Docker** or use the `openenv.yaml` `app: server.app:app` + `port: 7860` as documented in the [OpenEnv](https://huggingface.co/docs/hub/en/spaces) flow you use for the hackathon.
- Ensure **build context** is this repo; **do not** commit large secrets.
- After deploy: hit `/health`, open `/ui/`, run `python3 validate_submission.py` against the **live URL** (adjust `ROOT` or use env if your script supports it).
## Project layout (high level)
- `ceo_brief_env/` β Pydantic models, environment, graders, `tasks/`
- `inference.py` β oracle / baselines / LLM / trained CoS, `[START]`/`[STEP]`/`[END]` logs
- `server/app.py` β FastAPI; `/reset`, `/step`, `/state`, `/visualize/run`
- `training/scripts/` β re-runnable SFT/DPO/RL/Kaggle scripts and notebooks
- `training/evidence/` β small replayable evidence JSON plus loss/reward plots
- `training/checkpoints/` β small local MLP CoS artifacts, when present
- `subenvs/` β analyst + email/HR tools
## Evidence map (what every artefact actually is)
We deliberately label every artefact so a strict reader can tell training from smoke-tests at a glance. Two distinct experiments live in this repo:
**Experiment A β MLP+REINFORCE training (this *is* the trained model in the headline)**
- `training/checkpoints/cos_final.pt` β 2-layer MLP routing policy, trained with REINFORCE for 600 episodes at lr 0.003 (`training/scripts/train_cos_local.py`).
- `training/reward_curves/before_after.json` β real before/after evaluation under the *production* env (safety net ON): `mean_terminal_before β 0.405 β mean_terminal_after β 0.878` across all 6 tasks. This is the training-curve story, not the headline regime.
- `training/reward_curves/reward_curve.png`, `training/evidence/plots/loss_curve.png` β per-episode reward and training loss for that REINFORCE run.
- The same checkpoint, evaluated under the **honest regime** (safety net OFF, 5 seeds, 3 hard tasks), produces the `trained_mlp` row in the headline plot above (~0.73). The two numbers (~0.88 with safety net / ~0.73 without) are both real β they're the same model, evaluated under the production env vs. the honest-eval env.
- This is the trained model we put behind the `MLP trained CoS` policy in the live demo, and it is the headline trained-model number.
**Experiment B β Qwen2.5-1.5B SFT/DPO/GRPO routing**
- `training/scripts/kaggle_rl_1p5b_methods.py`, `kaggle_run_all_1p5b_experiments.py` β re-runnable RL scripts on Kaggle.
- `training/evidence/sft/`, `dpo/`, `sft_dpo/`, `grpo_rlvr/` β per-method `evidence.json` rollouts on the 3 hard tasks Γ RAG on/off. **These are short runs** (β€70 GRPO steps, capped by free-tier compute) and we treat them as smoke tests of the training loop, not as a converged GRPO policy. The `train_metrics.json` in each folder is the raw metric stream from those runs.
- `training/evidence/plots/rl_*.png`, `policy_rewards_by_method.png`, `terminal_scores_by_method.png` β plots derived from those short runs.
- We **do not** claim a fully converged GRPO policy from these. The Qwen path is a working pipeline; longer runs are future work.
**Experiment C β Headline benchmark, fallback OFF (the `eval_mode` story)**
- `training/scripts/run_headline_benchmark.py` β 3 hard tasks Γ 4 policies Γ 2 RAG Γ 5 seeds, run with `auto_fill_required=False` and `shaping="strict"`.
- `training/evidence/headline_benchmark.json` β raw cell-level numbers (mean / std / per-seed runs). Schema `autodatalab-plus.headline_benchmark.v2`.
- `training/evidence/plots/headline_terminal_reward.png` (the bar chart at the top of this README) β reproduced from that JSON.
- The four bars are clearly labelled: two **untrained baselines** (`base_naive`, `base_roundrobin`), one **actually-trained model** (`trained_mlp`, the REINFORCE-trained MLP CoS at `training/checkpoints/cos_final.pt`), and one **upper bound** (`oracle_router`, the handcoded canonical-order policy that our SFT/GRPO LLM trajectories imitate when they succeed). We do **not** label the oracle as a trained model anywhere in this README.
If a judge wants the LLM-driven version of Experiment C, point any LoRA adapter at `inference.py` (or load it inside `kaggle_three_llms_text.py`) and re-run with `eval_mode=true`; the same script gives a trained-LLM number that fits between `trained_mlp` and `oracle_router`.
## Honest known gaps
We want to be calibrated. As of submission:
- **GRPO runs are short.** ~70 update steps on a 4-action routing problem is enough to see the loop work end-to-end, not enough to claim a converged RL policy. The shaped-reward signal becomes meaningful at longer horizons; we only had compute for the smoke run.
- **Per-method evidence rollouts are deterministic** (decoded with low temperature for reproducibility). They demonstrate the routing pattern under each method, not the variance you'd get from a stochastic policy. The headline benchmark uses 5 seeds and reports std bars precisely to put a real spread on the trained-vs-base comparison.
- **`eval_mode` is opt-in.** Production `/reset` defaults to `eval_mode=false` (auto-fill on) so end-users always see a complete brief in the demo UI; the policy-comparison endpoint `/visualize/run` and the benchmark script default to `eval_mode=true` so terminal scores reflect the policy's own routing competence.
- **`memory/` corpus is intentionally small** (BM25 over a few company SOPs/policies). RAG gives a steady ~+0.01 terminal reward across all 3 hard tasks (visible in the headline JSON); we report it as a small, citable lift, not a magic boost.
## License
Hackathon / team use per repository owner.
|