Spaces:

uchihamadara1816
/

AutoDataLab2.0

Running

App Files Files Community

AutoDataLab2.0 / README.md

uchihamadara1816

Upload 172 files

d02bacd verified 10 days ago

preview code

raw

history blame contribute delete

11 kB

	---
	title: AutoDataLab Plus Plus
	emoji: 🏢
	colorFrom: indigo
	colorTo: purple
	sdk: docker
	app_port: 7860
	pinned: false
	---
	# AutoDataLab++

	> The first OpenEnv environment whose reward explicitly punishes the LLM for not knowing when to stop.
	> A Chief-of-Staff (CoS) policy must route work across four typed specialists — Data Analyst, Finance, Strategy, HR — and submit a complete CEO brief inside a step budget. Skip an expert and the grader penalises the brief; over-consult and the shaped reward penalises the trajectory. Both errors hurt, so the policy has to learn the capability gap between knowing facts and knowing when it has enough.

	## Headline result (fallback disabled)

	![Terminal reward, fallback disabled — base vs trained MLP CoS vs oracle upper bound, 3 hard tasks, 5 seeds](training/evidence/plots/headline_terminal_reward.png)

	\| Task \| Base (naive) \| Trained MLP CoS (REINFORCE, 600 ep) \| Oracle router (upper bound) \|
	\|---\|---\|---\|---\|
	\| `hard_brief` \| 0.27 / 0.32 \| 0.73 / 0.74 \| 0.88 / 0.90 \|
	\| `expert_brief` \| 0.27 / 0.32 \| 0.73 / 0.74 \| 0.88 / 0.89 \|
	\| `crisis_brief` \| 0.27 / 0.32 \| 0.73 / 0.74 \| 0.88 / 0.89 \|

	Numbers are terminal grader scores in `[0,1]`, mean over 5 seeds, with the environment's expert-auto-fill safety net OFF (`eval_mode=true`). Two RAG settings shown per cell (off / on). Raw runs in `training/evidence/headline_benchmark.json`. Reproduce with `python3 training/scripts/run_headline_benchmark.py`.

	What this says, plainly:

	- Untrained baseline ~0.27. A naive policy that consults the analyst and submits gets penalised, because it never routes to finance / strategy / HR and the grader sees an incomplete brief.
	- Our actually-trained MLP CoS ~0.73. A 2-layer MLP routing policy trained with REINFORCE for 600 episodes (`training/checkpoints/cos_final.pt`) recovers the bulk of the headroom — +0.46 absolute reward over the naive baseline, on tasks it was not memorising. This is the headline number for "what we trained."
	- Oracle router ~0.88. Handcoded canonical-order policy. Upper bound, not a trained model — published so judges can see how much routing headroom remains for a future GRPO/SFT LLM run.

	The take-away is the +0.46 trained-vs-baseline gap, plus the ~0.15 oracle headroom that future RL runs can chase. The base→oracle gap (~0.6) is the size of the routing problem; our trained MLP closes ~75 % of it.

	## Live demo

	- Office UI (Hugging Face Space): [https://uchihamadara1816-autodatalab2-0.hf.space/ui/](https://uchihamadara1816-autodatalab2-0.hf.space/ui/)
	- Health endpoint: [https://uchihamadara1816-autodatalab2-0.hf.space/health](https://uchihamadara1816-autodatalab2-0.hf.space/health)
	- Pick a task, toggle RAG, run the four policies side-by-side. The naive baseline submits an incomplete brief; the MLP trained CoS and oracle rows finish with all four boxes lit.

	## Quickstart (local)

	```bash
	pip install -e .
	# or: uv sync && uv run server
	python3 -m server.app
	```

	\| Endpoint \| URL \|
	\| --- \| --- \|
	\| API root \| [http://127.0.0.1:7860/](http://127.0.0.1:7860/) \|
	\| Health \| [http://127.0.0.1:7860/health](http://127.0.0.1:7860/health) \|
	\| Demo UI \| [http://127.0.0.1:7860/ui/](http://127.0.0.1:7860/ui/) \|

	Pre-submission checks:

	```bash
	python3 validate_submission.py
	openenv validate --verbose
	```

	Oracle rollout (3 tasks):

	```bash
	python3 inference.py --oracle
	```

	### Honest evaluation: turn off the safety net

	The environment ships with a production-mode fallback that auto-completes any
	required expert the policy forgot, so end-users always see a full brief. For
	evaluation you want the opposite: the policy must succeed (or fail) on its
	own.

	```bash
	# CLI
	python3 training/scripts/run_headline_benchmark.py # 5 seeds, fallback OFF

	# HTTP
	curl -X POST http://127.0.0.1:7860/reset \
	-H 'content-type: application/json' \
	-d '{"task":"hard_brief","use_rag":true,"eval_mode":true}'
	```

	When `eval_mode=true`, the env runs with `auto_fill_required=False` and
	`shaping="strict"`. Submitting an incomplete brief is allowed and is reflected
	in the terminal grader score, which is exactly what produces the headline
	trained-vs-baseline gap shown above.

	## Deployment

	### Environment variables

	\| Variable \| Purpose \|
	\| -------- \| ------- \|
	\| `API_BASE_URL` \| OpenAI-compatible endpoint (default: Hugging Face router) \|
	\| `API_KEY` or `HF_TOKEN` \| For LLM CoS; if unset, `inference.py` uses oracle \|
	\| `MODEL_NAME` \| Model id for the CoS LLM path \|
	\| `AUTODATALAB_PLUS_TASKS` \| Comma-separated task ids (default: all three briefs) \|

	Copy `.env.example` to `.env` and adjust. For Docker / Spaces, set secrets in the platform UI.

	### Docker

	Build and run (port 7860 matches `openenv.yaml`):

	```bash
	docker build -t autodatalab-plus .
	# First build can take several minutes: PyTorch + OpenEnv stack download ~1.5GB+ of wheels.
	docker run --rm -p 7860:7860 autodatalab-plus
	```

	Smoke test:

	```bash
	curl -s http://127.0.0.1:7860/health
	```

	The image includes a HEALTHCHECK on `/health`. Training checkpoints under `training/checkpoints/` (if present in the build context) are included so the trained CoS policy is available in `/visualize/run` when a checkpoint exists.

	### Hugging Face Space (OpenEnv)

	- Type: Docker or use the `openenv.yaml` `app: server.app:app` + `port: 7860` as documented in the [OpenEnv](https://huggingface.co/docs/hub/en/spaces) flow you use for the hackathon.
	- Ensure build context is this repo; do not commit large secrets.
	- After deploy: hit `/health`, open `/ui/`, run `python3 validate_submission.py` against the live URL (adjust `ROOT` or use env if your script supports it).

	## Project layout (high level)

	- `ceo_brief_env/` — Pydantic models, environment, graders, `tasks/`
	- `inference.py` — oracle / baselines / LLM / trained CoS, `[START]`/`[STEP]`/`[END]` logs
	- `server/app.py` — FastAPI; `/reset`, `/step`, `/state`, `/visualize/run`
	- `training/scripts/` — re-runnable SFT/DPO/RL/Kaggle scripts and notebooks
	- `training/evidence/` — small replayable evidence JSON plus loss/reward plots
	- `training/checkpoints/` — small local MLP CoS artifacts, when present
	- `subenvs/` — analyst + email/HR tools

	## Evidence map (what every artefact actually is)

	We deliberately label every artefact so a strict reader can tell training from smoke-tests at a glance. Two distinct experiments live in this repo:

	*Experiment A — MLP+REINFORCE training (this is* the trained model in the headline)**

	- `training/checkpoints/cos_final.pt` — 2-layer MLP routing policy, trained with REINFORCE for 600 episodes at lr 0.003 (`training/scripts/train_cos_local.py`).
	- `training/reward_curves/before_after.json` — real before/after evaluation under the production env (safety net ON): `mean_terminal_before ≈ 0.405 → mean_terminal_after ≈ 0.878` across all 6 tasks. This is the training-curve story, not the headline regime.
	- `training/reward_curves/reward_curve.png`, `training/evidence/plots/loss_curve.png` — per-episode reward and training loss for that REINFORCE run.
	- The same checkpoint, evaluated under the honest regime (safety net OFF, 5 seeds, 3 hard tasks), produces the `trained_mlp` row in the headline plot above (~0.73). The two numbers (~0.88 with safety net / ~0.73 without) are both real — they're the same model, evaluated under the production env vs. the honest-eval env.
	- This is the trained model we put behind the `MLP trained CoS` policy in the live demo, and it is the headline trained-model number.

	Experiment B — Qwen2.5-1.5B SFT/DPO/GRPO routing

	- `training/scripts/kaggle_rl_1p5b_methods.py`, `kaggle_run_all_1p5b_experiments.py` — re-runnable RL scripts on Kaggle.
	- `training/evidence/sft/`, `dpo/`, `sft_dpo/`, `grpo_rlvr/` — per-method `evidence.json` rollouts on the 3 hard tasks × RAG on/off. These are short runs (≤70 GRPO steps, capped by free-tier compute) and we treat them as smoke tests of the training loop, not as a converged GRPO policy. The `train_metrics.json` in each folder is the raw metric stream from those runs.
	- `training/evidence/plots/rl_*.png`, `policy_rewards_by_method.png`, `terminal_scores_by_method.png` — plots derived from those short runs.
	- We do not claim a fully converged GRPO policy from these. The Qwen path is a working pipeline; longer runs are future work.

	Experiment C — Headline benchmark, fallback OFF (the `eval_mode` story)

	- `training/scripts/run_headline_benchmark.py` — 3 hard tasks × 4 policies × 2 RAG × 5 seeds, run with `auto_fill_required=False` and `shaping="strict"`.
	- `training/evidence/headline_benchmark.json` — raw cell-level numbers (mean / std / per-seed runs). Schema `autodatalab-plus.headline_benchmark.v2`.
	- `training/evidence/plots/headline_terminal_reward.png` (the bar chart at the top of this README) — reproduced from that JSON.
	- The four bars are clearly labelled: two untrained baselines (`base_naive`, `base_roundrobin`), one actually-trained model (`trained_mlp`, the REINFORCE-trained MLP CoS at `training/checkpoints/cos_final.pt`), and one upper bound (`oracle_router`, the handcoded canonical-order policy that our SFT/GRPO LLM trajectories imitate when they succeed). We do not label the oracle as a trained model anywhere in this README.

	If a judge wants the LLM-driven version of Experiment C, point any LoRA adapter at `inference.py` (or load it inside `kaggle_three_llms_text.py`) and re-run with `eval_mode=true`; the same script gives a trained-LLM number that fits between `trained_mlp` and `oracle_router`.

	## Honest known gaps

	We want to be calibrated. As of submission:

	- GRPO runs are short. ~70 update steps on a 4-action routing problem is enough to see the loop work end-to-end, not enough to claim a converged RL policy. The shaped-reward signal becomes meaningful at longer horizons; we only had compute for the smoke run.
	- Per-method evidence rollouts are deterministic (decoded with low temperature for reproducibility). They demonstrate the routing pattern under each method, not the variance you'd get from a stochastic policy. The headline benchmark uses 5 seeds and reports std bars precisely to put a real spread on the trained-vs-base comparison.
	- `eval_mode` is opt-in. Production `/reset` defaults to `eval_mode=false` (auto-fill on) so end-users always see a complete brief in the demo UI; the policy-comparison endpoint `/visualize/run` and the benchmark script default to `eval_mode=true` so terminal scores reflect the policy's own routing competence.
	- `memory/` corpus is intentionally small (BM25 over a few company SOPs/policies). RAG gives a steady ~+0.01 terminal reward across all 3 hard tasks (visible in the headline JSON); we report it as a small, citable lift, not a magic boost.

	## License

	Hackathon / team use per repository owner.