Spaces:
Running
title: AutoDataLab Plus Plus
emoji: π’
colorFrom: indigo
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
AutoDataLab++
The first OpenEnv environment whose reward explicitly punishes the LLM for not knowing when to stop. A Chief-of-Staff (CoS) policy must route work across four typed specialists β Data Analyst, Finance, Strategy, HR β and submit a complete CEO brief inside a step budget. Skip an expert and the grader penalises the brief; over-consult and the shaped reward penalises the trajectory. Both errors hurt, so the policy has to learn the capability gap between knowing facts and knowing when it has enough.
Headline result (fallback disabled)
| Task | Base (naive) | Trained MLP CoS (REINFORCE, 600 ep) | Oracle router (upper bound) |
|---|---|---|---|
hard_brief |
0.27 / 0.32 | 0.73 / 0.74 | 0.88 / 0.90 |
expert_brief |
0.27 / 0.32 | 0.73 / 0.74 | 0.88 / 0.89 |
crisis_brief |
0.27 / 0.32 | 0.73 / 0.74 | 0.88 / 0.89 |
Numbers are terminal grader scores in [0,1], mean over 5 seeds, with the environment's expert-auto-fill safety net OFF (eval_mode=true). Two RAG settings shown per cell (off / on). Raw runs in training/evidence/headline_benchmark.json. Reproduce with python3 training/scripts/run_headline_benchmark.py.
What this says, plainly:
- Untrained baseline ~0.27. A naive policy that consults the analyst and submits gets penalised, because it never routes to finance / strategy / HR and the grader sees an incomplete brief.
- Our actually-trained MLP CoS ~0.73. A 2-layer MLP routing policy trained with REINFORCE for 600 episodes (
training/checkpoints/cos_final.pt) recovers the bulk of the headroom β +0.46 absolute reward over the naive baseline, on tasks it was not memorising. This is the headline number for "what we trained." - Oracle router ~0.88. Handcoded canonical-order policy. Upper bound, not a trained model β published so judges can see how much routing headroom remains for a future GRPO/SFT LLM run.
The take-away is the +0.46 trained-vs-baseline gap, plus the ~0.15 oracle headroom that future RL runs can chase. The baseβoracle gap (~0.6) is the size of the routing problem; our trained MLP closes ~75 % of it.
Live demo
- Office UI (Hugging Face Space): https://uchihamadara1816-autodatalab2-0.hf.space/ui/
- Health endpoint: https://uchihamadara1816-autodatalab2-0.hf.space/health
- Pick a task, toggle RAG, run the four policies side-by-side. The naive baseline submits an incomplete brief; the MLP trained CoS and oracle rows finish with all four boxes lit.
Quickstart (local)
pip install -e .
# or: uv sync && uv run server
python3 -m server.app
| Endpoint | URL |
|---|---|
| API root | http://127.0.0.1:7860/ |
| Health | http://127.0.0.1:7860/health |
| Demo UI | http://127.0.0.1:7860/ui/ |
Pre-submission checks:
python3 validate_submission.py
openenv validate --verbose
Oracle rollout (3 tasks):
python3 inference.py --oracle
Honest evaluation: turn off the safety net
The environment ships with a production-mode fallback that auto-completes any required expert the policy forgot, so end-users always see a full brief. For evaluation you want the opposite: the policy must succeed (or fail) on its own.
# CLI
python3 training/scripts/run_headline_benchmark.py # 5 seeds, fallback OFF
# HTTP
curl -X POST http://127.0.0.1:7860/reset \
-H 'content-type: application/json' \
-d '{"task":"hard_brief","use_rag":true,"eval_mode":true}'
When eval_mode=true, the env runs with auto_fill_required=False and
shaping="strict". Submitting an incomplete brief is allowed and is reflected
in the terminal grader score, which is exactly what produces the headline
trained-vs-baseline gap shown above.
Deployment
Environment variables
| Variable | Purpose |
|---|---|
API_BASE_URL |
OpenAI-compatible endpoint (default: Hugging Face router) |
API_KEY or HF_TOKEN |
For LLM CoS; if unset, inference.py uses oracle |
MODEL_NAME |
Model id for the CoS LLM path |
AUTODATALAB_PLUS_TASKS |
Comma-separated task ids (default: all three briefs) |
Copy .env.example to .env and adjust. For Docker / Spaces, set secrets in the platform UI.
Docker
Build and run (port 7860 matches openenv.yaml):
docker build -t autodatalab-plus .
# First build can take several minutes: PyTorch + OpenEnv stack download ~1.5GB+ of wheels.
docker run --rm -p 7860:7860 autodatalab-plus
Smoke test:
curl -s http://127.0.0.1:7860/health
The image includes a HEALTHCHECK on /health. Training checkpoints under training/checkpoints/ (if present in the build context) are included so the trained CoS policy is available in /visualize/run when a checkpoint exists.
Hugging Face Space (OpenEnv)
- Type: Docker or use the
openenv.yamlapp: server.app:app+port: 7860as documented in the OpenEnv flow you use for the hackathon. - Ensure build context is this repo; do not commit large secrets.
- After deploy: hit
/health, open/ui/, runpython3 validate_submission.pyagainst the live URL (adjustROOTor use env if your script supports it).
Project layout (high level)
ceo_brief_env/β Pydantic models, environment, graders,tasks/inference.pyβ oracle / baselines / LLM / trained CoS,[START]/[STEP]/[END]logsserver/app.pyβ FastAPI;/reset,/step,/state,/visualize/runtraining/scripts/β re-runnable SFT/DPO/RL/Kaggle scripts and notebookstraining/evidence/β small replayable evidence JSON plus loss/reward plotstraining/checkpoints/β small local MLP CoS artifacts, when presentsubenvs/β analyst + email/HR tools
Evidence map (what every artefact actually is)
We deliberately label every artefact so a strict reader can tell training from smoke-tests at a glance. Two distinct experiments live in this repo:
Experiment A β MLP+REINFORCE training (this is the trained model in the headline)
training/checkpoints/cos_final.ptβ 2-layer MLP routing policy, trained with REINFORCE for 600 episodes at lr 0.003 (training/scripts/train_cos_local.py).training/reward_curves/before_after.jsonβ real before/after evaluation under the production env (safety net ON):mean_terminal_before β 0.405 β mean_terminal_after β 0.878across all 6 tasks. This is the training-curve story, not the headline regime.training/reward_curves/reward_curve.png,training/evidence/plots/loss_curve.pngβ per-episode reward and training loss for that REINFORCE run.- The same checkpoint, evaluated under the honest regime (safety net OFF, 5 seeds, 3 hard tasks), produces the
trained_mlprow in the headline plot above (0.73). The two numbers (0.88 with safety net / ~0.73 without) are both real β they're the same model, evaluated under the production env vs. the honest-eval env. - This is the trained model we put behind the
MLP trained CoSpolicy in the live demo, and it is the headline trained-model number.
Experiment B β Qwen2.5-1.5B SFT/DPO/GRPO routing
training/scripts/kaggle_rl_1p5b_methods.py,kaggle_run_all_1p5b_experiments.pyβ re-runnable RL scripts on Kaggle.training/evidence/sft/,dpo/,sft_dpo/,grpo_rlvr/β per-methodevidence.jsonrollouts on the 3 hard tasks Γ RAG on/off. These are short runs (β€70 GRPO steps, capped by free-tier compute) and we treat them as smoke tests of the training loop, not as a converged GRPO policy. Thetrain_metrics.jsonin each folder is the raw metric stream from those runs.training/evidence/plots/rl_*.png,policy_rewards_by_method.png,terminal_scores_by_method.pngβ plots derived from those short runs.- We do not claim a fully converged GRPO policy from these. The Qwen path is a working pipeline; longer runs are future work.
Experiment C β Headline benchmark, fallback OFF (the eval_mode story)
training/scripts/run_headline_benchmark.pyβ 3 hard tasks Γ 4 policies Γ 2 RAG Γ 5 seeds, run withauto_fill_required=Falseandshaping="strict".training/evidence/headline_benchmark.jsonβ raw cell-level numbers (mean / std / per-seed runs). Schemaautodatalab-plus.headline_benchmark.v2.training/evidence/plots/headline_terminal_reward.png(the bar chart at the top of this README) β reproduced from that JSON.- The four bars are clearly labelled: two untrained baselines (
base_naive,base_roundrobin), one actually-trained model (trained_mlp, the REINFORCE-trained MLP CoS attraining/checkpoints/cos_final.pt), and one upper bound (oracle_router, the handcoded canonical-order policy that our SFT/GRPO LLM trajectories imitate when they succeed). We do not label the oracle as a trained model anywhere in this README.
If a judge wants the LLM-driven version of Experiment C, point any LoRA adapter at inference.py (or load it inside kaggle_three_llms_text.py) and re-run with eval_mode=true; the same script gives a trained-LLM number that fits between trained_mlp and oracle_router.
Honest known gaps
We want to be calibrated. As of submission:
- GRPO runs are short. ~70 update steps on a 4-action routing problem is enough to see the loop work end-to-end, not enough to claim a converged RL policy. The shaped-reward signal becomes meaningful at longer horizons; we only had compute for the smoke run.
- Per-method evidence rollouts are deterministic (decoded with low temperature for reproducibility). They demonstrate the routing pattern under each method, not the variance you'd get from a stochastic policy. The headline benchmark uses 5 seeds and reports std bars precisely to put a real spread on the trained-vs-base comparison.
eval_modeis opt-in. Production/resetdefaults toeval_mode=false(auto-fill on) so end-users always see a complete brief in the demo UI; the policy-comparison endpoint/visualize/runand the benchmark script default toeval_mode=trueso terminal scores reflect the policy's own routing competence.memory/corpus is intentionally small (BM25 over a few company SOPs/policies). RAG gives a steady ~+0.01 terminal reward across all 3 hard tasks (visible in the headline JSON); we report it as a small, citable lift, not a magic boost.
License
Hackathon / team use per repository owner.
