AutoDataLab2.0 / README.md
uchihamadara1816's picture
Upload 172 files
d02bacd verified
metadata
title: AutoDataLab Plus Plus
emoji: 🏒
colorFrom: indigo
colorTo: purple
sdk: docker
app_port: 7860
pinned: false

AutoDataLab++

The first OpenEnv environment whose reward explicitly punishes the LLM for not knowing when to stop. A Chief-of-Staff (CoS) policy must route work across four typed specialists β€” Data Analyst, Finance, Strategy, HR β€” and submit a complete CEO brief inside a step budget. Skip an expert and the grader penalises the brief; over-consult and the shaped reward penalises the trajectory. Both errors hurt, so the policy has to learn the capability gap between knowing facts and knowing when it has enough.

Headline result (fallback disabled)

Terminal reward, fallback disabled β€” base vs trained MLP CoS vs oracle upper bound, 3 hard tasks, 5 seeds

Task Base (naive) Trained MLP CoS (REINFORCE, 600 ep) Oracle router (upper bound)
hard_brief 0.27 / 0.32 0.73 / 0.74 0.88 / 0.90
expert_brief 0.27 / 0.32 0.73 / 0.74 0.88 / 0.89
crisis_brief 0.27 / 0.32 0.73 / 0.74 0.88 / 0.89

Numbers are terminal grader scores in [0,1], mean over 5 seeds, with the environment's expert-auto-fill safety net OFF (eval_mode=true). Two RAG settings shown per cell (off / on). Raw runs in training/evidence/headline_benchmark.json. Reproduce with python3 training/scripts/run_headline_benchmark.py.

What this says, plainly:

  • Untrained baseline ~0.27. A naive policy that consults the analyst and submits gets penalised, because it never routes to finance / strategy / HR and the grader sees an incomplete brief.
  • Our actually-trained MLP CoS ~0.73. A 2-layer MLP routing policy trained with REINFORCE for 600 episodes (training/checkpoints/cos_final.pt) recovers the bulk of the headroom β€” +0.46 absolute reward over the naive baseline, on tasks it was not memorising. This is the headline number for "what we trained."
  • Oracle router ~0.88. Handcoded canonical-order policy. Upper bound, not a trained model β€” published so judges can see how much routing headroom remains for a future GRPO/SFT LLM run.

The take-away is the +0.46 trained-vs-baseline gap, plus the ~0.15 oracle headroom that future RL runs can chase. The base→oracle gap (~0.6) is the size of the routing problem; our trained MLP closes ~75 % of it.

Live demo

Quickstart (local)

pip install -e .
# or: uv sync && uv run server
python3 -m server.app

Pre-submission checks:

python3 validate_submission.py
openenv validate --verbose

Oracle rollout (3 tasks):

python3 inference.py --oracle

Honest evaluation: turn off the safety net

The environment ships with a production-mode fallback that auto-completes any required expert the policy forgot, so end-users always see a full brief. For evaluation you want the opposite: the policy must succeed (or fail) on its own.

# CLI
python3 training/scripts/run_headline_benchmark.py    # 5 seeds, fallback OFF

# HTTP
curl -X POST http://127.0.0.1:7860/reset \
     -H 'content-type: application/json' \
     -d '{"task":"hard_brief","use_rag":true,"eval_mode":true}'

When eval_mode=true, the env runs with auto_fill_required=False and shaping="strict". Submitting an incomplete brief is allowed and is reflected in the terminal grader score, which is exactly what produces the headline trained-vs-baseline gap shown above.

Deployment

Environment variables

Variable Purpose
API_BASE_URL OpenAI-compatible endpoint (default: Hugging Face router)
API_KEY or HF_TOKEN For LLM CoS; if unset, inference.py uses oracle
MODEL_NAME Model id for the CoS LLM path
AUTODATALAB_PLUS_TASKS Comma-separated task ids (default: all three briefs)

Copy .env.example to .env and adjust. For Docker / Spaces, set secrets in the platform UI.

Docker

Build and run (port 7860 matches openenv.yaml):

docker build -t autodatalab-plus .
# First build can take several minutes: PyTorch + OpenEnv stack download ~1.5GB+ of wheels.
docker run --rm -p 7860:7860 autodatalab-plus

Smoke test:

curl -s http://127.0.0.1:7860/health

The image includes a HEALTHCHECK on /health. Training checkpoints under training/checkpoints/ (if present in the build context) are included so the trained CoS policy is available in /visualize/run when a checkpoint exists.

Hugging Face Space (OpenEnv)

  • Type: Docker or use the openenv.yaml app: server.app:app + port: 7860 as documented in the OpenEnv flow you use for the hackathon.
  • Ensure build context is this repo; do not commit large secrets.
  • After deploy: hit /health, open /ui/, run python3 validate_submission.py against the live URL (adjust ROOT or use env if your script supports it).

Project layout (high level)

  • ceo_brief_env/ β€” Pydantic models, environment, graders, tasks/
  • inference.py β€” oracle / baselines / LLM / trained CoS, [START]/[STEP]/[END] logs
  • server/app.py β€” FastAPI; /reset, /step, /state, /visualize/run
  • training/scripts/ β€” re-runnable SFT/DPO/RL/Kaggle scripts and notebooks
  • training/evidence/ β€” small replayable evidence JSON plus loss/reward plots
  • training/checkpoints/ β€” small local MLP CoS artifacts, when present
  • subenvs/ β€” analyst + email/HR tools

Evidence map (what every artefact actually is)

We deliberately label every artefact so a strict reader can tell training from smoke-tests at a glance. Two distinct experiments live in this repo:

Experiment A β€” MLP+REINFORCE training (this is the trained model in the headline)

  • training/checkpoints/cos_final.pt β€” 2-layer MLP routing policy, trained with REINFORCE for 600 episodes at lr 0.003 (training/scripts/train_cos_local.py).
  • training/reward_curves/before_after.json β€” real before/after evaluation under the production env (safety net ON): mean_terminal_before β‰ˆ 0.405 β†’ mean_terminal_after β‰ˆ 0.878 across all 6 tasks. This is the training-curve story, not the headline regime.
  • training/reward_curves/reward_curve.png, training/evidence/plots/loss_curve.png β€” per-episode reward and training loss for that REINFORCE run.
  • The same checkpoint, evaluated under the honest regime (safety net OFF, 5 seeds, 3 hard tasks), produces the trained_mlp row in the headline plot above (0.73). The two numbers (0.88 with safety net / ~0.73 without) are both real β€” they're the same model, evaluated under the production env vs. the honest-eval env.
  • This is the trained model we put behind the MLP trained CoS policy in the live demo, and it is the headline trained-model number.

Experiment B β€” Qwen2.5-1.5B SFT/DPO/GRPO routing

  • training/scripts/kaggle_rl_1p5b_methods.py, kaggle_run_all_1p5b_experiments.py β€” re-runnable RL scripts on Kaggle.
  • training/evidence/sft/, dpo/, sft_dpo/, grpo_rlvr/ β€” per-method evidence.json rollouts on the 3 hard tasks Γ— RAG on/off. These are short runs (≀70 GRPO steps, capped by free-tier compute) and we treat them as smoke tests of the training loop, not as a converged GRPO policy. The train_metrics.json in each folder is the raw metric stream from those runs.
  • training/evidence/plots/rl_*.png, policy_rewards_by_method.png, terminal_scores_by_method.png β€” plots derived from those short runs.
  • We do not claim a fully converged GRPO policy from these. The Qwen path is a working pipeline; longer runs are future work.

Experiment C β€” Headline benchmark, fallback OFF (the eval_mode story)

  • training/scripts/run_headline_benchmark.py β€” 3 hard tasks Γ— 4 policies Γ— 2 RAG Γ— 5 seeds, run with auto_fill_required=False and shaping="strict".
  • training/evidence/headline_benchmark.json β€” raw cell-level numbers (mean / std / per-seed runs). Schema autodatalab-plus.headline_benchmark.v2.
  • training/evidence/plots/headline_terminal_reward.png (the bar chart at the top of this README) β€” reproduced from that JSON.
  • The four bars are clearly labelled: two untrained baselines (base_naive, base_roundrobin), one actually-trained model (trained_mlp, the REINFORCE-trained MLP CoS at training/checkpoints/cos_final.pt), and one upper bound (oracle_router, the handcoded canonical-order policy that our SFT/GRPO LLM trajectories imitate when they succeed). We do not label the oracle as a trained model anywhere in this README.

If a judge wants the LLM-driven version of Experiment C, point any LoRA adapter at inference.py (or load it inside kaggle_three_llms_text.py) and re-run with eval_mode=true; the same script gives a trained-LLM number that fits between trained_mlp and oracle_router.

Honest known gaps

We want to be calibrated. As of submission:

  • GRPO runs are short. ~70 update steps on a 4-action routing problem is enough to see the loop work end-to-end, not enough to claim a converged RL policy. The shaped-reward signal becomes meaningful at longer horizons; we only had compute for the smoke run.
  • Per-method evidence rollouts are deterministic (decoded with low temperature for reproducibility). They demonstrate the routing pattern under each method, not the variance you'd get from a stochastic policy. The headline benchmark uses 5 seeds and reports std bars precisely to put a real spread on the trained-vs-base comparison.
  • eval_mode is opt-in. Production /reset defaults to eval_mode=false (auto-fill on) so end-users always see a complete brief in the demo UI; the policy-comparison endpoint /visualize/run and the benchmark script default to eval_mode=true so terminal scores reflect the policy's own routing competence.
  • memory/ corpus is intentionally small (BM25 over a few company SOPs/policies). RAG gives a steady ~+0.01 terminal reward across all 3 hard tasks (visible in the headline JSON); we report it as a small, citable lift, not a magic boost.

License

Hackathon / team use per repository owner.