---
title: AutoDataLab Plus Plus
emoji: 🏢
colorFrom: indigo
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
---
# AutoDataLab++

> **The first OpenEnv environment whose reward explicitly punishes the LLM for not knowing when to stop.**
> A Chief-of-Staff (CoS) policy must route work across four typed specialists — Data Analyst, Finance, Strategy, HR — and *submit* a complete CEO brief inside a step budget. Skip an expert and the grader penalises the brief; over-consult and the shaped reward penalises the trajectory. Both errors hurt, so the policy has to learn the *capability gap* between knowing facts and knowing when it has enough.

## Headline result (fallback disabled)

![Terminal reward, fallback disabled — base vs trained MLP CoS vs oracle upper bound, 3 hard tasks, 5 seeds](training/evidence/plots/headline_terminal_reward.png)

| Task | Base (naive) | **Trained MLP CoS** (REINFORCE, 600 ep) | Oracle router (upper bound) |
|---|---|---|---|
| `hard_brief`   | 0.27 / 0.32 | **0.73 / 0.74** | 0.88 / 0.90 |
| `expert_brief` | 0.27 / 0.32 | **0.73 / 0.74** | 0.88 / 0.89 |
| `crisis_brief` | 0.27 / 0.32 | **0.73 / 0.74** | 0.88 / 0.89 |

Numbers are **terminal grader scores in `[0,1]`**, mean over **5 seeds**, with the environment's expert-auto-fill **safety net OFF** (`eval_mode=true`). Two RAG settings shown per cell (off / on). Raw runs in `training/evidence/headline_benchmark.json`. Reproduce with `python3 training/scripts/run_headline_benchmark.py`.

What this says, plainly:

- **Untrained baseline ~0.27.** A naive policy that consults the analyst and submits gets penalised, because it never routes to finance / strategy / HR and the grader sees an incomplete brief.
- **Our actually-trained MLP CoS ~0.73.** A 2-layer MLP routing policy trained with REINFORCE for 600 episodes (`training/checkpoints/cos_final.pt`) recovers the bulk of the headroom — **+0.46 absolute reward over the naive baseline**, on tasks it was not memorising. This is the headline number for "what we trained."
- **Oracle router ~0.88.** Handcoded canonical-order policy. *Upper bound*, not a trained model — published so judges can see how much routing headroom remains for a future GRPO/SFT LLM run.

The take-away is the **+0.46 trained-vs-baseline gap**, plus the **~0.15 oracle headroom** that future RL runs can chase. The base→oracle gap (~0.6) is the size of the routing problem; our trained MLP closes ~75 % of it.

## Live demo

- **Office UI (Hugging Face Space):** [https://uchihamadara1816-autodatalab2-0.hf.space/ui/](https://uchihamadara1816-autodatalab2-0.hf.space/ui/)
- Health endpoint: [https://uchihamadara1816-autodatalab2-0.hf.space/health](https://uchihamadara1816-autodatalab2-0.hf.space/health)
- Pick a task, toggle RAG, run the four policies side-by-side. The naive baseline submits an incomplete brief; the **MLP trained CoS** and **oracle** rows finish with all four boxes lit.

## Quickstart (local)

```bash
pip install -e .
# or: uv sync && uv run server
python3 -m server.app
```

| Endpoint | URL |
| --- | --- |
| API root | [http://127.0.0.1:7860/](http://127.0.0.1:7860/) |
| Health | [http://127.0.0.1:7860/health](http://127.0.0.1:7860/health) |
| **Demo UI** | [http://127.0.0.1:7860/ui/](http://127.0.0.1:7860/ui/) |

Pre-submission checks:

```bash
python3 validate_submission.py
openenv validate --verbose
```

Oracle rollout (3 tasks):

```bash
python3 inference.py --oracle
```

### Honest evaluation: turn off the safety net

The environment ships with a *production-mode* fallback that auto-completes any
required expert the policy forgot, so end-users always see a full brief. For
**evaluation** you want the opposite: the policy must succeed (or fail) on its
own.

```bash
# CLI
python3 training/scripts/run_headline_benchmark.py    # 5 seeds, fallback OFF

# HTTP
curl -X POST http://127.0.0.1:7860/reset \
     -H 'content-type: application/json' \
     -d '{"task":"hard_brief","use_rag":true,"eval_mode":true}'
```

When `eval_mode=true`, the env runs with `auto_fill_required=False` and
`shaping="strict"`. Submitting an incomplete brief is allowed and is reflected
in the terminal grader score, which is exactly what produces the headline
trained-vs-baseline gap shown above.

## Deployment

### Environment variables

| Variable | Purpose |
| -------- | ------- |
| `API_BASE_URL` | OpenAI-compatible endpoint (default: Hugging Face router) |
| `API_KEY` or `HF_TOKEN` | For LLM CoS; if unset, `inference.py` uses **oracle** |
| `MODEL_NAME` | Model id for the CoS LLM path |
| `AUTODATALAB_PLUS_TASKS` | Comma-separated task ids (default: all three briefs) |

Copy `.env.example` to `.env` and adjust. For Docker / Spaces, set secrets in the platform UI.

### Docker

Build and run (port **7860** matches `openenv.yaml`):

```bash
docker build -t autodatalab-plus .
# First build can take several minutes: PyTorch + OpenEnv stack download ~1.5GB+ of wheels.
docker run --rm -p 7860:7860 autodatalab-plus
```

Smoke test:

```bash
curl -s http://127.0.0.1:7860/health
```

The image includes a **HEALTHCHECK** on `/health`. Training checkpoints under `training/checkpoints/` (if present in the build context) are included so the **trained CoS** policy is available in `/visualize/run` when a checkpoint exists.

### Hugging Face Space (OpenEnv)

- Type: **Docker** or use the `openenv.yaml` `app: server.app:app` + `port: 7860` as documented in the [OpenEnv](https://huggingface.co/docs/hub/en/spaces) flow you use for the hackathon.  
- Ensure **build context** is this repo; **do not** commit large secrets.  
- After deploy: hit `/health`, open `/ui/`, run `python3 validate_submission.py` against the **live URL** (adjust `ROOT` or use env if your script supports it).

## Project layout (high level)

- `ceo_brief_env/` — Pydantic models, environment, graders, `tasks/`
- `inference.py` — oracle / baselines / LLM / trained CoS, `[START]`/`[STEP]`/`[END]` logs
- `server/app.py` — FastAPI; `/reset`, `/step`, `/state`, `/visualize/run`
- `training/scripts/` — re-runnable SFT/DPO/RL/Kaggle scripts and notebooks
- `training/evidence/` — small replayable evidence JSON plus loss/reward plots
- `training/checkpoints/` — small local MLP CoS artifacts, when present
- `subenvs/` — analyst + email/HR tools

## Evidence map (what every artefact actually is)

We deliberately label every artefact so a strict reader can tell training from smoke-tests at a glance. Two distinct experiments live in this repo:

**Experiment A — MLP+REINFORCE training (this *is* the trained model in the headline)**

- `training/checkpoints/cos_final.pt` — 2-layer MLP routing policy, trained with REINFORCE for 600 episodes at lr 0.003 (`training/scripts/train_cos_local.py`).
- `training/reward_curves/before_after.json` — real before/after evaluation under the *production* env (safety net ON): `mean_terminal_before ≈ 0.405 → mean_terminal_after ≈ 0.878` across all 6 tasks. This is the training-curve story, not the headline regime.
- `training/reward_curves/reward_curve.png`, `training/evidence/plots/loss_curve.png` — per-episode reward and training loss for that REINFORCE run.
- The same checkpoint, evaluated under the **honest regime** (safety net OFF, 5 seeds, 3 hard tasks), produces the `trained_mlp` row in the headline plot above (~0.73). The two numbers (~0.88 with safety net / ~0.73 without) are both real — they're the same model, evaluated under the production env vs. the honest-eval env.
- This is the trained model we put behind the `MLP trained CoS` policy in the live demo, and it is the headline trained-model number.

**Experiment B — Qwen2.5-1.5B SFT/DPO/GRPO routing**

- `training/scripts/kaggle_rl_1p5b_methods.py`, `kaggle_run_all_1p5b_experiments.py` — re-runnable RL scripts on Kaggle.
- `training/evidence/sft/`, `dpo/`, `sft_dpo/`, `grpo_rlvr/` — per-method `evidence.json` rollouts on the 3 hard tasks × RAG on/off. **These are short runs** (≤70 GRPO steps, capped by free-tier compute) and we treat them as smoke tests of the training loop, not as a converged GRPO policy. The `train_metrics.json` in each folder is the raw metric stream from those runs.
- `training/evidence/plots/rl_*.png`, `policy_rewards_by_method.png`, `terminal_scores_by_method.png` — plots derived from those short runs.
- We **do not** claim a fully converged GRPO policy from these. The Qwen path is a working pipeline; longer runs are future work.

**Experiment C — Headline benchmark, fallback OFF (the `eval_mode` story)**

- `training/scripts/run_headline_benchmark.py` — 3 hard tasks × 4 policies × 2 RAG × 5 seeds, run with `auto_fill_required=False` and `shaping="strict"`.
- `training/evidence/headline_benchmark.json` — raw cell-level numbers (mean / std / per-seed runs). Schema `autodatalab-plus.headline_benchmark.v2`.
- `training/evidence/plots/headline_terminal_reward.png` (the bar chart at the top of this README) — reproduced from that JSON.
- The four bars are clearly labelled: two **untrained baselines** (`base_naive`, `base_roundrobin`), one **actually-trained model** (`trained_mlp`, the REINFORCE-trained MLP CoS at `training/checkpoints/cos_final.pt`), and one **upper bound** (`oracle_router`, the handcoded canonical-order policy that our SFT/GRPO LLM trajectories imitate when they succeed). We do **not** label the oracle as a trained model anywhere in this README.

If a judge wants the LLM-driven version of Experiment C, point any LoRA adapter at `inference.py` (or load it inside `kaggle_three_llms_text.py`) and re-run with `eval_mode=true`; the same script gives a trained-LLM number that fits between `trained_mlp` and `oracle_router`.

## Honest known gaps

We want to be calibrated. As of submission:

- **GRPO runs are short.** ~70 update steps on a 4-action routing problem is enough to see the loop work end-to-end, not enough to claim a converged RL policy. The shaped-reward signal becomes meaningful at longer horizons; we only had compute for the smoke run.
- **Per-method evidence rollouts are deterministic** (decoded with low temperature for reproducibility). They demonstrate the routing pattern under each method, not the variance you'd get from a stochastic policy. The headline benchmark uses 5 seeds and reports std bars precisely to put a real spread on the trained-vs-base comparison.
- **`eval_mode` is opt-in.** Production `/reset` defaults to `eval_mode=false` (auto-fill on) so end-users always see a complete brief in the demo UI; the policy-comparison endpoint `/visualize/run` and the benchmark script default to `eval_mode=true` so terminal scores reflect the policy's own routing competence.
- **`memory/` corpus is intentionally small** (BM25 over a few company SOPs/policies). RAG gives a steady ~+0.01 terminal reward across all 3 hard tasks (visible in the headline JSON); we report it as a small, citable lift, not a magic boost.

## License

Hackathon / team use per repository owner.