bpHigh's picture
Add blog and final data
18d1157

Building a cross-format office-document RL environment

What this is

bpHigh/financial-task-env is an OpenEnv environment for training agents to edit office documents with code. It started as a the online submission with 10 hand-picked Finch spreadsheet tasks. For Round 2/ Offline hack phase it expanded to 119 tasks across three file formats , Excel (xlsx), Word (docx), PowerPoint (pptx) and with a multi-layer reward function designed to be hard to hack and a four-defense anti-exploit stack hardened against agents that try to game it.

The environment is live as a Hugging Face Space at https://bphigh-financial-task-env.hf.space. The TRL/OpenEnv-compatible client lives in client.py; the dashboard shows a leaderboard, training plots, anti-hacking design, and step-by-step replays of the teacher solving tasks. Everything in the repo is reproducible.

The design that still holds up

The task split

Family Source Train Eval Total What it tests
xlsx Finch (FinWorkBench) - hand-curated (Round 1) 10 0 10 Diverse Financial workbook tasks for the original submission
xlsx Finch - stratified pull (Round 2) 40 10 50 Stratified across 7 task-type tags
docx OSWorld-Verified libreoffice_writer 17 4 21 16 distinct evaluator functions ported from OSWorld
pptx PPTArena 30 8 38 16 distinct edit_types, including singletons (transitions, animations, A/V)
Total 97 22 119

The eval split is stratified so that at least one task per tag bucket per family so the environment doesn't get gamed by overfitting to one task type.

The 6-component reward

Each step the agent runs code and gets back stdout/stderr plus an inline reward decomposition. Six components, all clamped, totalling 0.10/step max:

Component Range Fires when
exec_health 0-0.020 Subprocess exit code 0; bonus if stdout non-empty
lib_engagement 0-0.015 Code uses the family's expected library (openpyxl / python-docx / python-pptx)
mutation 0-0.025 Working file's SHA-256 changed since last step
validity 0-0.015 Mutated file still parses with its loader
progress 0-0.015 Structural distance to the gold reference decreased
eval_check 0-0.010 Per-task evaluator score went up (docx-only)

The final reward at submit is not the sum of step rewards, it's the grader's score on the submitted file (or text answer). The step rewards exist to give shaped signal for mid-episode learning. The clamps matter: an agent can't farm tiny rewards for 1000 steps and beat someone who actually solves the task.

Anti-hacking - 4 independent defenses

This part I'm proud of. After watching teacher trajectories, we found the model had figured out it could "submit submit_file with the source file path on step 1" and getting some reward because the source parses, has the right size, etc. and without making any edits. So we built four overlapping defenses, any of which independently kills that exploit:

  1. Phase 4 - Per-episode gold stash: at task reset, gold files are moved to a uuid4-randomized tmp directory. An agent globbing data/ finds nothing.
  2. Phase 7 - Grader byte-equality check: if the submitted file is byte-equal to the source, the grader returns the floor reward.
  3. Phase 9 - Env early-submit gate: the env refuses submit / submit_file until at least one code step has run. Configured via FINANCIAL_ENV_MIN_CODE_STEPS=1 (default).
  4. Phase 8 - SFT corpus filter: any teacher trajectory that did submit_file on step 1 was dropped before SFT, so the student never learns the cheat from the teacher.

Each layer covers a slightly different attack surface.


SFT training (cheap, converged, unhelpful)

Two SFT runs on 1× L40S 48GB at $1.80/hr. Same recipe both times: LoRA r=32 on all-linear targets, bf16, assistant-only loss masking, TRL SFTTrainer. The only difference: 4K vs 8K context length.

4K context 8K context
Hardware L40S 48GB L40S 48GB
Runtime 198s 354s
Loss start → end 0.412 → 0.069 0.384 → 0.103
Final train_loss 0.196 0.193
Cost ~$0.50 ~$0.80

Loss converges nicely. The 8K run sees the long debugging trajectories from Kimi (5-8 of 53 episodes get truncated at 4K), so it's the more interesting comparison point.

image

Then we evaluated

Model Avg score Success rate xlsx (n=10) docx (n=4) pptx (n=8)
MiniMaxAI/MiniMax-M2.1 (frontier baseline) 0.390 41% 0.293 0.445 0.485
moonshotai/Kimi-K2.5 (teacher) 0.481 52% 0.370 0.472 0.673
Qwen/Qwen2.5-Coder-3B-Instruct (vanilla student) 0.001 0% 0.001 0.001 0.001
Qwen2.5-Coder-3B + LoRA SFT (4K) 0.006 0% 0.007 0.005 0.005
Qwen2.5-Coder-3B + LoRA SFT (8K) 0.011 0% 0.018 0.005 0.005

The SFT lift is real - 6×-11× over vanilla - but every episode still bottoms out at the env's 0.005 reward floor. The model produces parseable code, but it doesn't mutate files in ways the grader rewards. SFT loss is well-converged on the training distribution; the gap is generalization, not under-training. We trained on 53 trajectories; the eval has 22 tasks all OOD; that's a brutal regime for a 3B model with LoRA.


Pointers

If you're a judge reading this and want to see the env do something in 30 seconds, open the dashboard's "Replay" widget and watch the docx task, it's the shortest end-to-end demonstration of the 6-component reward in motion.

image

image

image

The headers in edits.md - Phases 1 through 13 are the long version of this blog.


GRPO - what didn't work, what we tried, where we are now

This section is the truth: GRPO has not had a clean training run yet. Three failure modes deep, still iterating. I'm leaving it here because the failure modes are interesting and pretending we cleared them in one shot would be dishonest.

Setup

  • Hardware: A100 40GB on Modal Notebooks ($2.50/hr).
  • Stack: TRL GRPOTrainer + vLLM colocate + PEFT (continuing the SFT'd LoRA).
  • Multi-turn: 12 turns per episode, agent uses three tools (run_python_code, submit_file, submit_text_answer).
  • Env: same Space as eval, but with FINANCIAL_ENV_GOLD_STASH=copy (concurrent rollouts can't race on the gold-file rename) and SUPPORTS_CONCURRENT_SESSIONS=True on the FinancialEnvironment class.
  • Tracking: Trackio Space at bpHigh/trackio-office-grpo.

Failure 1 - Reward stuck at 0 ("the format mismatch")

First run started with environment_factory=OfficeDocumentEnv (TRL's managed multi-turn rollout). Trackio showed reward = 0.0 across every step. Captured a sample completion:

```json
{"name": "run_python_code", "arguments": {"code": "..."}}
```

The SFT'd model emits markdown JSON blocks (because the SFT teacher Kimi-K2.5 emits markdown JSON), but TRL's tool-call parser only accepts <tool_call>...</tool_call> XML (it has hard-coded schemas for qwen3, qwen3.5, llama3, glm4, gptoss - Qwen2.5-Coder isn't on that list). Parser found 0 tool calls, env never executed code, reward stayed 0, advantage was 0, gradients were 0. ~5 min of A100 time burned learning nothing.

To rule out "SFT broke it", I tried the same setup with the base model + fresh LoRA (no SFT). Same failure. So the issue isn't SFT-induced - Qwen2.5-Coder's chat template just defaults to markdown JSON when tools are bound, regardless of SFT.

Attempt at fix #1 - Custom rollout_func

TRL has two rollout paths: environment_factory (managed, XML-only) and rollout_func (you control everything). I wrote ~150 LOC of custom rollout that:

  • Spawned an OfficeDocumentEnv per generation
  • Called vLLM directly per turn
  • Parsed completions for ```json ... ``` blocks (with fallbacks for ```python ... ``` and Kimi's <|tool_call_begin|> markers)
  • Dispatched to env tools, tokenized feedback as a chat-template diff, appended with env_mask=0 so loss only flowed through model tokens

It worked. First batch printed:

[rollout_func] Done: 16/16 rollouts got non-zero reward

The format problem was solved. Then…

Failure 2 - OOM during gradient computation

OutOfMemoryError: CUDA out of memory. Tried to allocate 12.73 GiB.
GPU 0 has 39.49 GiB total, 8.80 GiB free, 30.69 GiB in use.

selective_log_softmax trying to compute logits over 16 rollouts × ~10-20K tokens each. The completions had grown enormous because each rollout was 4-8 turns × max_completion_length=4096 per turn. Add vLLM colocate's ~10GB footprint and there's no headroom.

Fix: lower max_completion_length (4096 → 1500), drop gradient_accumulation_steps (8 → 4), enable vllm_enable_sleep_mode=True, lower vllm_gpu_memory_utilization (0.3 → 0.22).

Failure 3 - CUDA index-out-of-bounds (twice)

/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:111: operator():
block: [0,0,0], thread: [2,0,0]
Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.

Async-reported, so the stack trace pointed at a benign downstream op (shuffle_sequence_dict's v[permutation]). The real cause was earlier in the gradient pass - likely a token ID outside the model's embedding range slipping into completion_ids.

Hypothesis: vllm_enable_sleep_mode=True + LoRA is buggy. vLLM wakes from sleep by reloading weights from disk - but the on-disk weights are the base model, not the LoRA-adapted one. So vLLM generates from base while the trainer expects LoRA-adapted distribution; some edge token mismatches and torch.gather blows up.

Disabled sleep mode, also disabled vllm_importance_sampling_correction (which is what calls selective_log_softmax). Still failed - same async assert, this time deeper in _prepare_inputs. CUDA context becomes poisoned after a device-side assert; even subsequent torch.cuda.manual_seed_all() raises the same error. Forced kernel restart, lose the loaded model.

Where we are now

Three full GRPO runs in, no successful gradient step yet, ~30 min of A100 time burned across debug iterations. Two paths forward:

Path A - Revert to environment_factory, force XML via prompt engineering. Restore the original setup but rewrite the SYSTEM_PROMPT with explicit "DO NOT use json or python" warnings and three concrete <tool_call> few-shot examples. The SFT-bias toward markdown will fight the prompt, but maybe 30-50% compliance is enough to start producing gradients. Cheapest to try.

Path B - Custom rollout_func with CUDA_LAUNCH_BLOCKING=1. Re-run the rollout_func setup with synchronous CUDA so the next assert points at the actual line. Probably reveals an OOV-token issue we can clip defensively in the rollout_func itself.

I'm currently on Path A - already reverted the script, prompt now hard-pushes XML format, env_factory back in place. Restart kernel, push the revert, see what happens.

If Path A also fails, the fallback is to skip Qwen2.5-Coder entirely and use a smaller Qwen3 model (which TRL's parser knows natively) for the GRPO comparison. Less interesting story but faster path to a working number.


What I'd do differently

  • Pick a model TRL natively supports for the tool-calling format. Qwen3 (any size) instead of Qwen2.5-Coder. The format mismatch ate hours.
  • Rule out vLLM sleep_mode + LoRA from the start. This combo is fragile. Either disable sleep mode and pay the memory cost, or pick a smaller model that fits comfortably without sleeping.
  • Test the SFT model on the env before spending eval budget. 53 trajectories was always going to be a tight regime; an early in-process smoke test with eval_lora.py on 4-5 tasks would have set expectations correctly.
  • Trackio rename, not delete, for failed runs. The office-doc-grpo project on Trackio has a 0-reward run that's evidence of the format-mismatch issue, not waste. It stays as -attempt1.

What still needs to happen

  • A successful GRPO run end-to-end (any path) with non-zero reward curve in Trackio.
  • Eval the GRPO'd model on the 22-task held-out split. Fill in the leaderboard's "GRPO" row.
  • Fair comparison: GRPO from base + fresh LoRA vs GRPO from SFT'd adapter. Both runs need to use the same prompt format and rollout path so the comparison is real.
  • Tighten the dashboard - add a Trackio embed in the dashboard once GRPO is producing curves worth showing.