Humanlearning commited on
Commit
ffc6ade
·
verified ·
1 Parent(s): e3d939d

Add README Hugging Face and blog links

Browse files
Files changed (1) hide show
  1. README.md +289 -13
README.md CHANGED
@@ -15,13 +15,15 @@ tags:
15
 
16
  # CyberSecurity_OWASP
17
 
 
 
18
  `CyberSecurity_OWASP` is an OpenEnv-compliant reinforcement-learning environment for a single LLM agent that performs a defensive authorization-repair workflow:
19
 
20
  ```text
21
- inspect generated app + policy -> discover authorization bug -> submit finding -> patch code -> preserve intended behavior
22
  ```
23
 
24
- The current implementation includes a functional MVP scenario: an invoices FastAPI-style app with one injected OWASP A01 BOLA/IDOR defect, visible tests, hidden deterministic verifier checks, anti-cheat safeguards, and decomposed reward.
25
 
26
  ## Diagrams
27
 
@@ -36,6 +38,7 @@ Editable Mermaid sources are available in `assets/architecture_diagram.mmd` and
36
  ```bash
37
  uv sync --extra dev
38
  uv run --extra dev pytest
 
39
  uv run server --port 8000
40
  ```
41
 
@@ -68,7 +71,7 @@ Supported tools:
68
  - `search_code`
69
  - `send_local_request`
70
  - `compare_identities`
71
- - `submit_finding`
72
  - `patch_file`
73
  - `run_visible_tests`
74
  - `submit_fix`
@@ -76,7 +79,7 @@ Supported tools:
76
 
77
  Tools are phase-gated:
78
 
79
- - `discover`: inspect policy/routes/files, run safe local requests, compare identities, submit finding.
80
  - `patch`: read/search, patch editable app files, run visible tests, submit final fix.
81
  - `done`: stable terminal observation only.
82
 
@@ -94,31 +97,76 @@ Terminal reward uses stable components:
94
  "visible_tests": 0.0,
95
  "safety": 0.0,
96
  "anti_cheat": 0.0,
 
 
 
 
 
 
 
97
  "total": 0.0,
98
  }
99
  ```
100
 
101
- The verifier rewards blocking the hidden exploit while preserving legitimate owner/admin behavior and intentionally public routes. It penalizes deny-all fixes, hardcoded IDs, hidden file probes, external URL attempts, and test/fixture tampering.
 
 
 
 
 
 
 
 
 
102
 
103
- ## Scenario Generation
104
 
105
- `reset(seed)` compiles a fresh isolated workspace under a temp directory. The MVP compiler generates:
 
 
 
 
 
 
 
 
 
 
 
 
106
 
107
  - invoices domain policy graph;
 
108
  - randomized users, tenants, invoices, and IDs;
109
  - generated app files under `app/`;
110
  - visible tests under `tests/test_visible.py`;
111
- - hidden facts kept only in state for deterministic verification.
112
 
113
  Additional domains and bug families are scaffolded for extension.
114
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
115
  ## Testing
116
 
117
  ```bash
118
  uv run --extra dev pytest
119
  ```
120
 
121
- The suite covers model serialization, reset/step/state behavior, seed reproducibility, invalid actions, reward outcomes, anti-cheat checks, and scripted rollout policies.
122
 
123
  ## Training Scaffold
124
 
@@ -133,6 +181,20 @@ Training files are under `training/`:
133
 
134
  The training scaffold is intentionally minimal until the environment/verifier behavior is stable. Trackio metric names and GRPO defaults follow the project brief.
135
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
136
  ## Trackio Run Tracking
137
 
138
  Trackio is the default tracker for official runs. Set `TRACKIO_SPACE_ID` to log to a hosted Hugging Face Trackio Space; otherwise Trackio records locally.
@@ -151,6 +213,20 @@ uv run python scripts/track_pytest.py tests
151
 
152
  Evaluation summaries saved through `training.eval_before_after.save_eval_summary(...)`, Modal smoke runs, and GRPO training configs all initialize Trackio runs with CyberSecurity_OWASP run names.
153
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
154
  ## Modal Ephemeral Runs
155
 
156
  Modal Labs support is kept in a separate launcher script so the local OpenEnv server and core training scaffold stay unchanged.
@@ -164,6 +240,7 @@ uv sync --extra modal
164
  Run a temporary Modal app for a cheap environment/training smoke check:
165
 
166
  ```bash
 
167
  uv run --extra modal modal run scripts/modal_ephemeral_train.py --mode smoke --episodes 4
168
  ```
169
 
@@ -181,6 +258,110 @@ The shell wrapper is equivalent:
181
  MODE=smoke EPISODES=4 uv run --extra modal bash scripts/modal_run_ephemeral.sh
182
  ```
183
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
184
  ## Modal GRPO Training
185
 
186
  The persistent GPU training launcher packages this local repo into Modal, trains
@@ -198,13 +379,107 @@ uv run --extra modal modal run scripts/modal_train_grpo.py --mode config
198
  Run the default smoke GRPO job:
199
 
200
  ```bash
 
201
  uv run --extra modal modal run scripts/modal_train_grpo.py \
202
  --max-steps 10 \
203
  --dataset-size 16 \
204
- --num-generations 2 \
205
  --difficulty 0
206
  ```
207
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
208
  If running from a public repository and you do not want Modal to package the
209
  local workspace, use public source mode:
210
 
@@ -215,7 +490,7 @@ uv run --extra modal modal run scripts/modal_train_grpo.py \
215
  --repo-branch master \
216
  --max-steps 10 \
217
  --dataset-size 16 \
218
- --num-generations 2 \
219
  --difficulty 0
220
  ```
221
 
@@ -223,10 +498,11 @@ Defaults are derived from `HF_TOKEN`:
223
 
224
  - Trackio Space: `<hf-user>/CyberSecurity_OWASP-trackio`
225
  - Trackio project: `CyberSecurity_OWASP-grpo`
226
- - Output repo: `<hf-user>/CyberSecurity_OWASP-qwen3-1.7b-grpo-lora`
 
227
 
228
  Override these with `--trackio-space-id`, `--trackio-project`, and
229
- `--output-repo-id` when needed.
230
 
231
  ## Docker / Spaces
232
 
 
15
 
16
  # CyberSecurity_OWASP
17
 
18
+ [Hugging Face Space](https://huggingface.co/spaces/Humanlearning/CyberSecurity_OWASP) | [Mini-blog](blog/blog.md)
19
+
20
  `CyberSecurity_OWASP` is an OpenEnv-compliant reinforcement-learning environment for a single LLM agent that performs a defensive authorization-repair workflow:
21
 
22
  ```text
23
+ inspect generated app + policy -> discover authorization bug -> submit diagnosis -> patch code -> preserve intended behavior
24
  ```
25
 
26
+ The current implementation includes a functional closed-loop MVP scenario: an invoices FastAPI-style app with one injected OWASP A01 BOLA/IDOR defect, config-driven curriculum settings, cache-backed scenario reset, an ephemeral app sandbox, multi-layer deterministic verifier checks, anti-cheat safeguards, JSONL episode artifacts, and decomposed reward.
27
 
28
  ## Diagrams
29
 
 
38
  ```bash
39
  uv sync --extra dev
40
  uv run --extra dev pytest
41
+ uv run python scripts/generate_scenario_cache.py --train-per-bucket 3 --validation-per-bucket 3 --heldout-per-bucket 3
42
  uv run server --port 8000
43
  ```
44
 
 
71
  - `search_code`
72
  - `send_local_request`
73
  - `compare_identities`
74
+ - `submit_diagnosis`
75
  - `patch_file`
76
  - `run_visible_tests`
77
  - `submit_fix`
 
79
 
80
  Tools are phase-gated:
81
 
82
+ - `discover`: inspect policy/routes/files, run safe local requests, compare identities, submit diagnosis.
83
  - `patch`: read/search, patch editable app files, run visible tests, submit final fix.
84
  - `done`: stable terminal observation only.
85
 
 
97
  "visible_tests": 0.0,
98
  "safety": 0.0,
99
  "anti_cheat": 0.0,
100
+ "terminal_total": 0.0,
101
+ "progressive": 0.0,
102
+ "step_penalty": 0.0,
103
+ "speed_bonus": 0.0,
104
+ "token_penalty": 0.0,
105
+ "behavior_penalty": 0.0,
106
+ "train_total": 0.0,
107
  "total": 0.0,
108
  }
109
  ```
110
 
111
+ The verifier rewards blocking the hidden exploit while preserving legitimate owner/admin behavior and intentionally public routes. Terminal scoring requires visible checks, hidden authorization checks, a policy-oracle matrix, regression checks, public-route preservation, and patch-quality checks. It penalizes deny-all fixes, hardcoded IDs, repeated/invalid action patterns, hidden file probes, external URL attempts, and test/fixture tampering.
112
+
113
+ Training can enable dense rewards with `CYBERSECURITY_OWASP_REWARD_MODE=dense_train`.
114
+ Dense mode adds configurable progressive rewards, small efficiency penalties, and capped behavior penalties from `training/configs/grpo_small.yaml`; evaluation defaults to sparse terminal scoring.
115
+
116
+ ## Scenario Cache And Generation
117
+
118
+ Scenario generation is an offline/cache-prep concern. `reset(seed)` asks the `CurriculumController` for a difficulty tier and target weakness, then loads a validated executable bundle from the scenario cache when `CYBERSECURITY_OWASP_SCENARIO_CACHE_MODE=require`. Local development defaults to `fallback`, which compiles deterministically on a cache miss.
119
+
120
+ The scenario/curriculum author is config-driven through `configs/scenario_authoring.small.json`. The default offline author model is `deepseek-ai/DeepSeek-V4-Pro` with Hugging Face provider settings, thinking mode enabled, `temperature=1.0`, and `top_p=1.0`. This model config is for scenario authoring, not the RL policy model.
121
 
122
+ The cache bundle contract is:
123
 
124
+ - `scenario.json`
125
+ - `app_source/`
126
+ - `policy_graph.json`
127
+ - `visible_tests.py`
128
+ - `hidden_tests.py`
129
+ - `oracle_tests.py`
130
+ - `expected_exploit_trace.json`
131
+ - `reward_config.json`
132
+ - `metadata.json`
133
+
134
+ Cache keys include difficulty, authorization bug type, app family, framework, policy shape, tenant model, exploit depth, patch scope, regression risk, generator version, verifier version, and scenario hash.
135
+
136
+ The MVP compiler currently generates:
137
 
138
  - invoices domain policy graph;
139
+ - bounded adversarial target metadata such as same-role cross-object access, cross-tenant access, public-route overlocking traps, alternate route/service reachability, or visible-test-only edge cases;
140
  - randomized users, tenants, invoices, and IDs;
141
  - generated app files under `app/`;
142
  - visible tests under `tests/test_visible.py`;
143
+ - hidden facts, oracle tuples, scenario family metadata, and verifier targets kept out of observations.
144
 
145
  Additional domains and bug families are scaffolded for extension.
146
 
147
+ ## Runtime Components
148
+
149
+ The OpenEnv runtime is split into small server modules:
150
+
151
+ - `server/curriculum.py` tracks mastery, weak spots, reward trend, and difficulty tier.
152
+ - `server/scenario_cache.py` writes and loads validated executable scenario bundles.
153
+ - `server/adversarial_designer.py` chooses safe synthetic scenario targets from tracked weaknesses.
154
+ - `server/scenario_factory.py` compiles the generated app during cache prep or local fallback.
155
+ - `server/app_sandbox.py` handles editable workspace reads, patches, local requests, and OpenAPI summaries.
156
+ - `server/action_tools.py` dispatches typed tools through the sandbox.
157
+ - `server/authz_oracle.py` builds the hidden allowed/denied user-resource-action matrix.
158
+ - `server/verifier.py` aggregates visible tests, hidden tests, oracle matrix, regression/public-route checks, and patch quality.
159
+ - `server/episode_logger.py` appends JSONL rollouts under `outputs/rollouts/`.
160
+
161
+ The agent sees partial observations only: product rules, fixture aliases, route summaries, visible test results, and action errors. Hidden tests, oracle tuples, injected bug labels, and held-out scenario-family labels stay internal.
162
+
163
  ## Testing
164
 
165
  ```bash
166
  uv run --extra dev pytest
167
  ```
168
 
169
+ The suite covers model serialization, reset/step/state behavior, seed reproducibility, invalid actions, reward outcomes, anti-cheat checks, scripted rollout policies, curriculum selection, adversarial targeting, held-out scenario families, oracle checks, verifier aggregation, and episode artifact logging.
170
 
171
  ## Training Scaffold
172
 
 
181
 
182
  The training scaffold is intentionally minimal until the environment/verifier behavior is stable. Trackio metric names and GRPO defaults follow the project brief.
183
 
184
+ `training/train_grpo.py` in this repo is a config helper only; it does not execute training locally.
185
+ Use the Modal launchers in `scripts/modal_train_grpo.py` (persistent) and
186
+ `scripts/modal_ephemeral_train.py` (smoke) for real GRPO runs.
187
+
188
+ Modal smoke and GRPO runs use `CYBERSECURITY_OWASP_SCENARIO_CACHE_MODE=require` and mount the persistent `CyberSecurity_OWASP-scenario-cache` volume. Prepare that cache before smoke/training:
189
+
190
+ ```bash
191
+ uv run --extra modal modal run scripts/modal_train_grpo.py --mode prepare-cache
192
+ uv run --extra modal modal run scripts/modal_ephemeral_train.py --mode prepare-cache
193
+ ```
194
+
195
+ If the cache slice is missing or below the configured per-bucket minimum, Modal training fails before rollouts rather than compiling scenarios during the run.
196
+ The persistent GRPO launcher runs a CPU-only scenario-cache preflight before it starts the L4 GPU function, so missing cache coverage fails before GPU allocation.
197
+
198
  ## Trackio Run Tracking
199
 
200
  Trackio is the default tracker for official runs. Set `TRACKIO_SPACE_ID` to log to a hosted Hugging Face Trackio Space; otherwise Trackio records locally.
 
213
 
214
  Evaluation summaries saved through `training.eval_before_after.save_eval_summary(...)`, Modal smoke runs, and GRPO training configs all initialize Trackio runs with CyberSecurity_OWASP run names.
215
 
216
+ Training, baseline, and smoke runs also log the effective reward config at step
217
+ 0. In Trackio, open **Media & Tables** and select the `reward_config` table to
218
+ see the actual values for each reward key, including stage-specific values,
219
+ caps, thresholds, terminate flags, and descriptions. Scalar metrics under
220
+ `reward_config/<key>/<field>` expose the same numeric values for plotting and
221
+ filtering, for example `reward_config/policy_inspected/value` and
222
+ `reward_config/shaping_weight/resolved`.
223
+
224
+ Each run config includes `reward_config_id`, `reward_config_hash`,
225
+ `reward_config_source`, `reward_mode`, and `reward_stage`. For manual ablations,
226
+ compare runs with the same scenario/model settings and different
227
+ `reward_config_hash` values to see which reward weights produced each training
228
+ curve.
229
+
230
  ## Modal Ephemeral Runs
231
 
232
  Modal Labs support is kept in a separate launcher script so the local OpenEnv server and core training scaffold stay unchanged.
 
240
  Run a temporary Modal app for a cheap environment/training smoke check:
241
 
242
  ```bash
243
+ uv run --extra modal modal run scripts/modal_ephemeral_train.py --mode prepare-cache
244
  uv run --extra modal modal run scripts/modal_ephemeral_train.py --mode smoke --episodes 4
245
  ```
246
 
 
258
  MODE=smoke EPISODES=4 uv run --extra modal bash scripts/modal_run_ephemeral.sh
259
  ```
260
 
261
+ ## Synthetic SFT Before GRPO
262
+
263
+ Use supervised fine-tuning to warm-start `unsloth/gemma-4-E2B-it` before GRPO.
264
+ The SFT generator executes every teacher action in the real environment and
265
+ keeps only trajectories that pass the deterministic reward verifier.
266
+
267
+ Generate a 300-train-episode curriculum SFT dataset across levels `0,1,2,3`:
268
+
269
+ ```bash
270
+ uv run python scripts/generate_sft_dataset.py \
271
+ --teacher-model deepseek-ai/DeepSeek-V4-Pro \
272
+ --target-model unsloth/gemma-4-E2B-it \
273
+ --difficulty-levels 0,1,2,3 \
274
+ --difficulty-buckets 4 \
275
+ --episodes 75 \
276
+ --validation-episodes 20 \
277
+ --workers 8 \
278
+ --out-dir outputs/sft
279
+ ```
280
+
281
+ `--episodes` is per difficulty level when `--difficulty-levels` is set, so
282
+ `--episodes 75` across four levels gives 300 total train episodes. Expect
283
+ roughly 2,400-4,500 chat-format JSONL rows because each successful trajectory
284
+ contributes one row per action step. The script writes JSONL rows under
285
+ `outputs/sft/`, trajectory artifacts under `outputs/sft/trajectories/`, a
286
+ dataset card at `outputs/sft/README.md`, and `outputs/sft/manifest.json` with
287
+ reward summaries and curriculum coverage.
288
+
289
+ Verify reward metadata before any training run:
290
+
291
+ ```bash
292
+ uv run python scripts/generate_sft_dataset.py \
293
+ --verify-only \
294
+ --difficulty-levels 0,1,2,3 \
295
+ --out-dir outputs/sft
296
+ ```
297
+
298
+ Push the verified dataset to Hugging Face Hub:
299
+
300
+ ```bash
301
+ uv run python scripts/generate_sft_dataset.py \
302
+ --push-only \
303
+ --difficulty-levels 0,1,2,3 \
304
+ --out-dir outputs/sft \
305
+ --dataset-repo-id Humanlearning/CyberSecurity_OWASP-sft-dataset
306
+ ```
307
+
308
+ The canonical dataset repo name is
309
+ `Humanlearning/CyberSecurity_OWASP-sft-dataset`. The upload is refused if
310
+ reward verification fails or `HF_TOKEN` is missing.
311
+
312
+ You can also generate and push in one command by adding `--push-to-hub` to the
313
+ generation command.
314
+
315
+ For local CI or smoke checks, add `--dry-run-oracle`; official SFT data should
316
+ use the teacher path and still pass the verifier gate above.
317
+
318
+ Launch SFT on Modal after reward verification passes:
319
+
320
+ ```bash
321
+ uv run --extra modal modal run --detach scripts/modal_train_sft.py \
322
+ --local-train-path outputs/sft/train.jsonl \
323
+ --local-validation-path outputs/sft/validation.jsonl \
324
+ --local-manifest-path outputs/sft/manifest.json \
325
+ --required-difficulties 0,1,2,3 \
326
+ --trackio-space-id Humanlearning/CyberSecurity_OWASP-trackio \
327
+ --trackio-project CyberSecurity_OWASP-sft \
328
+ --output-repo-id Humanlearning/CyberSecurity_OWASP-unsloth-gemma-4-e2b-it-sft-lora \
329
+ --push-to-hub \
330
+ --detach
331
+ ```
332
+
333
+ `scripts/modal_train_sft.py` re-checks the JSONL reward metadata locally before
334
+ upload and again inside Modal before loading the model. It refuses to start SFT
335
+ unless all required curriculum difficulties are represented and the verifier
336
+ reward metadata passes. The default SFT config trains the full dataset
337
+ (`--max-steps -1`) with bf16/tf32, LoRA rank 32, and Modal GPU fallback
338
+ `H200 -> H100 -> A100-80GB -> L40S`. TRL does not support packing or
339
+ assistant-only loss for the Gemma 4 vision-language loader, so both remain
340
+ disabled for this model. The script pre-tokenizes the small JSONL dataset
341
+ serially before constructing `SFTTrainer`, which avoids TRL multiprocessing
342
+ around the Gemma/Unsloth config object. It also uses the base Transformers loss
343
+ path to avoid a TRL entropy-metric incompatibility with Gemma 4 lazy logits. A
344
+ warm run for the 300-400 episode dataset should usually finish in about 20-60
345
+ minutes; first image or model-cache builds can push that closer to 45-90
346
+ minutes.
347
+
348
+ Continue GRPO from the SFT LoRA:
349
+
350
+ The GRPO launcher downloads the Hub adapter, attaches a matching trainable
351
+ Unsloth LoRA to Gemma 4, and then loads the adapter safetensors. This keeps the
352
+ SFT handoff compatible with Gemma 4's Unsloth linear wrappers.
353
+
354
+ ```bash
355
+ uv run --extra modal modal run --detach scripts/modal_train_grpo.py \
356
+ --initial-adapter-repo-id Humanlearning/CyberSecurity_OWASP-unsloth-gemma-4-e2b-it-sft-lora \
357
+ --max-steps 300 \
358
+ --dataset-size 64 \
359
+ --num-generations 8 \
360
+ --difficulty 0 \
361
+ --trace-log-every 10 \
362
+ --detach
363
+ ```
364
+
365
  ## Modal GRPO Training
366
 
367
  The persistent GPU training launcher packages this local repo into Modal, trains
 
379
  Run the default smoke GRPO job:
380
 
381
  ```bash
382
+ uv run --extra modal modal run scripts/modal_train_grpo.py --mode prepare-cache
383
  uv run --extra modal modal run scripts/modal_train_grpo.py \
384
  --max-steps 10 \
385
  --dataset-size 16 \
386
+ --num-generations 6 \
387
  --difficulty 0
388
  ```
389
 
390
+ For GPU-utilization tuning on the same single L4, start with a larger but still
391
+ bounded no-code trial:
392
+
393
+ ```bash
394
+ uv run --extra modal modal run scripts/modal_train_grpo.py \
395
+ --max-steps 30 \
396
+ --dataset-size 64 \
397
+ --num-generations 8 \
398
+ --max-completion-length 256 \
399
+ --difficulty 0
400
+ ```
401
+
402
+ The launcher exposes GRPO throughput knobs for follow-up trials:
403
+
404
+ ```bash
405
+ # larger generation group, no vLLM
406
+ uv run --extra modal modal run scripts/modal_train_grpo.py \
407
+ --max-steps 30 --dataset-size 64 --num-generations 8 \
408
+ --max-completion-length 256 --trace-log-every 5
409
+
410
+ # vLLM colocate on the same L4
411
+ uv run --extra modal modal run scripts/modal_train_grpo.py \
412
+ --max-steps 30 --dataset-size 64 --num-generations 8 \
413
+ --max-completion-length 256 --use-vllm \
414
+ --vllm-gpu-memory-utilization 0.35 --trace-log-every 5
415
+
416
+ # larger microbatch if the vLLM trial does not OOM
417
+ uv run --extra modal modal run scripts/modal_train_grpo.py \
418
+ --max-steps 30 --dataset-size 64 --num-generations 8 \
419
+ --per-device-train-batch-size 2 --gradient-accumulation-steps 4 \
420
+ --max-completion-length 256 --use-vllm \
421
+ --vllm-gpu-memory-utilization 0.45 --trace-log-every 5
422
+ ```
423
+
424
+ `per_device_train_batch_size * gradient_accumulation_steps * world_size` must
425
+ be divisible by `num_generations`; the launcher validates this before the GPU
426
+ container starts. Scalar Trackio metrics still log every reward callback, while
427
+ sample trace tables and Trace objects are throttled by `--trace-log-every`
428
+ (`1` restores every-callback logging, `0` disables trace artifacts).
429
+
430
+ ### Parallel Modal GRPO Runs
431
+
432
+ Parallel Modal GRPO runs are safe when each run has its own seed range, run
433
+ name, and output target, while the shared cache volumes remain read-only.
434
+ Before launching another job, check what is already active:
435
+
436
+ ```bash
437
+ uv run --extra modal modal app list
438
+ uv run --extra modal modal app logs <app-id>
439
+ ```
440
+
441
+ Launch long-running parallel jobs with both Modal CLI detach and the launcher
442
+ detach flag. The CLI-level `--detach` keeps the remote function alive after the
443
+ local entrypoint exits; the launcher `--detach` prevents the parent Modal
444
+ function from waiting on the GPU call.
445
+
446
+ ```bash
447
+ uv run --extra modal modal run --detach scripts/modal_train_grpo.py \
448
+ --max-steps 300 \
449
+ --dataset-size 64 \
450
+ --num-generations 8 \
451
+ --max-completion-length 768 \
452
+ --difficulty 0 \
453
+ --trace-log-every 10 \
454
+ --seed-start 10000 \
455
+ --detach
456
+ ```
457
+
458
+ For multiple concurrent experiments:
459
+
460
+ - Use a unique `--seed-start` range for every run, normally spaced by at least
461
+ 10,000 seeds.
462
+ - Keep `CYBERSECURITY_OWASP_SCENARIO_CACHE_MODE=require`; do not compile
463
+ scenarios during training.
464
+ - Do not run `prepare-cache --cache-force` while training jobs are active.
465
+ - Keep `--push-to-hub` disabled unless each run has a unique
466
+ `--output-repo-id`.
467
+ - Let the launcher generate unique timestamped Trackio run names, or set an
468
+ explicit `RUN_NAME` only when it is globally unique.
469
+ - Use the same Trackio Space/project for comparable metrics, but never reuse a
470
+ run name.
471
+ - Treat `CyberSecurity_OWASP-model-cache` and
472
+ `CyberSecurity_OWASP-scenario-cache` as shared read-mostly infrastructure
473
+ during training. Run outputs and checkpoints should stay under each run's
474
+ unique output directory.
475
+
476
+ If a Windows shell fails with a Unicode `charmap` encoding error during Modal
477
+ startup, rerun with UTF-8 enabled for that command:
478
+
479
+ ```powershell
480
+ $env:PYTHONIOENCODING='utf-8'; $env:PYTHONUTF8='1'; uv run --extra modal modal run --detach scripts/modal_train_grpo.py --max-steps 300 --dataset-size 64 --num-generations 4 --max-completion-length 768 --difficulty 0 --trace-log-every 10 --seed-start 60000 --detach
481
+ ```
482
+
483
  If running from a public repository and you do not want Modal to package the
484
  local workspace, use public source mode:
485
 
 
490
  --repo-branch master \
491
  --max-steps 10 \
492
  --dataset-size 16 \
493
+ --num-generations 6 \
494
  --difficulty 0
495
  ```
496
 
 
498
 
499
  - Trackio Space: `<hf-user>/CyberSecurity_OWASP-trackio`
500
  - Trackio project: `CyberSecurity_OWASP-grpo`
501
+ - Training model: `unsloth/gemma-4-E2B-it`
502
+ - Output repo: `<hf-user>/CyberSecurity_OWASP-unsloth-gemma-4-e2b-it-grpo-lora`
503
 
504
  Override these with `--trackio-space-id`, `--trackio-project`, and
505
+ `--output-repo-id` when needed. The persistent GRPO launcher intentionally rejects non-Gemma model overrides so smoke runs match the Unsloth Gemma 4 E2B RL notebook.
506
 
507
  ## Docker / Spaces
508