Humanlearning commited on
Commit
31637b2
·
verified ·
1 Parent(s): 108fa47

Upload remaining non-secret workspace files

Browse files
Files changed (2) hide show
  1. README.md +33 -0
  2. scripts/modal_train_grpo.py +721 -0
README.md CHANGED
@@ -155,6 +155,39 @@ The shell wrapper is equivalent:
155
  MODE=smoke EPISODES=4 uv run --extra modal bash scripts/modal_run_ephemeral.sh
156
  ```
157
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
158
  ## Docker / Spaces
159
 
160
  ```bash
 
155
  MODE=smoke EPISODES=4 uv run --extra modal bash scripts/modal_run_ephemeral.sh
156
  ```
157
 
158
+ ## Modal GRPO Training
159
+
160
+ The persistent GPU training launcher packages this local repo into Modal, trains
161
+ a small LoRA GRPO run, logs metrics and traces to Trackio, stores checkpoints in
162
+ the `CyberSecurity_OWASP-grpo-runs` Modal volume, and pushes the output adapter
163
+ to Hugging Face Hub.
164
+
165
+ Create a Modal secret named `CyberSecurity_OWASP-secrets` with `HF_TOKEN`, then
166
+ run the import/config check:
167
+
168
+ ```bash
169
+ uv run --extra modal modal run scripts/modal_train_grpo.py --mode config
170
+ ```
171
+
172
+ Run the default smoke GRPO job:
173
+
174
+ ```bash
175
+ uv run --extra modal modal run scripts/modal_train_grpo.py \
176
+ --max-steps 10 \
177
+ --dataset-size 16 \
178
+ --num-generations 2 \
179
+ --difficulty 0
180
+ ```
181
+
182
+ Defaults are derived from `HF_TOKEN`:
183
+
184
+ - Trackio Space: `<hf-user>/CyberSecurity_OWASP-trackio`
185
+ - Trackio project: `CyberSecurity_OWASP-grpo`
186
+ - Output repo: `<hf-user>/CyberSecurity_OWASP-qwen3-1.7b-grpo-lora`
187
+
188
+ Override these with `--trackio-space-id`, `--trackio-project`, and
189
+ `--output-repo-id` when needed.
190
+
191
  ## Docker / Spaces
192
 
193
  ```bash
scripts/modal_train_grpo.py ADDED
@@ -0,0 +1,721 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Persistent Modal GRPO launcher for CyberSecurity_OWASP.
2
+
3
+ This packages the local repository into a Modal GPU image, runs a small
4
+ tool-use GRPO job against the in-process CyberSecurity_OWASP environment, logs
5
+ metrics/traces to Trackio, and saves LoRA checkpoints in a persistent Modal
6
+ volume.
7
+
8
+ Example:
9
+
10
+ uv run --extra modal modal run scripts/modal_train_grpo.py \
11
+ --max-steps 10 \
12
+ --dataset-size 16 \
13
+ --num-generations 2 \
14
+ --difficulty 0
15
+ """
16
+
17
+ from __future__ import annotations
18
+
19
+ import os
20
+ import pathlib
21
+ import subprocess
22
+ from datetime import datetime, timezone
23
+ from typing import Any
24
+
25
+ import modal
26
+
27
+
28
+ APP_NAME = "CyberSecurity_OWASP-grpo"
29
+ VOLUME_NAME = "CyberSecurity_OWASP-grpo-runs"
30
+ SECRET_NAME = "CyberSecurity_OWASP-secrets"
31
+ RUNS_DIR = pathlib.Path("/runs")
32
+ REMOTE_PROJECT = "/root/CyberSecurity_OWASP"
33
+ PROJECT_ROOT = pathlib.Path(__file__).resolve().parents[1]
34
+
35
+
36
+ def _training_image() -> modal.Image:
37
+ return (
38
+ modal.Image.from_registry(
39
+ "nvidia/cuda:12.8.0-devel-ubuntu22.04",
40
+ add_python="3.11",
41
+ )
42
+ .apt_install("git", "build-essential", "curl")
43
+ .uv_pip_install(
44
+ "torch==2.10.0",
45
+ "triton>=3.4.0",
46
+ "torchvision==0.25.0",
47
+ "bitsandbytes",
48
+ "accelerate",
49
+ "datasets",
50
+ "huggingface_hub",
51
+ "peft",
52
+ "tokenizers",
53
+ "nvidia-ml-py",
54
+ "trackio>=0.25.0",
55
+ "transformers>=5.5.0",
56
+ "trl>=0.28.0",
57
+ "openenv-core[core]>=0.2.3",
58
+ "pydantic==2.10.6",
59
+ )
60
+ .uv_pip_install(
61
+ "unsloth_zoo[base] @ git+https://github.com/unslothai/unsloth-zoo",
62
+ "unsloth[base] @ git+https://github.com/unslothai/unsloth",
63
+ )
64
+ .uv_pip_install("mergekit", "immutables==0.21", extra_options="--no-deps")
65
+ .uv_pip_install("trl>=0.28.0", "transformers>=5.5.0", "jmespath")
66
+ .add_local_dir(
67
+ PROJECT_ROOT,
68
+ remote_path=REMOTE_PROJECT,
69
+ copy=True,
70
+ ignore=[
71
+ ".git",
72
+ ".venv",
73
+ "__pycache__",
74
+ ".pytest_cache",
75
+ "outputs",
76
+ "*.pyc",
77
+ ],
78
+ )
79
+ .run_commands(
80
+ f"python -m pip install -e {REMOTE_PROJECT}",
81
+ "python -c \"import os, torch; import transformers.utils.hub as hub; "
82
+ "hub.TRANSFORMERS_CACHE = getattr(hub, 'TRANSFORMERS_CACHE', "
83
+ "os.path.join(os.path.expanduser('~'), '.cache', 'huggingface', 'hub')); "
84
+ "from trl import GRPOConfig, GRPOTrainer; "
85
+ "from CyberSecurity_OWASP.server.CyberSecurity_OWASP_environment import "
86
+ "CybersecurityOwaspEnvironment; print('trainer import ok', torch.__version__)\"",
87
+ )
88
+ .workdir(REMOTE_PROJECT)
89
+ )
90
+
91
+
92
+ app = modal.App(APP_NAME)
93
+ volume = modal.Volume.from_name(VOLUME_NAME, create_if_missing=True)
94
+ secret = modal.Secret.from_name(SECRET_NAME)
95
+
96
+
97
+ @app.function(
98
+ image=_training_image(),
99
+ gpu=["L4", "A10G"],
100
+ timeout=4 * 60 * 60,
101
+ volumes={RUNS_DIR: volume},
102
+ secrets=[secret],
103
+ )
104
+ def check_training_imports() -> dict[str, str]:
105
+ import torch
106
+ import trackio
107
+ from datasets import Dataset
108
+ from trl import GRPOConfig, GRPOTrainer
109
+ from unsloth import FastLanguageModel
110
+
111
+ from CyberSecurity_OWASP.server.CyberSecurity_OWASP_environment import (
112
+ CybersecurityOwaspEnvironment,
113
+ )
114
+
115
+ env = CybersecurityOwaspEnvironment()
116
+ obs = env.reset(seed=0, split="validation", difficulty=0)
117
+ return {
118
+ "torch": torch.__version__,
119
+ "trackio": getattr(trackio, "__version__", "unknown"),
120
+ "dataset": Dataset.__name__,
121
+ "grpo_config": GRPOConfig.__name__,
122
+ "grpo_trainer": GRPOTrainer.__name__,
123
+ "unsloth_model": FastLanguageModel.__name__,
124
+ "env": CybersecurityOwaspEnvironment.__name__,
125
+ "reset_phase": obs.phase,
126
+ }
127
+
128
+
129
+ @app.function(
130
+ image=_training_image(),
131
+ gpu=["L4", "A10G"],
132
+ timeout=4 * 60 * 60,
133
+ volumes={RUNS_DIR: volume},
134
+ secrets=[secret],
135
+ )
136
+ def train_cybersecurity_owasp_grpo(
137
+ env_repo_id: str = "",
138
+ output_repo_id: str = "",
139
+ max_steps: int = 10,
140
+ dataset_size: int = 16,
141
+ difficulty: int = 0,
142
+ split: str = "train",
143
+ model_name: str = "Qwen/Qwen3-1.7B",
144
+ max_seq_length: int = 4096,
145
+ max_completion_length: int = 768,
146
+ lora_rank: int = 32,
147
+ trackio_space_id: str = "",
148
+ trackio_project: str = "CyberSecurity_OWASP-grpo",
149
+ num_generations: int = 2,
150
+ seed_start: int = 0,
151
+ git_sha: str = "nogit",
152
+ ) -> dict[str, str | int | float]:
153
+ import statistics
154
+
155
+ import torch
156
+ import transformers.utils.hub as transformers_hub
157
+ from datasets import Dataset
158
+ from huggingface_hub import whoami
159
+ from transformers import TrainerCallback
160
+ from trl import GRPOConfig, GRPOTrainer
161
+ from unsloth import FastLanguageModel
162
+
163
+ import trackio
164
+
165
+ from CyberSecurity_OWASP.models import CyberSecurityOWASPAction
166
+ from CyberSecurity_OWASP.server.CyberSecurity_OWASP_environment import (
167
+ CybersecurityOwaspEnvironment,
168
+ )
169
+
170
+ if not hasattr(transformers_hub, "TRANSFORMERS_CACHE"):
171
+ transformers_hub.TRANSFORMERS_CACHE = os.path.join(
172
+ os.path.expanduser("~"),
173
+ ".cache",
174
+ "huggingface",
175
+ "hub",
176
+ )
177
+
178
+ hf_token = os.environ.get("HF_TOKEN")
179
+ if not hf_token:
180
+ raise RuntimeError(
181
+ f"HF_TOKEN is missing from the Modal secret {SECRET_NAME}."
182
+ )
183
+
184
+ user = whoami(token=hf_token)["name"]
185
+ env_repo_id = env_repo_id or f"{user}/CyberSecurity_OWASP"
186
+ output_repo_id = output_repo_id or f"{user}/CyberSecurity_OWASP-qwen3-1.7b-grpo-lora"
187
+ trackio_space_id = trackio_space_id or f"{user}/CyberSecurity_OWASP-trackio"
188
+
189
+ os.environ["TRACKIO_SPACE_ID"] = trackio_space_id
190
+ os.environ["TRACKIO_PROJECT"] = trackio_project
191
+
192
+ model_slug = model_name.replace("/", "-")
193
+ stamp = datetime.now(timezone.utc).strftime("%Y%m%d-%H%M%S")
194
+ run_name = f"CyberSecurity_OWASP-{model_slug}-grpo-level{difficulty}-{stamp}-{git_sha[:8]}"
195
+ output_dir = RUNS_DIR / run_name
196
+ output_dir.mkdir(parents=True, exist_ok=True)
197
+
198
+ training_prompt = (
199
+ "You are a defensive AppSec repair agent in the local CyberSecurity_OWASP "
200
+ "OpenEnv environment. Use only the provided local tools. Do not target real "
201
+ "systems. Work step by step: inspect policy and generated code, reproduce the "
202
+ "authorization issue locally, submit a policy-tied finding, patch the generated "
203
+ "app, run visible tests, then submit the fix. Do not write explanations unless "
204
+ "a tool argument needs evidence text."
205
+ )
206
+
207
+ dataset = Dataset.from_list(
208
+ [
209
+ {
210
+ "prompt": [{"role": "user", "content": training_prompt}],
211
+ "seed": seed_start + index,
212
+ "difficulty": difficulty,
213
+ "split": split,
214
+ }
215
+ for index in range(dataset_size)
216
+ ]
217
+ )
218
+
219
+ def _state_snapshot(env: CybersecurityOwaspEnvironment) -> dict[str, Any]:
220
+ state = env.state
221
+ return {
222
+ "episode_id": state.episode_id,
223
+ "task_id": state.task_id,
224
+ "seed": state.seed,
225
+ "split": state.split,
226
+ "difficulty": state.difficulty,
227
+ "domain": state.domain,
228
+ "bug_family": state.bug_family,
229
+ "phase": state.phase,
230
+ "step_count": state.step_count,
231
+ "done": state.done,
232
+ "success": state.success,
233
+ "failure_reason": state.failure_reason,
234
+ "anti_cheat_flags": list(state.anti_cheat_flags),
235
+ }
236
+
237
+ class CyberSecurityOWASPToolEnv:
238
+ def __init__(self):
239
+ self._env = CybersecurityOwaspEnvironment()
240
+ self.reward = 0.0
241
+ self.reward_breakdown: dict[str, float] = {}
242
+ self.done = False
243
+ self.success = False
244
+ self.invalid_actions = 0
245
+ self.trace_messages: list[dict[str, str]] = []
246
+ self.trace_metadata: dict[str, Any] = {}
247
+
248
+ def reset(self, **kwargs) -> str:
249
+ seed = int(kwargs.get("seed", seed_start))
250
+ current_difficulty = int(kwargs.get("difficulty", difficulty))
251
+ current_split = str(kwargs.get("split", split))
252
+ obs = self._env.reset(
253
+ seed=seed,
254
+ split=current_split,
255
+ difficulty=current_difficulty,
256
+ )
257
+ self.reward = 0.0
258
+ self.reward_breakdown = {}
259
+ self.done = bool(obs.done)
260
+ self.success = False
261
+ self.invalid_actions = 0
262
+ self.trace_messages = [
263
+ {
264
+ "role": "user",
265
+ "content": (
266
+ f"{training_prompt}\n\nInitial observation:\n"
267
+ f"Phase: {obs.phase}\n"
268
+ f"Task: {obs.task_brief}\n"
269
+ f"Available actions: {obs.available_actions}\n"
270
+ f"Workspace summary: {obs.workspace_summary}\n"
271
+ f"Policy hint: {obs.visible_policy_hint}\n"
272
+ f"Message: {obs.message}"
273
+ ),
274
+ }
275
+ ]
276
+ self.trace_metadata = _state_snapshot(self._env)
277
+ return obs.message
278
+
279
+ def _step(self, tool_name: str, arguments: dict[str, Any] | None = None) -> str:
280
+ if self.done:
281
+ raise ValueError("Episode is already over.")
282
+ action = CyberSecurityOWASPAction(
283
+ tool_name=tool_name,
284
+ arguments=arguments or {},
285
+ )
286
+ obs = self._env.step(action)
287
+ if not obs.last_action_valid:
288
+ self.invalid_actions += 1
289
+ self.reward = float(obs.reward_breakdown.get("total", obs.reward or 0.0))
290
+ self.reward_breakdown = dict(obs.reward_breakdown or {})
291
+ self.done = bool(obs.done)
292
+ self.success = bool(self._env.state.success)
293
+ self.trace_messages.extend(
294
+ [
295
+ {
296
+ "role": "assistant",
297
+ "content": f"{tool_name}({arguments or {}})",
298
+ },
299
+ {"role": "tool", "content": obs.message},
300
+ ]
301
+ )
302
+ self.trace_metadata.update(_state_snapshot(self._env))
303
+ self.trace_metadata.update(
304
+ {
305
+ "last_action_valid": obs.last_action_valid,
306
+ "last_action_error": obs.last_action_error,
307
+ "reward": self.reward,
308
+ "reward_breakdown": self.reward_breakdown,
309
+ "invalid_actions": self.invalid_actions,
310
+ }
311
+ )
312
+ return obs.message
313
+
314
+ def inspect_policy_graph(self) -> str:
315
+ """Return public policy hints for the generated local scenario."""
316
+ return self._step("inspect_policy_graph")
317
+
318
+ def list_routes(self) -> str:
319
+ """List generated local app route summaries."""
320
+ return self._step("list_routes")
321
+
322
+ def read_openapi(self) -> str:
323
+ """Read generated OpenAPI metadata for the local app."""
324
+ return self._step("read_openapi")
325
+
326
+ def read_file(self, path: str) -> str:
327
+ """Read an editable generated workspace file by relative path."""
328
+ return self._step("read_file", {"path": path})
329
+
330
+ def search_code(self, query: str) -> str:
331
+ """Search editable generated workspace files for a string."""
332
+ return self._step("search_code", {"query": query})
333
+
334
+ def send_local_request(
335
+ self,
336
+ path: str,
337
+ method: str = "GET",
338
+ user_id: str | None = None,
339
+ ) -> str:
340
+ """Send a request to the generated local app only."""
341
+ return self._step(
342
+ "send_local_request",
343
+ {"path": path, "method": method, "user_id": user_id},
344
+ )
345
+
346
+ def compare_identities(
347
+ self,
348
+ path: str,
349
+ first_user_id: str,
350
+ second_user_id: str,
351
+ method: str = "GET",
352
+ ) -> str:
353
+ """Compare one local request as two generated users."""
354
+ return self._step(
355
+ "compare_identities",
356
+ {
357
+ "path": path,
358
+ "method": method,
359
+ "first_user_id": first_user_id,
360
+ "second_user_id": second_user_id,
361
+ },
362
+ )
363
+
364
+ def submit_finding(
365
+ self,
366
+ summary: str,
367
+ evidence: str,
368
+ policy_rule: str,
369
+ ) -> str:
370
+ """Submit structured evidence for the suspected authorization bug."""
371
+ return self._step(
372
+ "submit_finding",
373
+ {
374
+ "summary": summary,
375
+ "evidence": evidence,
376
+ "policy_rule": policy_rule,
377
+ },
378
+ )
379
+
380
+ def patch_file(
381
+ self,
382
+ path: str,
383
+ content: str | None = None,
384
+ diff: str | None = None,
385
+ ) -> str:
386
+ """Patch an editable generated app file with full content or a unified diff."""
387
+ args: dict[str, Any] = {"path": path}
388
+ if content is not None:
389
+ args["content"] = content
390
+ if diff is not None:
391
+ args["diff"] = diff
392
+ return self._step("patch_file", args)
393
+
394
+ def run_visible_tests(self) -> str:
395
+ """Run visible tests only; hidden tests are never exposed."""
396
+ return self._step("run_visible_tests")
397
+
398
+ def submit_fix(self) -> str:
399
+ """Submit the final patch to the hidden deterministic verifier."""
400
+ return self._step("submit_fix")
401
+
402
+ def noop(self) -> str:
403
+ """Take no action."""
404
+ return self._step("noop")
405
+
406
+ def _score(self) -> float:
407
+ return float(self.reward)
408
+
409
+ def __del__(self):
410
+ try:
411
+ self._env.close()
412
+ except Exception:
413
+ pass
414
+
415
+ trace_step = {"value": 0}
416
+
417
+ def _completion_to_text(completion) -> str:
418
+ if completion is None:
419
+ return ""
420
+ if isinstance(completion, str):
421
+ return completion
422
+ if isinstance(completion, list):
423
+ parts = []
424
+ for item in completion:
425
+ if isinstance(item, dict):
426
+ parts.append(str(item.get("content", item)))
427
+ else:
428
+ parts.append(str(item))
429
+ return "\n".join(parts)
430
+ return str(completion)
431
+
432
+ def _mean(values: list[float]) -> float:
433
+ return float(sum(values) / len(values)) if values else 0.0
434
+
435
+ def cybersecurity_owasp_reward(environments, **kwargs) -> list[float]:
436
+ rewards = [float(env._score()) for env in environments]
437
+ completions = kwargs.get("completions") or kwargs.get("completion") or []
438
+ trace_step["value"] += 1
439
+
440
+ breakdowns = [getattr(env, "reward_breakdown", {}) or {} for env in environments]
441
+ metrics = {
442
+ "train/reward_total_mean": _mean(rewards),
443
+ "train/reward_discovery_mean": _mean(
444
+ [float(item.get("discovery", 0.0)) for item in breakdowns]
445
+ ),
446
+ "train/reward_security_mean": _mean(
447
+ [float(item.get("security", 0.0)) for item in breakdowns]
448
+ ),
449
+ "train/reward_regression_mean": _mean(
450
+ [float(item.get("regression", 0.0)) for item in breakdowns]
451
+ ),
452
+ "train/reward_public_routes_mean": _mean(
453
+ [float(item.get("public_routes", 0.0)) for item in breakdowns]
454
+ ),
455
+ "train/reward_patch_quality_mean": _mean(
456
+ [float(item.get("patch_quality", 0.0)) for item in breakdowns]
457
+ ),
458
+ "train/reward_visible_tests_mean": _mean(
459
+ [float(item.get("visible_tests", 0.0)) for item in breakdowns]
460
+ ),
461
+ "train/reward_anti_cheat_mean": _mean(
462
+ [float(item.get("anti_cheat", 0.0)) for item in breakdowns]
463
+ ),
464
+ "train/success_rate": _mean(
465
+ [1.0 if bool(getattr(env, "success", False)) else 0.0 for env in environments]
466
+ ),
467
+ "train/invalid_action_rate": _mean(
468
+ [float(getattr(env, "invalid_actions", 0)) for env in environments]
469
+ ),
470
+ "train/episode_length_mean": _mean(
471
+ [
472
+ float(getattr(env, "trace_metadata", {}).get("step_count", 0))
473
+ for env in environments
474
+ ]
475
+ ),
476
+ }
477
+
478
+ try:
479
+ trackio.log(metrics, step=trace_step["value"])
480
+ except Exception as exc:
481
+ print(f"Trackio metric logging skipped: {exc!r}")
482
+
483
+ for index, env in enumerate(environments):
484
+ messages = list(getattr(env, "trace_messages", []))
485
+ if index < len(completions):
486
+ completion_text = _completion_to_text(completions[index])
487
+ if completion_text:
488
+ messages.append(
489
+ {
490
+ "role": "assistant",
491
+ "content": f"Raw generated completion:\n{completion_text}",
492
+ }
493
+ )
494
+ metadata = dict(getattr(env, "trace_metadata", {}))
495
+ metadata.update(
496
+ {
497
+ "sample_index": index,
498
+ "reward": rewards[index],
499
+ "trace_step": trace_step["value"],
500
+ "run_name": run_name,
501
+ }
502
+ )
503
+ try:
504
+ trackio.log(
505
+ {
506
+ f"cybersecurity_owasp_trace/sample_{index}": trackio.Trace(
507
+ messages=messages,
508
+ metadata=metadata,
509
+ )
510
+ },
511
+ step=trace_step["value"],
512
+ )
513
+ except Exception as exc:
514
+ print(f"Trackio trace logging skipped: {exc!r}")
515
+
516
+ if rewards:
517
+ print(
518
+ "Reward batch: "
519
+ f"mean={statistics.mean(rewards):.3f}, "
520
+ f"min={min(rewards):.3f}, max={max(rewards):.3f}"
521
+ )
522
+ return rewards
523
+
524
+ class TrackioSystemMetricsCallback(TrainerCallback):
525
+ def on_log(self, args, state, control, logs=None, **kwargs):
526
+ try:
527
+ metrics = trackio.log_gpu()
528
+ except Exception as exc:
529
+ print(f"Trackio GPU metrics skipped: {exc!r}")
530
+ return control
531
+ if metrics:
532
+ summary = ", ".join(f"{key}={value}" for key, value in sorted(metrics.items())[:4])
533
+ print(f"Trackio GPU metrics logged at step {state.global_step}: {summary}")
534
+ return control
535
+
536
+ print(f"CUDA available: {torch.cuda.is_available()}")
537
+ print(f"Packaged local CyberSecurity_OWASP repo; default env repo id: {env_repo_id}")
538
+ print(f"Trackio Space: {trackio_space_id}")
539
+ print(f"Trackio Project: {trackio_project}")
540
+ print(f"Output repo: {output_repo_id}")
541
+ print(f"Run name: {run_name}")
542
+
543
+ model, tokenizer = FastLanguageModel.from_pretrained(
544
+ model_name=model_name,
545
+ max_seq_length=max_seq_length,
546
+ load_in_4bit=False,
547
+ fast_inference=False,
548
+ token=hf_token,
549
+ )
550
+ model = FastLanguageModel.get_peft_model(
551
+ model,
552
+ r=lora_rank,
553
+ target_modules=[
554
+ "q_proj",
555
+ "k_proj",
556
+ "v_proj",
557
+ "o_proj",
558
+ "gate_proj",
559
+ "up_proj",
560
+ "down_proj",
561
+ ],
562
+ lora_alpha=lora_rank * 2,
563
+ use_gradient_checkpointing="unsloth",
564
+ random_state=3407,
565
+ )
566
+ FastLanguageModel.for_training(model)
567
+
568
+ training_args = GRPOConfig(
569
+ temperature=1.0,
570
+ learning_rate=5e-6,
571
+ weight_decay=0.001,
572
+ warmup_ratio=0.1,
573
+ lr_scheduler_type="linear",
574
+ optim="adamw_8bit",
575
+ logging_steps=1,
576
+ per_device_train_batch_size=1,
577
+ gradient_accumulation_steps=max(2, num_generations),
578
+ num_generations=num_generations,
579
+ max_prompt_length=max_seq_length,
580
+ max_completion_length=max_completion_length,
581
+ max_steps=max_steps,
582
+ save_steps=max(10, max_steps),
583
+ report_to="trackio",
584
+ trackio_space_id=trackio_space_id,
585
+ run_name=run_name,
586
+ output_dir=str(output_dir),
587
+ push_to_hub=True,
588
+ hub_model_id=output_repo_id,
589
+ hub_private_repo=True,
590
+ hub_strategy="every_save",
591
+ gradient_checkpointing=True,
592
+ gradient_checkpointing_kwargs={"use_reentrant": False},
593
+ epsilon=0.2,
594
+ epsilon_high=0.28,
595
+ delta=1.5,
596
+ loss_type="bnpo",
597
+ mask_truncated_completions=False,
598
+ )
599
+
600
+ trainer = GRPOTrainer(
601
+ model=model,
602
+ processing_class=tokenizer,
603
+ reward_funcs=cybersecurity_owasp_reward,
604
+ args=training_args,
605
+ train_dataset=dataset,
606
+ environment_factory=CyberSecurityOWASPToolEnv,
607
+ callbacks=[TrackioSystemMetricsCallback()],
608
+ )
609
+ trainer.train()
610
+ trainer.push_to_hub()
611
+ volume.commit()
612
+
613
+ return {
614
+ "run_name": run_name,
615
+ "env_repo_id": env_repo_id,
616
+ "output_repo_id": output_repo_id,
617
+ "trackio_space_id": trackio_space_id,
618
+ "trackio_project": trackio_project,
619
+ "max_steps": max_steps,
620
+ "dataset_size": dataset_size,
621
+ "difficulty": difficulty,
622
+ "split": split,
623
+ "model_name": model_name,
624
+ "max_completion_length": max_completion_length,
625
+ "num_generations": num_generations,
626
+ }
627
+
628
+
629
+ @app.local_entrypoint()
630
+ def main(
631
+ mode: str = "train",
632
+ env_repo_id: str = "",
633
+ output_repo_id: str = "",
634
+ max_steps: int = 10,
635
+ dataset_size: int = 16,
636
+ difficulty: int = 0,
637
+ split: str = "train",
638
+ model_name: str = "Qwen/Qwen3-1.7B",
639
+ max_seq_length: int = 4096,
640
+ max_completion_length: int = 768,
641
+ lora_rank: int = 32,
642
+ trackio_space_id: str = "",
643
+ trackio_project: str = "CyberSecurity_OWASP-grpo",
644
+ num_generations: int = 2,
645
+ seed_start: int = 0,
646
+ git_sha: str = "nogit",
647
+ ) -> None:
648
+ if mode == "config":
649
+ result = check_training_imports.remote()
650
+ print(result)
651
+ return
652
+ if mode != "train":
653
+ raise ValueError("mode must be 'train' or 'config'")
654
+
655
+ resolved_trackio_space_id = trackio_space_id
656
+ resolved_output_repo_id = output_repo_id
657
+ if not resolved_trackio_space_id or not resolved_output_repo_id:
658
+ hf_token = os.environ.get("HF_TOKEN")
659
+ if hf_token:
660
+ try:
661
+ from huggingface_hub import whoami
662
+
663
+ user = whoami(token=hf_token)["name"]
664
+ resolved_trackio_space_id = (
665
+ resolved_trackio_space_id or f"{user}/CyberSecurity_OWASP-trackio"
666
+ )
667
+ resolved_output_repo_id = (
668
+ resolved_output_repo_id
669
+ or f"{user}/CyberSecurity_OWASP-qwen3-1.7b-grpo-lora"
670
+ )
671
+ except Exception as exc:
672
+ print(f"Could not resolve Hugging Face defaults locally: {exc!r}")
673
+
674
+ if git_sha == "nogit":
675
+ try:
676
+ git_sha = subprocess.check_output(
677
+ ["git", "rev-parse", "HEAD"],
678
+ cwd=PROJECT_ROOT,
679
+ text=True,
680
+ stderr=subprocess.DEVNULL,
681
+ ).strip()
682
+ except Exception:
683
+ git_sha = "nogit"
684
+
685
+ model_slug = model_name.replace("/", "-")
686
+ local_stamp = datetime.now(timezone.utc).strftime("%Y%m%d-%H%M%S")
687
+ estimated_run_name = (
688
+ f"CyberSecurity_OWASP-{model_slug}-grpo-level{difficulty}-"
689
+ f"{local_stamp}-{git_sha[:8]}"
690
+ )
691
+
692
+ call = train_cybersecurity_owasp_grpo.spawn(
693
+ env_repo_id=env_repo_id,
694
+ output_repo_id=output_repo_id,
695
+ max_steps=max_steps,
696
+ dataset_size=dataset_size,
697
+ difficulty=difficulty,
698
+ split=split,
699
+ model_name=model_name,
700
+ max_seq_length=max_seq_length,
701
+ max_completion_length=max_completion_length,
702
+ lora_rank=lora_rank,
703
+ trackio_space_id=trackio_space_id,
704
+ trackio_project=trackio_project,
705
+ num_generations=num_generations,
706
+ seed_start=seed_start,
707
+ git_sha=git_sha,
708
+ )
709
+ print(f"Spawned Modal training call: {call.object_id}")
710
+ print(f"Estimated run name: {estimated_run_name}")
711
+ if resolved_trackio_space_id:
712
+ print(f"Trackio Space: https://huggingface.co/spaces/{resolved_trackio_space_id}")
713
+ else:
714
+ print("Trackio Space: derived remotely from HF_TOKEN as <hf-user>/CyberSecurity_OWASP-trackio")
715
+ if resolved_output_repo_id:
716
+ print(f"Output model repo: https://huggingface.co/{resolved_output_repo_id}")
717
+ else:
718
+ print(
719
+ "Output model repo: derived remotely from HF_TOKEN as "
720
+ "<hf-user>/CyberSecurity_OWASP-qwen3-1.7b-grpo-lora"
721
+ )