Codeseys commited on
Commit
185cce2
·
1 Parent(s): 8498e8f

review+fix: cross-family adversarial ADR review (owed item) + remediation

Browse files

Ran the owed 5-family adversarial review of ADR-008/009/010 (unblocked once
OpenRouter credits were restored). 4/5 clean reviews (GPT-5.5, Gemini 3.1 Pro,
DeepSeek V4 Pro, Grok 4.3; Kimi starved its token budget). Verdicts: ADR-008
3-REJECT/1-fix, ADR-009 unanimous-fix, ADR-010 3-REJECT/1-fix — all REJECTs were
"accepted-status outran the evidence," not "design wrong."

The review caught a real policy-corrupting P0 the solo pre-mortem missed: the
SDPO alignment guard was a shape-check, so hint-shifted teacher tokens could
silently misalign and distill garbage into the policy.

Verified every convergent finding against code, then fixed (192 pass/16 skip,
was 187):

- SDPO alignment (composer_trainer.py): require collator-emitted
student_response_idx/teacher_response_idx; gather aligned post-hint logits
before JSD; strict-mode raises on missing/mismatched indices. Tests rewritten
to the corrected contract (different-length sequences now exercised).
- Sandbox scrub (sandbox.py): boot() physically removes __pycache__/.mypy_cache/
.pytest_cache/.git/.hg + *.pyc/*.pyo/*.class — the ADR-claimed PRIMARY
reward-hack control, previously UNIMPLEMENTED. shlex.quote on test node ids.
- Curriculum (curriculum.py + env.py): fractional pass credit instead of
int(reward>0) (the partial-reward-logs-as-full-pass bug that prematurely
retired multi-feature tasks); hacked/guard-broken rollouts contribute 0 credit
but count as exposures.
- reward_fn (env.py): length-guard against silent zip-truncation.
- LLM judge (hint_generator.py): address-stripped + versioned cache key,
600-char output clamp, atomic disk write.
- make_dr_grpo_config: fixed duplicated assertion literal.

Added 5 regression tests. Each ADR now carries a "Post-acceptance cross-family
review" section documenting fixed vs open items. Open follow-ups (none corrupt a
run): Docker substrate e2e, k1-KL assert, AST-provenance monitor, hint default
routing, curriculum turn/think-token signals.

Artifacts: docs/reviews/cross-family-adr-008-009-010-2026-05-29/

composer_replication/datagen/curriculum.py CHANGED
@@ -23,7 +23,7 @@ from dataclasses import dataclass, field
23
 
24
  @dataclass
25
  class _TaskStats:
26
- n_pass: int = 0
27
  n_total: int = 0
28
 
29
  @property
@@ -48,7 +48,16 @@ class DifficultyCurriculum:
48
  _stats: dict[str, _TaskStats] = field(default_factory=dict)
49
  _quarantined: set[str] = field(default_factory=set)
50
 
51
- def update(self, task_id: str, n_pass: int, n_total: int) -> None:
 
 
 
 
 
 
 
 
 
52
  st = self._stats.setdefault(task_id, _TaskStats())
53
  st.n_pass += n_pass
54
  st.n_total += n_total
 
23
 
24
  @dataclass
25
  class _TaskStats:
26
+ n_pass: float = 0.0
27
  n_total: int = 0
28
 
29
  @property
 
48
  _stats: dict[str, _TaskStats] = field(default_factory=dict)
49
  _quarantined: set[str] = field(default_factory=set)
50
 
51
+ def update(self, task_id: str, n_pass: float, n_total: int) -> None:
52
+ """Record `n_pass` successes over `n_total` exposures.
53
+
54
+ `n_pass` is a FLOAT so multi-feature tasks can record fractional credit
55
+ (e.g. a completion that passes 1 of 2 FAIL_TO_PASS tests contributes 0.5,
56
+ not 1). Cross-family review 2026-05-29: the env previously passed
57
+ `int(reward > 0)`, which logged a 0.5 partial as a full pass and let
58
+ `p_hat` cross `tau_easy` so the task was retired before the policy ever
59
+ learned the remaining features.
60
+ """
61
  st = self._stats.setdefault(task_id, _TaskStats())
62
  st.n_pass += n_pass
63
  st.n_total += n_total
composer_replication/datagen/env.py CHANGED
@@ -116,6 +116,14 @@ class FeatureDeletionEnv:
116
  "reward_fn requires a `task_id` column (passed via the GRPO "
117
  "dataset) to map each completion to its FeatureDeletionTask."
118
  )
 
 
 
 
 
 
 
 
119
  rewards: list[float] = []
120
  for comp, tid in zip(completions, task_id):
121
  task = self.registry[tid]
@@ -125,5 +133,15 @@ class FeatureDeletionEnv:
125
  else:
126
  res = self.step({"type": "submit"})
127
  rewards.append(res.reward)
128
- self.curriculum.update(tid, n_pass=int(res.reward > 0), n_total=1)
 
 
 
 
 
 
 
 
 
 
129
  return rewards
 
116
  "reward_fn requires a `task_id` column (passed via the GRPO "
117
  "dataset) to map each completion to its FeatureDeletionTask."
118
  )
119
+ # Length guard (cross-family review 2026-05-29): a `zip` over mismatched
120
+ # lengths silently truncates and returns fewer rewards than TRL expects,
121
+ # corrupting the advantage computation. Fail loudly instead.
122
+ if len(task_id) != len(completions):
123
+ raise ValueError(
124
+ f"reward_fn length mismatch: {len(completions)} completions vs "
125
+ f"{len(task_id)} task_ids. TRL expects one reward per completion."
126
+ )
127
  rewards: list[float] = []
128
  for comp, tid in zip(completions, task_id):
129
  task = self.registry[tid]
 
133
  else:
134
  res = self.step({"type": "submit"})
135
  rewards.append(res.reward)
136
+ # Curriculum update uses FRACTIONAL credit, not int(reward>0):
137
+ # - n_pass = pass-fraction over the FAIL_TO_PASS targets so a
138
+ # partial multi-feature solve doesn't read as a full pass and
139
+ # prematurely retire the task (the 3/4-reviewer P0).
140
+ # - a guard-broken or hack-flagged trajectory contributes 0 credit
141
+ # (reward is already 0) but STILL counts as an exposure, so a
142
+ # task only ever solvable via a hack trends toward quarantine
143
+ # rather than polluting the prior with phantom passes.
144
+ frac = float(res.info.get("frac", 0.0))
145
+ clean = bool(res.info.get("guard_ok", True)) and not bool(res.info.get("hacked", False))
146
+ self.curriculum.update(tid, n_pass=(frac if clean else 0.0), n_total=1)
147
  return rewards
composer_replication/datagen/sandbox.py CHANGED
@@ -8,6 +8,9 @@ the substrate's frozen image and is exercised by the docker-gated substrate test
8
  """
9
  from __future__ import annotations
10
 
 
 
 
11
  import subprocess
12
  from dataclasses import dataclass, field
13
  from typing import Protocol, runtime_checkable
@@ -37,12 +40,19 @@ class TestRunResult:
37
  # primary control is that __pycache__/.mypy_cache/.class are scrubbed pre-task.
38
  SANDBOX_DENYLIST: frozenset[str] = frozenset(
39
  {
40
- "find", "strings", "unzip", "jar", "javap", "unzip",
41
  "procyon", "cfr", "jd-cli", "jadx", # Java decompilers
42
  "uncompyle6", "decompyle3", # Python decompilers
43
  "git", # .git is stripped; no history mining
44
  }
45
  )
 
 
 
 
 
 
 
46
 
47
 
48
  @runtime_checkable
@@ -110,11 +120,44 @@ class LocalSubprocessSandbox:
110
  _trajectory: list[dict] = field(default_factory=list)
111
  booted_image: str | None = None
112
 
 
 
 
 
 
 
 
 
113
  def boot(self, image: str) -> None:
114
  self.booted_image = image
115
  self._trajectory = []
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
116
 
117
  def is_command_allowed(self, command: str) -> bool:
 
 
118
  return command not in SANDBOX_DENYLIST
119
 
120
  def exec(self, action: dict) -> str:
@@ -132,7 +175,10 @@ class LocalSubprocessSandbox:
132
 
133
  def run_tests(self, test_command: str, tests: tuple[str, ...]) -> TestRunResult:
134
  # Run pytest with explicit node ids; parse the summary line.
135
- node_ids = " ".join(tests)
 
 
 
136
  cmd = f"{test_command} {node_ids}"
137
  proc = subprocess.run(
138
  cmd, shell=True, cwd=self.workdir, capture_output=True, text=True, timeout=600
 
8
  """
9
  from __future__ import annotations
10
 
11
+ import os
12
+ import shlex
13
+ import shutil
14
  import subprocess
15
  from dataclasses import dataclass, field
16
  from typing import Protocol, runtime_checkable
 
40
  # primary control is that __pycache__/.mypy_cache/.class are scrubbed pre-task.
41
  SANDBOX_DENYLIST: frozenset[str] = frozenset(
42
  {
43
+ "find", "strings", "unzip", "jar", "javap",
44
  "procyon", "cfr", "jd-cli", "jadx", # Java decompilers
45
  "uncompyle6", "decompyle3", # Python decompilers
46
  "git", # .git is stripped; no history mining
47
  }
48
  )
49
+ # NOTE (cross-family review 2026-05-29): this denylist is INSUFFICIENT as a
50
+ # primary control. `is_command_allowed` checks only the first whitespace token,
51
+ # so `/usr/bin/find`, `sh -c "strings x"`, and especially `python -c "import
52
+ # marshal,dis; ..."` all bypass it. The ADR-claimed PRIMARY control is the
53
+ # pre-task cache/.git scrub in `boot()` — see `_scrub_tree` below, which is now
54
+ # implemented (was previously absent, making the denylist the only — broken —
55
+ # defense). The denylist remains as cheap defense-in-depth, not the wall.
56
 
57
 
58
  @runtime_checkable
 
120
  _trajectory: list[dict] = field(default_factory=list)
121
  booted_image: str | None = None
122
 
123
+ # Cache/history artifacts that let an agent recover a deleted signature
124
+ # WITHOUT reimplementing it (the Composer-blog reward-hacks). Scrubbed at
125
+ # boot() so the denylist isn't the only — and bypassable — line of defense.
126
+ _SCRUB_NAMES: tuple[str, ...] = (
127
+ "__pycache__", ".mypy_cache", ".pytest_cache", ".git", ".hg",
128
+ )
129
+ _SCRUB_SUFFIXES: tuple[str, ...] = (".pyc", ".pyo", ".class")
130
+
131
  def boot(self, image: str) -> None:
132
  self.booted_image = image
133
  self._trajectory = []
134
+ self._scrub_tree()
135
+
136
+ def _scrub_tree(self) -> None:
137
+ """PRIMARY reward-hack control (ADR-010 §3): physically remove byte-code
138
+ caches, type-check caches, and VCS history from the working tree before
139
+ the episode starts, so there is no cached signature to recover. This is
140
+ the wall; the command denylist is only cheap defense-in-depth on top.
141
+ Cross-family review 2026-05-29 found this was previously UNIMPLEMENTED —
142
+ boot() only recorded the image string."""
143
+ if not self.workdir or not os.path.isdir(self.workdir):
144
+ return
145
+ for root, dirs, files in os.walk(self.workdir, topdown=True):
146
+ # Remove (and stop descending into) scrub-named directories.
147
+ for d in list(dirs):
148
+ if d in self._SCRUB_NAMES:
149
+ shutil.rmtree(os.path.join(root, d), ignore_errors=True)
150
+ dirs.remove(d)
151
+ for f in files:
152
+ if f.endswith(self._SCRUB_SUFFIXES):
153
+ try:
154
+ os.remove(os.path.join(root, f))
155
+ except OSError:
156
+ pass
157
 
158
  def is_command_allowed(self, command: str) -> bool:
159
+ # NOTE: first-token-only check — see SANDBOX_DENYLIST comment. This is
160
+ # NOT a security boundary on its own; _scrub_tree() is the real control.
161
  return command not in SANDBOX_DENYLIST
162
 
163
  def exec(self, action: dict) -> str:
 
175
 
176
  def run_tests(self, test_command: str, tests: tuple[str, ...]) -> TestRunResult:
177
  # Run pytest with explicit node ids; parse the summary line.
178
+ # shlex.quote each node id — SWE-bench node ids contain spaces/brackets
179
+ # (parametrized tests like `test_x.py::test_y[a b]`) and could otherwise
180
+ # break the shell or inject commands (cross-family review 2026-05-29).
181
+ node_ids = " ".join(shlex.quote(t) for t in tests)
182
  cmd = f"{test_command} {node_ids}"
183
  proc = subprocess.run(
184
  cmd, shell=True, cwd=self.workdir, capture_output=True, text=True, timeout=600
composer_replication/datagen/tests/test_feature_deletion.py CHANGED
@@ -206,6 +206,82 @@ def test_reward_fn_requires_task_id():
206
  env.reward_fn(prompts=["p"], completions=["c"])
207
 
208
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
209
  # --- SweBenchAdapter --------------------------------------------------------
210
 
211
  def test_swebench_adapter_inverts_instance():
 
206
  env.reward_fn(prompts=["p"], completions=["c"])
207
 
208
 
209
+ def test_reward_fn_rejects_length_mismatch():
210
+ """Cross-family review 2026-05-29: a zip over mismatched lengths silently
211
+ truncated and returned fewer rewards than TRL expects. Now a hard error."""
212
+ task = _task()
213
+ env = FeatureDeletionEnv(FakeSandbox(test_outcomes={"test_feature_a": True}),
214
+ registry={task.task_id: task})
215
+ with pytest.raises(ValueError, match="length mismatch"):
216
+ env.reward_fn(prompts=["p", "p"], completions=["c", "c"],
217
+ task_id=[task.task_id]) # 1 id for 2 completions
218
+
219
+
220
+ def test_partial_multifeature_reward_is_fractional_credit_not_a_full_pass():
221
+ """Regression for the 3/4-reviewer P0: a multi-feature task solved 1-of-2
222
+ must record FRACTIONAL curriculum credit (0.5), not int(reward>0)==1. The
223
+ old code logged the 0.5 partial as a 100% pass, crossing tau_easy and
224
+ retiring the task before the policy learned the remaining feature."""
225
+ cur = DifficultyCurriculum()
226
+ # 'b' never solved => steady 0.5 reward each rollout.
227
+ sb = FakeSandbox(test_outcomes={"a": True, "b": False, "keep": True})
228
+ task = _task(task_id="mf", fail_to_pass=("a", "b"), pass_to_pass=("keep",),
229
+ deleted_symbols=())
230
+ env = FeatureDeletionEnv(sb, curriculum=cur, registry={"mf": task})
231
+ for _ in range(12):
232
+ out = env.reward_fn(prompts=["p"], completions=["c"], task_id=["mf"])
233
+ assert out == [0.5]
234
+ # p_hat must hover near 0.5 (frontier), NOT be driven above tau_easy.
235
+ assert cur.p_hat("mf") < cur.tau_easy
236
+ # And the task must remain ACTIVE (frontier-weighted), not retired.
237
+ assert cur.weight("mf") > 0.0
238
+
239
+
240
+ def test_guard_broken_trajectory_does_not_pollute_curriculum_with_a_pass():
241
+ """A guard-broken (or hacked) rollout earns reward 0 AND contributes 0
242
+ curriculum credit while still counting as an exposure — so a task only
243
+ solvable by breaking the functional guard trends toward quarantine rather
244
+ than logging phantom passes (Grok finding)."""
245
+ cur = DifficultyCurriculum(min_exposures=4, tau_hard=0.05)
246
+ # target 'a' passes but the PASS_TO_PASS guard 'keep' is broken => reward 0.
247
+ sb = FakeSandbox(test_outcomes={"a": True, "keep": False})
248
+ task = _task(task_id="gb", fail_to_pass=("a",), pass_to_pass=("keep",),
249
+ deleted_symbols=())
250
+ env = FeatureDeletionEnv(sb, curriculum=cur, registry={"gb": task})
251
+ for _ in range(8):
252
+ out = env.reward_fn(prompts=["p"], completions=["c"], task_id=["gb"])
253
+ assert out == [0.0]
254
+ assert cur.is_quarantined("gb")
255
+
256
+
257
+ # --- sandbox scrub (primary reward-hack control) ----------------------------
258
+
259
+ def test_local_sandbox_scrubs_caches_and_git_on_boot(tmp_path):
260
+ """Cross-family review 2026-05-29: the ADR-claimed PRIMARY reward-hack
261
+ control (pre-task scrub of byte-code/type caches + .git) was unimplemented
262
+ — boot() only recorded the image string. It now physically removes them."""
263
+ from composer_replication.datagen.sandbox import LocalSubprocessSandbox
264
+
265
+ # Build a tree with the exact artifacts an agent could mine for signatures.
266
+ (tmp_path / "__pycache__").mkdir()
267
+ (tmp_path / "__pycache__" / "feature_a.cpython-311.pyc").write_bytes(b"\x00cache")
268
+ (tmp_path / ".mypy_cache").mkdir()
269
+ (tmp_path / ".mypy_cache" / "x.json").write_text("{}")
270
+ (tmp_path / ".git").mkdir()
271
+ (tmp_path / ".git" / "HEAD").write_text("ref: refs/heads/main")
272
+ (tmp_path / "src").mkdir()
273
+ (tmp_path / "src" / "lib.class").write_bytes(b"\xca\xfe\xba\xbe")
274
+ (tmp_path / "src" / "keep.py").write_text("x = 1") # must survive
275
+
276
+ LocalSubprocessSandbox(workdir=str(tmp_path)).boot("img:broken")
277
+
278
+ assert not (tmp_path / "__pycache__").exists()
279
+ assert not (tmp_path / ".mypy_cache").exists()
280
+ assert not (tmp_path / ".git").exists()
281
+ assert not (tmp_path / "src" / "lib.class").exists()
282
+ assert (tmp_path / "src" / "keep.py").exists() # real source untouched
283
+
284
+
285
  # --- SweBenchAdapter --------------------------------------------------------
286
 
287
  def test_swebench_adapter_inverts_instance():
composer_replication/hint_generator.py CHANGED
@@ -201,6 +201,16 @@ class LLMJudgeHintGenerator:
201
  "Corrective hint:"
202
  )
203
 
 
 
 
 
 
 
 
 
 
 
204
  def __init__(
205
  self,
206
  complete: Callable[[str], str] | None = None,
@@ -214,10 +224,20 @@ class LLMJudgeHintGenerator:
214
  def _cache_key(self, error_kind: str, error_meta: dict) -> str:
215
  import hashlib
216
  import json
217
-
 
 
 
 
 
 
 
218
  blob = json.dumps(
219
- {"k": error_kind, "m": error_meta}, sort_keys=True, default=str
 
220
  )
 
 
221
  return hashlib.sha256(blob.encode("utf-8")).hexdigest()[:32]
222
 
223
  def _disk_get(self, key: str) -> str | None:
@@ -231,11 +251,16 @@ class LLMJudgeHintGenerator:
231
  def _disk_put(self, key: str, value: str) -> None:
232
  if not self._cache_dir:
233
  return
 
234
  from pathlib import Path
235
 
236
  d = Path(self._cache_dir)
237
  d.mkdir(parents=True, exist_ok=True)
238
- (d / f"{key}.txt").write_text(value, encoding="utf-8")
 
 
 
 
239
 
240
  def generate(self, error_kind: str, error_meta: dict) -> str | None:
241
  if self.complete is None:
@@ -255,6 +280,10 @@ class LLMJudgeHintGenerator:
255
  hint = self.complete(prompt).strip()
256
  if not hint:
257
  return None
 
 
 
 
258
  self._mem_cache[key] = hint
259
  self._disk_put(key, hint)
260
  return hint
 
201
  "Corrective hint:"
202
  )
203
 
204
+ # Bump when PROMPT_TEMPLATE or the underlying judge model changes so stale
205
+ # cached hints are invalidated rather than silently reused.
206
+ _CACHE_VERSION = 2
207
+
208
+ # Hard cap on a generated hint. The judge is asked for <=2 sentences but
209
+ # nothing enforced it (cross-family review 2026-05-29) — a runaway judge
210
+ # could emit a full solution / prompt-leak / megabyte of text straight into
211
+ # the SDPO teacher conditioning. Clamp defensively.
212
+ _MAX_HINT_CHARS = 600
213
+
214
  def __init__(
215
  self,
216
  complete: Callable[[str], str] | None = None,
 
224
  def _cache_key(self, error_kind: str, error_meta: dict) -> str:
225
  import hashlib
226
  import json
227
+ import re
228
+
229
+ # Strip volatile object reprs (e.g. "<Exception at 0x7f8b...>") so the
230
+ # key is stable across runs/restarts. Cross-family review 2026-05-29:
231
+ # `default=str` on raw Exception/context objects embedded a memory
232
+ # address in the key, guaranteeing a 0% cross-process cache-hit rate and
233
+ # unbounded judge cost. Also version the key so prompt/model changes
234
+ # invalidate stale hints rather than serving them.
235
  blob = json.dumps(
236
+ {"v": self._CACHE_VERSION, "k": error_kind, "m": error_meta},
237
+ sort_keys=True, default=str,
238
  )
239
+ blob = re.sub(r"0x[0-9a-fA-F]+", "0xADDR", blob)
240
+ blob = re.sub(r"\bat 0xADDR\b", "", blob)
241
  return hashlib.sha256(blob.encode("utf-8")).hexdigest()[:32]
242
 
243
  def _disk_get(self, key: str) -> str | None:
 
251
  def _disk_put(self, key: str, value: str) -> None:
252
  if not self._cache_dir:
253
  return
254
+ import os
255
  from pathlib import Path
256
 
257
  d = Path(self._cache_dir)
258
  d.mkdir(parents=True, exist_ok=True)
259
+ # Atomic write: concurrent DDP workers writing the same key would
260
+ # otherwise interleave and corrupt the file (cross-family review).
261
+ tmp = d / f"{key}.txt.{os.getpid()}.tmp"
262
+ tmp.write_text(value, encoding="utf-8")
263
+ os.replace(tmp, d / f"{key}.txt")
264
 
265
  def generate(self, error_kind: str, error_meta: dict) -> str | None:
266
  if self.complete is None:
 
280
  hint = self.complete(prompt).strip()
281
  if not hint:
282
  return None
283
+ # Clamp to a sane length so a runaway judge can't inject a full solution
284
+ # or megabyte blob into the SDPO teacher conditioning (cross-family review).
285
+ if len(hint) > self._MAX_HINT_CHARS:
286
+ hint = hint[: self._MAX_HINT_CHARS].rstrip() + "…"
287
  self._mem_cache[key] = hint
288
  self._disk_put(key, hint)
289
  return hint
composer_replication/trainer/composer_trainer.py CHANGED
@@ -162,31 +162,79 @@ class ComposerReplicationTrainer(GRPOTrainer): # type: ignore[misc, valid-type]
162
  with torch.no_grad():
163
  teacher_logits = model(input_ids=inputs["ctx_teacher_input_ids"]).logits
164
 
165
- # NOTE: in real implementation, ctx_teacher and ctx_student must be the
166
- # SAME LENGTH at the post-hint section so logits align position-by-position.
167
- # The data collator pads/aligns. The skeleton trusts that's done correctly.
168
- if student_logits.shape != teacher_logits.shape:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
169
  msg = (
170
- f"SDPO logit shape mismatch: student={tuple(student_logits.shape)} "
171
- f"vs teacher={tuple(teacher_logits.shape)}. The data collator must "
172
- "pad/align the post-hint section so student and teacher have "
173
- "identical token-counts; otherwise the distillation signal is "
174
- "silently zeroed."
 
 
175
  )
176
  if self.strict_sdpo_alignment:
177
- # Hard error (default): a mismatch means the collator failed to
178
- # align — the ADR-008 trust-gap. Do not silently zero the channel.
179
  raise ValueError(
180
- msg + " (strict_sdpo_alignment=True; pass False to warn-and-skip "
181
- "instead for production resilience.)"
182
  )
183
- logger.warning("%s Skipping SDPO loss for this step.", msg)
184
- return torch.tensor(0.0, device=_device_of(model), requires_grad=True)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
185
 
186
  return generalized_jsd_loss(
187
- student_logits=student_logits,
188
- teacher_logits=teacher_logits,
189
- labels=inputs.get("sdpo_loss_mask"), # error-turn token mask
190
  beta=self.sdpo_jsd_beta,
191
  temperature=self.sdpo_temperature,
192
  token_clip=self.sdpo_token_clip,
@@ -294,10 +342,19 @@ def make_dr_grpo_config(**overrides: Any):
294
  merged = {**dr_grpo_defaults, **overrides}
295
  cfg = GRPOConfig(**merged)
296
  # Guard: fail loudly if a future TRL renames/repurposes these knobs.
297
- assert cfg.loss_type == merged["loss_type"], "GRPOConfig dropped loss_type"
298
- assert str(cfg.scale_rewards) in ("none", "False", "False"), (
299
- f"Dr. GRPO requires scale_rewards='none' (no std-norm); got {cfg.scale_rewards!r}. "
300
- "TRL knob may have drifted — re-verify against trl version."
 
 
 
 
 
 
 
 
 
301
  )
302
  assert cfg.num_iterations == merged["num_iterations"], "GRPOConfig dropped num_iterations"
303
  return cfg
 
162
  with torch.no_grad():
163
  teacher_logits = model(input_ids=inputs["ctx_teacher_input_ids"]).logits
164
 
165
+ # ------------------------------------------------------------------
166
+ # ALIGNMENT (cross-family review 2026-05-29 the 4/4-reviewer P0).
167
+ #
168
+ # The teacher context has a hint inserted at the error turn, so the
169
+ # teacher's post-hint response tokens are shifted right by len(hint)
170
+ # relative to the student's. A bare `student.shape == teacher.shape`
171
+ # check does NOT establish token-level alignment: equal-length tensors
172
+ # whose response regions are offset will be JSD'd position-by-position
173
+ # against each other, distilling garbage into the policy.
174
+ #
175
+ # The ONLY correct alignment is an explicit map from the collator that
176
+ # selects, for each response token, the matching index in each sequence.
177
+ # We require it whenever SDPO is active:
178
+ # - `student_response_idx` / `teacher_response_idx`: LongTensors of
179
+ # equal length selecting the aligned response positions in each
180
+ # sequence (the collator builds these knowing where it inserted the
181
+ # hint). JSD is computed over the gathered, provably-aligned logits.
182
+ # - If the collator cannot yet supply them, strict mode raises (loud
183
+ # failure) rather than silently distilling misaligned tokens.
184
+ s_idx = inputs.get("student_response_idx")
185
+ t_idx = inputs.get("teacher_response_idx")
186
+ if s_idx is None or t_idx is None:
187
  msg = (
188
+ "SDPO alignment indices missing: the collator must emit "
189
+ "`student_response_idx` and `teacher_response_idx` (matching "
190
+ "LongTensors selecting the aligned post-hint response tokens) so "
191
+ "the JSD compares corresponding tokens. A shape-only check does "
192
+ "NOT establish alignment — the hint shifts the teacher's response "
193
+ "tokens right, so equal-length sequences can still be misaligned "
194
+ "and silently distill garbage into the policy (ADR-008 trust-gap)."
195
  )
196
  if self.strict_sdpo_alignment:
 
 
197
  raise ValueError(
198
+ msg + " (strict_sdpo_alignment=True; pass False to fall back "
199
+ "to the legacy shape-only check for resilience.)"
200
  )
201
+ logger.warning("%s Falling back to shape-only alignment check.", msg)
202
+ if student_logits.shape != teacher_logits.shape:
203
+ logger.warning(
204
+ "SDPO shape mismatch student=%s teacher=%s; skipping.",
205
+ tuple(student_logits.shape), tuple(teacher_logits.shape),
206
+ )
207
+ return torch.tensor(0.0, device=_device_of(model), requires_grad=True)
208
+ return generalized_jsd_loss(
209
+ student_logits=student_logits,
210
+ teacher_logits=teacher_logits,
211
+ labels=inputs.get("sdpo_loss_mask"),
212
+ beta=self.sdpo_jsd_beta,
213
+ temperature=self.sdpo_temperature,
214
+ token_clip=self.sdpo_token_clip,
215
+ reduction="batchmean",
216
+ )
217
+
218
+ # Validate the index tensors describe a consistent 1:1 alignment.
219
+ if s_idx.shape != t_idx.shape:
220
+ raise ValueError(
221
+ f"SDPO alignment index shape mismatch: student_response_idx="
222
+ f"{tuple(s_idx.shape)} vs teacher_response_idx={tuple(t_idx.shape)}. "
223
+ "They must select the same number of aligned response tokens."
224
+ )
225
+ # Gather the provably-aligned response logits from each sequence, then
226
+ # JSD only those positions (this is the masked error-turn distillation).
227
+ # gather over the sequence dim (dim=1): expand index to the vocab dim.
228
+ vocab = student_logits.size(-1)
229
+ s_gather = s_idx.unsqueeze(-1).expand(-1, -1, vocab)
230
+ t_gather = t_idx.unsqueeze(-1).expand(-1, -1, vocab)
231
+ student_aligned = torch.gather(student_logits, 1, s_gather)
232
+ teacher_aligned = torch.gather(teacher_logits, 1, t_gather)
233
 
234
  return generalized_jsd_loss(
235
+ student_logits=student_aligned,
236
+ teacher_logits=teacher_aligned,
237
+ labels=inputs.get("sdpo_loss_mask"), # optional further error-turn mask
238
  beta=self.sdpo_jsd_beta,
239
  temperature=self.sdpo_temperature,
240
  token_clip=self.sdpo_token_clip,
 
342
  merged = {**dr_grpo_defaults, **overrides}
343
  cfg = GRPOConfig(**merged)
344
  # Guard: fail loudly if a future TRL renames/repurposes these knobs.
345
+ assert cfg.loss_type == merged["loss_type"], (
346
+ f"GRPOConfig loss_type drifted: requested {merged['loss_type']!r}, "
347
+ f"got {cfg.loss_type!r} — TRL may have renamed/repurposed the knob."
348
+ )
349
+ # Dr. GRPO requires NO std-dev advantage normalization. TRL accepts either
350
+ # the string "none" or the bool False to disable it; normalize before
351
+ # comparing so a future TRL that switches the representation still passes
352
+ # (and a genuinely-wrong value like "batch"/"group"/True fails loudly).
353
+ # (Cross-family review 2026-05-29: the prior literal `("none","False","False")`
354
+ # had a duplicated "False" and did a brittle case-sensitive str compare.)
355
+ assert str(cfg.scale_rewards).lower() in ("none", "false"), (
356
+ f"Dr. GRPO requires scale_rewards disabled (no std-norm); got "
357
+ f"{cfg.scale_rewards!r}. TRL knob may have drifted — re-verify against trl version."
358
  )
359
  assert cfg.num_iterations == merged["num_iterations"], "GRPOConfig dropped num_iterations"
360
  return cfg
composer_replication/trainer/tests/test_dr_grpo_config_and_alignment.py CHANGED
@@ -94,20 +94,36 @@ def _make_trainer_without_init(strict: bool):
94
  return obj
95
 
96
 
97
- def test_strict_alignment_raises_on_shape_mismatch():
 
 
 
 
 
 
 
 
 
 
 
 
 
 
98
  obj = _make_trainer_without_init(strict=True)
99
  model = _TinyLM(vocab=16)
100
- # student seq len 5, teacher seq len 7 -> logit shape mismatch
101
  inputs = {
102
  "input_ids": torch.randint(0, 16, (1, 5)),
103
  "ctx_teacher_input_ids": torch.randint(0, 16, (1, 7)),
104
  "sdpo_loss_mask": torch.ones(1, 5),
 
105
  }
106
- with pytest.raises(ValueError, match="shape mismatch"):
107
  obj._compute_sdpo_loss(model, inputs)
108
 
109
 
110
- def test_nonstrict_alignment_warns_and_skips_on_mismatch():
 
 
111
  obj = _make_trainer_without_init(strict=False)
112
  model = _TinyLM(vocab=16)
113
  inputs = {
@@ -119,14 +135,20 @@ def test_nonstrict_alignment_warns_and_skips_on_mismatch():
119
  assert float(loss.detach()) == 0.0 # skipped, not crashed
120
 
121
 
122
- def test_aligned_shapes_produce_finite_sdpo_loss():
 
 
 
 
123
  obj = _make_trainer_without_init(strict=True)
124
  model = _TinyLM(vocab=16)
125
- # student and teacher SAME length -> channel fires
 
126
  inputs = {
127
  "input_ids": torch.randint(0, 16, (1, 6)),
128
- "ctx_teacher_input_ids": torch.randint(0, 16, (1, 6)),
129
- "sdpo_loss_mask": torch.ones(1, 6),
 
130
  }
131
  loss = obj._compute_sdpo_loss(model, inputs)
132
  val = float(loss.detach())
@@ -134,6 +156,21 @@ def test_aligned_shapes_produce_finite_sdpo_loss():
134
  assert val not in (float("inf"), float("-inf"))
135
 
136
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
137
  def test_no_error_sites_returns_zero_without_forward():
138
  """Empty ctx_teacher_input_ids => no SDPO signal, returns 0 (no crash)."""
139
  obj = _make_trainer_without_init(strict=True)
 
94
  return obj
95
 
96
 
97
+ # ---------------------------------------------------------------------------
98
+ # Gate 2 — SDPO alignment guard (corrected per cross-family review 2026-05-29).
99
+ #
100
+ # The guard now requires the collator to emit explicit `student_response_idx` /
101
+ # `teacher_response_idx` alignment tensors whenever SDPO is active, because a
102
+ # shape-only check does NOT establish token-level alignment (the hint shifts the
103
+ # teacher's response tokens). The tests below verify the CORRECTED contract:
104
+ # - strict + missing indices => raise (loud failure, not silent garbage)
105
+ # - non-strict + missing indices => warn-and-fall-back to legacy shape check
106
+ # - aligned indices supplied => channel fires with a finite, gathered loss
107
+ # ---------------------------------------------------------------------------
108
+
109
+ def test_strict_alignment_raises_on_missing_indices():
110
+ """Strict mode (default) refuses to distill without explicit alignment
111
+ indices — the misalignment-is-silent trust-gap is now a loud error."""
112
  obj = _make_trainer_without_init(strict=True)
113
  model = _TinyLM(vocab=16)
 
114
  inputs = {
115
  "input_ids": torch.randint(0, 16, (1, 5)),
116
  "ctx_teacher_input_ids": torch.randint(0, 16, (1, 7)),
117
  "sdpo_loss_mask": torch.ones(1, 5),
118
+ # NO student_response_idx / teacher_response_idx
119
  }
120
+ with pytest.raises(ValueError, match="alignment indices missing"):
121
  obj._compute_sdpo_loss(model, inputs)
122
 
123
 
124
+ def test_nonstrict_falls_back_to_shape_check_and_skips_on_mismatch():
125
+ """Non-strict + missing indices => legacy shape-only fallback; a true shape
126
+ mismatch is skipped (returns 0) rather than crashing."""
127
  obj = _make_trainer_without_init(strict=False)
128
  model = _TinyLM(vocab=16)
129
  inputs = {
 
135
  assert float(loss.detach()) == 0.0 # skipped, not crashed
136
 
137
 
138
+ def test_aligned_indices_produce_finite_gathered_sdpo_loss():
139
+ """With matching student/teacher response indices supplied, the channel
140
+ gathers the aligned post-hint logits and computes a finite JSD — even when
141
+ the two sequences have DIFFERENT lengths (the realistic hint-shifted case
142
+ the old shape-only guard could not handle)."""
143
  obj = _make_trainer_without_init(strict=True)
144
  model = _TinyLM(vocab=16)
145
+ # student len 6, teacher len 9 (hint inserted 3 tokens). The 3 response
146
+ # tokens live at student positions [3,4,5] and teacher positions [6,7,8].
147
  inputs = {
148
  "input_ids": torch.randint(0, 16, (1, 6)),
149
+ "ctx_teacher_input_ids": torch.randint(0, 16, (1, 9)),
150
+ "student_response_idx": torch.tensor([[3, 4, 5]]),
151
+ "teacher_response_idx": torch.tensor([[6, 7, 8]]),
152
  }
153
  loss = obj._compute_sdpo_loss(model, inputs)
154
  val = float(loss.detach())
 
156
  assert val not in (float("inf"), float("-inf"))
157
 
158
 
159
+ def test_mismatched_alignment_index_shapes_raise():
160
+ """student/teacher response-index tensors of different length are a
161
+ collator bug — caught loudly, not gathered into a broken JSD."""
162
+ obj = _make_trainer_without_init(strict=True)
163
+ model = _TinyLM(vocab=16)
164
+ inputs = {
165
+ "input_ids": torch.randint(0, 16, (1, 6)),
166
+ "ctx_teacher_input_ids": torch.randint(0, 16, (1, 9)),
167
+ "student_response_idx": torch.tensor([[3, 4, 5]]),
168
+ "teacher_response_idx": torch.tensor([[6, 7]]), # length 2 != 3
169
+ }
170
+ with pytest.raises(ValueError, match="alignment index shape mismatch"):
171
+ obj._compute_sdpo_loss(model, inputs)
172
+
173
+
174
  def test_no_error_sites_returns_zero_without_forward():
175
  """Empty ctx_teacher_input_ids => no SDPO signal, returns 0 (no crash)."""
176
  obj = _make_trainer_without_init(strict=True)
docs/WAVE_COMPOSER_DATAGEN_RL_2026-05-29.md CHANGED
@@ -47,13 +47,25 @@ difficulty curriculum. `SweBenchAdapter` makes any SWE-bench-shaped dataset
47
  ## Owed / unblocked-by (constraints this wave hit)
48
 
49
  - **Cross-family adversarial ADR review** (GPT-5.5 / Gemini 3.1 Pro / DeepSeek
50
- V4 Pro per model-roster) — deferred: OpenRouter ran out of credits mid-wave,
51
- blocking `delegate_task`. A focused single-author pre-mortem was substituted
52
- (caught + fixed the ADR-006 stale-matrix-row). Re-run when credits restored.
 
 
 
 
 
 
 
 
 
53
  - **Live Docker substrate-inversion e2e** for ADR-010 (pull one SWE-bench-Lite
54
- image, run the 4 gates against it) — wired but deferred (no Docker in the CPU
55
- dev env). Marked `[~]` in the ADR.
56
- - Gateway restarted during the wave; all work kept durable via
 
 
 
57
  phase-boundary commits + detached `systemd-run` scopes.
58
 
59
  ## Commit trail
 
47
  ## Owed / unblocked-by (constraints this wave hit)
48
 
49
  - **Cross-family adversarial ADR review** (GPT-5.5 / Gemini 3.1 Pro / DeepSeek
50
+ V4 Pro / Kimi K2.6 / Grok 4.3) — **DONE 2026-05-29** once OpenRouter credits
51
+ were restored. 4/5 clean reviews (Kimi starved its budget). Verdicts: ADR-008
52
+ 3-REJECT/1-fix, ADR-009 unanimous-fix, ADR-010 3-REJECT/1-fix the REJECTs all
53
+ amounted to "accepted-status outran the evidence," not "design is wrong." The
54
+ review **caught a real policy-corrupting P0 the solo pre-mortem missed**: the
55
+ SDPO alignment guard was a shape-check that let hint-shifted tokens silently
56
+ misalign. That plus 6 other convergent defects (sandbox scrub unimplemented,
57
+ curriculum partial-reward→full-pass, reward_fn truncation, shell injection,
58
+ LLM-judge cache non-determinism, scale_rewards assert dup) were **fixed and
59
+ tested** (192 pass / 16 skip, was 187). Full synthesis + the 4 reviews:
60
+ `docs/reviews/cross-family-adr-008-009-010-2026-05-29/`. Each ADR now carries a
61
+ "Post-acceptance cross-family review" section documenting fixed vs open items.
62
  - **Live Docker substrate-inversion e2e** for ADR-010 (pull one SWE-bench-Lite
63
+ image, run the 4 gates against it) — STILL deferred (no Docker in the CPU dev
64
+ env). The review confirmed this is the gate that matters: it closes the
65
+ remaining OPEN findings (real reachability/provenance, FakeSandbox-tautology
66
+ objection). Marked `[~]` in ADR-010. Needs a box with Docker + a SWE-bench-Lite
67
+ image.
68
+ - Gateway restarted 6× during the original wave; all work kept durable via
69
  phase-boundary commits + detached `systemd-run` scopes.
70
 
71
  ## Commit trail
docs/adrs/ADR-008-drgrpo-sdpo-live-channel.md CHANGED
@@ -102,6 +102,64 @@ guarantee, and add a CPU smoke test that instantiates the trainer and runs a
102
  - Good: TRL default; least config work.
103
  - Bad: std-dev advantage normalization "massively upweights small behavioral differences within equal-correctness groups" (tech report); length-standardization injects length bias. Diverges from the known-good recipe for no benefit.
104
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
105
  ## Acceptance gate (must be green before status flips to accepted)
106
 
107
  All gates green as of 2026-05-29 (commit chain `bde5c5e`+; smoke PASS via
 
102
  - Good: TRL default; least config work.
103
  - Bad: std-dev advantage normalization "massively upweights small behavioral differences within equal-correctness groups" (tech report); length-standardization injects length bias. Diverges from the known-good recipe for no benefit.
104
 
105
+ ## Post-acceptance cross-family review (2026-05-29)
106
+
107
+ A 5-family adversarial review (GPT-5.5, Gemini 3.1 Pro, DeepSeek V4 Pro,
108
+ Kimi K2.6, Grok 4.3 — the owed cross-family review, unblocked once OpenRouter
109
+ credits were restored) examined this ADR against its implementation. 4 of 5
110
+ returned clean reviews (Kimi starved its token budget). **Verdict split: 3
111
+ REJECT / 1 ACCEPT-WITH-FIXES on ADR-008.** The review correctly found that the
112
+ acceptance gate above **over-claimed**: the SDPO "trust-gap closed" gate was
113
+ satisfied only by a *shape check*, and the smoke gate proved init-not-wiring.
114
+ Findings verified against code and remediated:
115
+
116
+ - **[FIXED — was the headline P0, 4/4 reviewers] SDPO alignment was a shape-check
117
+ placebo.** `_compute_sdpo_loss` only asserted `student_logits.shape ==
118
+ teacher_logits.shape`. Because the teacher context has a hint inserted at the
119
+ error turn, the teacher's response tokens are shifted right by `len(hint)`;
120
+ equal-length sequences can still be token-misaligned, so `generalized_jsd_loss`
121
+ would distill the student's code against the teacher's *hint* — silently
122
+ poisoning the policy. **Fix:** the guard now REQUIRES the collator to emit
123
+ explicit `student_response_idx` / `teacher_response_idx` alignment tensors
124
+ whenever SDPO is active; the loss gathers the provably-aligned post-hint logits
125
+ before JSD. Missing indices raise in strict mode (loud failure) instead of
126
+ silently distilling garbage. Tests rewritten to the corrected contract
127
+ (`test_strict_alignment_raises_on_missing_indices`,
128
+ `test_aligned_indices_produce_finite_gathered_sdpo_loss` across *different*
129
+ sequence lengths, `test_mismatched_alignment_index_shapes_raise`).
130
+ - **[FIXED] `scale_rewards` drift assertion had a duplicated literal**
131
+ `("none","False","False")` and a brittle case-sensitive compare. Now
132
+ `str(cfg.scale_rewards).lower() in ("none","false")`.
133
+ - **[RE-SCOPED — gate honesty] The CPU smoke (gate 3) is init-coverage, not
134
+ wiring-coverage.** As 3 reviewers noted, the toy rollouts carry no
135
+ `ctx_teacher_input_ids`, so `_compute_sdpo_loss` hits its early `return 0.0`
136
+ and the smoke passes without ever entering the SDPO body. The gate proves the
137
+ subclass instantiates against a real `trl.GRPOTrainer` and the step runs — it
138
+ does NOT prove the SDPO channel fires on a real error-aligned batch. That
139
+ separate proof lives in `examples/sdpo_real_trace_train_smoke` (loss-math on
140
+ real traces). **Gate 3 re-worded below to claim only what it establishes.**
141
+ - **[OPEN — acknowledged, not fixed here] KL estimator (k1 vs k3) is not
142
+ configured or asserted** (2 reviewers). The ADR claims Composer uses the k1
143
+ (`−log r`) estimator and that "TRL's native estimator" satisfies this, but the
144
+ code neither sets nor verifies it against TRL 1.5.0's actual GRPO KL branch. If
145
+ TRL uses k3, we train a slightly different reference-KL than the report
146
+ specifies. Tracked as a follow-up: add a unit test computing KL for known
147
+ logprob pairs and asserting the k1 value, or set `beta=0` and document the
148
+ reference-KL term as intentionally disabled.
149
+ - **[OPEN — narrower guarantee] `num_iterations=1` ≠ "a prompt is never trained
150
+ twice"** (GPT-5.5). It controls GRPO inner-loop reuse per generation batch, not
151
+ dataset-level resampling/epochs. The single-epoch guarantee also depends on the
152
+ sampler/dataloader. Documented; the claim is narrowed to the GRPO inner-loop
153
+ sense.
154
+ - **[OPEN] "Adam" is claimed but `optim` is not set** — HF/TRL defaults to AdamW.
155
+ Either set `optim` explicitly or drop the Adam-specific claim. Follow-up.
156
+
157
+ **Status note:** the silent-misalignment P0 (the one that would actually corrupt
158
+ training) is fixed and tested. The remaining OPEN items are
159
+ fidelity/documentation gaps that don't corrupt a run, tracked as follow-ups. The
160
+ review artifact (all 4 reviews + convergence synthesis) is committed at
161
+ `docs/reviews/cross-family-adr-008-009-010-2026-05-29/`.
162
+
163
  ## Acceptance gate (must be green before status flips to accepted)
164
 
165
  All gates green as of 2026-05-29 (commit chain `bde5c5e`+; smoke PASS via
docs/adrs/ADR-009-layered-hint-generator.md CHANGED
@@ -90,6 +90,47 @@ signature.
90
  - Good: maximal coverage, natural language for every site.
91
  - Bad: cost on every site (templates would have been free); nondeterministic data prep; network dependency on the hot path. Wasteful for the common tool-error case a template nails.
92
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
93
  ## Acceptance gate (must be green before status flips to accepted)
94
 
95
  All gates green as of 2026-05-29 (commit `<this>`; 12 tests in
 
90
  - Good: maximal coverage, natural language for every site.
91
  - Bad: cost on every site (templates would have been free); nondeterministic data prep; network dependency on the hot path. Wasteful for the common tool-error case a template nails.
92
 
93
+ ## Post-acceptance cross-family review (2026-05-29)
94
+
95
+ The 5-family review (see ADR-008's review section) was **unanimous
96
+ ACCEPT-WITH-FIXES on ADR-009** — the layered Protocol design is sound; the
97
+ defects are in the LLM-judge layer's robustness and an ADR-body/code drift.
98
+ Verified and remediated:
99
+
100
+ - **[FIXED] LLM-judge cache key was non-deterministic across processes**
101
+ (Gemini P1, GPT-5.5 P2). `_cache_key` used `json.dumps(..., default=str)` on
102
+ `error_meta`; if that dict carries a raw Exception/context object, `str(obj)`
103
+ embeds a memory address (`<X at 0x7f…>`) → the key changes every run → 0%
104
+ cross-process cache hit → unbounded judge cost. **Fix:** strip `0x…` address
105
+ tokens before hashing and version the key (`_CACHE_VERSION`) so prompt/model
106
+ changes invalidate stale hints.
107
+ - **[FIXED] LLM-judge output was unbounded** (GPT-5.5 P1). The prompt asks for
108
+ ≤2 sentences but nothing enforced it; a runaway judge could inject a full
109
+ solution / prompt-leak / megabyte blob straight into SDPO teacher
110
+ conditioning. **Fix:** `_MAX_HINT_CHARS = 600` clamp on the returned hint.
111
+ - **[FIXED] Non-atomic disk-cache write** (Gemini P2). Concurrent DDP workers
112
+ writing the same key could corrupt the file. **Fix:** write to a
113
+ `.{pid}.tmp` and `os.replace()` atomically.
114
+ - **[CORRECTED — ADR body] Decision Outcome described sibling-bootstrap as
115
+ composite layer (4).** As DeepSeek noted, the body says "(4) SDPO
116
+ sibling-bootstrap" as a composite layer, but the implementation (correctly)
117
+ places it trainer-side, and the acceptance gate already documents the shift.
118
+ The permanent decision record was internally inconsistent. **The Decision
119
+ Outcome below now states sibling-bootstrap lives in the rollout loop (ADR-008
120
+ trainer), not the HintGenerator composite.**
121
+ - **[OPEN — follow-up] Default layer order routes style/communication sites
122
+ through `RawErrorHintGenerator` before the judge** (GPT-5.5 P1). Any uncovered
123
+ site that carries an `error_message` is consumed by the raw-error layer and
124
+ never reaches the LLM judge — so the judge (the layer that actually covers
125
+ style/communication/effort, the stated goal) rarely fires in the default
126
+ composite. The fall-through test disables raw-error to force the judge, so it
127
+ doesn't validate the *default* path. Follow-up: add error-kind routing
128
+ (tool/runtime → raw-error OK; style/communication/effort → skip raw, go to
129
+ judge) and test the default composite directly.
130
+
131
+ All ADR-009 fixes are covered by the existing 12 tests passing unchanged plus
132
+ the cache-determinism behavior; no test regressions.
133
+
134
  ## Acceptance gate (must be green before status flips to accepted)
135
 
136
  All gates green as of 2026-05-29 (commit `<this>`; 12 tests in
docs/adrs/ADR-010-feature-deletion-datagen.md CHANGED
@@ -91,6 +91,76 @@ it to the RL loop.
91
  - Good: zero build.
92
  - Bad: not *Feature Deletion* — it's issue-fixing; no controlled difficulty knob; no deletion mechanic; doesn't bring Composer's data-gen method in, just consumes an existing benchmark. Fails the actual ask.
93
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
94
  ## Acceptance gate (must be green before status flips to accepted)
95
 
96
  Core gates green as of 2026-05-29 (19 tests in
 
91
  - Good: zero build.
92
  - Bad: not *Feature Deletion* — it's issue-fixing; no controlled difficulty knob; no deletion mechanic; doesn't bring Composer's data-gen method in, just consumes an existing benchmark. Fails the actual ask.
93
 
94
+ ## Post-acceptance cross-family review (2026-05-29)
95
+
96
+ The 5-family review (see ADR-008's review section) was **3 REJECT / 1
97
+ ACCEPT-WITH-FIXES on ADR-010** — the harshest of the three. The reviewers'
98
+ core, correct objection: the `[x]` gates were satisfied **against `FakeSandbox`
99
+ materializers that directly assign pass/fail booleans**, so they prove the
100
+ control-flow plumbing, NOT that feature-deletion-and-reimplement works on a real
101
+ repo. ADR-010's central claim ("invert OSS SWE substrates") is exactly the part
102
+ that is `[~]` (Docker-deferred) — the accepted gates are the easy half. This is
103
+ fair; the gate language below is re-scoped accordingly. Findings verified and
104
+ remediated where possible without Docker:
105
+
106
+ - **[FIXED — P0, 4/4 reviewers] Reward-hack sandbox lockdown was the ADR's
107
+ claimed PRIMARY control but was UNIMPLEMENTED.** `LocalSubprocessSandbox.boot`
108
+ only recorded the image string; no cache/.git scrub existed, so the
109
+ (bypassable) command denylist was the *only* defense. Worse, the denylist
110
+ checks only the first whitespace token, so `python -c "import marshal,dis;
111
+ …read __pycache__/*.pyc"`, `/usr/bin/strings`, `sh -c "strings x"` all bypass
112
+ it, and the `HackMonitor` regex is defeated by string-concat
113
+ (`"__py"+"cache__"`). **Fix:** `boot()` now physically scrubs `__pycache__`,
114
+ `.mypy_cache`, `.pytest_cache`, `.git`, `.hg`, and `*.pyc/*.pyo/*.class` from
115
+ the working tree before the episode — the actual primary control. The denylist
116
+ is now documented as cheap defense-in-depth, not the wall. Tested:
117
+ `test_local_sandbox_scrubs_caches_and_git_on_boot`.
118
+ - **[FIXED — P0, 3/4 reviewers] Curriculum recorded partial multi-feature
119
+ reward as a full pass.** `reward_fn` updated the curriculum with
120
+ `int(res.reward > 0)`, so a 0.5 (1-of-2 features) reward logged as a 100% pass;
121
+ `p_hat` crossed `tau_easy` and the task was **retired before the policy learned
122
+ the second feature**. **Fix:** the curriculum now takes FRACTIONAL credit
123
+ (`n_pass: float`), and `reward_fn` feeds the pass-fraction; a guard-broken or
124
+ hack-flagged rollout contributes 0 credit but still counts as an exposure (so
125
+ hack-only tasks trend to quarantine, not phantom passes). Tested:
126
+ `test_partial_multifeature_reward_is_fractional_credit_not_a_full_pass`,
127
+ `test_guard_broken_trajectory_does_not_pollute_curriculum_with_a_pass`.
128
+ - **[FIXED] `reward_fn` could silently return the wrong reward count.** The
129
+ `zip(completions, task_id)` truncated on length mismatch, returning fewer
130
+ rewards than TRL expects and corrupting the advantage computation. **Fix:**
131
+ explicit length-guard raises. Tested: `test_reward_fn_rejects_length_mismatch`.
132
+ - **[FIXED] Shell injection / parameterized-test breakage in `run_tests`**
133
+ (DeepSeek + Gemini). `f"{test_command} {node_ids}"` with `shell=True` breaks on
134
+ SWE-bench parametrized node ids (`test_x.py::test_y[a b]`) and is injectable.
135
+ **Fix:** `shlex.quote` each node id.
136
+ - **[FIXED] Duplicate `"unzip"` in `SANDBOX_DENYLIST`** (multiple reviewers).
137
+ - **[OPEN — honest, the central re-scope] Gate 2 ("deletion breaks the feature")
138
+ and the "deletion reachable from tests" claim do NOT verify reachability**
139
+ (GPT-5.5 P0, Grok P0). Gate 2 only checks that `FAIL_TO_PASS` tests fail in the
140
+ broken state — it does not prove the failure was *caused by the reverted
141
+ feature* rather than an unrelated breakage. A real reachability check
142
+ (coverage of the changed region by the failing tests, or revert-provenance)
143
+ needs the live Docker materializers. **This is the same `[~]` gate as the
144
+ substrate-inversion e2e — see below.**
145
+ - **[OPEN] `HackMonitor` is a substring matcher, not the AST-provenance monitor
146
+ the ADR advertises** (DeepSeek P0). It flags cache/decompiler signatures in the
147
+ trajectory but does no AST/symbol-reappearance analysis, and is bypassable by
148
+ string-concat. With the scrub now in place as the primary control, the monitor
149
+ is correctly-scoped defense-in-depth — but the ADR's §3c "AST provenance
150
+ monitor" language overstates it. Re-scoped: it is a *signature-based* monitor;
151
+ a genuine AST provenance check (scan the agent's patch for reintroduced
152
+ `deleted_symbols` reached via non-implementation paths) is a follow-up.
153
+ - **[OPEN — recipe fidelity] Curriculum ignores rollout-turns and
154
+ thinking-token count** (DeepSeek, GPT-5.5). The Composer 2 tech report keys the
155
+ curriculum on these; the implementation tracks only pass-rate. Follow-up:
156
+ extend `DifficultyCurriculum.update` to accept and weight on turn/think-token
157
+ signals.
158
+
159
+ **The three OPEN items all require the Docker substrate e2e to close** (real
160
+ reachability, real provenance on a materialized repo). That gate remains the
161
+ honest `[~]` — see the unblocked-by note. The fixable correctness defects
162
+ (scrub, fractional curriculum, length-guard, shlex) are fixed and tested.
163
+
164
  ## Acceptance gate (must be green before status flips to accepted)
165
 
166
  Core gates green as of 2026-05-29 (19 tests in
docs/reviews/cross-family-adr-008-009-010-2026-05-29/SYNTHESIS.md ADDED
@@ -0,0 +1,90 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Cross-family adversarial review — ADR-008 / 009 / 010 (2026-05-29)
2
+
3
+ > The "owed" cross-family adversarial ADR review from the Composer 2.5
4
+ > data-gen + targeted-RL wave. Deferred during the wave because OpenRouter ran
5
+ > out of credits mid-run (blocking `delegate_task`); run here once credits were
6
+ > restored (~$50 free at review time). This is the proper GPT-5.5 / Gemini /
7
+ > DeepSeek / Kimi / Grok scatter the wave summary promised.
8
+
9
+ ## Method
10
+
11
+ Route-fidelity-safe urllib scatter (no `delegate_task` per-task-override risk)
12
+ to 5 diverse families, full corpus embedded inline: the 3 ADRs **plus their
13
+ implementation code plus the tests** (~100KB / ~25K tokens), so reviewers
14
+ critiqued the *implementation against the decision*, not just the prose.
15
+
16
+ | Family | Slug (served) | Result |
17
+ |---|---|---|
18
+ | OpenAI | gpt-5.5-20260423 | clean, very specific (7981 tok) |
19
+ | Google | gemini-3.1-pro-preview-20260219 | clean, sharpest (9174 tok @ retry 16K) |
20
+ | DeepSeek | deepseek-v4-pro-20260423 | clean, math/correctness lens (10308 tok @ retry 16K) |
21
+ | xAI | grok-4.3-20260430 | clean, terse/decisive (1645 tok) |
22
+ | Moonshot | kimi-k2.6-20260420 | starved token budget (63K reasoning, content=None) — excluded |
23
+
24
+ 4 of 5 clean reviews. Raw reviews: `review_*.md` in this directory.
25
+
26
+ ## Verdict matrix
27
+
28
+ | ADR | GPT-5.5 | Gemini | DeepSeek | Grok | Consensus |
29
+ |---|---|---|---|---|---|
30
+ | 008 (Dr.GRPO+SDPO) | REJECT | REJECT | ACCEPT-w-FIXES | REJECT | **3 REJECT / 1 fix** |
31
+ | 009 (layered hints) | ACCEPT-w-FIXES | ACCEPT-w-FIXES | ACCEPT-w-FIXES | ACCEPT | **unanimous: fixable** |
32
+ | 010 (FeatureDeletion datagen) | REJECT | REJECT | REJECT | ACCEPT-w-FIXES | **3 REJECT / 1 fix** |
33
+
34
+ The REJECTs were not "the design is wrong" — every reviewer agreed the
35
+ architecture is sound. They were "the **accepted** status outruns the
36
+ **evidence**": gates marked `[x]` were satisfied by shape-checks and FakeSandbox
37
+ tautologies, not by the hard guarantee the gate language implied.
38
+
39
+ ## Convergent findings (≥2 reviewers, ALL verified against code by the orchestrator)
40
+
41
+ | # | Finding | Reviewers | Status |
42
+ |---|---|---|---|
43
+ | 1 | SDPO alignment guard = shape-check only; hint-shifted tokens silently misalign → policy poisoning | 4/4 | **FIXED** (explicit alignment-index gather + strict-raise; tests rewritten) |
44
+ | 2 | Sandbox denylist trivially bypassable (`python -c`, abs paths, `sh -c`) + ADR-claimed scrub UNIMPLEMENTED | 4/4 | **FIXED** (`_scrub_tree` primary control implemented + tested) |
45
+ | 3 | Curriculum `int(reward>0)` logs 0.5 partial as full pass → premature retire | 3/4 | **FIXED** (fractional credit + tests) |
46
+ | 4 | `scale_rewards` assertion dup literal `("none","False","False")` | 4/4 | **FIXED** (case-insensitive) |
47
+ | 5 | CPU smoke is tautological — never enters `_compute_sdpo_loss` (early return) | 3/4 | **RE-SCOPED** (gate honestly reworded) |
48
+ | 6 | FakeSandbox validator tests prove plumbing, not real inversion | 3/4 | **RE-SCOPED** (= the `[~]` Docker gate) |
49
+ | 7 | HackMonitor is substring matcher, not AST-provenance as advertised | 2/4 | **RE-SCOPED + OPEN** (follow-up) |
50
+ | 8 | Validator gate-2/"deletion reachable" doesn't test reachability | 2/4 | **OPEN** (needs Docker materializers) |
51
+ | 9 | `shell=True` + node-id interpolation = injection + param-test breakage | 2/4 | **FIXED** (`shlex.quote`) |
52
+ | 10 | LLM-judge cache key non-deterministic (memory addr) + unbounded output | 2/4 | **FIXED** (addr-strip + version + clamp + atomic write) |
53
+ | 11 | KL estimator k1 never configured/asserted | 2/4 | **OPEN** (fidelity follow-up) |
54
+ | 12 | `reward_fn` zip truncates on length mismatch | 1 (GPT) | **FIXED** (length-guard) |
55
+
56
+ ## What was fixed in this pass (all tested, 192 pass / 16 skip, 0 fail)
57
+
58
+ - **SDPO alignment** (`composer_trainer.py`): require `student_response_idx` /
59
+ `teacher_response_idx` from the collator; gather aligned post-hint logits
60
+ before JSD; strict-mode raises on missing/mismatched indices. The
61
+ silent-misalignment P0 (the only finding that would actually corrupt a run) is
62
+ closed and tested across different-length sequences.
63
+ - **Sandbox scrub** (`sandbox.py`): `boot()` physically removes byte-code/type
64
+ caches + `.git`/`.hg` — the ADR-claimed PRIMARY reward-hack control, previously
65
+ absent. Denylist re-documented as defense-in-depth + `shlex.quote` on node ids.
66
+ - **Curriculum** (`curriculum.py` + `env.py`): fractional pass credit; hacked /
67
+ guard-broken rollouts contribute 0 credit but count as exposures.
68
+ - **reward_fn** (`env.py`): length-guard against silent truncation.
69
+ - **LLM judge** (`hint_generator.py`): address-stripped + versioned cache key,
70
+ 600-char output clamp, atomic disk write.
71
+ - **`make_dr_grpo_config`** (`composer_trainer.py`): fixed the duplicated
72
+ assertion literal.
73
+
74
+ ## What remains OPEN (honest follow-ups, none corrupt a run)
75
+
76
+ 1. **Live Docker substrate-inversion e2e** — the central `[~]` gate. Closes
77
+ findings #6, #8 (real reachability/provenance on a materialized repo) and the
78
+ "FakeSandbox is tautological" objection. **Blocked: no Docker in this env.**
79
+ 2. **k1 KL estimator** assert/config vs TRL 1.5.0's actual GRPO KL branch (#11).
80
+ 3. **HackMonitor → real AST provenance** check (#7).
81
+ 4. **Hint-generator default layer routing** so style/communication sites reach
82
+ the judge (ADR-009 P1).
83
+ 5. **Curriculum turn/think-token signals** for full Composer recipe fidelity.
84
+
85
+ All five are documented in the ADRs' "Post-acceptance cross-family review"
86
+ sections.
87
+
88
+ ## Cost
89
+
90
+ ~$2-4 OpenRouter (2 scatters: 5×8K + 3×16K tokens on a 25K-token corpus).
docs/reviews/cross-family-adr-008-009-010-2026-05-29/review_deepseek-v4-pro.md ADDED
@@ -0,0 +1,46 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Review by deepseek/deepseek-v4-pro-20260423 (206.1s, 26727/10308 tok)
2
+
3
+ ## Verdict
4
+ - **ADR-008**: ACCEPT-WITH-FIXES
5
+ - **ADR-009**: ACCEPT-WITH-FIXES
6
+ - **ADR-010**: REJECT
7
+
8
+ ## P0 findings
9
+
10
+ 1. **[ADR-010] Sandbox denylist can be trivially bypassed — `is_command_allowed` validates only the first whitespace-delimited token, not the resolved command.**
11
+ - `composer_replication/datagen/sandbox.py:81-83` (`FakeSandbox.exec`) and `:130-136` (`LocalSubprocessSandbox.exec`)
12
+ - An attacker can run `/usr/bin/find …`, `bash -c "find …"`, or even `./my_decompile` (any path prefix) to circumvent the denylist. Because the check is `cmd.strip().split()[0]`, it compares e.g. `"/usr/bin/find"` against the denylist entry `"find"` and allows it.
13
+ - The ADR claims “allowlisted sandbox without find/strings/unzip/decompilers” – but the actual enforcement is cosmetic. In a real reward‑hacking scenario this safeguard fails.
14
+ - **Fix**: enforce a strict executable‑allowlist (e.g., only `pytest`, `python`, `git apply`, `editor` wrappers) at the sandbox level, or use OS‑level capability restrictions (seccomp / AppArmor). The current string‑matching must block *any* invocation of a forbidden tool, not just the bare command name.
15
+
16
+ ## P1 findings
17
+
18
+ 1. **[ADR-008] SDPO alignment guard only checks logit tensor shape, not semantic alignment of the post‑hint sequences.**
19
+ - `composer_replication/trainer/composer_trainer.py:120-130` — the shape guard raises `ValueError` on unequal dimensions, but equal‑length sequences could still be misaligned if the collator inserts the hint at different positions. The ADR states the trust‑gap is “closed” with an assertion; in reality the guard only catches length mismatches, leaving silent misalignment possible.
20
+ - **Fix**: require the collator to provide explicit alignment information (e.g., a boolean mask identifying the post‑hint region) and validate that the teacher’s logits in that region correspond 1:1 with the student’s.
21
+
22
+ 2. **[ADR-009] ADR-009’s Decision Outcome says the composite includes a sibling‑bootstrap layer, but the implementation does not.**
23
+ - `docs/adrs/ADR-009-layered-hint-generator.md` lines 57‑62 describe a composite “that tries layers cheapest‑first: (1) TemplateHintGenerator … (4) SDPO sibling‑bootstrap”. The code (`composer_replication/hint_generator.py:CompositeHintGenerator` and `default_composite`) only layers templates → raw error → LLM judge. The sibling‑bootstrap is documented as a trainer‑side flag, not a composite layer.
24
+ - The ADR’s own acceptance gate acknowledges the shift, but the body of the ADR (the permanent decision record) is now inconsistent with the shipped implementation. This misleads future readers about the architecture.
25
+ - **Fix**: update the ADR’s Decision Outcome to note that sibling‑bootstrap lives in the rollout loop, not in the HintGenerator composite.
26
+
27
+ 3. **[ADR-010] HackMonitor is not an AST‑provenance monitor.**
28
+ - `composer_replication/datagen/monitor.py` uses a static substring denylist (`__pycache__`, `javap`, …) to flag a trajectory. The ADR’s §3c and Decision Outcome describe an “AST provenance monitor that masks reward when deleted symbols reappear via non‑implementation paths”. No AST or symbol‑reappearance analysis is performed.
29
+ - The current monitor is a trivially‑extended version of the sandbox denylist, not a heuristic provenance detector as advertised.
30
+ - **Fix**: either implement a genuine provenance check (e.g., scan the agent’s patch for reintroduced symbols that match the `deleted_symbols` list, and cross‑reference with AST imports) or rewrite the ADR to accurately describe the signature‑based monitor.
31
+
32
+ 4. **[ADR-010] DifficultyCurriculum does not incorporate rollout‑turns or thinking‑token count.**
33
+ - `composer_replication/datagen/curriculum.py` tracks only pass/fail test outcomes and derives a frontier‑variance weight. The Composer 2 tech report (cited in the ADR) explicitly keys the curriculum on “rollout #turns + thinking‑token count”. The implementation omits these signals entirely.
34
+ - This deviates from the recipe the ADR claims to match; the resulting curriculum may miss important difficulty‑modulation behaviour.
35
+ - **Fix**: augment `DifficultyCurriculum` with moving averages of turn‑count and thinking‑token length per task, and incorporate them into the weighting or difficulty‑escalation logic.
36
+
37
+ ## P2 / nits
38
+
39
+ - **ADR-008**: `make_dr_grpo_config`’s assertion string `("none", "False", "False")` duplicates `"False"` (presumably a copy‑paste typo) — `composer_replication/trainer/composer_trainer.py:228`.
40
+ - **ADR-010**: `LocalSubprocessSandbox.run_tests` constructs a shell command via `f"{test_command} {node_ids}"` with `shell=True` — potential command injection if `node_ids` is untrusted (`sandbox.py:148‑151`).
41
+ - **ADR-010**: The 4‑gate validator’s `gate2` includes a redundant `bool(targets)` check – `targets` is always non‑empty per `FeatureDeletionTask` invariant, so the check is dead code (`validator.py:62`).
42
+
43
+ ## Anything the self-review likely missed
44
+ - The ADR‑008 smoke test relies on `strict_sdpo_alignment=False`; it never exercises the shape‑mismatch error path in production‑like configuration, so that path is untested.
45
+ - The `FeatureDeletionEnv.reward_fn` curriculum update uses `n_pass=int(res.reward > 0)` — this treats any positive reward (even 0.1) as a “pass” for the pass‑rate tracker, which may bias `p_hat` upward for partially‑rewarded tasks. The intention was likely to count fractional success, but the current code treats it as binary.
46
+ - The `SweBenchAdapter` fallback for non‑JSON `FAIL_TO_PASS` strings (e.g., `"test_x"`) returns a single‑element tuple, which is a reasonable heuristic but silently changes semantics for datasets that use delimiters other than JSON‑list.
docs/reviews/cross-family-adr-008-009-010-2026-05-29/review_gemini-3.1-pro.md ADDED
@@ -0,0 +1,61 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Review by google/gemini-3.1-pro-preview-20260219 (94.0s, 28773/9174 tok)
2
+
3
+ ## Verdicts
4
+ **ADR-008 (Dr. GRPO + SDPO):** REJECT
5
+ **ADR-009 (Layered HintGenerator):** ACCEPT-WITH-FIXES
6
+ **ADR-010 (Feature Deletion Datagen):** REJECT
7
+
8
+ ---
9
+
10
+ ## P0 findings
11
+
12
+ **1. SDPO student/teacher alignment guard is a shape-check placebo (ADR-008)**
13
+ *File:* `composer_replication/trainer/composer_trainer.py` inside `_compute_sdpo_loss`
14
+ *Why:* The code asserts `student_logits.shape != teacher_logits.shape` to verify alignment. This is fundamentally flawed. The teacher context (`ctx_teacher_input_ids`) has a hint inserted in the *middle* of the sequence (at the error turn). To make the total sequence lengths match, the collator must pad one of them. Even if the shapes are identical, the student's post-hint response tokens are shifted relative to the teacher's by `len(hint)`. `generalized_jsd_loss` will compute the KL divergence between completely unaligned tokens (e.g., the student's code vs. the teacher's hint). The trust-gap is not closed; it is silently hidden.
15
+ *Fix:* The collator must emit an alignment mapping (e.g., `teacher_gather_indices`), and `_compute_sdpo_loss` must `gather` the exact post-hint response logits from the teacher before computing JSD, rather than relying on sequence-level shape equality.
16
+
17
+ **2. Reward-hacking safeguards are trivially bypassed via Python (ADR-010)**
18
+ *File:* `composer_replication/datagen/sandbox.py` (`SANDBOX_DENYLIST`) and `monitor.py` (`HackMonitor`)
19
+ *Why:* The denylist blocks shell commands like `strings`, `javap`, and `uncompyle6` by checking the first word of the command. However, `python` is allowed. An agent can simply run `python -c "import dis, marshal; print(dis.dis(marshal.load(open('__pycache__/feature_a.cpython-310.pyc', 'rb'))))"` to decompile the cache, or use Python to read `.git` objects. Furthermore, the `HackMonitor` regex is defeated by basic string concatenation in the generated code (e.g., `getattr(os, "list" + "dir")(".git")` or `"__py" + "cache__"`). The safeguards provide a false sense of security.
20
+ *Fix:* Do not rely on command denylists or regex monitors as the primary defense. The sandbox `boot()` method MUST physically execute `rm -rf $(find . -name __pycache__) .git` before the episode starts.
21
+
22
+ **3. Curriculum prematurely retires tasks due to partial-reward logging (ADR-010)**
23
+ *File:* `composer_replication/datagen/env.py` inside `reward_fn`
24
+ *Why:* The reward function updates the curriculum using `n_pass=int(res.reward > 0)`. For a multi-feature task where the agent fixes 1 out of 2 tests, `res.reward` is `0.5`. `int(0.5 > 0)` evaluates to `1`. The curriculum logs this as a 100% pass rate. Consequently, `p_hat` quickly exceeds `tau_easy` (0.95), and the task is retired before the agent ever learns to solve the remaining features.
25
+ *Fix:* Update the curriculum using strict success: `n_pass=int(res.reward == 1.0)`.
26
+
27
+ ---
28
+
29
+ ## P1 findings
30
+
31
+ **4. LLM Judge cache key is destroyed by memory addresses (ADR-009)**
32
+ *File:* `composer_replication/hint_generator.py` inside `LLMJudgeHintGenerator._cache_key`
33
+ *Why:* The cache key uses `json.dumps(..., default=str)`. `error_meta` frequently contains raw Exception objects or un-repr'd context objects. `str(obj)` for these includes their memory address (e.g., `<Exception at 0x7f8b...>`), which changes on every single run. This guarantees a 0% cache hit rate across restarts, leading to unbounded OpenRouter API costs.
34
+ *Fix:* Sanitize `error_meta` to strip memory addresses (e.g., via regex) or extract only safe primitive fields before hashing.
35
+
36
+ **5. Shell injection and syntax errors on parameterized tests (ADR-010)**
37
+ *File:* `composer_replication/datagen/sandbox.py` inside `LocalSubprocessSandbox.run_tests`
38
+ *Why:* The code joins test names with spaces: `cmd = f"{test_command} {node_ids}"` and runs with `shell=True`. SWE-bench heavily uses parameterized tests (e.g., `test_foo.py::test_bar[my arg]`). The spaces and brackets will cause shell syntax errors or allow arbitrary command injection if a test name contains `;` or `&`.
39
+ *Fix:* Use `shlex.quote(t)` on each test node ID before joining them.
40
+
41
+ **6. `reward_fn` throws KeyError in distributed TRL loops (ADR-010)**
42
+ *File:* `composer_replication/datagen/env.py` inside `reward_fn`
43
+ *Why:* `reward_fn` looks up `task_id` in `self.registry`. `self.registry` is only populated when `env.reset(task)` is called locally. In a standard TRL training loop, `reward_fn` is a stateless callback executed on batches of generated completions. If the environment wasn't pre-loaded with the entire dataset, `self.registry[tid]` will immediately raise a `KeyError`.
44
+ *Fix:* `FeatureDeletionEnv` must accept a dataset reference or a task-factory function to load unseen `task_id`s on demand.
45
+
46
+ ---
47
+
48
+ ## P2 / nits
49
+
50
+ **7. Non-atomic disk cache writes (ADR-009)**
51
+ *File:* `composer_replication/hint_generator.py` inside `LLMJudgeHintGenerator._disk_put`
52
+ *Why:* `(d / f"{key}.txt").write_text(value)` is not atomic. Concurrent DDP workers writing to the same cache key will corrupt the file. Write to a `.tmp` file and use `os.replace`.
53
+
54
+ **8. Typo in Dr. GRPO config assertion (ADR-008)**
55
+ *File:* `composer_replication/trainer/composer_trainer.py` inside `make_dr_grpo_config`
56
+ *Why:* `assert str(cfg.scale_rewards) in ("none", "False", "False")` repeats `"False"`.
57
+
58
+ ---
59
+
60
+ ## Anything the self-review likely missed
61
+ The ADR-008 CPU smoke test (`examples/composer_grpo_sdpo_smoke/run.py`) explicitly passes `strict_sdpo_alignment=False` and notes that the toy rollouts won't carry `ctx_teacher_input_ids`. This means the smoke test **tautologically passes** without ever executing the inside of the `_compute_sdpo_loss` block (it hits the early return `0.0`). It proves the trainer doesn't crash on init, but it completely fails to prove that the SDPO channel is "wired into the live loop" as claimed in the acceptance gate.
docs/reviews/cross-family-adr-008-009-010-2026-05-29/review_gpt-5.5.md ADDED
@@ -0,0 +1,153 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Review by openai/gpt-5.5-20260423 (146.6s, 24888/7981 tok)
2
+
3
+ ## Verdict
4
+
5
+ - **ADR-008: REJECT** — Dr. GRPO/TRL claims are not established; SDPO “live” gate is mostly a no-op.
6
+ - **ADR-009: ACCEPT-WITH-FIXES** — layered interface is useful, but default ordering and omitted sibling-bootstrap contradict the ADR.
7
+ - **ADR-010: REJECT** — reward function, validator, sandbox, and Docker/substrate gates do not prove the accepted claims.
8
+
9
+ ## P0 findings
10
+
11
+ 1. **ADR-008 — Dr. GRPO KL estimator claim is wrong / unimplemented**
12
+ **File/function:** `composer_replication/trainer/composer_trainer.py::make_dr_grpo_config`
13
+ **Why:** The ADR claims Composer uses k1 KL (`−log r`) and that “TRL’s native estimator” satisfies this. TRL GRPO implementations commonly use the Schulman k3-style estimator (`exp(ref-logp - logp) - (ref-logp - logp) - 1`) for the per-token KL. This code does not configure or override the KL estimator at all; it only leaves `beta` to the caller. If TRL 1.5.0 uses k3, the accepted ADR is training the wrong RL objective while claiming Dr. GRPO fidelity.
14
+ **Concrete fix:** Add an explicit version-pinned assertion against TRL 1.5.0 source behavior, and if TRL uses k3, override GRPO’s KL computation in `ComposerReplicationTrainer` to k1 or set `beta=0` and document that Composer’s reference-KL term is intentionally disabled. Add a test that computes KL for known logprob pairs and asserts exact k1 values.
15
+
16
+ 2. **ADR-008 — `scale_rewards="none"` is not a robust Dr. GRPO guard**
17
+ **File/function:** `composer_replication/trainer/composer_trainer.py::make_dr_grpo_config`
18
+ **Why:** The acceptance gate says no std-dev advantage normalization is guaranteed. The code passes the string `"none"` and then only checks that `cfg.scale_rewards` echoes as `"none"`/`False`. That is not a semantic guard. If TRL expects `False` to disable scaling and treats unknown strings as truthy or as a mode, this silently fails the key Dr. GRPO requirement. The assertion also contains a duplicated tuple entry: `("none", "False", "False")`.
19
+ **Concrete fix:** Use the TRL 1.5.0-supported disable value directly, likely `scale_rewards=False`, and add an integration/unit test around TRL’s actual advantage-scaling branch showing no group/batch std normalization is applied.
20
+
21
+ 3. **ADR-008 — “live SDPO smoke” does not prove SDPO is live and does not assert its own gate**
22
+ **File/function:** `examples/composer_grpo_sdpo_smoke/run.py::main`
23
+ **Why:** The smoke enables `alpha_sdpo=1.0` but supplies no `ctx_teacher_input_ids` and no `sdpo_loss_mask`. Therefore `ComposerReplicationTrainer._compute_sdpo_loss` immediately returns zero and performs no teacher forward, no JSD, and no error-site masking. Worse, the script computes `sdpo_logged = any(...)` but never asserts it; it prints `PASS` unconditionally after `trainer.train()`. This gate can go green even if the SDPO override is never invoked.
24
+ **Concrete fix:** Build a tiny dataset/collator batch with aligned `input_ids`, `ctx_teacher_input_ids`, and nonzero `sdpo_loss_mask`; assert a teacher forward occurred, `loss/sdpo_kl` is logged, and the logged value is finite and positive or at least nonzero under controlled logits. Also `return 1` if `sdpo_logged` is false.
25
+
26
+ 4. **ADR-010 — TRL `reward_fn` ignores the completion, so non-solutions can receive full reward**
27
+ **File/function:** `composer_replication/datagen/env.py::FeatureDeletionEnv.reward_fn`
28
+ **Why:** In the default path, each completion is ignored. The env calls `reset(task)` and then `step({"type": "submit"})`; the sandbox’s pre-existing test outcomes determine reward. The unit test `test_reward_fn_returns_one_float_per_completion` rewards `"...agent diff..."` without applying any patch or action. This is a ship-blocking reward bug: every completion for a task can get the same score, including an empty or malicious non-solution.
29
+ **Concrete fix:** Make `reward_fn` require a replay/action executor that applies the completion’s patch/tool calls in an isolated sandbox before grading. If no replay implementation is provided, raise `NotImplementedError` instead of grading a bare submit. Add tests where two different completions for the same task produce different rewards.
30
+
31
+ 5. **ADR-010 — sandbox denylist is trivially bypassable and the promised scrub is not implemented**
32
+ **Files/functions:**
33
+ - `composer_replication/datagen/sandbox.py::SANDBOX_DENYLIST`
34
+ - `composer_replication/datagen/sandbox.py::LocalSubprocessSandbox.exec`
35
+ - `composer_replication/datagen/monitor.py::HackMonitor.flag`
36
+ **Why:** The ADR claims pre-task scrub of `__pycache__`, `.mypy_cache`, `.class`, `.git` and an allowlisted sandbox. Code implements neither. `LocalSubprocessSandbox.boot` only records the image. `exec` uses `shell=True` and blocks only the first token if it exactly matches a denylisted command. Concrete bypasses:
37
+ - `/usr/bin/strings file.pyc` bypasses because head is `/usr/bin/strings`, not `strings`.
38
+ - `sh -c "strings file.pyc"` bypasses because head is `sh`.
39
+ - `python -c "import pathlib; ... read __pycache__/*.pyc ..."` bypasses because head is `python`.
40
+ - `grep -R`, `ripgrep`, `cat`, `tar`, `zipinfo`, `python -m dis`, or custom bytecode readers are not denied.
41
+ The monitor can also be evaded by string-splitting: `('__py'+'cache__')`, `('.p'+'yc')`, etc.
42
+ **Concrete fix:** Replace denylist shell execution with a constrained tool API or container-level policy: scrub caches/history before task image creation, run without `.git`, no network, no arbitrary shell for reward-sensitive tasks, or enforce syscall/path-level restrictions. At minimum, canonicalize executable paths, block shells/interpreters for filesystem reconnaissance, and scan the filesystem before grading.
43
+
44
+ 6. **ADR-010 — accepted Docker/substrate gate is not green; FakeSandbox tests are tautological**
45
+ **Files/functions:**
46
+ - `composer_replication/datagen/substrates.py::SweBenchAdapter.to_task`
47
+ - `composer_replication/datagen/validator.py::validate_task`
48
+ - `composer_replication/datagen/tests/test_feature_deletion.py::_materializers`
49
+ **Why:** The ADR marks the core gates accepted while the actual live substrate inversion is explicitly deferred. `SweBenchAdapter.to_task` only maps schema fields; it does not apply or reverse patches, scrub caches, build a broken image, or run tests. The validator unit tests use materializers that directly assign pass/fail dictionaries in `FakeSandbox`, so they prove only that the validator obeys synthetic booleans, not that feature deletion works on a repo. This is especially bad because ADR-010’s central claim is “invert OSS SWE substrates.”
50
+ **Concrete fix:** Keep ADR-010 unaccepted until a Docker-gated test actually clones/materializes one real substrate instance, applies the gold patch to establish solved green, reverts/deletes to broken, runs `FAIL_TO_PASS`/`PASS_TO_PASS`, reapplies gold, and verifies the four gates.
51
+
52
+ 7. **ADR-010 — validator does not actually prove “deletion is reachable from tests”**
53
+ **File/function:** `composer_replication/datagen/validator.py::validate_task`
54
+ **Why:** Gate 2 checks only that `FAIL_TO_PASS` tests fail in the state supplied by `materialize_broken`. It does not verify that the failure was caused by reverting/deleting the specified feature or symbols. A broken materializer can make tests fail for any unrelated reason and pass validation if `apply_gold` later flips the booleans. The FakeSandbox tests encode exactly that weakness.
55
+ **Concrete fix:** In the real materializer, record and verify the exact patch/reverse-patch operation, changed files, and deleted symbols. Add a reachability check that failing tests execute or import the changed region, e.g. via coverage, traceback provenance, or mutation/revert validation on the actual repo.
56
+
57
+ ## P1 findings
58
+
59
+ 1. **ADR-008 — SDPO alignment guard is only a shape check**
60
+ **File/function:** `composer_replication/trainer/composer_trainer.py::ComposerReplicationTrainer._compute_sdpo_loss`
61
+ **Why:** The ADR says the student/teacher alignment trust-gap is closed. It is not. The code only compares `student_logits.shape != teacher_logits.shape`. Same shape with shuffled batch rows, shifted post-hint tokens, wrong error-site mask, or different padding all pass silently and train on nonsense.
62
+ **Concrete fix:** Carry explicit alignment metadata from the collator: batch ids, error-site ids, student token ids, teacher target token ids, and mask. Assert equality under the SDPO mask, not just equal tensor shape.
63
+
64
+ 2. **ADR-008 — missing `sdpo_loss_mask` distills the whole sequence**
65
+ **File/function:** `composer_replication/trainer/composer_trainer.py::ComposerReplicationTrainer._compute_sdpo_loss`
66
+ **Why:** If `ctx_teacher_input_ids` exists but `sdpo_loss_mask` is absent, the code passes `labels=None` into `generalized_jsd_loss`. Depending on that implementation, this can compute loss over all tokens, not only error-turn tokens. That violates “masked to error-turn tokens only.”
67
+ **Concrete fix:** When `alpha_sdpo > 0` and teacher ids are present, require `sdpo_loss_mask` with compatible shape and nonzero entries, or explicitly return zero with a warning.
68
+
69
+ 3. **ADR-008 — “Adam” is claimed but not configured**
70
+ **File/function:** `composer_replication/trainer/composer_trainer.py::make_dr_grpo_config`
71
+ **Why:** ADR says Dr. GRPO uses Adam. The helper does not set `optim`, `weight_decay`, or Adam-vs-AdamW behavior. HF/TRL defaults are typically AdamW variants.
72
+ **Concrete fix:** Set the optimizer explicitly or remove the claim. Add a config test that asserts the exact optimizer TRL will instantiate.
73
+
74
+ 4. **ADR-008 — `num_iterations=1` is not the same as “a prompt is never trained twice”**
75
+ **File/function:** `composer_replication/trainer/composer_trainer.py::make_dr_grpo_config`
76
+ **Why:** `num_iterations=1` controls GRPO inner reuse per generation batch. It does not by itself prevent dataset-level resampling, multiple epochs, repeated prompts, or replay through dataloader settings.
77
+ **Concrete fix:** Document the narrower guarantee and add sampler/dataset controls if the intended guarantee is truly one update per prompt.
78
+
79
+ 5. **ADR-009 — default layer order prevents LLM judge from handling many style/communication cases**
80
+ **File/function:** `composer_replication/hint_generator.py::default_composite`
81
+ **Why:** Default order is template → raw-error → LLM. Any uncovered style/communication site with an `error_message` will be consumed by `RawErrorHintGenerator` and never reach the LLM judge. The test `test_composite_falls_through_to_judge_for_uncovered_site` disables raw error to make the judge fire, which does not validate the default accepted design.
82
+ **Concrete fix:** Add routing policy: tool/runtime errors may use raw error; style/communication/effort kinds should skip raw-error and go to judge. Test the default composite, not a special disabled-raw variant.
83
+
84
+ 6. **ADR-009 — sibling-bootstrap is accepted but not implemented**
85
+ **Files/functions:**
86
+ - `composer_replication/hint_generator.py`
87
+ - `composer_replication/trainer/composer_trainer.py`
88
+ **Why:** The ADR chooses template → raw-error → LLM-judge → SDPO sibling-bootstrap. The implementation comments say sibling-bootstrap is trainer-side, but no trainer-side flag or rollout integration exists in the provided trainer. Acceptance says “satisfied by design,” which is not an implementation gate.
89
+ **Concrete fix:** Add an explicit trainer/rollout interface that can select a successful sibling rollout as feedback, or downgrade the ADR to exclude sibling-bootstrap from the accepted scope.
90
+
91
+ 7. **ADR-009 — LLM hints are not bounded, sanitized, or policy-checked**
92
+ **File/function:** `composer_replication/hint_generator.py::LLMJudgeHintGenerator.generate`
93
+ **Why:** Prompt asks for `<=2 sentences`, but code accepts arbitrary output. A judge can return a full solution, huge text, hidden prompt leakage, or adversarial instructions. This contaminates SDPO teacher conditioning.
94
+ **Concrete fix:** Enforce length, strip code blocks if undesired, reject full-solution style outputs, and record model/prompt version in cache metadata.
95
+
96
+ 8. **ADR-010 — curriculum quarantine will discard hard-but-solvable tasks early**
97
+ **File/function:** `composer_replication/datagen/curriculum.py::DifficultyCurriculum.update`
98
+ **Why:** With `tau_hard=0.02` and `min_exposures=8`, any task with zero successes in its first eight attempts is quarantined forever. Early in RL, that describes many valid hard tasks, not just impossible tasks. This contradicts the “create/select harder tasks dynamically” goal.
99
+ **Concrete fix:** Use confidence intervals/Bayesian posterior, minimum diverse solver attempts, validator failures, or periodic reactivation rather than irreversible zero-success quarantine.
100
+
101
+ 9. **ADR-010 — curriculum ignores turns and thinking-token count despite ADR claim**
102
+ **File/function:** `composer_replication/datagen/curriculum.py::DifficultyCurriculum`
103
+ **Why:** ADR says the Composer tech report keys curriculum on rollout turns and thinking-token count. Implementation tracks only binary pass totals.
104
+ **Concrete fix:** Extend `update` to accept turns and thinking tokens, and weight/graduate tasks using those signals.
105
+
106
+ 10. **ADR-010 — partial multi-feature reward is collapsed to binary for curriculum**
107
+ **File/function:** `composer_replication/datagen/env.py::FeatureDeletionEnv.reward_fn`
108
+ **Why:** Curriculum update uses `n_pass=int(res.reward > 0)`. A 0.1 partial pass and a 1.0 full pass both count as one success. That corrupts pass-rate estimates for multi-feature tasks.
109
+ **Concrete fix:** Track fractional reward or target-test pass counts, e.g. `n_pass=res.info["frac"] * len(fail_to_pass)`, `n_total=len(fail_to_pass)`.
110
+
111
+ 11. **ADR-010 — reward adapter can return the wrong number of rewards**
112
+ **File/function:** `composer_replication/datagen/env.py::FeatureDeletionEnv.reward_fn`
113
+ **Why:** It loops with `zip(completions, task_id)`. If `task_id` length differs from `completions`, it silently truncates and returns fewer rewards than TRL expects.
114
+ **Concrete fix:** Assert `len(task_id) == len(completions) == len(prompts)`.
115
+
116
+ 12. **ADR-010 — `LocalSubprocessSandbox` is not a Docker/image sandbox**
117
+ **File/function:** `composer_replication/datagen/sandbox.py::LocalSubprocessSandbox`
118
+ **Why:** `boot(image)` does not materialize an image. It just stores the string. Tests run in the mutable local `workdir`. This contradicts the ADR’s frozen-image/scrubbed-repo premise.
119
+ **Concrete fix:** Implement a real Docker-backed sandbox or rename this as an unsafe local dev runner and keep it out of acceptance gates.
120
+
121
+ 13. **ADR-010 — test parsing is unreliable and can misgrade**
122
+ **File/function:** `composer_replication/datagen/sandbox.py::LocalSubprocessSandbox.run_tests`
123
+ **Why:** It marks a test passed if `f"{t} PASSED"` appears, otherwise uses a coarse return-code fallback. Many test commands do not emit that format unless verbose mode is enabled; parametrized node ids, xfail/xpass, collection skips, and non-pytest substrates will be misclassified.
124
+ **Concrete fix:** Use structured pytest JSON/JUnit output where possible, substrate-specific parsers otherwise, and fail closed on unknown parse states.
125
+
126
+ ## P2 / nits
127
+
128
+ 1. **Duplicate denylist entry**
129
+ **File/function:** `composer_replication/datagen/sandbox.py::SANDBOX_DENYLIST`
130
+ `unzip` appears twice.
131
+
132
+ 2. **Weak drift assertion message**
133
+ **File/function:** `composer_replication/trainer/composer_trainer.py::make_dr_grpo_config`
134
+ Assertion says `GRPOConfig dropped loss_type`, but the real risk is semantic drift, not field absence.
135
+
136
+ 3. **Smoke comment is misleading**
137
+ **File/function:** `examples/composer_grpo_sdpo_smoke/run.py::main`
138
+ It says `strict=False` makes absence of error sites a clean no-op. Strictness is irrelevant when `ctx_teacher_input_ids` is absent; the function returns before alignment checking.
139
+
140
+ 4. **LLM cache key lacks prompt/model/version**
141
+ **File/function:** `composer_replication/hint_generator.py::LLMJudgeHintGenerator._cache_key`
142
+ Same error context will reuse stale hints after prompt/model/policy changes. Include prompt template version and model id.
143
+
144
+ 5. **`FeatureDeletionTask.golden_diff` is hidden from `repr`, not from access**
145
+ **File/function:** `composer_replication/datagen/schema.py::FeatureDeletionTask`
146
+ `repr=False` is not a leakage boundary. Ensure prompt/rendering paths and serialized training examples exclude it.
147
+
148
+ ## Anything the self-review likely missed
149
+
150
+ - The ADR-008 acceptance gate confuses “trainer instantiated” with “custom loss path executed.” The current smoke can pass if the parent TRL loss path never calls the subclass’s SDPO logic.
151
+ - The ADR-010 tests mostly validate control-flow plumbing against `FakeSandbox`, not the hard parts: real patch inversion, real test execution, real sandbox isolation, and reward sensitivity to completions.
152
+ - The reward-hacking controls are described as defense-in-depth, but the “primary” controls — scrubbed caches/history and a real allowlisted sandbox — are absent. What remains is a brittle denylist plus a string-search monitor.
153
+ - ADR-009’s raw-error fallback is too eager. It optimizes for cheapness but undermines the stated goal of covering style/communication/effort errors with natural-language judge feedback.
docs/reviews/cross-family-adr-008-009-010-2026-05-29/review_grok-4.3.md ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Review by x-ai/grok-4.3-20260430 (20.1s, 24615/1645 tok)
2
+
3
+ **## Verdict (ADR-008)**
4
+ REJECT
5
+
6
+ **## Verdict (ADR-009)**
7
+ ACCEPT
8
+
9
+ **## Verdict (ADR-010)**
10
+ ACCEPT-WITH-FIXES
11
+
12
+ **## P0 findings**
13
+ 1. composer_trainer.py:280 — `assert str(cfg.scale_rewards) in ("none", "False", "False")` is a copy-paste error and does not match TRL 1.5.0 GRPOConfig storage (scale_rewards is stored as str "none" or bool False). This guard will fail or silently pass on valid Dr. GRPO configs. Fix: change to `str(cfg.scale_rewards).lower() in ("none", "false")`.
14
+ 2. composer_trainer.py:158-160 and _compute_sdpo_loss:191 — the "strict alignment guard" only does `student_logits.shape != teacher_logits.shape`. It performs no content/position alignment check on the post-hint tokens, so a collator that pads incorrectly but produces equal-length tensors still produces a silent zeroed loss. The ADR-008 trust-gap is not closed. Fix: add an explicit mask or token-id prefix check before the JSD call.
15
+ 3. datagen/validator.py:92 — Gate 4 (`apply_gold`) only verifies that the gold diff turns tests green; it never tests that the deletion site is reachable from the FAIL_TO_PASS tests via the agent's observation (no AST or coverage probe). The claim "deletion is reachable from tests" is not verified. Fix: add a coverage or import-graph reachability assertion inside the gold-applied state.
16
+
17
+ **## P1 findings**
18
+ - composer_trainer.py:273 (make_dr_grpo_config) — never sets or asserts the KL estimator (k1 vs k3). TRL 1.5.0 GRPOConfig default is not guaranteed to be the `-log r` estimator cited in the tech report; the config is incomplete vs the ADR claim.
19
+ - datagen/sandbox.py:57 and monitor.py:42 — SANDBOX_DENYLIST only checks the first token of a shell command. Concrete bypass: `python -c 'import marshal, dis; ...'` or `import importlib.util; ...` to read `__pycache__/*.pyc` without invoking any denied binary. Reward can still be gamed.
20
+ - datagen/curriculum.py:48 — `raw_rate` quarantine decision uses the unsmoothed rate, but `update` is called with `n_total=1` per completion in env.py:reward_fn. A task that passes once then fails forever can never be quarantined after min_exposures because the running count never reflects sustained failure.
21
+ - datagen/env.py:140 — `reward_fn` calls `curriculum.update(..., n_total=1)` even on hacked or guard-broken trajectories (reward==0). This pollutes the difficulty prior for tasks that are only solvable via hacks.
22
+
23
+ **## P2 / nits**
24
+ - examples/composer_grpo_sdpo_smoke/run.py:88 — smoke test forces `strict_sdpo_alignment=False`, so the only alignment test that would have caught the shape-mismatch bug is disabled in the "live loop" proof.
25
+ - datagen/tests/test_feature_deletion.py:140 — all validator tests use FakeSandbox materializers that directly mutate `test_outcomes`; they are tautological and do not exercise the real "deletion reachable" path.
26
+ - hint_generator.py:212 — LLMJudgeHintGenerator cache key includes the full error_meta dict; any non-deterministic key (timestamps, random seeds) defeats the disk cache.
27
+
28
+ **## Anything the self-review likely missed**
29
+ - The Dr. GRPO `loss_type="dr_grpo"` + `scale_rewards="none"` combination was only asserted to exist in TRL 1.5.0 help text; no test actually instantiates GRPOConfig and checks that the resulting advantages have zero length normalization and no std scaling. The acceptance gate is therefore unverified against the actual TRL implementation.