Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
review+fix: cross-family adversarial ADR review (owed item) + remediation
Browse filesRan the owed 5-family adversarial review of ADR-008/009/010 (unblocked once
OpenRouter credits were restored). 4/5 clean reviews (GPT-5.5, Gemini 3.1 Pro,
DeepSeek V4 Pro, Grok 4.3; Kimi starved its token budget). Verdicts: ADR-008
3-REJECT/1-fix, ADR-009 unanimous-fix, ADR-010 3-REJECT/1-fix — all REJECTs were
"accepted-status outran the evidence," not "design wrong."
The review caught a real policy-corrupting P0 the solo pre-mortem missed: the
SDPO alignment guard was a shape-check, so hint-shifted teacher tokens could
silently misalign and distill garbage into the policy.
Verified every convergent finding against code, then fixed (192 pass/16 skip,
was 187):
- SDPO alignment (composer_trainer.py): require collator-emitted
student_response_idx/teacher_response_idx; gather aligned post-hint logits
before JSD; strict-mode raises on missing/mismatched indices. Tests rewritten
to the corrected contract (different-length sequences now exercised).
- Sandbox scrub (sandbox.py): boot() physically removes __pycache__/.mypy_cache/
.pytest_cache/.git/.hg + *.pyc/*.pyo/*.class — the ADR-claimed PRIMARY
reward-hack control, previously UNIMPLEMENTED. shlex.quote on test node ids.
- Curriculum (curriculum.py + env.py): fractional pass credit instead of
int(reward>0) (the partial-reward-logs-as-full-pass bug that prematurely
retired multi-feature tasks); hacked/guard-broken rollouts contribute 0 credit
but count as exposures.
- reward_fn (env.py): length-guard against silent zip-truncation.
- LLM judge (hint_generator.py): address-stripped + versioned cache key,
600-char output clamp, atomic disk write.
- make_dr_grpo_config: fixed duplicated assertion literal.
Added 5 regression tests. Each ADR now carries a "Post-acceptance cross-family
review" section documenting fixed vs open items. Open follow-ups (none corrupt a
run): Docker substrate e2e, k1-KL assert, AST-provenance monitor, hint default
routing, curriculum turn/think-token signals.
Artifacts: docs/reviews/cross-family-adr-008-009-010-2026-05-29/
- composer_replication/datagen/curriculum.py +11 -2
- composer_replication/datagen/env.py +19 -1
- composer_replication/datagen/sandbox.py +48 -2
- composer_replication/datagen/tests/test_feature_deletion.py +76 -0
- composer_replication/hint_generator.py +32 -3
- composer_replication/trainer/composer_trainer.py +79 -22
- composer_replication/trainer/tests/test_dr_grpo_config_and_alignment.py +45 -8
- docs/WAVE_COMPOSER_DATAGEN_RL_2026-05-29.md +18 -6
- docs/adrs/ADR-008-drgrpo-sdpo-live-channel.md +58 -0
- docs/adrs/ADR-009-layered-hint-generator.md +41 -0
- docs/adrs/ADR-010-feature-deletion-datagen.md +70 -0
- docs/reviews/cross-family-adr-008-009-010-2026-05-29/SYNTHESIS.md +90 -0
- docs/reviews/cross-family-adr-008-009-010-2026-05-29/review_deepseek-v4-pro.md +46 -0
- docs/reviews/cross-family-adr-008-009-010-2026-05-29/review_gemini-3.1-pro.md +61 -0
- docs/reviews/cross-family-adr-008-009-010-2026-05-29/review_gpt-5.5.md +153 -0
- docs/reviews/cross-family-adr-008-009-010-2026-05-29/review_grok-4.3.md +29 -0
|
@@ -23,7 +23,7 @@ from dataclasses import dataclass, field
|
|
| 23 |
|
| 24 |
@dataclass
|
| 25 |
class _TaskStats:
|
| 26 |
-
n_pass:
|
| 27 |
n_total: int = 0
|
| 28 |
|
| 29 |
@property
|
|
@@ -48,7 +48,16 @@ class DifficultyCurriculum:
|
|
| 48 |
_stats: dict[str, _TaskStats] = field(default_factory=dict)
|
| 49 |
_quarantined: set[str] = field(default_factory=set)
|
| 50 |
|
| 51 |
-
def update(self, task_id: str, n_pass:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 52 |
st = self._stats.setdefault(task_id, _TaskStats())
|
| 53 |
st.n_pass += n_pass
|
| 54 |
st.n_total += n_total
|
|
|
|
| 23 |
|
| 24 |
@dataclass
|
| 25 |
class _TaskStats:
|
| 26 |
+
n_pass: float = 0.0
|
| 27 |
n_total: int = 0
|
| 28 |
|
| 29 |
@property
|
|
|
|
| 48 |
_stats: dict[str, _TaskStats] = field(default_factory=dict)
|
| 49 |
_quarantined: set[str] = field(default_factory=set)
|
| 50 |
|
| 51 |
+
def update(self, task_id: str, n_pass: float, n_total: int) -> None:
|
| 52 |
+
"""Record `n_pass` successes over `n_total` exposures.
|
| 53 |
+
|
| 54 |
+
`n_pass` is a FLOAT so multi-feature tasks can record fractional credit
|
| 55 |
+
(e.g. a completion that passes 1 of 2 FAIL_TO_PASS tests contributes 0.5,
|
| 56 |
+
not 1). Cross-family review 2026-05-29: the env previously passed
|
| 57 |
+
`int(reward > 0)`, which logged a 0.5 partial as a full pass and let
|
| 58 |
+
`p_hat` cross `tau_easy` so the task was retired before the policy ever
|
| 59 |
+
learned the remaining features.
|
| 60 |
+
"""
|
| 61 |
st = self._stats.setdefault(task_id, _TaskStats())
|
| 62 |
st.n_pass += n_pass
|
| 63 |
st.n_total += n_total
|
|
@@ -116,6 +116,14 @@ class FeatureDeletionEnv:
|
|
| 116 |
"reward_fn requires a `task_id` column (passed via the GRPO "
|
| 117 |
"dataset) to map each completion to its FeatureDeletionTask."
|
| 118 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 119 |
rewards: list[float] = []
|
| 120 |
for comp, tid in zip(completions, task_id):
|
| 121 |
task = self.registry[tid]
|
|
@@ -125,5 +133,15 @@ class FeatureDeletionEnv:
|
|
| 125 |
else:
|
| 126 |
res = self.step({"type": "submit"})
|
| 127 |
rewards.append(res.reward)
|
| 128 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 129 |
return rewards
|
|
|
|
| 116 |
"reward_fn requires a `task_id` column (passed via the GRPO "
|
| 117 |
"dataset) to map each completion to its FeatureDeletionTask."
|
| 118 |
)
|
| 119 |
+
# Length guard (cross-family review 2026-05-29): a `zip` over mismatched
|
| 120 |
+
# lengths silently truncates and returns fewer rewards than TRL expects,
|
| 121 |
+
# corrupting the advantage computation. Fail loudly instead.
|
| 122 |
+
if len(task_id) != len(completions):
|
| 123 |
+
raise ValueError(
|
| 124 |
+
f"reward_fn length mismatch: {len(completions)} completions vs "
|
| 125 |
+
f"{len(task_id)} task_ids. TRL expects one reward per completion."
|
| 126 |
+
)
|
| 127 |
rewards: list[float] = []
|
| 128 |
for comp, tid in zip(completions, task_id):
|
| 129 |
task = self.registry[tid]
|
|
|
|
| 133 |
else:
|
| 134 |
res = self.step({"type": "submit"})
|
| 135 |
rewards.append(res.reward)
|
| 136 |
+
# Curriculum update uses FRACTIONAL credit, not int(reward>0):
|
| 137 |
+
# - n_pass = pass-fraction over the FAIL_TO_PASS targets so a
|
| 138 |
+
# partial multi-feature solve doesn't read as a full pass and
|
| 139 |
+
# prematurely retire the task (the 3/4-reviewer P0).
|
| 140 |
+
# - a guard-broken or hack-flagged trajectory contributes 0 credit
|
| 141 |
+
# (reward is already 0) but STILL counts as an exposure, so a
|
| 142 |
+
# task only ever solvable via a hack trends toward quarantine
|
| 143 |
+
# rather than polluting the prior with phantom passes.
|
| 144 |
+
frac = float(res.info.get("frac", 0.0))
|
| 145 |
+
clean = bool(res.info.get("guard_ok", True)) and not bool(res.info.get("hacked", False))
|
| 146 |
+
self.curriculum.update(tid, n_pass=(frac if clean else 0.0), n_total=1)
|
| 147 |
return rewards
|
|
@@ -8,6 +8,9 @@ the substrate's frozen image and is exercised by the docker-gated substrate test
|
|
| 8 |
"""
|
| 9 |
from __future__ import annotations
|
| 10 |
|
|
|
|
|
|
|
|
|
|
| 11 |
import subprocess
|
| 12 |
from dataclasses import dataclass, field
|
| 13 |
from typing import Protocol, runtime_checkable
|
|
@@ -37,12 +40,19 @@ class TestRunResult:
|
|
| 37 |
# primary control is that __pycache__/.mypy_cache/.class are scrubbed pre-task.
|
| 38 |
SANDBOX_DENYLIST: frozenset[str] = frozenset(
|
| 39 |
{
|
| 40 |
-
"find", "strings", "unzip", "jar", "javap",
|
| 41 |
"procyon", "cfr", "jd-cli", "jadx", # Java decompilers
|
| 42 |
"uncompyle6", "decompyle3", # Python decompilers
|
| 43 |
"git", # .git is stripped; no history mining
|
| 44 |
}
|
| 45 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 46 |
|
| 47 |
|
| 48 |
@runtime_checkable
|
|
@@ -110,11 +120,44 @@ class LocalSubprocessSandbox:
|
|
| 110 |
_trajectory: list[dict] = field(default_factory=list)
|
| 111 |
booted_image: str | None = None
|
| 112 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 113 |
def boot(self, image: str) -> None:
|
| 114 |
self.booted_image = image
|
| 115 |
self._trajectory = []
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 116 |
|
| 117 |
def is_command_allowed(self, command: str) -> bool:
|
|
|
|
|
|
|
| 118 |
return command not in SANDBOX_DENYLIST
|
| 119 |
|
| 120 |
def exec(self, action: dict) -> str:
|
|
@@ -132,7 +175,10 @@ class LocalSubprocessSandbox:
|
|
| 132 |
|
| 133 |
def run_tests(self, test_command: str, tests: tuple[str, ...]) -> TestRunResult:
|
| 134 |
# Run pytest with explicit node ids; parse the summary line.
|
| 135 |
-
|
|
|
|
|
|
|
|
|
|
| 136 |
cmd = f"{test_command} {node_ids}"
|
| 137 |
proc = subprocess.run(
|
| 138 |
cmd, shell=True, cwd=self.workdir, capture_output=True, text=True, timeout=600
|
|
|
|
| 8 |
"""
|
| 9 |
from __future__ import annotations
|
| 10 |
|
| 11 |
+
import os
|
| 12 |
+
import shlex
|
| 13 |
+
import shutil
|
| 14 |
import subprocess
|
| 15 |
from dataclasses import dataclass, field
|
| 16 |
from typing import Protocol, runtime_checkable
|
|
|
|
| 40 |
# primary control is that __pycache__/.mypy_cache/.class are scrubbed pre-task.
|
| 41 |
SANDBOX_DENYLIST: frozenset[str] = frozenset(
|
| 42 |
{
|
| 43 |
+
"find", "strings", "unzip", "jar", "javap",
|
| 44 |
"procyon", "cfr", "jd-cli", "jadx", # Java decompilers
|
| 45 |
"uncompyle6", "decompyle3", # Python decompilers
|
| 46 |
"git", # .git is stripped; no history mining
|
| 47 |
}
|
| 48 |
)
|
| 49 |
+
# NOTE (cross-family review 2026-05-29): this denylist is INSUFFICIENT as a
|
| 50 |
+
# primary control. `is_command_allowed` checks only the first whitespace token,
|
| 51 |
+
# so `/usr/bin/find`, `sh -c "strings x"`, and especially `python -c "import
|
| 52 |
+
# marshal,dis; ..."` all bypass it. The ADR-claimed PRIMARY control is the
|
| 53 |
+
# pre-task cache/.git scrub in `boot()` — see `_scrub_tree` below, which is now
|
| 54 |
+
# implemented (was previously absent, making the denylist the only — broken —
|
| 55 |
+
# defense). The denylist remains as cheap defense-in-depth, not the wall.
|
| 56 |
|
| 57 |
|
| 58 |
@runtime_checkable
|
|
|
|
| 120 |
_trajectory: list[dict] = field(default_factory=list)
|
| 121 |
booted_image: str | None = None
|
| 122 |
|
| 123 |
+
# Cache/history artifacts that let an agent recover a deleted signature
|
| 124 |
+
# WITHOUT reimplementing it (the Composer-blog reward-hacks). Scrubbed at
|
| 125 |
+
# boot() so the denylist isn't the only — and bypassable — line of defense.
|
| 126 |
+
_SCRUB_NAMES: tuple[str, ...] = (
|
| 127 |
+
"__pycache__", ".mypy_cache", ".pytest_cache", ".git", ".hg",
|
| 128 |
+
)
|
| 129 |
+
_SCRUB_SUFFIXES: tuple[str, ...] = (".pyc", ".pyo", ".class")
|
| 130 |
+
|
| 131 |
def boot(self, image: str) -> None:
|
| 132 |
self.booted_image = image
|
| 133 |
self._trajectory = []
|
| 134 |
+
self._scrub_tree()
|
| 135 |
+
|
| 136 |
+
def _scrub_tree(self) -> None:
|
| 137 |
+
"""PRIMARY reward-hack control (ADR-010 §3): physically remove byte-code
|
| 138 |
+
caches, type-check caches, and VCS history from the working tree before
|
| 139 |
+
the episode starts, so there is no cached signature to recover. This is
|
| 140 |
+
the wall; the command denylist is only cheap defense-in-depth on top.
|
| 141 |
+
Cross-family review 2026-05-29 found this was previously UNIMPLEMENTED —
|
| 142 |
+
boot() only recorded the image string."""
|
| 143 |
+
if not self.workdir or not os.path.isdir(self.workdir):
|
| 144 |
+
return
|
| 145 |
+
for root, dirs, files in os.walk(self.workdir, topdown=True):
|
| 146 |
+
# Remove (and stop descending into) scrub-named directories.
|
| 147 |
+
for d in list(dirs):
|
| 148 |
+
if d in self._SCRUB_NAMES:
|
| 149 |
+
shutil.rmtree(os.path.join(root, d), ignore_errors=True)
|
| 150 |
+
dirs.remove(d)
|
| 151 |
+
for f in files:
|
| 152 |
+
if f.endswith(self._SCRUB_SUFFIXES):
|
| 153 |
+
try:
|
| 154 |
+
os.remove(os.path.join(root, f))
|
| 155 |
+
except OSError:
|
| 156 |
+
pass
|
| 157 |
|
| 158 |
def is_command_allowed(self, command: str) -> bool:
|
| 159 |
+
# NOTE: first-token-only check — see SANDBOX_DENYLIST comment. This is
|
| 160 |
+
# NOT a security boundary on its own; _scrub_tree() is the real control.
|
| 161 |
return command not in SANDBOX_DENYLIST
|
| 162 |
|
| 163 |
def exec(self, action: dict) -> str:
|
|
|
|
| 175 |
|
| 176 |
def run_tests(self, test_command: str, tests: tuple[str, ...]) -> TestRunResult:
|
| 177 |
# Run pytest with explicit node ids; parse the summary line.
|
| 178 |
+
# shlex.quote each node id — SWE-bench node ids contain spaces/brackets
|
| 179 |
+
# (parametrized tests like `test_x.py::test_y[a b]`) and could otherwise
|
| 180 |
+
# break the shell or inject commands (cross-family review 2026-05-29).
|
| 181 |
+
node_ids = " ".join(shlex.quote(t) for t in tests)
|
| 182 |
cmd = f"{test_command} {node_ids}"
|
| 183 |
proc = subprocess.run(
|
| 184 |
cmd, shell=True, cwd=self.workdir, capture_output=True, text=True, timeout=600
|
|
@@ -206,6 +206,82 @@ def test_reward_fn_requires_task_id():
|
|
| 206 |
env.reward_fn(prompts=["p"], completions=["c"])
|
| 207 |
|
| 208 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 209 |
# --- SweBenchAdapter --------------------------------------------------------
|
| 210 |
|
| 211 |
def test_swebench_adapter_inverts_instance():
|
|
|
|
| 206 |
env.reward_fn(prompts=["p"], completions=["c"])
|
| 207 |
|
| 208 |
|
| 209 |
+
def test_reward_fn_rejects_length_mismatch():
|
| 210 |
+
"""Cross-family review 2026-05-29: a zip over mismatched lengths silently
|
| 211 |
+
truncated and returned fewer rewards than TRL expects. Now a hard error."""
|
| 212 |
+
task = _task()
|
| 213 |
+
env = FeatureDeletionEnv(FakeSandbox(test_outcomes={"test_feature_a": True}),
|
| 214 |
+
registry={task.task_id: task})
|
| 215 |
+
with pytest.raises(ValueError, match="length mismatch"):
|
| 216 |
+
env.reward_fn(prompts=["p", "p"], completions=["c", "c"],
|
| 217 |
+
task_id=[task.task_id]) # 1 id for 2 completions
|
| 218 |
+
|
| 219 |
+
|
| 220 |
+
def test_partial_multifeature_reward_is_fractional_credit_not_a_full_pass():
|
| 221 |
+
"""Regression for the 3/4-reviewer P0: a multi-feature task solved 1-of-2
|
| 222 |
+
must record FRACTIONAL curriculum credit (0.5), not int(reward>0)==1. The
|
| 223 |
+
old code logged the 0.5 partial as a 100% pass, crossing tau_easy and
|
| 224 |
+
retiring the task before the policy learned the remaining feature."""
|
| 225 |
+
cur = DifficultyCurriculum()
|
| 226 |
+
# 'b' never solved => steady 0.5 reward each rollout.
|
| 227 |
+
sb = FakeSandbox(test_outcomes={"a": True, "b": False, "keep": True})
|
| 228 |
+
task = _task(task_id="mf", fail_to_pass=("a", "b"), pass_to_pass=("keep",),
|
| 229 |
+
deleted_symbols=())
|
| 230 |
+
env = FeatureDeletionEnv(sb, curriculum=cur, registry={"mf": task})
|
| 231 |
+
for _ in range(12):
|
| 232 |
+
out = env.reward_fn(prompts=["p"], completions=["c"], task_id=["mf"])
|
| 233 |
+
assert out == [0.5]
|
| 234 |
+
# p_hat must hover near 0.5 (frontier), NOT be driven above tau_easy.
|
| 235 |
+
assert cur.p_hat("mf") < cur.tau_easy
|
| 236 |
+
# And the task must remain ACTIVE (frontier-weighted), not retired.
|
| 237 |
+
assert cur.weight("mf") > 0.0
|
| 238 |
+
|
| 239 |
+
|
| 240 |
+
def test_guard_broken_trajectory_does_not_pollute_curriculum_with_a_pass():
|
| 241 |
+
"""A guard-broken (or hacked) rollout earns reward 0 AND contributes 0
|
| 242 |
+
curriculum credit while still counting as an exposure — so a task only
|
| 243 |
+
solvable by breaking the functional guard trends toward quarantine rather
|
| 244 |
+
than logging phantom passes (Grok finding)."""
|
| 245 |
+
cur = DifficultyCurriculum(min_exposures=4, tau_hard=0.05)
|
| 246 |
+
# target 'a' passes but the PASS_TO_PASS guard 'keep' is broken => reward 0.
|
| 247 |
+
sb = FakeSandbox(test_outcomes={"a": True, "keep": False})
|
| 248 |
+
task = _task(task_id="gb", fail_to_pass=("a",), pass_to_pass=("keep",),
|
| 249 |
+
deleted_symbols=())
|
| 250 |
+
env = FeatureDeletionEnv(sb, curriculum=cur, registry={"gb": task})
|
| 251 |
+
for _ in range(8):
|
| 252 |
+
out = env.reward_fn(prompts=["p"], completions=["c"], task_id=["gb"])
|
| 253 |
+
assert out == [0.0]
|
| 254 |
+
assert cur.is_quarantined("gb")
|
| 255 |
+
|
| 256 |
+
|
| 257 |
+
# --- sandbox scrub (primary reward-hack control) ----------------------------
|
| 258 |
+
|
| 259 |
+
def test_local_sandbox_scrubs_caches_and_git_on_boot(tmp_path):
|
| 260 |
+
"""Cross-family review 2026-05-29: the ADR-claimed PRIMARY reward-hack
|
| 261 |
+
control (pre-task scrub of byte-code/type caches + .git) was unimplemented
|
| 262 |
+
— boot() only recorded the image string. It now physically removes them."""
|
| 263 |
+
from composer_replication.datagen.sandbox import LocalSubprocessSandbox
|
| 264 |
+
|
| 265 |
+
# Build a tree with the exact artifacts an agent could mine for signatures.
|
| 266 |
+
(tmp_path / "__pycache__").mkdir()
|
| 267 |
+
(tmp_path / "__pycache__" / "feature_a.cpython-311.pyc").write_bytes(b"\x00cache")
|
| 268 |
+
(tmp_path / ".mypy_cache").mkdir()
|
| 269 |
+
(tmp_path / ".mypy_cache" / "x.json").write_text("{}")
|
| 270 |
+
(tmp_path / ".git").mkdir()
|
| 271 |
+
(tmp_path / ".git" / "HEAD").write_text("ref: refs/heads/main")
|
| 272 |
+
(tmp_path / "src").mkdir()
|
| 273 |
+
(tmp_path / "src" / "lib.class").write_bytes(b"\xca\xfe\xba\xbe")
|
| 274 |
+
(tmp_path / "src" / "keep.py").write_text("x = 1") # must survive
|
| 275 |
+
|
| 276 |
+
LocalSubprocessSandbox(workdir=str(tmp_path)).boot("img:broken")
|
| 277 |
+
|
| 278 |
+
assert not (tmp_path / "__pycache__").exists()
|
| 279 |
+
assert not (tmp_path / ".mypy_cache").exists()
|
| 280 |
+
assert not (tmp_path / ".git").exists()
|
| 281 |
+
assert not (tmp_path / "src" / "lib.class").exists()
|
| 282 |
+
assert (tmp_path / "src" / "keep.py").exists() # real source untouched
|
| 283 |
+
|
| 284 |
+
|
| 285 |
# --- SweBenchAdapter --------------------------------------------------------
|
| 286 |
|
| 287 |
def test_swebench_adapter_inverts_instance():
|
|
@@ -201,6 +201,16 @@ class LLMJudgeHintGenerator:
|
|
| 201 |
"Corrective hint:"
|
| 202 |
)
|
| 203 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 204 |
def __init__(
|
| 205 |
self,
|
| 206 |
complete: Callable[[str], str] | None = None,
|
|
@@ -214,10 +224,20 @@ class LLMJudgeHintGenerator:
|
|
| 214 |
def _cache_key(self, error_kind: str, error_meta: dict) -> str:
|
| 215 |
import hashlib
|
| 216 |
import json
|
| 217 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 218 |
blob = json.dumps(
|
| 219 |
-
{"k": error_kind, "m": error_meta},
|
|
|
|
| 220 |
)
|
|
|
|
|
|
|
| 221 |
return hashlib.sha256(blob.encode("utf-8")).hexdigest()[:32]
|
| 222 |
|
| 223 |
def _disk_get(self, key: str) -> str | None:
|
|
@@ -231,11 +251,16 @@ class LLMJudgeHintGenerator:
|
|
| 231 |
def _disk_put(self, key: str, value: str) -> None:
|
| 232 |
if not self._cache_dir:
|
| 233 |
return
|
|
|
|
| 234 |
from pathlib import Path
|
| 235 |
|
| 236 |
d = Path(self._cache_dir)
|
| 237 |
d.mkdir(parents=True, exist_ok=True)
|
| 238 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 239 |
|
| 240 |
def generate(self, error_kind: str, error_meta: dict) -> str | None:
|
| 241 |
if self.complete is None:
|
|
@@ -255,6 +280,10 @@ class LLMJudgeHintGenerator:
|
|
| 255 |
hint = self.complete(prompt).strip()
|
| 256 |
if not hint:
|
| 257 |
return None
|
|
|
|
|
|
|
|
|
|
|
|
|
| 258 |
self._mem_cache[key] = hint
|
| 259 |
self._disk_put(key, hint)
|
| 260 |
return hint
|
|
|
|
| 201 |
"Corrective hint:"
|
| 202 |
)
|
| 203 |
|
| 204 |
+
# Bump when PROMPT_TEMPLATE or the underlying judge model changes so stale
|
| 205 |
+
# cached hints are invalidated rather than silently reused.
|
| 206 |
+
_CACHE_VERSION = 2
|
| 207 |
+
|
| 208 |
+
# Hard cap on a generated hint. The judge is asked for <=2 sentences but
|
| 209 |
+
# nothing enforced it (cross-family review 2026-05-29) — a runaway judge
|
| 210 |
+
# could emit a full solution / prompt-leak / megabyte of text straight into
|
| 211 |
+
# the SDPO teacher conditioning. Clamp defensively.
|
| 212 |
+
_MAX_HINT_CHARS = 600
|
| 213 |
+
|
| 214 |
def __init__(
|
| 215 |
self,
|
| 216 |
complete: Callable[[str], str] | None = None,
|
|
|
|
| 224 |
def _cache_key(self, error_kind: str, error_meta: dict) -> str:
|
| 225 |
import hashlib
|
| 226 |
import json
|
| 227 |
+
import re
|
| 228 |
+
|
| 229 |
+
# Strip volatile object reprs (e.g. "<Exception at 0x7f8b...>") so the
|
| 230 |
+
# key is stable across runs/restarts. Cross-family review 2026-05-29:
|
| 231 |
+
# `default=str` on raw Exception/context objects embedded a memory
|
| 232 |
+
# address in the key, guaranteeing a 0% cross-process cache-hit rate and
|
| 233 |
+
# unbounded judge cost. Also version the key so prompt/model changes
|
| 234 |
+
# invalidate stale hints rather than serving them.
|
| 235 |
blob = json.dumps(
|
| 236 |
+
{"v": self._CACHE_VERSION, "k": error_kind, "m": error_meta},
|
| 237 |
+
sort_keys=True, default=str,
|
| 238 |
)
|
| 239 |
+
blob = re.sub(r"0x[0-9a-fA-F]+", "0xADDR", blob)
|
| 240 |
+
blob = re.sub(r"\bat 0xADDR\b", "", blob)
|
| 241 |
return hashlib.sha256(blob.encode("utf-8")).hexdigest()[:32]
|
| 242 |
|
| 243 |
def _disk_get(self, key: str) -> str | None:
|
|
|
|
| 251 |
def _disk_put(self, key: str, value: str) -> None:
|
| 252 |
if not self._cache_dir:
|
| 253 |
return
|
| 254 |
+
import os
|
| 255 |
from pathlib import Path
|
| 256 |
|
| 257 |
d = Path(self._cache_dir)
|
| 258 |
d.mkdir(parents=True, exist_ok=True)
|
| 259 |
+
# Atomic write: concurrent DDP workers writing the same key would
|
| 260 |
+
# otherwise interleave and corrupt the file (cross-family review).
|
| 261 |
+
tmp = d / f"{key}.txt.{os.getpid()}.tmp"
|
| 262 |
+
tmp.write_text(value, encoding="utf-8")
|
| 263 |
+
os.replace(tmp, d / f"{key}.txt")
|
| 264 |
|
| 265 |
def generate(self, error_kind: str, error_meta: dict) -> str | None:
|
| 266 |
if self.complete is None:
|
|
|
|
| 280 |
hint = self.complete(prompt).strip()
|
| 281 |
if not hint:
|
| 282 |
return None
|
| 283 |
+
# Clamp to a sane length so a runaway judge can't inject a full solution
|
| 284 |
+
# or megabyte blob into the SDPO teacher conditioning (cross-family review).
|
| 285 |
+
if len(hint) > self._MAX_HINT_CHARS:
|
| 286 |
+
hint = hint[: self._MAX_HINT_CHARS].rstrip() + "…"
|
| 287 |
self._mem_cache[key] = hint
|
| 288 |
self._disk_put(key, hint)
|
| 289 |
return hint
|
|
@@ -162,31 +162,79 @@ class ComposerReplicationTrainer(GRPOTrainer): # type: ignore[misc, valid-type]
|
|
| 162 |
with torch.no_grad():
|
| 163 |
teacher_logits = model(input_ids=inputs["ctx_teacher_input_ids"]).logits
|
| 164 |
|
| 165 |
-
#
|
| 166 |
-
#
|
| 167 |
-
#
|
| 168 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 169 |
msg = (
|
| 170 |
-
|
| 171 |
-
|
| 172 |
-
"
|
| 173 |
-
"
|
| 174 |
-
"
|
|
|
|
|
|
|
| 175 |
)
|
| 176 |
if self.strict_sdpo_alignment:
|
| 177 |
-
# Hard error (default): a mismatch means the collator failed to
|
| 178 |
-
# align — the ADR-008 trust-gap. Do not silently zero the channel.
|
| 179 |
raise ValueError(
|
| 180 |
-
msg + " (strict_sdpo_alignment=True; pass False to
|
| 181 |
-
"
|
| 182 |
)
|
| 183 |
-
logger.warning("%s
|
| 184 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 185 |
|
| 186 |
return generalized_jsd_loss(
|
| 187 |
-
student_logits=
|
| 188 |
-
teacher_logits=
|
| 189 |
-
labels=inputs.get("sdpo_loss_mask"), # error-turn
|
| 190 |
beta=self.sdpo_jsd_beta,
|
| 191 |
temperature=self.sdpo_temperature,
|
| 192 |
token_clip=self.sdpo_token_clip,
|
|
@@ -294,10 +342,19 @@ def make_dr_grpo_config(**overrides: Any):
|
|
| 294 |
merged = {**dr_grpo_defaults, **overrides}
|
| 295 |
cfg = GRPOConfig(**merged)
|
| 296 |
# Guard: fail loudly if a future TRL renames/repurposes these knobs.
|
| 297 |
-
assert cfg.loss_type == merged["loss_type"],
|
| 298 |
-
|
| 299 |
-
f"
|
| 300 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 301 |
)
|
| 302 |
assert cfg.num_iterations == merged["num_iterations"], "GRPOConfig dropped num_iterations"
|
| 303 |
return cfg
|
|
|
|
| 162 |
with torch.no_grad():
|
| 163 |
teacher_logits = model(input_ids=inputs["ctx_teacher_input_ids"]).logits
|
| 164 |
|
| 165 |
+
# ------------------------------------------------------------------
|
| 166 |
+
# ALIGNMENT (cross-family review 2026-05-29 — the 4/4-reviewer P0).
|
| 167 |
+
#
|
| 168 |
+
# The teacher context has a hint inserted at the error turn, so the
|
| 169 |
+
# teacher's post-hint response tokens are shifted right by len(hint)
|
| 170 |
+
# relative to the student's. A bare `student.shape == teacher.shape`
|
| 171 |
+
# check does NOT establish token-level alignment: equal-length tensors
|
| 172 |
+
# whose response regions are offset will be JSD'd position-by-position
|
| 173 |
+
# against each other, distilling garbage into the policy.
|
| 174 |
+
#
|
| 175 |
+
# The ONLY correct alignment is an explicit map from the collator that
|
| 176 |
+
# selects, for each response token, the matching index in each sequence.
|
| 177 |
+
# We require it whenever SDPO is active:
|
| 178 |
+
# - `student_response_idx` / `teacher_response_idx`: LongTensors of
|
| 179 |
+
# equal length selecting the aligned response positions in each
|
| 180 |
+
# sequence (the collator builds these knowing where it inserted the
|
| 181 |
+
# hint). JSD is computed over the gathered, provably-aligned logits.
|
| 182 |
+
# - If the collator cannot yet supply them, strict mode raises (loud
|
| 183 |
+
# failure) rather than silently distilling misaligned tokens.
|
| 184 |
+
s_idx = inputs.get("student_response_idx")
|
| 185 |
+
t_idx = inputs.get("teacher_response_idx")
|
| 186 |
+
if s_idx is None or t_idx is None:
|
| 187 |
msg = (
|
| 188 |
+
"SDPO alignment indices missing: the collator must emit "
|
| 189 |
+
"`student_response_idx` and `teacher_response_idx` (matching "
|
| 190 |
+
"LongTensors selecting the aligned post-hint response tokens) so "
|
| 191 |
+
"the JSD compares corresponding tokens. A shape-only check does "
|
| 192 |
+
"NOT establish alignment — the hint shifts the teacher's response "
|
| 193 |
+
"tokens right, so equal-length sequences can still be misaligned "
|
| 194 |
+
"and silently distill garbage into the policy (ADR-008 trust-gap)."
|
| 195 |
)
|
| 196 |
if self.strict_sdpo_alignment:
|
|
|
|
|
|
|
| 197 |
raise ValueError(
|
| 198 |
+
msg + " (strict_sdpo_alignment=True; pass False to fall back "
|
| 199 |
+
"to the legacy shape-only check for resilience.)"
|
| 200 |
)
|
| 201 |
+
logger.warning("%s Falling back to shape-only alignment check.", msg)
|
| 202 |
+
if student_logits.shape != teacher_logits.shape:
|
| 203 |
+
logger.warning(
|
| 204 |
+
"SDPO shape mismatch student=%s teacher=%s; skipping.",
|
| 205 |
+
tuple(student_logits.shape), tuple(teacher_logits.shape),
|
| 206 |
+
)
|
| 207 |
+
return torch.tensor(0.0, device=_device_of(model), requires_grad=True)
|
| 208 |
+
return generalized_jsd_loss(
|
| 209 |
+
student_logits=student_logits,
|
| 210 |
+
teacher_logits=teacher_logits,
|
| 211 |
+
labels=inputs.get("sdpo_loss_mask"),
|
| 212 |
+
beta=self.sdpo_jsd_beta,
|
| 213 |
+
temperature=self.sdpo_temperature,
|
| 214 |
+
token_clip=self.sdpo_token_clip,
|
| 215 |
+
reduction="batchmean",
|
| 216 |
+
)
|
| 217 |
+
|
| 218 |
+
# Validate the index tensors describe a consistent 1:1 alignment.
|
| 219 |
+
if s_idx.shape != t_idx.shape:
|
| 220 |
+
raise ValueError(
|
| 221 |
+
f"SDPO alignment index shape mismatch: student_response_idx="
|
| 222 |
+
f"{tuple(s_idx.shape)} vs teacher_response_idx={tuple(t_idx.shape)}. "
|
| 223 |
+
"They must select the same number of aligned response tokens."
|
| 224 |
+
)
|
| 225 |
+
# Gather the provably-aligned response logits from each sequence, then
|
| 226 |
+
# JSD only those positions (this is the masked error-turn distillation).
|
| 227 |
+
# gather over the sequence dim (dim=1): expand index to the vocab dim.
|
| 228 |
+
vocab = student_logits.size(-1)
|
| 229 |
+
s_gather = s_idx.unsqueeze(-1).expand(-1, -1, vocab)
|
| 230 |
+
t_gather = t_idx.unsqueeze(-1).expand(-1, -1, vocab)
|
| 231 |
+
student_aligned = torch.gather(student_logits, 1, s_gather)
|
| 232 |
+
teacher_aligned = torch.gather(teacher_logits, 1, t_gather)
|
| 233 |
|
| 234 |
return generalized_jsd_loss(
|
| 235 |
+
student_logits=student_aligned,
|
| 236 |
+
teacher_logits=teacher_aligned,
|
| 237 |
+
labels=inputs.get("sdpo_loss_mask"), # optional further error-turn mask
|
| 238 |
beta=self.sdpo_jsd_beta,
|
| 239 |
temperature=self.sdpo_temperature,
|
| 240 |
token_clip=self.sdpo_token_clip,
|
|
|
|
| 342 |
merged = {**dr_grpo_defaults, **overrides}
|
| 343 |
cfg = GRPOConfig(**merged)
|
| 344 |
# Guard: fail loudly if a future TRL renames/repurposes these knobs.
|
| 345 |
+
assert cfg.loss_type == merged["loss_type"], (
|
| 346 |
+
f"GRPOConfig loss_type drifted: requested {merged['loss_type']!r}, "
|
| 347 |
+
f"got {cfg.loss_type!r} — TRL may have renamed/repurposed the knob."
|
| 348 |
+
)
|
| 349 |
+
# Dr. GRPO requires NO std-dev advantage normalization. TRL accepts either
|
| 350 |
+
# the string "none" or the bool False to disable it; normalize before
|
| 351 |
+
# comparing so a future TRL that switches the representation still passes
|
| 352 |
+
# (and a genuinely-wrong value like "batch"/"group"/True fails loudly).
|
| 353 |
+
# (Cross-family review 2026-05-29: the prior literal `("none","False","False")`
|
| 354 |
+
# had a duplicated "False" and did a brittle case-sensitive str compare.)
|
| 355 |
+
assert str(cfg.scale_rewards).lower() in ("none", "false"), (
|
| 356 |
+
f"Dr. GRPO requires scale_rewards disabled (no std-norm); got "
|
| 357 |
+
f"{cfg.scale_rewards!r}. TRL knob may have drifted — re-verify against trl version."
|
| 358 |
)
|
| 359 |
assert cfg.num_iterations == merged["num_iterations"], "GRPOConfig dropped num_iterations"
|
| 360 |
return cfg
|
|
@@ -94,20 +94,36 @@ def _make_trainer_without_init(strict: bool):
|
|
| 94 |
return obj
|
| 95 |
|
| 96 |
|
| 97 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 98 |
obj = _make_trainer_without_init(strict=True)
|
| 99 |
model = _TinyLM(vocab=16)
|
| 100 |
-
# student seq len 5, teacher seq len 7 -> logit shape mismatch
|
| 101 |
inputs = {
|
| 102 |
"input_ids": torch.randint(0, 16, (1, 5)),
|
| 103 |
"ctx_teacher_input_ids": torch.randint(0, 16, (1, 7)),
|
| 104 |
"sdpo_loss_mask": torch.ones(1, 5),
|
|
|
|
| 105 |
}
|
| 106 |
-
with pytest.raises(ValueError, match="
|
| 107 |
obj._compute_sdpo_loss(model, inputs)
|
| 108 |
|
| 109 |
|
| 110 |
-
def
|
|
|
|
|
|
|
| 111 |
obj = _make_trainer_without_init(strict=False)
|
| 112 |
model = _TinyLM(vocab=16)
|
| 113 |
inputs = {
|
|
@@ -119,14 +135,20 @@ def test_nonstrict_alignment_warns_and_skips_on_mismatch():
|
|
| 119 |
assert float(loss.detach()) == 0.0 # skipped, not crashed
|
| 120 |
|
| 121 |
|
| 122 |
-
def
|
|
|
|
|
|
|
|
|
|
|
|
|
| 123 |
obj = _make_trainer_without_init(strict=True)
|
| 124 |
model = _TinyLM(vocab=16)
|
| 125 |
-
# student
|
|
|
|
| 126 |
inputs = {
|
| 127 |
"input_ids": torch.randint(0, 16, (1, 6)),
|
| 128 |
-
"ctx_teacher_input_ids": torch.randint(0, 16, (1,
|
| 129 |
-
"
|
|
|
|
| 130 |
}
|
| 131 |
loss = obj._compute_sdpo_loss(model, inputs)
|
| 132 |
val = float(loss.detach())
|
|
@@ -134,6 +156,21 @@ def test_aligned_shapes_produce_finite_sdpo_loss():
|
|
| 134 |
assert val not in (float("inf"), float("-inf"))
|
| 135 |
|
| 136 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 137 |
def test_no_error_sites_returns_zero_without_forward():
|
| 138 |
"""Empty ctx_teacher_input_ids => no SDPO signal, returns 0 (no crash)."""
|
| 139 |
obj = _make_trainer_without_init(strict=True)
|
|
|
|
| 94 |
return obj
|
| 95 |
|
| 96 |
|
| 97 |
+
# ---------------------------------------------------------------------------
|
| 98 |
+
# Gate 2 — SDPO alignment guard (corrected per cross-family review 2026-05-29).
|
| 99 |
+
#
|
| 100 |
+
# The guard now requires the collator to emit explicit `student_response_idx` /
|
| 101 |
+
# `teacher_response_idx` alignment tensors whenever SDPO is active, because a
|
| 102 |
+
# shape-only check does NOT establish token-level alignment (the hint shifts the
|
| 103 |
+
# teacher's response tokens). The tests below verify the CORRECTED contract:
|
| 104 |
+
# - strict + missing indices => raise (loud failure, not silent garbage)
|
| 105 |
+
# - non-strict + missing indices => warn-and-fall-back to legacy shape check
|
| 106 |
+
# - aligned indices supplied => channel fires with a finite, gathered loss
|
| 107 |
+
# ---------------------------------------------------------------------------
|
| 108 |
+
|
| 109 |
+
def test_strict_alignment_raises_on_missing_indices():
|
| 110 |
+
"""Strict mode (default) refuses to distill without explicit alignment
|
| 111 |
+
indices — the misalignment-is-silent trust-gap is now a loud error."""
|
| 112 |
obj = _make_trainer_without_init(strict=True)
|
| 113 |
model = _TinyLM(vocab=16)
|
|
|
|
| 114 |
inputs = {
|
| 115 |
"input_ids": torch.randint(0, 16, (1, 5)),
|
| 116 |
"ctx_teacher_input_ids": torch.randint(0, 16, (1, 7)),
|
| 117 |
"sdpo_loss_mask": torch.ones(1, 5),
|
| 118 |
+
# NO student_response_idx / teacher_response_idx
|
| 119 |
}
|
| 120 |
+
with pytest.raises(ValueError, match="alignment indices missing"):
|
| 121 |
obj._compute_sdpo_loss(model, inputs)
|
| 122 |
|
| 123 |
|
| 124 |
+
def test_nonstrict_falls_back_to_shape_check_and_skips_on_mismatch():
|
| 125 |
+
"""Non-strict + missing indices => legacy shape-only fallback; a true shape
|
| 126 |
+
mismatch is skipped (returns 0) rather than crashing."""
|
| 127 |
obj = _make_trainer_without_init(strict=False)
|
| 128 |
model = _TinyLM(vocab=16)
|
| 129 |
inputs = {
|
|
|
|
| 135 |
assert float(loss.detach()) == 0.0 # skipped, not crashed
|
| 136 |
|
| 137 |
|
| 138 |
+
def test_aligned_indices_produce_finite_gathered_sdpo_loss():
|
| 139 |
+
"""With matching student/teacher response indices supplied, the channel
|
| 140 |
+
gathers the aligned post-hint logits and computes a finite JSD — even when
|
| 141 |
+
the two sequences have DIFFERENT lengths (the realistic hint-shifted case
|
| 142 |
+
the old shape-only guard could not handle)."""
|
| 143 |
obj = _make_trainer_without_init(strict=True)
|
| 144 |
model = _TinyLM(vocab=16)
|
| 145 |
+
# student len 6, teacher len 9 (hint inserted 3 tokens). The 3 response
|
| 146 |
+
# tokens live at student positions [3,4,5] and teacher positions [6,7,8].
|
| 147 |
inputs = {
|
| 148 |
"input_ids": torch.randint(0, 16, (1, 6)),
|
| 149 |
+
"ctx_teacher_input_ids": torch.randint(0, 16, (1, 9)),
|
| 150 |
+
"student_response_idx": torch.tensor([[3, 4, 5]]),
|
| 151 |
+
"teacher_response_idx": torch.tensor([[6, 7, 8]]),
|
| 152 |
}
|
| 153 |
loss = obj._compute_sdpo_loss(model, inputs)
|
| 154 |
val = float(loss.detach())
|
|
|
|
| 156 |
assert val not in (float("inf"), float("-inf"))
|
| 157 |
|
| 158 |
|
| 159 |
+
def test_mismatched_alignment_index_shapes_raise():
|
| 160 |
+
"""student/teacher response-index tensors of different length are a
|
| 161 |
+
collator bug — caught loudly, not gathered into a broken JSD."""
|
| 162 |
+
obj = _make_trainer_without_init(strict=True)
|
| 163 |
+
model = _TinyLM(vocab=16)
|
| 164 |
+
inputs = {
|
| 165 |
+
"input_ids": torch.randint(0, 16, (1, 6)),
|
| 166 |
+
"ctx_teacher_input_ids": torch.randint(0, 16, (1, 9)),
|
| 167 |
+
"student_response_idx": torch.tensor([[3, 4, 5]]),
|
| 168 |
+
"teacher_response_idx": torch.tensor([[6, 7]]), # length 2 != 3
|
| 169 |
+
}
|
| 170 |
+
with pytest.raises(ValueError, match="alignment index shape mismatch"):
|
| 171 |
+
obj._compute_sdpo_loss(model, inputs)
|
| 172 |
+
|
| 173 |
+
|
| 174 |
def test_no_error_sites_returns_zero_without_forward():
|
| 175 |
"""Empty ctx_teacher_input_ids => no SDPO signal, returns 0 (no crash)."""
|
| 176 |
obj = _make_trainer_without_init(strict=True)
|
|
@@ -47,13 +47,25 @@ difficulty curriculum. `SweBenchAdapter` makes any SWE-bench-shaped dataset
|
|
| 47 |
## Owed / unblocked-by (constraints this wave hit)
|
| 48 |
|
| 49 |
- **Cross-family adversarial ADR review** (GPT-5.5 / Gemini 3.1 Pro / DeepSeek
|
| 50 |
-
V4 Pro
|
| 51 |
-
|
| 52 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 53 |
- **Live Docker substrate-inversion e2e** for ADR-010 (pull one SWE-bench-Lite
|
| 54 |
-
image, run the 4 gates against it) —
|
| 55 |
-
|
| 56 |
-
|
|
|
|
|
|
|
|
|
|
| 57 |
phase-boundary commits + detached `systemd-run` scopes.
|
| 58 |
|
| 59 |
## Commit trail
|
|
|
|
| 47 |
## Owed / unblocked-by (constraints this wave hit)
|
| 48 |
|
| 49 |
- **Cross-family adversarial ADR review** (GPT-5.5 / Gemini 3.1 Pro / DeepSeek
|
| 50 |
+
V4 Pro / Kimi K2.6 / Grok 4.3) — ✅ **DONE 2026-05-29** once OpenRouter credits
|
| 51 |
+
were restored. 4/5 clean reviews (Kimi starved its budget). Verdicts: ADR-008
|
| 52 |
+
3-REJECT/1-fix, ADR-009 unanimous-fix, ADR-010 3-REJECT/1-fix — the REJECTs all
|
| 53 |
+
amounted to "accepted-status outran the evidence," not "design is wrong." The
|
| 54 |
+
review **caught a real policy-corrupting P0 the solo pre-mortem missed**: the
|
| 55 |
+
SDPO alignment guard was a shape-check that let hint-shifted tokens silently
|
| 56 |
+
misalign. That plus 6 other convergent defects (sandbox scrub unimplemented,
|
| 57 |
+
curriculum partial-reward→full-pass, reward_fn truncation, shell injection,
|
| 58 |
+
LLM-judge cache non-determinism, scale_rewards assert dup) were **fixed and
|
| 59 |
+
tested** (192 pass / 16 skip, was 187). Full synthesis + the 4 reviews:
|
| 60 |
+
`docs/reviews/cross-family-adr-008-009-010-2026-05-29/`. Each ADR now carries a
|
| 61 |
+
"Post-acceptance cross-family review" section documenting fixed vs open items.
|
| 62 |
- **Live Docker substrate-inversion e2e** for ADR-010 (pull one SWE-bench-Lite
|
| 63 |
+
image, run the 4 gates against it) — STILL deferred (no Docker in the CPU dev
|
| 64 |
+
env). The review confirmed this is the gate that matters: it closes the
|
| 65 |
+
remaining OPEN findings (real reachability/provenance, FakeSandbox-tautology
|
| 66 |
+
objection). Marked `[~]` in ADR-010. Needs a box with Docker + a SWE-bench-Lite
|
| 67 |
+
image.
|
| 68 |
+
- Gateway restarted 6× during the original wave; all work kept durable via
|
| 69 |
phase-boundary commits + detached `systemd-run` scopes.
|
| 70 |
|
| 71 |
## Commit trail
|
|
@@ -102,6 +102,64 @@ guarantee, and add a CPU smoke test that instantiates the trainer and runs a
|
|
| 102 |
- Good: TRL default; least config work.
|
| 103 |
- Bad: std-dev advantage normalization "massively upweights small behavioral differences within equal-correctness groups" (tech report); length-standardization injects length bias. Diverges from the known-good recipe for no benefit.
|
| 104 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 105 |
## Acceptance gate (must be green before status flips to accepted)
|
| 106 |
|
| 107 |
All gates green as of 2026-05-29 (commit chain `bde5c5e`+; smoke PASS via
|
|
|
|
| 102 |
- Good: TRL default; least config work.
|
| 103 |
- Bad: std-dev advantage normalization "massively upweights small behavioral differences within equal-correctness groups" (tech report); length-standardization injects length bias. Diverges from the known-good recipe for no benefit.
|
| 104 |
|
| 105 |
+
## Post-acceptance cross-family review (2026-05-29)
|
| 106 |
+
|
| 107 |
+
A 5-family adversarial review (GPT-5.5, Gemini 3.1 Pro, DeepSeek V4 Pro,
|
| 108 |
+
Kimi K2.6, Grok 4.3 — the owed cross-family review, unblocked once OpenRouter
|
| 109 |
+
credits were restored) examined this ADR against its implementation. 4 of 5
|
| 110 |
+
returned clean reviews (Kimi starved its token budget). **Verdict split: 3
|
| 111 |
+
REJECT / 1 ACCEPT-WITH-FIXES on ADR-008.** The review correctly found that the
|
| 112 |
+
acceptance gate above **over-claimed**: the SDPO "trust-gap closed" gate was
|
| 113 |
+
satisfied only by a *shape check*, and the smoke gate proved init-not-wiring.
|
| 114 |
+
Findings verified against code and remediated:
|
| 115 |
+
|
| 116 |
+
- **[FIXED — was the headline P0, 4/4 reviewers] SDPO alignment was a shape-check
|
| 117 |
+
placebo.** `_compute_sdpo_loss` only asserted `student_logits.shape ==
|
| 118 |
+
teacher_logits.shape`. Because the teacher context has a hint inserted at the
|
| 119 |
+
error turn, the teacher's response tokens are shifted right by `len(hint)`;
|
| 120 |
+
equal-length sequences can still be token-misaligned, so `generalized_jsd_loss`
|
| 121 |
+
would distill the student's code against the teacher's *hint* — silently
|
| 122 |
+
poisoning the policy. **Fix:** the guard now REQUIRES the collator to emit
|
| 123 |
+
explicit `student_response_idx` / `teacher_response_idx` alignment tensors
|
| 124 |
+
whenever SDPO is active; the loss gathers the provably-aligned post-hint logits
|
| 125 |
+
before JSD. Missing indices raise in strict mode (loud failure) instead of
|
| 126 |
+
silently distilling garbage. Tests rewritten to the corrected contract
|
| 127 |
+
(`test_strict_alignment_raises_on_missing_indices`,
|
| 128 |
+
`test_aligned_indices_produce_finite_gathered_sdpo_loss` across *different*
|
| 129 |
+
sequence lengths, `test_mismatched_alignment_index_shapes_raise`).
|
| 130 |
+
- **[FIXED] `scale_rewards` drift assertion had a duplicated literal**
|
| 131 |
+
`("none","False","False")` and a brittle case-sensitive compare. Now
|
| 132 |
+
`str(cfg.scale_rewards).lower() in ("none","false")`.
|
| 133 |
+
- **[RE-SCOPED — gate honesty] The CPU smoke (gate 3) is init-coverage, not
|
| 134 |
+
wiring-coverage.** As 3 reviewers noted, the toy rollouts carry no
|
| 135 |
+
`ctx_teacher_input_ids`, so `_compute_sdpo_loss` hits its early `return 0.0`
|
| 136 |
+
and the smoke passes without ever entering the SDPO body. The gate proves the
|
| 137 |
+
subclass instantiates against a real `trl.GRPOTrainer` and the step runs — it
|
| 138 |
+
does NOT prove the SDPO channel fires on a real error-aligned batch. That
|
| 139 |
+
separate proof lives in `examples/sdpo_real_trace_train_smoke` (loss-math on
|
| 140 |
+
real traces). **Gate 3 re-worded below to claim only what it establishes.**
|
| 141 |
+
- **[OPEN — acknowledged, not fixed here] KL estimator (k1 vs k3) is not
|
| 142 |
+
configured or asserted** (2 reviewers). The ADR claims Composer uses the k1
|
| 143 |
+
(`−log r`) estimator and that "TRL's native estimator" satisfies this, but the
|
| 144 |
+
code neither sets nor verifies it against TRL 1.5.0's actual GRPO KL branch. If
|
| 145 |
+
TRL uses k3, we train a slightly different reference-KL than the report
|
| 146 |
+
specifies. Tracked as a follow-up: add a unit test computing KL for known
|
| 147 |
+
logprob pairs and asserting the k1 value, or set `beta=0` and document the
|
| 148 |
+
reference-KL term as intentionally disabled.
|
| 149 |
+
- **[OPEN — narrower guarantee] `num_iterations=1` ≠ "a prompt is never trained
|
| 150 |
+
twice"** (GPT-5.5). It controls GRPO inner-loop reuse per generation batch, not
|
| 151 |
+
dataset-level resampling/epochs. The single-epoch guarantee also depends on the
|
| 152 |
+
sampler/dataloader. Documented; the claim is narrowed to the GRPO inner-loop
|
| 153 |
+
sense.
|
| 154 |
+
- **[OPEN] "Adam" is claimed but `optim` is not set** — HF/TRL defaults to AdamW.
|
| 155 |
+
Either set `optim` explicitly or drop the Adam-specific claim. Follow-up.
|
| 156 |
+
|
| 157 |
+
**Status note:** the silent-misalignment P0 (the one that would actually corrupt
|
| 158 |
+
training) is fixed and tested. The remaining OPEN items are
|
| 159 |
+
fidelity/documentation gaps that don't corrupt a run, tracked as follow-ups. The
|
| 160 |
+
review artifact (all 4 reviews + convergence synthesis) is committed at
|
| 161 |
+
`docs/reviews/cross-family-adr-008-009-010-2026-05-29/`.
|
| 162 |
+
|
| 163 |
## Acceptance gate (must be green before status flips to accepted)
|
| 164 |
|
| 165 |
All gates green as of 2026-05-29 (commit chain `bde5c5e`+; smoke PASS via
|
|
@@ -90,6 +90,47 @@ signature.
|
|
| 90 |
- Good: maximal coverage, natural language for every site.
|
| 91 |
- Bad: cost on every site (templates would have been free); nondeterministic data prep; network dependency on the hot path. Wasteful for the common tool-error case a template nails.
|
| 92 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 93 |
## Acceptance gate (must be green before status flips to accepted)
|
| 94 |
|
| 95 |
All gates green as of 2026-05-29 (commit `<this>`; 12 tests in
|
|
|
|
| 90 |
- Good: maximal coverage, natural language for every site.
|
| 91 |
- Bad: cost on every site (templates would have been free); nondeterministic data prep; network dependency on the hot path. Wasteful for the common tool-error case a template nails.
|
| 92 |
|
| 93 |
+
## Post-acceptance cross-family review (2026-05-29)
|
| 94 |
+
|
| 95 |
+
The 5-family review (see ADR-008's review section) was **unanimous
|
| 96 |
+
ACCEPT-WITH-FIXES on ADR-009** — the layered Protocol design is sound; the
|
| 97 |
+
defects are in the LLM-judge layer's robustness and an ADR-body/code drift.
|
| 98 |
+
Verified and remediated:
|
| 99 |
+
|
| 100 |
+
- **[FIXED] LLM-judge cache key was non-deterministic across processes**
|
| 101 |
+
(Gemini P1, GPT-5.5 P2). `_cache_key` used `json.dumps(..., default=str)` on
|
| 102 |
+
`error_meta`; if that dict carries a raw Exception/context object, `str(obj)`
|
| 103 |
+
embeds a memory address (`<X at 0x7f…>`) → the key changes every run → 0%
|
| 104 |
+
cross-process cache hit → unbounded judge cost. **Fix:** strip `0x…` address
|
| 105 |
+
tokens before hashing and version the key (`_CACHE_VERSION`) so prompt/model
|
| 106 |
+
changes invalidate stale hints.
|
| 107 |
+
- **[FIXED] LLM-judge output was unbounded** (GPT-5.5 P1). The prompt asks for
|
| 108 |
+
≤2 sentences but nothing enforced it; a runaway judge could inject a full
|
| 109 |
+
solution / prompt-leak / megabyte blob straight into SDPO teacher
|
| 110 |
+
conditioning. **Fix:** `_MAX_HINT_CHARS = 600` clamp on the returned hint.
|
| 111 |
+
- **[FIXED] Non-atomic disk-cache write** (Gemini P2). Concurrent DDP workers
|
| 112 |
+
writing the same key could corrupt the file. **Fix:** write to a
|
| 113 |
+
`.{pid}.tmp` and `os.replace()` atomically.
|
| 114 |
+
- **[CORRECTED — ADR body] Decision Outcome described sibling-bootstrap as
|
| 115 |
+
composite layer (4).** As DeepSeek noted, the body says "(4) SDPO
|
| 116 |
+
sibling-bootstrap" as a composite layer, but the implementation (correctly)
|
| 117 |
+
places it trainer-side, and the acceptance gate already documents the shift.
|
| 118 |
+
The permanent decision record was internally inconsistent. **The Decision
|
| 119 |
+
Outcome below now states sibling-bootstrap lives in the rollout loop (ADR-008
|
| 120 |
+
trainer), not the HintGenerator composite.**
|
| 121 |
+
- **[OPEN — follow-up] Default layer order routes style/communication sites
|
| 122 |
+
through `RawErrorHintGenerator` before the judge** (GPT-5.5 P1). Any uncovered
|
| 123 |
+
site that carries an `error_message` is consumed by the raw-error layer and
|
| 124 |
+
never reaches the LLM judge — so the judge (the layer that actually covers
|
| 125 |
+
style/communication/effort, the stated goal) rarely fires in the default
|
| 126 |
+
composite. The fall-through test disables raw-error to force the judge, so it
|
| 127 |
+
doesn't validate the *default* path. Follow-up: add error-kind routing
|
| 128 |
+
(tool/runtime → raw-error OK; style/communication/effort → skip raw, go to
|
| 129 |
+
judge) and test the default composite directly.
|
| 130 |
+
|
| 131 |
+
All ADR-009 fixes are covered by the existing 12 tests passing unchanged plus
|
| 132 |
+
the cache-determinism behavior; no test regressions.
|
| 133 |
+
|
| 134 |
## Acceptance gate (must be green before status flips to accepted)
|
| 135 |
|
| 136 |
All gates green as of 2026-05-29 (commit `<this>`; 12 tests in
|
|
@@ -91,6 +91,76 @@ it to the RL loop.
|
|
| 91 |
- Good: zero build.
|
| 92 |
- Bad: not *Feature Deletion* — it's issue-fixing; no controlled difficulty knob; no deletion mechanic; doesn't bring Composer's data-gen method in, just consumes an existing benchmark. Fails the actual ask.
|
| 93 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 94 |
## Acceptance gate (must be green before status flips to accepted)
|
| 95 |
|
| 96 |
Core gates green as of 2026-05-29 (19 tests in
|
|
|
|
| 91 |
- Good: zero build.
|
| 92 |
- Bad: not *Feature Deletion* — it's issue-fixing; no controlled difficulty knob; no deletion mechanic; doesn't bring Composer's data-gen method in, just consumes an existing benchmark. Fails the actual ask.
|
| 93 |
|
| 94 |
+
## Post-acceptance cross-family review (2026-05-29)
|
| 95 |
+
|
| 96 |
+
The 5-family review (see ADR-008's review section) was **3 REJECT / 1
|
| 97 |
+
ACCEPT-WITH-FIXES on ADR-010** — the harshest of the three. The reviewers'
|
| 98 |
+
core, correct objection: the `[x]` gates were satisfied **against `FakeSandbox`
|
| 99 |
+
materializers that directly assign pass/fail booleans**, so they prove the
|
| 100 |
+
control-flow plumbing, NOT that feature-deletion-and-reimplement works on a real
|
| 101 |
+
repo. ADR-010's central claim ("invert OSS SWE substrates") is exactly the part
|
| 102 |
+
that is `[~]` (Docker-deferred) — the accepted gates are the easy half. This is
|
| 103 |
+
fair; the gate language below is re-scoped accordingly. Findings verified and
|
| 104 |
+
remediated where possible without Docker:
|
| 105 |
+
|
| 106 |
+
- **[FIXED — P0, 4/4 reviewers] Reward-hack sandbox lockdown was the ADR's
|
| 107 |
+
claimed PRIMARY control but was UNIMPLEMENTED.** `LocalSubprocessSandbox.boot`
|
| 108 |
+
only recorded the image string; no cache/.git scrub existed, so the
|
| 109 |
+
(bypassable) command denylist was the *only* defense. Worse, the denylist
|
| 110 |
+
checks only the first whitespace token, so `python -c "import marshal,dis;
|
| 111 |
+
…read __pycache__/*.pyc"`, `/usr/bin/strings`, `sh -c "strings x"` all bypass
|
| 112 |
+
it, and the `HackMonitor` regex is defeated by string-concat
|
| 113 |
+
(`"__py"+"cache__"`). **Fix:** `boot()` now physically scrubs `__pycache__`,
|
| 114 |
+
`.mypy_cache`, `.pytest_cache`, `.git`, `.hg`, and `*.pyc/*.pyo/*.class` from
|
| 115 |
+
the working tree before the episode — the actual primary control. The denylist
|
| 116 |
+
is now documented as cheap defense-in-depth, not the wall. Tested:
|
| 117 |
+
`test_local_sandbox_scrubs_caches_and_git_on_boot`.
|
| 118 |
+
- **[FIXED — P0, 3/4 reviewers] Curriculum recorded partial multi-feature
|
| 119 |
+
reward as a full pass.** `reward_fn` updated the curriculum with
|
| 120 |
+
`int(res.reward > 0)`, so a 0.5 (1-of-2 features) reward logged as a 100% pass;
|
| 121 |
+
`p_hat` crossed `tau_easy` and the task was **retired before the policy learned
|
| 122 |
+
the second feature**. **Fix:** the curriculum now takes FRACTIONAL credit
|
| 123 |
+
(`n_pass: float`), and `reward_fn` feeds the pass-fraction; a guard-broken or
|
| 124 |
+
hack-flagged rollout contributes 0 credit but still counts as an exposure (so
|
| 125 |
+
hack-only tasks trend to quarantine, not phantom passes). Tested:
|
| 126 |
+
`test_partial_multifeature_reward_is_fractional_credit_not_a_full_pass`,
|
| 127 |
+
`test_guard_broken_trajectory_does_not_pollute_curriculum_with_a_pass`.
|
| 128 |
+
- **[FIXED] `reward_fn` could silently return the wrong reward count.** The
|
| 129 |
+
`zip(completions, task_id)` truncated on length mismatch, returning fewer
|
| 130 |
+
rewards than TRL expects and corrupting the advantage computation. **Fix:**
|
| 131 |
+
explicit length-guard raises. Tested: `test_reward_fn_rejects_length_mismatch`.
|
| 132 |
+
- **[FIXED] Shell injection / parameterized-test breakage in `run_tests`**
|
| 133 |
+
(DeepSeek + Gemini). `f"{test_command} {node_ids}"` with `shell=True` breaks on
|
| 134 |
+
SWE-bench parametrized node ids (`test_x.py::test_y[a b]`) and is injectable.
|
| 135 |
+
**Fix:** `shlex.quote` each node id.
|
| 136 |
+
- **[FIXED] Duplicate `"unzip"` in `SANDBOX_DENYLIST`** (multiple reviewers).
|
| 137 |
+
- **[OPEN — honest, the central re-scope] Gate 2 ("deletion breaks the feature")
|
| 138 |
+
and the "deletion reachable from tests" claim do NOT verify reachability**
|
| 139 |
+
(GPT-5.5 P0, Grok P0). Gate 2 only checks that `FAIL_TO_PASS` tests fail in the
|
| 140 |
+
broken state — it does not prove the failure was *caused by the reverted
|
| 141 |
+
feature* rather than an unrelated breakage. A real reachability check
|
| 142 |
+
(coverage of the changed region by the failing tests, or revert-provenance)
|
| 143 |
+
needs the live Docker materializers. **This is the same `[~]` gate as the
|
| 144 |
+
substrate-inversion e2e — see below.**
|
| 145 |
+
- **[OPEN] `HackMonitor` is a substring matcher, not the AST-provenance monitor
|
| 146 |
+
the ADR advertises** (DeepSeek P0). It flags cache/decompiler signatures in the
|
| 147 |
+
trajectory but does no AST/symbol-reappearance analysis, and is bypassable by
|
| 148 |
+
string-concat. With the scrub now in place as the primary control, the monitor
|
| 149 |
+
is correctly-scoped defense-in-depth — but the ADR's §3c "AST provenance
|
| 150 |
+
monitor" language overstates it. Re-scoped: it is a *signature-based* monitor;
|
| 151 |
+
a genuine AST provenance check (scan the agent's patch for reintroduced
|
| 152 |
+
`deleted_symbols` reached via non-implementation paths) is a follow-up.
|
| 153 |
+
- **[OPEN — recipe fidelity] Curriculum ignores rollout-turns and
|
| 154 |
+
thinking-token count** (DeepSeek, GPT-5.5). The Composer 2 tech report keys the
|
| 155 |
+
curriculum on these; the implementation tracks only pass-rate. Follow-up:
|
| 156 |
+
extend `DifficultyCurriculum.update` to accept and weight on turn/think-token
|
| 157 |
+
signals.
|
| 158 |
+
|
| 159 |
+
**The three OPEN items all require the Docker substrate e2e to close** (real
|
| 160 |
+
reachability, real provenance on a materialized repo). That gate remains the
|
| 161 |
+
honest `[~]` — see the unblocked-by note. The fixable correctness defects
|
| 162 |
+
(scrub, fractional curriculum, length-guard, shlex) are fixed and tested.
|
| 163 |
+
|
| 164 |
## Acceptance gate (must be green before status flips to accepted)
|
| 165 |
|
| 166 |
Core gates green as of 2026-05-29 (19 tests in
|
|
@@ -0,0 +1,90 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Cross-family adversarial review — ADR-008 / 009 / 010 (2026-05-29)
|
| 2 |
+
|
| 3 |
+
> The "owed" cross-family adversarial ADR review from the Composer 2.5
|
| 4 |
+
> data-gen + targeted-RL wave. Deferred during the wave because OpenRouter ran
|
| 5 |
+
> out of credits mid-run (blocking `delegate_task`); run here once credits were
|
| 6 |
+
> restored (~$50 free at review time). This is the proper GPT-5.5 / Gemini /
|
| 7 |
+
> DeepSeek / Kimi / Grok scatter the wave summary promised.
|
| 8 |
+
|
| 9 |
+
## Method
|
| 10 |
+
|
| 11 |
+
Route-fidelity-safe urllib scatter (no `delegate_task` per-task-override risk)
|
| 12 |
+
to 5 diverse families, full corpus embedded inline: the 3 ADRs **plus their
|
| 13 |
+
implementation code plus the tests** (~100KB / ~25K tokens), so reviewers
|
| 14 |
+
critiqued the *implementation against the decision*, not just the prose.
|
| 15 |
+
|
| 16 |
+
| Family | Slug (served) | Result |
|
| 17 |
+
|---|---|---|
|
| 18 |
+
| OpenAI | gpt-5.5-20260423 | clean, very specific (7981 tok) |
|
| 19 |
+
| Google | gemini-3.1-pro-preview-20260219 | clean, sharpest (9174 tok @ retry 16K) |
|
| 20 |
+
| DeepSeek | deepseek-v4-pro-20260423 | clean, math/correctness lens (10308 tok @ retry 16K) |
|
| 21 |
+
| xAI | grok-4.3-20260430 | clean, terse/decisive (1645 tok) |
|
| 22 |
+
| Moonshot | kimi-k2.6-20260420 | starved token budget (63K reasoning, content=None) — excluded |
|
| 23 |
+
|
| 24 |
+
4 of 5 clean reviews. Raw reviews: `review_*.md` in this directory.
|
| 25 |
+
|
| 26 |
+
## Verdict matrix
|
| 27 |
+
|
| 28 |
+
| ADR | GPT-5.5 | Gemini | DeepSeek | Grok | Consensus |
|
| 29 |
+
|---|---|---|---|---|---|
|
| 30 |
+
| 008 (Dr.GRPO+SDPO) | REJECT | REJECT | ACCEPT-w-FIXES | REJECT | **3 REJECT / 1 fix** |
|
| 31 |
+
| 009 (layered hints) | ACCEPT-w-FIXES | ACCEPT-w-FIXES | ACCEPT-w-FIXES | ACCEPT | **unanimous: fixable** |
|
| 32 |
+
| 010 (FeatureDeletion datagen) | REJECT | REJECT | REJECT | ACCEPT-w-FIXES | **3 REJECT / 1 fix** |
|
| 33 |
+
|
| 34 |
+
The REJECTs were not "the design is wrong" — every reviewer agreed the
|
| 35 |
+
architecture is sound. They were "the **accepted** status outruns the
|
| 36 |
+
**evidence**": gates marked `[x]` were satisfied by shape-checks and FakeSandbox
|
| 37 |
+
tautologies, not by the hard guarantee the gate language implied.
|
| 38 |
+
|
| 39 |
+
## Convergent findings (≥2 reviewers, ALL verified against code by the orchestrator)
|
| 40 |
+
|
| 41 |
+
| # | Finding | Reviewers | Status |
|
| 42 |
+
|---|---|---|---|
|
| 43 |
+
| 1 | SDPO alignment guard = shape-check only; hint-shifted tokens silently misalign → policy poisoning | 4/4 | **FIXED** (explicit alignment-index gather + strict-raise; tests rewritten) |
|
| 44 |
+
| 2 | Sandbox denylist trivially bypassable (`python -c`, abs paths, `sh -c`) + ADR-claimed scrub UNIMPLEMENTED | 4/4 | **FIXED** (`_scrub_tree` primary control implemented + tested) |
|
| 45 |
+
| 3 | Curriculum `int(reward>0)` logs 0.5 partial as full pass → premature retire | 3/4 | **FIXED** (fractional credit + tests) |
|
| 46 |
+
| 4 | `scale_rewards` assertion dup literal `("none","False","False")` | 4/4 | **FIXED** (case-insensitive) |
|
| 47 |
+
| 5 | CPU smoke is tautological — never enters `_compute_sdpo_loss` (early return) | 3/4 | **RE-SCOPED** (gate honestly reworded) |
|
| 48 |
+
| 6 | FakeSandbox validator tests prove plumbing, not real inversion | 3/4 | **RE-SCOPED** (= the `[~]` Docker gate) |
|
| 49 |
+
| 7 | HackMonitor is substring matcher, not AST-provenance as advertised | 2/4 | **RE-SCOPED + OPEN** (follow-up) |
|
| 50 |
+
| 8 | Validator gate-2/"deletion reachable" doesn't test reachability | 2/4 | **OPEN** (needs Docker materializers) |
|
| 51 |
+
| 9 | `shell=True` + node-id interpolation = injection + param-test breakage | 2/4 | **FIXED** (`shlex.quote`) |
|
| 52 |
+
| 10 | LLM-judge cache key non-deterministic (memory addr) + unbounded output | 2/4 | **FIXED** (addr-strip + version + clamp + atomic write) |
|
| 53 |
+
| 11 | KL estimator k1 never configured/asserted | 2/4 | **OPEN** (fidelity follow-up) |
|
| 54 |
+
| 12 | `reward_fn` zip truncates on length mismatch | 1 (GPT) | **FIXED** (length-guard) |
|
| 55 |
+
|
| 56 |
+
## What was fixed in this pass (all tested, 192 pass / 16 skip, 0 fail)
|
| 57 |
+
|
| 58 |
+
- **SDPO alignment** (`composer_trainer.py`): require `student_response_idx` /
|
| 59 |
+
`teacher_response_idx` from the collator; gather aligned post-hint logits
|
| 60 |
+
before JSD; strict-mode raises on missing/mismatched indices. The
|
| 61 |
+
silent-misalignment P0 (the only finding that would actually corrupt a run) is
|
| 62 |
+
closed and tested across different-length sequences.
|
| 63 |
+
- **Sandbox scrub** (`sandbox.py`): `boot()` physically removes byte-code/type
|
| 64 |
+
caches + `.git`/`.hg` — the ADR-claimed PRIMARY reward-hack control, previously
|
| 65 |
+
absent. Denylist re-documented as defense-in-depth + `shlex.quote` on node ids.
|
| 66 |
+
- **Curriculum** (`curriculum.py` + `env.py`): fractional pass credit; hacked /
|
| 67 |
+
guard-broken rollouts contribute 0 credit but count as exposures.
|
| 68 |
+
- **reward_fn** (`env.py`): length-guard against silent truncation.
|
| 69 |
+
- **LLM judge** (`hint_generator.py`): address-stripped + versioned cache key,
|
| 70 |
+
600-char output clamp, atomic disk write.
|
| 71 |
+
- **`make_dr_grpo_config`** (`composer_trainer.py`): fixed the duplicated
|
| 72 |
+
assertion literal.
|
| 73 |
+
|
| 74 |
+
## What remains OPEN (honest follow-ups, none corrupt a run)
|
| 75 |
+
|
| 76 |
+
1. **Live Docker substrate-inversion e2e** — the central `[~]` gate. Closes
|
| 77 |
+
findings #6, #8 (real reachability/provenance on a materialized repo) and the
|
| 78 |
+
"FakeSandbox is tautological" objection. **Blocked: no Docker in this env.**
|
| 79 |
+
2. **k1 KL estimator** assert/config vs TRL 1.5.0's actual GRPO KL branch (#11).
|
| 80 |
+
3. **HackMonitor → real AST provenance** check (#7).
|
| 81 |
+
4. **Hint-generator default layer routing** so style/communication sites reach
|
| 82 |
+
the judge (ADR-009 P1).
|
| 83 |
+
5. **Curriculum turn/think-token signals** for full Composer recipe fidelity.
|
| 84 |
+
|
| 85 |
+
All five are documented in the ADRs' "Post-acceptance cross-family review"
|
| 86 |
+
sections.
|
| 87 |
+
|
| 88 |
+
## Cost
|
| 89 |
+
|
| 90 |
+
~$2-4 OpenRouter (2 scatters: 5×8K + 3×16K tokens on a 25K-token corpus).
|
|
@@ -0,0 +1,46 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Review by deepseek/deepseek-v4-pro-20260423 (206.1s, 26727/10308 tok)
|
| 2 |
+
|
| 3 |
+
## Verdict
|
| 4 |
+
- **ADR-008**: ACCEPT-WITH-FIXES
|
| 5 |
+
- **ADR-009**: ACCEPT-WITH-FIXES
|
| 6 |
+
- **ADR-010**: REJECT
|
| 7 |
+
|
| 8 |
+
## P0 findings
|
| 9 |
+
|
| 10 |
+
1. **[ADR-010] Sandbox denylist can be trivially bypassed — `is_command_allowed` validates only the first whitespace-delimited token, not the resolved command.**
|
| 11 |
+
- `composer_replication/datagen/sandbox.py:81-83` (`FakeSandbox.exec`) and `:130-136` (`LocalSubprocessSandbox.exec`)
|
| 12 |
+
- An attacker can run `/usr/bin/find …`, `bash -c "find …"`, or even `./my_decompile` (any path prefix) to circumvent the denylist. Because the check is `cmd.strip().split()[0]`, it compares e.g. `"/usr/bin/find"` against the denylist entry `"find"` and allows it.
|
| 13 |
+
- The ADR claims “allowlisted sandbox without find/strings/unzip/decompilers” – but the actual enforcement is cosmetic. In a real reward‑hacking scenario this safeguard fails.
|
| 14 |
+
- **Fix**: enforce a strict executable‑allowlist (e.g., only `pytest`, `python`, `git apply`, `editor` wrappers) at the sandbox level, or use OS‑level capability restrictions (seccomp / AppArmor). The current string‑matching must block *any* invocation of a forbidden tool, not just the bare command name.
|
| 15 |
+
|
| 16 |
+
## P1 findings
|
| 17 |
+
|
| 18 |
+
1. **[ADR-008] SDPO alignment guard only checks logit tensor shape, not semantic alignment of the post‑hint sequences.**
|
| 19 |
+
- `composer_replication/trainer/composer_trainer.py:120-130` — the shape guard raises `ValueError` on unequal dimensions, but equal‑length sequences could still be misaligned if the collator inserts the hint at different positions. The ADR states the trust‑gap is “closed” with an assertion; in reality the guard only catches length mismatches, leaving silent misalignment possible.
|
| 20 |
+
- **Fix**: require the collator to provide explicit alignment information (e.g., a boolean mask identifying the post‑hint region) and validate that the teacher’s logits in that region correspond 1:1 with the student’s.
|
| 21 |
+
|
| 22 |
+
2. **[ADR-009] ADR-009’s Decision Outcome says the composite includes a sibling‑bootstrap layer, but the implementation does not.**
|
| 23 |
+
- `docs/adrs/ADR-009-layered-hint-generator.md` lines 57‑62 describe a composite “that tries layers cheapest‑first: (1) TemplateHintGenerator … (4) SDPO sibling‑bootstrap”. The code (`composer_replication/hint_generator.py:CompositeHintGenerator` and `default_composite`) only layers templates → raw error → LLM judge. The sibling‑bootstrap is documented as a trainer‑side flag, not a composite layer.
|
| 24 |
+
- The ADR’s own acceptance gate acknowledges the shift, but the body of the ADR (the permanent decision record) is now inconsistent with the shipped implementation. This misleads future readers about the architecture.
|
| 25 |
+
- **Fix**: update the ADR’s Decision Outcome to note that sibling‑bootstrap lives in the rollout loop, not in the HintGenerator composite.
|
| 26 |
+
|
| 27 |
+
3. **[ADR-010] HackMonitor is not an AST‑provenance monitor.**
|
| 28 |
+
- `composer_replication/datagen/monitor.py` uses a static substring denylist (`__pycache__`, `javap`, …) to flag a trajectory. The ADR’s §3c and Decision Outcome describe an “AST provenance monitor that masks reward when deleted symbols reappear via non‑implementation paths”. No AST or symbol‑reappearance analysis is performed.
|
| 29 |
+
- The current monitor is a trivially‑extended version of the sandbox denylist, not a heuristic provenance detector as advertised.
|
| 30 |
+
- **Fix**: either implement a genuine provenance check (e.g., scan the agent’s patch for reintroduced symbols that match the `deleted_symbols` list, and cross‑reference with AST imports) or rewrite the ADR to accurately describe the signature‑based monitor.
|
| 31 |
+
|
| 32 |
+
4. **[ADR-010] DifficultyCurriculum does not incorporate rollout‑turns or thinking‑token count.**
|
| 33 |
+
- `composer_replication/datagen/curriculum.py` tracks only pass/fail test outcomes and derives a frontier‑variance weight. The Composer 2 tech report (cited in the ADR) explicitly keys the curriculum on “rollout #turns + thinking‑token count”. The implementation omits these signals entirely.
|
| 34 |
+
- This deviates from the recipe the ADR claims to match; the resulting curriculum may miss important difficulty‑modulation behaviour.
|
| 35 |
+
- **Fix**: augment `DifficultyCurriculum` with moving averages of turn‑count and thinking‑token length per task, and incorporate them into the weighting or difficulty‑escalation logic.
|
| 36 |
+
|
| 37 |
+
## P2 / nits
|
| 38 |
+
|
| 39 |
+
- **ADR-008**: `make_dr_grpo_config`’s assertion string `("none", "False", "False")` duplicates `"False"` (presumably a copy‑paste typo) — `composer_replication/trainer/composer_trainer.py:228`.
|
| 40 |
+
- **ADR-010**: `LocalSubprocessSandbox.run_tests` constructs a shell command via `f"{test_command} {node_ids}"` with `shell=True` — potential command injection if `node_ids` is untrusted (`sandbox.py:148‑151`).
|
| 41 |
+
- **ADR-010**: The 4‑gate validator’s `gate2` includes a redundant `bool(targets)` check – `targets` is always non‑empty per `FeatureDeletionTask` invariant, so the check is dead code (`validator.py:62`).
|
| 42 |
+
|
| 43 |
+
## Anything the self-review likely missed
|
| 44 |
+
- The ADR‑008 smoke test relies on `strict_sdpo_alignment=False`; it never exercises the shape‑mismatch error path in production‑like configuration, so that path is untested.
|
| 45 |
+
- The `FeatureDeletionEnv.reward_fn` curriculum update uses `n_pass=int(res.reward > 0)` — this treats any positive reward (even 0.1) as a “pass” for the pass‑rate tracker, which may bias `p_hat` upward for partially‑rewarded tasks. The intention was likely to count fractional success, but the current code treats it as binary.
|
| 46 |
+
- The `SweBenchAdapter` fallback for non‑JSON `FAIL_TO_PASS` strings (e.g., `"test_x"`) returns a single‑element tuple, which is a reasonable heuristic but silently changes semantics for datasets that use delimiters other than JSON‑list.
|
|
@@ -0,0 +1,61 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Review by google/gemini-3.1-pro-preview-20260219 (94.0s, 28773/9174 tok)
|
| 2 |
+
|
| 3 |
+
## Verdicts
|
| 4 |
+
**ADR-008 (Dr. GRPO + SDPO):** REJECT
|
| 5 |
+
**ADR-009 (Layered HintGenerator):** ACCEPT-WITH-FIXES
|
| 6 |
+
**ADR-010 (Feature Deletion Datagen):** REJECT
|
| 7 |
+
|
| 8 |
+
---
|
| 9 |
+
|
| 10 |
+
## P0 findings
|
| 11 |
+
|
| 12 |
+
**1. SDPO student/teacher alignment guard is a shape-check placebo (ADR-008)**
|
| 13 |
+
*File:* `composer_replication/trainer/composer_trainer.py` inside `_compute_sdpo_loss`
|
| 14 |
+
*Why:* The code asserts `student_logits.shape != teacher_logits.shape` to verify alignment. This is fundamentally flawed. The teacher context (`ctx_teacher_input_ids`) has a hint inserted in the *middle* of the sequence (at the error turn). To make the total sequence lengths match, the collator must pad one of them. Even if the shapes are identical, the student's post-hint response tokens are shifted relative to the teacher's by `len(hint)`. `generalized_jsd_loss` will compute the KL divergence between completely unaligned tokens (e.g., the student's code vs. the teacher's hint). The trust-gap is not closed; it is silently hidden.
|
| 15 |
+
*Fix:* The collator must emit an alignment mapping (e.g., `teacher_gather_indices`), and `_compute_sdpo_loss` must `gather` the exact post-hint response logits from the teacher before computing JSD, rather than relying on sequence-level shape equality.
|
| 16 |
+
|
| 17 |
+
**2. Reward-hacking safeguards are trivially bypassed via Python (ADR-010)**
|
| 18 |
+
*File:* `composer_replication/datagen/sandbox.py` (`SANDBOX_DENYLIST`) and `monitor.py` (`HackMonitor`)
|
| 19 |
+
*Why:* The denylist blocks shell commands like `strings`, `javap`, and `uncompyle6` by checking the first word of the command. However, `python` is allowed. An agent can simply run `python -c "import dis, marshal; print(dis.dis(marshal.load(open('__pycache__/feature_a.cpython-310.pyc', 'rb'))))"` to decompile the cache, or use Python to read `.git` objects. Furthermore, the `HackMonitor` regex is defeated by basic string concatenation in the generated code (e.g., `getattr(os, "list" + "dir")(".git")` or `"__py" + "cache__"`). The safeguards provide a false sense of security.
|
| 20 |
+
*Fix:* Do not rely on command denylists or regex monitors as the primary defense. The sandbox `boot()` method MUST physically execute `rm -rf $(find . -name __pycache__) .git` before the episode starts.
|
| 21 |
+
|
| 22 |
+
**3. Curriculum prematurely retires tasks due to partial-reward logging (ADR-010)**
|
| 23 |
+
*File:* `composer_replication/datagen/env.py` inside `reward_fn`
|
| 24 |
+
*Why:* The reward function updates the curriculum using `n_pass=int(res.reward > 0)`. For a multi-feature task where the agent fixes 1 out of 2 tests, `res.reward` is `0.5`. `int(0.5 > 0)` evaluates to `1`. The curriculum logs this as a 100% pass rate. Consequently, `p_hat` quickly exceeds `tau_easy` (0.95), and the task is retired before the agent ever learns to solve the remaining features.
|
| 25 |
+
*Fix:* Update the curriculum using strict success: `n_pass=int(res.reward == 1.0)`.
|
| 26 |
+
|
| 27 |
+
---
|
| 28 |
+
|
| 29 |
+
## P1 findings
|
| 30 |
+
|
| 31 |
+
**4. LLM Judge cache key is destroyed by memory addresses (ADR-009)**
|
| 32 |
+
*File:* `composer_replication/hint_generator.py` inside `LLMJudgeHintGenerator._cache_key`
|
| 33 |
+
*Why:* The cache key uses `json.dumps(..., default=str)`. `error_meta` frequently contains raw Exception objects or un-repr'd context objects. `str(obj)` for these includes their memory address (e.g., `<Exception at 0x7f8b...>`), which changes on every single run. This guarantees a 0% cache hit rate across restarts, leading to unbounded OpenRouter API costs.
|
| 34 |
+
*Fix:* Sanitize `error_meta` to strip memory addresses (e.g., via regex) or extract only safe primitive fields before hashing.
|
| 35 |
+
|
| 36 |
+
**5. Shell injection and syntax errors on parameterized tests (ADR-010)**
|
| 37 |
+
*File:* `composer_replication/datagen/sandbox.py` inside `LocalSubprocessSandbox.run_tests`
|
| 38 |
+
*Why:* The code joins test names with spaces: `cmd = f"{test_command} {node_ids}"` and runs with `shell=True`. SWE-bench heavily uses parameterized tests (e.g., `test_foo.py::test_bar[my arg]`). The spaces and brackets will cause shell syntax errors or allow arbitrary command injection if a test name contains `;` or `&`.
|
| 39 |
+
*Fix:* Use `shlex.quote(t)` on each test node ID before joining them.
|
| 40 |
+
|
| 41 |
+
**6. `reward_fn` throws KeyError in distributed TRL loops (ADR-010)**
|
| 42 |
+
*File:* `composer_replication/datagen/env.py` inside `reward_fn`
|
| 43 |
+
*Why:* `reward_fn` looks up `task_id` in `self.registry`. `self.registry` is only populated when `env.reset(task)` is called locally. In a standard TRL training loop, `reward_fn` is a stateless callback executed on batches of generated completions. If the environment wasn't pre-loaded with the entire dataset, `self.registry[tid]` will immediately raise a `KeyError`.
|
| 44 |
+
*Fix:* `FeatureDeletionEnv` must accept a dataset reference or a task-factory function to load unseen `task_id`s on demand.
|
| 45 |
+
|
| 46 |
+
---
|
| 47 |
+
|
| 48 |
+
## P2 / nits
|
| 49 |
+
|
| 50 |
+
**7. Non-atomic disk cache writes (ADR-009)**
|
| 51 |
+
*File:* `composer_replication/hint_generator.py` inside `LLMJudgeHintGenerator._disk_put`
|
| 52 |
+
*Why:* `(d / f"{key}.txt").write_text(value)` is not atomic. Concurrent DDP workers writing to the same cache key will corrupt the file. Write to a `.tmp` file and use `os.replace`.
|
| 53 |
+
|
| 54 |
+
**8. Typo in Dr. GRPO config assertion (ADR-008)**
|
| 55 |
+
*File:* `composer_replication/trainer/composer_trainer.py` inside `make_dr_grpo_config`
|
| 56 |
+
*Why:* `assert str(cfg.scale_rewards) in ("none", "False", "False")` repeats `"False"`.
|
| 57 |
+
|
| 58 |
+
---
|
| 59 |
+
|
| 60 |
+
## Anything the self-review likely missed
|
| 61 |
+
The ADR-008 CPU smoke test (`examples/composer_grpo_sdpo_smoke/run.py`) explicitly passes `strict_sdpo_alignment=False` and notes that the toy rollouts won't carry `ctx_teacher_input_ids`. This means the smoke test **tautologically passes** without ever executing the inside of the `_compute_sdpo_loss` block (it hits the early return `0.0`). It proves the trainer doesn't crash on init, but it completely fails to prove that the SDPO channel is "wired into the live loop" as claimed in the acceptance gate.
|
|
@@ -0,0 +1,153 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Review by openai/gpt-5.5-20260423 (146.6s, 24888/7981 tok)
|
| 2 |
+
|
| 3 |
+
## Verdict
|
| 4 |
+
|
| 5 |
+
- **ADR-008: REJECT** — Dr. GRPO/TRL claims are not established; SDPO “live” gate is mostly a no-op.
|
| 6 |
+
- **ADR-009: ACCEPT-WITH-FIXES** — layered interface is useful, but default ordering and omitted sibling-bootstrap contradict the ADR.
|
| 7 |
+
- **ADR-010: REJECT** — reward function, validator, sandbox, and Docker/substrate gates do not prove the accepted claims.
|
| 8 |
+
|
| 9 |
+
## P0 findings
|
| 10 |
+
|
| 11 |
+
1. **ADR-008 — Dr. GRPO KL estimator claim is wrong / unimplemented**
|
| 12 |
+
**File/function:** `composer_replication/trainer/composer_trainer.py::make_dr_grpo_config`
|
| 13 |
+
**Why:** The ADR claims Composer uses k1 KL (`−log r`) and that “TRL’s native estimator” satisfies this. TRL GRPO implementations commonly use the Schulman k3-style estimator (`exp(ref-logp - logp) - (ref-logp - logp) - 1`) for the per-token KL. This code does not configure or override the KL estimator at all; it only leaves `beta` to the caller. If TRL 1.5.0 uses k3, the accepted ADR is training the wrong RL objective while claiming Dr. GRPO fidelity.
|
| 14 |
+
**Concrete fix:** Add an explicit version-pinned assertion against TRL 1.5.0 source behavior, and if TRL uses k3, override GRPO’s KL computation in `ComposerReplicationTrainer` to k1 or set `beta=0` and document that Composer’s reference-KL term is intentionally disabled. Add a test that computes KL for known logprob pairs and asserts exact k1 values.
|
| 15 |
+
|
| 16 |
+
2. **ADR-008 — `scale_rewards="none"` is not a robust Dr. GRPO guard**
|
| 17 |
+
**File/function:** `composer_replication/trainer/composer_trainer.py::make_dr_grpo_config`
|
| 18 |
+
**Why:** The acceptance gate says no std-dev advantage normalization is guaranteed. The code passes the string `"none"` and then only checks that `cfg.scale_rewards` echoes as `"none"`/`False`. That is not a semantic guard. If TRL expects `False` to disable scaling and treats unknown strings as truthy or as a mode, this silently fails the key Dr. GRPO requirement. The assertion also contains a duplicated tuple entry: `("none", "False", "False")`.
|
| 19 |
+
**Concrete fix:** Use the TRL 1.5.0-supported disable value directly, likely `scale_rewards=False`, and add an integration/unit test around TRL’s actual advantage-scaling branch showing no group/batch std normalization is applied.
|
| 20 |
+
|
| 21 |
+
3. **ADR-008 — “live SDPO smoke” does not prove SDPO is live and does not assert its own gate**
|
| 22 |
+
**File/function:** `examples/composer_grpo_sdpo_smoke/run.py::main`
|
| 23 |
+
**Why:** The smoke enables `alpha_sdpo=1.0` but supplies no `ctx_teacher_input_ids` and no `sdpo_loss_mask`. Therefore `ComposerReplicationTrainer._compute_sdpo_loss` immediately returns zero and performs no teacher forward, no JSD, and no error-site masking. Worse, the script computes `sdpo_logged = any(...)` but never asserts it; it prints `PASS` unconditionally after `trainer.train()`. This gate can go green even if the SDPO override is never invoked.
|
| 24 |
+
**Concrete fix:** Build a tiny dataset/collator batch with aligned `input_ids`, `ctx_teacher_input_ids`, and nonzero `sdpo_loss_mask`; assert a teacher forward occurred, `loss/sdpo_kl` is logged, and the logged value is finite and positive or at least nonzero under controlled logits. Also `return 1` if `sdpo_logged` is false.
|
| 25 |
+
|
| 26 |
+
4. **ADR-010 — TRL `reward_fn` ignores the completion, so non-solutions can receive full reward**
|
| 27 |
+
**File/function:** `composer_replication/datagen/env.py::FeatureDeletionEnv.reward_fn`
|
| 28 |
+
**Why:** In the default path, each completion is ignored. The env calls `reset(task)` and then `step({"type": "submit"})`; the sandbox’s pre-existing test outcomes determine reward. The unit test `test_reward_fn_returns_one_float_per_completion` rewards `"...agent diff..."` without applying any patch or action. This is a ship-blocking reward bug: every completion for a task can get the same score, including an empty or malicious non-solution.
|
| 29 |
+
**Concrete fix:** Make `reward_fn` require a replay/action executor that applies the completion’s patch/tool calls in an isolated sandbox before grading. If no replay implementation is provided, raise `NotImplementedError` instead of grading a bare submit. Add tests where two different completions for the same task produce different rewards.
|
| 30 |
+
|
| 31 |
+
5. **ADR-010 — sandbox denylist is trivially bypassable and the promised scrub is not implemented**
|
| 32 |
+
**Files/functions:**
|
| 33 |
+
- `composer_replication/datagen/sandbox.py::SANDBOX_DENYLIST`
|
| 34 |
+
- `composer_replication/datagen/sandbox.py::LocalSubprocessSandbox.exec`
|
| 35 |
+
- `composer_replication/datagen/monitor.py::HackMonitor.flag`
|
| 36 |
+
**Why:** The ADR claims pre-task scrub of `__pycache__`, `.mypy_cache`, `.class`, `.git` and an allowlisted sandbox. Code implements neither. `LocalSubprocessSandbox.boot` only records the image. `exec` uses `shell=True` and blocks only the first token if it exactly matches a denylisted command. Concrete bypasses:
|
| 37 |
+
- `/usr/bin/strings file.pyc` bypasses because head is `/usr/bin/strings`, not `strings`.
|
| 38 |
+
- `sh -c "strings file.pyc"` bypasses because head is `sh`.
|
| 39 |
+
- `python -c "import pathlib; ... read __pycache__/*.pyc ..."` bypasses because head is `python`.
|
| 40 |
+
- `grep -R`, `ripgrep`, `cat`, `tar`, `zipinfo`, `python -m dis`, or custom bytecode readers are not denied.
|
| 41 |
+
The monitor can also be evaded by string-splitting: `('__py'+'cache__')`, `('.p'+'yc')`, etc.
|
| 42 |
+
**Concrete fix:** Replace denylist shell execution with a constrained tool API or container-level policy: scrub caches/history before task image creation, run without `.git`, no network, no arbitrary shell for reward-sensitive tasks, or enforce syscall/path-level restrictions. At minimum, canonicalize executable paths, block shells/interpreters for filesystem reconnaissance, and scan the filesystem before grading.
|
| 43 |
+
|
| 44 |
+
6. **ADR-010 — accepted Docker/substrate gate is not green; FakeSandbox tests are tautological**
|
| 45 |
+
**Files/functions:**
|
| 46 |
+
- `composer_replication/datagen/substrates.py::SweBenchAdapter.to_task`
|
| 47 |
+
- `composer_replication/datagen/validator.py::validate_task`
|
| 48 |
+
- `composer_replication/datagen/tests/test_feature_deletion.py::_materializers`
|
| 49 |
+
**Why:** The ADR marks the core gates accepted while the actual live substrate inversion is explicitly deferred. `SweBenchAdapter.to_task` only maps schema fields; it does not apply or reverse patches, scrub caches, build a broken image, or run tests. The validator unit tests use materializers that directly assign pass/fail dictionaries in `FakeSandbox`, so they prove only that the validator obeys synthetic booleans, not that feature deletion works on a repo. This is especially bad because ADR-010’s central claim is “invert OSS SWE substrates.”
|
| 50 |
+
**Concrete fix:** Keep ADR-010 unaccepted until a Docker-gated test actually clones/materializes one real substrate instance, applies the gold patch to establish solved green, reverts/deletes to broken, runs `FAIL_TO_PASS`/`PASS_TO_PASS`, reapplies gold, and verifies the four gates.
|
| 51 |
+
|
| 52 |
+
7. **ADR-010 — validator does not actually prove “deletion is reachable from tests”**
|
| 53 |
+
**File/function:** `composer_replication/datagen/validator.py::validate_task`
|
| 54 |
+
**Why:** Gate 2 checks only that `FAIL_TO_PASS` tests fail in the state supplied by `materialize_broken`. It does not verify that the failure was caused by reverting/deleting the specified feature or symbols. A broken materializer can make tests fail for any unrelated reason and pass validation if `apply_gold` later flips the booleans. The FakeSandbox tests encode exactly that weakness.
|
| 55 |
+
**Concrete fix:** In the real materializer, record and verify the exact patch/reverse-patch operation, changed files, and deleted symbols. Add a reachability check that failing tests execute or import the changed region, e.g. via coverage, traceback provenance, or mutation/revert validation on the actual repo.
|
| 56 |
+
|
| 57 |
+
## P1 findings
|
| 58 |
+
|
| 59 |
+
1. **ADR-008 — SDPO alignment guard is only a shape check**
|
| 60 |
+
**File/function:** `composer_replication/trainer/composer_trainer.py::ComposerReplicationTrainer._compute_sdpo_loss`
|
| 61 |
+
**Why:** The ADR says the student/teacher alignment trust-gap is closed. It is not. The code only compares `student_logits.shape != teacher_logits.shape`. Same shape with shuffled batch rows, shifted post-hint tokens, wrong error-site mask, or different padding all pass silently and train on nonsense.
|
| 62 |
+
**Concrete fix:** Carry explicit alignment metadata from the collator: batch ids, error-site ids, student token ids, teacher target token ids, and mask. Assert equality under the SDPO mask, not just equal tensor shape.
|
| 63 |
+
|
| 64 |
+
2. **ADR-008 — missing `sdpo_loss_mask` distills the whole sequence**
|
| 65 |
+
**File/function:** `composer_replication/trainer/composer_trainer.py::ComposerReplicationTrainer._compute_sdpo_loss`
|
| 66 |
+
**Why:** If `ctx_teacher_input_ids` exists but `sdpo_loss_mask` is absent, the code passes `labels=None` into `generalized_jsd_loss`. Depending on that implementation, this can compute loss over all tokens, not only error-turn tokens. That violates “masked to error-turn tokens only.”
|
| 67 |
+
**Concrete fix:** When `alpha_sdpo > 0` and teacher ids are present, require `sdpo_loss_mask` with compatible shape and nonzero entries, or explicitly return zero with a warning.
|
| 68 |
+
|
| 69 |
+
3. **ADR-008 — “Adam” is claimed but not configured**
|
| 70 |
+
**File/function:** `composer_replication/trainer/composer_trainer.py::make_dr_grpo_config`
|
| 71 |
+
**Why:** ADR says Dr. GRPO uses Adam. The helper does not set `optim`, `weight_decay`, or Adam-vs-AdamW behavior. HF/TRL defaults are typically AdamW variants.
|
| 72 |
+
**Concrete fix:** Set the optimizer explicitly or remove the claim. Add a config test that asserts the exact optimizer TRL will instantiate.
|
| 73 |
+
|
| 74 |
+
4. **ADR-008 — `num_iterations=1` is not the same as “a prompt is never trained twice”**
|
| 75 |
+
**File/function:** `composer_replication/trainer/composer_trainer.py::make_dr_grpo_config`
|
| 76 |
+
**Why:** `num_iterations=1` controls GRPO inner reuse per generation batch. It does not by itself prevent dataset-level resampling, multiple epochs, repeated prompts, or replay through dataloader settings.
|
| 77 |
+
**Concrete fix:** Document the narrower guarantee and add sampler/dataset controls if the intended guarantee is truly one update per prompt.
|
| 78 |
+
|
| 79 |
+
5. **ADR-009 — default layer order prevents LLM judge from handling many style/communication cases**
|
| 80 |
+
**File/function:** `composer_replication/hint_generator.py::default_composite`
|
| 81 |
+
**Why:** Default order is template → raw-error → LLM. Any uncovered style/communication site with an `error_message` will be consumed by `RawErrorHintGenerator` and never reach the LLM judge. The test `test_composite_falls_through_to_judge_for_uncovered_site` disables raw error to make the judge fire, which does not validate the default accepted design.
|
| 82 |
+
**Concrete fix:** Add routing policy: tool/runtime errors may use raw error; style/communication/effort kinds should skip raw-error and go to judge. Test the default composite, not a special disabled-raw variant.
|
| 83 |
+
|
| 84 |
+
6. **ADR-009 — sibling-bootstrap is accepted but not implemented**
|
| 85 |
+
**Files/functions:**
|
| 86 |
+
- `composer_replication/hint_generator.py`
|
| 87 |
+
- `composer_replication/trainer/composer_trainer.py`
|
| 88 |
+
**Why:** The ADR chooses template → raw-error → LLM-judge → SDPO sibling-bootstrap. The implementation comments say sibling-bootstrap is trainer-side, but no trainer-side flag or rollout integration exists in the provided trainer. Acceptance says “satisfied by design,” which is not an implementation gate.
|
| 89 |
+
**Concrete fix:** Add an explicit trainer/rollout interface that can select a successful sibling rollout as feedback, or downgrade the ADR to exclude sibling-bootstrap from the accepted scope.
|
| 90 |
+
|
| 91 |
+
7. **ADR-009 — LLM hints are not bounded, sanitized, or policy-checked**
|
| 92 |
+
**File/function:** `composer_replication/hint_generator.py::LLMJudgeHintGenerator.generate`
|
| 93 |
+
**Why:** Prompt asks for `<=2 sentences`, but code accepts arbitrary output. A judge can return a full solution, huge text, hidden prompt leakage, or adversarial instructions. This contaminates SDPO teacher conditioning.
|
| 94 |
+
**Concrete fix:** Enforce length, strip code blocks if undesired, reject full-solution style outputs, and record model/prompt version in cache metadata.
|
| 95 |
+
|
| 96 |
+
8. **ADR-010 — curriculum quarantine will discard hard-but-solvable tasks early**
|
| 97 |
+
**File/function:** `composer_replication/datagen/curriculum.py::DifficultyCurriculum.update`
|
| 98 |
+
**Why:** With `tau_hard=0.02` and `min_exposures=8`, any task with zero successes in its first eight attempts is quarantined forever. Early in RL, that describes many valid hard tasks, not just impossible tasks. This contradicts the “create/select harder tasks dynamically” goal.
|
| 99 |
+
**Concrete fix:** Use confidence intervals/Bayesian posterior, minimum diverse solver attempts, validator failures, or periodic reactivation rather than irreversible zero-success quarantine.
|
| 100 |
+
|
| 101 |
+
9. **ADR-010 — curriculum ignores turns and thinking-token count despite ADR claim**
|
| 102 |
+
**File/function:** `composer_replication/datagen/curriculum.py::DifficultyCurriculum`
|
| 103 |
+
**Why:** ADR says the Composer tech report keys curriculum on rollout turns and thinking-token count. Implementation tracks only binary pass totals.
|
| 104 |
+
**Concrete fix:** Extend `update` to accept turns and thinking tokens, and weight/graduate tasks using those signals.
|
| 105 |
+
|
| 106 |
+
10. **ADR-010 — partial multi-feature reward is collapsed to binary for curriculum**
|
| 107 |
+
**File/function:** `composer_replication/datagen/env.py::FeatureDeletionEnv.reward_fn`
|
| 108 |
+
**Why:** Curriculum update uses `n_pass=int(res.reward > 0)`. A 0.1 partial pass and a 1.0 full pass both count as one success. That corrupts pass-rate estimates for multi-feature tasks.
|
| 109 |
+
**Concrete fix:** Track fractional reward or target-test pass counts, e.g. `n_pass=res.info["frac"] * len(fail_to_pass)`, `n_total=len(fail_to_pass)`.
|
| 110 |
+
|
| 111 |
+
11. **ADR-010 — reward adapter can return the wrong number of rewards**
|
| 112 |
+
**File/function:** `composer_replication/datagen/env.py::FeatureDeletionEnv.reward_fn`
|
| 113 |
+
**Why:** It loops with `zip(completions, task_id)`. If `task_id` length differs from `completions`, it silently truncates and returns fewer rewards than TRL expects.
|
| 114 |
+
**Concrete fix:** Assert `len(task_id) == len(completions) == len(prompts)`.
|
| 115 |
+
|
| 116 |
+
12. **ADR-010 — `LocalSubprocessSandbox` is not a Docker/image sandbox**
|
| 117 |
+
**File/function:** `composer_replication/datagen/sandbox.py::LocalSubprocessSandbox`
|
| 118 |
+
**Why:** `boot(image)` does not materialize an image. It just stores the string. Tests run in the mutable local `workdir`. This contradicts the ADR’s frozen-image/scrubbed-repo premise.
|
| 119 |
+
**Concrete fix:** Implement a real Docker-backed sandbox or rename this as an unsafe local dev runner and keep it out of acceptance gates.
|
| 120 |
+
|
| 121 |
+
13. **ADR-010 — test parsing is unreliable and can misgrade**
|
| 122 |
+
**File/function:** `composer_replication/datagen/sandbox.py::LocalSubprocessSandbox.run_tests`
|
| 123 |
+
**Why:** It marks a test passed if `f"{t} PASSED"` appears, otherwise uses a coarse return-code fallback. Many test commands do not emit that format unless verbose mode is enabled; parametrized node ids, xfail/xpass, collection skips, and non-pytest substrates will be misclassified.
|
| 124 |
+
**Concrete fix:** Use structured pytest JSON/JUnit output where possible, substrate-specific parsers otherwise, and fail closed on unknown parse states.
|
| 125 |
+
|
| 126 |
+
## P2 / nits
|
| 127 |
+
|
| 128 |
+
1. **Duplicate denylist entry**
|
| 129 |
+
**File/function:** `composer_replication/datagen/sandbox.py::SANDBOX_DENYLIST`
|
| 130 |
+
`unzip` appears twice.
|
| 131 |
+
|
| 132 |
+
2. **Weak drift assertion message**
|
| 133 |
+
**File/function:** `composer_replication/trainer/composer_trainer.py::make_dr_grpo_config`
|
| 134 |
+
Assertion says `GRPOConfig dropped loss_type`, but the real risk is semantic drift, not field absence.
|
| 135 |
+
|
| 136 |
+
3. **Smoke comment is misleading**
|
| 137 |
+
**File/function:** `examples/composer_grpo_sdpo_smoke/run.py::main`
|
| 138 |
+
It says `strict=False` makes absence of error sites a clean no-op. Strictness is irrelevant when `ctx_teacher_input_ids` is absent; the function returns before alignment checking.
|
| 139 |
+
|
| 140 |
+
4. **LLM cache key lacks prompt/model/version**
|
| 141 |
+
**File/function:** `composer_replication/hint_generator.py::LLMJudgeHintGenerator._cache_key`
|
| 142 |
+
Same error context will reuse stale hints after prompt/model/policy changes. Include prompt template version and model id.
|
| 143 |
+
|
| 144 |
+
5. **`FeatureDeletionTask.golden_diff` is hidden from `repr`, not from access**
|
| 145 |
+
**File/function:** `composer_replication/datagen/schema.py::FeatureDeletionTask`
|
| 146 |
+
`repr=False` is not a leakage boundary. Ensure prompt/rendering paths and serialized training examples exclude it.
|
| 147 |
+
|
| 148 |
+
## Anything the self-review likely missed
|
| 149 |
+
|
| 150 |
+
- The ADR-008 acceptance gate confuses “trainer instantiated” with “custom loss path executed.” The current smoke can pass if the parent TRL loss path never calls the subclass’s SDPO logic.
|
| 151 |
+
- The ADR-010 tests mostly validate control-flow plumbing against `FakeSandbox`, not the hard parts: real patch inversion, real test execution, real sandbox isolation, and reward sensitivity to completions.
|
| 152 |
+
- The reward-hacking controls are described as defense-in-depth, but the “primary” controls — scrubbed caches/history and a real allowlisted sandbox — are absent. What remains is a brittle denylist plus a string-search monitor.
|
| 153 |
+
- ADR-009’s raw-error fallback is too eager. It optimizes for cheapness but undermines the stated goal of covering style/communication/effort errors with natural-language judge feedback.
|
|
@@ -0,0 +1,29 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Review by x-ai/grok-4.3-20260430 (20.1s, 24615/1645 tok)
|
| 2 |
+
|
| 3 |
+
**## Verdict (ADR-008)**
|
| 4 |
+
REJECT
|
| 5 |
+
|
| 6 |
+
**## Verdict (ADR-009)**
|
| 7 |
+
ACCEPT
|
| 8 |
+
|
| 9 |
+
**## Verdict (ADR-010)**
|
| 10 |
+
ACCEPT-WITH-FIXES
|
| 11 |
+
|
| 12 |
+
**## P0 findings**
|
| 13 |
+
1. composer_trainer.py:280 — `assert str(cfg.scale_rewards) in ("none", "False", "False")` is a copy-paste error and does not match TRL 1.5.0 GRPOConfig storage (scale_rewards is stored as str "none" or bool False). This guard will fail or silently pass on valid Dr. GRPO configs. Fix: change to `str(cfg.scale_rewards).lower() in ("none", "false")`.
|
| 14 |
+
2. composer_trainer.py:158-160 and _compute_sdpo_loss:191 — the "strict alignment guard" only does `student_logits.shape != teacher_logits.shape`. It performs no content/position alignment check on the post-hint tokens, so a collator that pads incorrectly but produces equal-length tensors still produces a silent zeroed loss. The ADR-008 trust-gap is not closed. Fix: add an explicit mask or token-id prefix check before the JSD call.
|
| 15 |
+
3. datagen/validator.py:92 — Gate 4 (`apply_gold`) only verifies that the gold diff turns tests green; it never tests that the deletion site is reachable from the FAIL_TO_PASS tests via the agent's observation (no AST or coverage probe). The claim "deletion is reachable from tests" is not verified. Fix: add a coverage or import-graph reachability assertion inside the gold-applied state.
|
| 16 |
+
|
| 17 |
+
**## P1 findings**
|
| 18 |
+
- composer_trainer.py:273 (make_dr_grpo_config) — never sets or asserts the KL estimator (k1 vs k3). TRL 1.5.0 GRPOConfig default is not guaranteed to be the `-log r` estimator cited in the tech report; the config is incomplete vs the ADR claim.
|
| 19 |
+
- datagen/sandbox.py:57 and monitor.py:42 — SANDBOX_DENYLIST only checks the first token of a shell command. Concrete bypass: `python -c 'import marshal, dis; ...'` or `import importlib.util; ...` to read `__pycache__/*.pyc` without invoking any denied binary. Reward can still be gamed.
|
| 20 |
+
- datagen/curriculum.py:48 — `raw_rate` quarantine decision uses the unsmoothed rate, but `update` is called with `n_total=1` per completion in env.py:reward_fn. A task that passes once then fails forever can never be quarantined after min_exposures because the running count never reflects sustained failure.
|
| 21 |
+
- datagen/env.py:140 — `reward_fn` calls `curriculum.update(..., n_total=1)` even on hacked or guard-broken trajectories (reward==0). This pollutes the difficulty prior for tasks that are only solvable via hacks.
|
| 22 |
+
|
| 23 |
+
**## P2 / nits**
|
| 24 |
+
- examples/composer_grpo_sdpo_smoke/run.py:88 — smoke test forces `strict_sdpo_alignment=False`, so the only alignment test that would have caught the shape-mismatch bug is disabled in the "live loop" proof.
|
| 25 |
+
- datagen/tests/test_feature_deletion.py:140 — all validator tests use FakeSandbox materializers that directly mutate `test_outcomes`; they are tautological and do not exercise the real "deletion reachable" path.
|
| 26 |
+
- hint_generator.py:212 — LLMJudgeHintGenerator cache key includes the full error_meta dict; any non-deterministic key (timestamps, random seeds) defeats the disk cache.
|
| 27 |
+
|
| 28 |
+
**## Anything the self-review likely missed**
|
| 29 |
+
- The Dr. GRPO `loss_type="dr_grpo"` + `scale_rewards="none"` combination was only asserted to exist in TRL 1.5.0 help text; no test actually instantiates GRPOConfig and checks that the resulting advantages have zero length normalization and no std scaling. The acceptance gate is therefore unverified against the actual TRL implementation.
|