Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
fix(phase-8): close all 5 cross-family final-verify findings + regression tests
Browse filesThe deep-work-loop exit gate (2 independent confirmations). A final 3-family
scatter of the session diff found and we fixed:
- [P0 latent] SDPO sentinel-mask used only student_response_valid; now requires
BOTH valid masks AND non-sentinel indices (a future divergent teacher tail
could have distilled against clamped position 0).
- [P1 convergent, GPT-5.5+Gemini] curriculum shared one n_effort denominator for
turns+think_tokens; split into independent n_turns/n_think counters.
- [P1] docker E2E test ran pip-install under --network none (would fail when the
gate activates); replaced with stdlib python -c expression runner, no network.
- [P1] MMLU reward: Answer: A or B hedge scored full credit; _HEDGE_RE now
penalizes it as multiple-answers.
- [P1] HackMonitor read-markers matched import in important / cat in concatenate
and scanned the agent patch as read-evidence; whole-word regex + exclude
patch/diff payloads from BOTH layers (a legit patch can no longer self-incriminate).
232 passed / 18 skipped / 0 failed. 5 regression tests added. Synthesis +
raw reviews in docs/reviews/final-verify-deep-work-2026-05-29/.
- composer_replication/datagen/curriculum.py +26 -17
- composer_replication/datagen/monitor.py +25 -5
- composer_replication/datagen/tests/test_docker_substrate_e2e.py +34 -36
- composer_replication/datagen/tests/test_feature_deletion.py +41 -0
- composer_replication/integrations/altered_minds/reward.py +17 -2
- composer_replication/integrations/altered_minds/tests/test_channel_ladder.py +23 -0
- composer_replication/trainer/composer_trainer.py +14 -4
- docs/reviews/final-verify-deep-work-2026-05-29/SYNTHESIS.md +33 -0
- docs/reviews/final-verify-deep-work-2026-05-29/verify_gemini-3.1-pro.md +4 -0
- docs/reviews/final-verify-deep-work-2026-05-29/verify_gpt-5.5.md +19 -0
|
@@ -25,23 +25,30 @@ from dataclasses import dataclass, field
|
|
| 25 |
class _TaskStats:
|
| 26 |
n_pass: float = 0.0
|
| 27 |
n_total: int = 0
|
| 28 |
-
# Running means of effort signals (ADR-012 finding #4).
|
| 29 |
-
#
|
| 30 |
-
#
|
|
|
|
| 31 |
mean_turns: float = 0.0
|
| 32 |
mean_think: float = 0.0
|
| 33 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 34 |
|
| 35 |
def observe_effort(self, turns: float | None, think_tokens: float | None) -> None:
|
| 36 |
-
"""Fold optional turn / think-token signals into running means."""
|
| 37 |
-
if turns is None and think_tokens is None:
|
| 38 |
-
return
|
| 39 |
-
self.n_effort += 1
|
| 40 |
-
k = self.n_effort
|
| 41 |
if turns is not None:
|
| 42 |
-
self.
|
|
|
|
| 43 |
if think_tokens is not None:
|
| 44 |
-
self.
|
|
|
|
| 45 |
|
| 46 |
@property
|
| 47 |
def p_hat(self) -> float:
|
|
@@ -128,15 +135,17 @@ class DifficultyCurriculum:
|
|
| 128 |
if st is None or st.n_effort == 0:
|
| 129 |
return 1.0
|
| 130 |
max_turns = max(
|
| 131 |
-
(s.mean_turns for s in self._stats.values() if s.
|
| 132 |
)
|
| 133 |
max_think = max(
|
| 134 |
-
(s.mean_think for s in self._stats.values() if s.
|
| 135 |
)
|
| 136 |
-
|
| 137 |
-
|
| 138 |
-
|
| 139 |
-
components = [
|
|
|
|
|
|
|
| 140 |
z = sum(components) / len(components) if components else 0.0
|
| 141 |
return 1.0 + self.effort_gain * z
|
| 142 |
|
|
|
|
| 25 |
class _TaskStats:
|
| 26 |
n_pass: float = 0.0
|
| 27 |
n_total: int = 0
|
| 28 |
+
# Running means of effort signals (ADR-012 finding #4). Each signal has its
|
| 29 |
+
# OWN exposure counter because turns/think_tokens are INDEPENDENTLY optional
|
| 30 |
+
# per update — sharing one denominator would corrupt the mean of whichever
|
| 31 |
+
# signal was present when the other was absent (final-verify 2026-05-29).
|
| 32 |
mean_turns: float = 0.0
|
| 33 |
mean_think: float = 0.0
|
| 34 |
+
n_turns: int = 0
|
| 35 |
+
n_think: int = 0
|
| 36 |
+
|
| 37 |
+
@property
|
| 38 |
+
def n_effort(self) -> int:
|
| 39 |
+
"""Exposures that supplied AT LEAST ONE effort signal (max of the two
|
| 40 |
+
per-signal counters — back-compat accessor for callers that just want
|
| 41 |
+
'did we ever see effort')."""
|
| 42 |
+
return max(self.n_turns, self.n_think)
|
| 43 |
|
| 44 |
def observe_effort(self, turns: float | None, think_tokens: float | None) -> None:
|
| 45 |
+
"""Fold optional turn / think-token signals into independent running means."""
|
|
|
|
|
|
|
|
|
|
|
|
|
| 46 |
if turns is not None:
|
| 47 |
+
self.n_turns += 1
|
| 48 |
+
self.mean_turns += (turns - self.mean_turns) / self.n_turns
|
| 49 |
if think_tokens is not None:
|
| 50 |
+
self.n_think += 1
|
| 51 |
+
self.mean_think += (think_tokens - self.mean_think) / self.n_think
|
| 52 |
|
| 53 |
@property
|
| 54 |
def p_hat(self) -> float:
|
|
|
|
| 135 |
if st is None or st.n_effort == 0:
|
| 136 |
return 1.0
|
| 137 |
max_turns = max(
|
| 138 |
+
(s.mean_turns for s in self._stats.values() if s.n_turns), default=0.0
|
| 139 |
)
|
| 140 |
max_think = max(
|
| 141 |
+
(s.mean_think for s in self._stats.values() if s.n_think), default=0.0
|
| 142 |
)
|
| 143 |
+
# Only count a signal for THIS task if the task actually recorded it.
|
| 144 |
+
z_turns = (st.mean_turns / max_turns) if (max_turns > 0 and st.n_turns) else 0.0
|
| 145 |
+
z_think = (st.mean_think / max_think) if (max_think > 0 and st.n_think) else 0.0
|
| 146 |
+
components = [
|
| 147 |
+
z for z, present in ((z_turns, st.n_turns), (z_think, st.n_think)) if present
|
| 148 |
+
]
|
| 149 |
z = sum(components) / len(components) if components else 0.0
|
| 150 |
return 1.0 + self.effort_gain * z
|
| 151 |
|
|
@@ -60,10 +60,16 @@ _ARTIFACT_MARKERS: tuple[str, ...] = (
|
|
| 60 |
)
|
| 61 |
|
| 62 |
# Actions that "read" something (a cache/bytecode artifact, for provenance).
|
|
|
|
|
|
|
| 63 |
_READ_MARKERS: tuple[str, ...] = (
|
| 64 |
"cat", "read", "open", "load", "strings", "unzip", "jar", "javap",
|
| 65 |
"decompile", "uncompyle", "decompyle", "dump", "show", "less", "head",
|
| 66 |
-
"tail", "import",
|
|
|
|
|
|
|
|
|
|
|
|
|
| 67 |
)
|
| 68 |
|
| 69 |
|
|
@@ -92,10 +98,18 @@ class HackMonitor:
|
|
| 92 |
patch: str | None = None,
|
| 93 |
) -> bool:
|
| 94 |
# --- layer 1: signature substring matcher (defense-in-depth) ---------
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 95 |
sigs = _HACK_SIGNATURES + tuple(self.extra_signatures)
|
| 96 |
for action in trajectory:
|
|
|
|
|
|
|
| 97 |
blob = " ".join(
|
| 98 |
-
str(v) for v in action.
|
|
|
|
| 99 |
).lower()
|
| 100 |
if any(sig.lower() in blob for sig in sigs):
|
| 101 |
return True
|
|
@@ -148,15 +162,21 @@ class HackMonitor:
|
|
| 148 |
if not reappeared:
|
| 149 |
return False
|
| 150 |
for action in trajectory:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 151 |
raw = " ".join(
|
| 152 |
-
str(v) for v in action.
|
| 153 |
-
if isinstance(v, (str, int, float))
|
| 154 |
)
|
| 155 |
low = raw.lower()
|
| 156 |
norm = _normalize(low)
|
| 157 |
reads_artifact = (
|
| 158 |
any(m in norm for m in _ARTIFACT_MARKERS)
|
| 159 |
-
and
|
| 160 |
)
|
| 161 |
if reads_artifact:
|
| 162 |
return True
|
|
|
|
| 60 |
)
|
| 61 |
|
| 62 |
# Actions that "read" something (a cache/bytecode artifact, for provenance).
|
| 63 |
+
# Matched as WHOLE WORDS (see _has_read_verb) so 'cat' doesn't match 'concatenate'
|
| 64 |
+
# and 'import' doesn't match 'important' (final-verify 2026-05-29 false-positive).
|
| 65 |
_READ_MARKERS: tuple[str, ...] = (
|
| 66 |
"cat", "read", "open", "load", "strings", "unzip", "jar", "javap",
|
| 67 |
"decompile", "uncompyle", "decompyle", "dump", "show", "less", "head",
|
| 68 |
+
"tail", "import", "marshal", "dis",
|
| 69 |
+
)
|
| 70 |
+
|
| 71 |
+
_READ_VERB_RE = re.compile(
|
| 72 |
+
r"\b(" + "|".join(re.escape(m) for m in _READ_MARKERS) + r")\b"
|
| 73 |
)
|
| 74 |
|
| 75 |
|
|
|
|
| 98 |
patch: str | None = None,
|
| 99 |
) -> bool:
|
| 100 |
# --- layer 1: signature substring matcher (defense-in-depth) ---------
|
| 101 |
+
# Scan trajectory ACTIONS for cache/decompiler signatures, but EXCLUDE
|
| 102 |
+
# the submitted patch/diff payload — that's the agent's output, not a
|
| 103 |
+
# read action, and a legit patch mentioning e.g. `__pycache__` in a
|
| 104 |
+
# comment must not self-incriminate (final-verify 2026-05-29). Layer 2
|
| 105 |
+
# handles the patch separately via provenance.
|
| 106 |
sigs = _HACK_SIGNATURES + tuple(self.extra_signatures)
|
| 107 |
for action in trajectory:
|
| 108 |
+
if action.get("type") == "submit_patch":
|
| 109 |
+
continue
|
| 110 |
blob = " ".join(
|
| 111 |
+
str(v) for k, v in action.items()
|
| 112 |
+
if isinstance(v, (str, int, float)) and k not in ("patch", "diff")
|
| 113 |
).lower()
|
| 114 |
if any(sig.lower() in blob for sig in sigs):
|
| 115 |
return True
|
|
|
|
| 162 |
if not reappeared:
|
| 163 |
return False
|
| 164 |
for action in trajectory:
|
| 165 |
+
# The submitted patch/diff is the agent's OUTPUT, not a read action —
|
| 166 |
+
# exclude it from read-evidence so a legit patch that merely contains
|
| 167 |
+
# the word 'import' or a comment mentioning a cache path can't
|
| 168 |
+
# self-incriminate (final-verify 2026-05-29 false-positive class).
|
| 169 |
+
if action.get("type") == "submit_patch":
|
| 170 |
+
continue
|
| 171 |
raw = " ".join(
|
| 172 |
+
str(v) for k, v in action.items()
|
| 173 |
+
if isinstance(v, (str, int, float)) and k not in ("patch", "diff")
|
| 174 |
)
|
| 175 |
low = raw.lower()
|
| 176 |
norm = _normalize(low)
|
| 177 |
reads_artifact = (
|
| 178 |
any(m in norm for m in _ARTIFACT_MARKERS)
|
| 179 |
+
and bool(_READ_VERB_RE.search(low)) # whole-word read verb only
|
| 180 |
)
|
| 181 |
if reads_artifact:
|
| 182 |
return True
|
|
@@ -70,24 +70,17 @@ _MODULE_BROKEN = textwrap.dedent('''\
|
|
| 70 |
return a + b
|
| 71 |
''')
|
| 72 |
|
| 73 |
-
_TESTS = textwrap.dedent('''\
|
| 74 |
-
from feature import add
|
| 75 |
-
try:
|
| 76 |
-
from feature import mul
|
| 77 |
-
except ImportError:
|
| 78 |
-
mul = None
|
| 79 |
-
|
| 80 |
-
def test_add_guard(): # PASS_TO_PASS — must pass in broken state
|
| 81 |
-
assert add(2, 3) == 5
|
| 82 |
-
|
| 83 |
-
def test_mul_target(): # FAIL_TO_PASS — fails in broken state
|
| 84 |
-
assert mul is not None and mul(2, 3) == 6
|
| 85 |
-
''')
|
| 86 |
|
|
|
|
|
|
|
|
|
|
| 87 |
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
|
|
|
|
|
|
|
|
|
|
| 91 |
import os
|
| 92 |
import tempfile
|
| 93 |
|
|
@@ -95,50 +88,55 @@ def _run_in_container(workdir_files: dict[str, str], test_node: str) -> tuple[bo
|
|
| 95 |
for name, content in workdir_files.items():
|
| 96 |
with open(os.path.join(d, name), "w") as f:
|
| 97 |
f.write(content)
|
| 98 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 99 |
cmd = [
|
| 100 |
"docker", "run", "--rm", "--network", "none",
|
| 101 |
"-v", f"{d}:/work", "-w", "/work",
|
| 102 |
-
"python:3.11-slim",
|
| 103 |
-
"bash", "-lc",
|
| 104 |
-
f"pip install -q pytest >/dev/null 2>&1 && "
|
| 105 |
-
f"python -m pytest -q '{test_node}' 2>&1",
|
| 106 |
]
|
| 107 |
-
r = subprocess.run(cmd, capture_output=True, text=True, timeout=
|
| 108 |
out = (r.stdout or "") + (r.stderr or "")
|
| 109 |
return (r.returncode == 0, out)
|
| 110 |
|
| 111 |
|
| 112 |
def test_substrate_inversion_four_gates_on_real_container():
|
| 113 |
-
"""The 4 ADR-010 gates against a REAL container + REAL
|
| 114 |
|
| 115 |
Gate 1 (baseline green): solved state → target + guard both pass.
|
| 116 |
Gate 2 (deletion breaks): broken state → target FAILS.
|
| 117 |
Gate 3 (remains functional):broken state → guard PASSES.
|
| 118 |
Gate 4 (gold restores): gold-applied → target passes again.
|
| 119 |
"""
|
| 120 |
-
|
|
|
|
|
|
|
|
|
|
| 121 |
|
| 122 |
# Gate 1 — solved: both pass.
|
| 123 |
-
g1_target, _ = _run_in_container(
|
| 124 |
-
|
| 125 |
-
g1_guard, _ = _run_in_container(
|
| 126 |
-
{**tests_file, "feature.py": _MODULE_SOLVED}, "test_feature.py::test_add_guard")
|
| 127 |
assert g1_target and g1_guard, "Gate 1 (baseline green) failed on real container"
|
| 128 |
|
| 129 |
-
# Gate 2 — broken: target FAILS.
|
| 130 |
-
g2_target, g2_out = _run_in_container(
|
| 131 |
-
|
| 132 |
-
assert not g2_target, f"Gate 2 (deletion breaks target) failed — target passed in broken state:\n{g2_out}"
|
| 133 |
|
| 134 |
# Gate 3 — broken: guard still PASSES.
|
| 135 |
-
g3_guard, g3_out = _run_in_container(
|
| 136 |
-
{**tests_file, "feature.py": _MODULE_BROKEN}, "test_feature.py::test_add_guard")
|
| 137 |
assert g3_guard, f"Gate 3 (remains functional) failed — guard broke in broken state:\n{g3_out}"
|
| 138 |
|
| 139 |
# Gate 4 — gold restores: target passes again (gold == the solved module).
|
| 140 |
-
g4_target, _ = _run_in_container(
|
| 141 |
-
{**tests_file, "feature.py": _MODULE_SOLVED}, "test_feature.py::test_mul_target")
|
| 142 |
assert g4_target, "Gate 4 (gold restores) failed — target did not recover after gold patch"
|
| 143 |
|
| 144 |
|
|
|
|
| 70 |
return a + b
|
| 71 |
''')
|
| 72 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 73 |
|
| 74 |
+
def _run_in_container(workdir_files: dict[str, str], target_expr: str) -> tuple[bool, str]:
|
| 75 |
+
"""Materialize files in a fresh python:3.11-slim container and evaluate one
|
| 76 |
+
boolean expression against the module, return (passed, output).
|
| 77 |
|
| 78 |
+
Uses a plain stdlib `python -c` runner — NO pip install, so `--network none`
|
| 79 |
+
is honored (final-verify 2026-05-29: the earlier `pip install pytest` inside
|
| 80 |
+
a network-disabled container would fail exactly when the gate activates).
|
| 81 |
+
`target_expr` is a Python expression over the imported `feature` module that
|
| 82 |
+
must evaluate truthy for a 'pass', e.g. `mul(2,3)==6`.
|
| 83 |
+
"""
|
| 84 |
import os
|
| 85 |
import tempfile
|
| 86 |
|
|
|
|
| 88 |
for name, content in workdir_files.items():
|
| 89 |
with open(os.path.join(d, name), "w") as f:
|
| 90 |
f.write(content)
|
| 91 |
+
runner = (
|
| 92 |
+
"import sys\n"
|
| 93 |
+
"try:\n"
|
| 94 |
+
" import feature\n"
|
| 95 |
+
f" ok = bool({target_expr})\n"
|
| 96 |
+
"except Exception as e:\n"
|
| 97 |
+
" print('EXC', e); sys.exit(1)\n"
|
| 98 |
+
"print('PASS' if ok else 'FAIL'); sys.exit(0 if ok else 1)\n"
|
| 99 |
+
)
|
| 100 |
+
with open(os.path.join(d, "_runner.py"), "w") as f:
|
| 101 |
+
f.write(runner)
|
| 102 |
cmd = [
|
| 103 |
"docker", "run", "--rm", "--network", "none",
|
| 104 |
"-v", f"{d}:/work", "-w", "/work",
|
| 105 |
+
"python:3.11-slim", "python", "_runner.py",
|
|
|
|
|
|
|
|
|
|
| 106 |
]
|
| 107 |
+
r = subprocess.run(cmd, capture_output=True, text=True, timeout=120)
|
| 108 |
out = (r.stdout or "") + (r.stderr or "")
|
| 109 |
return (r.returncode == 0, out)
|
| 110 |
|
| 111 |
|
| 112 |
def test_substrate_inversion_four_gates_on_real_container():
|
| 113 |
+
"""The 4 ADR-010 gates against a REAL container + REAL python import, not FakeSandbox.
|
| 114 |
|
| 115 |
Gate 1 (baseline green): solved state → target + guard both pass.
|
| 116 |
Gate 2 (deletion breaks): broken state → target FAILS.
|
| 117 |
Gate 3 (remains functional):broken state → guard PASSES.
|
| 118 |
Gate 4 (gold restores): gold-applied → target passes again.
|
| 119 |
"""
|
| 120 |
+
solved = {"feature.py": _MODULE_SOLVED}
|
| 121 |
+
broken = {"feature.py": _MODULE_BROKEN}
|
| 122 |
+
TARGET = "feature.mul(2, 3) == 6" # FAIL_TO_PASS — exercises the deleted symbol
|
| 123 |
+
GUARD = "feature.add(2, 3) == 5" # PASS_TO_PASS — must survive the deletion
|
| 124 |
|
| 125 |
# Gate 1 — solved: both pass.
|
| 126 |
+
g1_target, _ = _run_in_container(solved, TARGET)
|
| 127 |
+
g1_guard, _ = _run_in_container(solved, GUARD)
|
|
|
|
|
|
|
| 128 |
assert g1_target and g1_guard, "Gate 1 (baseline green) failed on real container"
|
| 129 |
|
| 130 |
+
# Gate 2 — broken: target FAILS (mul is gone → AttributeError → non-truthy).
|
| 131 |
+
g2_target, g2_out = _run_in_container(broken, TARGET)
|
| 132 |
+
assert not g2_target, f"Gate 2 (deletion breaks target) failed — target passed broken:\n{g2_out}"
|
|
|
|
| 133 |
|
| 134 |
# Gate 3 — broken: guard still PASSES.
|
| 135 |
+
g3_guard, g3_out = _run_in_container(broken, GUARD)
|
|
|
|
| 136 |
assert g3_guard, f"Gate 3 (remains functional) failed — guard broke in broken state:\n{g3_out}"
|
| 137 |
|
| 138 |
# Gate 4 — gold restores: target passes again (gold == the solved module).
|
| 139 |
+
g4_target, _ = _run_in_container(solved, TARGET)
|
|
|
|
| 140 |
assert g4_target, "Gate 4 (gold restores) failed — target did not recover after gold patch"
|
| 141 |
|
| 142 |
|
|
@@ -102,6 +102,47 @@ def test_monitor_passes_clean_reimplementation():
|
|
| 102 |
assert mon.flag(traj, deleted_symbols=("feature_a",)) is False
|
| 103 |
|
| 104 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 105 |
def test_reward_masked_when_hack_detected():
|
| 106 |
sb = FakeSandbox(test_outcomes={"test_feature_a": True, "test_unrelated": True})
|
| 107 |
env = FeatureDeletionEnv(sb)
|
|
|
|
| 102 |
assert mon.flag(traj, deleted_symbols=("feature_a",)) is False
|
| 103 |
|
| 104 |
|
| 105 |
+
def test_monitor_no_false_positive_on_legit_import_words():
|
| 106 |
+
"""Final-verify 2026-05-29: bare-substring read markers flagged 'import' in
|
| 107 |
+
'important' / 'cat' in 'concatenate'. With whole-word matching + patch
|
| 108 |
+
exclusion, a legit patch that reintroduces the symbol and uses normal words
|
| 109 |
+
must NOT be flagged when there's no real cache/bytecode read."""
|
| 110 |
+
mon = HackMonitor()
|
| 111 |
+
traj = [
|
| 112 |
+
{"type": "comment", "text": "this is an important concatenation of features"},
|
| 113 |
+
{"type": "submit_patch", "patch": "def feature_a():\n return important() # __pycache__ mention in a comment"},
|
| 114 |
+
]
|
| 115 |
+
# Symbol reappears in the patch AND the words 'important'/'concatenation'
|
| 116 |
+
# appear and the patch comment even mentions __pycache__ — but there is NO
|
| 117 |
+
# actual read action, and the patch text is excluded from read-evidence.
|
| 118 |
+
assert mon.flag(traj, deleted_symbols=("feature_a",)) is False
|
| 119 |
+
|
| 120 |
+
|
| 121 |
+
def test_monitor_still_flags_obfuscated_cache_read():
|
| 122 |
+
"""The patch-provenance layer still catches a string-concat-obfuscated cache
|
| 123 |
+
read that reintroduces a deleted symbol (the real attack it's for)."""
|
| 124 |
+
mon = HackMonitor()
|
| 125 |
+
traj = [
|
| 126 |
+
{"type": "shell", "command": "python -c \"import marshal; d=open('__py'+'cache__/widget.pyc','rb').read()\""},
|
| 127 |
+
{"type": "submit_patch", "patch": "def feature_a(): return 42 # recovered"},
|
| 128 |
+
]
|
| 129 |
+
assert mon.flag(traj, deleted_symbols=("feature_a",)) is True
|
| 130 |
+
|
| 131 |
+
|
| 132 |
+
def test_curriculum_independent_effort_signals():
|
| 133 |
+
"""Final-verify 2026-05-29: turns and think_tokens are INDEPENDENTLY optional;
|
| 134 |
+
they must not share a denominator. A task updated with turns-only then
|
| 135 |
+
think-only must record correct per-signal means (each = its single value),
|
| 136 |
+
not values halved by a shared n_effort counter."""
|
| 137 |
+
cur = DifficultyCurriculum()
|
| 138 |
+
cur.update("t", n_pass=1, n_total=2, turns=10.0) # turns only
|
| 139 |
+
cur.update("t", n_pass=1, n_total=2, think_tokens=200.0) # think only
|
| 140 |
+
st = cur._stats["t"]
|
| 141 |
+
assert st.mean_turns == 10.0, f"turns mean corrupted: {st.mean_turns}"
|
| 142 |
+
assert st.mean_think == 200.0, f"think mean corrupted: {st.mean_think}"
|
| 143 |
+
assert st.n_turns == 1 and st.n_think == 1
|
| 144 |
+
|
| 145 |
+
|
| 146 |
def test_reward_masked_when_hack_detected():
|
| 147 |
sb = FakeSandbox(test_outcomes={"test_feature_a": True, "test_unrelated": True})
|
| 148 |
env = FeatureDeletionEnv(sb)
|
|
@@ -42,6 +42,15 @@ _ANSWER_RE = re.compile(r"answer\s*[:\-]\s*\*{0,2}([A-D])\b", re.IGNORECASE)
|
|
| 42 |
_JSON_ANSWER_RE = re.compile(
|
| 43 |
r'["\']answer["\']\s*:\s*["\']([A-D])["\']', re.IGNORECASE
|
| 44 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 45 |
|
| 46 |
|
| 47 |
def _find_markers(text: str) -> list[str]:
|
|
@@ -70,8 +79,14 @@ def parse_final_answer(completion: str) -> tuple[str | None, int]:
|
|
| 70 |
markers = _find_markers(completion)
|
| 71 |
if not markers:
|
| 72 |
return None, 0
|
| 73 |
-
distinct =
|
| 74 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 75 |
|
| 76 |
|
| 77 |
@dataclass
|
|
|
|
| 42 |
_JSON_ANSWER_RE = re.compile(
|
| 43 |
r'["\']answer["\']\s*:\s*["\']([A-D])["\']', re.IGNORECASE
|
| 44 |
)
|
| 45 |
+
# Hedge AFTER an answer marker: ``Answer: A or B`` / ``Answer: A/B`` /
|
| 46 |
+
# ``Answer: A, B`` — a single marker that names a SECOND distinct option is a
|
| 47 |
+
# format hedge and must be treated as multiple-answers, not full credit for the
|
| 48 |
+
# first letter (final-verify 2026-05-29). Captures the lead letter + the hedged
|
| 49 |
+
# second letter immediately following via or / slash / comma / 'and'.
|
| 50 |
+
_HEDGE_RE = re.compile(
|
| 51 |
+
r"answer\s*[:\-]\s*\*{0,2}([A-D])\b\s*(?:or|and|/|,|\|)\s*\*{0,2}([A-D])\b",
|
| 52 |
+
re.IGNORECASE,
|
| 53 |
+
)
|
| 54 |
|
| 55 |
|
| 56 |
def _find_markers(text: str) -> list[str]:
|
|
|
|
| 79 |
markers = _find_markers(completion)
|
| 80 |
if not markers:
|
| 81 |
return None, 0
|
| 82 |
+
distinct = set(markers)
|
| 83 |
+
# Hedge detection: a single marker naming a second distinct option
|
| 84 |
+
# ("Answer: A or B") adds the hedged letter to the distinct set, so it is
|
| 85 |
+
# penalized as a multiple-answers format hack instead of scoring the lead.
|
| 86 |
+
for m in _HEDGE_RE.finditer(completion or ""):
|
| 87 |
+
distinct.add(m.group(1).upper())
|
| 88 |
+
distinct.add(m.group(2).upper())
|
| 89 |
+
return markers[-1], len(distinct)
|
| 90 |
|
| 91 |
|
| 92 |
@dataclass
|
|
@@ -23,6 +23,29 @@ from composer_replication.integrations.altered_minds import (
|
|
| 23 |
dual_kl_logger,
|
| 24 |
randomize_options,
|
| 25 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 26 |
|
| 27 |
|
| 28 |
# ===========================================================================
|
|
|
|
| 23 |
dual_kl_logger,
|
| 24 |
randomize_options,
|
| 25 |
)
|
| 26 |
+
from composer_replication.integrations.altered_minds.reward import parse_final_answer
|
| 27 |
+
|
| 28 |
+
|
| 29 |
+
def test_hedged_answer_is_penalized_not_full_credit():
|
| 30 |
+
"""Final-verify 2026-05-29: 'Answer: A or B' / 'Answer: A/B' must NOT score
|
| 31 |
+
full credit for A — a hedge naming a second distinct option is a multiple-
|
| 32 |
+
answers format hack (n_distinct >= 2)."""
|
| 33 |
+
for hedge in ("Answer: A or B", "Answer: A/B", "Answer: A, B", "Answer: A and C"):
|
| 34 |
+
letter, n_distinct = parse_final_answer(hedge)
|
| 35 |
+
assert n_distinct >= 2, f"hedge {hedge!r} not detected (n_distinct={n_distinct})"
|
| 36 |
+
# A clean single answer is still n_distinct == 1.
|
| 37 |
+
letter, n_distinct = parse_final_answer("After reasoning, Answer: C")
|
| 38 |
+
assert letter == "C" and n_distinct == 1
|
| 39 |
+
# Two clean markers of the SAME letter are not a hedge.
|
| 40 |
+
_, n_same = parse_final_answer("Answer: C ... wait, Answer: C")
|
| 41 |
+
assert n_same == 1
|
| 42 |
+
|
| 43 |
+
|
| 44 |
+
def test_hedged_answer_scores_multiple_penalty_via_reward():
|
| 45 |
+
r = MMLUFormatReward()
|
| 46 |
+
out = r(prompts=["p"], completions=["Answer: A or B"], answers=["A"])
|
| 47 |
+
# Even though the gold is A and A is the lead letter, the hedge is penalized.
|
| 48 |
+
assert out[0] == r.multiple_answers_reward
|
| 49 |
|
| 50 |
|
| 51 |
# ===========================================================================
|
|
@@ -231,10 +231,20 @@ class ComposerReplicationTrainer(GRPOTrainer): # type: ignore[misc, valid-type]
|
|
| 231 |
# to 0 before gathering, then neutralize those positions by feeding
|
| 232 |
# labels=-100 (the standard HF ignore convention that generalized_jsd_loss
|
| 233 |
# already honors). This makes sentinel/padding positions contribute 0.
|
| 234 |
-
|
| 235 |
-
|
| 236 |
-
|
| 237 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 238 |
|
| 239 |
vocab = student_logits.size(-1)
|
| 240 |
s_safe = s_idx.clamp_min(0)
|
|
|
|
| 231 |
# to 0 before gathering, then neutralize those positions by feeding
|
| 232 |
# labels=-100 (the standard HF ignore convention that generalized_jsd_loss
|
| 233 |
# already honors). This makes sentinel/padding positions contribute 0.
|
| 234 |
+
#
|
| 235 |
+
# Final-verify 2026-05-29: combine BOTH valid masks (not just student's)
|
| 236 |
+
# AND the sentinel guard. If a future collator ever emits divergent
|
| 237 |
+
# student/teacher valid tails, a teacher sentinel clamped to 0 would
|
| 238 |
+
# otherwise be silently distilled against teacher position 0. Belt-and-
|
| 239 |
+
# suspenders: valid iff student-valid AND teacher-valid AND both indices
|
| 240 |
+
# non-sentinel.
|
| 241 |
+
s_valid = inputs.get("student_response_valid")
|
| 242 |
+
t_valid = inputs.get("teacher_response_valid")
|
| 243 |
+
aligned_mask = (s_idx >= 0) & (t_idx >= 0)
|
| 244 |
+
if s_valid is not None:
|
| 245 |
+
aligned_mask = aligned_mask & s_valid.bool()
|
| 246 |
+
if t_valid is not None:
|
| 247 |
+
aligned_mask = aligned_mask & t_valid.bool()
|
| 248 |
|
| 249 |
vocab = student_logits.size(-1)
|
| 250 |
s_safe = s_idx.clamp_min(0)
|
|
@@ -0,0 +1,33 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Phase-8 final cross-family verify — deep-work-loop 2026-05-29
|
| 2 |
+
|
| 3 |
+
The deep-work-loop exit gate (Hard Rule 6: two independent confirmations). After
|
| 4 |
+
Waves A+B shipped (ADR-011/012/013, 227 tests), the net session code diff
|
| 5 |
+
(~92KB) was scattered to 3 diverse families for a final "would this break a real
|
| 6 |
+
run" pass. 2 of 3 returned clean (GPT-5.5, Gemini; DeepSeek starved its reasoning
|
| 7 |
+
budget). The verify EARNED ITS KEEP — it found a latent P0 + a convergent bug +
|
| 8 |
+
a self-bug in this session's own new code.
|
| 9 |
+
|
| 10 |
+
## Findings (all verified against code, all FIXED)
|
| 11 |
+
|
| 12 |
+
| # | Finding | Severity | Reviewers | Fix |
|
| 13 |
+
|---|---|---|---|---|
|
| 14 |
+
| 1 | SDPO sentinel-mask used only `student_response_valid`, ignored `teacher_response_valid` — a future divergent teacher tail would distill against clamped position 0 | P0 (latent) | GPT-5.5 | `aligned_mask = (s_idx>=0)&(t_idx>=0)&student_valid&teacher_valid` |
|
| 15 |
+
| 2 | Curriculum `_TaskStats` shared one `n_effort` denominator for both turns + think_tokens → corrupts a mean when only one signal is present | P1 (convergent) | GPT-5.5 + Gemini | separate `n_turns`/`n_think` counters |
|
| 16 |
+
| 3 | Docker E2E test ran `pip install pytest` under `--network none` → would fail exactly when the gate activates | P1 | GPT-5.5 | dropped pytest; stdlib `python -c` expression runner, no network needed |
|
| 17 |
+
| 4 | MMLU reward: `Answer: A or B` hedge parsed as `A` with `n_distinct=1` → full credit | P1 | GPT-5.5 | `_HEDGE_RE` adds the hedged second letter to the distinct set → multiple-answers penalty |
|
| 18 |
+
| 5 | HackMonitor read-markers bare-substring matched `import` in `important`, `cat` in `concatenate`; and scanned the submitted patch as read-evidence → false positives | P1 | GPT-5.5 | whole-word `_READ_VERB_RE` + exclude `submit_patch`/`patch`/`diff` payloads from BOTH layers |
|
| 19 |
+
|
| 20 |
+
Finding 5's fix surfaced a follow-on: layer-1's signature matcher ALSO scanned
|
| 21 |
+
the patch payload (a legit patch mentioning `__pycache__` in a comment got
|
| 22 |
+
flagged). The regression test caught it; fixed by excluding the patch payload
|
| 23 |
+
from layer-1 too. Net: the monitor now only treats actual read ACTIONS as
|
| 24 |
+
provenance evidence, never the agent's own output patch.
|
| 25 |
+
|
| 26 |
+
## Outcome
|
| 27 |
+
|
| 28 |
+
All 5 findings fixed + 5 regression tests added. Final suite: **232 passed / 18
|
| 29 |
+
skipped, 0 failed**. Both the execution view (workers reported done) and the
|
| 30 |
+
review view (final verify findings all closed) confirm the loop is empty — the
|
| 31 |
+
two independent confirmations the deep-work-loop requires.
|
| 32 |
+
|
| 33 |
+
Raw reviews: `verify_gpt-5.5.md`, `verify_gemini-3.1-pro.md`.
|
|
@@ -0,0 +1,4 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
## Verdict SHIP-WITH-FIXES
|
| 2 |
+
## P0 none
|
| 3 |
+
## P1
|
| 4 |
+
1. `composer_replication/datagen/curriculum.py:observe_effort` — The running mean math shares the `n_effort` denominator for both `turns` and `think_tokens`. Because these signals are independently optional (`float | None`), if one is provided without the other, its running mean is corrupted (divided by total effort updates rather than the count of that specific signal). **Fix:** Track `n_turns` and `n_think` separately, and use them as the respective denominators when updating `mean_turns` and `mean_think`. Update `_effort_factor` to check `st.n_turns > 0` and `st.n_think > 0` instead of `st.n_effort`.
|
|
@@ -0,0 +1,19 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
## Verdict
|
| 2 |
+
BLOCK
|
| 3 |
+
|
| 4 |
+
## P0
|
| 5 |
+
1. `composer_replication/trainer/composer_trainer.py:_compute_sdpo_loss` — sentinel masking only uses `student_response_valid`; `teacher_response_valid` is ignored. If student/teacher valid tails differ under truncation/tokenization/ragged rows, `teacher_response_idx == -1` is clamped to `0` but still labeled valid, so the JSD silently distills against teacher position 0 garbage. If K differs, this can also shape-crash.
|
| 6 |
+
**Fix:** build `aligned_mask = student_valid & teacher_valid & (s_idx >= 0) & (t_idx >= 0)`, require/repair common `(B,K)` padding, and in strict mode assert valid masks/counts are compatible before gather. Return zero loss if no valid aligned tokens.
|
| 7 |
+
|
| 8 |
+
## P1
|
| 9 |
+
1. `composer_replication/integrations/altered_minds/reward.py:parse_final_answer/_ANSWER_RE` — multiple-answer hedges in a single marker are not detected: e.g. `"Answer: A or B"` / `"Answer: A/B"` parses as `A` with `n_distinct == 1`, so it can receive full credit.
|
| 10 |
+
**Fix:** after an `Answer:` marker, inspect the answer span and reject/penalize if it contains more than one distinct option letter before the line/end punctuation, not just multiple markers.
|
| 11 |
+
|
| 12 |
+
2. `composer_replication/datagen/curriculum.py:_TaskStats.observe_effort/_effort_factor` — one shared `n_effort` is used for both optional signals. If updates sometimes provide only `turns` and sometimes only `think_tokens`, the missing signal still advances the denominator for the other mean, biasing effort means downward. `_effort_factor` also averages a task’s absent signal as `0` whenever any other task has that signal, making weights depend on missingness.
|
| 13 |
+
**Fix:** track `n_turns` and `n_think` separately; combine only signals actually present for that task.
|
| 14 |
+
|
| 15 |
+
3. `composer_replication/datagen/monitor.py:HackMonitor._patch_provenance_hack` — read detection scans every action value, including submitted patch/diff text, and uses bare substring markers (`"cat"`, `"import"`, etc.). A legitimate patch reintroducing a deleted symbol plus code/comment containing `__pycache__` and `import`/`important`/`concatenate` can be falsely flagged and reward-masked.
|
| 16 |
+
**Fix:** only scan actual shell/file-read/import actions for provenance, exclude `submit_patch`/diff payloads from read evidence, and match read verbs with token/command boundaries near artifact paths.
|
| 17 |
+
|
| 18 |
+
4. `composer_replication/datagen/tests/test_docker_substrate_e2e.py:_run_in_container` — the Docker-gated test runs containers with `--network none` and then executes `pip install -q pytest`; on a real Docker host with `python:3.11-slim`, this fails because pytest is not installed and network is disabled. This will break the suite exactly when the gate activates.
|
| 19 |
+
**Fix:** use an image with pytest preinstalled, vendor/install before disabling network, or avoid pytest inside the container.
|