Codeseys commited on
Commit
678d10b
·
1 Parent(s): c009712

fix(phase-8): close all 5 cross-family final-verify findings + regression tests

Browse files

The deep-work-loop exit gate (2 independent confirmations). A final 3-family
scatter of the session diff found and we fixed:

- [P0 latent] SDPO sentinel-mask used only student_response_valid; now requires
BOTH valid masks AND non-sentinel indices (a future divergent teacher tail
could have distilled against clamped position 0).
- [P1 convergent, GPT-5.5+Gemini] curriculum shared one n_effort denominator for
turns+think_tokens; split into independent n_turns/n_think counters.
- [P1] docker E2E test ran pip-install under --network none (would fail when the
gate activates); replaced with stdlib python -c expression runner, no network.
- [P1] MMLU reward: Answer: A or B hedge scored full credit; _HEDGE_RE now
penalizes it as multiple-answers.
- [P1] HackMonitor read-markers matched import in important / cat in concatenate
and scanned the agent patch as read-evidence; whole-word regex + exclude
patch/diff payloads from BOTH layers (a legit patch can no longer self-incriminate).

232 passed / 18 skipped / 0 failed. 5 regression tests added. Synthesis +
raw reviews in docs/reviews/final-verify-deep-work-2026-05-29/.

composer_replication/datagen/curriculum.py CHANGED
@@ -25,23 +25,30 @@ from dataclasses import dataclass, field
25
  class _TaskStats:
26
  n_pass: float = 0.0
27
  n_total: int = 0
28
- # Running means of effort signals (ADR-012 finding #4). `n_effort` counts
29
- # exposures that supplied an effort signal (may differ from n_total since
30
- # turns/think_tokens are optional per update).
 
31
  mean_turns: float = 0.0
32
  mean_think: float = 0.0
33
- n_effort: int = 0
 
 
 
 
 
 
 
 
34
 
35
  def observe_effort(self, turns: float | None, think_tokens: float | None) -> None:
36
- """Fold optional turn / think-token signals into running means."""
37
- if turns is None and think_tokens is None:
38
- return
39
- self.n_effort += 1
40
- k = self.n_effort
41
  if turns is not None:
42
- self.mean_turns += (turns - self.mean_turns) / k
 
43
  if think_tokens is not None:
44
- self.mean_think += (think_tokens - self.mean_think) / k
 
45
 
46
  @property
47
  def p_hat(self) -> float:
@@ -128,15 +135,17 @@ class DifficultyCurriculum:
128
  if st is None or st.n_effort == 0:
129
  return 1.0
130
  max_turns = max(
131
- (s.mean_turns for s in self._stats.values() if s.n_effort), default=0.0
132
  )
133
  max_think = max(
134
- (s.mean_think for s in self._stats.values() if s.n_effort), default=0.0
135
  )
136
- z_turns = st.mean_turns / max_turns if max_turns > 0 else 0.0
137
- z_think = st.mean_think / max_think if max_think > 0 else 0.0
138
- # Combine the two normalized effort signals (mean of those present).
139
- components = [z for z, mx in ((z_turns, max_turns), (z_think, max_think)) if mx > 0]
 
 
140
  z = sum(components) / len(components) if components else 0.0
141
  return 1.0 + self.effort_gain * z
142
 
 
25
  class _TaskStats:
26
  n_pass: float = 0.0
27
  n_total: int = 0
28
+ # Running means of effort signals (ADR-012 finding #4). Each signal has its
29
+ # OWN exposure counter because turns/think_tokens are INDEPENDENTLY optional
30
+ # per update sharing one denominator would corrupt the mean of whichever
31
+ # signal was present when the other was absent (final-verify 2026-05-29).
32
  mean_turns: float = 0.0
33
  mean_think: float = 0.0
34
+ n_turns: int = 0
35
+ n_think: int = 0
36
+
37
+ @property
38
+ def n_effort(self) -> int:
39
+ """Exposures that supplied AT LEAST ONE effort signal (max of the two
40
+ per-signal counters — back-compat accessor for callers that just want
41
+ 'did we ever see effort')."""
42
+ return max(self.n_turns, self.n_think)
43
 
44
  def observe_effort(self, turns: float | None, think_tokens: float | None) -> None:
45
+ """Fold optional turn / think-token signals into independent running means."""
 
 
 
 
46
  if turns is not None:
47
+ self.n_turns += 1
48
+ self.mean_turns += (turns - self.mean_turns) / self.n_turns
49
  if think_tokens is not None:
50
+ self.n_think += 1
51
+ self.mean_think += (think_tokens - self.mean_think) / self.n_think
52
 
53
  @property
54
  def p_hat(self) -> float:
 
135
  if st is None or st.n_effort == 0:
136
  return 1.0
137
  max_turns = max(
138
+ (s.mean_turns for s in self._stats.values() if s.n_turns), default=0.0
139
  )
140
  max_think = max(
141
+ (s.mean_think for s in self._stats.values() if s.n_think), default=0.0
142
  )
143
+ # Only count a signal for THIS task if the task actually recorded it.
144
+ z_turns = (st.mean_turns / max_turns) if (max_turns > 0 and st.n_turns) else 0.0
145
+ z_think = (st.mean_think / max_think) if (max_think > 0 and st.n_think) else 0.0
146
+ components = [
147
+ z for z, present in ((z_turns, st.n_turns), (z_think, st.n_think)) if present
148
+ ]
149
  z = sum(components) / len(components) if components else 0.0
150
  return 1.0 + self.effort_gain * z
151
 
composer_replication/datagen/monitor.py CHANGED
@@ -60,10 +60,16 @@ _ARTIFACT_MARKERS: tuple[str, ...] = (
60
  )
61
 
62
  # Actions that "read" something (a cache/bytecode artifact, for provenance).
 
 
63
  _READ_MARKERS: tuple[str, ...] = (
64
  "cat", "read", "open", "load", "strings", "unzip", "jar", "javap",
65
  "decompile", "uncompyle", "decompyle", "dump", "show", "less", "head",
66
- "tail", "import",
 
 
 
 
67
  )
68
 
69
 
@@ -92,10 +98,18 @@ class HackMonitor:
92
  patch: str | None = None,
93
  ) -> bool:
94
  # --- layer 1: signature substring matcher (defense-in-depth) ---------
 
 
 
 
 
95
  sigs = _HACK_SIGNATURES + tuple(self.extra_signatures)
96
  for action in trajectory:
 
 
97
  blob = " ".join(
98
- str(v) for v in action.values() if isinstance(v, (str, int, float))
 
99
  ).lower()
100
  if any(sig.lower() in blob for sig in sigs):
101
  return True
@@ -148,15 +162,21 @@ class HackMonitor:
148
  if not reappeared:
149
  return False
150
  for action in trajectory:
 
 
 
 
 
 
151
  raw = " ".join(
152
- str(v) for v in action.values()
153
- if isinstance(v, (str, int, float))
154
  )
155
  low = raw.lower()
156
  norm = _normalize(low)
157
  reads_artifact = (
158
  any(m in norm for m in _ARTIFACT_MARKERS)
159
- and any(rm in low for rm in _READ_MARKERS)
160
  )
161
  if reads_artifact:
162
  return True
 
60
  )
61
 
62
  # Actions that "read" something (a cache/bytecode artifact, for provenance).
63
+ # Matched as WHOLE WORDS (see _has_read_verb) so 'cat' doesn't match 'concatenate'
64
+ # and 'import' doesn't match 'important' (final-verify 2026-05-29 false-positive).
65
  _READ_MARKERS: tuple[str, ...] = (
66
  "cat", "read", "open", "load", "strings", "unzip", "jar", "javap",
67
  "decompile", "uncompyle", "decompyle", "dump", "show", "less", "head",
68
+ "tail", "import", "marshal", "dis",
69
+ )
70
+
71
+ _READ_VERB_RE = re.compile(
72
+ r"\b(" + "|".join(re.escape(m) for m in _READ_MARKERS) + r")\b"
73
  )
74
 
75
 
 
98
  patch: str | None = None,
99
  ) -> bool:
100
  # --- layer 1: signature substring matcher (defense-in-depth) ---------
101
+ # Scan trajectory ACTIONS for cache/decompiler signatures, but EXCLUDE
102
+ # the submitted patch/diff payload — that's the agent's output, not a
103
+ # read action, and a legit patch mentioning e.g. `__pycache__` in a
104
+ # comment must not self-incriminate (final-verify 2026-05-29). Layer 2
105
+ # handles the patch separately via provenance.
106
  sigs = _HACK_SIGNATURES + tuple(self.extra_signatures)
107
  for action in trajectory:
108
+ if action.get("type") == "submit_patch":
109
+ continue
110
  blob = " ".join(
111
+ str(v) for k, v in action.items()
112
+ if isinstance(v, (str, int, float)) and k not in ("patch", "diff")
113
  ).lower()
114
  if any(sig.lower() in blob for sig in sigs):
115
  return True
 
162
  if not reappeared:
163
  return False
164
  for action in trajectory:
165
+ # The submitted patch/diff is the agent's OUTPUT, not a read action —
166
+ # exclude it from read-evidence so a legit patch that merely contains
167
+ # the word 'import' or a comment mentioning a cache path can't
168
+ # self-incriminate (final-verify 2026-05-29 false-positive class).
169
+ if action.get("type") == "submit_patch":
170
+ continue
171
  raw = " ".join(
172
+ str(v) for k, v in action.items()
173
+ if isinstance(v, (str, int, float)) and k not in ("patch", "diff")
174
  )
175
  low = raw.lower()
176
  norm = _normalize(low)
177
  reads_artifact = (
178
  any(m in norm for m in _ARTIFACT_MARKERS)
179
+ and bool(_READ_VERB_RE.search(low)) # whole-word read verb only
180
  )
181
  if reads_artifact:
182
  return True
composer_replication/datagen/tests/test_docker_substrate_e2e.py CHANGED
@@ -70,24 +70,17 @@ _MODULE_BROKEN = textwrap.dedent('''\
70
  return a + b
71
  ''')
72
 
73
- _TESTS = textwrap.dedent('''\
74
- from feature import add
75
- try:
76
- from feature import mul
77
- except ImportError:
78
- mul = None
79
-
80
- def test_add_guard(): # PASS_TO_PASS — must pass in broken state
81
- assert add(2, 3) == 5
82
-
83
- def test_mul_target(): # FAIL_TO_PASS — fails in broken state
84
- assert mul is not None and mul(2, 3) == 6
85
- ''')
86
 
 
 
 
87
 
88
- def _run_in_container(workdir_files: dict[str, str], test_node: str) -> tuple[bool, str]:
89
- """Materialize files in a fresh python:3.11-slim container, run one pytest
90
- node, return (passed, output). Uses the docker CLI (no SDK dep)."""
 
 
 
91
  import os
92
  import tempfile
93
 
@@ -95,50 +88,55 @@ def _run_in_container(workdir_files: dict[str, str], test_node: str) -> tuple[bo
95
  for name, content in workdir_files.items():
96
  with open(os.path.join(d, name), "w") as f:
97
  f.write(content)
98
- # Mount the dir, install pytest, run the single node id.
 
 
 
 
 
 
 
 
 
 
99
  cmd = [
100
  "docker", "run", "--rm", "--network", "none",
101
  "-v", f"{d}:/work", "-w", "/work",
102
- "python:3.11-slim",
103
- "bash", "-lc",
104
- f"pip install -q pytest >/dev/null 2>&1 && "
105
- f"python -m pytest -q '{test_node}' 2>&1",
106
  ]
107
- r = subprocess.run(cmd, capture_output=True, text=True, timeout=300)
108
  out = (r.stdout or "") + (r.stderr or "")
109
  return (r.returncode == 0, out)
110
 
111
 
112
  def test_substrate_inversion_four_gates_on_real_container():
113
- """The 4 ADR-010 gates against a REAL container + REAL pytest, not FakeSandbox.
114
 
115
  Gate 1 (baseline green): solved state → target + guard both pass.
116
  Gate 2 (deletion breaks): broken state → target FAILS.
117
  Gate 3 (remains functional):broken state → guard PASSES.
118
  Gate 4 (gold restores): gold-applied → target passes again.
119
  """
120
- tests_file = {"test_feature.py": _TESTS}
 
 
 
121
 
122
  # Gate 1 — solved: both pass.
123
- g1_target, _ = _run_in_container(
124
- {**tests_file, "feature.py": _MODULE_SOLVED}, "test_feature.py::test_mul_target")
125
- g1_guard, _ = _run_in_container(
126
- {**tests_file, "feature.py": _MODULE_SOLVED}, "test_feature.py::test_add_guard")
127
  assert g1_target and g1_guard, "Gate 1 (baseline green) failed on real container"
128
 
129
- # Gate 2 — broken: target FAILS.
130
- g2_target, g2_out = _run_in_container(
131
- {**tests_file, "feature.py": _MODULE_BROKEN}, "test_feature.py::test_mul_target")
132
- assert not g2_target, f"Gate 2 (deletion breaks target) failed — target passed in broken state:\n{g2_out}"
133
 
134
  # Gate 3 — broken: guard still PASSES.
135
- g3_guard, g3_out = _run_in_container(
136
- {**tests_file, "feature.py": _MODULE_BROKEN}, "test_feature.py::test_add_guard")
137
  assert g3_guard, f"Gate 3 (remains functional) failed — guard broke in broken state:\n{g3_out}"
138
 
139
  # Gate 4 — gold restores: target passes again (gold == the solved module).
140
- g4_target, _ = _run_in_container(
141
- {**tests_file, "feature.py": _MODULE_SOLVED}, "test_feature.py::test_mul_target")
142
  assert g4_target, "Gate 4 (gold restores) failed — target did not recover after gold patch"
143
 
144
 
 
70
  return a + b
71
  ''')
72
 
 
 
 
 
 
 
 
 
 
 
 
 
 
73
 
74
+ def _run_in_container(workdir_files: dict[str, str], target_expr: str) -> tuple[bool, str]:
75
+ """Materialize files in a fresh python:3.11-slim container and evaluate one
76
+ boolean expression against the module, return (passed, output).
77
 
78
+ Uses a plain stdlib `python -c` runner — NO pip install, so `--network none`
79
+ is honored (final-verify 2026-05-29: the earlier `pip install pytest` inside
80
+ a network-disabled container would fail exactly when the gate activates).
81
+ `target_expr` is a Python expression over the imported `feature` module that
82
+ must evaluate truthy for a 'pass', e.g. `mul(2,3)==6`.
83
+ """
84
  import os
85
  import tempfile
86
 
 
88
  for name, content in workdir_files.items():
89
  with open(os.path.join(d, name), "w") as f:
90
  f.write(content)
91
+ runner = (
92
+ "import sys\n"
93
+ "try:\n"
94
+ " import feature\n"
95
+ f" ok = bool({target_expr})\n"
96
+ "except Exception as e:\n"
97
+ " print('EXC', e); sys.exit(1)\n"
98
+ "print('PASS' if ok else 'FAIL'); sys.exit(0 if ok else 1)\n"
99
+ )
100
+ with open(os.path.join(d, "_runner.py"), "w") as f:
101
+ f.write(runner)
102
  cmd = [
103
  "docker", "run", "--rm", "--network", "none",
104
  "-v", f"{d}:/work", "-w", "/work",
105
+ "python:3.11-slim", "python", "_runner.py",
 
 
 
106
  ]
107
+ r = subprocess.run(cmd, capture_output=True, text=True, timeout=120)
108
  out = (r.stdout or "") + (r.stderr or "")
109
  return (r.returncode == 0, out)
110
 
111
 
112
  def test_substrate_inversion_four_gates_on_real_container():
113
+ """The 4 ADR-010 gates against a REAL container + REAL python import, not FakeSandbox.
114
 
115
  Gate 1 (baseline green): solved state → target + guard both pass.
116
  Gate 2 (deletion breaks): broken state → target FAILS.
117
  Gate 3 (remains functional):broken state → guard PASSES.
118
  Gate 4 (gold restores): gold-applied → target passes again.
119
  """
120
+ solved = {"feature.py": _MODULE_SOLVED}
121
+ broken = {"feature.py": _MODULE_BROKEN}
122
+ TARGET = "feature.mul(2, 3) == 6" # FAIL_TO_PASS — exercises the deleted symbol
123
+ GUARD = "feature.add(2, 3) == 5" # PASS_TO_PASS — must survive the deletion
124
 
125
  # Gate 1 — solved: both pass.
126
+ g1_target, _ = _run_in_container(solved, TARGET)
127
+ g1_guard, _ = _run_in_container(solved, GUARD)
 
 
128
  assert g1_target and g1_guard, "Gate 1 (baseline green) failed on real container"
129
 
130
+ # Gate 2 — broken: target FAILS (mul is gone → AttributeError → non-truthy).
131
+ g2_target, g2_out = _run_in_container(broken, TARGET)
132
+ assert not g2_target, f"Gate 2 (deletion breaks target) failed — target passed broken:\n{g2_out}"
 
133
 
134
  # Gate 3 — broken: guard still PASSES.
135
+ g3_guard, g3_out = _run_in_container(broken, GUARD)
 
136
  assert g3_guard, f"Gate 3 (remains functional) failed — guard broke in broken state:\n{g3_out}"
137
 
138
  # Gate 4 — gold restores: target passes again (gold == the solved module).
139
+ g4_target, _ = _run_in_container(solved, TARGET)
 
140
  assert g4_target, "Gate 4 (gold restores) failed — target did not recover after gold patch"
141
 
142
 
composer_replication/datagen/tests/test_feature_deletion.py CHANGED
@@ -102,6 +102,47 @@ def test_monitor_passes_clean_reimplementation():
102
  assert mon.flag(traj, deleted_symbols=("feature_a",)) is False
103
 
104
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
105
  def test_reward_masked_when_hack_detected():
106
  sb = FakeSandbox(test_outcomes={"test_feature_a": True, "test_unrelated": True})
107
  env = FeatureDeletionEnv(sb)
 
102
  assert mon.flag(traj, deleted_symbols=("feature_a",)) is False
103
 
104
 
105
+ def test_monitor_no_false_positive_on_legit_import_words():
106
+ """Final-verify 2026-05-29: bare-substring read markers flagged 'import' in
107
+ 'important' / 'cat' in 'concatenate'. With whole-word matching + patch
108
+ exclusion, a legit patch that reintroduces the symbol and uses normal words
109
+ must NOT be flagged when there's no real cache/bytecode read."""
110
+ mon = HackMonitor()
111
+ traj = [
112
+ {"type": "comment", "text": "this is an important concatenation of features"},
113
+ {"type": "submit_patch", "patch": "def feature_a():\n return important() # __pycache__ mention in a comment"},
114
+ ]
115
+ # Symbol reappears in the patch AND the words 'important'/'concatenation'
116
+ # appear and the patch comment even mentions __pycache__ — but there is NO
117
+ # actual read action, and the patch text is excluded from read-evidence.
118
+ assert mon.flag(traj, deleted_symbols=("feature_a",)) is False
119
+
120
+
121
+ def test_monitor_still_flags_obfuscated_cache_read():
122
+ """The patch-provenance layer still catches a string-concat-obfuscated cache
123
+ read that reintroduces a deleted symbol (the real attack it's for)."""
124
+ mon = HackMonitor()
125
+ traj = [
126
+ {"type": "shell", "command": "python -c \"import marshal; d=open('__py'+'cache__/widget.pyc','rb').read()\""},
127
+ {"type": "submit_patch", "patch": "def feature_a(): return 42 # recovered"},
128
+ ]
129
+ assert mon.flag(traj, deleted_symbols=("feature_a",)) is True
130
+
131
+
132
+ def test_curriculum_independent_effort_signals():
133
+ """Final-verify 2026-05-29: turns and think_tokens are INDEPENDENTLY optional;
134
+ they must not share a denominator. A task updated with turns-only then
135
+ think-only must record correct per-signal means (each = its single value),
136
+ not values halved by a shared n_effort counter."""
137
+ cur = DifficultyCurriculum()
138
+ cur.update("t", n_pass=1, n_total=2, turns=10.0) # turns only
139
+ cur.update("t", n_pass=1, n_total=2, think_tokens=200.0) # think only
140
+ st = cur._stats["t"]
141
+ assert st.mean_turns == 10.0, f"turns mean corrupted: {st.mean_turns}"
142
+ assert st.mean_think == 200.0, f"think mean corrupted: {st.mean_think}"
143
+ assert st.n_turns == 1 and st.n_think == 1
144
+
145
+
146
  def test_reward_masked_when_hack_detected():
147
  sb = FakeSandbox(test_outcomes={"test_feature_a": True, "test_unrelated": True})
148
  env = FeatureDeletionEnv(sb)
composer_replication/integrations/altered_minds/reward.py CHANGED
@@ -42,6 +42,15 @@ _ANSWER_RE = re.compile(r"answer\s*[:\-]\s*\*{0,2}([A-D])\b", re.IGNORECASE)
42
  _JSON_ANSWER_RE = re.compile(
43
  r'["\']answer["\']\s*:\s*["\']([A-D])["\']', re.IGNORECASE
44
  )
 
 
 
 
 
 
 
 
 
45
 
46
 
47
  def _find_markers(text: str) -> list[str]:
@@ -70,8 +79,14 @@ def parse_final_answer(completion: str) -> tuple[str | None, int]:
70
  markers = _find_markers(completion)
71
  if not markers:
72
  return None, 0
73
- distinct = len(set(markers))
74
- return markers[-1], distinct
 
 
 
 
 
 
75
 
76
 
77
  @dataclass
 
42
  _JSON_ANSWER_RE = re.compile(
43
  r'["\']answer["\']\s*:\s*["\']([A-D])["\']', re.IGNORECASE
44
  )
45
+ # Hedge AFTER an answer marker: ``Answer: A or B`` / ``Answer: A/B`` /
46
+ # ``Answer: A, B`` — a single marker that names a SECOND distinct option is a
47
+ # format hedge and must be treated as multiple-answers, not full credit for the
48
+ # first letter (final-verify 2026-05-29). Captures the lead letter + the hedged
49
+ # second letter immediately following via or / slash / comma / 'and'.
50
+ _HEDGE_RE = re.compile(
51
+ r"answer\s*[:\-]\s*\*{0,2}([A-D])\b\s*(?:or|and|/|,|\|)\s*\*{0,2}([A-D])\b",
52
+ re.IGNORECASE,
53
+ )
54
 
55
 
56
  def _find_markers(text: str) -> list[str]:
 
79
  markers = _find_markers(completion)
80
  if not markers:
81
  return None, 0
82
+ distinct = set(markers)
83
+ # Hedge detection: a single marker naming a second distinct option
84
+ # ("Answer: A or B") adds the hedged letter to the distinct set, so it is
85
+ # penalized as a multiple-answers format hack instead of scoring the lead.
86
+ for m in _HEDGE_RE.finditer(completion or ""):
87
+ distinct.add(m.group(1).upper())
88
+ distinct.add(m.group(2).upper())
89
+ return markers[-1], len(distinct)
90
 
91
 
92
  @dataclass
composer_replication/integrations/altered_minds/tests/test_channel_ladder.py CHANGED
@@ -23,6 +23,29 @@ from composer_replication.integrations.altered_minds import (
23
  dual_kl_logger,
24
  randomize_options,
25
  )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26
 
27
 
28
  # ===========================================================================
 
23
  dual_kl_logger,
24
  randomize_options,
25
  )
26
+ from composer_replication.integrations.altered_minds.reward import parse_final_answer
27
+
28
+
29
+ def test_hedged_answer_is_penalized_not_full_credit():
30
+ """Final-verify 2026-05-29: 'Answer: A or B' / 'Answer: A/B' must NOT score
31
+ full credit for A — a hedge naming a second distinct option is a multiple-
32
+ answers format hack (n_distinct >= 2)."""
33
+ for hedge in ("Answer: A or B", "Answer: A/B", "Answer: A, B", "Answer: A and C"):
34
+ letter, n_distinct = parse_final_answer(hedge)
35
+ assert n_distinct >= 2, f"hedge {hedge!r} not detected (n_distinct={n_distinct})"
36
+ # A clean single answer is still n_distinct == 1.
37
+ letter, n_distinct = parse_final_answer("After reasoning, Answer: C")
38
+ assert letter == "C" and n_distinct == 1
39
+ # Two clean markers of the SAME letter are not a hedge.
40
+ _, n_same = parse_final_answer("Answer: C ... wait, Answer: C")
41
+ assert n_same == 1
42
+
43
+
44
+ def test_hedged_answer_scores_multiple_penalty_via_reward():
45
+ r = MMLUFormatReward()
46
+ out = r(prompts=["p"], completions=["Answer: A or B"], answers=["A"])
47
+ # Even though the gold is A and A is the lead letter, the hedge is penalized.
48
+ assert out[0] == r.multiple_answers_reward
49
 
50
 
51
  # ===========================================================================
composer_replication/trainer/composer_trainer.py CHANGED
@@ -231,10 +231,20 @@ class ComposerReplicationTrainer(GRPOTrainer): # type: ignore[misc, valid-type]
231
  # to 0 before gathering, then neutralize those positions by feeding
232
  # labels=-100 (the standard HF ignore convention that generalized_jsd_loss
233
  # already honors). This makes sentinel/padding positions contribute 0.
234
- if "student_response_valid" in inputs and inputs["student_response_valid"] is not None:
235
- aligned_mask = inputs["student_response_valid"].bool()
236
- else:
237
- aligned_mask = (s_idx >= 0) & (t_idx >= 0)
 
 
 
 
 
 
 
 
 
 
238
 
239
  vocab = student_logits.size(-1)
240
  s_safe = s_idx.clamp_min(0)
 
231
  # to 0 before gathering, then neutralize those positions by feeding
232
  # labels=-100 (the standard HF ignore convention that generalized_jsd_loss
233
  # already honors). This makes sentinel/padding positions contribute 0.
234
+ #
235
+ # Final-verify 2026-05-29: combine BOTH valid masks (not just student's)
236
+ # AND the sentinel guard. If a future collator ever emits divergent
237
+ # student/teacher valid tails, a teacher sentinel clamped to 0 would
238
+ # otherwise be silently distilled against teacher position 0. Belt-and-
239
+ # suspenders: valid iff student-valid AND teacher-valid AND both indices
240
+ # non-sentinel.
241
+ s_valid = inputs.get("student_response_valid")
242
+ t_valid = inputs.get("teacher_response_valid")
243
+ aligned_mask = (s_idx >= 0) & (t_idx >= 0)
244
+ if s_valid is not None:
245
+ aligned_mask = aligned_mask & s_valid.bool()
246
+ if t_valid is not None:
247
+ aligned_mask = aligned_mask & t_valid.bool()
248
 
249
  vocab = student_logits.size(-1)
250
  s_safe = s_idx.clamp_min(0)
docs/reviews/final-verify-deep-work-2026-05-29/SYNTHESIS.md ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Phase-8 final cross-family verify — deep-work-loop 2026-05-29
2
+
3
+ The deep-work-loop exit gate (Hard Rule 6: two independent confirmations). After
4
+ Waves A+B shipped (ADR-011/012/013, 227 tests), the net session code diff
5
+ (~92KB) was scattered to 3 diverse families for a final "would this break a real
6
+ run" pass. 2 of 3 returned clean (GPT-5.5, Gemini; DeepSeek starved its reasoning
7
+ budget). The verify EARNED ITS KEEP — it found a latent P0 + a convergent bug +
8
+ a self-bug in this session's own new code.
9
+
10
+ ## Findings (all verified against code, all FIXED)
11
+
12
+ | # | Finding | Severity | Reviewers | Fix |
13
+ |---|---|---|---|---|
14
+ | 1 | SDPO sentinel-mask used only `student_response_valid`, ignored `teacher_response_valid` — a future divergent teacher tail would distill against clamped position 0 | P0 (latent) | GPT-5.5 | `aligned_mask = (s_idx>=0)&(t_idx>=0)&student_valid&teacher_valid` |
15
+ | 2 | Curriculum `_TaskStats` shared one `n_effort` denominator for both turns + think_tokens → corrupts a mean when only one signal is present | P1 (convergent) | GPT-5.5 + Gemini | separate `n_turns`/`n_think` counters |
16
+ | 3 | Docker E2E test ran `pip install pytest` under `--network none` → would fail exactly when the gate activates | P1 | GPT-5.5 | dropped pytest; stdlib `python -c` expression runner, no network needed |
17
+ | 4 | MMLU reward: `Answer: A or B` hedge parsed as `A` with `n_distinct=1` → full credit | P1 | GPT-5.5 | `_HEDGE_RE` adds the hedged second letter to the distinct set → multiple-answers penalty |
18
+ | 5 | HackMonitor read-markers bare-substring matched `import` in `important`, `cat` in `concatenate`; and scanned the submitted patch as read-evidence → false positives | P1 | GPT-5.5 | whole-word `_READ_VERB_RE` + exclude `submit_patch`/`patch`/`diff` payloads from BOTH layers |
19
+
20
+ Finding 5's fix surfaced a follow-on: layer-1's signature matcher ALSO scanned
21
+ the patch payload (a legit patch mentioning `__pycache__` in a comment got
22
+ flagged). The regression test caught it; fixed by excluding the patch payload
23
+ from layer-1 too. Net: the monitor now only treats actual read ACTIONS as
24
+ provenance evidence, never the agent's own output patch.
25
+
26
+ ## Outcome
27
+
28
+ All 5 findings fixed + 5 regression tests added. Final suite: **232 passed / 18
29
+ skipped, 0 failed**. Both the execution view (workers reported done) and the
30
+ review view (final verify findings all closed) confirm the loop is empty — the
31
+ two independent confirmations the deep-work-loop requires.
32
+
33
+ Raw reviews: `verify_gpt-5.5.md`, `verify_gemini-3.1-pro.md`.
docs/reviews/final-verify-deep-work-2026-05-29/verify_gemini-3.1-pro.md ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ ## Verdict SHIP-WITH-FIXES
2
+ ## P0 none
3
+ ## P1
4
+ 1. `composer_replication/datagen/curriculum.py:observe_effort` — The running mean math shares the `n_effort` denominator for both `turns` and `think_tokens`. Because these signals are independently optional (`float | None`), if one is provided without the other, its running mean is corrupted (divided by total effort updates rather than the count of that specific signal). **Fix:** Track `n_turns` and `n_think` separately, and use them as the respective denominators when updating `mean_turns` and `mean_think`. Update `_effort_factor` to check `st.n_turns > 0` and `st.n_think > 0` instead of `st.n_effort`.
docs/reviews/final-verify-deep-work-2026-05-29/verify_gpt-5.5.md ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## Verdict
2
+ BLOCK
3
+
4
+ ## P0
5
+ 1. `composer_replication/trainer/composer_trainer.py:_compute_sdpo_loss` — sentinel masking only uses `student_response_valid`; `teacher_response_valid` is ignored. If student/teacher valid tails differ under truncation/tokenization/ragged rows, `teacher_response_idx == -1` is clamped to `0` but still labeled valid, so the JSD silently distills against teacher position 0 garbage. If K differs, this can also shape-crash.
6
+ **Fix:** build `aligned_mask = student_valid & teacher_valid & (s_idx >= 0) & (t_idx >= 0)`, require/repair common `(B,K)` padding, and in strict mode assert valid masks/counts are compatible before gather. Return zero loss if no valid aligned tokens.
7
+
8
+ ## P1
9
+ 1. `composer_replication/integrations/altered_minds/reward.py:parse_final_answer/_ANSWER_RE` — multiple-answer hedges in a single marker are not detected: e.g. `"Answer: A or B"` / `"Answer: A/B"` parses as `A` with `n_distinct == 1`, so it can receive full credit.
10
+ **Fix:** after an `Answer:` marker, inspect the answer span and reject/penalize if it contains more than one distinct option letter before the line/end punctuation, not just multiple markers.
11
+
12
+ 2. `composer_replication/datagen/curriculum.py:_TaskStats.observe_effort/_effort_factor` — one shared `n_effort` is used for both optional signals. If updates sometimes provide only `turns` and sometimes only `think_tokens`, the missing signal still advances the denominator for the other mean, biasing effort means downward. `_effort_factor` also averages a task’s absent signal as `0` whenever any other task has that signal, making weights depend on missingness.
13
+ **Fix:** track `n_turns` and `n_think` separately; combine only signals actually present for that task.
14
+
15
+ 3. `composer_replication/datagen/monitor.py:HackMonitor._patch_provenance_hack` — read detection scans every action value, including submitted patch/diff text, and uses bare substring markers (`"cat"`, `"import"`, etc.). A legitimate patch reintroducing a deleted symbol plus code/comment containing `__pycache__` and `import`/`important`/`concatenate` can be falsely flagged and reward-masked.
16
+ **Fix:** only scan actual shell/file-read/import actions for provenance, exclude `submit_patch`/diff payloads from read evidence, and match read verbs with token/command boundaries near artifact paths.
17
+
18
+ 4. `composer_replication/datagen/tests/test_docker_substrate_e2e.py:_run_in_container` — the Docker-gated test runs containers with `--network none` and then executes `pip install -q pytest`; on a real Docker host with `python:3.11-slim`, this fails because pytest is not installed and network is disabled. This will break the suite exactly when the gate activates.
19
+ **Fix:** use an image with pytest preinstalled, vendor/install before disabling network, or avoid pytest inside the container.