fix(phase-8): close all 5 cross-family final-verify findings + regression tests

The deep-work-loop exit gate (2 independent confirmations). A final 3-family
scatter of the session diff found and we fixed:

- [P0 latent] SDPO sentinel-mask used only student_response_valid; now requires
BOTH valid masks AND non-sentinel indices (a future divergent teacher tail
could have distilled against clamped position 0).
- [P1 convergent, GPT-5.5+Gemini] curriculum shared one n_effort denominator for
turns+think_tokens; split into independent n_turns/n_think counters.
- [P1] docker E2E test ran pip-install under --network none (would fail when the
gate activates); replaced with stdlib python -c expression runner, no network.
- [P1] MMLU reward: Answer: A or B hedge scored full credit; _HEDGE_RE now
penalizes it as multiple-answers.
- [P1] HackMonitor read-markers matched import in important / cat in concatenate
and scanned the agent patch as read-evidence; whole-word regex + exclude
patch/diff payloads from BOTH layers (a legit patch can no longer self-incriminate).

232 passed / 18 skipped / 0 failed. 5 regression tests added. Synthesis +
raw reviews in docs/reviews/final-verify-deep-work-2026-05-29/.

Files changed (10) hide show

composer_replication/datagen/curriculum.py +26 -17
composer_replication/datagen/monitor.py +25 -5
composer_replication/datagen/tests/test_docker_substrate_e2e.py +34 -36
composer_replication/datagen/tests/test_feature_deletion.py +41 -0
composer_replication/integrations/altered_minds/reward.py +17 -2
composer_replication/integrations/altered_minds/tests/test_channel_ladder.py +23 -0
composer_replication/trainer/composer_trainer.py +14 -4
docs/reviews/final-verify-deep-work-2026-05-29/SYNTHESIS.md +33 -0
docs/reviews/final-verify-deep-work-2026-05-29/verify_gemini-3.1-pro.md +4 -0
docs/reviews/final-verify-deep-work-2026-05-29/verify_gpt-5.5.md +19 -0

composer_replication/datagen/curriculum.py CHANGED Viewed

@@ -25,23 +25,30 @@ from dataclasses import dataclass, field
 class _TaskStats:
     n_pass: float = 0.0
     n_total: int = 0
-    # Running means of effort signals (ADR-012 finding #4). `n_effort` counts
-    # exposures that supplied an effort signal (may differ from n_total since
-    # turns/think_tokens are optional per update).
     mean_turns: float = 0.0
     mean_think: float = 0.0
-    n_effort: int = 0
     def observe_effort(self, turns: float | None, think_tokens: float | None) -> None:
-        """Fold optional turn / think-token signals into running means."""
-        if turns is None and think_tokens is None:
-            return
-        self.n_effort += 1
-        k = self.n_effort
         if turns is not None:
-            self.mean_turns += (turns - self.mean_turns) / k
         if think_tokens is not None:
-            self.mean_think += (think_tokens - self.mean_think) / k
     @property
     def p_hat(self) -> float:
@@ -128,15 +135,17 @@ class DifficultyCurriculum:
         if st is None or st.n_effort == 0:
             return 1.0
         max_turns = max(
-            (s.mean_turns for s in self._stats.values() if s.n_effort), default=0.0
         )
         max_think = max(
-            (s.mean_think for s in self._stats.values() if s.n_effort), default=0.0
         )
-        z_turns = st.mean_turns / max_turns if max_turns > 0 else 0.0
-        z_think = st.mean_think / max_think if max_think > 0 else 0.0
-        # Combine the two normalized effort signals (mean of those present).
-        components = [z for z, mx in ((z_turns, max_turns), (z_think, max_think)) if mx > 0]
         z = sum(components) / len(components) if components else 0.0
         return 1.0 + self.effort_gain * z

 class _TaskStats:
     n_pass: float = 0.0
     n_total: int = 0
+    # Running means of effort signals (ADR-012 finding #4). Each signal has its
+    # OWN exposure counter because turns/think_tokens are INDEPENDENTLY optional
+    # per update — sharing one denominator would corrupt the mean of whichever
+    # signal was present when the other was absent (final-verify 2026-05-29).
     mean_turns: float = 0.0
     mean_think: float = 0.0
+    n_turns: int = 0
+    n_think: int = 0
+    @property
+    def n_effort(self) -> int:
+        """Exposures that supplied AT LEAST ONE effort signal (max of the two
+        per-signal counters — back-compat accessor for callers that just want
+        'did we ever see effort')."""
+        return max(self.n_turns, self.n_think)
     def observe_effort(self, turns: float | None, think_tokens: float | None) -> None:
+        """Fold optional turn / think-token signals into independent running means."""
         if turns is not None:
+            self.n_turns += 1
+            self.mean_turns += (turns - self.mean_turns) / self.n_turns
         if think_tokens is not None:
+            self.n_think += 1
+            self.mean_think += (think_tokens - self.mean_think) / self.n_think
     @property
     def p_hat(self) -> float:
         if st is None or st.n_effort == 0:
             return 1.0
         max_turns = max(
+            (s.mean_turns for s in self._stats.values() if s.n_turns), default=0.0
         )
         max_think = max(
+            (s.mean_think for s in self._stats.values() if s.n_think), default=0.0
         )
+        # Only count a signal for THIS task if the task actually recorded it.
+        z_turns = (st.mean_turns / max_turns) if (max_turns > 0 and st.n_turns) else 0.0
+        z_think = (st.mean_think / max_think) if (max_think > 0 and st.n_think) else 0.0
+        components = [
+            z for z, present in ((z_turns, st.n_turns), (z_think, st.n_think)) if present
+        ]
         z = sum(components) / len(components) if components else 0.0
         return 1.0 + self.effort_gain * z

composer_replication/datagen/monitor.py CHANGED Viewed

@@ -60,10 +60,16 @@ _ARTIFACT_MARKERS: tuple[str, ...] = (
 )
 # Actions that "read" something (a cache/bytecode artifact, for provenance).
 _READ_MARKERS: tuple[str, ...] = (
     "cat", "read", "open", "load", "strings", "unzip", "jar", "javap",
     "decompile", "uncompyle", "decompyle", "dump", "show", "less", "head",
-    "tail", "import",
 )
@@ -92,10 +98,18 @@ class HackMonitor:
         patch: str | None = None,
     ) -> bool:
         # --- layer 1: signature substring matcher (defense-in-depth) ---------
         sigs = _HACK_SIGNATURES + tuple(self.extra_signatures)
         for action in trajectory:
             blob = " ".join(
-                str(v) for v in action.values() if isinstance(v, (str, int, float))
             ).lower()
             if any(sig.lower() in blob for sig in sigs):
                 return True
@@ -148,15 +162,21 @@ class HackMonitor:
         if not reappeared:
             return False
         for action in trajectory:
             raw = " ".join(
-                str(v) for v in action.values()
-                if isinstance(v, (str, int, float))
             )
             low = raw.lower()
             norm = _normalize(low)
             reads_artifact = (
                 any(m in norm for m in _ARTIFACT_MARKERS)
-                and any(rm in low for rm in _READ_MARKERS)
             )
             if reads_artifact:
                 return True

 )
 # Actions that "read" something (a cache/bytecode artifact, for provenance).
+# Matched as WHOLE WORDS (see _has_read_verb) so 'cat' doesn't match 'concatenate'
+# and 'import' doesn't match 'important' (final-verify 2026-05-29 false-positive).
 _READ_MARKERS: tuple[str, ...] = (
     "cat", "read", "open", "load", "strings", "unzip", "jar", "javap",
     "decompile", "uncompyle", "decompyle", "dump", "show", "less", "head",
+    "tail", "import", "marshal", "dis",
+)
+_READ_VERB_RE = re.compile(
+    r"\b(" + "|".join(re.escape(m) for m in _READ_MARKERS) + r")\b"
 )
         patch: str | None = None,
     ) -> bool:
         # --- layer 1: signature substring matcher (defense-in-depth) ---------
+        # Scan trajectory ACTIONS for cache/decompiler signatures, but EXCLUDE
+        # the submitted patch/diff payload — that's the agent's output, not a
+        # read action, and a legit patch mentioning e.g. `__pycache__` in a
+        # comment must not self-incriminate (final-verify 2026-05-29). Layer 2
+        # handles the patch separately via provenance.
         sigs = _HACK_SIGNATURES + tuple(self.extra_signatures)
         for action in trajectory:
+            if action.get("type") == "submit_patch":
+                continue
             blob = " ".join(
+                str(v) for k, v in action.items()
+                if isinstance(v, (str, int, float)) and k not in ("patch", "diff")
             ).lower()
             if any(sig.lower() in blob for sig in sigs):
                 return True
         if not reappeared:
             return False
         for action in trajectory:
+            # The submitted patch/diff is the agent's OUTPUT, not a read action —
+            # exclude it from read-evidence so a legit patch that merely contains
+            # the word 'import' or a comment mentioning a cache path can't
+            # self-incriminate (final-verify 2026-05-29 false-positive class).
+            if action.get("type") == "submit_patch":
+                continue
             raw = " ".join(
+                str(v) for k, v in action.items()
+                if isinstance(v, (str, int, float)) and k not in ("patch", "diff")
             )
             low = raw.lower()
             norm = _normalize(low)
             reads_artifact = (
                 any(m in norm for m in _ARTIFACT_MARKERS)
+                and bool(_READ_VERB_RE.search(low))  # whole-word read verb only
             )
             if reads_artifact:
                 return True

composer_replication/datagen/tests/test_docker_substrate_e2e.py CHANGED Viewed

@@ -70,24 +70,17 @@ _MODULE_BROKEN = textwrap.dedent('''\
         return a + b
 ''')
-_TESTS = textwrap.dedent('''\
-    from feature import add
-    try:
-        from feature import mul
-    except ImportError:
-        mul = None
-    def test_add_guard():            # PASS_TO_PASS — must pass in broken state
-        assert add(2, 3) == 5
-    def test_mul_target():           # FAIL_TO_PASS — fails in broken state
-        assert mul is not None and mul(2, 3) == 6
-''')
-def _run_in_container(workdir_files: dict[str, str], test_node: str) -> tuple[bool, str]:
-    """Materialize files in a fresh python:3.11-slim container, run one pytest
-    node, return (passed, output). Uses the docker CLI (no SDK dep)."""
     import os
     import tempfile
@@ -95,50 +88,55 @@ def _run_in_container(workdir_files: dict[str, str], test_node: str) -> tuple[bo
         for name, content in workdir_files.items():
             with open(os.path.join(d, name), "w") as f:
                 f.write(content)
-        # Mount the dir, install pytest, run the single node id.
         cmd = [
             "docker", "run", "--rm", "--network", "none",
             "-v", f"{d}:/work", "-w", "/work",
-            "python:3.11-slim",
-            "bash", "-lc",
-            f"pip install -q pytest >/dev/null 2>&1 && "
-            f"python -m pytest -q '{test_node}' 2>&1",
         ]
-        r = subprocess.run(cmd, capture_output=True, text=True, timeout=300)
         out = (r.stdout or "") + (r.stderr or "")
         return (r.returncode == 0, out)
 def test_substrate_inversion_four_gates_on_real_container():
-    """The 4 ADR-010 gates against a REAL container + REAL pytest, not FakeSandbox.
     Gate 1 (baseline green):    solved state → target + guard both pass.
     Gate 2 (deletion breaks):   broken state → target FAILS.
     Gate 3 (remains functional):broken state → guard PASSES.
     Gate 4 (gold restores):     gold-applied → target passes again.
     """
-    tests_file = {"test_feature.py": _TESTS}
     # Gate 1 — solved: both pass.
-    g1_target, _ = _run_in_container(
-        {**tests_file, "feature.py": _MODULE_SOLVED}, "test_feature.py::test_mul_target")
-    g1_guard, _ = _run_in_container(
-        {**tests_file, "feature.py": _MODULE_SOLVED}, "test_feature.py::test_add_guard")
     assert g1_target and g1_guard, "Gate 1 (baseline green) failed on real container"
-    # Gate 2 — broken: target FAILS.
-    g2_target, g2_out = _run_in_container(
-        {**tests_file, "feature.py": _MODULE_BROKEN}, "test_feature.py::test_mul_target")
-    assert not g2_target, f"Gate 2 (deletion breaks target) failed — target passed in broken state:\n{g2_out}"
     # Gate 3 — broken: guard still PASSES.
-    g3_guard, g3_out = _run_in_container(
-        {**tests_file, "feature.py": _MODULE_BROKEN}, "test_feature.py::test_add_guard")
     assert g3_guard, f"Gate 3 (remains functional) failed — guard broke in broken state:\n{g3_out}"
     # Gate 4 — gold restores: target passes again (gold == the solved module).
-    g4_target, _ = _run_in_container(
-        {**tests_file, "feature.py": _MODULE_SOLVED}, "test_feature.py::test_mul_target")
     assert g4_target, "Gate 4 (gold restores) failed — target did not recover after gold patch"

         return a + b
 ''')
+def _run_in_container(workdir_files: dict[str, str], target_expr: str) -> tuple[bool, str]:
+    """Materialize files in a fresh python:3.11-slim container and evaluate one
+    boolean expression against the module, return (passed, output).
+    Uses a plain stdlib `python -c` runner — NO pip install, so `--network none`
+    is honored (final-verify 2026-05-29: the earlier `pip install pytest` inside
+    a network-disabled container would fail exactly when the gate activates).
+    `target_expr` is a Python expression over the imported `feature` module that
+    must evaluate truthy for a 'pass', e.g. `mul(2,3)==6`.
+    """
     import os
     import tempfile
         for name, content in workdir_files.items():
             with open(os.path.join(d, name), "w") as f:
                 f.write(content)
+        runner = (
+            "import sys\n"
+            "try:\n"
+            "    import feature\n"
+            f"    ok = bool({target_expr})\n"
+            "except Exception as e:\n"
+            "    print('EXC', e); sys.exit(1)\n"
+            "print('PASS' if ok else 'FAIL'); sys.exit(0 if ok else 1)\n"
+        )
+        with open(os.path.join(d, "_runner.py"), "w") as f:
+            f.write(runner)
         cmd = [
             "docker", "run", "--rm", "--network", "none",
             "-v", f"{d}:/work", "-w", "/work",
+            "python:3.11-slim", "python", "_runner.py",
         ]
+        r = subprocess.run(cmd, capture_output=True, text=True, timeout=120)
         out = (r.stdout or "") + (r.stderr or "")
         return (r.returncode == 0, out)
 def test_substrate_inversion_four_gates_on_real_container():
+    """The 4 ADR-010 gates against a REAL container + REAL python import, not FakeSandbox.
     Gate 1 (baseline green):    solved state → target + guard both pass.
     Gate 2 (deletion breaks):   broken state → target FAILS.
     Gate 3 (remains functional):broken state → guard PASSES.
     Gate 4 (gold restores):     gold-applied → target passes again.
     """
+    solved = {"feature.py": _MODULE_SOLVED}
+    broken = {"feature.py": _MODULE_BROKEN}
+    TARGET = "feature.mul(2, 3) == 6"   # FAIL_TO_PASS — exercises the deleted symbol
+    GUARD = "feature.add(2, 3) == 5"    # PASS_TO_PASS — must survive the deletion
     # Gate 1 — solved: both pass.
+    g1_target, _ = _run_in_container(solved, TARGET)
+    g1_guard, _ = _run_in_container(solved, GUARD)
     assert g1_target and g1_guard, "Gate 1 (baseline green) failed on real container"
+    # Gate 2 — broken: target FAILS (mul is gone → AttributeError → non-truthy).
+    g2_target, g2_out = _run_in_container(broken, TARGET)
+    assert not g2_target, f"Gate 2 (deletion breaks target) failed — target passed broken:\n{g2_out}"
     # Gate 3 — broken: guard still PASSES.
+    g3_guard, g3_out = _run_in_container(broken, GUARD)
     assert g3_guard, f"Gate 3 (remains functional) failed — guard broke in broken state:\n{g3_out}"
     # Gate 4 — gold restores: target passes again (gold == the solved module).
+    g4_target, _ = _run_in_container(solved, TARGET)
     assert g4_target, "Gate 4 (gold restores) failed — target did not recover after gold patch"

composer_replication/datagen/tests/test_feature_deletion.py CHANGED Viewed

@@ -102,6 +102,47 @@ def test_monitor_passes_clean_reimplementation():
     assert mon.flag(traj, deleted_symbols=("feature_a",)) is False
 def test_reward_masked_when_hack_detected():
     sb = FakeSandbox(test_outcomes={"test_feature_a": True, "test_unrelated": True})
     env = FeatureDeletionEnv(sb)

     assert mon.flag(traj, deleted_symbols=("feature_a",)) is False
+def test_monitor_no_false_positive_on_legit_import_words():
+    """Final-verify 2026-05-29: bare-substring read markers flagged 'import' in
+    'important' / 'cat' in 'concatenate'. With whole-word matching + patch
+    exclusion, a legit patch that reintroduces the symbol and uses normal words
+    must NOT be flagged when there's no real cache/bytecode read."""
+    mon = HackMonitor()
+    traj = [
+        {"type": "comment", "text": "this is an important concatenation of features"},
+        {"type": "submit_patch", "patch": "def feature_a():\n    return important()  # __pycache__ mention in a comment"},
+    ]
+    # Symbol reappears in the patch AND the words 'important'/'concatenation'
+    # appear and the patch comment even mentions __pycache__ — but there is NO
+    # actual read action, and the patch text is excluded from read-evidence.
+    assert mon.flag(traj, deleted_symbols=("feature_a",)) is False
+def test_monitor_still_flags_obfuscated_cache_read():
+    """The patch-provenance layer still catches a string-concat-obfuscated cache
+    read that reintroduces a deleted symbol (the real attack it's for)."""
+    mon = HackMonitor()
+    traj = [
+        {"type": "shell", "command": "python -c \"import marshal; d=open('__py'+'cache__/widget.pyc','rb').read()\""},
+        {"type": "submit_patch", "patch": "def feature_a(): return 42  # recovered"},
+    ]
+    assert mon.flag(traj, deleted_symbols=("feature_a",)) is True
+def test_curriculum_independent_effort_signals():
+    """Final-verify 2026-05-29: turns and think_tokens are INDEPENDENTLY optional;
+    they must not share a denominator. A task updated with turns-only then
+    think-only must record correct per-signal means (each = its single value),
+    not values halved by a shared n_effort counter."""
+    cur = DifficultyCurriculum()
+    cur.update("t", n_pass=1, n_total=2, turns=10.0)            # turns only
+    cur.update("t", n_pass=1, n_total=2, think_tokens=200.0)    # think only
+    st = cur._stats["t"]
+    assert st.mean_turns == 10.0, f"turns mean corrupted: {st.mean_turns}"
+    assert st.mean_think == 200.0, f"think mean corrupted: {st.mean_think}"
+    assert st.n_turns == 1 and st.n_think == 1
 def test_reward_masked_when_hack_detected():
     sb = FakeSandbox(test_outcomes={"test_feature_a": True, "test_unrelated": True})
     env = FeatureDeletionEnv(sb)

composer_replication/integrations/altered_minds/reward.py CHANGED Viewed

@@ -42,6 +42,15 @@ _ANSWER_RE = re.compile(r"answer\s*[:\-]\s*\*{0,2}([A-D])\b", re.IGNORECASE)
 _JSON_ANSWER_RE = re.compile(
     r'["\']answer["\']\s*:\s*["\']([A-D])["\']', re.IGNORECASE
 )
 def _find_markers(text: str) -> list[str]:
@@ -70,8 +79,14 @@ def parse_final_answer(completion: str) -> tuple[str | None, int]:
     markers = _find_markers(completion)
     if not markers:
         return None, 0
-    distinct = len(set(markers))
-    return markers[-1], distinct
 @dataclass

 _JSON_ANSWER_RE = re.compile(
     r'["\']answer["\']\s*:\s*["\']([A-D])["\']', re.IGNORECASE
 )
+# Hedge AFTER an answer marker: ``Answer: A or B`` / ``Answer: A/B`` /
+# ``Answer: A, B`` — a single marker that names a SECOND distinct option is a
+# format hedge and must be treated as multiple-answers, not full credit for the
+# first letter (final-verify 2026-05-29). Captures the lead letter + the hedged
+# second letter immediately following via or / slash / comma / 'and'.
+_HEDGE_RE = re.compile(
+    r"answer\s*[:\-]\s*\*{0,2}([A-D])\b\s*(?:or|and|/|,|\|)\s*\*{0,2}([A-D])\b",
+    re.IGNORECASE,
+)
 def _find_markers(text: str) -> list[str]:
     markers = _find_markers(completion)
     if not markers:
         return None, 0
+    distinct = set(markers)
+    # Hedge detection: a single marker naming a second distinct option
+    # ("Answer: A or B") adds the hedged letter to the distinct set, so it is
+    # penalized as a multiple-answers format hack instead of scoring the lead.
+    for m in _HEDGE_RE.finditer(completion or ""):
+        distinct.add(m.group(1).upper())
+        distinct.add(m.group(2).upper())
+    return markers[-1], len(distinct)
 @dataclass

composer_replication/integrations/altered_minds/tests/test_channel_ladder.py CHANGED Viewed

@@ -23,6 +23,29 @@ from composer_replication.integrations.altered_minds import (
     dual_kl_logger,
     randomize_options,
 )
 # ===========================================================================

     dual_kl_logger,
     randomize_options,
 )
+from composer_replication.integrations.altered_minds.reward import parse_final_answer
+def test_hedged_answer_is_penalized_not_full_credit():
+    """Final-verify 2026-05-29: 'Answer: A or B' / 'Answer: A/B' must NOT score
+    full credit for A — a hedge naming a second distinct option is a multiple-
+    answers format hack (n_distinct >= 2)."""
+    for hedge in ("Answer: A or B", "Answer: A/B", "Answer: A, B", "Answer: A and C"):
+        letter, n_distinct = parse_final_answer(hedge)
+        assert n_distinct >= 2, f"hedge {hedge!r} not detected (n_distinct={n_distinct})"
+    # A clean single answer is still n_distinct == 1.
+    letter, n_distinct = parse_final_answer("After reasoning, Answer: C")
+    assert letter == "C" and n_distinct == 1
+    # Two clean markers of the SAME letter are not a hedge.
+    _, n_same = parse_final_answer("Answer: C ... wait, Answer: C")
+    assert n_same == 1
+def test_hedged_answer_scores_multiple_penalty_via_reward():
+    r = MMLUFormatReward()
+    out = r(prompts=["p"], completions=["Answer: A or B"], answers=["A"])
+    # Even though the gold is A and A is the lead letter, the hedge is penalized.
+    assert out[0] == r.multiple_answers_reward
 # ===========================================================================

composer_replication/trainer/composer_trainer.py CHANGED Viewed

@@ -231,10 +231,20 @@ class ComposerReplicationTrainer(GRPOTrainer):  # type: ignore[misc, valid-type]
         # to 0 before gathering, then neutralize those positions by feeding
         # labels=-100 (the standard HF ignore convention that generalized_jsd_loss
         # already honors). This makes sentinel/padding positions contribute 0.
-        if "student_response_valid" in inputs and inputs["student_response_valid"] is not None:
-            aligned_mask = inputs["student_response_valid"].bool()
-        else:
-            aligned_mask = (s_idx >= 0) & (t_idx >= 0)
         vocab = student_logits.size(-1)
         s_safe = s_idx.clamp_min(0)

         # to 0 before gathering, then neutralize those positions by feeding
         # labels=-100 (the standard HF ignore convention that generalized_jsd_loss
         # already honors). This makes sentinel/padding positions contribute 0.
+        #
+        # Final-verify 2026-05-29: combine BOTH valid masks (not just student's)
+        # AND the sentinel guard. If a future collator ever emits divergent
+        # student/teacher valid tails, a teacher sentinel clamped to 0 would
+        # otherwise be silently distilled against teacher position 0. Belt-and-
+        # suspenders: valid iff student-valid AND teacher-valid AND both indices
+        # non-sentinel.
+        s_valid = inputs.get("student_response_valid")
+        t_valid = inputs.get("teacher_response_valid")
+        aligned_mask = (s_idx >= 0) & (t_idx >= 0)
+        if s_valid is not None:
+            aligned_mask = aligned_mask & s_valid.bool()
+        if t_valid is not None:
+            aligned_mask = aligned_mask & t_valid.bool()
         vocab = student_logits.size(-1)
         s_safe = s_idx.clamp_min(0)

docs/reviews/final-verify-deep-work-2026-05-29/SYNTHESIS.md ADDED Viewed

	@@ -0,0 +1,33 @@

+# Phase-8 final cross-family verify — deep-work-loop 2026-05-29
+The deep-work-loop exit gate (Hard Rule 6: two independent confirmations). After
+Waves A+B shipped (ADR-011/012/013, 227 tests), the net session code diff
+(~92KB) was scattered to 3 diverse families for a final "would this break a real
+run" pass. 2 of 3 returned clean (GPT-5.5, Gemini; DeepSeek starved its reasoning
+budget). The verify EARNED ITS KEEP — it found a latent P0 + a convergent bug +
+a self-bug in this session's own new code.
+## Findings (all verified against code, all FIXED)
+| # | Finding | Severity | Reviewers | Fix |
+|---|---|---|---|---|
+| 1 | SDPO sentinel-mask used only `student_response_valid`, ignored `teacher_response_valid` — a future divergent teacher tail would distill against clamped position 0 | P0 (latent) | GPT-5.5 | `aligned_mask = (s_idx>=0)&(t_idx>=0)&student_valid&teacher_valid` |
+| 2 | Curriculum `_TaskStats` shared one `n_effort` denominator for both turns + think_tokens → corrupts a mean when only one signal is present | P1 (convergent) | GPT-5.5 + Gemini | separate `n_turns`/`n_think` counters |
+| 3 | Docker E2E test ran `pip install pytest` under `--network none` → would fail exactly when the gate activates | P1 | GPT-5.5 | dropped pytest; stdlib `python -c` expression runner, no network needed |
+| 4 | MMLU reward: `Answer: A or B` hedge parsed as `A` with `n_distinct=1` → full credit | P1 | GPT-5.5 | `_HEDGE_RE` adds the hedged second letter to the distinct set → multiple-answers penalty |
+| 5 | HackMonitor read-markers bare-substring matched `import` in `important`, `cat` in `concatenate`; and scanned the submitted patch as read-evidence → false positives | P1 | GPT-5.5 | whole-word `_READ_VERB_RE` + exclude `submit_patch`/`patch`/`diff` payloads from BOTH layers |
+Finding 5's fix surfaced a follow-on: layer-1's signature matcher ALSO scanned
+the patch payload (a legit patch mentioning `__pycache__` in a comment got
+flagged). The regression test caught it; fixed by excluding the patch payload
+from layer-1 too. Net: the monitor now only treats actual read ACTIONS as
+provenance evidence, never the agent's own output patch.
+## Outcome
+All 5 findings fixed + 5 regression tests added. Final suite: **232 passed / 18
+skipped, 0 failed**. Both the execution view (workers reported done) and the
+review view (final verify findings all closed) confirm the loop is empty — the
+two independent confirmations the deep-work-loop requires.
+Raw reviews: `verify_gpt-5.5.md`, `verify_gemini-3.1-pro.md`.

docs/reviews/final-verify-deep-work-2026-05-29/verify_gemini-3.1-pro.md ADDED Viewed

	@@ -0,0 +1,4 @@

+## Verdict SHIP-WITH-FIXES
+## P0 none
+## P1
+1. `composer_replication/datagen/curriculum.py:observe_effort` — The running mean math shares the `n_effort` denominator for both `turns` and `think_tokens`. Because these signals are independently optional (`float | None`), if one is provided without the other, its running mean is corrupted (divided by total effort updates rather than the count of that specific signal). **Fix:** Track `n_turns` and `n_think` separately, and use them as the respective denominators when updating `mean_turns` and `mean_think`. Update `_effort_factor` to check `st.n_turns > 0` and `st.n_think > 0` instead of `st.n_effort`.

docs/reviews/final-verify-deep-work-2026-05-29/verify_gpt-5.5.md ADDED Viewed

	@@ -0,0 +1,19 @@

+## Verdict
+BLOCK
+## P0
+1. `composer_replication/trainer/composer_trainer.py:_compute_sdpo_loss` — sentinel masking only uses `student_response_valid`; `teacher_response_valid` is ignored. If student/teacher valid tails differ under truncation/tokenization/ragged rows, `teacher_response_idx == -1` is clamped to `0` but still labeled valid, so the JSD silently distills against teacher position 0 garbage. If K differs, this can also shape-crash.
+   **Fix:** build `aligned_mask = student_valid & teacher_valid & (s_idx >= 0) & (t_idx >= 0)`, require/repair common `(B,K)` padding, and in strict mode assert valid masks/counts are compatible before gather. Return zero loss if no valid aligned tokens.
+## P1
+1. `composer_replication/integrations/altered_minds/reward.py:parse_final_answer/_ANSWER_RE` — multiple-answer hedges in a single marker are not detected: e.g. `"Answer: A or B"` / `"Answer: A/B"` parses as `A` with `n_distinct == 1`, so it can receive full credit.
+   **Fix:** after an `Answer:` marker, inspect the answer span and reject/penalize if it contains more than one distinct option letter before the line/end punctuation, not just multiple markers.
+2. `composer_replication/datagen/curriculum.py:_TaskStats.observe_effort/_effort_factor` — one shared `n_effort` is used for both optional signals. If updates sometimes provide only `turns` and sometimes only `think_tokens`, the missing signal still advances the denominator for the other mean, biasing effort means downward. `_effort_factor` also averages a task’s absent signal as `0` whenever any other task has that signal, making weights depend on missingness.
+   **Fix:** track `n_turns` and `n_think` separately; combine only signals actually present for that task.
+3. `composer_replication/datagen/monitor.py:HackMonitor._patch_provenance_hack` — read detection scans every action value, including submitted patch/diff text, and uses bare substring markers (`"cat"`, `"import"`, etc.). A legitimate patch reintroducing a deleted symbol plus code/comment containing `__pycache__` and `import`/`important`/`concatenate` can be falsely flagged and reward-masked.
+   **Fix:** only scan actual shell/file-read/import actions for provenance, exclude `submit_patch`/diff payloads from read evidence, and match read verbs with token/command boundaries near artifact paths.
+4. `composer_replication/datagen/tests/test_docker_substrate_e2e.py:_run_in_container` — the Docker-gated test runs containers with `--network none` and then executes `pip install -q pytest`; on a real Docker host with `python:3.11-slim`, this fails because pytest is not installed and network is disabled. This will break the suite exactly when the gate activates.
+   **Fix:** use an image with pytest preinstalled, vendor/install before disabling network, or avoid pytest inside the container.