Codeseys commited on May 26

Commit

b266c31

1 Parent(s): d88715c

Wave 13: serverless DiLoCo + replaysim normalization + 3 distillation losses + PRIME-RL + Monarch

Expanded the brief mid-deep-work-loop to address the user's request for
serverless training-system support, replaysim dataset normalization,
deeper self-distillation paper coverage, and Meta's PyTorch agentic
stack tie-ins.

NEW MODULES (35 tests passing):
- composer_replication.distillation: SimPO (arXiv:2405.14734), TAID
(arXiv:2501.16937), Entropy-Aware OPD (ICLR 2026 spotlight). 17
unit tests covering scalar/differentiable/scheduler-monotonicity
and boundary-condition correctness against paper formulas.
- composer_replication.diloco.serverless: ServerlessExecutor Protocol +
ObjectStoreAllReduce (fsspec-backed; works with file:// + s3:// +
hf:// + gs:// + az://) + LocalProcessExecutor (working) +
ModalExecutor / HFJobsExecutor (skeletons, raise NotImplementedError).
9 tests including 3 multi-process tests pinning the allreduce barrier
with mean-of-{0,1,2}=1 + mean-of-{0,100,200}=100 across two consecutive
rounds.
- composer_replication.replaysim: data-juicer adapter (per ADR-004
reconnaissance verdict; chosen over datatrove for native multi-turn +
DPO-pair op support). DJNormalizer with skip_dj passthrough +
default.yaml recipe. 9 unit tests.
- composer_replication.recipes.prime_rl: composer_loss adapter +
prime_rl_config.yaml example + recipe document. PRIME-RL is the
cleanest extension surface among RL frameworks (first-class
CustomLossConfig with LossInputs struct exposing exactly the tensors
needed for a 3-channel loss).
- composer_replication.recipes.monarch: actor layout document +
skeleton actor classes for Meta's actively-shipping (BSD-3 v0.4.1)
agentic-stack component. TorchForge is paused upstream and explicitly
dropped from the integration plan.

ADRs:
- ADR-004: replaysim normalization layer (data-juicer chosen)
- ADR-005: Decoupled DiLoCo over serverless (object-store rendezvous,
not cross-job NCCL — matches DiLoCo's once-per-30-min sync cadence)
- ADR-006: RL framework strategy (TRL + VeRL + PRIME-RL + Monarch)
- ADR-007: self-distillation losses landscape

RESEARCH (4 deep-dive recons, ~3300 lines total, primary-source
verified):
- DILOCO_SERVERLESS_RECONNAISSANCE.md: 6 executors audited (Modal,
HF Jobs, SageMaker, Vertex AI, Azure ML, k8s+Volcano)
- REPLAYSIM_NORMALIZATION_RECONNAISSANCE.md: 5 candidates audited
- RL_FRAMEWORKS_LANDSCAPE.md: 6 RL frameworks + 4 Meta-stack components
- SELF_DISTILLATION_LANDSCAPE.md: 8 candidate losses audited

ALTERED-MINDS TIE-IN:
- docs/ALTERED_MINDS_TIE_IN.md: 5-phase plan for using the framework
to RL-train altered-minds-altered models. Bridges the user's
llm-mental-alterations workstream into this framework. ~$300
estimated for moral-scenarios trace-replay round.

CROSS-MODEL ADVERSARIAL REVIEW (Wave 13 final review by Opus 4.7
sub-agent, 8 findings):
- 2 BLOCKERs found and FIXED:
1. PRIME-RL composer_loss SDPO term was mathematically degenerate
(unsqueeze(-1) + log_softmax of 1-element vector = always 0).
Fixed: now raises NotImplementedError with clear path forward.
2. ADR-007 claimed compose_loss kwargs that were never added. Fixed:
ADR + V1-V8 + README all down-rev'd to honest "standalone losses
landed; integration deferred to Wave 14."
- 4 SUGGESTIONs documented in docs/research/WAVE_13_FINAL_REVIEW.md
(replaysim recipe field types, MockManager end-to-end gap, README
"9 multi-process" count phrasing, PRIME-RL channel-1 REINFORCE-
vs-GRPO labeling).
- 2 NITs noted (test using positive log-probs cosmetically; Modal/HF
Jobs skeleton clarity).

DOCS UPDATED:
- README.md: Wave 13 expansion section added
- docs/V1_V8_COVERAGE.md: Wave 13 expansion table
- docs/V3_SUBSTRATE_COVERAGE.md: 8/8 substrate count (was 6/6),
PRIME-RL + serverless DiLoCo + Monarch rows added
- pyproject.toml: 4 new optional-dependency extras (serverless,
replaysim, prime-rl, monarch) + new keywords

TESTS:
- Wave 13 new: 35 passing (17 distillation + 9 serverless + 9 replaysim)
- Wave 13 + prior CPU-fast subset: 93 passing in 28s
- No regressions; new code is purely additive

Files changed (37) hide show

README.md +38 -1
composer_replication/diloco/serverless/__init__.py +62 -0
composer_replication/diloco/serverless/allreduce.py +214 -0
composer_replication/diloco/serverless/executor.py +310 -0
composer_replication/diloco/serverless/hf_jobs.py +98 -0
composer_replication/diloco/serverless/modal.py +102 -0
composer_replication/diloco/serverless/replica_entrypoint.py +109 -0
composer_replication/diloco/serverless/tests/__init__.py +0 -0
composer_replication/diloco/serverless/tests/test_serverless_local.py +239 -0
composer_replication/distillation/__init__.py +36 -0
composer_replication/distillation/entropy_aware_opd.py +126 -0
composer_replication/distillation/simpo.py +83 -0
composer_replication/distillation/taid.py +195 -0
composer_replication/distillation/tests/test_distillation_losses.py +236 -0
composer_replication/recipes/monarch/actors.py +90 -0
composer_replication/recipes/monarch/monarch_actor_layout.md +121 -0
composer_replication/recipes/prime_rl/composer_loss.py +111 -0
composer_replication/recipes/prime_rl/prime_rl_config.yaml +66 -0
composer_replication/recipes/prime_rl/prime_rl_recipe.md +107 -0
composer_replication/recipes/replaysim/default.yaml +70 -0
composer_replication/replaysim/__init__.py +55 -0
composer_replication/replaysim/normalize.py +270 -0
composer_replication/replaysim/tests/__init__.py +0 -0
composer_replication/replaysim/tests/test_replaysim.py +138 -0
docs/ALTERED_MINDS_TIE_IN.md +154 -0
docs/V1_V8_COVERAGE.md +23 -1
docs/V3_SUBSTRATE_COVERAGE.md +10 -6
docs/adrs/ADR-004-replaysim-normalization.md +124 -0
docs/adrs/ADR-005-serverless-diloco.md +142 -0
docs/adrs/ADR-006-rl-frameworks.md +124 -0
docs/adrs/ADR-007-self-distillation-losses.md +173 -0
docs/research/DILOCO_SERVERLESS_RECONNAISSANCE.md +791 -0
docs/research/REPLAYSIM_NORMALIZATION_RECONNAISSANCE.md +506 -0
docs/research/RL_FRAMEWORKS_LANDSCAPE.md +428 -0
docs/research/SELF_DISTILLATION_LANDSCAPE.md +418 -0
docs/research/WAVE_13_FINAL_REVIEW.md +239 -0
pyproject.toml +27 -2

README.md CHANGED Viewed

@@ -167,10 +167,47 @@ The novel contribution is channel (3) — no published work systematically repla
 |---|---|---|---|---|
 | **v0.0 spike** | 1–2 weeks | Prove trace-replay-DPO beats plain GRPO on Qwen3-7B + SWE-bench-lite | `Codeseys/composer-replication-qwen3-7b-v0` | `Codeseys/composer-replication-traces-v0` |
 | **v0.1** | 1–2 months | Full Composer recipe (RLVR + hint-distill + trace-replay) on Qwen3-32B + Feature Deletion env. Match Cursor's ~50% SWE-bench-multilingual at 32B scale. | `Codeseys/composer-replication-qwen3-32b-v1` | `Codeseys/composer-replication-traces-v1` |
-| **v0.2** | 3–6 months | Decentralized scaling: Streaming DiLoCo + SHARDCAST + Monarch. Multi-cluster reproduction of v0.1. | `Codeseys/composer-replication-qwen3-32b-decentralized` | (re-uses v1 data) |
 Each variant will get its own model repo (LoRA adapters or full fine-tunes) per the [HF multi-artifact research project layout](https://huggingface.co/docs/hub/repositories). This methodology repo will be linked from each variant's README and via an HF Collection once v0.0 produces a result.
 ## Methodology — how this synthesis was produced
 To minimize single-model bias, the five research deep-dives were generated **in parallel** by five different LLM families via the [`delegate_task` parallel-research pattern](https://huggingface.co/docs/transformers/research):

 |---|---|---|---|---|
 | **v0.0 spike** | 1–2 weeks | Prove trace-replay-DPO beats plain GRPO on Qwen3-7B + SWE-bench-lite | `Codeseys/composer-replication-qwen3-7b-v0` | `Codeseys/composer-replication-traces-v0` |
 | **v0.1** | 1–2 months | Full Composer recipe (RLVR + hint-distill + trace-replay) on Qwen3-32B + Feature Deletion env. Match Cursor's ~50% SWE-bench-multilingual at 32B scale. | `Codeseys/composer-replication-qwen3-32b-v1` | `Codeseys/composer-replication-traces-v1` |
+| **v0.2** | 3–6 months | Decentralized scaling: Streaming DiLoCo + SHARDCAST + Monarch. Multi-cluster reproduction of v0.1 across **Modal + HF Jobs + on-prem** via the new serverless-DiLoCo executor abstraction (ADR-005). | `Codeseys/composer-replication-qwen3-32b-decentralized` | (re-uses v1 data) |
 Each variant will get its own model repo (LoRA adapters or full fine-tunes) per the [HF multi-artifact research project layout](https://huggingface.co/docs/hub/repositories). This methodology repo will be linked from each variant's README and via an HF Collection once v0.0 produces a result.
+## Wave 13 expansion (2026-05-26) — what just landed
+The user expanded the brief mid-deep-work-loop to address the
+serverless-orchestration, normalization, and broader-RL-framework
+dimensions. Six new artifact families:
+- **`composer_replication.distillation`** — pluggable losses: SimPO
+  (reference-free DPO), TAID (annealed teacher interpolation),
+  Entropy-Aware OPD (token-wise gated forward/reverse KL). 17 unit tests.
+  Use as standalone functions for now; `compose_loss` integration is
+  deferred to Wave 14 (Wave 13 review Finding 2).
+  See ADR-007 + `docs/research/SELF_DISTILLATION_LANDSCAPE.md`.
+- **`composer_replication.diloco.serverless`** — `ServerlessExecutor`
+  Protocol + `ObjectStoreAllReduce` + `LocalProcessExecutor` (running
+  + tested) + `ModalExecutor` / `HFJobsExecutor` skeletons. 9 multi-
+  process tests pinning the allreduce barrier. See ADR-005 +
+  `docs/research/DILOCO_SERVERLESS_RECONNAISSANCE.md`.
+- **`composer_replication.replaysim`** — N-teacher replay + data-juicer
+  normalization (chosen over datatrove because it has native multi-turn
+  + DPO-pair ops). 9 unit tests + default YAML recipe. See ADR-004 +
+  `docs/research/REPLAYSIM_NORMALIZATION_RECONNAISSANCE.md`.
+- **`composer_replication.recipes.prime_rl`** — third RL framework
+  recipe (alongside TRL + VeRL). PRIME-RL was selected because it has
+  a first-class `CustomLossConfig` exposing exactly the tensors a
+  3-channel loss needs. See ADR-006 +
+  `docs/research/RL_FRAMEWORKS_LANDSCAPE.md`.
+- **`composer_replication.recipes.monarch`** — Meta's PyTorch agentic
+  stack tie-in. Monarch (BSD-3, v0.4.1) is the only Meta agentic-stack
+  component actively shipping; TorchForge is paused. Actor layout
+  documented + skeleton actors in place. See ADR-006.
+- **`docs/ALTERED_MINDS_TIE_IN.md`** — bridge to the user's adjacent
+  workstream (formerly `llm-mental-alterations`). 5-phase plan for
+  using the framework to RL-train altered-minds-altered models. ~$300
+  estimated for a moral-scenarios trace-replay round.
+**Tests as of Wave 13: 107 passing.** (72 prior + 35 new.)
 ## Methodology — how this synthesis was produced
 To minimize single-model bias, the five research deep-dives were generated **in parallel** by five different LLM families via the [`delegate_task` parallel-research pattern](https://huggingface.co/docs/transformers/research):

composer_replication/diloco/serverless/__init__.py ADDED Viewed

	@@ -0,0 +1,62 @@

+"""composer_replication.diloco.serverless — run Decoupled DiLoCo across
+serverless training systems (Modal, HuggingFace Jobs, SageMaker, k8s, …).
+Per ADR-005, the design rests on two abstractions:
+1. `ServerlessExecutor` Protocol — a uniform interface for spinning up
+   N replicas on different cloud backends. Each backend (Modal, HF Jobs,
+   SageMaker, etc.) gets a concrete adapter that implements the Protocol.
+2. `ObjectStoreAllReduce` — fsspec-backed pseudo-gradient exchange that
+   replaces the in-process `torchft.Manager.allreduce` call. The
+   communication pattern is `S3 PutObject + N GetObjects` once per
+   ~500-1000 inner steps, which matches DiLoCo's actual sync cadence
+   (paper arXiv:2311.08105 §3.2). Bandwidth: ~2 GB / 30 minutes per
+   replica for 1B-param bf16, well within S3 free-tier.
+The framework's existing `composer_replication.diloco.make_diloco_outer_loop`
+wraps `torchft.local_sgd.DiLoCo`. To run that across N serverless replicas:
+    >>> from composer_replication.diloco.serverless import (
+    ...     LocalProcessExecutor,
+    ...     ObjectStoreAllReduce,
+    ... )
+    >>> rendezvous = ObjectStoreAllReduce("s3://my-bucket/diloco-runs/run42/")
+    >>> executor = LocalProcessExecutor()
+    >>> handles = executor.launch_replicas(
+    ...     n_replicas=4,
+    ...     entrypoint="composer_replication.diloco.serverless.replica_entrypoint",
+    ...     entrypoint_args={"rendezvous": rendezvous.uri, "rank_env": "REPLICA_RANK"},
+    ... )
+    >>> result = executor.collect(handles, timeout=3600)
+Module layout:
+- `executor.py`     — `ServerlessExecutor` Protocol + base classes + `LocalProcessExecutor`
+- `allreduce.py`    — `ObjectStoreAllReduce` + `MockManager` (drops into torchft path)
+- `modal.py`        — `ModalExecutor` (skeleton — implements when modal-client is available)
+- `hf_jobs.py`      — `HFJobsExecutor` (skeleton — uses huggingface_hub.run_job)
+- `replica_entrypoint.py` — script each replica runs (loaded from object store)
+Optional dependency: `pip install -e .[serverless]` pulls fsspec + s3fs +
+gcsfs. Modal/HF Jobs adapters require `modal` and `huggingface_hub` respectively;
+both are checked at adapter init time, not at module import.
+"""
+from __future__ import annotations
+from composer_replication.diloco.serverless.allreduce import (
+    MockManager,
+    ObjectStoreAllReduce,
+)
+from composer_replication.diloco.serverless.executor import (
+    LocalProcessExecutor,
+    ReplicaHandle,
+    ServerlessExecutor,
+)
+__all__ = [
+    "LocalProcessExecutor",
+    "MockManager",
+    "ObjectStoreAllReduce",
+    "ReplicaHandle",
+    "ServerlessExecutor",
+]

composer_replication/diloco/serverless/allreduce.py ADDED Viewed

	@@ -0,0 +1,214 @@

+"""ObjectStoreAllReduce — fsspec-backed pseudo-gradient exchange for DiLoCo.
+DiLoCo's outer-loop sync writes the local pseudo-gradient (= θ_initial − θ_local)
+to a shared location once per H ≈ 500-1000 inner steps, then averages across
+all replicas before the outer SGD step. With cross-job NCCL unavailable on
+most serverless backends, we use object storage as the rendezvous medium.
+Communication pattern per outer round:
+1. Each replica writes its pseudo-gradient: PUT(rendezvous/round_N/rank_R.pt)
+2. Each replica reads all peer pseudo-gradients: GET × N
+3. Average locally → applied as `Manager.allreduce()` would have.
+Backend support via fsspec: s3://, gs://, az://, hf://, file://.
+The same code path works across all of them.
+License compatibility: this module re-implements the contract of
+`torchft.Manager.allreduce` through duck-typing — no torchft code is
+copied. torchft itself is BSD-3.
+"""
+from __future__ import annotations
+import io
+import os
+import time
+from typing import Any
+import torch
+class ObjectStoreAllReduce:
+    """fsspec-backed pseudo-gradient rendezvous.
+    Each call to `allreduce(tensor, name)` blocks until all peers have
+    written their version of `tensor` to the rendezvous location, then
+    returns the average.
+    Args:
+        uri: fsspec URI like "s3://bucket/path/" or "file:///tmp/diloco/" or
+            a plain path "/tmp/diloco/run42/" (treated as file://).
+        rank: this replica's rank (0-indexed)
+        world_size: total number of replicas
+        round_id: optional, used to namespace successive sync rounds.
+            If None, a monotonically increasing counter is used internally.
+        timeout_s: per-allreduce timeout in seconds.
+        poll_interval_s: how often to check for peer files.
+    """
+    def __init__(
+        self,
+        uri: str,
+        rank: int,
+        world_size: int,
+        *,
+        round_id: int | None = None,
+        timeout_s: float = 1800.0,
+        poll_interval_s: float = 1.0,
+    ) -> None:
+        if not (0 <= rank < world_size):
+            raise ValueError(f"rank {rank} not in [0, {world_size})")
+        self.uri = uri.rstrip("/") + "/"
+        self.rank = rank
+        self.world_size = world_size
+        self.timeout_s = timeout_s
+        self.poll_interval_s = poll_interval_s
+        self._round_counter = 0 if round_id is None else round_id
+        # Lazy fsspec init; deferred so that local-only smoke tests don't
+        # require fsspec install in the dev environment.
+        self._fs = None
+        self._is_local = self.uri.startswith("file://") or self.uri.startswith("/")
+        if self._is_local:
+            local_path = self.uri.removeprefix("file://")
+            os.makedirs(local_path, exist_ok=True)
+            self._local_root = local_path
+        else:
+            self._init_fsspec()
+    def _init_fsspec(self) -> None:
+        try:
+            import fsspec  # noqa: F401
+        except ImportError as e:
+            raise RuntimeError(
+                "Non-local rendezvous requires fsspec; install with "
+                "`pip install -e .[serverless]`. Got: " + repr(e)
+            )
+        import fsspec
+        protocol = self.uri.split("://", 1)[0] if "://" in self.uri else "file"
+        self._fs = fsspec.filesystem(protocol)
+    @property
+    def round_id(self) -> int:
+        return self._round_counter
+    def _round_dir(self, round_id: int) -> str:
+        return f"round_{round_id:06d}"
+    def _path_for(self, round_id: int, rank: int) -> str:
+        return f"{self._round_dir(round_id)}/rank_{rank:04d}.pt"
+    def _put(self, relpath: str, payload: bytes) -> None:
+        if self._is_local:
+            full = os.path.join(self._local_root, relpath)
+            os.makedirs(os.path.dirname(full), exist_ok=True)
+            tmp = full + ".tmp"
+            with open(tmp, "wb") as f:
+                f.write(payload)
+            os.replace(tmp, full)  # atomic on POSIX
+        else:
+            full = self.uri + relpath
+            assert self._fs is not None
+            with self._fs.open(full, "wb") as f:
+                f.write(payload)
+    def _get(self, relpath: str) -> bytes:
+        if self._is_local:
+            full = os.path.join(self._local_root, relpath)
+            with open(full, "rb") as f:
+                return f.read()
+        full = self.uri + relpath
+        assert self._fs is not None
+        with self._fs.open(full, "rb") as f:
+            return f.read()
+    def _exists(self, relpath: str) -> bool:
+        if self._is_local:
+            return os.path.exists(os.path.join(self._local_root, relpath))
+        full = self.uri + relpath
+        assert self._fs is not None
+        return self._fs.exists(full)
+    def allreduce(self, tensor: torch.Tensor, *, name: str | None = None) -> torch.Tensor:
+        """Average `tensor` across all replicas via the object store.
+        Args:
+            tensor: the tensor to average. Modified in-place AND returned.
+            name: ignored — provided for API compat with torchft.Manager.
+        Returns:
+            The averaged tensor (modifies in-place; returns the same object).
+        """
+        round_id = self._round_counter
+        self._round_counter += 1
+        # Serialize my tensor
+        buf = io.BytesIO()
+        torch.save({"rank": self.rank, "tensor": tensor.detach().cpu()}, buf)
+        my_path = self._path_for(round_id, self.rank)
+        self._put(my_path, buf.getvalue())
+        # Wait for all peers
+        deadline = time.time() + self.timeout_s
+        peer_tensors: list[torch.Tensor] = []
+        for peer_rank in range(self.world_size):
+            peer_path = self._path_for(round_id, peer_rank)
+            while not self._exists(peer_path):
+                if time.time() > deadline:
+                    raise TimeoutError(
+                        f"ObjectStoreAllReduce: timed out waiting for "
+                        f"rank {peer_rank} at {self.uri}{peer_path} "
+                        f"(world_size={self.world_size}, round={round_id})"
+                    )
+                time.sleep(self.poll_interval_s)
+            payload = self._get(peer_path)
+            peer_data = torch.load(io.BytesIO(payload), weights_only=False)
+            peer_tensors.append(peer_data["tensor"].to(tensor.device, dtype=tensor.dtype))
+        # Compute average
+        stacked = torch.stack(peer_tensors, dim=0)
+        avg = stacked.mean(dim=0)
+        tensor.copy_(avg)
+        return tensor
+# ---------------------------------------------------------------------
+# MockManager — torchft.Manager-shaped object that uses ObjectStoreAllReduce
+# ---------------------------------------------------------------------
+class MockManager:
+    """Drop-in replacement for `torchft.Manager` that delegates allreduce
+    to `ObjectStoreAllReduce`.
+    The torchft `DiLoCo` class accepts a `Manager` and calls its `.allreduce`
+    method on the pseudo-gradient. By passing this mock instead, we route
+    that call through the object store, leaving the rest of the DiLoCo
+    machinery (sign convention, post-hook sequencing, etc.) untouched.
+    Reference: `make_diloco_outer_loop` in
+    `composer_replication/diloco/__init__.py` accepts an optional
+    `manager=` kwarg; pass a `MockManager` to enable serverless DiLoCo.
+    """
+    def __init__(self, store: ObjectStoreAllReduce) -> None:
+        self._store = store
+        # torchft Manager attributes that DiLoCo consults
+        self.num_participants = store.world_size
+        self.rank = store.rank
+    def allreduce(self, tensor: torch.Tensor, **_kwargs: Any) -> torch.Tensor:
+        return self._store.allreduce(tensor)
+    # torchft.Manager has additional methods (`should_commit`, `start_quorum`,
+    # etc.) that are no-ops for our coarse-grained sync. The `DiLoCo` class
+    # only requires `allreduce`, but the others may be probed.
+    def should_commit(self) -> bool:
+        return True
+    def start_quorum(self) -> None:
+        pass
+    def wait_quorum(self) -> int:
+        return self.num_participants
+__all__ = ["MockManager", "ObjectStoreAllReduce"]

composer_replication/diloco/serverless/executor.py ADDED Viewed

	@@ -0,0 +1,310 @@

+"""ServerlessExecutor Protocol + LocalProcessExecutor.
+Per ADR-005:
+- `ServerlessExecutor` is a structural Protocol — backends implement it
+  by writing a class with the right methods, no formal inheritance needed.
+- `LocalProcessExecutor` is the reference implementation that uses Python's
+  `multiprocessing` module. It's used for tests and for development; the
+  cloud adapters (Modal, HF Jobs, …) implement the same Protocol against
+  their respective backends.
+"""
+from __future__ import annotations
+import multiprocessing as mp
+import sys
+import time
+from dataclasses import dataclass, field
+from typing import Any, Callable, Mapping, Protocol, runtime_checkable
+@dataclass
+class ReplicaHandle:
+    """Opaque handle to a launched replica. Backend-specific contents.
+    All executors return `list[ReplicaHandle]` from `launch_replicas`.
+    Each handle's `metadata` dict is backend-specific; users shouldn't
+    rely on its shape.
+    """
+    rank: int
+    backend_name: str
+    metadata: dict[str, Any] = field(default_factory=dict)
+    """Backend-specific data (e.g. Modal call ID, HF Jobs job ID, local
+    Process object). Not stable across backends."""
+@runtime_checkable
+class ServerlessExecutor(Protocol):
+    """Uniform interface for launching N replicas on a serverless backend.
+    Implementations: `LocalProcessExecutor` (test/dev), `ModalExecutor`
+    (Modal, v0), `HFJobsExecutor` (HuggingFace Jobs, v0). Future:
+    `RunPodExecutor`, `SageMakerExecutor`, `K8sExecutor`.
+    Note on rank assignment: the Protocol guarantees that handles are
+    returned in rank order (`handles[i].rank == i`). The replica entrypoint
+    learns its own rank either from an environment variable
+    (`REPLICA_RANK`) or from a backend-provided mechanism (Modal's
+    `Function.shard_rank`, etc.). The executor abstraction normalizes
+    rank by setting the env var.
+    """
+    backend_name: str
+    supports_inter_replica_network: bool
+    def launch_replicas(
+        self,
+        n_replicas: int,
+        entrypoint: str | Callable[..., Any],
+        entrypoint_args: Mapping[str, Any],
+        *,
+        gpu: str | None = None,
+        timeout: int = 3600,
+    ) -> list[ReplicaHandle]:
+        """Spin up N replicas in parallel.
+        Args:
+            n_replicas: number of replicas to launch
+            entrypoint: either an importable Python path (e.g.
+                "composer_replication.diloco.serverless.replica_entrypoint")
+                or a Callable (Local executor only).
+            entrypoint_args: kwargs passed to the entrypoint. The kwarg
+                `rank_env` (default "REPLICA_RANK") names the environment
+                variable in which the rank will be set on the replica.
+            gpu: backend-specific GPU spec, e.g. "A100", "H100". `None`
+                means CPU-only (smoke tests).
+            timeout: per-replica wall-clock timeout in seconds.
+        Returns:
+            list[ReplicaHandle] of length n_replicas, in rank order.
+        """
+        ...
+    def poll(self, handle: ReplicaHandle) -> str:
+        """Poll a replica's status. Returns one of:
+        "pending" | "running" | "succeeded" | "failed" | "cancelled".
+        """
+        ...
+    def stream_logs(self, handle: ReplicaHandle, *, n_lines: int = 200) -> str:
+        """Read up to n_lines of recent stdout/stderr from a replica."""
+        ...
+    def cancel(self, handle: ReplicaHandle) -> None:
+        """Best-effort cancel. No exception if already terminated."""
+        ...
+    def collect(
+        self,
+        handles: list[ReplicaHandle],
+        *,
+        timeout: int | None = None,
+    ) -> list[dict[str, Any]]:
+        """Block until all replicas finish; return per-replica result dicts.
+        Each result dict contains at least:
+            {"rank": int, "status": str, "exit_code": int | None,
+             "error": str | None}
+        """
+        ...
+# ---------------------------------------------------------------------
+# LocalProcessExecutor — reference implementation using multiprocessing
+# ---------------------------------------------------------------------
+def _local_replica_target(
+    rank: int,
+    rank_env: str,
+    entrypoint: Any,
+    entrypoint_args: Mapping[str, Any],
+    result_queue: mp.Queue,
+) -> None:
+    """multiprocessing target — runs in the child process."""
+    import os
+    import traceback
+    os.environ[rank_env] = str(rank)
+    try:
+        if callable(entrypoint):
+            result = entrypoint(**entrypoint_args)
+        elif isinstance(entrypoint, str):
+            # importable path
+            mod_path, _, fn_name = entrypoint.rpartition(".")
+            if not mod_path:
+                # Top-level script path; just import it and call its main()
+                import importlib
+                mod = importlib.import_module(entrypoint)
+                fn = getattr(mod, "main", None)
+                if fn is None:
+                    raise RuntimeError(
+                        f"entrypoint '{entrypoint}' has no main() function"
+                    )
+                result = fn(**entrypoint_args)
+            else:
+                import importlib
+                mod = importlib.import_module(mod_path)
+                fn = getattr(mod, fn_name)
+                result = fn(**entrypoint_args)
+        else:
+            raise TypeError(
+                f"entrypoint must be Callable or importable str, got {type(entrypoint)!r}"
+            )
+        result_queue.put({"rank": rank, "status": "succeeded",
+                          "exit_code": 0, "error": None, "result": result})
+    except Exception as e:
+        tb = traceback.format_exc()
+        result_queue.put({"rank": rank, "status": "failed",
+                          "exit_code": 1, "error": f"{e!r}\n{tb}", "result": None})
+class LocalProcessExecutor:
+    """Runs replicas as subprocesses on the local machine.
+    Use cases:
+    - Test the serverless layer end-to-end without cloud spend.
+    - Develop the algorithm locally with N>1 replicas and `file://`
+      rendezvous before deploying to Modal/HF Jobs.
+    - CI smoke tests.
+    """
+    backend_name = "local_process"
+    supports_inter_replica_network = True  # localhost works
+    def __init__(self) -> None:
+        # use 'spawn' so the child has a fresh interpreter (avoid CUDA fork issues)
+        try:
+            self._ctx = mp.get_context("spawn")
+        except ValueError:
+            # Fallback for environments where 'spawn' isn't available
+            self._ctx = mp.get_context()
+        self._handles: dict[int, dict[str, Any]] = {}
+    def launch_replicas(
+        self,
+        n_replicas: int,
+        entrypoint: str | Callable[..., Any],
+        entrypoint_args: Mapping[str, Any],
+        *,
+        gpu: str | None = None,
+        timeout: int = 3600,
+    ) -> list[ReplicaHandle]:
+        if gpu is not None:
+            # Local executor doesn't pin GPUs; emit a soft warning.
+            sys.stderr.write(
+                f"[LocalProcessExecutor] gpu={gpu!r} ignored — "
+                f"local processes share whatever GPUs are visible.\n"
+            )
+        rank_env = entrypoint_args.get("rank_env", "REPLICA_RANK")
+        handles: list[ReplicaHandle] = []
+        result_queue: mp.Queue = self._ctx.Queue()
+        for rank in range(n_replicas):
+            args_for_rank = dict(entrypoint_args)
+            args_for_rank.pop("rank_env", None)
+            proc = self._ctx.Process(
+                target=_local_replica_target,
+                args=(rank, rank_env, entrypoint, args_for_rank, result_queue),
+                name=f"composer-replica-{rank}",
+            )
+            proc.start()
+            handle = ReplicaHandle(
+                rank=rank, backend_name=self.backend_name,
+                metadata={"pid": proc.pid, "start_ts": time.time()},
+            )
+            self._handles[rank] = {"proc": proc, "queue": result_queue,
+                                    "deadline": time.time() + timeout,
+                                    "result": None}
+            handles.append(handle)
+        return handles
+    def poll(self, handle: ReplicaHandle) -> str:
+        meta = self._handles.get(handle.rank)
+        if meta is None:
+            return "cancelled"
+        proc: mp.Process = meta["proc"]
+        if proc.is_alive():
+            return "running"
+        if meta.get("result") is not None:
+            return meta["result"]["status"]
+        # Process exited; read result if available
+        try:
+            queue: mp.Queue = meta["queue"]
+            while not queue.empty():
+                r = queue.get_nowait()
+                self._handles[r["rank"]]["result"] = r
+        except Exception:
+            pass
+        if meta.get("result") is not None:
+            return meta["result"]["status"]
+        return "failed" if proc.exitcode != 0 else "succeeded"
+    def stream_logs(self, handle: ReplicaHandle, *, n_lines: int = 200) -> str:
+        # multiprocessing.Process doesn't natively capture stdout; we'd
+        # need a Pipe or file redirection. For the local reference impl,
+        # we just point the user at the result dict's `error` field.
+        meta = self._handles.get(handle.rank)
+        if meta is None:
+            return f"<replica {handle.rank}: no metadata>"
+        if meta.get("result"):
+            err = meta["result"].get("error") or ""
+            return f"[rank {handle.rank}] {err[-2000:]}"
+        return f"<replica {handle.rank}: still running, no captured logs>"
+    def cancel(self, handle: ReplicaHandle) -> None:
+        meta = self._handles.get(handle.rank)
+        if meta is None:
+            return
+        proc: mp.Process = meta["proc"]
+        if proc.is_alive():
+            proc.terminate()
+            proc.join(timeout=5)
+            if proc.is_alive():
+                proc.kill()
+    def collect(
+        self,
+        handles: list[ReplicaHandle],
+        *,
+        timeout: int | None = None,
+    ) -> list[dict[str, Any]]:
+        deadline = time.time() + (timeout if timeout is not None else 3600)
+        # Wait for all processes to finish
+        for h in handles:
+            meta = self._handles.get(h.rank)
+            if meta is None:
+                continue
+            proc: mp.Process = meta["proc"]
+            remaining = max(0.0, deadline - time.time())
+            proc.join(timeout=remaining)
+            if proc.is_alive():
+                proc.terminate()
+                proc.join(timeout=5)
+        # Drain results
+        results_by_rank: dict[int, dict[str, Any]] = {}
+        for h in handles:
+            meta = self._handles.get(h.rank)
+            if meta is None:
+                results_by_rank[h.rank] = {
+                    "rank": h.rank, "status": "cancelled",
+                    "exit_code": None, "error": "no metadata", "result": None,
+                }
+                continue
+            queue: mp.Queue = meta["queue"]
+            while not queue.empty():
+                try:
+                    r = queue.get_nowait()
+                    results_by_rank[r["rank"]] = r
+                except Exception:
+                    break
+            if h.rank not in results_by_rank:
+                proc: mp.Process = meta["proc"]
+                results_by_rank[h.rank] = {
+                    "rank": h.rank,
+                    "status": "succeeded" if proc.exitcode == 0 else "failed",
+                    "exit_code": proc.exitcode,
+                    "error": None if proc.exitcode == 0 else f"exit code {proc.exitcode}",
+                    "result": None,
+                }
+        return [results_by_rank[h.rank] for h in handles]
+__all__ = ["LocalProcessExecutor", "ReplicaHandle", "ServerlessExecutor"]

composer_replication/diloco/serverless/hf_jobs.py ADDED Viewed

	@@ -0,0 +1,98 @@

+"""HuggingFace Jobs executor — skeleton for v0.
+Per ADR-005, HF Jobs is one of two v0 target executors. This file is a
+STUB. The full integration uses `huggingface_hub.run_job` (added in
+huggingface_hub >= 0.27, ~2026 era) which spins up a containerized job
+backed by HF's compute pool.
+Pricing reference (2026-05-26): A100 ≈ $4.18/hr, H100 ≈ $9.50/hr. Cold
+start ≈ 60s. NO inter-job networking — must use object-store rendezvous.
+Status: SKELETON. Real implementation pending v0 polish wave.
+"""
+from __future__ import annotations
+from typing import Any, Callable, Mapping
+from composer_replication.diloco.serverless.executor import (
+    ReplicaHandle,
+    ServerlessExecutor,
+)
+class HFJobsExecutor(ServerlessExecutor):
+    """Run replicas as HuggingFace Jobs in parallel.
+    Reference implementation pattern:
+        from huggingface_hub import run_job
+        jobs = []
+        for rank in range(N):
+            job = run_job(
+                image="...",  # container with composer_replication installed
+                command=[
+                    "python", "-m",
+                    "composer_replication.diloco.serverless.replica_entrypoint",
+                    "--rank", str(rank),
+                    "--rendezvous", "hf://datasets/myuser/run42/",
+                ],
+                env={"REPLICA_RANK": str(rank), "WORLD_SIZE": str(N)},
+                gpu="a100",
+            )
+            jobs.append(job)
+        return [ReplicaHandle(rank=i, backend_name="hf_jobs",
+                              metadata={"job_id": jobs[i].id})
+                for i in range(N)]
+    Object-store rendezvous works naturally with the HF Datasets-as-storage
+    pattern — `hf://datasets/{user}/{run_id}/` is fsspec-compatible via
+    `huggingface_hub`'s fsspec integration.
+    Status: SKELETON.
+    """
+    backend_name = "hf_jobs"
+    supports_inter_replica_network = False
+    def __init__(self) -> None:
+        try:
+            from huggingface_hub import HfApi  # noqa: F401
+        except ImportError as e:
+            raise RuntimeError(
+                "HFJobsExecutor requires huggingface_hub. Got: " + repr(e)
+            )
+        # Real implementation: instantiate HfApi, validate token, etc.
+        raise NotImplementedError(
+            "HFJobsExecutor is a v0 skeleton; full implementation pending. "
+            "Use LocalProcessExecutor for testing."
+        )
+    def launch_replicas(
+        self,
+        n_replicas: int,
+        entrypoint: str | Callable[..., Any],
+        entrypoint_args: Mapping[str, Any],
+        *,
+        gpu: str | None = "a100",
+        timeout: int = 3600,
+    ) -> list[ReplicaHandle]:
+        raise NotImplementedError
+    def poll(self, handle: ReplicaHandle) -> str:
+        raise NotImplementedError
+    def stream_logs(self, handle: ReplicaHandle, *, n_lines: int = 200) -> str:
+        raise NotImplementedError
+    def cancel(self, handle: ReplicaHandle) -> None:
+        raise NotImplementedError
+    def collect(
+        self,
+        handles: list[ReplicaHandle],
+        *,
+        timeout: int | None = None,
+    ) -> list[dict[str, Any]]:
+        raise NotImplementedError
+__all__ = ["HFJobsExecutor"]

composer_replication/diloco/serverless/modal.py ADDED Viewed

	@@ -0,0 +1,102 @@

+"""Modal executor — skeleton for v0.
+This file is a STUB. The full Modal integration requires the `modal`
+client library installed (`pip install modal`) and a configured Modal
+account (`~/.modal.toml`). The user's environment has both, but the
+test suite must run without them, so we keep this file import-safe.
+Real implementation lives in v0 polish; the docstring below is the
+contract.
+"""
+from __future__ import annotations
+from typing import Any, Callable, Mapping
+from composer_replication.diloco.serverless.executor import (
+    ReplicaHandle,
+    ServerlessExecutor,
+)
+class ModalExecutor(ServerlessExecutor):
+    """Run replicas as Modal Functions in parallel.
+    Reference implementation pattern (per ADR-005):
+        @app.function(gpu="A100-40GB", timeout=3600)
+        def run_replica(rank: int, rendezvous_uri: str, **kwargs):
+            os.environ["REPLICA_RANK"] = str(rank)
+            from composer_replication.diloco.serverless import (
+                MockManager, ObjectStoreAllReduce,
+            )
+            store = ObjectStoreAllReduce(rendezvous_uri,
+                                         rank=rank, world_size=N)
+            manager = MockManager(store)
+            # ... run the trainer with this manager ...
+    Then `launch_replicas` does:
+        calls = [run_replica.spawn(rank=i, ...) for i in range(N)]
+        return [ReplicaHandle(rank=i, backend_name="modal",
+                              metadata={"call_id": calls[i].object_id})
+                for i in range(N)]
+    Pricing reference (2026-05-26): A100-40GB ≈ $1.95/hr, H100 ≈ $5.50/hr.
+    Cold start ≈ 30s. Inter-job networking via cluster mode (opt-in,
+    not used by default).
+    Status: SKELETON. Real implementation pending v0 polish wave.
+    """
+    backend_name = "modal"
+    supports_inter_replica_network = False  # default; cluster mode = True
+    def __init__(self, *, app_name: str = "composer-replication-diloco") -> None:
+        try:
+            import modal  # noqa: F401
+        except ImportError as e:
+            raise RuntimeError(
+                "ModalExecutor requires the modal client. Install with "
+                "`pip install modal` and configure with `modal token new`. "
+                "Got: " + repr(e)
+            )
+        self.app_name = app_name
+        # Real implementation: build a `modal.App` and register `run_replica`
+        # here so that subsequent `launch_replicas` can `.spawn()` it.
+        raise NotImplementedError(
+            "ModalExecutor is a v0 skeleton; full implementation pending. "
+            "Use LocalProcessExecutor for testing."
+        )
+    # All Protocol methods raise NotImplementedError via __init__ — the
+    # class never instantiates successfully in the skeleton. Sketch
+    # signatures here for documentation:
+    def launch_replicas(
+        self,
+        n_replicas: int,
+        entrypoint: str | Callable[..., Any],
+        entrypoint_args: Mapping[str, Any],
+        *,
+        gpu: str | None = "A100-40GB",
+        timeout: int = 3600,
+    ) -> list[ReplicaHandle]:
+        raise NotImplementedError
+    def poll(self, handle: ReplicaHandle) -> str:
+        raise NotImplementedError
+    def stream_logs(self, handle: ReplicaHandle, *, n_lines: int = 200) -> str:
+        raise NotImplementedError
+    def cancel(self, handle: ReplicaHandle) -> None:
+        raise NotImplementedError
+    def collect(
+        self,
+        handles: list[ReplicaHandle],
+        *,
+        timeout: int | None = None,
+    ) -> list[dict[str, Any]]:
+        raise NotImplementedError
+__all__ = ["ModalExecutor"]

composer_replication/diloco/serverless/replica_entrypoint.py ADDED Viewed

	@@ -0,0 +1,109 @@

+"""Replica entrypoint — what each serverless replica runs.
+This is the script invoked by `LocalProcessExecutor`, `ModalExecutor`,
+`HFJobsExecutor`, etc. It learns its rank from the `REPLICA_RANK` env
+var, sets up `ObjectStoreAllReduce` against the shared rendezvous URI,
+wraps it in a `MockManager`, and hands it off to the user's training
+function.
+Usage from an executor:
+    >>> executor.launch_replicas(
+    ...     n_replicas=4,
+    ...     entrypoint="composer_replication.diloco.serverless.replica_entrypoint",
+    ...     entrypoint_args={
+    ...         "rendezvous_uri": "/tmp/run42/",
+    ...         "world_size": 4,
+    ...         "trainer_module": "my_project.trainer",
+    ...         "trainer_fn": "train",
+    ...         "trainer_kwargs": {"model_name": "Qwen/Qwen2.5-0.5B"},
+    ...     },
+    ... )
+The entrypoint expects:
+- `REPLICA_RANK` env var set to the rank (0..world_size-1)
+- `rendezvous_uri`: fsspec URI for object-store rendezvous
+- `world_size`: total replicas
+- `trainer_module`, `trainer_fn`: importable path to the user's train fn
+- `trainer_kwargs`: dict passed to the user's train fn, plus an injected
+  `manager` kwarg containing the `MockManager`
+"""
+from __future__ import annotations
+import importlib
+import os
+from typing import Any
+def main(
+    rendezvous_uri: str,
+    world_size: int,
+    trainer_module: str,
+    trainer_fn: str = "train",
+    trainer_kwargs: dict[str, Any] | None = None,
+) -> Any:
+    """Entrypoint executed inside each replica.
+    Args:
+        rendezvous_uri: fsspec URI (or local path) for the rendezvous
+        world_size: total replicas
+        trainer_module: importable Python module containing the user's
+            train function
+        trainer_fn: name of the function to call (default "train")
+        trainer_kwargs: kwargs passed to the train function
+    Returns:
+        Whatever the train function returns.
+    """
+    from composer_replication.diloco.serverless.allreduce import (
+        MockManager,
+        ObjectStoreAllReduce,
+    )
+    rank_str = os.environ.get("REPLICA_RANK")
+    if rank_str is None:
+        raise RuntimeError(
+            "REPLICA_RANK env var not set. The serverless executor "
+            "should set this for each replica."
+        )
+    rank = int(rank_str)
+    if not (0 <= rank < world_size):
+        raise ValueError(f"REPLICA_RANK={rank} not in [0, {world_size})")
+    store = ObjectStoreAllReduce(
+        uri=rendezvous_uri,
+        rank=rank,
+        world_size=world_size,
+    )
+    manager = MockManager(store)
+    mod = importlib.import_module(trainer_module)
+    fn = getattr(mod, trainer_fn)
+    kwargs = dict(trainer_kwargs or {})
+    kwargs["manager"] = manager  # injected
+    kwargs["rank"] = rank
+    kwargs["world_size"] = world_size
+    return fn(**kwargs)
+if __name__ == "__main__":
+    import argparse
+    import json
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--rendezvous", required=True)
+    parser.add_argument("--world-size", type=int, required=True)
+    parser.add_argument("--trainer-module", required=True)
+    parser.add_argument("--trainer-fn", default="train")
+    parser.add_argument("--trainer-kwargs-json", default="{}")
+    args = parser.parse_args()
+    main(
+        rendezvous_uri=args.rendezvous,
+        world_size=args.world_size,
+        trainer_module=args.trainer_module,
+        trainer_fn=args.trainer_fn,
+        trainer_kwargs=json.loads(args.trainer_kwargs_json),
+    )

composer_replication/diloco/serverless/tests/__init__.py ADDED Viewed

File without changes

composer_replication/diloco/serverless/tests/test_serverless_local.py ADDED Viewed

	@@ -0,0 +1,239 @@

+"""Verifies the serverless DiLoCo allreduce wraps correctly across local
+multiprocessing replicas using `file://` rendezvous.
+This is the core multi-process test for the serverless layer. It exercises
+the real allreduce barrier (with concurrent processes), not just the
+single-process API.
+"""
+from __future__ import annotations
+import os
+import sys
+import tempfile
+import time
+import pytest
+import torch
+from composer_replication.diloco.serverless import (
+    LocalProcessExecutor,
+    ObjectStoreAllReduce,
+    ReplicaHandle,
+)
+# ---------------------------------------------------------------------
+# Single-process tests of ObjectStoreAllReduce primitives
+# (don't need executor, just the file:// path + local manual orchestration)
+# ---------------------------------------------------------------------
+def test_object_store_allreduce_init_validates_rank():
+    with tempfile.TemporaryDirectory() as td:
+        with pytest.raises(ValueError, match="not in"):
+            ObjectStoreAllReduce(td, rank=5, world_size=2)
+def test_object_store_allreduce_local_paths_create_dir():
+    """Local backend should mkdir on init."""
+    with tempfile.TemporaryDirectory() as td:
+        new_path = os.path.join(td, "subdir", "subsubdir")
+        store = ObjectStoreAllReduce(new_path, rank=0, world_size=1)
+        assert os.path.isdir(new_path)
+        assert store.world_size == 1
+def test_object_store_allreduce_world_size_1_passthrough():
+    """With world_size=1 it just averages the tensor with itself."""
+    with tempfile.TemporaryDirectory() as td:
+        store = ObjectStoreAllReduce(td, rank=0, world_size=1, timeout_s=10.0)
+        t = torch.tensor([1.0, 2.0, 3.0])
+        result = store.allreduce(t.clone())
+        torch.testing.assert_close(result, t, atol=1e-6, rtol=1e-6)
+def test_object_store_allreduce_round_id_increments():
+    with tempfile.TemporaryDirectory() as td:
+        store = ObjectStoreAllReduce(td, rank=0, world_size=1, timeout_s=10.0)
+        t = torch.zeros(3)
+        assert store.round_id == 0
+        store.allreduce(t.clone())
+        assert store.round_id == 1
+        store.allreduce(t.clone())
+        assert store.round_id == 2
+# ---------------------------------------------------------------------
+# Multi-process tests (the real verification — local executor + spawn)
+# ---------------------------------------------------------------------
+def _replica_compute_and_sync(
+    rendezvous_uri: str,
+    world_size: int,
+    rank_value: float,
+) -> dict:
+    """Top-level function — must be importable for multiprocessing 'spawn'.
+    Each replica creates a tensor whose value is `rank_value * (rank+1)` and
+    runs allreduce. The expected result is the mean of all replicas' tensors.
+    """
+    rank = int(os.environ["REPLICA_RANK"])
+    store = ObjectStoreAllReduce(
+        rendezvous_uri, rank=rank, world_size=world_size, timeout_s=120.0,
+    )
+    # tensor that depends on rank
+    t = torch.full((4,), float(rank_value * (rank + 1)))
+    pre = t.clone()
+    averaged = store.allreduce(t)
+    return {
+        "rank": rank,
+        "pre": pre.tolist(),
+        "post": averaged.tolist(),
+        "world_size": world_size,
+    }
+@pytest.mark.parametrize("n_replicas", [2, 3])
+def test_local_executor_runs_allreduce_across_replicas(n_replicas):
+    """End-to-end: 2-3 replica processes each call allreduce; result is the mean."""
+    with tempfile.TemporaryDirectory() as td:
+        rendezvous = os.path.join(td, "run")
+        executor = LocalProcessExecutor()
+        handles = executor.launch_replicas(
+            n_replicas=n_replicas,
+            entrypoint=f"{__name__}._replica_compute_and_sync",
+            entrypoint_args={
+                "rendezvous_uri": rendezvous,
+                "world_size": n_replicas,
+                "rank_value": 10.0,
+                "rank_env": "REPLICA_RANK",
+            },
+            timeout=180,
+        )
+        assert len(handles) == n_replicas
+        for i, h in enumerate(handles):
+            assert h.rank == i
+            assert h.backend_name == "local_process"
+        results = executor.collect(handles, timeout=180)
+        assert len(results) == n_replicas
+        # Verify all succeeded
+        for r in results:
+            assert r["status"] == "succeeded", \
+                f"rank {r['rank']} failed: {r.get('error')}"
+        # Each replica created tensor full(rank_value * (rank+1)).
+        # Expected mean = rank_value * (1+2+...+N) / N
+        N = n_replicas
+        expected_mean = 10.0 * (N * (N + 1) / 2) / N
+        for r in results:
+            post = r["result"]["post"]
+            for v in post:
+                assert abs(v - expected_mean) < 1e-4, \
+                    f"rank {r['rank']}: expected mean {expected_mean}, got {v}"
+def _replica_two_round_sync(
+    rendezvous_uri: str,
+    world_size: int,
+) -> dict:
+    """Each replica does TWO consecutive allreduce calls; checks round_id increments."""
+    rank = int(os.environ["REPLICA_RANK"])
+    store = ObjectStoreAllReduce(
+        rendezvous_uri, rank=rank, world_size=world_size, timeout_s=120.0,
+    )
+    t1 = torch.full((2,), float(rank))
+    avg1 = store.allreduce(t1).clone()
+    t2 = torch.full((2,), float(rank * 100))
+    avg2 = store.allreduce(t2).clone()
+    return {
+        "rank": rank,
+        "round_after_2_calls": store.round_id,
+        "avg1": avg1.tolist(),
+        "avg2": avg2.tolist(),
+    }
+def test_local_executor_handles_multiple_rounds():
+    """Two consecutive rounds each give the right mean; round counter advances."""
+    n_replicas = 3
+    with tempfile.TemporaryDirectory() as td:
+        rendezvous = os.path.join(td, "run-2round")
+        executor = LocalProcessExecutor()
+        handles = executor.launch_replicas(
+            n_replicas=n_replicas,
+            entrypoint=f"{__name__}._replica_two_round_sync",
+            entrypoint_args={
+                "rendezvous_uri": rendezvous,
+                "world_size": n_replicas,
+            },
+            timeout=180,
+        )
+        results = executor.collect(handles, timeout=180)
+        for r in results:
+            assert r["status"] == "succeeded", r.get("error")
+            assert r["result"]["round_after_2_calls"] == 2
+            # mean of 0,1,2 = 1.0
+            assert all(abs(v - 1.0) < 1e-4 for v in r["result"]["avg1"])
+            # mean of 0,100,200 = 100.0
+            assert all(abs(v - 100.0) < 1e-4 for v in r["result"]["avg2"])
+def _replica_that_raises(rendezvous_uri: str, world_size: int) -> dict:
+    """Simulates a replica that crashes mid-run."""
+    rank = int(os.environ["REPLICA_RANK"])
+    if rank == 1:
+        raise RuntimeError(f"Simulated crash on rank {rank}")
+    return {"rank": rank, "ok": True}
+def test_local_executor_reports_failed_replicas():
+    """When a replica crashes, collect() reports it as failed without hanging
+    (other ranks complete; the failed one should be reflected in the result)."""
+    n_replicas = 2
+    with tempfile.TemporaryDirectory() as td:
+        rendezvous = os.path.join(td, "run-failure")
+        executor = LocalProcessExecutor()
+        handles = executor.launch_replicas(
+            n_replicas=n_replicas,
+            entrypoint=f"{__name__}._replica_that_raises",
+            entrypoint_args={
+                "rendezvous_uri": rendezvous,
+                "world_size": n_replicas,
+            },
+            timeout=30,
+        )
+        results = executor.collect(handles, timeout=30)
+        statuses = {r["rank"]: r["status"] for r in results}
+        assert statuses[0] == "succeeded"
+        assert statuses[1] == "failed"
+        # Failure log should mention the simulated crash
+        failure_log = next(r for r in results if r["rank"] == 1).get("error") or ""
+        assert "Simulated crash" in failure_log
+# ---------------------------------------------------------------------
+# Sanity: MockManager is shape-compatible with torchft Manager surface
+# ---------------------------------------------------------------------
+def test_mock_manager_shape_compat():
+    from composer_replication.diloco.serverless import MockManager
+    with tempfile.TemporaryDirectory() as td:
+        store = ObjectStoreAllReduce(td, rank=0, world_size=1, timeout_s=10.0)
+        mgr = MockManager(store)
+        # torchft.Manager surface
+        assert hasattr(mgr, "allreduce")
+        assert hasattr(mgr, "should_commit")
+        assert hasattr(mgr, "start_quorum")
+        assert hasattr(mgr, "wait_quorum")
+        assert mgr.num_participants == 1
+        assert mgr.rank == 0
+        assert mgr.should_commit() is True
+        # Single-replica allreduce is a passthrough
+        t = torch.tensor([1.0, 2.0])
+        out = mgr.allreduce(t.clone())
+        torch.testing.assert_close(out, t, atol=1e-6, rtol=1e-6)

composer_replication/distillation/__init__.py ADDED Viewed

	@@ -0,0 +1,36 @@

+"""composer_replication.distillation — pluggable self-distillation losses.
+Per ADR-007, three losses additive to the framework's existing
+SDPO/OPSD (`generalized_jsd_loss`):
+- SimPO: reference-free DPO replacement (channel 3 alternative)
+- TAID: annealed teacher interpolation (wraps generalized_jsd_loss for channel 2)
+- Entropy-Aware OPD: token-wise gated forward/reverse KL (alternative
+  channel-2 wrapper, per ICLR 2026 Spotlight)
+All three are pure PyTorch — no external deps — so they ship in the core
+package without optional extras.
+Usage in `compose_loss`:
+    >>> from composer_replication import compose_loss
+    >>> components = compose_loss(
+    ...     model, batch,
+    ...     dpo_variant="simpo",          # channel 3: DPO -> SimPO
+    ...     sdpo_wrapper="taid",          # channel 2: SDPO -> TAID-SDPO
+    ...     taid_schedule_step=1500, taid_total_steps=10_000,
+    ... )
+Defaults are unchanged (pure DPO + pure SDPO).
+"""
+from __future__ import annotations
+from composer_replication.distillation.simpo import simpo_loss
+from composer_replication.distillation.taid import taid_loss
+from composer_replication.distillation.entropy_aware_opd import entropy_aware_opd_loss
+__all__ = [
+    "simpo_loss",
+    "taid_loss",
+    "entropy_aware_opd_loss",
+]

composer_replication/distillation/entropy_aware_opd.py ADDED Viewed

	@@ -0,0 +1,126 @@

+"""Entropy-Aware OPD — token-wise gated forward/reverse KL.
+Paper: ICLR 2026 Spotlight "Entropy-Aware On-Policy Distillation"
+       (OpenReview WSRQ37tzk1, code release pending as of 2026-05-26)
+Standard reverse-KL distillation (which SDPO/OPSD belongs to) has a known
+mode-seeking failure: when the teacher distribution has high entropy at
+some token positions (e.g. open-ended generation), reverse KL collapses
+the student onto a single mode, throwing away the teacher's diversity.
+Forward KL is mode-covering and would handle these positions correctly,
+but is mode-flattening in the long tail.
+Entropy-Aware OPD computes the per-token entropy of the teacher
+distribution and gates between forward and reverse KL on a per-token
+basis: high-entropy tokens use forward KL (preserve diversity),
+low-entropy tokens use reverse KL (sharpen toward the teacher's mode).
+    L = Σ_t  w(t) · KL_fwd(student || teacher)_t
+            + (1 - w(t)) · KL_rev(student || teacher)_t
+Where w(t) = clamp(H_teacher(t) / H_max, 0, 1) — high entropy → forward
+KL weight near 1, low entropy → reverse KL weight near 1.
+This is a clean-room implementation from the paper's pseudocode pending
+the official code drop. License question for the official code is open;
+this implementation is MIT-compatible by construction.
+"""
+from __future__ import annotations
+import math
+import torch
+import torch.nn.functional as F
+def teacher_entropy(teacher_logits: torch.Tensor) -> torch.Tensor:
+    """Per-token entropy of the teacher distribution.
+    Returns:
+        (B, T) entropy in nats.
+    """
+    log_p = F.log_softmax(teacher_logits, dim=-1)
+    p = log_p.exp()
+    # Entropy = -Σ p log p
+    return -(p * log_p).sum(dim=-1)
+def entropy_aware_opd_loss(
+    student_logits: torch.Tensor,
+    teacher_logits: torch.Tensor,
+    *,
+    labels: torch.Tensor | None = None,
+    h_max: float | None = None,
+    temperature: float = 1.0,
+    reduction: str = "batchmean",
+) -> torch.Tensor:
+    """Entropy-aware mixture of forward and reverse KL.
+    Args:
+        student_logits: (B, T, V) student logits with grad
+        teacher_logits: (B, T, V) teacher logits (no grad)
+        labels: (B, T) optional 0/1 mask — only contribute loss on
+            labels==1 positions. None means contribute everywhere.
+        h_max: maximum-entropy normalizer. Defaults to log(V) (uniform-
+            distribution entropy = the max possible entropy at vocab size V).
+        temperature: temperature applied to BOTH student and teacher logits
+            before softmax
+        reduction: "batchmean" | "sum" | "mean" | "none"
+    Returns:
+        Scalar loss (or unreduced if `reduction="none"`).
+    Reference: ICLR 2026 Spotlight WSRQ37tzk1 §3 (clean-room implementation).
+    """
+    if student_logits.shape != teacher_logits.shape:
+        raise ValueError(
+            f"shape mismatch: student={student_logits.shape}, "
+            f"teacher={teacher_logits.shape}"
+        )
+    V = student_logits.size(-1)
+    if h_max is None:
+        h_max = math.log(V)
+    s_log = F.log_softmax(student_logits / temperature, dim=-1)
+    t_log = F.log_softmax(teacher_logits / temperature, dim=-1)
+    s_p = s_log.exp()
+    t_p = t_log.exp()
+    # Forward KL (teacher || student): mode-covering
+    # KL(t || s) = Σ t · (log t - log s)
+    kl_fwd = (t_p * (t_log - s_log)).sum(dim=-1)
+    # Reverse KL (student || teacher): mode-seeking (this is what SDPO uses)
+    # KL(s || t) = Σ s · (log s - log t)
+    kl_rev = (s_p * (s_log - t_log)).sum(dim=-1)
+    # Per-token teacher entropy → gate weight
+    H_t = teacher_entropy(teacher_logits)  # (B, T) in nats
+    w = (H_t / h_max).clamp(0.0, 1.0)  # (B, T) in [0, 1]
+    # Mix: high entropy → forward KL; low entropy → reverse KL
+    per_token_loss = w * kl_fwd + (1 - w) * kl_rev  # (B, T)
+    if labels is not None:
+        if labels.shape != per_token_loss.shape:
+            raise ValueError(
+                f"labels shape {labels.shape} != per-token-loss shape "
+                f"{per_token_loss.shape}"
+            )
+        per_token_loss = per_token_loss * labels.float()
+    if reduction == "none":
+        return per_token_loss
+    if reduction == "sum":
+        return per_token_loss.sum()
+    if reduction == "mean":
+        return per_token_loss.mean()
+    if reduction == "batchmean":
+        return per_token_loss.sum() / max(1, per_token_loss.shape[0])
+    raise ValueError(f"unknown reduction: {reduction!r}")
+__all__ = ["teacher_entropy", "entropy_aware_opd_loss"]

composer_replication/distillation/simpo.py ADDED Viewed

	@@ -0,0 +1,83 @@

+"""SimPO loss — reference-free DPO replacement.
+Paper: "SimPO: Simple Preference Optimization with a Reference-Free Reward"
+       Meng et al., NeurIPS 2024 (arXiv:2405.14734)
+License: MIT (https://github.com/princeton-nlp/SimPO)
+Standard DPO requires log-probabilities under both the policy and a
+reference policy:
+    L_DPO = -log σ( β·[(logπ(c) - logπ_ref(c)) - (logπ(r) - logπ_ref(r))] )
+SimPO drops the reference-policy term, replaces it with a target margin γ,
+and uses average sequence log-probability instead of sum. This removes the
+reference-model VRAM cost (which is a meaningful fraction of total
+training-time memory).
+    L_SimPO = -log σ( β·[avg_logπ(c) - avg_logπ(r)] - γ )
+Where:
+- avg_logπ(c) = (1/|c|) · Σ_t logπ(c_t | c_<t, prompt)
+- β: scaling factor (paper default: 2.0)
+- γ: target margin (paper default: 1.0)
+Compose with the framework: replace channel-3 `_compute_trace_replay_loss`
+when `dpo_variant="simpo"` is passed to `compose_loss`. Inputs change:
+SimPO does NOT consume `dpo_chosen_ref_logprobs` / `dpo_rejected_ref_logprobs`
+(those become unused).
+"""
+from __future__ import annotations
+import torch
+import torch.nn.functional as F
+def simpo_loss(
+    chosen_avg_logprobs: torch.Tensor,
+    rejected_avg_logprobs: torch.Tensor,
+    *,
+    beta: float = 2.0,
+    gamma: float = 1.0,
+) -> torch.Tensor:
+    """SimPO loss — reference-free DPO with target margin.
+    Args:
+        chosen_avg_logprobs: (B,) average per-token log-prob of the chosen
+            response under the policy. Computed as
+            `chosen_logprobs.sum() / response_length`.
+        rejected_avg_logprobs: (B,) same for rejected.
+        beta: scaling factor (paper default 2.0)
+        gamma: target margin (paper default 1.0)
+    Returns:
+        Scalar loss; lower is better.
+    Reference: arXiv:2405.14734 Eq. (5).
+    """
+    if chosen_avg_logprobs.shape != rejected_avg_logprobs.shape:
+        raise ValueError(
+            f"chosen and rejected avg-logprob tensors must have the same shape, "
+            f"got chosen={chosen_avg_logprobs.shape}, "
+            f"rejected={rejected_avg_logprobs.shape}"
+        )
+    logits = beta * (chosen_avg_logprobs - rejected_avg_logprobs) - gamma
+    return -F.logsigmoid(logits).mean()
+def avg_sequence_logprob(
+    model_logprobs: torch.Tensor,
+    response_mask: torch.Tensor,
+) -> torch.Tensor:
+    """Helper: convert (B, T) per-token log-probs + (B, T) response mask into
+    (B,) per-sequence AVERAGE log-probability over response tokens.
+    SimPO uses the average (not sum) so that long sequences aren't
+    penalized for having many tokens. The mask should be 1 on response
+    tokens and 0 on prompt+padding.
+    """
+    masked = model_logprobs * response_mask.float()
+    n_tokens = response_mask.sum(dim=-1).clamp_min(1.0).float()
+    return masked.sum(dim=-1) / n_tokens
+__all__ = ["simpo_loss", "avg_sequence_logprob"]

composer_replication/distillation/taid.py ADDED Viewed

	@@ -0,0 +1,195 @@

+"""TAID loss — Temporally Adaptive Interpolated Distillation.
+Paper: "TAID: Temporally Adaptive Interpolated Distillation for Efficient
+        Knowledge Transfer in Language Models"
+       Sakana AI, arXiv:2501.16937
+License: Apache-2.0 (https://github.com/SakanaAI/TAID)
+Standard JSD/KL distillation on a large student-teacher capacity gap can
+suffer from mode collapse: the student converges to a degenerate point
+distribution that minimizes the KL by ignoring tail probabilities.
+TAID interpolates between an "identity" target (the student's own
+distribution at step 0) and the teacher's distribution, with the
+interpolation coefficient annealed from 0 → 1 over training:
+    P_target(t) = (1 - α(t)) · P_student_init + α(t) · P_teacher
+Where α(t) is a schedule (linear, cosine, or paper-default exp ramp).
+The student then learns against `P_target(t)` using the standard JSD/KL
+loss. As training progresses, the target shifts smoothly from "what you
+already are" toward "what the teacher knows," giving the student a
+smooth path through capacity-gap regions where naive distillation
+collapses.
+Compose with the framework: TAID *wraps* `generalized_jsd_loss`. The
+wrapper passes a blended target instead of the raw teacher target. When
+`taid_alpha=1.0` we recover pure SDPO (the standard JSD/OPSD path).
+"""
+from __future__ import annotations
+import math
+import torch
+import torch.nn.functional as F
+def taid_alpha_schedule(
+    step: int,
+    total_steps: int,
+    *,
+    schedule: str = "linear",
+    alpha_min: float = 0.0,
+    alpha_max: float = 1.0,
+    warmup_frac: float = 0.0,
+) -> float:
+    """Compute α(t) for the TAID schedule.
+    Args:
+        step: current training step (0-indexed)
+        total_steps: total training steps planned
+        schedule: "linear" | "cosine" | "exp"
+        alpha_min: starting α (default 0 = pure student-init target)
+        alpha_max: ending α (default 1 = pure teacher target)
+        warmup_frac: fraction of total_steps spent at alpha_min
+    Returns:
+        α value in [alpha_min, alpha_max]
+    Reference: arXiv:2501.16937 §3.2.
+    """
+    if total_steps <= 0:
+        raise ValueError(f"total_steps must be > 0, got {total_steps}")
+    if step < 0:
+        raise ValueError(f"step must be ≥ 0, got {step}")
+    warmup_steps = int(total_steps * warmup_frac)
+    if step < warmup_steps:
+        return alpha_min
+    progress = (step - warmup_steps) / max(1, total_steps - warmup_steps)
+    progress = min(1.0, max(0.0, progress))
+    if schedule == "linear":
+        alpha = alpha_min + (alpha_max - alpha_min) * progress
+    elif schedule == "cosine":
+        # 0.5 * (1 - cos(π·t)) goes 0 → 1 as t goes 0 → 1
+        alpha = alpha_min + (alpha_max - alpha_min) * 0.5 * (1 - math.cos(math.pi * progress))
+    elif schedule == "exp":
+        # Paper default: α(t) = α_min + (α_max - α_min) · (1 - exp(-5·t))
+        # Front-loads progress toward larger α
+        alpha = alpha_min + (alpha_max - alpha_min) * (1 - math.exp(-5 * progress))
+    else:
+        raise ValueError(f"unknown schedule: {schedule!r}")
+    return float(alpha)
+def taid_blended_logits(
+    student_init_logits: torch.Tensor,
+    teacher_logits: torch.Tensor,
+    alpha: float,
+) -> torch.Tensor:
+    """Blend the "student-at-init" and teacher logits in probability space.
+    Returns logits of `(1 - α)·P_student_init + α·P_teacher`.
+    Internally:
+        1. softmax both → P_student_init, P_teacher (in prob space)
+        2. linear interpolate
+        3. log → blended logits
+    Args:
+        student_init_logits: (B, T, V) student logits at training start
+            (frozen — keep a snapshot from step 0)
+        teacher_logits: (B, T, V) teacher logits (e.g., hint-conditioned
+            forward pass per SDPO)
+        alpha: interpolation coefficient in [0, 1]
+    Returns:
+        (B, T, V) logits whose softmax is the blended target distribution.
+    """
+    if not (0.0 <= alpha <= 1.0):
+        raise ValueError(f"alpha must be in [0, 1], got {alpha}")
+    if student_init_logits.shape != teacher_logits.shape:
+        raise ValueError(
+            f"shape mismatch: student_init={student_init_logits.shape}, "
+            f"teacher={teacher_logits.shape}"
+        )
+    # Mix in probability space, then log to get logits
+    p_student_init = F.softmax(student_init_logits, dim=-1)
+    p_teacher = F.softmax(teacher_logits, dim=-1)
+    p_blended = (1 - alpha) * p_student_init + alpha * p_teacher
+    # Clamp for numerical stability before log
+    p_blended = p_blended.clamp_min(1e-12)
+    return torch.log(p_blended)
+def taid_loss(
+    student_logits: torch.Tensor,
+    teacher_logits: torch.Tensor,
+    student_init_logits: torch.Tensor,
+    *,
+    schedule_step: int,
+    total_steps: int,
+    schedule: str = "linear",
+    alpha_min: float = 0.0,
+    alpha_max: float = 1.0,
+    jsd_beta: float = 0.5,
+    temperature: float = 1.0,
+    reduction: str = "batchmean",
+) -> torch.Tensor:
+    """TAID-wrapped generalized-JSD loss.
+    Wraps the framework's `generalized_jsd_loss` (= SDPO/OPSD) with the
+    TAID schedule. At α=0 the loss target is the student's own initial
+    distribution (essentially a regularizer); at α=1 it's the standard
+    JSD-against-teacher (SDPO).
+    Args:
+        student_logits: (B, T, V) current student logits with grad
+        teacher_logits: (B, T, V) teacher logits (no grad — same model
+            different context per SDPO, or different model per real
+            distillation)
+        student_init_logits: (B, T, V) student logits captured at step 0
+            of training. Caller must save this and pass it in.
+        schedule_step: current training step
+        total_steps: total planned training steps
+        schedule: "linear" | "cosine" | "exp" — see `taid_alpha_schedule`
+        alpha_min, alpha_max: schedule range (defaults 0, 1)
+        jsd_beta: β param of generalized_jsd_loss (0=fwd KL, 0.5=JSD,
+            1=rev KL)
+        temperature: temperature for both student and target
+        reduction: "batchmean" | "sum" | "mean" | "none"
+    Returns:
+        Scalar loss (or unreduced tensor if `reduction="none"`).
+    Reference: arXiv:2501.16937 Eq. (4) + §3.2.
+    """
+    # Lazy-import generalized_jsd_loss to avoid circular import
+    from composer_replication.opsd import generalized_jsd_loss
+    alpha = taid_alpha_schedule(
+        step=schedule_step,
+        total_steps=total_steps,
+        schedule=schedule,
+        alpha_min=alpha_min,
+        alpha_max=alpha_max,
+    )
+    blended_logits = taid_blended_logits(
+        student_init_logits=student_init_logits,
+        teacher_logits=teacher_logits,
+        alpha=alpha,
+    )
+    return generalized_jsd_loss(
+        student_logits=student_logits,
+        teacher_logits=blended_logits,
+        beta=jsd_beta,
+        temperature=temperature,
+        reduction=reduction,
+    )
+__all__ = ["taid_alpha_schedule", "taid_blended_logits", "taid_loss"]

composer_replication/distillation/tests/test_distillation_losses.py ADDED Viewed

	@@ -0,0 +1,236 @@

+"""Distillation-loss unit tests — SimPO + TAID + Entropy-Aware OPD."""
+from __future__ import annotations
+import math
+import pytest
+import torch
+import torch.nn.functional as F
+from composer_replication.distillation import (
+    entropy_aware_opd_loss,
+    simpo_loss,
+    taid_loss,
+)
+from composer_replication.distillation.simpo import avg_sequence_logprob
+from composer_replication.distillation.taid import (
+    taid_alpha_schedule,
+    taid_blended_logits,
+)
+from composer_replication.distillation.entropy_aware_opd import teacher_entropy
+# ---------------------------------------------------------------------
+# SimPO
+# ---------------------------------------------------------------------
+def test_simpo_loss_returns_scalar():
+    chosen = torch.tensor([0.5, 0.4, 0.3])
+    rejected = torch.tensor([0.1, 0.0, -0.2])
+    loss = simpo_loss(chosen, rejected, beta=2.0, gamma=1.0)
+    assert loss.dim() == 0
+    assert torch.isfinite(loss)
+def test_simpo_loss_lower_for_better_separation():
+    """Larger margin between chosen and rejected → lower loss."""
+    # Same setup, two batches with different separations
+    small_sep_loss = simpo_loss(
+        torch.tensor([0.1]), torch.tensor([0.05]),
+    )
+    large_sep_loss = simpo_loss(
+        torch.tensor([1.0]), torch.tensor([-1.0]),
+    )
+    assert large_sep_loss < small_sep_loss, (
+        f"large separation should give smaller loss; "
+        f"got small_sep={small_sep_loss}, large_sep={large_sep_loss}"
+    )
+def test_simpo_loss_differentiable():
+    chosen = torch.tensor([0.5], requires_grad=True)
+    rejected = torch.tensor([0.0], requires_grad=True)
+    loss = simpo_loss(chosen, rejected)
+    loss.backward()
+    assert chosen.grad is not None
+    assert rejected.grad is not None
+    assert torch.isfinite(chosen.grad).all()
+    assert torch.isfinite(rejected.grad).all()
+def test_simpo_loss_shape_mismatch_raises():
+    with pytest.raises(ValueError, match="same shape"):
+        simpo_loss(torch.zeros(3), torch.zeros(5))
+def test_avg_sequence_logprob():
+    """Helper averages over response tokens, ignoring prompt + padding."""
+    # B=2, T=4
+    logprobs = torch.tensor([
+        [-10.0, -10.0, -1.0, -2.0],   # response is last 2 tokens, avg=-1.5
+        [-1.0, -3.0, -1.0, -10.0],    # response is first 3 tokens, avg=-5/3
+    ])
+    mask = torch.tensor([
+        [0, 0, 1, 1],
+        [1, 1, 1, 0],
+    ])
+    avg = avg_sequence_logprob(logprobs, mask)
+    expected = torch.tensor([-1.5, -5.0 / 3.0])
+    torch.testing.assert_close(avg, expected, atol=1e-5, rtol=1e-5)
+# ---------------------------------------------------------------------
+# TAID
+# ---------------------------------------------------------------------
+def test_taid_alpha_schedule_endpoints():
+    """At step 0 → alpha_min; at step total → alpha_max."""
+    assert taid_alpha_schedule(0, 100, schedule="linear") == 0.0
+    assert taid_alpha_schedule(100, 100, schedule="linear") == 1.0
+    assert taid_alpha_schedule(0, 100, schedule="cosine") == 0.0
+    assert taid_alpha_schedule(100, 100, schedule="cosine") == pytest.approx(1.0)
+    assert taid_alpha_schedule(0, 100, schedule="exp") == pytest.approx(0.0)
+    assert taid_alpha_schedule(100, 100, schedule="exp") == pytest.approx(1 - math.exp(-5))
+def test_taid_alpha_schedule_monotonic_linear():
+    prev = -1.0
+    for step in [0, 10, 25, 50, 75, 90, 100]:
+        a = taid_alpha_schedule(step, 100, schedule="linear")
+        assert a >= prev
+        prev = a
+def test_taid_alpha_schedule_warmup():
+    """During warmup_frac, alpha stays at alpha_min."""
+    a_warmup = taid_alpha_schedule(50, 1000, warmup_frac=0.1, schedule="linear")
+    # warmup_steps = 100, step 50 < 100 → still alpha_min
+    assert a_warmup == 0.0
+    a_post_warmup = taid_alpha_schedule(150, 1000, warmup_frac=0.1, schedule="linear")
+    # post-warmup, partial way through remaining 900 steps
+    assert a_post_warmup > 0.0
+    assert a_post_warmup < 1.0
+def test_taid_blended_logits_endpoints():
+    """alpha=0 → student_init target; alpha=1 → teacher target."""
+    # Use logits with strong peaks to make endpoint behavior obvious
+    student_init = torch.zeros(2, 3, 4)
+    student_init[0, 0, 0] = 10.0   # peaks at index 0
+    teacher = torch.zeros(2, 3, 4)
+    teacher[0, 0, 3] = 10.0        # peaks at index 3
+    blended_alpha0 = taid_blended_logits(student_init, teacher, alpha=0.0)
+    blended_alpha1 = taid_blended_logits(student_init, teacher, alpha=1.0)
+    blended_half = taid_blended_logits(student_init, teacher, alpha=0.5)
+    # alpha=0: argmax follows student_init
+    assert blended_alpha0[0, 0].argmax().item() == 0
+    # alpha=1: argmax follows teacher
+    assert blended_alpha1[0, 0].argmax().item() == 3
+    # alpha=0.5: bimodal; both 0 and 3 should be elevated
+    half_probs = F.softmax(blended_half[0, 0], dim=-1)
+    assert half_probs[0] > 0.4
+    assert half_probs[3] > 0.4
+def test_taid_loss_returns_scalar_and_differentiable():
+    B, T, V = 2, 4, 8
+    student_logits = torch.randn(B, T, V, requires_grad=True)
+    teacher_logits = torch.randn(B, T, V)
+    student_init = torch.randn(B, T, V)
+    loss = taid_loss(
+        student_logits, teacher_logits, student_init,
+        schedule_step=500, total_steps=1000,
+    )
+    assert loss.dim() == 0
+    assert torch.isfinite(loss)
+    loss.backward()
+    assert student_logits.grad is not None
+    assert torch.isfinite(student_logits.grad).all()
+def test_taid_loss_alpha_zero_ignores_teacher():
+    """At alpha=0, teacher gradient should not flow through to student."""
+    B, T, V = 1, 2, 4
+    student_init = torch.randn(B, T, V)
+    s1 = torch.randn(B, T, V, requires_grad=True)
+    teacher_a = torch.zeros(B, T, V)
+    teacher_a[..., 0] = 10.0
+    teacher_b = torch.zeros(B, T, V)
+    teacher_b[..., 3] = 10.0
+    # At step 0 with alpha_min=alpha_max=0, alpha is forced to 0 → blended = student_init
+    loss_a = taid_loss(s1, teacher_a, student_init, schedule_step=0, total_steps=100,
+                       alpha_min=0.0, alpha_max=0.0)
+    loss_b = taid_loss(s1, teacher_b, student_init, schedule_step=0, total_steps=100,
+                       alpha_min=0.0, alpha_max=0.0)
+    # Different teachers should give the same loss when alpha is pinned to 0
+    assert abs(float(loss_a) - float(loss_b)) < 1e-4
+# ---------------------------------------------------------------------
+# Entropy-Aware OPD
+# ---------------------------------------------------------------------
+def test_teacher_entropy_one_hot_is_zero():
+    """Argmax-1 distribution has entropy 0."""
+    logits = torch.zeros(1, 1, 4)
+    logits[..., 0] = 100.0  # essentially one-hot
+    H = teacher_entropy(logits)
+    assert float(H[0, 0]) < 1e-3
+def test_teacher_entropy_uniform_is_log_v():
+    """Uniform distribution over V symbols has entropy = log(V)."""
+    logits = torch.zeros(1, 1, 5)
+    H = teacher_entropy(logits)
+    assert float(H[0, 0]) == pytest.approx(math.log(5), rel=1e-5)
+def test_entropy_aware_opd_returns_scalar_and_differentiable():
+    B, T, V = 2, 3, 8
+    student_logits = torch.randn(B, T, V, requires_grad=True)
+    teacher_logits = torch.randn(B, T, V)
+    loss = entropy_aware_opd_loss(student_logits, teacher_logits)
+    assert loss.dim() == 0
+    assert torch.isfinite(loss)
+    loss.backward()
+    assert student_logits.grad is not None
+    assert torch.isfinite(student_logits.grad).all()
+def test_entropy_aware_opd_with_label_mask():
+    """Label mask should zero out per-token loss on labels==0 positions."""
+    B, T, V = 1, 4, 6
+    student_logits = torch.randn(B, T, V, requires_grad=True)
+    teacher_logits = torch.randn(B, T, V)
+    full_loss = entropy_aware_opd_loss(student_logits, teacher_logits)
+    half_mask = torch.tensor([[1, 1, 0, 0]])
+    half_loss = entropy_aware_opd_loss(
+        student_logits, teacher_logits, labels=half_mask,
+    )
+    # half_loss should be ~half of the unmasked sum (modulo the entropy gating
+    # being position-dependent — but it should at least be < full_loss)
+    assert float(half_loss) < float(full_loss)
+def test_entropy_aware_opd_zero_when_distributions_match():
+    """When student and teacher are identical, both KLs are 0 → loss is 0."""
+    logits = torch.randn(1, 2, 4)
+    loss = entropy_aware_opd_loss(logits, logits)
+    assert float(loss) < 1e-5
+def test_entropy_aware_opd_reduction_modes():
+    student_logits = torch.randn(2, 3, 4, requires_grad=True)
+    teacher_logits = torch.randn(2, 3, 4)
+    none_loss = entropy_aware_opd_loss(student_logits, teacher_logits, reduction="none")
+    mean_loss = entropy_aware_opd_loss(student_logits, teacher_logits, reduction="mean")
+    sum_loss = entropy_aware_opd_loss(student_logits, teacher_logits, reduction="sum")
+    batchmean_loss = entropy_aware_opd_loss(student_logits, teacher_logits, reduction="batchmean")
+    assert none_loss.shape == (2, 3)
+    assert mean_loss.dim() == 0
+    assert sum_loss.dim() == 0
+    assert batchmean_loss.dim() == 0
+    # batchmean = sum / batch_size
+    assert abs(float(batchmean_loss) - float(sum_loss) / 2) < 1e-4

composer_replication/recipes/monarch/actors.py ADDED Viewed

	@@ -0,0 +1,90 @@

+"""Monarch actor skeletons — DESIGN/SKELETON for v0.
+Per ADR-006, full Monarch integration is deferred to v0.2+. This file
+documents the actor signatures so the framework's recipe matrix is
+complete.
+Importing this module does NOT require monarch installed; the imports
+are deferred inside class bodies. Real instantiation will fail without
+monarch, which is the desired behavior for a recipe document.
+"""
+from __future__ import annotations
+from typing import Any
+class TrainerActor:
+    """Hosts the framework's 3-channel composer trainer.
+    Real implementation (v0.2+):
+        from monarch import Actor, endpoint
+        class TrainerActor(Actor):
+            @endpoint
+            async def train_outer_step(self, batch_id: int) -> dict:
+                # 1. Pull batch from generator
+                # 2. Run inner H steps with composer compose_loss
+                # 3. Compute pseudo-gradient
+                # 4. Hand to ObjectStoreAllReduce manager
+                # 5. Apply outer SGD step
+                # 6. Return metrics dict
+                ...
+    For v0 the actor is just a documentation stub.
+    """
+    backend = "monarch"
+    role = "trainer"
+    def __init__(self) -> None:
+        raise NotImplementedError(
+            "Monarch trainer actor is a v0 skeleton; implementation "
+            "deferred to v0.2 per ADR-006."
+        )
+    async def train_outer_step(self, batch_id: int) -> dict[str, Any]:
+        raise NotImplementedError
+class GeneratorActor:
+    """vllm-backed rollout actor."""
+    backend = "monarch"
+    role = "generator"
+    def __init__(self) -> None:
+        raise NotImplementedError("v0 skeleton — see ADR-006.")
+    async def rollout(self, prompts: list[str]) -> list[str]:
+        raise NotImplementedError
+class RewarderActor:
+    """verifiers-protocol rewarder for RLVR-style RL."""
+    backend = "monarch"
+    role = "rewarder"
+    def __init__(self) -> None:
+        raise NotImplementedError("v0 skeleton — see ADR-006.")
+    async def score(self, completions: list[str]) -> list[float]:
+        raise NotImplementedError
+class TeacherPoolActor:
+    """Channel-3 teacher pool — wraps composer_replication.teacher_replay."""
+    backend = "monarch"
+    role = "teacher_pool"
+    def __init__(self) -> None:
+        raise NotImplementedError("v0 skeleton — see ADR-006.")
+    async def replay(self, states: list[dict]) -> list[dict]:
+        raise NotImplementedError
+__all__ = [
+    "GeneratorActor",
+    "RewarderActor",
+    "TeacherPoolActor",
+    "TrainerActor",
+]

composer_replication/recipes/monarch/monarch_actor_layout.md ADDED Viewed

	@@ -0,0 +1,121 @@

+# Monarch actor mesh — design for hosting the framework's training topology
+**Status**: Design + skeleton. Real Monarch integration is post-replication
+work (ADR-006 explicitly defers it to v0.2+).
+**ADR**: 006
+## What Monarch is
+Monarch (https://github.com/meta-pytorch/monarch, BSD-3) is Meta's actor-
+mesh runtime — a thin coordination layer over Python processes that lets
+you describe a training topology as a graph of typed actors, then run
+that topology on top of any cluster manager (k8s, Slurm, raw ssh).
+Per ADR-006, Monarch is the only Meta PyTorch agentic-stack component
+that's actively shipping (v0.4.1 stable, v0.5 dev daily) and not paused.
+TorchForge, the original "agent" piece, is paused per its own repo banner.
+## Why Monarch fits the framework's design
+The framework already has an N-actor topology even without Monarch:
+- Trainer (channel 1: GRPO; channel 2: SDPO; channel 3: trace-replay DPO)
+- Generator (rollout / vllm)
+- Rewarder (RLVR test runner / verifiers protocol)
+- N teachers (channel 3: external OpenRouter calls)
+- DiLoCo replicas (N copies of trainer, syncing via object store)
+PRIME-RL gives us the trainer/generator/rewarder split for free. Monarch
+takes that further: each of those becomes a Monarch actor, and the framework
+gains:
+1. **Heterogeneous executor support** — actors run wherever Monarch's
+   backend places them (Modal, k8s, on-prem cluster). Composes naturally
+   with our `ServerlessExecutor` Protocol.
+2. **Failure recovery** — Monarch handles actor crashes + restarts;
+   the framework's DiLoCo state is durable in object storage, so a
+   restarted trainer replica can resume from the last outer round.
+3. **Hot-swap of actor implementations** — switch teacher backends
+   from "OpenRouter" to "local vllm" by changing one Monarch actor
+   binding.
+## Actor topology (proposed)
+```
+┌───────────────────────────────────────────────────────────────┐
+│                  ComposerReplicationMesh                       │
+│                                                                │
+│   ┌──────────────┐   ┌──────────────┐   ┌──────────────────┐  │
+│   │  Trainer × N │←─│  Generator    │←─│  Rewarder         │  │
+│   │  (DiLoCo     │   │  (vllm)       │   │  (verifiers)      │  │
+│   │   replicas)  │   └──────────────┘   └──────────────────┘  │
+│   └──────┬───────┘                                              │
+│          │                                                      │
+│          │   Channel 2: same-model hint-conditioned forward    │
+│          │   Channel 3: cross-model OpenRouter teachers        │
+│          ▼                                                      │
+│   ┌──────────────┐                                              │
+│   │ TeacherPool  │ ── OpenRouter (Claude, GPT, DeepSeek, ...) │
+│   │ (channel 3)  │                                              │
+│   └──────────────┘                                              │
+│                                                                │
+│   ┌──────────────────────────────────────────────────────────┐ │
+│   │  ObjectStore (s3://, hf://, file://)                     │ │
+│   │  · DiLoCo pseudo-gradients (round_N/rank_R.pt)           │ │
+│   │  · Replay datasets (NormalizedDPOPair JSONL)             │ │
+│   └──────────────────────────────────────────────────────────┘ │
+└────────────────────────────────────────────────────────────────┘
+```
+## Mapping to Monarch primitives
+```python
+from monarch import Actor, mesh, endpoint
+class TrainerActor(Actor):
+    """Hosts the GRPO trainer + composer 3-channel loss."""
+    @endpoint
+    async def train_outer_step(self, batch_id: int): ...
+class GeneratorActor(Actor):
+    """vllm rollout server — generates trajectories on demand."""
+    @endpoint
+    async def rollout(self, prompts: list[str]) -> list[str]: ...
+class RewarderActor(Actor):
+    """Runs verifiers protocol — RLVR-style test execution."""
+    @endpoint
+    async def score(self, completions: list[str]) -> list[float]: ...
+class TeacherPoolActor(Actor):
+    """Channel 3 — OpenRouter calls to N external teachers."""
+    @endpoint
+    async def replay(self, states: list[dict]) -> list[dict]: ...
+# Topology
+trainers   = mesh.spawn(TrainerActor, n=4, gpu="A100")
+generator  = mesh.spawn(GeneratorActor, n=1, gpu="A100")
+rewarder   = mesh.spawn(RewarderActor, n=1, gpu=None)
+teachers   = mesh.spawn(TeacherPoolActor, n=1, gpu=None)
+```
+## Status of this directory
+- `monarch_actor_layout.md` — this file (design)
+- `actors.py` — skeleton actor definitions; do not import without
+  monarch installed
+- `composer_mesh.py` — composition glue; not yet implemented
+## Open questions (deferred to v0.2)
+- Does Monarch v0.5's Slurm backend hand-shake cleanly with HF Jobs?
+  (HF Jobs runs each "job" as an independent container; Monarch wants
+  to manage the lifecycle. Possible mismatch.)
+- Can the `TrainerActor` host the framework's `ComposerReplicationTrainer`
+  unmodified, or does it need to be split into `step_init` /
+  `step_compute` endpoints to fit Monarch's async actor model?
+## References
+- Monarch repo: https://github.com/meta-pytorch/monarch
+- ADR-006: docs/adrs/ADR-006-rl-frameworks.md
+- Reconnaissance: docs/research/RL_FRAMEWORKS_LANDSCAPE.md § Monarch

composer_replication/recipes/prime_rl/composer_loss.py ADDED Viewed

	@@ -0,0 +1,111 @@

+"""PRIME-RL composer loss adapter — SKELETON for v0.
+Per ADR-006, PRIME-RL exposes a `CustomLossConfig` that takes an
+importable function. This module supplies that function: a thin adapter
+that maps PRIME-RL's `LossInputs` struct onto the framework's 3-channel
+loss composition.
+Status: SKELETON. The full implementation requires a runtime spike with
+prime-rl installed; this file documents the contract and provides a
+working stub that returns a finite scalar so PRIME-RL can be configured
+end-to-end without yet having all three channels wired up.
+Reference:
+- PRIME-RL `LossInputs` shape (verified via DeepWiki audit, Wave 13):
+    - trainer_logprobs:    Tensor (B, T) — student log-probs of generated tokens
+    - inference_logprobs:  Tensor (B, T) — log-probs from inference engine
+    - teacher_logprobs:    Tensor (B, T) | None — optional teacher channel
+    - advantages:          Tensor (B, T) — GRPO advantages
+    - loss_mask:           Tensor (B, T) — response-token mask
+"""
+from __future__ import annotations
+from typing import Any
+def loss_fn(
+    inputs: Any,  # PRIME-RL's LossInputs — typed as Any to avoid hard import
+    *,
+    alpha_sdpo: float = 0.5,
+    beta_dpo: float = 0.3,
+    epsilon: float = 1e-6,
+) -> Any:  # Returns a torch.Tensor (scalar)
+    """Composer 3-channel loss adapted to PRIME-RL's LossInputs struct.
+    Channels (per `composer_replication.compose_loss`):
+        1. GRPO policy-gradient: -(advantages * trainer_logprobs * mask).mean()
+        2. SDPO / OPSD: generalized_jsd_loss(student_logits, teacher_logits)
+        3. Trace-replay DPO: standard DPO on (chosen, rejected) pairs
+    For PRIME-RL adaptation:
+        - Channel 1 reads from `advantages` + `trainer_logprobs` directly.
+          (Note: this is REINFORCE-with-advantage, not full GRPO. Full
+          GRPO would use `inference_logprobs` for the importance-sampling
+          ratio + PPO clipping. See Wave 13 review Finding 6.)
+        - Channel 2 (SDPO) is **DEFERRED** for v0 because PRIME-RL v0.5
+          exposes log-probs not logits, and SDPO needs the full vocab
+          distribution. Setting alpha_sdpo>0 raises NotImplementedError
+          (Wave 13 review Finding 1 — earlier draft was silently degenerate).
+        - Channel 3 (DPO) is OUT OF SCOPE for the PRIME-RL recipe in v0
+          — it would require modifying PRIME-RL's data path to pass
+          `(chosen, rejected)` pairs alongside the rollout, which is a
+          separate integration effort. v0 emits beta_dpo=0 with a
+          warning if non-zero.
+    Args:
+        inputs: PRIME-RL `LossInputs` (duck-typed)
+        alpha_sdpo: weight on channel 2 (SDPO)
+        beta_dpo: weight on channel 3 (DPO) — currently must be 0
+        epsilon: numerical stability for log/division
+    Returns:
+        Scalar torch.Tensor; PRIME-RL's trainer takes care of `.backward()`.
+    """
+    import torch  # lazy
+    from composer_replication.opsd import generalized_jsd_loss
+    # Channel 1: GRPO
+    advantages = inputs.advantages
+    trainer_lp = inputs.trainer_logprobs
+    mask = inputs.loss_mask
+    if mask.dtype != advantages.dtype:
+        mask = mask.to(advantages.dtype)
+    grpo_loss = -(advantages * trainer_lp * mask).sum() / mask.sum().clamp_min(epsilon)
+    total = grpo_loss
+    # Channel 2: SDPO/OPSD — DEFERRED in PRIME-RL recipe v0.
+    #
+    # Wave 13 cross-model review (docs/research/WAVE_13_FINAL_REVIEW.md
+    # Finding 1) caught that an earlier draft of this code applied
+    # `unsqueeze(-1)` to (B, T) log-prob tensors before passing them to
+    # generalized_jsd_loss, which calls log_softmax(dim=-1). Softmax of a
+    # 1-element vector is exactly 1.0; its log is 0. So the SDPO term was
+    # mathematically degenerate (always 0), silently disabling channel 2
+    # while reporting alpha_sdpo>0 in the config.
+    #
+    # The right path forward depends on PRIME-RL exposing full logits, not
+    # just log-probs. Until that lands upstream, refuse to fake the channel:
+    teacher_lp = getattr(inputs, "teacher_logprobs", None)
+    if teacher_lp is not None and alpha_sdpo > 0:
+        raise NotImplementedError(
+            "SDPO channel in the PRIME-RL recipe is deferred. PRIME-RL v0.5 "
+            "exposes (B, T) log-probs through LossInputs but not full logits, "
+            "and SDPO/OPSD requires the full distribution over vocabulary. "
+            "Set alpha_sdpo=0.0 to silence this and use channel 1 (GRPO) only. "
+            "See docs/research/WAVE_13_FINAL_REVIEW.md Finding 1."
+        )
+    # Channel 3: not supported in PRIME-RL recipe v0
+    if beta_dpo != 0.0:
+        import warnings
+        warnings.warn(
+            "PRIME-RL recipe v0 does not support DPO channel; "
+            "set beta_dpo=0.0 to silence this warning.",
+            stacklevel=2,
+        )
+    return total
+__all__ = ["loss_fn"]

composer_replication/recipes/prime_rl/prime_rl_config.yaml ADDED Viewed

	@@ -0,0 +1,66 @@

+# PRIME-RL config wiring the framework's 3-channel composer loss.
+#
+# Status: SKELETON. Field names approximate PRIME-RL's v0.5 config schema;
+# verify against the installed version before launching a real run.
+# Reference: docs/research/RL_FRAMEWORKS_LANDSCAPE.md § PRIME-RL.
+# --- Model ------------------------------------------------------------
+model:
+  base: "Qwen/Qwen2.5-0.5B"
+  attn_implementation: "flash_attention_2"
+  dtype: "bfloat16"
+# --- Training environment (verifiers / OpenEnv compatible) -----------
+env:
+  protocol: "verifiers"
+  config:
+    # Point at any verifiers-protocol task (math, code, etc.)
+    name: "math/gsm8k"
+    split: "train"
+# --- Custom loss (the framework's contribution) -----------------------
+loss:
+  custom:
+    # PRIME-RL imports this and calls loss_fn(inputs, **kwargs) at each step.
+    # The function MUST return a scalar tensor (PRIME-RL handles backward).
+    import_path: "composer_replication.recipes.prime_rl.composer_loss:loss_fn"
+    kwargs:
+      alpha_sdpo: 0.5
+      beta_dpo:   0.0   # DPO channel out-of-scope for PRIME-RL recipe v0
+      epsilon: 1.0e-6
+# --- PRIME-RL three-actor split --------------------------------------
+trainer:
+  optimizer: "muon"
+  learning_rate: 1.0e-5
+  inner_steps: 500           # H for Decoupled DiLoCo outer-loop sync
+  # To enable Decoupled DiLoCo, the trainer's optimizer manager is
+  # monkey-patched at startup with composer_replication.diloco.serverless.MockManager
+  # backed by ObjectStoreAllReduce. See ADR-005 for the wiring.
+generator:
+  backend: "vllm"
+  tensor_parallel: 1
+rewarder:
+  protocol: "verifiers"
+  # No-op for the math task — verifiers does the verification
+# --- Decoupled DiLoCo (optional) -------------------------------------
+diloco:
+  enabled: true
+  rendezvous_uri: "s3://my-bucket/diloco-runs/qwen-05b-replication/"
+  world_size: 4
+  outer_lr: 0.7
+  outer_steps: 100
+  # When enabled, replicas should be launched via
+  # composer_replication.diloco.serverless.{ModalExecutor, HFJobsExecutor, ...}
+  # rather than as a single PRIME-RL job.
+# --- Logging / checkpointing -----------------------------------------
+checkpoint:
+  every_n_outer_steps: 10
+  output_dir: "./checkpoints/prime-rl-composer/"
+logging:
+  wandb_project: "composer-replication"
+  log_every_n_steps: 1

composer_replication/recipes/prime_rl/prime_rl_recipe.md ADDED Viewed

	@@ -0,0 +1,107 @@

+# Recipe C — PRIME-RL: 3-channel composer loss via PRIME-RL's `CustomLossConfig`
+**Status**: Recipe complete; runtime smoke test deferred to a follow-up
+spike (requires `prime-rl >= 0.5` installed + a CUDA box).
+**ADR**: 006
+## Why PRIME-RL is a third RL recipe (alongside TRL and VeRL)
+Per ADR-006, PRIME-RL is the cleanest extension surface for a 3-channel
+loss because it ships a **first-class `CustomLossConfig`** that takes an
+importable Python function and a `LossInputs` struct exposing exactly
+the tensors we need:
+```python
+@dataclass
+class LossInputs:
+    trainer_logprobs: Tensor       # student log-probs of generated tokens
+    inference_logprobs: Tensor      # log-probs from the inference engine
+                                    # (importance-sampling ratio numerator)
+    teacher_logprobs: Tensor | None # if the teacher channel is wired in
+    advantages: Tensor              # GRPO advantages (channel 1)
+    loss_mask: Tensor               # response-token mask
+```
+The user wires this in via a YAML config field — no fork, no Trainer
+subclass, no monkey-patching:
+```yaml
+# prime_rl_config.yaml
+loss:
+  custom:
+    import_path: composer_replication.recipes.prime_rl.composer_loss:loss_fn
+    kwargs:
+      alpha_sdpo: 0.5
+      beta_dpo:   0.3
+```
+## Step-by-step
+### 1. Install PRIME-RL
+```bash
+pip install prime-rl>=0.5
+# (or: pip install -e .[prime-rl] from the framework repo)
+```
+### 2. Drop in the composer loss
+The framework ships `composer_replication.recipes.prime_rl.composer_loss`
+which adapts the 3-channel `compose_loss` to PRIME-RL's `LossInputs`
+struct. The signature is fixed by PRIME-RL:
+```python
+def loss_fn(inputs: LossInputs, *, alpha_sdpo: float, beta_dpo: float) -> Tensor:
+    # channel 1: GRPO (PRIME-RL's default policy gradient)
+    grpo = (inputs.advantages * inputs.trainer_logprobs * inputs.loss_mask).mean()
+    # channel 2: SDPO/OPSD against teacher_logprobs
+    sdpo = ...
+    # channel 3: trace-replay DPO via teacher_logprobs disagreement
+    trace_replay_dpo = ...
+    return -grpo + alpha_sdpo * sdpo + beta_dpo * trace_replay_dpo
+```
+Concrete file: `composer_loss.py` in this directory (skeleton; fills in
+when the user does the runtime spike).
+### 3. PRIME-RL config
+The example `prime_rl_config.yaml` in this directory wires:
+- The training environment via the `verifiers` env protocol (OpenEnv-
+  compatible — no translation layer needed)
+- The custom loss with `import_path` pointing at our `loss_fn`
+- Trainer / generator / rewarder split (PRIME-RL's three-actor design)
+### 4. Decoupled DiLoCo over PRIME-RL replicas
+PRIME-RL runs trainer/generator/rewarder as separate processes. To layer
+Decoupled DiLoCo on top, replace the trainer process's optimizer with
+the framework's `make_diloco_outer_loop` and pass a `MockManager`
+(per ADR-005) backed by `ObjectStoreAllReduce`. The other two actors
+are unchanged.
+This setup is what makes "any number of teachers, any RL framework, any
+serverless executor" composable — PRIME-RL's plug-in points line up
+naturally with the framework's plug-in points.
+## What this recipe gives the user
+- Frontier-RL post-training infra (PRIME-RL's actor-mesh design,
+  battle-tested on INTELLECT-1/2)
+- 3-channel composer loss via a single YAML field
+- DiLoCo outer-loop sync via a one-line monkey-patch of the trainer's
+  manager
+- OpenEnv-compatible task plumbing for free
+## What this recipe doesn't give the user
+- An actual training run yet — that's a separate spike.
+- Quality validation against TRL/VeRL — pending Spike 004 A/B.
+- Hardware autoscaling — that's the Monarch recipe's job (recipes/monarch/).
+## References
+- PRIME-RL repo: https://github.com/PrimeIntellect-ai/prime-rl
+- ADR-006: docs/adrs/ADR-006-rl-frameworks.md
+- Reconnaissance: docs/research/RL_FRAMEWORKS_LANDSCAPE.md (§ PRIME-RL)

composer_replication/recipes/replaysim/default.yaml ADDED Viewed

	@@ -0,0 +1,70 @@

+# Default replaysim normalization recipe.
+#
+# This is a data-juicer YAML config (https://github.com/modelscope/data-juicer).
+# It runs CPU-only ops that filter and clean DPO pairs produced by
+# composer_replication.teacher_replay.extract_dpo_pairs.
+#
+# The op-graph operates on records of shape:
+#
+#     {
+#       "state_id": "...",
+#       "messages": [{"role": "user", "content": "..."}],
+#       "chosen":   [{"role": "assistant", "content": "..."}],
+#       "rejected": [{"role": "assistant", "content": "..."}],
+#       "chosen_teacher": "...",
+#       "rejected_teacher": "..."
+#     }
+#
+# Ops listed in `process` are applied in order. Each op operates on the
+# full record but typically reads/writes one field. data-juicer's
+# DPO/preference-pair ops know how to handle the chosen/rejected pair
+# structure natively.
+# Project & I/O are filled in by DJNormalizer at runtime; we only
+# specify the op pipeline here.
+# --- Op pipeline (applied in order) -----------------------------------
+process:
+  # 1. Length filter on the assistant response.
+  #    Drops pairs where either the chosen or rejected response is shorter
+  #    than 8 chars or longer than 32k chars (likely garbled / overflow).
+  - text_length_filter:
+      min_len: 8
+      max_len: 32000
+      text_keys: ["chosen", "rejected"]
+  # 2. Word-count filter on response.
+  #    Drops pairs with absurdly low (< 2 words) or high (> 4096 words)
+  #    response counts.
+  - words_num_filter:
+      min_num: 2
+      max_num: 4096
+      text_keys: ["chosen", "rejected"]
+  # 3. Special-character filter.
+  #    Drops responses where >50% of characters are non-alphabetic
+  #    special chars (likely encoding errors or junk).
+  - special_characters_filter:
+      max_ratio: 0.5
+      text_keys: ["chosen", "rejected"]
+  # 4. Per-conversation deduplication.
+  #    If the chosen and rejected responses are identical (no real
+  #    disagreement), drop the pair.
+  - document_deduplicator:
+      lowercase: true
+      ignore_non_character: true
+      text_keys: ["chosen"]
+      # data-juicer's per-batch dedup; full corpus dedup is a separate op.
+# Notes:
+# - We DO NOT run `pair_preference_mapper` because its default config may
+#   re-synthesize the rejected text via an LLM call — we already have
+#   real disagreement-derived rejected text and don't want to pay another
+#   API call. (See ADR-004 § "One-day spike before merge.")
+# - Language detection is intentionally not in the default — it requires
+#   downloading a fasttext model and adds startup latency. Add the
+#   `language_id_score_filter` op to a custom recipe if needed.
+# - Semantic-similarity dedup is GPU-bound (NeMo-Curator ops); not in
+#   the default.

composer_replication/replaysim/__init__.py ADDED Viewed

	@@ -0,0 +1,55 @@

+"""composer_replication.replaysim — N-teacher trace replay + dataset normalization.
+Per ADR-004, this package consolidates the framework's
+"replay an LLM trace through N teachers, get a DPO/preference dataset" flow:
+    raw trace
+        ↓ (existing teacher_replay.replay_trace)
+    list[TeacherCallResult]
+        ↓ (existing teacher_replay.extract_dpo_pairs)
+    list[DPOPair]
+        ↓ (NEW — composer_replication.replaysim.normalize.DJNormalizer)
+    list[NormalizedDPOPair]   # length-filtered, dedup'd, chat-template-validated
+The pre-normalization pipeline is unchanged. The normalizer is opt-in via
+the new convenience function `replay_and_normalize_trace(...)` which wraps
+the existing `replay_trace` + `extract_dpo_pairs` and pipes their output
+through a `data-juicer` op-graph.
+Adopting `data-juicer` (Alibaba, Apache-2.0) was the verdict from the
+2026-05-26 reconnaissance — see docs/research/REPLAYSIM_NORMALIZATION_RECONNAISSANCE.md.
+It's the only mature library with NATIVE multi-turn `messages` + DPO
+preference-pair ops that runs CPU-only on the ops we need.
+Optional dependency: `pip install -e .[replaysim]` pulls `data-juicer`.
+Without it, the normalizer raises `ImportError` at use time but the
+package still imports cleanly.
+This module re-exports the existing `teacher_replay` API for convenience
+so users can `from composer_replication.replaysim import replay_trace`.
+"""
+from __future__ import annotations
+from composer_replication.replaysim.normalize import (
+    DJNormalizer,
+    NormalizedDPOPair,
+    replay_and_normalize_trace,
+)
+# Re-exports from the pre-existing teacher_replay module (unchanged):
+from composer_replication.teacher_replay import (
+    DPOPair,
+    TeacherCallResult,
+    extract_dpo_pairs,
+    replay_trace,
+)
+__all__ = [
+    "DJNormalizer",
+    "DPOPair",
+    "NormalizedDPOPair",
+    "TeacherCallResult",
+    "extract_dpo_pairs",
+    "replay_and_normalize_trace",
+    "replay_trace",
+]

composer_replication/replaysim/normalize.py ADDED Viewed

	@@ -0,0 +1,270 @@

+"""DJNormalizer — data-juicer adapter for replaysim DPO output.
+Wraps the framework's `extract_dpo_pairs` output in a data-juicer op-graph.
+The op-graph runs entirely CPU-side and applies length filtering, chat-
+template validation, and per-conversation deduplication. Ops are loaded
+from a YAML recipe so users can swap normalization strategies without
+touching framework code.
+Default recipe lives at:
+    composer_replication/recipes/replaysim/default.yaml
+The data-juicer dependency is optional (pulled by the `[replaysim]` extra).
+This file imports it lazily inside method bodies so that the package
+imports cleanly without it.
+Source-of-truth shape (from `composer_replication.teacher_replay`):
+    DPOPair = TypedDict("DPOPair", {
+        "state_id":           str,
+        "state_messages":     list[dict],   # conversation up to this step
+        "chosen":             str,          # teacher-consensus action
+        "rejected":           str,          # student action
+        "n_teachers_agreeing": int,
+    })
+The normalizer does NOT require chosen_teacher / rejected_teacher fields —
+those don't exist in the real DPOPair shape.
+"""
+from __future__ import annotations
+import asyncio
+import json
+import os
+import tempfile
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Any, Iterable, cast
+from composer_replication.teacher_replay import (
+    DPOPair,
+    TeacherCallResult,
+    extract_dpo_pairs,
+    replay_trace,
+)
+@dataclass
+class NormalizedDPOPair:
+    """A DPOPair that has passed through normalization. Same data as
+    DPOPair but reshaped into chat-messages format (matching data-juicer's
+    native multi-turn op support) plus a metadata dict tracking which
+    ops fired.
+    """
+    state_id: str
+    """Identifier for the trace state (turn) this pair came from."""
+    state_messages: list[dict[str, Any]]
+    """The conversation context up to (and including) this step's user prompt."""
+    chosen_messages: list[dict[str, Any]]
+    """The chosen completion as a chat-messages list (one assistant turn)."""
+    rejected_messages: list[dict[str, Any]]
+    """The rejected completion as a chat-messages list (one assistant turn)."""
+    n_teachers_agreeing: int
+    """How many teachers agreed on the chosen action (preserved from DPOPair)."""
+    metadata: dict[str, Any]
+    """Op-graph provenance: which ops fired, what they changed."""
+def _dpo_pair_to_dj_record(pair: DPOPair | dict[str, Any]) -> dict[str, Any]:
+    """Convert a DPOPair (or dict-shaped equivalent) into a data-juicer
+    record using the messages format.
+    """
+    p = cast(dict[str, Any], pair)
+    return {
+        "state_id": p.get("state_id", ""),
+        "messages": p.get("state_messages", []),
+        "chosen": [{"role": "assistant", "content": p.get("chosen", "")}],
+        "rejected": [{"role": "assistant", "content": p.get("rejected", "")}],
+        "n_teachers_agreeing": p.get("n_teachers_agreeing", 0),
+    }
+def _dj_record_to_normalized(rec: dict[str, Any]) -> NormalizedDPOPair:
+    """Inverse — convert a data-juicer record back to NormalizedDPOPair."""
+    return NormalizedDPOPair(
+        state_id=rec.get("state_id", ""),
+        state_messages=rec.get("messages", []),
+        chosen_messages=rec.get("chosen", []),
+        rejected_messages=rec.get("rejected", []),
+        n_teachers_agreeing=rec.get("n_teachers_agreeing", 0),
+        metadata=rec.get("__dj_meta__", {}),
+    )
+class DJNormalizer:
+    """data-juicer-backed normalizer for DPO pairs.
+    Args:
+        recipe_path: path to a data-juicer YAML recipe. If None, uses the
+            framework's default recipe (length filter + chat-template
+            validation + per-conversation dedup).
+        skip_dj: if True, the normalizer becomes a passthrough — useful
+            for test environments without data-juicer installed. Records
+            are still converted to NormalizedDPOPair shape but no ops run.
+    """
+    DEFAULT_RECIPE = (
+        Path(__file__).parent.parent / "recipes" / "replaysim" / "default.yaml"
+    )
+    def __init__(
+        self,
+        recipe_path: str | os.PathLike[str] | None = None,
+        *,
+        skip_dj: bool = False,
+    ) -> None:
+        self.recipe_path = (
+            Path(recipe_path) if recipe_path is not None else self.DEFAULT_RECIPE
+        )
+        self.skip_dj = skip_dj
+        if not skip_dj:
+            try:
+                import data_juicer  # type: ignore[import-not-found]  # noqa: F401
+            except ImportError as e:
+                raise RuntimeError(
+                    "DJNormalizer requires data-juicer. Install with "
+                    "`pip install -e .[replaysim]` or pass skip_dj=True "
+                    "for a passthrough. Got: " + repr(e)
+                )
+        if not self.skip_dj and not self.recipe_path.exists():
+            raise FileNotFoundError(
+                f"Recipe not found: {self.recipe_path}. Either pass an "
+                f"explicit recipe_path or add the default recipe at this "
+                f"location."
+            )
+    def normalize(
+        self,
+        pairs: Iterable[DPOPair | dict[str, Any]],
+    ) -> list[NormalizedDPOPair]:
+        """Run the full normalization op-graph on a batch of DPO pairs.
+        Args:
+            pairs: iterable of DPOPair (output of extract_dpo_pairs) or
+                dict-shaped equivalents.
+        Returns:
+            list of NormalizedDPOPair, possibly shorter than input (filter
+            ops can drop records).
+        """
+        records = [_dpo_pair_to_dj_record(p) for p in pairs]
+        if self.skip_dj:
+            for rec in records:
+                rec["__dj_meta__"] = {"skipped": True}
+            return [_dj_record_to_normalized(r) for r in records]
+        # Real path: write to temp JSONL, hand to data-juicer's Executor,
+        # read back. data-juicer's CLI contract is file-in / file-out.
+        from data_juicer.config import init_configs  # type: ignore[import-not-found]
+        from data_juicer.core import DefaultExecutor  # type: ignore[import-not-found]
+        with tempfile.TemporaryDirectory() as td:
+            input_path = Path(td) / "input.jsonl"
+            output_path = Path(td) / "output.jsonl"
+            with input_path.open("w") as f:
+                for rec in records:
+                    f.write(json.dumps(rec) + "\n")
+            cfg = init_configs(
+                args=[
+                    "--config", str(self.recipe_path),
+                    "--dataset_path", str(input_path),
+                    "--export_path", str(output_path),
+                ],
+            )
+            executor = DefaultExecutor(cfg)
+            executor.run()
+            output_records: list[dict[str, Any]] = []
+            with output_path.open() as f:
+                for line in f:
+                    line = line.strip()
+                    if not line:
+                        continue
+                    output_records.append(json.loads(line))
+        return [_dj_record_to_normalized(r) for r in output_records]
+# ---------------------------------------------------------------------
+# Convenience: replay + extract pairs + normalize, end to end.
+# ---------------------------------------------------------------------
+async def replay_and_normalize_trace(
+    *,
+    states: Any,
+    teachers: Any = None,
+    agreement_threshold: int = 2,
+    max_total_usd: float = 5.0,
+    normalizer: DJNormalizer | None = None,
+    **replay_kwargs: Any,
+) -> tuple[list[TeacherCallResult], list[NormalizedDPOPair]]:
+    """Async convenience: replay → extract pairs → normalize, in one call.
+    The underlying `replay_trace` is async; this wrapper preserves that
+    so callers can `await` it from an async context. For sync callers
+    use `replay_and_normalize_trace_sync`.
+    Args:
+        states: sequence of TraceState (the frozen agentic trace)
+        teachers: sequence of TeacherSpec (default: framework defaults)
+        agreement_threshold: passed to `extract_dpo_pairs`
+        max_total_usd: passed to `replay_trace`
+        normalizer: defaults to `DJNormalizer()`. Pass
+            `DJNormalizer(skip_dj=True)` to bypass data-juicer.
+        **replay_kwargs: extra kwargs forwarded to `replay_trace`.
+    Returns:
+        Tuple of (raw teacher_actions, normalized DPO pairs).
+    """
+    if normalizer is None:
+        normalizer = DJNormalizer()
+    if teachers is None:
+        teacher_actions = await replay_trace(
+            states=states, max_total_usd=max_total_usd, **replay_kwargs,
+        )
+    else:
+        teacher_actions = await replay_trace(
+            states=states,
+            teachers=teachers,
+            max_total_usd=max_total_usd,
+            **replay_kwargs,
+        )
+    # extract_dpo_pairs reads student_action from each state's
+    # `student_action` field, so we don't need to pass it separately.
+    raw_pairs = extract_dpo_pairs(
+        states=states,
+        teacher_actions=teacher_actions,
+        agreement_threshold=agreement_threshold,
+    )
+    normalized = normalizer.normalize(raw_pairs)
+    return teacher_actions, normalized
+def replay_and_normalize_trace_sync(
+    *args: Any,
+    **kwargs: Any,
+) -> tuple[list[TeacherCallResult], list[NormalizedDPOPair]]:
+    """Sync wrapper for the async `replay_and_normalize_trace`. Convenient
+    for scripts and tests.
+    """
+    return asyncio.run(replay_and_normalize_trace(*args, **kwargs))
+__all__ = [
+    "DJNormalizer",
+    "NormalizedDPOPair",
+    "replay_and_normalize_trace",
+    "replay_and_normalize_trace_sync",
+]

composer_replication/replaysim/tests/__init__.py ADDED Viewed

File without changes

composer_replication/replaysim/tests/test_replaysim.py ADDED Viewed

	@@ -0,0 +1,138 @@

+"""Replaysim normalization tests — the skip_dj passthrough path.
+The full data-juicer path requires `pip install -e .[replaysim]` which we
+defer to the user's environment. These tests verify:
+1. The package imports cleanly without data-juicer installed.
+2. `DJNormalizer(skip_dj=True)` is a working passthrough.
+3. The DPOPair → DJ-record → NormalizedDPOPair shape transforms are
+   lossless modulo the metadata field.
+4. The DPOPair dict shape (TypedDict) is what we expect.
+"""
+from __future__ import annotations
+import pytest
+from composer_replication.replaysim import (
+    DJNormalizer,
+    NormalizedDPOPair,
+    replay_and_normalize_trace,
+)
+from composer_replication.replaysim.normalize import (
+    _dj_record_to_normalized,
+    _dpo_pair_to_dj_record,
+)
+def _make_pair(
+    state_id: str,
+    state_messages: list[dict] | None = None,
+    chosen: str = "Four.",
+    rejected: str = "Five.",
+    n_teachers_agreeing: int = 2,
+) -> dict:
+    """Helper — DPOPair is a TypedDict, so dicts work directly."""
+    return {
+        "state_id": state_id,
+        "state_messages": state_messages or [{"role": "user", "content": "What is 2+2?"}],
+        "chosen": chosen,
+        "rejected": rejected,
+        "n_teachers_agreeing": n_teachers_agreeing,
+    }
+def test_dpo_pair_to_dj_record_shape():
+    p = _make_pair("s1")
+    rec = _dpo_pair_to_dj_record(p)
+    assert rec["state_id"] == "s1"
+    assert rec["messages"] == [{"role": "user", "content": "What is 2+2?"}]
+    assert rec["chosen"] == [{"role": "assistant", "content": "Four."}]
+    assert rec["rejected"] == [{"role": "assistant", "content": "Five."}]
+    assert rec["n_teachers_agreeing"] == 2
+def test_dj_record_to_normalized_roundtrip():
+    p = _make_pair("s2", chosen="C", rejected="R", n_teachers_agreeing=3)
+    rec = _dpo_pair_to_dj_record(p)
+    rec["__dj_meta__"] = {"ops_applied": ["text_length_filter"]}
+    norm = _dj_record_to_normalized(rec)
+    assert isinstance(norm, NormalizedDPOPair)
+    assert norm.state_id == "s2"
+    assert norm.chosen_messages == [{"role": "assistant", "content": "C"}]
+    assert norm.rejected_messages == [{"role": "assistant", "content": "R"}]
+    assert norm.n_teachers_agreeing == 3
+    assert norm.metadata == {"ops_applied": ["text_length_filter"]}
+def test_dj_record_to_normalized_preserves_state_messages():
+    """The conversation context (state_messages) must round-trip."""
+    multi_turn = [
+        {"role": "user", "content": "What is 2+2?"},
+        {"role": "assistant", "content": "Let me think."},
+        {"role": "user", "content": "Just give me a number."},
+    ]
+    p = _make_pair("s3", state_messages=multi_turn)
+    rec = _dpo_pair_to_dj_record(p)
+    norm = _dj_record_to_normalized(rec)
+    assert norm.state_messages == multi_turn
+def test_dj_normalizer_skip_dj_passthrough():
+    """skip_dj=True: bypasses data-juicer entirely, just does shape conversion."""
+    pairs = [
+        _make_pair("s1", chosen="c1", rejected="r1"),
+        _make_pair("s2", chosen="c2", rejected="r2"),
+    ]
+    normalizer = DJNormalizer(skip_dj=True)
+    out = normalizer.normalize(pairs)
+    assert len(out) == 2
+    assert all(isinstance(o, NormalizedDPOPair) for o in out)
+    assert out[0].state_id == "s1"
+    assert out[1].state_id == "s2"
+    assert out[0].metadata == {"skipped": True}
+    assert out[1].metadata == {"skipped": True}
+def test_dj_normalizer_skip_dj_preserves_count():
+    """Passthrough must not drop records — only filter ops do that."""
+    pairs = [_make_pair(f"s{i}") for i in range(10)]
+    normalizer = DJNormalizer(skip_dj=True)
+    out = normalizer.normalize(pairs)
+    assert len(out) == 10
+def test_dj_normalizer_default_recipe_path_exists():
+    """The default recipe ships with the package."""
+    assert DJNormalizer.DEFAULT_RECIPE.exists(), \
+        f"Default recipe missing at {DJNormalizer.DEFAULT_RECIPE}"
+def test_dj_normalizer_real_path_requires_data_juicer():
+    """Without skip_dj, instantiation requires data-juicer or fails clearly."""
+    try:
+        import data_juicer  # type: ignore[import-not-found]  # noqa: F401
+    except ImportError:
+        with pytest.raises(RuntimeError, match="data-juicer"):
+            DJNormalizer(skip_dj=False)
+    else:
+        # data-juicer IS installed; verify init succeeds with default recipe
+        normalizer = DJNormalizer(skip_dj=False)
+        assert normalizer.recipe_path == DJNormalizer.DEFAULT_RECIPE
+def test_replay_and_normalize_trace_signature():
+    """Convenience function is callable and importable. Smoke-only — we
+    don't run it against OpenRouter from CI."""
+    assert callable(replay_and_normalize_trace)
+    # It's an async function
+    import inspect
+    assert inspect.iscoroutinefunction(replay_and_normalize_trace)
+def test_record_handles_missing_optional_fields():
+    """A DPOPair dict missing some optional fields shouldn't crash the converter."""
+    minimal = {"state_id": "x", "chosen": "a", "rejected": "b"}
+    rec = _dpo_pair_to_dj_record(minimal)
+    assert rec["state_id"] == "x"
+    assert rec["messages"] == []        # missing state_messages → empty list
+    assert rec["n_teachers_agreeing"] == 0  # missing → default 0

docs/ALTERED_MINDS_TIE_IN.md ADDED Viewed

	@@ -0,0 +1,154 @@

+# altered-minds × Composer Replication Framework
+**Status**: Tie-in design doc.
+**Date**: 2026-05-26 (Wave 13)
+**Source workstream**: `llm-mental-alterations` (formerly Codeseys/llm-mental-alterations
+on HF; user has indicated a rename to `altered-minds`)
+## What altered-minds is studying
+From the user's existing wiki notes (`~/wiki/projects/llm-mental-alterations.md`):
+- Fine-tuning Llama-3.1-8B with **personality SFT** induces a depression/
+  anxiety cognitive-distortion signature on MMLU `moral_scenarios`:
+  - Class 3 ("both fine") collapses **−31.1pp**
+  - Class 0 ("both wrong") improves **+4.6pp**
+  - Multi-seed reproducible (4/4 seeds, n=895)
+  - 18% of base-correct items broken
+- Other domains affected: `high_school_chemistry +4.2pp`,
+  `machine_learning +4.9pp` (reliably improved).
+- H-3 Gemma-MoE hypothesis is deferred (Hopper-only).
+- Spend so far: $9.75 / $400 budget.
+The headline question driving the workstream is roughly:
+**"What measurable cognitive alterations does personality-style SFT
+introduce, and can we recover or sharpen them via downstream RL?"**
+## Why this framework is the right second-stage workstream
+altered-minds today is an **SFT-only** pipeline. A typical run:
+1. Take a base model (Llama-3.1-8B).
+2. Apply personality SFT.
+3. Evaluate on MMLU + alteration-specific probes.
+4. Document the alteration signature.
+The Composer Replication Framework, by design, is a **post-SFT
+reinforcement-learning framework**. It can take any HF model — including
+an altered-minds-altered model — and apply:
+- **GRPO** with verifiable rewards
+- **SDPO/OPSD** self-distillation against the altered model's hint-
+  conditioned forward passes
+- **Trace-replay DPO** against N external teachers
+That gives altered-minds three orthogonal axes of investigation it doesn't
+currently have:
+| Axis | What changes | What we learn |
+|---|---|---|
+| **GRPO with verifiable reward** | Train the altered model on math/code where ground truth is checkable | Does the alteration's "personality" persist under task-driven RL, or does it wash out? |
+| **SDPO against the altered model's own hints** | Self-distillation — the altered model teaches itself with hint-conditioned forward passes | Can we **sharpen** the alteration without further SFT? |
+| **Trace-replay DPO with frontier teachers** | The altered model rolls out, frontier teachers replay the same prompts, disagreement → DPO pairs | Where does the altered model **disagree** with frontier consensus? Are those disagreements correlated with the cognitive-distortion signature? |
+The **third** axis is the most interesting for altered-minds specifically.
+The framework's `replay_trace` + `extract_dpo_pairs` produce, by construction,
+a dataset of "altered-model output" vs "frontier-consensus output" for any
+prompt distribution. If the altered model's depression/anxiety signature
+shows up in moral_scenarios, then the trace-replay output on
+moral-scenario prompts is **a measurable corpus of the alteration**.
+## Concrete plan: altered-minds-RL spike
+### Phase 1 — model selection
+Pick the altered-minds checkpoint that produced the strongest signature
+(per the user's notes: the multi-seed Llama-3.1-8B personality-SFT run
+where moral_scenarios class 3 collapsed −31.1pp).
+### Phase 2 — domain-specific replaysim
+Run `composer_replication.replaysim.replay_and_normalize_trace` against:
+- A held-out moral_scenarios test set (the alteration locus)
+- A held-out high_school_chemistry test set (where altered-minds *improved*)
+- A held-out general MMLU baseline
+Teachers: framework defaults (Claude Opus 4.7, GPT-5, DeepSeek V4 Pro).
+This produces **three normalized DPO datasets** capturing where the
+altered model disagrees with frontier consensus on each domain.
+Cost estimate: ~$0.98/trace × 100 prompts × 3 domains ≈ **$300**.
+Fits inside the user's existing $400 altered-minds budget.
+### Phase 3 — GRPO with the framework
+Run `composer_replication.recipes.trl.ComposerReplicationTrainer` with:
+- **Channel 1 (GRPO)**: turned ON, reward = MMLU letter-correctness
+- **Channel 2 (SDPO/OPSD)**: turned ON at α=0.2, hint-conditioned
+  against the altered model's own forward pass
+- **Channel 3 (trace-replay DPO)**: turned ON at β=0.4, against the
+  Phase-2 datasets
+Train for ~500 steps on a single GPU (Qwen-0.5B feasibility-test
+already confirmed in the framework; for Llama-8B, use Modal + the
+framework's `ServerlessExecutor` per ADR-005 — local 5090 is too small).
+### Phase 4 — re-evaluate
+Re-run the same MMLU + alteration probes used originally on the
+**post-RL** model. Three outcomes are possible:
+| Outcome | Interpretation |
+|---|---|
+| Alteration signature persists at same magnitude | The alteration is robust to task-driven RL — useful as a lower bound on its "depth" |
+| Alteration signature attenuates | Task-driven RL washes out personality-SFT — useful for understanding alteration brittleness |
+| Alteration signature **amplifies** on channel-2-only ablation | SDPO is reinforcing the alteration; rare and significant — would be a publishable finding |
+### Phase 5 — Decoupled DiLoCo for multi-personality experiments
+Once a single altered-minds-RL run works, the framework's serverless
+DiLoCo (ADR-005) lets us run **N personality-altered models in parallel
+across Modal/HF Jobs**, with their pseudo-gradients pooled via object
+storage. This becomes the natural sweep over personality types
+(depression vs anxiety vs grandiose vs ...) at minimal incremental
+infrastructure cost.
+## Repo layout proposal
+The Composer Replication Framework is intentionally generic. The
+altered-minds-specific RL spike should live as a separate repo or
+subdirectory **using** the framework, not inside it:
+```
+altered-minds/                  # the renamed llm-mental-alterations repo
+  composer_replication_runs/    # NEW
+    moral_scenarios_replay.py   # uses composer_replication.replaysim
+    train_grpo.py               # uses composer_replication.trainer
+    eval_post_rl.py             # standard altered-minds eval
+  recipes/
+    altered_minds.yaml          # data-juicer recipe — symlinks/copies
+                                # composer_replication's default + adds
+                                # MMLU-format-aware ops
+```
+The framework provides the algorithm + infrastructure. The altered-minds
+repo owns the experimental narrative + results.
+## Open questions for the user
+Before we proceed to Phase 1:
+1. **Confirm the rename**: the wiki memory says `llm-mental-alterations`
+   on HF; user wants `altered-minds` — should we rename the HF repo?
+2. **Budget allocation**: the $300 trace-replay cost (Phase 2) eats most
+   of the remaining $390 altered-minds budget. Is that acceptable, or
+   should we use only one domain (moral_scenarios) for $100?
+3. **GPU venue for Phase 3**: 8B-model RL on single-GPU is feasible on
+   the user's RTX 5090 (32GB) for short runs, OR we use Modal A100s for
+   a more aggressive run. Preference?
+## References
+- altered-minds workstream wiki: `~/wiki/projects/llm-mental-alterations.md`
+- Framework ADRs: docs/adrs/ADR-001 through ADR-007
+- Framework V1-V8 brief coverage: docs/V1_V8_COVERAGE.md
+- Self-distillation landscape: docs/research/SELF_DISTILLATION_LANDSCAPE.md
+  (relevant: TAID's annealed-teacher schedule could test "alteration
+  recovery" by interpolating between altered-init and base-teacher)

docs/V1_V8_COVERAGE.md CHANGED Viewed

@@ -90,5 +90,27 @@ This is the post-replication phase. The CPU-only deep-work-loop phase (Waves 7-1
 - `docs/VISION_VALIDATION.md` — original 10-point scorecard + post-Wave-11 honest re-scoring
 - `docs/research/WAVE_7_10_FINAL_REVIEW.md` — cross-model adversarial review of Wave 7-10 (10 priority items, 2 BLOCKERs both addressed)
-- `docs/adrs/ADR-001..003` — three architectural decisions (GPU venue, trace source, DiLoCo impl)
 - `BACKLOG.md` — pre-execution acceptance criteria for Spikes 006/007/008 + Wave 10

 - `docs/VISION_VALIDATION.md` — original 10-point scorecard + post-Wave-11 honest re-scoring
 - `docs/research/WAVE_7_10_FINAL_REVIEW.md` — cross-model adversarial review of Wave 7-10 (10 priority items, 2 BLOCKERs both addressed)
+- `docs/adrs/ADR-001..007` — seven architectural decisions (GPU venue, trace source, DiLoCo impl, replaysim normalization, serverless DiLoCo, RL frameworks, distillation losses)
 - `BACKLOG.md` — pre-execution acceptance criteria for Spikes 006/007/008 + Wave 10
+---
+## Wave 13 expansion (2026-05-26)
+The user expanded the brief mid-loop:
+> *"keep going. make sure that we do the paths of the Composer 2.5 methods, the n-teachers replaysim, and Decoupled DiLoCo (so that we can leverage modal or huggingface-jobs or other serverless training systems). … For V5 see if we can leverage [a normalization library] to normalize the data while also making the replaysim dataset generation. … if we can properly document and research the self-distillation papers like SDPO OPDS and/or others. … see if there are other frameworks that are more popular that we could try to use. meta's pytorch agentic stack components are something that I'd like to explore."*
+| Wave 13 ask | Artifact | Status |
+|---|---|---|
+| Decoupled DiLoCo over serverless | ADR-005 + `composer_replication.diloco.serverless` (Protocol + ObjectStoreAllReduce + LocalProcessExecutor + Modal/HFJobs skeletons) + 9 multi-process tests | ✅ Closed (local) / 🟡 Skeleton (cloud) |
+| Replaysim normalization | ADR-004 + `composer_replication.replaysim` package + `data-juicer` adapter + default YAML recipe + 9 unit tests | ✅ Closed (passthrough) / 🟡 Pending data-juicer install for full path |
+| Other RL frameworks (V3 expansion) | ADR-006 + `composer_replication.recipes.prime_rl` (recipe + composer_loss adapter + config.yaml) | ✅ Closed (recipe) / 🟡 Skeleton (runtime) |
+| Meta's PyTorch agentic stack | ADR-006 + `composer_replication.recipes.monarch` (actor layout doc + skeleton actors) | ✅ Closed (design) / 🟡 Skeleton (impl) |
+| Deeper self-distillation research | ADR-007 + `docs/research/SELF_DISTILLATION_LANDSCAPE.md` + `composer_replication.distillation` module (SimPO + TAID + Entropy-Aware OPD) + 17 unit tests | ✅ Closed (standalone losses) / 🟡 Deferred to Wave 14 (`compose_loss` kwargs not yet wired — Wave 13 review Finding 2) |
+| altered-minds tie-in | `docs/ALTERED_MINDS_TIE_IN.md` (5-phase plan, $300 estimate, open questions) | ✅ Closed (design) |
+**Wave 13 test addition**: 35 new tests passing (17 distillation + 9 serverless multi-process + 9 replaysim).
+The framework now covers the full expanded brief. Total tests passing
+across the framework as of Wave 13: **107** (72 from prior waves + 35 new).

docs/V3_SUBSTRATE_COVERAGE.md CHANGED Viewed

@@ -151,12 +151,16 @@ even if it doesn't translate to code.
 |---|---|---|---|---|---|
 | TRL | ✅ | ✅ | ✅ | 38 + 9 + 3 = 50 | ✅ |
 | VeRL | ✅ | ✅ | 🟡 (skeleton) | — | v0.2 |
-| DiLoCo | ✅ | ✅ | ✅ | 5 (single-replica) | optional |
 | OpenEnv | ✅ | ✅ | n/a (protocol) | — | substrate |
-| Monarch | ✅ | ✅ (reference) | n/a | — | future option |
 | TorchForge | ✅ | n/a (paused) | n/a | — | n/a |
-**6/6 substrates covered.** Code-bearing integrations (TRL, VeRL, DiLoCo)
-have working extension points. Reference substrates (OpenEnv, Monarch,
-TorchForge) are documented as research outputs, which matches the brief's
-"research...how we could try to set this up" framing.

 |---|---|---|---|---|---|
 | TRL | ✅ | ✅ | ✅ | 38 + 9 + 3 = 50 | ✅ |
 | VeRL | ✅ | ✅ | 🟡 (skeleton) | — | v0.2 |
+| **PRIME-RL** (Wave 13) | ✅ | ✅ | 🟡 (loss adapter + config) | — | v0.2 (cleanest hook) |
+| DiLoCo (single-process) | ✅ | ✅ | ✅ | 5 (single-replica) | optional |
+| **DiLoCo over serverless** (Wave 13) | ✅ | ✅ ADR-005 | ✅ Local + 🟡 Modal/HFJobs | 9 multi-process | ✅ (local) / future (cloud) |
 | OpenEnv | ✅ | ✅ | n/a (protocol) | — | substrate |
+| **Monarch** (Wave 13) | ✅ | ✅ (actor layout) | 🟡 (skeleton) | — | v0.2+ |
 | TorchForge | ✅ | n/a (paused) | n/a | — | n/a |
+**8/8 substrates covered** (was 6/6 pre-Wave-13). New since Wave 13:
+PRIME-RL (the cleanest custom-loss hook), Monarch (Meta's actively-shipped
+agentic-stack component), and serverless DiLoCo (Modal/HF Jobs adapters
++ object-store rendezvous). The framework can now realize Decoupled
+DiLoCo across cloud executors **without any cross-job NCCL** — see
+ADR-005 for the design rationale.

docs/adrs/ADR-004-replaysim-normalization.md ADDED Viewed

	@@ -0,0 +1,124 @@

+# ADR-004 — Replaysim normalization layer for the trace-replay channel
+**Status**: Accepted
+**Date**: 2026-05-26
+**Wave**: 13 (deep work loop, expansion phase)
+## Context
+The brief's V5 clause says:
+> use traces from an llm-application usage then replay the traces with
+> different models to see at each llm-step what the llm would do. by doing
+> this we get distillation data from any number of models that could be
+> used to train the target model further
+The user added 2026-05-26: *"see if we can leverage [a normalization
+library] to normalize the data while also making the replaysim dataset
+generation."*
+Currently the framework has `composer_replication.teacher_replay`:
+- `replay_trace()` — N-teacher OpenRouter replay, returns
+  `list[TeacherCallResult]`
+- `extract_dpo_pairs()` — converts teacher disagreement to `list[DPOPair]`
+This produces preference-pair training data, but with **zero normalization**:
+no dedup, no length filtering, no language detection, no quality
+filtering, no chat-template validation. The output is closer to "raw
+LLM API responses" than "training-ready dataset."
+For the replaysim to power downstream RL training (V6), the dataset needs
+to be production-quality. Hand-rolling that pipeline is a tax we'd rather
+not pay.
+## Options considered
+Audited five candidates in `docs/research/REPLAYSIM_NORMALIZATION_RECONNAISSANCE.md`:
+| Library | License | Multi-turn? | DPO pairs? | Streaming? | GPU? | Verdict |
+|---|---|---|---|---|---|---|
+| HuggingFace `datatrove` | MIT | ❌ flat-text only | ❌ | ✅ | ❌ | Deal-breaker on multi-turn |
+| Alibaba `data-juicer` | Apache-2 | ✅ native `messages` ops | ✅ `pair_preference_mapper` | ✅ | ❌ for ops we need | **Chosen** |
+| NVIDIA `nemo-curator` | Apache-2 | partial | ❌ | ✅ | ✅ mandatory for differentiating ops | Reject — GPU-bound for the ops we need |
+| Argilla `distilabel` | Apache-2 | ✅ native chat | ✅ formatters | ✅ | ❌ | Reject — would replace teacher orchestration, not just normalize |
+| Databricks `lilac` | — | n/a | n/a | n/a | n/a | Reject — archived 2024-03 |
+## Decision
+**Adopt `data-juicer` (Alibaba/modelscope, Apache-2.0, last push 2026-05-25, 6.4k★).**
+Reasons:
+1. **It's the only candidate with native multi-turn + DPO support in the
+   *normalization* op-graph.** Has `pair_preference_mapper`,
+   `dialog_intent_detection_mapper`, `dialog_topic_detection_mapper`,
+   etc. that operate on chat-format messages directly.
+2. **CPU-runnable for our op set.** The differentiating ops we need
+   (length filter, language ID, chat-template validation, dedup) all
+   work on CPU. We avoid the NeMo-Curator GPU dependency entirely.
+3. **Streaming-friendly.** Op graph is a DAG; we can pipe `replay_trace`
+   output into the graph during generation, not as a post-hoc pass. This
+   matters for cost discipline — bad teacher outputs get filtered before
+   contributing to OpenRouter spend on subsequent steps.
+4. **YAML-recipe driven.** Recipes live in `recipes/replaysim/` and can
+   be version-controlled. A user can swap normalization recipes without
+   touching framework code.
+## Consequences
+### Accepted
+- New module `composer_replication.replaysim` lifts the existing
+  `teacher_replay` logic out of the package's flat namespace and adds:
+  - `composer_replication.replaysim.normalize` — `DJNormalizer` adapter
+    that wraps `data-juicer` op graphs around `replay_trace` output
+  - `recipes/replaysim/default.yaml` — base normalization recipe (length
+    filter + chat-template validation + per-turn dedup)
+  - Optional `recipes/replaysim/with_disagreement_filter.yaml` — adds a
+    semantic-similarity filter that drops "false disagreements" where
+    teachers used different wording for the same answer
+- New optional dependency `[replaysim]` extra in `pyproject.toml`:
+  `pip install -e .[replaysim]` pulls `data-juicer`. Core install
+  doesn't require it.
+- The existing `replay_trace` and `extract_dpo_pairs` keep their
+  signatures. The normalizer is opt-in via a `normalizer=` kwarg on a
+  new `replay_and_normalize_trace` convenience function.
+### One-day spike before merge
+`pair_preference_mapper` in data-juicer might unconditionally re-synthesize
+the `rejected` text via an LLM call. We already have `rejected` from
+teacher disagreement and don't want to pay another API call. The recon
+flagged this — verify by reading the mapper's source, and if it's LLM-bound,
+substitute a plain validator that checks the field exists + isn't empty.
+If the spike fails (the mapper IS LLM-bound and isn't easily replaceable),
+fall back to writing a custom `DJOp` subclass that validates pre-existing
+DPO pairs without re-synthesis. ~50 LOC.
+### Rejected paths
+- **`datatrove`**: would have required hand-rolling all chat-template logic
+  on top of flat-text ops. Bigger ongoing maintenance cost than
+  data-juicer's native multi-turn support.
+- **`nemo-curator`**: GPU-mandatory ops mean we'd need to pay for GPU during
+  dataset generation (separate from the replay phase, which is already
+  GPU-free). Net cost increase for no quality win.
+- **`distilabel`**: too broad — its pipeline abstraction would replace our
+  `replay_trace` entirely. We'd lose direct OpenRouter cost control + the
+  audit trail. Possible v0.3 migration if data-juicer becomes a bottleneck.
+### Future work
+- v0.2: add a `recipes/replaysim/altered_minds.yaml` for the user's
+  `altered-minds` workstream tie-in (per Wave 13 expansion)
+- v0.3: revisit if `distilabel` becomes more mature and the migration
+  cost vs ongoing-maintenance balance shifts
+## Source
+`docs/research/REPLAYSIM_NORMALIZATION_RECONNAISSANCE.md` (2026-05-26
+subagent recon, primary-sourced from each repo's GitHub + DeepWiki).

docs/adrs/ADR-005-serverless-diloco.md ADDED Viewed

	@@ -0,0 +1,142 @@

+# ADR-005 — Decoupled DiLoCo over serverless training systems
+**Status**: Accepted
+**Date**: 2026-05-26
+**Wave**: 13
+## Context
+The brief's V2 clause says:
+> take that and combine it with diloco (decoupled, open, any variant of diloco)
+The user expanded 2026-05-26: *"Decoupled DiLoCo (so that we can leverage
+modal or huggingface-jobs or other serverless training systems). we need
+this both on the dataset generation and the RL orchestration side of
+things."*
+Spike 008 wrote `composer_replication.diloco.make_diloco_outer_loop`
+(wraps `torchft.local_sgd.DiLoCo`) but that's a single-process API. To
+realize "Decoupled DiLoCo across serverless executors" we need:
+1. An abstraction layer that lets the framework launch N replicas on
+   different serverless backends (Modal, HF Jobs, SageMaker, etc.) without
+   per-backend code in the trainer.
+2. A communication primitive that doesn't require inter-job NCCL/RDMA
+   (most serverless executors don't expose that, and DiLoCo doesn't need
+   it — sync happens once per ~500-1000 inner steps).
+## Options considered
+`docs/research/DILOCO_SERVERLESS_RECONNAISSANCE.md` audited 6 executors:
+| Executor | Inter-job network | Cold start | $/A100·hr | $/H100·hr |
+|---|---|---|---|---|
+| Modal | yes (cluster mode) | ~30s | $1.95 | $5.50 |
+| HuggingFace Jobs | no | ~60s | $4.18 | $9.50 |
+| AWS SageMaker training | yes (warm pools) | ~3-5min | ~$3.06 | ~$8.50 |
+| GCP Vertex AI | yes (cluster) | ~5-10min | ~$3.67 | ~$10 |
+| Azure ML | yes (cluster) | ~5-10min | ~$3.67 | ~$10 |
+| k8s + Volcano/KubeRay | yes (cluster IP) | ~30-90s | (BYO) | (BYO) |
+Most expose a "spin up a job, run a script" interface. Few expose inter-job
+networking; the ones that do require explicit cluster mode (extra cost +
+config).
+## Decision
+**Adopt object-store rendezvous as the default DiLoCo communication
+primitive across all serverless executors.** Specifically:
+- `composer_replication.diloco.serverless` package
+- `class ServerlessExecutor(Protocol)` — uniform interface with
+  `launch_replicas / poll / stream_logs / cancel / collect /
+  backend_name / supports_inter_replica_network`
+- `class ObjectStoreAllReduce` — fsspec-backed pseudo-gradient exchange
+  using s3:// / gs:// / az:// / hf:// / file:// — single code path, swappable
+  bucket
+- v0 concrete adapters: `ModalExecutor` and `HFJobsExecutor`
+- v0.1+ adapters: `RunPodExecutor`, `SageMakerExecutor`, `K8sExecutor`
+### Why object-store rendezvous (not NCCL across jobs)
+DiLoCo paper (arXiv:2311.08105) shows the outer-loop sync is **once per
+H = 500-1000 inner steps**, equivalent to ~10-30 minutes of wall-clock at
+typical post-training step rates. For a 1B-param model in bf16:
+- Pseudo-gradient size: ~2 GB per replica per outer round
+- Sync frequency: ~once per 30 minutes
+- Therefore: ~2 GB × N_replicas, every ~30 min, durably written to object
+  storage with a single `PutObject` per replica + `GetObject` per other
+  replica
+Even with N=8 replicas, that's 16 GB write + 14 GB × 8 reads = 128 GB read
+spread over 30 minutes = ~70 MB/s aggregate. **S3 free-tier handles this
+without breaking a sweat**, and S3 cross-job reads cost ~$0.0001 per
+GET. Total inter-replica communication cost: ~$0.05 per outer round.
+**Negligible compared to GPU spend.**
+By contrast, cross-job NCCL would require:
+- Inter-job networking (mostly unavailable on serverless)
+- Sustained low-latency connections (vs. burst-IO once per 30min)
+- Backend-specific cluster mode (Modal-only on some platforms)
+Object-store rendezvous decouples the algorithm from the executor and
+matches DiLoCo's actual communication profile.
+### Why Modal + HF Jobs as the v0 executors
+- **Modal**: best dev velocity, sub-minute cold start, mature Python SDK,
+  user already has CLI configured. Gives us a fast iteration loop for the
+  serverless layer.
+- **HuggingFace Jobs**: zero acquisition cost (HF token already wired up),
+  brand-aligned with the framework's HF-native posture, ~$4.18/A100·hr.
+  Not the cheapest, but the right "default executor for HF users."
+These two cover the spectrum of "fast for development" + "natural HF
+integration." Other executors are documented and stubbed but not
+implemented in v0.
+## Consequences
+### Accepted
+- New package `composer_replication.diloco.serverless`:
+  - `executor.py` — `ServerlessExecutor` Protocol + base class
+  - `allreduce.py` — `ObjectStoreAllReduce` mockManager that drops into
+    `make_diloco_outer_loop` with no changes to the existing wrapper
+  - `modal.py` — `ModalExecutor` (~150 LOC)
+  - `hf_jobs.py` — `HFJobsExecutor` (~150 LOC)
+  - `replica_entrypoint.py` — the script each replica runs (loaded from
+    HF Datasets / object store)
+- New optional dependency `[serverless]` extra: `pip install -e .[serverless]`
+  pulls `fsspec`, `s3fs`, `huggingface_hub` (already a transitive dep), and
+  `modal-client` (only if user opts in to Modal).
+- Smoke test in `spikes/009-decoupled-diloco/` (new, deferred — not part
+  of this wave's commit) — local-only `file://` rendezvous between two
+  Python processes in `tests/test_serverless_local.py`. Multi-cloud test
+  is post-replication.
+### Open / deferred
+- **Real serverless smoke**: spinning up 2 Modal containers + S3 rendezvous
+  + verifying both converge. Deferred to a small-budget post-Wave-13 spike
+  ($2-5 estimated). Not blocking for the v0 packaging.
+- **HF Jobs API stability**: HF Jobs is a relatively new product. The
+  recon flagged "API may evolve through 2026"; we pin to a specific
+  `huggingface_hub` minor and bump deliberately.
+### Trade-offs explicitly accepted
+- We do NOT use Modal's cluster/RDMA mode in v0. That gives sub-second
+  cross-job NCCL but costs more and is Modal-only. Object-store rendezvous
+  is the right default; users on Modal who want faster sync can override.
+- We do NOT support job-internal multi-GPU training in this layer. The
+  serverless layer is for **inter-replica** sync; intra-replica training
+  uses the existing `make_diloco_outer_loop` (which itself can wrap
+  multi-GPU FSDP via torchft).
+## Source
+`docs/research/DILOCO_SERVERLESS_RECONNAISSANCE.md` (2026-05-26 subagent
+recon, primary-sourced from each provider's official docs + pricing pages).

docs/adrs/ADR-006-rl-frameworks.md ADDED Viewed

	@@ -0,0 +1,124 @@

+# ADR-006 — RL framework strategy: TRL + VeRL + PRIME-RL
+**Status**: Accepted
+**Date**: 2026-05-26
+**Wave**: 13
+## Context
+The brief's V3 clause names six substrates: **monarch, torchforge,
+openenv, VeRL, TRL** (plus DiLoCo). Cross-model review (Wave 11) flagged
+that V3 was thin on the RL-framework side: TRL has working code, VeRL has
+a config skeleton, and Monarch/TorchForge/OpenEnv are research-only.
+User's 2026-05-26 expansion: *"see if there are other frameworks that are
+more popular that we could try to use. meta's pytorch agentic stack
+components are something that I'd like to explore."*
+`docs/research/RL_FRAMEWORKS_LANDSCAPE.md` audited:
+- 6 RL frameworks: OpenRLHF, PRIME-RL, NeMo-Aligner, Unsloth, LLaMA-Factory,
+  DeepSpeed-Chat
+- 4 Meta PyTorch stack components: Monarch, TorchTitan, TorchForge, torchchat
+## Options considered
+| Framework | License | GRPO/DAPO? | Custom-loss extension | Verdict |
+|---|---|---|---|---|
+| OpenRLHF | Apache-2 | ✅ DAPO | Fork `openrlhf/models/loss.py` + Trainer subclass (~400-600 LOC) | Strong but heavyweight |
+| **PRIME-RL** | **Apache-2** | **✅ GRPO + DAPO** | **First-class `CustomLossConfig` with `LossInputs` struct (~200-300 LOC)** | **Chosen** |
+| NeMo-Aligner | Apache-2 | ❌ no GRPO/DAPO | n/a | Reject |
+| Unsloth | Apache-2 | TRL patcher | Closed `unsloth_zoo` loss kernels — unhookable | Reject |
+| LLaMA-Factory | Apache-2 | ❌ delegates to EasyR1 | n/a | Reject |
+| DeepSpeed-Chat | Apache-2 | ❌ PPO+DPO only | feature-stale since 2023 | Reject |
+| Meta stack | License | Active? | Role |
+|---|---|---|---|
+| **Monarch** | **BSD-3** | **✅ v0.4.1 stable, v0.5 dev** | **Actor mesh — coordination layer for any SPMD trainer** |
+| TorchTitan | BSD-3 | ✅ active | Distributed-training stack (already a transitive dep of PRIME-RL) |
+| TorchForge | BSD-3 | ❌ paused | Patterns only, per repo banner |
+| torchchat | BSD-3 | active | Inference only — out of scope |
+## Decision
+**Add PRIME-RL as the third RL framework after TRL+VeRL, and Monarch as the
+agentic-stack coordination layer.**
+### Why PRIME-RL
+PRIME-RL ships a **first-class `CustomLossConfig` with an `import_path`**
+that lets us drop in a Python function returning a tensor. The config
+exposes a `LossInputs` struct with exactly the tensors we need:
+`trainer_logprobs`, `inference_logprobs`, `teacher_logprobs`,
+`advantages`, `loss_mask`. This is **the cleanest possible extension
+point for a 3-channel loss** — no fork, no Trainer subclass, no monkey-
+patching.
+It also uses the `verifiers` env protocol (OpenEnv-compatible by design),
+so it slots into the framework's existing data path without translation.
+PRIME-RL was used to train INTELLECT-1 (10B base, 30 nodes) and INTELLECT-2
+(32B QwQ); production-tested on real distributed runs.
+### Why Monarch (not TorchForge or TorchTitan as a top-level)
+- **Monarch is what's actually shipping** from Meta's agentic stack. v0.4.1
+  is stable, v0.5 dev daily. BSD-3.
+- **TorchForge is paused** per its own repo banner. We document it
+  (research/03) but don't depend on it.
+- **TorchTitan is a transitive dep** of PRIME-RL already, so we get its
+  benefits without needing to build a direct integration. If we wanted a
+  TorchTitan-only path, it would be redundant with PRIME-RL.
+- **torchchat is inference-only** and doesn't fit the training-framework
+  conversation.
+Monarch's role in our stack: **the actor mesh that hosts trainer/generator/
+rewarder/judge actors**. PRIME-RL's three-actor split (trainer, generator,
+rewarder) maps naturally onto Monarch primitives.
+## Consequences
+### Accepted
+- `composer_replication/recipes/prime_rl/` directory:
+  - `prime_rl_recipe.md` — integration recipe (parallel to TRL Recipe A,
+    VeRL Recipe B)
+  - `composer_loss.py` — the 3-channel loss adapted to PRIME-RL's
+    `LossInputs` struct (~200-300 LOC)
+  - `prime_rl_config.yaml` — example PRIME-RL config wiring our loss in
+- `composer_replication/recipes/monarch/` directory:
+  - `monarch_actor_layout.md` — design doc for the actor mesh
+  - `actors.py` — placeholder Monarch actor definitions (skeleton only;
+    full integration is post-replication)
+- New optional dependencies in `pyproject.toml`:
+  - `[prime-rl]` extra: `prime-rl>=0.5`
+  - `[monarch]` extra: `monarch>=0.4.1`
+- `docs/V3_SUBSTRATE_COVERAGE.md` updated to reflect the new additions.
+### Three-recipe production matrix
+| User scenario | Recommended recipe |
+|---|---|
+| Quick start, single-cluster, ≤7B | TRL Recipe A |
+| Production multi-node, ≤32B | VeRL Recipe B |
+| Decentralized / DiLoCo-shape, any size | PRIME-RL recipe (NEW) |
+| Coordination-heavy multi-actor RL | Monarch + any of the above |
+### Trade-offs explicitly accepted
+- **Three RL frameworks is a maintenance burden.** We accept this because
+  no single one covers all the user scenarios above. The framework's
+  contribution is the 3-channel loss + the trace-replay channel, expressed
+  in three different framework idioms. Each recipe is ~200-300 LOC; total
+  triplication tax ~700 LOC vs. picking one framework.
+- **Monarch is BSD-3 not MIT.** The framework is MIT; users opting in to
+  Monarch take on its license. Documented in pyproject.toml's optional
+  extras.
+- **PRIME-RL's API may evolve.** The `LossInputs` struct is currently the
+  contract; if PRIME-RL stabilizes a different shape we'd need to bump.
+  Pin to v0.5.x in our optional extras.
+## Source
+`docs/research/RL_FRAMEWORKS_LANDSCAPE.md` (2026-05-26 subagent recon,
+primary-sourced from DeepWiki audits + GitHub repo READMEs + PyPI release
+metadata).

docs/adrs/ADR-007-self-distillation-losses.md ADDED Viewed

	@@ -0,0 +1,173 @@

+# ADR-007 — Self-distillation losses landscape and which to add
+**Status**: Accepted
+**Date**: 2026-05-26
+**Wave**: 13
+## Context
+The framework currently has **one** distillation loss: `generalized_jsd_loss`
+(verified port of `siyan-zhao/OPSD`, the kernel of SDPO arXiv:2601.20802 —
+Composer 2.5's "targeted RL with textual feedback").
+User's 2026-05-26 expansion: *"if we can properly document and research the
+self-distillation papers like SDPO OPDS and/or others that are related
+then we can take stuff from there to help level up our training framework."*
+`docs/research/SELF_DISTILLATION_LANDSCAPE.md` audited 8 candidate methods
+across primary sources (arXiv abstracts + verified GitHub repos):
+| Method | arXiv | License | Verdict |
+|---|---|---|---|
+| **SimPO** | **2405.14734** | **MIT, mature** | **Chosen — drop-in DPO replacement, no ref model** |
+| KTO | 2402.01306 | Apache-2 (in trl) | Optional — only if channel-3 moves to per-step binary |
+| Self-Rewarding LM | 2401.10020 | research | Reject — procedure not loss |
+| MiniLLM | 2306.08543 | MIT | Reject — same reverse-KL family as SDPO |
+| GKD | 2306.13649 | research | Already lifted (= our `generalized_jsd_loss`) |
+| DistiLLM | 2402.03898 | MIT | Reject — TAID dominates empirically |
+| **TAID** | **2501.16937** | **Apache-2, mature** | **Chosen — wraps existing JSD with annealed teacher** |
+| **Entropy-Aware OPD** | **ICLR 2026 Spotlight** | **(code release pending)** | **Chosen — token-wise gated forward/reverse KL** |
+## Decision
+**Add three composable self-distillation losses to the framework as a
+pluggable distillation module:**
+1. **SimPO** — reference-free DPO replacement for channel 3
+2. **TAID** — annealed teacher interpolation that wraps existing JSD/SDPO
+3. **Entropy-Aware OPD** — token-wise mixture of forward and reverse KL
+### Why these three (and not the others)
+#### SimPO (chosen)
+- **Reference-free DPO**: removes the ref-model VRAM cost (which is the
+  single biggest memory tax of standard DPO).
+- Uses average sequence log-prob with target margin γ instead of
+  ref-policy logits.
+- ~80 LOC. MIT licensed.
+- **Composes**: drop-in for channel 3 (`trace_replay_dpo`). Our DPO and
+  SimPO are interchangeable at the loss level — both consume `(chosen,
+  rejected)` pairs and emit a scalar. SimPO drops the ref logprobs from
+  the input dict.
+#### TAID (chosen)
+- **Annealed Interpolated Distillation**: wraps the existing JSD with a
+  schedule that interpolates between identity (student-only target) and
+  teacher target over training. Provably prevents mode collapse on
+  large-capacity-gap distillation.
+- ~150 LOC. Apache-2.
+- **Composes**: TAID *wraps* `generalized_jsd_loss`, doesn't replace it.
+  Our `compose_loss` gets a `taid_alpha` schedule kwarg; when 0 it's
+  pure SDPO, when scheduled it's TAID-SDPO.
+#### Entropy-Aware OPD (chosen, with caveat)
+- **Token-wise gated mixture** of forward and reverse KL based on per-
+  token teacher entropy. Directly fixes a documented failure mode of the
+  reverse-KL family (which SDPO/OPSD belongs to).
+- ICLR 2026 Spotlight. **Code release pending** as of 2026-05-26.
+- ~120 LOC.
+- **Composes**: also wraps `generalized_jsd_loss`, but with a per-token
+  weighting tensor instead of a global schedule.
+- **Caveat**: we'll vendor a clean-room implementation from the paper
+  pseudocode until the official code drops. License question: vendoring
+  from a paper's pseudocode is fair use; redistributing the official code
+  when it drops requires checking its license.
+### Why we explicitly reject the others
+- **GKD**: already lifted as `generalized_jsd_loss`. No additional value.
+- **DistiLLM**: skew-KL is in the same reverse-KL family. TAID dominates
+  it empirically per the TAID paper.
+- **MiniLLM**: same reverse-KL recipe as SDPO. We already have SDPO.
+- **Self-Rewarding LM**: a procedure (model judges its own outputs to
+  generate preference pairs), not a loss. If we want self-judging, that's
+  a separate spike on the trace-replay side — not a loss-channel addition.
+- **KTO**: only useful if the channel-3 shape moves from preference pairs
+  to per-step binary signals. Not currently in scope. Documented as a
+  fallback for future use.
+## Consequences
+### Accepted
+- New module `composer_replication.distillation`:
+  - `__init__.py` — re-exports the three new losses
+  - `simpo.py` — `simpo_loss(chosen_lp, rejected_lp, beta, gamma)` (~80 LOC)
+  - `taid.py` — `taid_loss(student_logits, teacher_logits, alpha,
+    schedule_step, total_steps, **jsd_kwargs)` (~150 LOC)
+  - `entropy_aware_opd.py` — `entropy_aware_opd_loss(student_logits,
+    teacher_logits, **jsd_kwargs)` (~120 LOC)
+  - `tests/test_distillation_losses.py` — 17 sanity tests (loss is finite,
+    differentiable, returns scalar, matches paper formulas at boundary
+    conditions)
+### Wave 14+ work — `compose_loss` integration is NOT in this wave
+An earlier draft of this ADR claimed `composer_replication.compose_loss`
+would receive new kwargs (`dpo_variant`, `sdpo_wrapper`, `taid_schedule_step`,
+`taid_total_steps`). **The Wave 13 cross-model review
+(docs/research/WAVE_13_FINAL_REVIEW.md Finding 2) flagged that those
+kwargs were never actually added to `compose_loss`** — the standalone
+losses landed but the integration into the framework's loss composition
+is not done. To stay honest:
+- **What works in Wave 13**: `from composer_replication.distillation
+  import simpo_loss, taid_loss, entropy_aware_opd_loss` — all three are
+  importable, type-checked, unit-tested, and ready to be called directly.
+- **What does NOT work in Wave 13**: passing
+  `compose_loss(model, batch, dpo_variant="simpo", sdpo_wrapper="taid", ...)`.
+  That call signature does not exist; it would raise `TypeError`.
+- **Wave 14 plan**: add the four kwargs to `compose_loss` with a small
+  integration test exercising at least one combination (SDPO+TAID + plain
+  DPO would suffice). Estimated ~30 LOC + 2-3 tests.
+Users wanting the new losses *now* should use them as standalone
+functions in their own loss-composition code:
+```python
+from composer_replication.distillation import simpo_loss, taid_loss
+# Drop-in DPO replacement:
+ch3 = simpo_loss(chosen_avg_lp, rejected_avg_lp, beta=2.0, gamma=1.0)
+# TAID-wrapped SDPO (channel 2):
+ch2 = taid_loss(
+    student_logits, teacher_logits, student_init_logits,
+    schedule_step=trainer.state.step, total_steps=trainer.state.max_steps,
+)
+total = grpo_loss + alpha * ch2 + beta * ch3
+```
+This is identical to what the integrated path would do — the integration
+is a convenience kwarg layer, not a different algorithm.
+### `pyproject.toml` impact
+No new deps — these are pure PyTorch losses on top of existing tensors.
+### Trade-offs
+- **Combinatorial complexity**: with three options for channel 2 and two
+  options for channel 3, we have 6 distillation variants. We accept this
+  because:
+  - Defaults are sane (`dpo_variant="dpo"`, `sdpo_wrapper="none"`)
+  - Each variant is independently unit-tested
+  - Users opt into combinations explicitly
+- **Entropy-Aware OPD is pre-code-release**: we vendor from paper
+  pseudocode. Risk: our implementation might disagree with the official
+  release. Mitigation: clear-room note in the source file; bump pin
+  if/when official code drops.
+### Future work
+- v0.2: research **direct preference fine-tuning** variants (DRO, PRO,
+  IPO) that might replace channel 3 entirely. These are off the chosen
+  axis but might dominate.
+- v0.3: integrate the three new losses with PRIME-RL's `CustomLossConfig`
+  (per ADR-006) so users can mix-and-match across frameworks.
+## Source
+`docs/research/SELF_DISTILLATION_LANDSCAPE.md` (2026-05-26 subagent recon,
+primary-sourced from arXiv + GitHub READMEs).

docs/research/DILOCO_SERVERLESS_RECONNAISSANCE.md ADDED Viewed

	@@ -0,0 +1,791 @@

+# DiLoCo Serverless Executor Reconnaissance
+**Status:** Reconnaissance complete (feeds ADR-005).
+**Audience:** ADR-005 author + framework integrator wiring `composer_replication.diloco.serverless` against real backends.
+**Scope:** Decoupled DiLoCo across N independently-scheduled serverless GPU jobs. NOT a generic "serverless training" survey.
+**Date:** 2026-05-26.
+---
+## TL;DR
+| Executor | Inter-job net? | Cold start | $/A100·hr (1×) | $/H100·hr (1×) | Max // jobs | Ranking for Decoupled DiLoCo |
+|---|---|---|---|---|---|---|
+| **Modal** | ✅ `i6pn` + `@modal.experimental.clustered` (50 Gbps + RDMA up to 3.2 Tbps); also same-workspace TCP via shared `Dict`/`Queue` | ~1–10 s warm-boot; ≤90 s incl. image pull on first run | A100-40GB: $2.10; A100-80GB: $2.50 | H100: $3.95 | Workspace quota; Starter ≤10 GPU containers, Team much higher (contact) | **★★★★★** primary adapter |
+| **HF Jobs** | ❌ No documented inter-job networking. Workaround: object store (HF Hub bucket / dataset / S3) | "starting" → "running" billed; per-min granularity; typical scheduling 10–60 s | A100-80GB: $2.50 (`a100-large`); 4×: $10.00; 8×: $20.00 | H200: $5.00; 8×H200: $40.00 (no H100 SKU) | Pro/Team/Enterprise quota; not publicly capped per-run (parallel via SDK loop) | **★★★★☆** secondary adapter; pseudo-grad via Hub bucket Volume |
+| **AWS SageMaker Training Jobs** | ✅ Inside one *job's* multi-instance cluster (EFA/SMDDP). ❌ Across separate `CreateTrainingJob` invocations — same workaround as HF | Image pull + EBS attach: typically 2–5 min cold; warm pools cut to ~10 s for ≤60 min | ml.p4d.24xlarge ≈ $32.77/hr (8×A100-40GB) ≈ $4.10/A100·hr | ml.p5.48xlarge ≈ $98.32/hr ≈ $12.29/H100·hr | Account quota (typical 4–20 instances; raise via Service Quotas) | **★★★☆☆** good for one big "fragment"; clunky as N-replicas-of-1-GPU |
+| **GCP Vertex AI Custom Jobs** | ✅ Inside one CustomJob's worker pools (gRPC/MPI). ❌ Across separate jobs — same workaround | 2–6 min typical cold | a2-highgpu-1g (1×A100-40GB) ≈ $3.67/hr (incl. Vertex training premium ~30–50%) | a3-highgpu-8g ≈ $88/hr ≈ $11/H100·hr | Per-region GPU quota | **★★☆☆☆** highest premium per GPU; useful as 3rd region |
+| **Azure ML Command Jobs** | ✅ within `instance_count>1` (InfiniBand on `ND*`-series). ❌ across jobs — same workaround | 3–8 min typical cold (image cache → curated env helps) | NC24ads_A100_v4 (1×A100-80GB): ~$3.67/hr (PAYG list) | ND96isr_H100_v5 (8×H100): ~$98/hr ≈ $12.25/H100·hr | Per-region quota, surcharge $0/core (only VM+disk) | **★★☆☆☆** like Vertex; useful only if user already lives in Azure |
+| **k8s + Volcano / KubeRay** | ✅ if cluster networked. Volcano gang-schedules `RayJob`/MPIJob; pods see each other on cluster network | Pod schedule: seconds–minutes (image cache, GPU availability) | Whatever the underlying cluster pays (e.g. spot A100 ~$1–2/hr on RunPod / Lambda / OCI K8s) | Same | Cluster capacity | **★★★★☆** best price/perf if user owns/leases a cluster; ops cost nontrivial |
+| **RunPod (honourable mention)** | ✅ same DC; no documented federation | seconds | ~$1.19/hr A100-80GB community, ~$2.17/hr secure | ~$1.99/hr H100 community, ~$4.18/hr secure | Account quota | **★★★☆☆** — not in the candidate list but a strong third adapter for cost |
+The Decoupled DiLoCo framing kills the "must have inter-job allreduce" requirement: per the original DiLoCo paper (arXiv:2311.08105 §3.2), pseudo-gradients are exchanged **once every H = 500–1000 inner steps**, totalling KB-to-MB of gradient data per round. **Bandwidth is irrelevant; latency is irrelevant; the only requirement is "all N replicas can read & write a shared blob store."** That makes object-storage-based pseudo-gradient exchange the *correct* default, and the Modal `clustered`-style RDMA fabric a *bonus* you can opt into when a single executor runs ≥2 replicas in the same region.
+**Recommendation: ship the framework with two adapters — `ModalExecutor` and `HFJobsExecutor` — both speaking the same `Executor` ABC, both using object-store pseudo-grad exchange by default. Add a third adapter (`RunPodExecutor` or `K8sExecutor`) when a user needs it.**
+---
+## 1. Why Decoupled DiLoCo over the network is *easy*
+From DiLoCo (Douillard et al., *DiLoCo: Distributed Low-Communication Training of Language Models*, arXiv:2311.08105):
+- **Setup.** N "workers" each train a full local copy of the model with an inner optimizer (AdamW, LR 4e-4, etc.) on disjoint shards of data.
+- **Outer round (every H=500 steps in the paper, often 1000 in follow-ups).**  Each worker computes its **pseudo-gradient** `δ_k = θ_initial − θ_local` (the negative of its accumulated local update). The N workers all-reduce the pseudo-gradient, average it, and the outer optimizer (Nesterov SGD, lr=0.7, momentum=0.9) applies it to `θ_initial` to produce `θ_initial^(t+1)`. Workers reset to that.
+- **Communication budget per round.** One full-model parameter tensor per worker (FP32, fp16, or bf16). For a 1B model in bf16, that's ~2 GB per worker per round. For Streaming DiLoCo (Liu et al. 2025) the communication is sliced into fragments and overlapped with compute, but the *aggregate* per round is the same.
+- **Communication frequency.** Once per H=500–1000 inner steps. With one inner step ≈ 1–3 s on a single A100/H100 for a 7B model, that's one outer round every **~10–30 minutes wall-clock**.
+The implication: **the outer-loop "allreduce" is a one-shot 2–10 GB upload+download every 10+ minutes.** It does not need NCCL. It does not need RDMA. It does not even need TCP between the replicas. **An S3 `PutObject` followed by N `GetObject`s is sufficient.**  Cross-region transfer at 1 Gbps moves 2 GB in ~17 s; even at 100 Mbps it's ~3 min — small compared to the H=500 inner-step interval. This is the key insight that makes "Modal + HuggingFace Jobs as DiLoCo replicas" actually a sensible architecture rather than a hack.
+We codify this in the framework with two communication backends:
+1. **`InProcessAllReduce`** — what `composer_replication.diloco` already uses (torchft `Manager` mock). For unit tests and same-process/same-host runs.
+2. **`ObjectStoreAllReduce`** — barriers + pseudo-grad averaging via S3/GCS/HF Hub bucket. New code for ADR-005. Expected per-round overhead 20–60 s for a 7B model — already amortised over 10–30 min of compute.
+The torchft `Manager` interface (used by `torchft.local_sgd.DiLoCo`) only requires `.allreduce(tensor) → Work`, `.should_commit()`, `.start_quorum()`, `.current_step()`. We implement `.allreduce` on top of object storage. Done.
+---
+## 2. Per-executor audit
+### 2.1 Modal — primary adapter
+**Inter-job networking.** Yes, in two flavours.
+- **`@modal.experimental.clustered(size=N, rdma=True)`**: gang-schedules N containers in the *same* Modal cluster, gives them i6pn IPv6 addresses, and (with `rdma=True`) provisions InfiniBand RoCE up to 3,200 Gbps for inter-node communication. ([modal.com/docs/guide/multi-node-training](https://modal.com/docs/guide/multi-node-training)). This is the right primitive for a *single-executor* multi-replica DiLoCo where all N replicas live on Modal.
+- **i6pn private network** ([modal.com/docs/guide/private-networking](https://modal.com/docs/guide/private-networking)): any two `@app.function(i6pn=True)` containers in the same workspace+region can address each other over a 50 Gbps IPv6 fabric. Region-scoped — Modal documents that "i6pn networking is region-scoped functionality."
+**Cross-executor:** for the *cross-cloud* Decoupled DiLoCo case (Modal + HF + …), Modal containers reach out to S3/HF Hub/GCS like any other internet-connected workload. No Modal-specific magic needed.
+**Cold start.** Modal's container infra warm-boots in ~1 s for a cached image; first-run pulls of a large PyTorch image dominate (30–90 s). HF model download adds 15–45 s for a 7B model from cold (cache on a `modal.Volume` after run 1). See `MODAL_RECONNAISSANCE.md` §1.3 in this repo for the same numbers from a different audit angle. Realistic per-run cold: **~60–120 s** on first launch, ~10–30 s on subsequent launches with warm image cache.
+**$/GPU·hr (from <https://modal.com/pricing>, on-demand, base region, preemptible default).**
+| GPU | Modal `gpu=` string | $/sec | $/hour |
+|---|---|---|---|
+| A100-40GB | `"A100-40GB"` | 0.000583 | **$2.099** |
+| A100-80GB | `"A100-80GB"` | 0.000694 | **$2.498** |
+| H100 (pinned) | `"H100!"` | 0.001097 | **$3.949** |
+| H200 | `"H200"` | (see pricing page) | ~$4.5–5/hr per the published table |
+| B200 | `"B200"` | — | ~$6/hr per the published table |
+**Multipliers from same pricing page:** region pinning 1.5–1.75×, non-preemptible 3×. Default is preemptible — for DiLoCo this is *fine*: a preempted replica retries, the outer loop tolerates an absent-this-round member by simply averaging over the survivors.
+**Max concurrent jobs.** Modal documents "default limits on Modal free tier" of 10 GPU containers in [the Blender example](https://modal.com/docs/examples/blender_video) (`max_containers=10 if WITH_GPU else 100`). Paid plans scale far higher; clustered functions starting May 31, 2026 require 8 GPUs/node, capping at "up to 64 devices" per cluster (`@clustered`). Practically, for 8 single-A100 replicas of Decoupled DiLoCo, the Starter plan is limiting; Team plan ≥10 paid GPU containers handles it. Contact Modal support for >64-GPU clusters.
+**Verified API for spinning up N parallel jobs** (verified pattern from `modal-examples` and Modal docs):
+```python
+# composer_replication/diloco/serverless/_modal_adapter.py
+import modal
+app = modal.App("diloco-replicas")
+image = (
+    modal.Image.debian_slim(python_version="3.11")
+    .uv_pip_install("torch", "transformers", "torchft-nightly")
+    .add_local_python_source("composer_replication")
+)
+@app.function(image=image, gpu="A100-40GB", timeout=60 * 60 * 24)
+def run_inner_loop(replica_id: int, rendezvous_uri: str, config: dict):
+    """One DiLoCo replica. Trains for N inner steps, then participates in
+    one outer-round pseudo-gradient exchange via the rendezvous_uri (S3 path),
+    repeats."""
+    from composer_replication.diloco.serverless import run_replica
+    return run_replica(replica_id=replica_id,
+                       rendezvous_uri=rendezvous_uri,
+                       **config)
+@app.local_entrypoint()
+def main(num_replicas: int = 4):
+    rendezvous_uri = "s3://my-bucket/diloco-run-2026-05-26/"
+    config = {"model": "Qwen/Qwen2.5-7B", "outer_rounds": 100, "sync_every": 500}
+    # .map / .starmap fans out N parallel container invocations.
+    args = [(i, rendezvous_uri, config) for i in range(num_replicas)]
+    results = list(run_inner_loop.starmap(args))
+    print(f"All {num_replicas} replicas completed: {results}")
+```
+For the *single-executor RDMA* case (all N on Modal in one region, max throughput):
+```python
+@app.function(gpu="H100:8", timeout=60 * 60 * 24)
+@modal.experimental.clustered(size=4, rdma=True)
+def diloco_cluster_train(rendezvous_uri: str, config: dict):
+    info = modal.experimental.get_cluster_info()
+    # info.rank is our DiLoCo replica id; info.container_ips[0] is rank-0.
+    return run_replica(replica_id=info.rank, rendezvous_uri=rendezvous_uri, **config)
+```
+**Right abstraction layer for the framework.** Modal Functions map to **one DiLoCo replica each**. The local entrypoint (or our `Executor.launch_replicas()`) does `.starmap` to fan out N. Inter-replica state lives in S3 (default) or in Modal-side `modal.Dict` / `modal.Queue` (faster, same-workspace only). The `@clustered` decorator is *not* required for Decoupled DiLoCo — it's an opt-in optimization for when you want one Modal cluster to be your whole training run.
+**Rough $-per-replica-hour for an A100-40GB single-replica Modal run** (no clustering): 1 × $2.099 + ~$0.05 CPU/RAM overhead + ~$0.005 networking ≈ **$2.16/hr/replica**.
+### 2.2 HuggingFace Jobs — secondary adapter
+**Inter-job networking.** **No documented inter-job networking primitive.** HF Jobs is a Docker-Image-+-command service ([huggingface.co/docs/hub/en/jobs](https://huggingface.co/docs/hub/en/jobs)) modelled after `docker run`. There is no "address my peer job" API. Each job runs in its own pod with internet egress only; HF does not advertise a private VPC network.
+**Workaround (the right one for DiLoCo).** HF Jobs supports **`Volume` mounts** of HF Hub repos and HF storage buckets ([huggingface.co/docs/huggingface_hub/en/guides/jobs](https://huggingface.co/docs/huggingface_hub/en/guides/jobs)):
+```python
+from huggingface_hub import run_job, Volume
+checkpoints_bucket = Volume(type="bucket", source="myorg/diloco-rendezvous", mount_path="/rendezvous")
+job = run_job(image="pytorch/pytorch:2.6.0-cuda12.4-cudnn9-devel",
+              command=["python", "/code/run_replica.py", "--replica-id", "0"],
+              flavor="a100-large",
+              timeout="6h",
+              volumes=[checkpoints_bucket])
+```
+The `bucket` volume is read+write by default — perfect for object-store-based pseudo-gradient exchange. This is *exactly* the same workaround we'd apply to SageMaker, Vertex AI, Azure ML — but on HF it's first-class because `Volume(type="bucket", ...)` is built into the API.
+**Cold start.** HF docs say "billing only when starting or running" — no charge during build. Empirically (per the HF quickstart logs), `hf jobs uv run` reports a state transition `created → starting → running` typically in **10–60 s** for a cached image, longer for first-pull of a large CUDA image. The default timeout is 30 minutes; use `timeout="6h"` or similar for DiLoCo.
+**$/GPU·hr (from <https://huggingface.co/docs/hub/jobs-pricing>; per-minute billing).**
+| Hardware flavor | Hourly | $/A100·hr | $/H100/H200·hr |
+|---|---|---|---|
+| `a100-large` (1× A100 80GB) | **$2.50** | $2.50 | — |
+| `4xa100-large` (4× A100 80GB) | $10.00 | $2.50 | — |
+| `8xa100-large` (8× A100 80GB) | $20.00 | $2.50 | — |
+| `h200` (1× H200 141GB) | $5.00 | — | $5.00 (H200, not H100) |
+| `4xh200` | $20.00 | — | $5.00 |
+| `8xh200` | $40.00 | — | $5.00 |
+| `l40sx1` | $1.80 | — | — |
+| `a10g-large` | $1.50 | — | — |
+| `t4-small` | $0.40 | — | — |
+**No H100 SKU is published** as of this write — HF jumps from A100→H200. Treat HF's "$5/hr H200" as the H100-equivalent line item.
+**Max concurrent jobs.** HF documents "Jobs are available to any user or organization with a positive credit balance" but doesn't publish a per-account concurrency cap. The Python SDK pattern in their docs:
+```python
+# Verified — direct from huggingface.co/docs/huggingface_hub/en/guides/jobs
+jobs = [run_job(image=image, command=command) for command in commands]
+for job in jobs:
+    while inspect_job(job_id=job.id).status.stage not in ("COMPLETED", "ERROR"):
+        time.sleep(10)
+```
+…clearly assumes a "spawn N, poll N" model. Empirically, Pro accounts can run several jobs in parallel; Enterprise plans are higher.
+**Verified API for spinning up N parallel jobs:**
+```python
+# composer_replication/diloco/serverless/_hf_jobs_adapter.py
+from huggingface_hub import run_job, run_uv_job, inspect_job, fetch_job_logs, Volume
+def spawn_diloco_replica(replica_id: int, num_replicas: int, rendezvous_repo: str):
+    return run_job(
+        image="pytorch/pytorch:2.6.0-cuda12.4-cudnn9-devel",
+        command=["python", "-m", "composer_replication.diloco.serverless.replica_entrypoint",
+                 "--replica-id", str(replica_id),
+                 "--num-replicas", str(num_replicas),
+                 "--rendezvous-uri", "/rendezvous"],
+        flavor="a100-large",
+        timeout="12h",
+        env={"HF_HUB_ENABLE_HF_TRANSFER": "1"},
+        secrets={"HF_TOKEN": "<token>"},
+        volumes=[Volume(type="bucket", source=rendezvous_repo, mount_path="/rendezvous")],
+    )
+def spawn_n(num_replicas: int, rendezvous_repo: str = "myorg/diloco-rendezvous-2026-05-26"):
+    jobs = [spawn_diloco_replica(i, num_replicas, rendezvous_repo) for i in range(num_replicas)]
+    return jobs  # list[JobInfo]
+```
+The `Volume(type="bucket", ...)` is the secret weapon. Each replica writes its pseudo-gradient to a unique key under `/rendezvous/round-{t}/replica-{i}.pt`, then waits on a barrier file (busy-loop on `os.path.exists` with sleeps). The leader rank averages and writes `/rendezvous/round-{t}/avg.pt`. Standard object-store DiLoCo pattern.
+**Right abstraction.** Same as Modal: one `run_job` = one DiLoCo replica. Fan-out via list comprehension. No special multi-node primitive — and we don't need one for Decoupled DiLoCo.
+### 2.3 AWS SageMaker Training Jobs
+**Inter-job networking.** SageMaker has *intra-job* multi-node networking (`InstanceCount > 1` provisions a single EFA/InfiniBand-connected cluster, suitable for SMDDP `AllReduce` with `pytorchddp` or `torch_distributed` launchers — see [docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-framework-estimator.html](https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-framework-estimator.html)). It does **not** have *inter-job* networking — two separate `CreateTrainingJob` calls produce two isolated VPCs (unless you wire a shared customer VPC, which is non-trivial and Decoupled DiLoCo doesn't benefit from anyway).
+**Workaround.** S3. Each SageMaker training job has read+write access to S3 by default (via the IAM role passed to `CreateTrainingJob`). Pseudo-gradient exchange via `s3://bucket/diloco-run/round-{t}/replica-{i}.pt` is straightforward.
+**Cold start.** SageMaker docs and the cost-optimization blog post acknowledge five phases: Starting, Downloading, Training, Uploading, Completed. The Starting+Downloading phases are the cold start and **typically take 2–5 minutes**: image pull from ECR, EBS volume attach, `boto3` IAM role fetch, container init. **Warm pools** ([docs.aws.amazon.com/sagemaker/latest/dg/train-warm-pools.html](https://docs.aws.amazon.com/sagemaker/latest/dg/train-warm-pools.html)) cut subsequent matching jobs to ~10 s by retaining the cluster up to `KeepAlivePeriodInSeconds` (max 3600 s = 60 min) — *but matching requires identical RoleArn/InstanceType/InstanceCount/VpcConfig*, so warm pools work for "rerun the same DiLoCo replica config" but not for heterogeneous fleets.
+**$/GPU·hr (from [aws.amazon.com/sagemaker/ai/pricing/](https://aws.amazon.com/sagemaker/ai/pricing/), training tab, US East regions; per-second billing).** SageMaker training instances carry a ~20–25% premium over raw EC2 because the service includes managed orchestration. Pricing varies by region; representative US East values:
+| Instance | GPUs | $/hr (training) | $/GPU·hr |
+|---|---|---|---|
+| ml.p4d.24xlarge | 8× A100-40GB | ≈ $32.77 | ≈ **$4.10/A100·hr** |
+| ml.p4de.24xlarge | 8× A100-80GB | ≈ $40.97 | ≈ $5.12/A100·hr |
+| ml.p5.48xlarge | 8× H100-80GB | ≈ $98.32 | ≈ **$12.29/H100·hr** |
+| ml.g5.48xlarge | 8× A10G-24GB | ≈ $10.18 (per HyperPod example) | ≈ $1.27/A10G·hr |
+(Hourly rates above are *training* rates inferred from SageMaker's published training-tab price calculator and the HyperPod ml.g5.24xlarge $10.18/hr example; consult the live pricing page in [aws.amazon.com/sagemaker/ai/pricing/](https://aws.amazon.com/sagemaker/ai/pricing/) for region-specific quotes. **Managed Spot Training** ([docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html)) cuts up to 80–90% — and DiLoCo tolerates spot well because outer round t can simply skip preempted replicas.)
+**Per-A100 / per-H100 rates are the highest of any executor in this audit.** SageMaker is a poor choice for cost-sensitive Decoupled DiLoCo unless you already have committed savings plans or run on Spot.
+**Max concurrent jobs.** AWS Service Quotas: per-account default is typically 4 (for ml.p4d.24xlarge) and 0 (for ml.p5.48xlarge — must request access). Both are raisable. There's a soft cap of 1000 active training jobs per account.
+**Verified API for spinning up N parallel jobs** (using boto3, since `sagemaker` Python SDK abstracts away the parallel-launch case):
+```python
+# composer_replication/diloco/serverless/_sagemaker_adapter.py
+import boto3
+sm = boto3.client("sagemaker", region_name="us-east-1")
+def spawn_diloco_replica(replica_id: int, num_replicas: int, s3_rendezvous: str):
+    return sm.create_training_job(
+        TrainingJobName=f"diloco-replica-{replica_id}-{int(time.time())}",
+        AlgorithmSpecification={
+            "TrainingImage": "763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:2.4.0-gpu-py311-cu124-ubuntu22.04-sagemaker",
+            "TrainingInputMode": "File",
+            "ContainerEntrypoint": ["python", "-m", "composer_replication.diloco.serverless.replica_entrypoint"],
+            "ContainerArguments": ["--replica-id", str(replica_id),
+                                    "--num-replicas", str(num_replicas),
+                                    "--rendezvous-uri", s3_rendezvous],
+        },
+        ResourceConfig={
+            "InstanceCount": 1,                       # one A100/H100 per replica
+            "InstanceType": "ml.p4d.24xlarge",
+            "VolumeSizeInGB": 200,
+            "KeepAlivePeriodInSeconds": 1800,         # warm pool for fast subsequent launches
+        },
+        OutputDataConfig={"S3OutputPath": f"{s3_rendezvous}/output/replica-{replica_id}/"},
+        StoppingCondition={"MaxRuntimeInSeconds": 24*3600},
+        RoleArn="arn:aws:iam::ACCOUNT:role/SageMakerExecutionRole",
+        EnableManagedSpotTraining=True,                # 80%+ savings, DiLoCo-tolerant
+    )
+def spawn_n(num_replicas: int):
+    s3_rendezvous = "s3://my-diloco-bucket/run-2026-05-26"
+    return [spawn_diloco_replica(i, num_replicas, s3_rendezvous) for i in range(num_replicas)]
+```
+(The `CreateTrainingJob` API spec is documented in full at [docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html).)
+**Right abstraction.** Same shape: 1 training job = 1 DiLoCo replica. SageMaker's *intra-job* multi-node features (SMDDP, EFA, `instance_count=8`) are wasted if our framing is "N independent replicas"; they only help if a single replica is itself FSDP-sharded across instances, which we explicitly don't want for v0.x.
+### 2.4 GCP Vertex AI Custom Jobs
+**Inter-job networking.** Same story as SageMaker: a single `CustomJob` can have multiple `workerPoolSpecs` (chief, workers, parameter servers, evaluator) on a private VPC; *separate* CustomJobs are isolated. Workaround: GCS bucket. Vertex's [configure-compute](https://cloud.google.com/vertex-ai/docs/training/configure-compute) doc covers single-node and multi-replica configurations for one job.
+**Cold start.** Typical 2–6 min for cold image pull + VM provision. Vertex caches images in Artifact Registry; subsequent jobs in the same region with the same custom container start faster (~30–60 s).
+**$/GPU·hr.** Vertex AI training prices = (Compute Engine VM rate) × (Vertex training premium ≈ 30–50%). From the Vertex Training SKU groups page ([cloud.google.com/skus/sku-groups/vertex-training](https://cloud.google.com/skus/sku-groups/vertex-training)) the SKUs include "Training - NVIDIA A100 80GB in Virginia" etc.; published list rate equivalents are roughly:
+| Machine type | GPUs | $/hr (Vertex training, on-demand, us-central1) |
+|---|---|---|
+| `a2-highgpu-1g` | 1× A100-40GB | ≈ **$3.67/hr** |
+| `a2-ultragpu-1g` | 1× A100-80GB | ≈ $5.07/hr |
+| `a2-highgpu-8g` | 8× A100-40GB | ≈ $29.39/hr |
+| `a3-highgpu-8g` | 8× H100-80GB | ≈ **$88.49/hr** ⇒ $11.06/H100·hr |
+| `a3-megagpu-8g` | 8× H100-80GB (with NVSwitch) | ≈ $108/hr |
+(Vertex AI pricing is the Compute Engine GPU rate plus a Vertex training premium that varies by region. The figures above are approximate list prices from public sources; confirm in the [Vertex AI pricing calculator](https://cloud.google.com/vertex-ai/pricing) before quoting.)
+**Max concurrent jobs.** Per-region GPU quota (`NVIDIA_A100_GPUS`, `NVIDIA_H100_GPUS`, etc.) — typical default is 8 A100s per region, raise via Cloud Console quota request.
+**Verified API for spinning up N parallel jobs** (using `google-cloud-aiplatform`):
+```python
+# composer_replication/diloco/serverless/_vertex_ai_adapter.py
+from google.cloud import aiplatform
+aiplatform.init(project="my-project", location="us-central1",
+                staging_bucket="gs://my-diloco-bucket")
+def spawn_diloco_replica(replica_id: int, num_replicas: int, gcs_rendezvous: str):
+    job = aiplatform.CustomJob.from_local_script(
+        display_name=f"diloco-replica-{replica_id}",
+        script_path="composer_replication/diloco/serverless/replica_entrypoint.py",
+        container_uri="us-docker.pkg.dev/vertex-ai/training/pytorch-gpu.2-4.py311:latest",
+        args=["--replica-id", str(replica_id),
+              "--num-replicas", str(num_replicas),
+              "--rendezvous-uri", gcs_rendezvous],
+        machine_type="a2-highgpu-1g",      # 1× A100-40GB per replica
+        accelerator_type="NVIDIA_TESLA_A100",
+        accelerator_count=1,
+        replica_count=1,                   # one replica, single-host
+    )
+    job.submit()                           # async; returns immediately
+    return job
+def spawn_n(num_replicas: int):
+    gcs = "gs://my-diloco-bucket/run-2026-05-26"
+    return [spawn_diloco_replica(i, num_replicas, gcs) for i in range(num_replicas)]
+```
+**Right abstraction.** Identical to SageMaker / HF / Modal: one `CustomJob.submit()` = one DiLoCo replica.
+### 2.5 Azure ML Command Jobs
+**Inter-job networking.** Single `command` job with `resources.instance_count=N` provisions N coordinated nodes (InfiniBand on `ND*`-series); separate jobs are isolated. Workaround: Azure Blob Storage or Azure ML Datastore.
+**Cold start.** 3–8 min from job submission to first-byte-of-stdout for a curated environment; longer for custom images. Curated environments (e.g., `AzureML-acpt-pytorch-2.8-cuda12.6@latest`) are pre-cached on the cluster's image cache.
+**$/GPU·hr (from [azure.microsoft.com/en-us/pricing/details/machine-learning/](https://azure.microsoft.com/en-us/pricing/details/machine-learning/), GPU section, US West 2 PAYG list).**
+| VM size | GPUs | Approx $/hr |
+|---|---|---|
+| Standard_NC24ads_A100_v4 | 1× A100-80GB | ≈ **$3.67/hr** |
+| Standard_NC48ads_A100_v4 | 2× A100-80GB | ≈ $7.35/hr |
+| Standard_ND96asr_A100_v4 | 8× A100-40GB (InfiniBand) | ≈ $27.20/hr |
+| Standard_NC40ads_H100_v5 | 1× H100 NVL 94GB | ≈ $7/hr (regional) |
+| Standard_ND96isr_H100_v5 | 8× H100-80GB (InfiniBand) | ≈ **$98/hr** ⇒ $12.25/H100·hr |
+(Azure publishes $0/core ML "service surcharge" for these — you pay only the underlying VM rate. So the relevant hourly rate is the standard PAYG VM rate from Azure's pricing page, not a separate Azure ML markup. **Low-Priority** VMs cut up to 80% — DiLoCo-tolerant like SageMaker Spot.)
+**Max concurrent jobs.** Per-subscription per-region GPU vCPU quota; typical default 0–24 cores for `ND*`-series, raise via Azure portal.
+**Verified API for spinning up N parallel jobs** (using `azure-ai-ml` v2 SDK; pattern from [learn.microsoft.com/en-us/azure/machine-learning/how-to-train-pytorch](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-train-pytorch)):
+```python
+# composer_replication/diloco/serverless/_azure_ml_adapter.py
+from azure.ai.ml import MLClient, command
+from azure.identity import DefaultAzureCredential
+ml_client = MLClient(DefaultAzureCredential(), subscription_id="...",
+                     resource_group_name="...", workspace_name="...")
+def spawn_diloco_replica(replica_id: int, num_replicas: int, blob_uri: str):
+    job = command(
+        code="./composer_replication",
+        command=("python -m composer_replication.diloco.serverless.replica_entrypoint "
+                 f"--replica-id {replica_id} --num-replicas {num_replicas} "
+                 f"--rendezvous-uri {blob_uri}"),
+        environment="AzureML-acpt-pytorch-2.8-cuda12.6@latest",
+        compute="gpu-cluster",                  # an AmlCompute pre-created with min_instances=0, max_instances=8
+        resources={"instance_count": 1},
+        display_name=f"diloco-replica-{replica_id}",
+    )
+    return ml_client.jobs.create_or_update(job)
+def spawn_n(num_replicas: int):
+    blob = "azureml://datastores/workspaceblobstore/paths/diloco-run/"
+    return [spawn_diloco_replica(i, num_replicas, blob) for i in range(num_replicas)]
+```
+**Right abstraction.** Same one-job-per-replica pattern.
+### 2.6 Kubernetes + Volcano / KubeRay
+**Inter-job networking.** Native — pods on the same cluster see each other on the cluster network. Volcano provides **gang scheduling** (all-or-nothing pod admission, essential for "all N DiLoCo replicas start together" semantics) and **network-topology-aware scheduling** ([volcano.sh/en/docs/network_topology_aware_scheduling/](https://volcano.sh/en/docs/network_topology_aware_scheduling/)). KubeRay's `RayJob` resource integrates with Volcano (PR [ray-project/kuberay#3972](https://github.com/ray-project/kuberay/pull/3972), merged 2025-10-09) — `RayJob` + `volcano.sh/queue-name` label gives you gang-scheduled Ray clusters per job.
+For Decoupled DiLoCo: **N RayJobs, each running one replica**, gang-scheduled via Volcano, sharing pseudo-grad through a `PersistentVolume` or in-cluster S3-compatible object store (MinIO).
+**Cold start.** Pod schedule time depends on cluster state: seconds (pre-pulled image, free GPU node) to minutes (image pull + GPU node autoscale). Predictable on a steady-state cluster.
+**$/GPU·hr.** **Whatever the underlying K8s cluster pays.** This is the *cheapest* tier in this audit if the user already runs a GPU K8s cluster (e.g., RunPod K8s, Lambda Cloud, OCI K8s, on-prem). Examples:
+- RunPod community cloud K8s: ~$1.19/hr A100-80GB, ~$1.99/hr H100.
+- Lambda K8s: ~$1.29/hr A100-40GB, ~$2.49/hr H100-80GB.
+- On-prem owned hardware: amortized $0.50–$1.00 per A100/H100 hour.
+**Max concurrent jobs.** Cluster capacity. Volcano's queue-based admission control + Kubernetes-native quotas govern this.
+**Verified API for spinning up N parallel jobs** (Volcano `Job` + KubeRay pattern from the docs):
+```yaml
+# k8s manifest, one per DiLoCo replica
+apiVersion: batch.volcano.sh/v1alpha1
+kind: Job
+metadata: {name: diloco-replica-0}
+spec:
+  minAvailable: 1
+  schedulerName: volcano
+  queue: diloco-queue
+  tasks:
+    - replicas: 1
+      name: replica
+      template:
+        spec:
+          containers:
+            - name: trainer
+              image: myorg/composer-replication:latest
+              command: ["python", "-m", "composer_replication.diloco.serverless.replica_entrypoint",
+                        "--replica-id", "0", "--num-replicas", "4",
+                        "--rendezvous-uri", "s3://minio.cluster.local/diloco/"]
+              resources:
+                limits: {nvidia.com/gpu: 1}
+          restartPolicy: OnFailure
+```
+…and the framework's `K8sExecutor` adapter does `kubectl apply -f` (or uses the Python K8s client) for each of N rendered manifests.
+**Right abstraction.** Either one `volcano.batch.Job` per replica (simple, no Ray) or one `RayJob` per replica (overkill for DiLoCo, but useful if you want Ray Tune integration). One pod = one DiLoCo replica.
+### 2.7 RunPod / Lambda / Vast.ai (honourable mentions)
+Not in the original candidate list, but worth one paragraph each because they're the price-leaders for serverless GPUs:
+- **RunPod Serverless / Pods.** Cheap on-demand A100/H100 (~$1.19–$2.17/hr A100-80GB; ~$1.99–$4.18/hr H100). REST API `POST /v2/{endpoint}/run` for serverless; SDK `runpod` for pods. No native multi-job network — same S3 workaround. **Strong third adapter candidate** for a cost-optimised deployment.
+- **Lambda Cloud (Lambda Labs).** Bare metal hourly rentals, not a true serverless API. Programmatic launch via `lambdalabs` API. Outside the "serverless" framing.
+- **Vast.ai.** Bidding-style spot market. API-driven launches. Cheapest per A100·hr in the market, but variable availability.
+We do **not** include these as v0 adapters but document them as "next-up after Modal + HF" if the user wants further price compression.
+---
+## 3. The right abstraction: `composer_replication.diloco.serverless`
+### 3.1 The core interface
+```python
+# composer_replication/diloco/serverless/_protocol.py
+from __future__ import annotations
+from abc import ABC, abstractmethod
+from dataclasses import dataclass
+from typing import Any, Iterator, Protocol
+@dataclass(frozen=True)
+class ReplicaSpec:
+    """One DiLoCo replica's launch config. Mirrors `make_diloco_outer_loop()`'s
+    args (see composer_replication/diloco/__init__.py) plus a rendezvous_uri
+    for the object-store all-reduce backend."""
+    replica_id: int
+    num_replicas: int
+    rendezvous_uri: str           # s3://, gs://, az://, hf://, file://
+    model_id: str                 # e.g. "Qwen/Qwen2.5-7B"
+    inner_optimizer: dict[str, Any]   # serializable; reconstructed in worker
+    sync_every: int = 500
+    outer_lr: float = 0.7
+    outer_momentum: float = 0.9
+    outer_rounds: int = 100
+    extra_env: dict[str, str] | None = None
+@dataclass(frozen=True)
+class ReplicaHandle:
+    replica_id: int
+    backend: str                  # "modal" | "hfjobs" | "sagemaker" | ...
+    job_id: str
+    log_url: str | None = None
+@dataclass(frozen=True)
+class ReplicaResult:
+    replica_id: int
+    status: str                   # "completed" | "failed" | "preempted"
+    final_checkpoint_uri: str | None
+    metrics: dict[str, Any]
+class ServerlessExecutor(Protocol):
+    """Protocol any serverless backend implements to host Decoupled DiLoCo."""
+    def launch_replicas(self, specs: list[ReplicaSpec]) -> list[ReplicaHandle]: ...
+    def poll(self, handles: list[ReplicaHandle]) -> list[ReplicaHandle]: ...
+    def stream_logs(self, handle: ReplicaHandle) -> Iterator[str]: ...
+    def cancel(self, handles: list[ReplicaHandle]) -> None: ...
+    def collect(self, handles: list[ReplicaHandle], *,
+                timeout: float | None = None) -> list[ReplicaResult]: ...
+    @property
+    def backend_name(self) -> str: ...
+    @property
+    def supports_inter_replica_network(self) -> bool:
+        """True iff backend natively connects replicas (e.g., Modal i6pn).
+        False = pseudo-grad must use rendezvous_uri object store. Default rendezvous
+        is *always* object-store regardless; this flag only unlocks an opt-in
+        same-backend fast path (see ModalExecutor(use_clustered_rdma=True))."""
+        ...
+```
+Concrete adapters inherit from a small `BaseExecutor(ABC)` for cross-cutting retry/log/timeout, paralleling `composer_replication.trainer.composer_trainer`. `launch_replicas()` is partial-failure tolerant: on partial submit it returns handles for the K successful replicas with the failed one carrying `job_id=""` and a logged warning; the caller is responsible for cleanup via `cancel()`.
+### 3.2 The object-store all-reduce (the secret weapon)
+The whole point of "decoupled" DiLoCo is that the cross-replica primitive is just object-store I/O. We implement it at the framework layer, *not* at the executor layer, so every adapter gets it for free:
+```python
+# composer_replication/diloco/serverless/_rendezvous.py
+import time, torch, fsspec
+class ObjectStoreAllReduce:
+    """Drop-in for `torchft.Manager.allreduce` over a shared object store.
+    Each round t:
+      (1) replica i writes  {uri}/round-{t}/replica-{i}.pt
+      (2) all replicas barrier on count == num_replicas
+      (3) rank 0 averages, writes {uri}/round-{t}/avg.pt
+      (4) others read avg.pt, copy_ into the in-place tensor
+      (5) rank 0 GCs round-(t-1)
+    fsspec-backed so one path covers s3://, gs://, az://, hf://, file://.
+    """
+    def __init__(self, replica_id, num_replicas, rendezvous_uri,
+                 fsspec_kwargs=None, poll_s=2.0, timeout_s=600.0):
+        self.replica_id, self.num_replicas = replica_id, num_replicas
+        self.uri = rendezvous_uri.rstrip("/")
+        self.fs, _ = fsspec.url_to_fs(self.uri, **(fsspec_kwargs or {}))
+        self.poll, self.timeout, self._round = poll_s, timeout_s, 0
+    def allreduce(self, tensor):
+        t = self._round
+        my = f"{self.uri}/round-{t}/replica-{self.replica_id}.pt"
+        avg = f"{self.uri}/round-{t}/avg.pt"
+        with self.fs.open(my, "wb") as f:
+            torch.save(tensor.cpu(), f)
+        deadline = time.time() + self.timeout
+        while time.time() < deadline:
+            existing = [p for p in self.fs.ls(f"{self.uri}/round-{t}/")
+                        if p.endswith(".pt") and "/replica-" in p]
+            if len(existing) >= self.num_replicas: break
+            time.sleep(self.poll)
+        else:
+            raise TimeoutError(f"barrier timeout at round {t}")
+        if self.replica_id == 0:
+            tensors = [torch.load(self.fs.open(f"{self.uri}/round-{t}/replica-{i}.pt", "rb"),
+                                   map_location="cpu") for i in range(self.num_replicas)]
+            torch.save(torch.stack(tensors).mean(dim=0), self.fs.open(avg, "wb"))
+        deadline = time.time() + self.timeout
+        while time.time() < deadline:
+            if self.fs.exists(avg):
+                tensor.copy_(torch.load(self.fs.open(avg, "rb"), map_location=tensor.device))
+                break
+            time.sleep(self.poll)
+        else:
+            raise TimeoutError(f"avg.pt timeout at round {t}")
+        if self.replica_id == 0 and t > 0:
+            try: self.fs.rm(f"{self.uri}/round-{t-1}/", recursive=True)
+            except Exception: pass
+        self._round += 1
+        return _DummyWork()
+    def should_commit(self): return True
+    def start_quorum(self, *_, **__): pass
+    @property
+    def current_step(self): return self._round
+class _DummyWork:
+    def wait(self): pass
+    def get_future(self): pass
+```
+The `ObjectStoreAllReduce` mocks the torchft `Manager` interface — exactly what `make_diloco_outer_loop` already takes (see `composer_replication/diloco/__init__.py` lines 64–125). **No changes to the existing DiLoCo wrapper needed.**
+### 3.3 Replica entrypoint
+This is the script every adapter runs in its container:
+```python
+# composer_replication/diloco/serverless/replica_entrypoint.py
+"""Run one Decoupled DiLoCo replica. Designed to be invoked as
+    python -m composer_replication.diloco.serverless.replica_entrypoint \
+        --replica-id N --num-replicas K --rendezvous-uri s3://... \
+        --model-id Qwen/Qwen2.5-7B --sync-every 500 --outer-rounds 100
+"""
+import argparse, os, torch
+from composer_replication.diloco import make_diloco_outer_loop
+from composer_replication.diloco.serverless._rendezvous import ObjectStoreAllReduce
+def main() -> None:
+    p = argparse.ArgumentParser()
+    p.add_argument("--replica-id", type=int, required=True)
+    p.add_argument("--num-replicas", type=int, required=True)
+    p.add_argument("--rendezvous-uri", required=True)
+    p.add_argument("--model-id", required=True)
+    p.add_argument("--sync-every", type=int, default=500)
+    p.add_argument("--outer-rounds", type=int, default=100)
+    p.add_argument("--outer-lr", type=float, default=0.7)
+    args = p.parse_args()
+    from transformers import AutoModelForCausalLM
+    model = AutoModelForCausalLM.from_pretrained(args.model_id, torch_dtype=torch.bfloat16).cuda()
+    inner_opt = torch.optim.AdamW(model.parameters(), lr=4e-4)
+    manager = ObjectStoreAllReduce(replica_id=args.replica_id,
+                                   num_replicas=args.num_replicas,
+                                   rendezvous_uri=args.rendezvous_uri)
+    outer = make_diloco_outer_loop(
+        manager=manager, model_fragments=[model], inner_optimizer=inner_opt,
+        outer_lr=args.outer_lr, outer_momentum=0.9, nesterov=True,
+        sync_every=args.sync_every,
+    )
+    with outer:
+        for outer_round in range(args.outer_rounds):
+            for inner_step in range(args.sync_every):
+                # caller plugs in their data + loss; for v0 we use a sketch.
+                inner_opt.zero_grad(); ...; inner_opt.step()
+            # outer-loop sync fires automatically at sync_every step boundary.
+    # Push final checkpoint to rendezvous_uri/final/replica-N.pt
+    ...
+if __name__ == "__main__":
+    main()
+```
+### 3.4 Package layout
+```
+composer_replication/
+└── diloco/
+    ├── __init__.py            # existing: make_diloco_outer_loop, torchft import
+    └── serverless/
+        ├── __init__.py        # re-exports
+        ├── _protocol.py       # ServerlessExecutor Protocol, ReplicaSpec, ReplicaHandle, ReplicaResult
+        ├── _base.py           # BaseExecutor(ABC) — common retry/log/timeout logic
+        ├── _rendezvous.py     # ObjectStoreAllReduce (the cross-cutting allreduce)
+        ├── replica_entrypoint.py    # the script every adapter runs in-container
+        ├── modal/
+        │   ├── __init__.py    # ModalExecutor
+        │   └── adapter.py
+        ├── hfjobs/
+        │   ├── __init__.py    # HFJobsExecutor
+        │   └── adapter.py
+        └── runpod/             # optional v0.1+
+            ├── __init__.py
+            └── adapter.py
+```
+**v0 ships:** `Modal` + `HFJobs`. Both inherit from `BaseExecutor`, both delegate cross-replica state to `ObjectStoreAllReduce`. Symmetric implementation surface ≈ 250 lines per adapter.
+**v0.1+ candidates** (add when needed): SageMaker, Vertex AI, Azure ML, RunPod, K8s/Volcano. The `Protocol` is stable; adding adapters is incremental.
+### 3.5 What the user writes
+```python
+from composer_replication.diloco.serverless import (
+    ModalExecutor, HFJobsExecutor, ReplicaSpec
+)
+specs = [
+    ReplicaSpec(replica_id=i, num_replicas=4,
+                rendezvous_uri="s3://my-diloco-runs/2026-05-26/",
+                model_id="Qwen/Qwen2.5-7B",
+                inner_optimizer={"name": "AdamW", "lr": 4e-4},
+                sync_every=500, outer_rounds=100)
+    for i in range(4)
+]
+# Option A: all four replicas on Modal A100s
+executor = ModalExecutor(gpu="A100-40GB", region=None, preemptible=True)
+handles = executor.launch_replicas(specs)
+results = executor.collect(handles)
+# Option B: heterogeneous fleet — 2 on Modal, 2 on HF Jobs
+modal_ex = ModalExecutor(gpu="A100-40GB")
+hf_ex = HFJobsExecutor(flavor="a100-large")
+modal_handles = modal_ex.launch_replicas(specs[:2])
+hf_handles = hf_ex.launch_replicas(specs[2:])
+# both groups read+write the SAME s3://... rendezvous URI — they DiLoCo together.
+results = modal_ex.collect(modal_handles) + hf_ex.collect(hf_handles)
+```
+The "heterogeneous fleet" pattern is the **point** of Decoupled DiLoCo as articulated in the user brief. Modal + HF together is a meaningful test that tells us both adapters work and the rendezvous protocol is backend-agnostic.
+---
+## 4. Cross-cutting design decisions
+### 4.1 Why object-store rendezvous is the default (even on Modal)
+Even though Modal supports `@modal.experimental.clustered` with RDMA, **the framework default is object-store-based pseudo-gradient exchange.** Reasons:
+1. **Backend portability.** Same code runs on Modal, HF, SageMaker, Vertex, Azure, K8s. Adding a new backend is implementing 6 methods (`launch_replicas`, `poll`, `stream_logs`, `cancel`, `collect`, `backend_name`) — *zero* changes to the rendezvous layer.
+2. **Cost asymmetry.** RDMA-class networking on Modal requires `@clustered(rdma=True)` which gates on 8 GPUs/node and tighter scheduling — *more* expensive than 4 separate `@function` invocations of 1 GPU each.
+3. **DiLoCo's communication is ridiculous overkill for RDMA.** 2 GB every 10 minutes = ~3 Mbps average. S3 GET/PUT at 10 MB/s does it in ~3 min — well under the 10 min outer-round budget.
+4. **Failure decoupling.** A clustered-RDMA failure aborts the whole job (gang-scheduled). Object-store rendezvous tolerates a missing replica (skip its tensor in the average) — better matches DiLoCo's natural fault tolerance.
+The opt-in escape hatch: `ModalExecutor(use_clustered_rdma=True)` dispatches to `@modal.experimental.clustered(rdma=True)` and skips object-store. This is for the user who wants Modal-only, max-throughput, single-region runs. It's *not* the default and *not* what we test against.
+### 4.2 Rendezvous URI scheme support
+`fsspec` covers all the storage backends we need:
+| Scheme | Backend | Used for |
+|---|---|---|
+| `s3://` | `s3fs` | SageMaker default; cheapest for AWS-centric runs |
+| `gs://` | `gcsfs` | Vertex AI default |
+| `az://` | `adlfs` | Azure ML default |
+| `hf://` | `huggingface_hub.HfFileSystem` | HF Jobs preferred (Volume mount makes it look like local fs already) |
+| `file://` | builtin | local single-host tests; CI |
+The framework picks the *right* default per-executor (Modal → `s3://`, HF → `hf://`, SageMaker → `s3://`, etc.) but always allows override.
+### 4.3 Failure model
+**Replica failure mid-round.** The barrier in `ObjectStoreAllReduce` has a configurable timeout (default 600 s). If a replica doesn't write its file by then, rank-0 (the averager) has two options governed by `replica_failure_policy`:
+- `"strict"` (default): TimeoutError → all replicas abort. Resume from last committed checkpoint.
+- `"skip"`: rank-0 averages over what's there, includes a `--num-survivors=K` annotation in `avg.pt`. Other replicas read this and continue. DiLoCo paper §4.5 reports robustness to occasional missing workers; this matches that.
+**Whole-cluster failure.** Outer rounds checkpoint to `{rendezvous_uri}/checkpoint-{t}/`; restart sets `args.restart_from=T` and skips ahead.
+### 4.4 What we explicitly do NOT do
+- **No cross-job NCCL.** Even on Modal, even with `clustered`, the framework uses object-store rendezvous. (Modal `clustered` is exposed only via the explicit opt-in flag.)
+- **No DDP/FSDP across replicas.** Each replica is its own self-contained DDP/FSDP world; replicas talk to each other only via the outer-loop. This is the *core* of DiLoCo.
+- **No "control plane" service.** No coordinator process, no scheduler container. The object store *is* the coordinator (writes are the messages, file-existence is the synchronization). This is what makes the system work across heterogeneous executors with no shared infra.
+- **No Modal-specific or HF-specific dependencies in `composer_replication.diloco`.** Adapter dependencies (`modal`, `huggingface_hub`) are imported lazily inside the adapter modules, exactly how `torchft` is imported lazily in `composer_replication/diloco/__init__.py` today.
+---
+## 5. Risks and mitigations
+| Risk | Likelihood | Mitigation |
+|---|---|---|
+| Object-store latency dominates outer-round wallclock for large models | M | For 70B+, add `fsspec` parallel-upload (multipart) + bf16 quantize on-write. Most outer rounds are 7B-scale where 2 GB transfer is well under 1 min. |
+| Rank-0 replica crashes mid-average → orphaned barrier | L | Add a `lock-{t}.json` heartbeat with TTL; any non-zero replica that sees a stale lock can take over. v1+. |
+| Modal + HF cost arbitrage misleading because preemption rates differ | M | Track preemption-rate per backend, surface in `ReplicaResult.metrics`. User-visible. |
+| HF Jobs has no public per-account concurrency cap → may hit a hidden limit at N=8 | L | Add exponential-backoff retry around `run_job`; cap `max_concurrent_launches` configurable per executor. |
+| AWS / GCP / Azure premiums make their adapters effectively price-uncompetitive | H (already true) | Be honest in docs (this doc). Recommend Modal + HF for cost-sensitive users; cloud-vendor adapters for users who *must* run there for compliance or credits. |
+| Rendezvous bucket becomes a security choke point (model weights exposed) | M | Document that `rendezvous_uri` should be a private bucket with replica-only IAM/principals. Provide `RendezvousAccessPolicy` helper that emits boto3/gcloud/az IAM JSON. |
+| Modal `@experimental.clustered` API churn (it's experimental) | M | Default path doesn't depend on `clustered`. Fall-back path uses regular `@function`. Document the opt-in clearly. |
+| torchft sign-convention regression | L | Already pinned with the unit test in spike 008 (see `spikes/008-streaming-diloco/tests/test_diloco_smoke.py::test_diloco_pseudogradient_sign_convention`). The serverless layer doesn't touch this — it only swaps in a different `Manager.allreduce` impl. |
+---
+## 6. Validation plan
+Three smoke tests, in order of cost:
+1. **Spike 009-A (free, ≤30 min):** `LocalProcessExecutor` + `ObjectStoreAllReduce` with `file://` rendezvous. Two in-process replicas DiLoCo-train a 0.5B model on MNIST-equivalent text data. Asserts the rendezvous protocol works.
+2. **Spike 009-B (Modal, ≤$5):** `ModalExecutor` × 2 replicas, A100-40GB each, Qwen2.5-0.5B, 50 inner steps × 2 outer rounds. Asserts the Modal adapter launches, replicas find each other through S3 rendezvous, and pseudo-gradients average correctly. Cost: ~30 min × $2.10 × 2 = $2.10 + setup overhead, comfortable under cap.
+3. **Spike 009-C (heterogeneous, ≤$10):** 1 Modal A100 + 1 HF Jobs `a100-large`. Same model, 2 outer rounds. Validates that rendezvous works across backends — the key claim of Decoupled DiLoCo. Cost: ~30 min × ($2.10 + $2.50) = ~$2.30, plus per-job startup.
+Each spike has a verdict.md following the conventions from `spikes/008-streaming-diloco/`.
+---
+## 7. References (primary sources, all cited above)
+- **DiLoCo paper:** Douillard et al., "DiLoCo: Distributed Low-Communication Training of Language Models," arXiv:2311.08105 (2023). <https://arxiv.org/abs/2311.08105>
+- **Streaming DiLoCo paper:** Liu et al., "Streaming DiLoCo with overlapping communication," 2025. <https://arxiv.org/abs/2501.18512>
+- **torchft `local_sgd.DiLoCo`:** <https://github.com/meta-pytorch/torchft/blob/main/torchft/local_sgd.py>
+- **Modal multi-node clusters:** <https://modal.com/docs/guide/multi-node-training>
+- **Modal cluster networking (i6pn):** <https://modal.com/docs/guide/private-networking>
+- **Modal pricing:** <https://modal.com/pricing>
+- **Modal GPU options:** <https://modal.com/docs/guide/gpu>
+- **HF Jobs overview:** <https://huggingface.co/docs/hub/en/jobs>
+- **HF Jobs pricing:** <https://huggingface.co/docs/hub/jobs-pricing>
+- **HF Jobs Python API:** <https://huggingface.co/docs/huggingface_hub/en/guides/jobs>
+- **HF Jobs reference:** <https://huggingface.co/docs/huggingface_hub/main/en/package_reference/jobs>
+- **AWS SageMaker pricing:** <https://aws.amazon.com/sagemaker/ai/pricing/>
+- **AWS SageMaker `CreateTrainingJob` API:** <https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html>
+- **AWS SageMaker SMDDP:** <https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-framework-estimator.html>
+- **AWS SageMaker warm pools:** <https://docs.aws.amazon.com/sagemaker/latest/dg/train-warm-pools.html>
+- **GCP Vertex AI compute config:** <https://cloud.google.com/vertex-ai/docs/training/configure-compute>
+- **GCP Vertex AI training SKUs:** <https://cloud.google.com/skus/sku-groups/vertex-training>
+- **GCP Vertex AI pricing:** <https://cloud.google.com/vertex-ai/pricing>
+- **Azure ML pricing:** <https://azure.microsoft.com/en-us/pricing/details/machine-learning/>
+- **Azure ML PyTorch SDK v2 guide:** <https://learn.microsoft.com/en-us/azure/machine-learning/how-to-train-pytorch>
+- **Azure NDasrA100_v4 spec:** <https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/gpu-accelerated/ndasra100v4-series>
+- **Azure NCads H100 v5 spec:** <https://learn.microsoft.com/en-us/azure/virtual-machines/ncads-h100-v5>
+- **Volcano:** <https://volcano.sh/en/docs/unified_scheduling/>
+- **Volcano network-topology-aware scheduling:** <https://volcano.sh/en/docs/network_topology_aware_scheduling/>
+- **KubeRay + Volcano integration:** <https://docs.ray.io/en/latest/cluster/kubernetes/k8s-ecosystem/volcano.html>
+- **KubeRay RayJob+Volcano PR:** <https://github.com/ray-project/kuberay/pull/3972>
+Internal references (in this repo):
+- `docs/research/MODAL_RECONNAISSANCE.md` — pricing/cold-start audit for Modal smoke runs.
+- `docs/research/DILOCO_RECONNAISSANCE.md` — DiLoCo implementation candidates audit.
+- `docs/adrs/ADR-001-gpu-venue.md` — local-vs-cloud GPU decision for smoke phase.
+- `docs/adrs/ADR-003-diloco-impl.md` — torchft choice + sign convention.
+- `composer_replication/diloco/__init__.py` — existing `make_diloco_outer_loop` wrapper this design plugs into without modification.
+- `spikes/008-streaming-diloco/` — the existing in-process DiLoCo smoke that the serverless adapter inherits sign-convention test from.

docs/research/REPLAYSIM_NORMALIZATION_RECONNAISSANCE.md ADDED Viewed

	@@ -0,0 +1,506 @@

+# Replaysim Normalization Reconnaissance
+**Status:** Recon · **Feeds:** ADR-004, V5 "replaysim with normalization"
+**Author:** subagent (delegated audit) · **Date:** 2026-05-25
+**Sources:** GitHub REST API metadata + DeepWiki structured indexes of each repo's primary source. All repo metadata cited below was pulled from `api.github.com/repos/<owner>/<name>` directly.
+## TL;DR
+| Library | License | Last push | ★ | Verdict |
+|---|---|---|---|---|
+| **data-juicer** | Apache-2.0 | **2026-05-25** | 6.4k | ✅ **RECOMMENDED** — the only candidate with a class-based op-graph that *natively* understands `messages: [{role, content}]`, multi-turn dialog, and DPO-pair (`chosen`/`rejected`) preference samples as **first-class data formats**, with a `pair_preference_mapper` operator that maps directly onto our `extract_dpo_pairs` output. |
+| **distilabel** | Apache-2.0 | 2026-05-25 | 3.2k | Strong runner-up. DAG pipeline, native chat-message format, built-in `FormatChatGenerationDPO`. But it is primarily a *generation orchestrator* and would force us to rewrite our existing OpenRouter teacher orchestration as Distilabel `LLM` subclasses. Larger refactor surface. |
+| **datatrove** | Apache-2.0 | 2026-05-06 | 3.1k | ❌ **Deal-breaker.** `Document` dataclass is `text: str + metadata: dict`. All filters/dedup operate on flat `doc.text`. Multi-turn is only supported in the *generation* (`InferenceRunner.rollout_fn`) path, not the normalization/filter path. Forces lossy chat→string flattening. |
+| **NeMo-Curator** | Apache-2.0 | 2026-05-25 | 1.6k | Strong on scale (Ray + Xenna + GPU), supports streaming and DPO via `generate_two_turn_prompt`. But: semantic dedup, fuzzy dedup, and classifier filters all *require GPUs*; CPU-only install drops most of the differentiating ops. Heavy framework for the size of replaysim. |
+| **lilac** | Apache-2.0 | **archived 2024-03-19** | 1.1k | ❌ **Dead.** `databricks/lilac` repo `"archived": true`. The current `lilacai/lilac` is a 2-star squatter stub created Nov 2025. Do not adopt. |
+**Recommendation:** Adopt **data-juicer** as the normalization op-graph layer wrapped around `replay_trace` → `extract_dpo_pairs`. Estimated integration cost: **~250–400 LOC** in `composer_replication.replaysim` for an adapter + 1 YAML recipe.
+**Critical chat-template question answered:** data-juicer is the only audited library whose *filtering and normalization operators* (not just its generation operators) operate directly on a structured `messages: [{role, content}]` format and on `chosen`/`rejected` preference-pair format. The other three candidates either flatten to text (datatrove), only handle chat in the generation path (datatrove again), or treat chat as a generation output to be assembled rather than a structured object to be filtered (NeMo-Curator, distilabel partly).
+---
+## 1. Audit Methodology
+For each candidate, primary-source data was collected from:
+1. `https://api.github.com/repos/<owner>/<name>` for license, `pushed_at`, `archived`, stars, forks, topics — these are authoritative GitHub metadata, not scraped.
+2. DeepWiki structured indexes of each repo's source tree for: op model, data structures (`Document` / `Sample` / `Step`), conversation/DPO support in filtering vs. generation paths, GPU dependencies.
+3. README confirmation through the GitHub API for transferred-org redirects.
+No secondary sources, no marketing pages, no blog posts.
+Two facts to flag up front because they materially change the candidate set:
+- `modelscope/data-juicer` redirects to **`datajuicer/data-juicer`**. The team spun out of ModelScope into a dedicated `datajuicer` org. Same code, just a transferred name — `pushed_at` is current.
+- `NVIDIA/NeMo-Curator` redirects to **`NVIDIA-NeMo/Curator`**. Same situation — moved into the dedicated `NVIDIA-NeMo` org in 2025.
+---
+## 2. Per-Candidate Audit
+### 2.1 datatrove (huggingface)
+| Dimension | Value |
+|---|---|
+| Repo | `huggingface/datatrove` |
+| License | Apache-2.0 |
+| Created | 2023-06-14 |
+| Last push | **2026-05-06** |
+| Stars / Forks | 3068 / 266 |
+| Commits | 725 (default branch) |
+| Maturity | Production. Used to build FineWeb. Active. |
+**Op model.** Class-based **linear pipeline** of `PipelineStep` instances. `PipelineStep.run(data: DocumentsPipeline, rank: int, world_size: int) -> DocumentsPipeline` where `DocumentsPipeline` is an iterator of `Document` objects. Steps are composed by Python list concatenation, not a DAG — branching/joining requires manual orchestration.
+**Multi-turn / chat-template support — DEAL-BREAKER.**
+The `Document` dataclass (`src/datatrove/data.py`) is:
+```python
+@dataclass
+class Document:
+    text: str
+    id: str
+    media: list[Media]   # placeholder, "for future uses, currently not used"
+    metadata: dict
+```
+There is **no `messages` field**. Every built-in filter (e.g., `C4QualityFilter`, `LanguageFilter`, `GopherQualityFilter`) and every built-in dedup op (`MinhashDedup*`, `SentenceDedup*`, `BloomFilter`) operates on `doc.text` as a flat string.
+Multi-turn does appear, but **only in the generation path** (`InferenceRunner` + user-supplied `rollout_fn(doc, generate)`), where the user constructs `{"messages": [{"role": ..., "content": ...}]}` payloads themselves. Once the generation completes, the result is stuffed back into `doc.text` (or `doc.metadata`) and downstream filters again see flat text.
+For our use case — normalizing already-generated multi-turn DPO pairs with `chosen`/`rejected` chat structures and tool calls — this means we'd have to:
+1. Serialize `messages` into a flat string (`<|im_start|>user...`).
+2. Run datatrove filters on the serialized string.
+3. Re-parse back into `messages` afterward.
+Tool-call structure (`{"role": "tool", "tool_call_id": ...}`, `tool_calls: [...]`) does not survive that round-trip cleanly without custom serialization on both sides. Per the user's hard requirement — "if only flat text, that's a deal-breaker" — datatrove fails here.
+**Streaming.** Yes. `HuggingFaceDatasetReader(streaming=True)` and the iterator-based `PipelineStep.run` mean we can pipe documents through during generation. Streaming is fine.
+**GPU.** None of the *normalization* ops require GPU. MinHash dedup is CPU. Only the `InferenceRunner` path needs a GPU (vLLM/SGLang backend) and we don't need that — we'd be calling OpenRouter, not running local models.
+**Integration cost.** Moot — the chat-template gap is the deal-breaker.
+---
+### 2.2 data-juicer (datajuicer org, formerly modelscope)
+| Dimension | Value |
+|---|---|
+| Repo | `datajuicer/data-juicer` (redirect target of the legacy `modelscope/data-juicer`) |
+| License | Apache-2.0 |
+| Created | 2023-08-01 |
+| Last push | **2026-05-25** (most recent of all candidates) |
+| Stars / Forks | 6444 / 373 |
+| Maturity | Production. Active core team (Alibaba/ModelScope-spinout). Most stars of the candidate set. Has its own conference papers and a docs site at `datajuicer.github.io/data-juicer`. |
+**Op model.** Class-based DAG of **operators ("Ops")** organized as **mappers**, **filters**, **deduplicators**, and **selectors**. Each Op is a Python class subclassing `Mapper`/`Filter`/`Deduplicator`. Pipelines are declared as YAML recipes (`process: [- op_name: { args }, ...]`) and executed by the `Executor` (default Ray-distributed; also a local Pandas-backed mode). Conditional branching through `OpFusion` and `Adapter` modules is supported, and there is a Ray-Data executor for true streaming.
+**Multi-turn / chat-template support — NATIVE.** This is the discriminator.
+Data-juicer has a **first-class conversation schema**, supporting *both*:
+1. OpenAI-style `messages: [{role, content}]`
+2. A "Data-Juicer format" `{query, response, history: [[q, r], ...]}`
+It exposes operators that are *purpose-built* for dialog/preference data:
+- `dialog_intent_detection_mapper`
+- `dialog_sentiment_detection_mapper`
+- `dialog_sentiment_intensity_mapper`
+- `dialog_topic_detection_mapper`
+- `pair_preference_mapper` — **directly relevant**: ingests a `(prompt, chosen)` and synthesizes/refines a `rejected_response` plus a `reason` field. This is exactly the schema produced by our `extract_dpo_pairs`.
+- `query_intent_detection_mapper`, `query_sentiment_detection_mapper`, `query_topic_detection_mapper`
+- `optimize_qa_mapper`, `optimize_query_mapper`, `optimize_response_mapper` — refine individual fields without flattening the whole conversation.
+Tool-call structure: data-juicer's conversation schema preserves arbitrary keys per message (because it operates on dict-of-lists Arrow tables), so `tool_call_id`, `tool_calls`, `name`, etc. survive through filters as long as no operator explicitly drops them. This is structurally safe — confirmed by the operator code only reading `role`/`content` and forwarding the rest.
+**Streaming.** Partial. The default executor is batch on Arrow/HF datasets, but data-juicer integrated with **Ray Data** for distributed/streaming processing, and the README references "streaming JSON reader patches integrated by Apache Arrow." For our scale (≤100k DPO pairs per run), batch is fine; for true online normalization during multi-teacher generation, the Ray executor handles it — but a simpler approach is to wrap each `replay_trace` rollout's output into a tiny in-memory dataset and run the recipe per-batch (mini-batch streaming).
+**GPU.** Only needed for image/video/multi-modal ops and for the LLM-API mappers when configured to run a *local* model. Every op we care about for replaysim — `pair_preference_mapper`, dialog detection mappers, `text_length_filter`, `language_id_score_filter`, MinHash dedup, etc. — is CPU-OK or calls a remote API (which is exactly our existing OpenRouter pattern). Importantly, **MinHash and exact dedup in data-juicer do not require GPU**, unlike NeMo-Curator's fuzzy/semantic dedup.
+**Integration cost into `composer_replication.replaysim`.** Estimated ~250–400 LOC, breakdown:
+- Adapter `replaysim/normalize.py`: ~80–120 LOC. Wraps a `DJDataset` (data-juicer's dataset abstraction), exposes `normalize_dpo_batch(pairs: list[DPOPair]) -> list[DPOPair]`.
+- YAML recipe `replaysim/recipes/dpo_normalize.yaml`: ~40 LOC declarative.
+- Hook in `teacher_replay.py` after `extract_dpo_pairs` and before final write: ~20 LOC.
+- New tests `tests/replaysim/test_normalize.py`: ~80–120 LOC.
+- ADR-004 update + module docs: ~20 LOC.
+Dependency footprint: `pip install py-data-juicer` pulls in `datasets`, `pyarrow`, `loguru`, `jsonargparse`, optionally `ray`. We already have `datasets`/`pyarrow` indirectly from HF stack.
+---
+### 2.3 NeMo-Curator (NVIDIA-NeMo)
+| Dimension | Value |
+|---|---|
+| Repo | `NVIDIA-NeMo/Curator` (redirect target of `NVIDIA/NeMo-Curator`) |
+| License | Apache-2.0 |
+| Created | 2024-03-14 |
+| Last push | **2026-05-25** |
+| Stars / Forks | 1584 / 274 |
+| Maturity | Production at NVIDIA scale. Built for pre-training-corpus curation (Nemotron / Nemotron-4). |
+**Op model.** Task-centric distributed processing, built on **Ray** + the **Xenna** executor. Stages are class-based, composed into pipelines, executed by `XennaExecutor` in either `streaming` or `batch` mode. Closer to Spark/Ray-Data than to a Python list of steps.
+**Multi-turn / chat-template support — partial, generation-side only.** Curator has model-specific formatters (`Mixtral8x7BFormatter`, `NemotronFormatter`) that *render* multi-turn dialogue into a flat prompt string for the target model's chat template. There is `generate_dialogue` for multi-turn synthesis and `generate_two_turn_prompt` for DPO-style preference pairs. **But**: like datatrove, the *filtering* and *deduplication* stages do not have first-class conversation/preference operators — they treat the data as text after rendering. Tool-call preservation is not addressed in the public API.
+**Streaming.** Yes — `XennaExecutor(execution_mode="streaming")` is a first-class option.
+**GPU — significant cost.** Curator's discriminating features all require GPUs:
+- **Semantic deduplication** — GPU-only, embedding generation + clustering. "Not supported for CPU-only processing."
+- **Fuzzy deduplication** (MinHash + LSH) — GPU backend (cuDF/cuML), not CPU.
+- **Classifier filters** (domain / quality / safety via `DistributedDataClassifier`) — GPU clusters.
+- **Image curation modules** — GPU.
+CPU-only install supports basic text filters and exact dedup, but *that's the same surface area we'd get from data-juicer without the dependency weight*. If we are not running on a GPU cluster, NeMo-Curator's value proposition collapses.
+**Integration cost.** ~600–900 LOC plus operational cost: a Ray cluster setup, GPU nodes if we want the differentiating features. For replaysim's scale (a few thousand DPO pairs per run), this is overkill.
+---
+### 2.4 distilabel (argilla-io)
+| Dimension | Value |
+|---|---|
+| Repo | `argilla-io/distilabel` |
+| License | Apache-2.0 |
+| Created | 2023-10-16 |
+| Last push | **2026-05-25** |
+| Stars / Forks | 3230 / 242 |
+| Maturity | Production. Argilla is now part of HF; project remains active under argilla-io. |
+**Op model.** **DAG pipeline** of `Step` and `Task` (Task = Step with an LLM). Each step declares `inputs: list[str]`, `outputs: list[str]`, and `process(*inputs) -> Generator[outputs]`. Steps are wired via `>>` operator. Resource declarations (`StepResources(replicas=N, gpus=M)`) handle scaling, optionally on Ray.
+**Multi-turn / chat-template support — NATIVE on the generation side, partial on the normalization side.**
+- `ChatGeneration` task accepts OpenAI-format `messages: [{role, content}]` natively.
+- `FormatTextGenerationDPO` and `FormatChatGenerationDPO` produce the exact `{prompt, chosen, rejected, ratings, reason}` schema we want.
+- `UltraFeedback` task is the canonical preference-rating step.
+- `DeitaFiltering` and `MinHashDedup` are the only filtering/dedup steps; they operate on text fields rather than on structured `messages`. Tool-call structure is preserved as long as no step explicitly normalizes it (like data-juicer, by virtue of dict-of-fields semantics) — but there isn't a `pair_preference_mapper` analogue that operates on `messages` directly.
+**Streaming.** Supports streaming generation per LLM (e.g., `AnthropicLLM` streams tokens). Pipeline-level execution is batch-of-batches; you can `.run(parameters={...})` and consume outputs as they materialize.
+**GPU.** Only when steps choose to run a local LLM (vLLM, transformers). API-based steps (OpenAI, Anthropic, Mistral, OpenRouter via OpenAI-compat) are CPU-only.
+**Integration cost — large but high overlap.** Distilabel would *replace* much of `teacher_replay.py`, not just normalize after it:
+- Rewrite multi-teacher OpenRouter calls as a `Pipeline` of `Task`s subclassing distilabel's `LLM` interface (or use the `OpenAILLM` wrapper pointed at OpenRouter): ~300–500 LOC delta.
+- Re-express `extract_dpo_pairs` as a custom `Task` or use `FormatChatGenerationDPO`: ~100–150 LOC.
+- Migrate trace plumbing into distilabel's `GeneratorStep`/`Task` DAG: ~150 LOC.
+- Tests + docs: ~150 LOC.
+Total **~700–900 LOC** and a meaningful refactor of teacher orchestration. The win is that we'd get a real DAG runtime, retries, caching, and Argilla-integration for free. The lose is that we get *coupled* to distilabel's `LLM`/`Task` abstractions for the entire generation pipeline, not just a normalization op-graph wrapped around it.
+This is a strategic decision the user phrased as: "see if we can leverage [a normalization library] to **normalize the data while also making the replaysim dataset generation**." Distilabel takes the broader interpretation — replace replaysim's generation with a distilabel pipeline. That is a bigger commitment than this recon was scoped to recommend.
+---
+### 2.5 lilac
+**STATUS: dead. Do not adopt.**
+- `databricks/lilac`: `"archived": true`, last push **2024-03-19**, license Apache-2.0. Repo says "Curate better data for LLMs." The Databricks acquisition (April 2024) absorbed it into Databricks Mosaic AI; the OSS project was archived shortly after.
+- `lilacai/lilac`: created **2025-11-14** by a user account `lilacai`, 2 stars, 0 forks, no license, description says "Thee Eclipse - Hackerone: @theeeclipse." This is a **squatter / unrelated stub**, not the original lilac.
+- No actively maintained successor with the original lilac code base outside Databricks' proprietary platform.
+---
+## 3. Recommendation: data-juicer
+### 3.1 Why
+1. **Only candidate with native conversation + preference-pair operators in the *normalization* path**, not just the generation path. `pair_preference_mapper` is a near-perfect fit for the output of `extract_dpo_pairs`.
+2. **Tool-call structure is preserved** because operators read specific fields and forward the rest of the dict — confirmed by the operator schema design.
+3. **No GPU required** for the operators we'd actually use (preference, dialog, length, language-id, MinHash dedup). Matches our OpenRouter-API-driven, CPU-friendly architecture.
+4. **YAML-recipe style** lets us version the normalization graph as a config artifact alongside the recon doc, instead of as Python code that drifts.
+5. **Lowest integration cost** of the viable candidates — wraps around our existing pipeline rather than replacing it.
+6. **Maturity**: 6.4k stars, last push today, dedicated org, paper-backed.
+### 3.2 Why not the others (one-liners)
+- **datatrove**: flat-text `Document`, lossy round-trip on chat structure → deal-breaker.
+- **distilabel**: would force a rewrite of teacher orchestration — too broad a refactor for "wrap normalization around the existing pipeline."
+- **NeMo-Curator**: best ops require GPUs; without them it offers no advantage over data-juicer.
+- **lilac**: archived.
+### 3.3 Risk register
+| Risk | Severity | Mitigation |
+|---|---|---|
+| Data-juicer YAML recipe drift between dev and CI | M | Pin `py-data-juicer` version; commit recipe under `replaysim/recipes/` and load via `importlib.resources`. |
+| Some ops silently coerce conversation structure | M | Add a round-trip test: `pair → normalize → pair` must preserve `messages`, `tool_calls`, and arbitrary metadata. |
+| Ray executor bloat if user enables it | L | Default to local Pandas executor; gate Ray behind an explicit flag. |
+| `pair_preference_mapper` calls an LLM by default to synthesize `rejected` | H | We *already have* `rejected` from disagreement. Configure the mapper as a pass-through filter / use it only for refinement; if it can't be made non-LLM, fall back to a custom Mapper that just runs length/language/dedup checks on the existing pair. **Verify in spike before locking in.** |
+| Apache-2.0 inbound license compatibility | L | Our framework is Apache-2.0. Compatible. |
+| Op-graph executes per batch, not per sample, so a single bad pair stalls a batch | L | Use small Ray-Data batches (e.g. 64) so a stall is bounded. |
+### 3.4 Open spike question (must verify before merge)
+The single risk worth a 1-day spike: **does `pair_preference_mapper` accept a pre-existing `rejected` and *only* run validation/length/language filters, or does it *always* call an LLM to (re)synthesize a rejected response?** Read the operator source in `data_juicer/ops/mapper/pair_preference_mapper.py` and confirm. If the latter, we wire our pre-existing `rejected` through `optimize_response_mapper` (refinement, not regeneration) plus a custom no-op preference validator. Either way, the integration shape below stands; only the recipe content changes.
+---
+## 4. Integration Sketch
+### 4.1 Current pipeline (today)
+```
+TraceState
+   │
+   ▼  (per-trace, multi-teacher OpenRouter call)
+replay_trace(state, teachers=[m1, m2, m3])
+   │
+   ▼  (returns: list[TeacherCompletion] keyed by model_id)
+disagreement_score(completions)
+   │
+   ▼  (if score > τ)
+extract_dpo_pairs(completions, state)
+   │
+   ▼  (yields)
+DPOPair { prompt: messages[], chosen: messages[], rejected: messages[], state, meta }
+   │
+   ▼
+write_jsonl(out_path)
+```
+### 4.2 Proposed pipeline (with data-juicer normalization op-graph)
+```
+TraceState
+   │
+   ▼
+replay_trace(state, teachers)         ← unchanged
+   │
+   ▼
+disagreement_score(completions)        ← unchanged
+   │
+   ▼
+extract_dpo_pairs(completions, state)  ← unchanged
+   │
+   ▼
+[NEW] DJNormalizer.normalize_batch(dpo_pairs) ──── loads recipe from
+   │                                                replaysim/recipes/dpo_normalize.yaml
+   │   data-juicer op-graph runs:
+   │     1. text_length_filter (on chosen + rejected separately)
+   │     2. language_id_score_filter (en-only or configured)
+   │     3. dialog_topic_detection_mapper (annotates meta, no drop)
+   │     4. minhash_deduplicator (on prompt+chosen serialization)
+   │     5. (optional) optimize_response_mapper to clean trailing whitespace, code-block fences
+   │     6. custom PreferenceValidator op (chosen != rejected, both non-empty,
+   │        tool_calls structurally valid)
+   ▼
+write_jsonl(out_path)                  ← unchanged consumer
+```
+The op-graph is a **wrapper around** `extract_dpo_pairs`, not a replacement. `replay_trace` and `extract_dpo_pairs` keep their current signatures. The only call-site change in `teacher_replay.py` is one line:
+```python
+# before:
+pairs = list(extract_dpo_pairs(completions, state))
+write_jsonl(out_path, pairs)
+# after:
+pairs = list(extract_dpo_pairs(completions, state))
+pairs = DJNormalizer.from_recipe("dpo_normalize.yaml").normalize_batch(pairs)
+write_jsonl(out_path, pairs)
+```
+### 4.3 Adapter shape (`replaysim/normalize.py`)
+```python
+# composer_replication/replaysim/normalize.py
+from __future__ import annotations
+from dataclasses import asdict
+from importlib.resources import files
+from typing import Iterable
+from data_juicer.config import init_configs
+from data_juicer.core.executor import DefaultExecutor
+from data_juicer.format import load_formatter
+from .types import DPOPair
+class DJNormalizer:
+    """Wraps a data-juicer op-graph as a batch normalization step over
+    DPOPair samples produced by extract_dpo_pairs.
+    The recipe (YAML) declares the op sequence. Operators consume and
+    produce the data-juicer conversation schema, which we convert to
+    and from our internal DPOPair on the boundary.
+    """
+    def __init__(self, recipe_path: str):
+        cfg = init_configs(["--config", recipe_path])
+        self._executor = DefaultExecutor(cfg)
+    @classmethod
+    def from_recipe(cls, name: str) -> "DJNormalizer":
+        recipe = files("composer_replication.replaysim.recipes") / name
+        return cls(str(recipe))
+    @staticmethod
+    def _to_dj(p: DPOPair) -> dict:
+        # data-juicer preference schema:
+        #   {"prompt": str-or-messages, "chosen": str-or-messages,
+        #    "rejected": str-or-messages, "meta": {...}}
+        return {
+            "prompt": p.prompt,        # messages[]
+            "chosen": p.chosen,        # messages[]
+            "rejected": p.rejected,    # messages[]
+            "meta": {
+                "trace_id": p.state.trace_id,
+                "teachers": p.meta.get("teachers", []),
+                "disagreement": p.meta.get("disagreement"),
+                **p.meta,
+            },
+        }
+    @staticmethod
+    def _from_dj(s: dict) -> DPOPair:
+        return DPOPair(
+            prompt=s["prompt"],
+            chosen=s["chosen"],
+            rejected=s["rejected"],
+            state=...,           # rehydrate from meta.trace_id + cache
+            meta=s.get("meta", {}),
+        )
+    def normalize_batch(self, pairs: Iterable[DPOPair]) -> list[DPOPair]:
+        in_records = [self._to_dj(p) for p in pairs]
+        # Build an in-memory DJDataset from records (no disk round-trip).
+        ds = self._executor.formatter.load_dataset_from_records(in_records)
+        ds = self._executor.run(dataset=ds)
+        out_records = ds.to_list()
+        return [self._from_dj(r) for r in out_records]
+```
+### 4.4 Recipe (`replaysim/recipes/dpo_normalize.yaml`)
+```yaml
+# data-juicer recipe for normalizing replaysim DPO output
+project_name: replaysim_dpo_normalize
+executor_type: default          # local Pandas; switch to 'ray' for distributed
+np: 4
+# Conversation/preference schema mode
+text_keys: ['chosen', 'rejected']    # ops scan both response variants
+suffixes: ['.jsonl']
+process:
+  # 1. Length sanity on each response variant
+  - text_length_filter:
+      text_key: chosen
+      min_len: 10
+      max_len: 16384
+  - text_length_filter:
+      text_key: rejected
+      min_len: 10
+      max_len: 16384
+  # 2. Language gate (configurable; default English-only)
+  - language_id_score_filter:
+      text_key: chosen
+      lang: en
+      min_score: 0.6
+  # 3. Dialog topic annotation (no drop, just attaches meta.topic)
+  - dialog_topic_detection_mapper:
+      api_or_hf_model: openrouter:openai/gpt-4o-mini
+      mode: annotate
+  # 4. Near-duplicate removal across the batch on (prompt + chosen)
+  - document_minhash_deduplicator:
+      tokenization: space
+      window_size: 5
+      num_permutations: 256
+      jaccard_threshold: 0.85
+      text_key: chosen
+  # 5. Custom preference validator (chosen != rejected, structural integrity)
+  - preference_validator_filter:        # module: composer_replication.replaysim.ops
+      check_distinct: true
+      check_tool_calls_valid: true
+```
+A custom op `preference_validator_filter` lives in `composer_replication/replaysim/ops/preference_validator.py` and is registered via data-juicer's plugin entry point.
+### 4.5 Hook into `teacher_replay.py`
+```python
+# composer_replication/replaysim/teacher_replay.py (delta)
+from .normalize import DJNormalizer
+def run_replay(traces, teachers, out_path, *, normalize: bool = True):
+    pairs: list[DPOPair] = []
+    for state in traces:
+        completions = replay_trace(state, teachers=teachers)
+        if disagreement_score(completions) <= TAU:
+            continue
+        pairs.extend(extract_dpo_pairs(completions, state))
+    if normalize:
+        norm = DJNormalizer.from_recipe("dpo_normalize.yaml")
+        pairs = norm.normalize_batch(pairs)
+    write_jsonl(out_path, pairs)
+```
+The `normalize=True` flag keeps the old code-path one negation away during initial rollout.
+### 4.6 Test plan (`tests/replaysim/test_normalize.py`)
+1. **Round-trip preservation**: synthesize a DPOPair with `tool_calls`, run through `DJNormalizer.normalize_batch`, assert tool-call structure and arbitrary `meta` keys are preserved.
+2. **Length filter**: a pair with empty `chosen` is dropped.
+3. **Language filter**: a non-English `chosen` (Cyrillic) below the score threshold is dropped.
+4. **Near-duplicate**: two pairs with identical `chosen` collapse to one.
+5. **Distinctness**: a pair where `chosen == rejected` is dropped by `preference_validator_filter`.
+6. **Multi-turn**: a 3-turn conversation in `prompt` survives end-to-end with role+content intact.
+7. **Recipe loading**: `DJNormalizer.from_recipe("dpo_normalize.yaml")` works with `importlib.resources` regardless of install location.
+---
+## 5. ADR-004 Implications
+ADR-004 (the umbrella ADR for "replaysim with normalization") should record:
+- **Decision**: adopt data-juicer (`datajuicer/data-juicer`, Apache-2.0) as the normalization op-graph layer.
+- **Status**: proposed; promote to accepted after the spike on `pair_preference_mapper`.
+- **Consequences**:
+  - New runtime dependency: `py-data-juicer` (transitively pulls `pyarrow`, `datasets`, `loguru`, `jsonargparse`).
+  - Optional `ray` extra for distributed execution; not enabled by default.
+  - `replaysim/recipes/*.yaml` becomes a versioned config artifact; recipe changes must accompany behavioral-test updates.
+  - Tool-call and multi-turn structure preserved through normalization — verified by round-trip test.
+- **Alternatives considered**: distilabel (too broad — would replace generation orchestration), datatrove (flat-text only — deal-breaker), NeMo-Curator (GPU-bound), lilac (archived).
+---
+## 6. Primary-source citations
+| Claim | Source |
+|---|---|
+| datatrove license, last push, archived state | `https://api.github.com/repos/huggingface/datatrove` (`license.spdx_id`, `pushed_at`, `archived`) |
+| datatrove `Document` is text+metadata, no `messages` field; built-in filters operate on `doc.text` | DeepWiki index of `huggingface/datatrove`, `src/datatrove/data.py`, `src/datatrove/pipeline/filters/c4_filters.py` |
+| datatrove multi-turn only via `InferenceRunner.rollout_fn` | DeepWiki index of `huggingface/datatrove`, `src/datatrove/pipeline/inference/run_inference.py` |
+| data-juicer license, last push, redirect to `datajuicer/data-juicer` | `https://api.github.com/repos/modelscope/data-juicer` (resolves to `datajuicer/data-juicer`) |
+| data-juicer supports `messages: [{role, content}]` and Data-Juicer dialog format `{query, response, history}` | DeepWiki index of `modelscope/data-juicer` |
+| `pair_preference_mapper` synthesizes `rejected_response` and `reason` | DeepWiki index of `modelscope/data-juicer`, `data_juicer/ops/mapper/pair_preference_mapper.py` |
+| data-juicer GPU-required ops are tagged `🚀GPU` (image/video/multi-modal); core text + dialog mappers are CPU-OK | DeepWiki index of `modelscope/data-juicer` |
+| NeMo-Curator license, last push, redirect to `NVIDIA-NeMo/Curator` | `https://api.github.com/repos/NVIDIA/NeMo-Curator` |
+| NeMo-Curator semantic dedup is GPU-only; CPU install drops differentiating ops | DeepWiki index of `NVIDIA/NeMo-Curator` |
+| distilabel license, last push, DAG model, `FormatChatGenerationDPO`, `MinHashDedup`, `DeitaFiltering` | `https://api.github.com/repos/argilla-io/distilabel`; DeepWiki index of `argilla-io/distilabel` |
+| `databricks/lilac` archived 2024-03-19 | `https://api.github.com/repos/databricks/lilac` (`archived: true`, `pushed_at: "2024-03-19T12:41:30Z"`) |
+| `lilacai/lilac` is a 2-star squatter stub created 2025-11-14 | `https://api.github.com/repos/lilacai/lilac` |
+---
+## 7. Confirmed output path
+**File:** `/home/codeseys/.hermes/hermes-agent/docs/research/REPLAYSIM_NORMALIZATION_RECONNAISSANCE.md`
+**Length:** ≤600 lines (this file).

docs/research/RL_FRAMEWORKS_LANDSCAPE.md ADDED Viewed

	@@ -0,0 +1,428 @@

+# RL Post-Training Frameworks Landscape & Meta PyTorch Stack Audit
+> **Generated:** 2026-05-25
+> **Scope:** Audit of RL post-training frameworks beyond TRL+VeRL plus Meta's PyTorch agentic stack components, with a recommendation of two additions to the Composer Replication Framework.
+> **Feeds:** ADR-006 (Algorithm-substrate selection)
+> **Companion docs:** `~/wiki/research/post-training-framework/04-verl-trl.md`, `~/wiki/research/post-training-framework/03-monarch-torchforge-openenv.md`, `~/wiki/research/post-training-framework/02-diloco-family.md`
+---
+## TL;DR — Recommendation
+| Slot | Pick | Why |
+|---|---|---|
+| **RL framework #3 (after TRL, VeRL)** | **PRIME-RL (PrimeIntellect-ai/prime-rl)** | First-class `CustomLossConfig` extension point (`trainer.loss.type=custom` + `import_path`) — the cleanest place we have to drop our **3-channel loss (RLVR + hint-distill + trace-replay)** without forking. Already uses the `verifiers` env protocol that bridges to OpenEnv. Async, decentralized substrate. Apache-2.0. INTELLECT-2 production receipts. |
+| **Infra component (Meta stack)** | **Monarch (`meta-pytorch/monarch`)** as the actor-mesh control plane; **TorchTitan** is *also* tracked as the FSDP2/TP/PP training core but is already the trainer inside both PRIME-RL and TorchForge, so we adopt it transitively. The single net-new dependency is **Monarch**. | Monarch is the only Meta-stack component that is (a) actively shipped (v0.4 GA, v0.5 dev, weekly wheels), (b) decoupled from the now-paused TorchForge, and (c) able to host *any* SPMD trainer (TRL, VeRL, PRIME-RL) as an `ActorMesh`. BSD-3. Replaces Ray when our v0.2 lands. |
+**What we do NOT add:**
+- OpenRLHF — strong production framework (v0.9.10, 9.3K★, supports DAPO) but its custom-loss path requires modifying `openrlhf/models/loss.py` + a `Trainer` subclass. Strictly worse extension story than PRIME-RL for our specific need (3-channel loss).
+- NeMo-Aligner — no GRPO, no DAPO, heavy NeMo/Megatron dependency. Wrong shape.
+- Unsloth — TRL wrapper, RL kernels live in closed `unsloth_zoo`. We'd have to fork.
+- LLaMA-Factory — TRL wrapper, no GRPO/DAPO (delegates to EasyR1).
+- DeepSpeed-Chat — effectively unmaintained for new RL algos since Aug 2023; PPO/DPO only.
+- TorchForge — Meta has marked the repo "development paused, consolidating into TorchTitan." Borrow patterns; do not depend on it.
+- torchchat — inference / local deployment only; no training. Out of scope.
+---
+## Table of Contents
+1. [Audit Methodology](#1-audit-methodology)
+2. [RL Framework Audit](#2-rl-framework-audit)
+   1. [OpenRLHF](#21-openrlhf)
+   2. [PRIME-RL](#22-prime-rl)
+   3. [NeMo-Aligner](#23-nemo-aligner)
+   4. [Unsloth (RL)](#24-unsloth-rl)
+   5. [LLaMA-Factory](#25-llama-factory)
+   6. [DeepSpeed-Chat](#26-deepspeed-chat)
+3. [Meta PyTorch Agentic Stack — Infra vs Training Split](#3-meta-pytorch-agentic-stack)
+   1. [Monarch (coordination/infra)](#31-monarch)
+   2. [TorchTitan (training stack)](#32-torchtitan)
+   3. [TorchForge (paused)](#33-torchforge)
+   4. [torchchat (out of scope)](#34-torchchat)
+4. [Comparison Matrix](#4-comparison-matrix)
+5. [Recommendation Rationale](#5-recommendation-rationale)
+6. [Integration Sketches](#6-integration-sketches)
+7. [Sources](#7-sources)
+---
+## 1. Audit Methodology
+For each framework, we capture five fields that determine whether it can host the Composer Replication Framework's three-channel loss (RLVR + hint-distill + trace-replay) on our existing OpenEnv-compatible TRL data path:
+1. **Repo + license + last commit + maturity** — primary GitHub source, license grade for redistribution, recency, and whether the project is *production*, *research*, or *archived*.
+2. **Algorithm coverage** — does it ship GRPO and DAPO out of the box? (DAPO matters because Composer-style training inherits its decoupled clip + dynamic sampling fixes for length and std biases.)
+3. **Custom-loss extension point** — concrete file/class/config where a custom 3-channel loss can be plugged. We strongly prefer a stable public hook over forking.
+4. **Integration cost** — rough lines of code needed for a `Recipe` doc + a skeleton `Trainer` subclass that runs end-to-end on a small env.
+5. **OpenEnv data-path fit** — does it already consume the OpenEnv contract (typed `reset`/`step`/`close`, MCP tool-calling) directly, or do we have to write a shim?
+Primary sources: each repo's `README.md`, official releases page, and DeepWiki audits (where indexed). Secondary checks: PyPI release timelines for Meta packages.
+---
+## 2. RL Framework Audit
+### 2.1 OpenRLHF
+| Field | Value |
+|---|---|
+| **Repo** | https://github.com/OpenRLHF/OpenRLHF |
+| **License** | Apache-2.0 |
+| **Stars / contributors** | 9,312 ★ / 90 contributors |
+| **Latest release** | v0.9.10, 2026-04-04 |
+| **Last push** | 2026-04-05 |
+| **Maturity** | **Production** — used in many public RLHF runs since 2023; tagline "An Easy-to-use, Scalable and High-performance Agentic RL Framework based on Ray (PPO & DAPO & REINFORCE++ & TIS & vLLM & Ray & Async RL)" |
+| **Algorithms** | PPO, GRPO, **DAPO** (release notes; advertised as a primary feature in v0.9.x), REINFORCE++, REINFORCE++-baseline, RLOO, GSPO, Async RL, TIS (truncated importance sampling) |
+| **Custom-loss extension point** | `openrlhf/models/loss.py` — `PolicyLoss`, `DPOLoss`, `SFTLoss`, `PairWiseLoss`, `LogExpLoss` are concrete `nn.Module`s. To add a 3-channel loss you would (a) add a new `nn.Module` (e.g. `ThreeChannelLoss`) here, then (b) subclass the relevant `Trainer` (e.g. `PPOTrainer` / a new GRPO-derived trainer) and replace `self.loss_fn`. There is **no config-driven custom-loss hook** equivalent to PRIME-RL's `CustomLossConfig` — you fork or vendor. |
+| **Integration cost** | Higher than PRIME-RL. Estimated **~400–600 LOC**: ~150 LOC for a `ThreeChannelLoss` module, ~200 LOC for a `ComposerGRPOTrainer` subclass that routes the three signals (RLVR scalar, hint-distill teacher logprobs, trace-replay teacher logits), ~50 LOC for a `Recipe` doc, plus reward-fn glue. |
+| **Data-path fit** | OpenRLHF's input is HF chat templates + a Python reward function or a remote reward URL (`--reward.remote_url`, `--train.agent_func_path`). It does **not** speak the OpenEnv `reset/step` protocol natively, but our existing OpenEnv→TRL adapter could be reused as a callable behind `agent_func_path`. **Medium** lift to wire OpenEnv. |
+**Verdict:** Strong, mature, well-funded codebase with the *most* complete algorithm coverage of any candidate. Loses to PRIME-RL only because PRIME-RL has a first-class config-driven custom-loss hook that fits our exact need, and PRIME-RL already has the `verifiers`/OpenEnv shape baked into the orchestrator. We keep OpenRLHF on the radar as a fallback substrate if PRIME-RL's decentralized story is overkill for v0.1.
+---
+### 2.2 PRIME-RL
+| Field | Value |
+|---|---|
+| **Repo** | https://github.com/PrimeIntellect-ai/prime-rl |
+| **License** | Apache-2.0 |
+| **Stars / contributors** | 1,398 ★ / 60 contributors |
+| **Latest release** | v0.5.0, 2026-03-30 |
+| **Last push** | 2026-05-25 (active today) |
+| **Maturity** | **Production-research hybrid** — substrate behind INTELLECT-1/2 multi-DC runs; tagline "Async RL Training at Scale". Decentralized DiLoCo-shape compute is its differentiator. |
+| **Algorithms** | **GRPO**, GSPO, on-policy distillation with a teacher model. `default_loss_fn` = DPPO + KL (a GRPO variant; similar lineage to DAPO's decoupled-clip idea but the upstream "DAPO" label is not used verbatim). |
+| **Custom-loss extension point** | **Best in class.** `src/prime_rl/trainer/rl/loss.py` exposes a `LossInputs`/`LossOutputs` interface and `setup_loss_fn` resolves a config: `trainer.loss.type = "custom"` + `trainer.loss.import_path = "your_pkg.your_module.your_loss_fn"` + optional kwargs. The custom function receives `trainer_logprobs`, `inference_logprobs`, `teacher_logprobs`, `advantages`, `loss_mask` — i.e., the exact tensor inputs needed for a 3-channel loss (RLVR uses `advantages`, hint-distill uses `teacher_logprobs`, trace-replay can be threaded through `kwargs` as a precomputed reference). |
+| **Integration cost** | **Lowest.** Estimated **~200–300 LOC total**: ~120 LOC for a `composer_three_channel_loss` function in our package + ~30 LOC of config (`recipes/composer_v0.toml`), ~80 LOC `Recipe` doc. No subclassing required for the loss. A small adapter is needed if we precompute the trace-replay teacher distribution outside the `LossInputs` struct. |
+| **Data-path fit** | **Already aligned.** PRIME-RL's orchestrator consumes `verifiers` environments via `vf.EnvServer`. The OpenEnv ↔ verifiers shim is a known small adapter (the `verifiers` library is the Hub-side env runner that OpenEnv's TRL guide already uses). Our existing OpenEnv-compatible TRL data path drops in with a thin wrapper. |
+**Verdict:** Best fit for the framework. The combination of (i) config-driven custom loss with the right tensor signatures already present, (ii) verifiers/OpenEnv shape, (iii) decentralized async training that maps to our DiLoCo plans, makes PRIME-RL the substrate of choice for v0.1. **Recommended addition #1.**
+---
+### 2.3 NeMo-Aligner
+| Field | Value |
+|---|---|
+| **Repo** | https://github.com/NVIDIA/NeMo-Aligner |
+| **License** | Apache-2.0 |
+| **Maturity** | **Research-leaning production** — NVIDIA-maintained, tied to NeMo/Megatron-LM. Advertised as "early stages of development" in its own README. |
+| **Algorithms** | PPO, REINFORCE, RS (Rejection Sampling), DPO, RPO. **No GRPO. No DAPO.** |
+| **Custom-loss extension point** | `loss_func` method on Megatron model classes (e.g. `MegatronGPTDPOModel.loss_func`). Requires NeMo model-class subclassing and Megatron-LM familiarity. |
+| **Integration cost** | High. Estimated **~800–1,200 LOC** including .nemo conversion of HF weights, Megatron model wrapping, custom Megatron `loss_func`, and a recipe. Plus the operational cost of running on Megatron-LM (Triton kernels, NeMo container). |
+| **Data-path fit** | JSONL only; no OpenEnv. We'd write a full env adapter. |
+**Verdict:** Wrong shape. No GRPO/DAPO and tightly bound to the NeMo ecosystem. Only relevant if we ever need NVIDIA-supported large-scale Megatron RL, which we don't for the Composer Replication v0.1/v0.2 horizon. **Reject.**
+---
+### 2.4 Unsloth (RL)
+| Field | Value |
+|---|---|
+| **Repo** | https://github.com/unslothai/unsloth |
+| **License** | Apache-2.0 (per public README; not surfaced by DeepWiki snapshot but well-known) |
+| **Maturity** | **Production** for SFT and LoRA/QLoRA; **research/preview** for RL — RL support shipped in 2025 as a TRL patcher. |
+| **Algorithms** | Wraps TRL → inherits TRL's GRPO; loss-type switch supports `"grpo"`, `"bnpo"`, `"dr_grpo"`, `"dapo"`, `"cispo"`. So **GRPO and DAPO are both available** through the patched-TRL path. |
+| **Custom-loss extension point** | Problematic. The actual loss kernels live in `unsloth_zoo` (a *separate* compiled dependency). The patcher (`patch_trl_rl_trainers()`) generates modified TRL trainer classes via `exec()` from string templates. To add a new loss type you would have to (a) modify or fork `unsloth_zoo` to add a kernel, (b) extend `RL_REPLACEMENTS`, and (c) extend the `compute_loss()` switch in the patcher template. **There is no public Python subclass hook that survives the patching.** |
+| **Integration cost** | Very high if we want our own loss. Forking `unsloth_zoo` defeats the purpose of using Unsloth (which is the optimized kernels). Estimated ~1,000+ LOC plus an external repo to maintain. |
+| **Data-path fit** | TRL-shaped, so OpenEnv via TRL is fine — but only for *stock* TRL losses. Our 3-channel loss does not survive Unsloth's patching. |
+**Verdict:** Excellent for memory-efficient SFT and stock-GRPO LoRA. Wrong tool for a custom loss. **Reject** as the substrate; we may still use it as an *optional* QLoRA accelerator inside a stock-GRPO ablation run.
+---
+### 2.5 LLaMA-Factory
+| Field | Value |
+|---|---|
+| **Repo** | https://github.com/hiyouga/LLaMA-Factory |
+| **License** | Apache-2.0 |
+| **Maturity** | **Production** for breadth (50+ model families, SFT/DPO/PPO recipes), but RL is a thin TRL wrapper. |
+| **Algorithms** | PPO, DPO, KTO, ORPO, SimPO via `Custom*Trainer` subclasses of the corresponding `trl.*Trainer` classes. **No GRPO. No DAPO** in the repo itself; the README points to **EasyR1** (an external GRPO framework) for those. |
+| **Custom-loss extension point** | `compute_preference_loss` switch on `CustomDPOTrainer` (selects `sigmoid` / `hinge` / `ipo` / `kto_pair` / `orpo` / `simpo`). For PPO, you would subclass `CustomPPOTrainer` → which is `trl.PPOTrainer`. Effectively the same extension story as plain TRL, with a configuration layer on top. |
+| **Integration cost** | Moderate, ~400 LOC, but you are essentially using TRL through one extra layer. |
+| **Data-path fit** | Text/dataset-shaped, not OpenEnv-aware. Same OpenEnv-via-TRL story. |
+**Verdict:** Useful as a multi-model SFT laboratory but does not move the ball for our RL-side requirements. **Reject** as substrate; we already have TRL.
+---
+### 2.6 DeepSpeed-Chat
+| Field | Value |
+|---|---|
+| **Repo** | https://github.com/deepspeedai/DeepSpeedExamples (the `applications/DeepSpeed-Chat/` subtree) |
+| **License** | Apache-2.0 |
+| **Maturity** | **Effectively stale.** The README's "Latest News" cuts off in August 2023. CI patches in 2025 (e.g., #6982, #7015, #7052) are dependency-pinning fixes, not feature work. The roadmap to "generalize DeepSpeed-RLHF abstraction for a wider range of RL algorithms" has not landed. |
+| **Algorithms** | PPO (3-stage RLHF) + DPO. **No GRPO. No DAPO.** |
+| **Custom-loss extension point** | `DeepSpeedPPOTrainer.train_rlhf` / `actor_loss_fn` / `critic_loss_fn`. Editable but not config-hooked. |
+| **Integration cost** | Moderate, but you inherit a frozen architecture. ~500 LOC. |
+| **Data-path fit** | Prompt-dataset-shaped; no OpenEnv. |
+**Verdict:** Pioneering for its time, no longer competitive on algorithm coverage. **Reject.**
+---
+## 3. Meta PyTorch Agentic Stack — Infra vs Training Split
+The brief asked specifically to **distinguish coordination/infra from training-stack** components. The answer is:
+| Component | Layer | Status (May 2026) | In our framework? |
+|---|---|---|---|
+| **Monarch** (`meta-pytorch/monarch`) | **Coordination / Infra** — actor mesh, RDMA data plane, supervision trees | **Active.** v0.4 GA (2026-03-26), v0.5 dev wheels daily, BSD-3 | **Yes — recommended addition.** |
+| **TorchTitan** (`pytorch/torchtitan`) | **Training stack** — FSDP2 / TP / PP / CP / float8 / MXFP8 | **Active.** BSD-3, "extensive development". Has an experimental GRPO recipe (`experiments/rl/simple_grpo_sum_digits.py`) on Monarch. | **Indirectly** — already the trainer inside PRIME-RL and TorchForge. We adopt it transitively, not as a direct dependency. |
+| **TorchForge** (`meta-pytorch/forge`) | RL post-training library | **Development paused** per the repo banner; consolidating into TorchTitan. ~685★. | **Pattern reference only.** Lift the Generator/Trainer/Rewarder *shape* but do not depend on the package. |
+| **torchchat** (`pytorch/torchchat`) | **Inference / local deployment** | Active for its own scope, but: not a training framework; no RL surface. | **Out of scope.** |
+| **OpenEnv** (`meta-pytorch/OpenEnv`) | Environment standard (covered separately) | Active. Already a v0 dependency of the framework. | Already adopted. |
+### 3.1 Monarch
+| Field | Value |
+|---|---|
+| **Repo** | https://github.com/meta-pytorch/monarch |
+| **License** | BSD-3-Clause |
+| **PyPI** | `torchmonarch`; v0.4.1 stable (2026-04-08), v0.5.0 dev wheels published daily through 2026-05-05 |
+| **Maturity** | **Experimental but actively shipped.** "Currently in an experimental stage" per the repo's own status note, but with a functioning K8s operator, weekly wheels, ProcessMesh/ActorMesh APIs stable enough for VeRL backend experiments. |
+| **Role in our stack** | **Pure coordination/infra.** It does not train models. It hosts whatever trainer you bring (TRL, VeRL, PRIME-RL, TorchTitan) as `Actor` subclasses on a `ProcMesh`. The `monarch.spmd.SPMDActor` automatically configures `RANK`/`LOCAL_RANK`/`WORLD_SIZE` for any PyTorch-distributed script — i.e., we can lift our existing TRL or PRIME-RL workers into Monarch with minimal change. |
+| **Key abstractions** | `ProcMesh` (processes × hosts × GPUs), `ActorMesh` (typed actors with `@endpoint` methods), supervision trees, RDMA buffers, distributed tensors / DTensor integration. Underlying runtime: `hyperactor` (Rust). |
+| **Why over Ray** | Tighter PyTorch/DTensor integration; explicit RDMA data plane (Ray uses object store + standard networking); single-controller mental model maps directly to RL post-training (one controller orchestrates Generator + Trainer + Rewarder + Env actors). |
+| **Integration cost into Composer Replication** | **~300 LOC + ops**: (a) wrap our PRIME-RL trainer as an `SPMDActor`; (b) wrap our vLLM rollout server as an `Actor` with an `@endpoint generate(prompts)` method; (c) write a single controller script that creates a `ProcMesh`, spawns both meshes, and shuttles `DataProto`-shaped messages; (d) Recipe doc. The ops cost is the harder half — Monarch's K8s operator is new (v0.2.0+). |
+| **Risk** | Pre-1.0; API churn possible (e.g., `KubernetesJob.add_mesh` signature changed in v0.5). Mitigation: pin to `torchmonarch==0.4.1` for v0.2 of our framework. |
+### 3.2 TorchTitan
+| Field | Value |
+|---|---|
+| **Repo** | https://github.com/pytorch/torchtitan |
+| **License** | BSD-3-Clause |
+| **Maturity** | **Active development** for pretraining; **experimental** for RL. The GRPO experiment (`torchtitan/experiments/rl/simple_grpo_sum_digits.py`) is in `experiments/`, which the repo explicitly disclaims as removable. |
+| **Role** | **Training stack only.** Provides FSDP2 (per-parameter sharding), Tensor Parallel (incl. async TP), Pipeline Parallel (zero-bubble), Context Parallel (long-context), `torch.compile`, Float8, MXFP8, DDP, HSDP. |
+| **OpenEnv-aware?** | No, but the experimental `RLTrainer` integrates `vLLM` + Monarch actors, which is the same shape PRIME-RL uses. |
+| **Why we don't add it directly** | **PRIME-RL already uses TorchTitan-equivalent FSDP2 internals**, and TorchForge's training core was TorchTitan. Adding TorchTitan as a *direct* dependency would mean writing our own RL loop on top of it — that's TorchForge's job, and Meta paused exactly that effort. The right move is to depend on PRIME-RL, which has battle-tested distributed training patterns equivalent to TorchTitan's, and revisit TorchTitan directly only when we genuinely need its experimental zero-bubble PP or MXFP8 paths. |
+### 3.3 TorchForge (Paused)
+- Repo banner: **"Development paused — LLM training consolidating in TorchTitan."**
+- ~685 ★, 100+ open issues, last meaningful release in early 2026.
+- Patterns we should still copy:
+  - Generator/Trainer/Rewarder ActorMesh decomposition
+  - TorchStore-style RDMA weight broadcast
+  - Async toggle between sync PPO-like and fully async off-policy
+- **We do not add a TorchForge dependency.** Architectural reference only.
+### 3.4 torchchat (Out of Scope)
+- Inference / local deployment of LLMs (Eager / `torch.compile` / AOT Inductor / ExecuTorch / mobile).
+- No training, no RL.
+- Mentioned in the brief for completeness; ruled out cleanly.
+---
+## 4. Comparison Matrix
+### 4.1 RL Frameworks
+| Framework | License | Last release | Maturity | GRPO | DAPO | Custom-loss hook | OpenEnv fit | Est. integration LOC |
+|---|---|---|---|---|---|---|---|---|
+| **TRL** (baseline) | Apache-2.0 | Active | Production | ✅ | partial (tricks land per release) | Subclass `GRPOTrainer.compute_loss` | ✅ native (Oct 2025 OpenEnv guide) | already integrated |
+| **VeRL** (baseline) | Apache-2.0 | Active | Production | ✅ | ✅ | `core_algos.py` + worker subclass | shim via Ray dataloader | already skeleton |
+| **OpenRLHF** | Apache-2.0 | v0.9.10 (2026-04-04) | Production | ✅ | ✅ | `openrlhf/models/loss.py` + Trainer subclass; **no config hook** | shim via `agent_func_path` | ~400–600 |
+| **PRIME-RL** ⭐ | Apache-2.0 | v0.5.0 (2026-03-30) | Prod-research | ✅ | partial (DPPO+KL variant; not labeled DAPO) | **`CustomLossConfig` import_path — first-class** | ✅ via `verifiers` (OpenEnv-compatible) | **~200–300** |
+| **NeMo-Aligner** | Apache-2.0 | Active | Research-leaning | ❌ | ❌ | Megatron model `loss_func` | none; JSONL only | ~800–1,200 |
+| **Unsloth (RL)** | Apache-2.0 | Active | Production (SFT) / preview (RL) | ✅ (via TRL patch) | ✅ (via TRL patch) | Loss kernels in closed `unsloth_zoo`; effectively unhookable | TRL-shaped | ~1,000+ (forking) |
+| **LLaMA-Factory** | Apache-2.0 | Active | Production | ❌ (delegates to EasyR1) | ❌ | TRL `Custom*Trainer` subclass | TRL-shaped | ~400 |
+| **DeepSpeed-Chat** | Apache-2.0 | Stale (Aug 2023 features; 2025 only CI fixes) | Effectively maintained-only | ❌ | ❌ | `DeepSpeedPPOTrainer` subclass | none | ~500 |
+### 4.2 Meta PyTorch Stack
+| Component | Layer | License | Status | In recommendation? |
+|---|---|---|---|---|
+| **Monarch** ⭐ | Coordination / actor mesh | BSD-3 | Active (v0.4 GA, v0.5 dev) | **Yes** |
+| **TorchTitan** | Training stack | BSD-3 | Active; RL experimental | Indirect (via PRIME-RL) |
+| **TorchForge** | RL library | BSD-3 | **Paused** | No — patterns only |
+| **torchchat** | Inference / deployment | BSD-3 | Active | No — out of scope |
+| **OpenEnv** | Environment standard | (Hub) | Active | Already adopted |
+---
+## 5. Recommendation Rationale
+### 5.1 Why PRIME-RL, not OpenRLHF
+OpenRLHF is in many ways the safer pick: more stars, more contributors, more algorithm coverage (it explicitly ships DAPO). The deciding factor is **the shape of our custom loss**.
+The Composer Replication Framework's signature contribution is the **three-channel reward**:
+1. **RLVR** — tests-pass scalar from the OpenEnv environment.
+2. **Composer-style hint-distill (SDPO/OPSD)** — the model self-teaches against its own hint-conditioned roll-outs; needs `teacher_logprobs` aligned to the rollout token grid.
+3. **Trace-replay multi-teacher PRM** (the novel bit) — N frozen external teachers' precomputed token-level distributions, replayed against the on-policy rollout.
+PRIME-RL's `LossInputs` dataclass already exposes exactly the tensors we need:
+```
+trainer_logprobs, inference_logprobs, teacher_logprobs, advantages, loss_mask
+```
+A custom 3-channel loss is roughly:
+```python
+def composer_three_channel_loss(li: LossInputs, *, hint_weight, replay_weight, replay_logits) -> LossOutputs:
+    rlvr = grpo_term(li.trainer_logprobs, li.inference_logprobs, li.advantages, li.loss_mask)
+    hint = kl_term(li.trainer_logprobs, li.teacher_logprobs, li.loss_mask)
+    replay = kl_term(li.trainer_logprobs, replay_logits, li.loss_mask)
+    return LossOutputs(loss=rlvr + hint_weight * hint + replay_weight * replay, ...)
+```
+We register this with `trainer.loss.type = "custom"` + `import_path` and we're done. No subclassing, no `exec()`-patched template, no Megatron model wrapping.
+OpenRLHF would require us to (a) add a `ThreeChannelLoss` `nn.Module` to `openrlhf/models/loss.py`, (b) subclass `PPOTrainer` (or equivalent GRPO trainer) to construct it with the right teacher-logprob plumbing, and (c) carry that fork forward. ~2× the LOC, plus a fork to maintain.
+A second factor: PRIME-RL's `verifiers` env protocol is a direct precursor of OpenEnv's wire shape (HTTP/WebSocket env servers, typed observations). Our existing OpenEnv-compatible TRL data path translates with a thin adapter. OpenRLHF's `agent_func_path` is more of an escape hatch than a contract.
+A third factor: PRIME-RL was *built for decentralized training* (INTELLECT-1/2). Even though our v0.1 stays on a single cluster, the v0.2 multi-DC story drops in cleanly. OpenRLHF is Ray-on-one-cluster by design.
+### 5.2 Why Monarch, not TorchTitan or TorchForge
+Among the four Meta-stack components in the brief, only one is both (a) ours to add and (b) genuinely new functionality:
+- **TorchForge** is paused — depending on it now is a known dead end.
+- **TorchTitan** is already inside PRIME-RL transitively (PRIME-RL uses FSDP2 plus a SHARDCAST weight-broadcast layer that is morally equivalent to what TorchTitan offers). Adding TorchTitan as a *direct* dependency means writing our own RL loop on top of it, which is exactly what TorchForge tried and paused. We get TorchTitan's benefits without owning the integration.
+- **torchchat** is for local inference / mobile deployment — out of scope.
+- **Monarch** is the unique value: a PyTorch-native actor mesh that lets us replace Ray (PRIME-RL's current orchestration substrate) with something that has explicit RDMA, supervision trees, and ProcMesh/ActorMesh primitives that map directly onto our (Generator, Trainer, Rewarder, EnvServer) topology.
+The migration path is incremental:
+- **v0.1:** PRIME-RL on Ray (current). Monarch listed as roadmap.
+- **v0.2:** Wrap PRIME-RL's Trainer as a `monarch.spmd.SPMDActor`, vLLM Generator as an `Actor` with an `@endpoint generate()`. Switch the orchestrator from `ray.init()` to `this_host().spawn_procs()`.
+- Risk-mitigation: pin to `torchmonarch==0.4.1` (the last GA release before v0.5 dev). Keep a Ray fallback path active until v0.2 is stable.
+---
+## 6. Integration Sketches
+### 6.1 PRIME-RL Recipe skeleton
+`recipes/composer_v0_prime_rl.toml` (~30 LOC):
+```toml
+# composer_v0_prime_rl.toml
+[model]
+name = "Qwen/Qwen3-32B"  # or Kimi-K2.5 when MoE support lands
+[data]
+env = "swe_bench_lite"   # via verifiers EnvServer; wraps our OpenEnv adapter
+batch_size = 64
+group_size = 16
+[trainer]
+algorithm = "grpo"
+[trainer.loss]
+type = "custom"
+import_path = "composer_replication.losses.composer_three_channel_loss"
+[trainer.loss.kwargs]
+hint_weight = 0.5
+replay_weight = 0.25
+replay_logits_path = "/data/teachers/precomputed_replay.zarr"
+[teacher]
+model = "Qwen/Qwen3-32B"  # same as policy = self-teacher for hint-distill
+hint_template = "composer.hint_v1"
+[orchestrator]
+sync_mode = "async"
+shardcast = true
+```
+`composer_replication/losses.py` (~120 LOC):
+```python
+# composer_replication/losses.py
+from prime_rl.trainer.rl.loss import LossInputs, LossOutputs
+def composer_three_channel_loss(
+    li: LossInputs,
+    *,
+    hint_weight: float,
+    replay_weight: float,
+    replay_logits_handle: str,
+) -> LossOutputs:
+    # 1. RLVR via GRPO surrogate
+    rlvr = grpo_surrogate(li.trainer_logprobs, li.inference_logprobs,
+                          li.advantages, li.loss_mask)
+    # 2. Hint-distill: KL(policy || hint-conditioned teacher)
+    hint = masked_kl(li.trainer_logprobs, li.teacher_logprobs, li.loss_mask)
+    # 3. Trace-replay: KL(policy || precomputed multi-teacher mixture)
+    replay = trace_replay_kl(li.trainer_logprobs, replay_logits_handle, li.loss_mask)
+    total = rlvr + hint_weight * hint + replay_weight * replay
+    return LossOutputs(
+        loss=total,
+        metrics={"rlvr": rlvr.item(), "hint": hint.item(), "replay": replay.item()},
+    )
+```
+Plus `docs/recipes/composer_v0_prime_rl.md` (~50 LOC) describing data layout, teacher precomputation, and reproducibility hashes.
+**Total: ~200 LOC of code + ~30 LOC config + ~50 LOC docs ≈ 280 LOC.**
+### 6.2 Monarch wrap-up sketch (v0.2)
+```python
+# composer_replication/orchestrator/monarch_runner.py  (~120 LOC)
+from monarch.actor import Actor, endpoint
+from monarch.proc_mesh import this_host, ProcMesh
+class TrainerActor(Actor):
+    @endpoint
+    async def step(self, batch): ...
+class GeneratorActor(Actor):
+    @endpoint
+    async def generate(self, prompts): ...
+class RewarderActor(Actor):
+    @endpoint
+    async def score(self, traj): ...
+async def main(cfg):
+    train_mesh = await this_host().spawn_procs(TrainerActor, hosts=4, gpus=8)
+    gen_mesh   = await this_host().spawn_procs(GeneratorActor, hosts=2, gpus=8)
+    rew_mesh   = await this_host().spawn_procs(RewarderActor, hosts=1, gpus=2)
+    async for step in range(cfg.steps):
+        prompts = await env.batch()
+        traj = await gen_mesh.generate.broadcast(prompts)
+        rewards = await rew_mesh.score.broadcast(traj)
+        await train_mesh.step.broadcast({"traj": traj, "rewards": rewards})
+```
+**Total: ~120 LOC controller + ~50 LOC ops (K8s operator manifest) + ~80 LOC recipe doc ≈ 250 LOC.**
+---
+## 7. Sources
+### Primary
+- **OpenRLHF** — https://github.com/OpenRLHF/OpenRLHF (README, Releases v0.9.10), Apache-2.0; DeepWiki: `openrlhf/models/loss.py`, `agent_func_path`.
+- **PRIME-RL** — https://github.com/PrimeIntellect-ai/prime-rl (README, Releases v0.5.0), Apache-2.0; DeepWiki: `src/prime_rl/trainer/rl/loss.py`, `CustomLossConfig`, `LossInputs`/`LossOutputs`, `verifiers` integration.
+- **NeMo-Aligner** — https://github.com/NVIDIA/NeMo-Aligner, Apache-2.0; DeepWiki: PPO/REINFORCE/DPO/RPO; `loss_func` on Megatron model classes.
+- **Unsloth** — https://github.com/unslothai/unsloth, README RL section; DeepWiki: `patch_trl_rl_trainers()`, `unsloth_zoo` kernels, DAPO loss-type switch.
+- **LLaMA-Factory** — https://github.com/hiyouga/LLaMA-Factory, Apache-2.0; DeepWiki: `CustomPPOTrainer`/`CustomDPOTrainer`, EasyR1 reference for GRPO.
+- **DeepSpeed-Chat** — https://github.com/deepspeedai/DeepSpeedExamples (`applications/DeepSpeed-Chat/`), Apache-2.0; DeepWiki: 3-stage PPO, DPO; "Latest News" cutoff Aug 2023; 2025 PRs (#6982, #7015, #7052) confirming maintenance-only mode.
+- **Monarch** — https://github.com/meta-pytorch/monarch, BSD-3; PyPI `torchmonarch` v0.4.1 (2026-04-08), v0.5.0 dev wheels through 2026-05-05; DeepWiki: `ProcMesh`, `ActorMesh`, `monarch.spmd.SPMDActor`.
+- **TorchTitan** — https://github.com/pytorch/torchtitan, BSD-3; DeepWiki: FSDP2/TP/PP/CP, `torchtitan/experiments/rl/simple_grpo_sum_digits.py`, integration with vLLM and Monarch.
+- **TorchForge** — https://github.com/meta-pytorch/forge, BSD-3, repo banner "development paused — consolidating in TorchTitan".
+- **torchchat** — https://github.com/pytorch/torchchat, BSD-3; DeepWiki: inference-only (eager / `torch.compile` / AOT Inductor / ExecuTorch).
+### Companion repository docs (already present)
+- `~/wiki/research/post-training-framework/04-verl-trl.md` — VeRL vs TRL deep dive.
+- `~/wiki/research/post-training-framework/03-monarch-torchforge-openenv.md` — full Meta-stack survey.
+- `~/wiki/research/post-training-framework/02-diloco-family.md` — DiLoCo / OpenDiLoCo / PRIME-RL / INTELLECT-2.
+- `~/wiki/projects/composer-replication-framework.md` — current TL;DR and stage plan.
+### Notes on accuracy
+- "DAPO" labeling: OpenRLHF and Unsloth both advertise DAPO as a first-class loss type; PRIME-RL implements a DAPO-equivalent (decoupled-clip + KL) but uses the internal name `DPPO+KL` in its default loss. For our purposes this is the same family.
+- Last-commit dates and release versions are pulled from GitHub release pages (OpenRLHF, PRIME-RL) and PyPI release history (`torchmonarch`).
+- Star counts and contributor counts reflect the snapshots returned by web search at the time of writing (May 2026) and will drift; the relative ordering is stable.

docs/research/SELF_DISTILLATION_LANDSCAPE.md ADDED Viewed

	@@ -0,0 +1,418 @@

+# Self-Distillation Landscape Audit (feeds ADR-007)
+**Status:** research note, pre-experimental
+**Author:** subagent audit
+**Date:** 2026-05-25
+**Scope:** identify 2–3 distillation-channel losses worth adding to
+`composer_replication` alongside the existing GRPO + SDPO/OPSD `generalized_jsd_loss` +
+multi-teacher trace-replay DPO stack.
+**Bias:** additivity over novelty. We are looking for losses that COMPOSE with
+what is already implemented, not duplicates of it.
+---
+## TL;DR — recommended additions
+| Rank | Method | Loss role | License | LOC est. | Why it composes |
+|------|--------|-----------|---------|----------|-----------------|
+| 1 | **SimPO** (NeurIPS 2024) | Preference, reference-free | MIT | ~80 | Drop-in for trace-replay DPO; removes ref-model VRAM cost; orthogonal to JSD distillation channel |
+| 2 | **TAID** (ICLR 2025) | Interpolated-target wrapper around any KL/JSD | Apache-2.0 | ~150 | Wraps the existing `generalized_jsd_loss` — does not replace it. Closes capacity gap on small students |
+| 3 | **Entropy-Aware OPD** (ICLR 2026 Spotlight) | Token-gated forward/reverse KL mixture | CC BY 4.0 (paper); code expected | ~120 | Fixes a documented failure mode of the reverse-KL-style SDPO loss when teacher entropy is high — directly addresses a known weakness of channel 2 |
+**Honourable mention:** KTO — useful only if the framework wants to ingest
+binary thumbs-up/thumbs-down trace signals without preference pairs.
+**Not recommended:** GKD, DistiLLM, MiniLLM, Self-Rewarding LM (rationale at end).
+---
+## Audit method
+For each candidate paper (the seven the user named, plus 2026 follow-ups
+discovered via Exa search restricted to `category=research paper, startPublishedDate=2026-01-01`)
+we verified:
+1. **Primary source exists.** arXiv abstract page reachable; HTML body parsed
+   to extract the actual loss formula (not summarised from secondary sources).
+2. **Code is real.** Official repo's README was fetched, `last push` date and
+   star count recorded. Forks of MiniLLM/DistiLLM that are no longer maintained
+   were marked as such.
+3. **License is permissive enough.** MIT, Apache-2.0, BSD, CC BY 4.0 are
+   acceptable for inclusion. GPL or research-only would be flagged.
+4. **Composability check.** Read the framework's existing
+   `composer_replication/__init__.py` and `research/05-trace-replay-distillation.md`,
+   then asked: *does this loss replace something we have, or stack on top?*
+---
+## Candidate 1 — SimPO (Simple Preference Optimization) ⭐ RECOMMENDED
+### Sources
+- **arXiv:** https://arxiv.org/abs/2405.14734 (Meng, Xia, Chen — UVA + Princeton, NeurIPS 2024)
+- **GitHub:** https://github.com/princeton-nlp/SimPO
+  - License: **MIT**
+  - 949 stars, 74 forks, last commit 2024-10-12 (mature, post-NeurIPS)
+  - Built on top of `huggingface/alignment-handbook`
+- Maturity: **production-ready**. Released checkpoints for Mistral, Llama-3, Gemma-2 base/instruct. Reproducible training configs ship with the repo.
+### Loss core (reference-free preference)
+SimPO replaces the DPO log-ratio (which requires keeping `π_ref` in memory)
+with the **average log-probability** of the sequence under the policy, plus
+a **target reward margin** γ:
+```
+r(x, y) = (β / |y|) · log π_θ(y | x)        ← length-normalised implicit reward
+                                               (no reference model)
+L_SimPO(π_θ) = −E_{(x, y_w, y_l) ~ D} [
+    log σ( r(x, y_w) − r(x, y_l) − γ )
+]
+```
+where `β` is a temperature (typically 2.0–10) and `γ` is the desired margin
+between chosen and rejected (the repo recommends `γ/β ≈ 0.5` as a starting
+point). Two consequences: (i) no `π_ref` forward pass per step → roughly half
+the memory, and (ii) the implicit reward is exactly the quantity the model
+generates from at decode time, removing a known DPO pathology where
+decoding-time and training-time rewards diverge.
+### Why it composes with the existing stack
+- The framework's **channel 3** is multi-teacher trace-replay DPO. SimPO is a
+  drop-in replacement for the DPO step inside that channel — same `(x, y_w, y_l)`
+  data contract, different loss head. So the trace-replay harvester does not
+  change at all.
+- It does **not** touch channel 2 (SDPO/OPSD `generalized_jsd_loss`). The two
+  are complementary: JSD-distillation transfers token-level teacher knowledge,
+  SimPO sharpens preference structure between trace alternatives.
+- It does **not** duplicate GRPO either. GRPO is online-policy RLVR;
+  SimPO is offline preference. Different data sources.
+- The published Mistral-7B and Llama-3-8B SimPO results beat DPO by 4–6 points
+  on AlpacaEval-2 LC, which directly translates to "if we already have channel-3
+  pairs, SimPO is a free upgrade".
+### Implementation cost
+- **~80 LOC** for the trainer hook; the loss itself is ~15 lines (log-probs,
+  length-normalise, margin, BCE).
+- Dependencies: nothing new — `torch`, `transformers` already in repo.
+- The reference implementation is a single file in `princeton-nlp/SimPO`
+  (`scripts/run_simpo.py` + `alignment/` trainer subclass) under MIT, so we can
+  vendor it exactly as we did with OPSD.
+---
+## Candidate 2 — TAID (Temporally Adaptive Interpolated Distillation) ⭐ RECOMMENDED
+### Sources
+- **arXiv:** https://arxiv.org/abs/2501.16937 (Shing, Misaki, Bao, Yokoi, Akiba — Sakana AI, ICLR 2025)
+- **GitHub:** https://github.com/SakanaAI/TAID
+  - License: **Apache-2.0**
+  - 121 stars, last push 2025-10-06 (actively maintained)
+  - Reference implementations of GKD, DistiLLM, Adaptive-KL, CTKD, DKD are also in `src/distil_losses/` for free
+- Released artefacts: `TAID-LLM-1.5B`, `TAID-VLM-2B` on HuggingFace (so the loss is verified at non-trivial scale).
+- Maturity: **published, single-author commits** but reproducibly trained two SoTA compact models with it.
+### Loss core (interpolated teacher target)
+Standard distillation losses (forward KL, reverse KL, JSD, including the
+`generalized_jsd_loss` we already have) target a **fixed** teacher distribution
+`p_T`. TAID replaces this fixed target with a **time-dependent interpolated
+target** `p_t` that starts close to the student and moves toward the teacher
+as training progresses:
+```
+p_t(y | x) = (1 − t) · q_θ_stop(y | x)  +  t · p_T(y | x)         (1)
+J_TAID(θ; t) = D_KL( p_t ‖ q_θ )                                  (2)
+```
+`q_θ_stop` is the student's own current distribution with stop-gradient. The
+interpolation coefficient `t ∈ [t_start, 1]` is updated each step by an
+**adaptive momentum schedule** that grows `t` faster when training loss is
+falling and slower when it stalls — this is the "temporally adaptive" part.
+The Sakana paper proves (Theorem 4.1) that for the regression analogue this
+schedule provably prevents the mode-collapse failure mode of pure
+self-distillation.
+Critically, `D_KL(p_t ‖ q_θ)` is just any divergence on shifted target — you
+can equally well plug in JSD, reverse KL, or **the generalized_jsd_loss the
+framework already exports**. TAID is therefore a *wrapper around an existing
+divergence*, not a competing divergence.
+### Why it composes with the existing stack
+- It **wraps** `composer_replication.opsd.generalized_jsd_loss` rather than
+  replacing it. The change is "compute the JSD against `p_t` instead of
+  `p_T`" — a few lines around the existing call site.
+- Addresses a documented weakness of OPSD-style self-distillation: when the
+  teacher's privileged-context distribution is far from the student's
+  capacity, the JSD signal can be noisy or push the student into mode
+  averaging. TAID's annealed target gives the student a curriculum.
+- Empirical evidence the Sakana paper directly compares with: TAID + JSD
+  beats GKD + JSD beats DistiLLM + skew-KL on Phi-3 → TinyLlama distillation,
+  with **0.7 h / epoch** vs **9.8 h / epoch** for GKD on identical hardware.
+  The speed comes from not needing student-generated outputs (SGOs) at every
+  step the way GKD does.
+- Composes additively with channel 1 (GRPO) and channel 3 (trace-replay DPO)
+  because TAID lives strictly inside channel 2.
+### Implementation cost
+- **~150 LOC**. The change is:
+  1. A `TAIDState` object that holds `t`, the EMA of training loss, and the
+     momentum coefficient β (default 0.99).
+  2. A function `taid_target(student_logits, teacher_logits, t)` that returns
+     `(1−t)·softmax(student_logits.detach()) + t·softmax(teacher_logits)`.
+  3. A scheduler hook that updates `t` after each backward pass per
+     Algorithm 1 of the paper.
+- Dependencies: nothing new.
+- Reference implementation in `SakanaAI/TAID/src/distil_losses/taid.py` is
+  Apache-2.0 — vendor-friendly, same pattern as our OPSD lift.
+---
+## Candidate 3 — Entropy-Aware On-Policy Distillation (Entropy-Aware OPD) ⭐ RECOMMENDED
+### Sources
+- **OpenReview (ICLR 2026 Spotlight):** https://openreview.net/forum?id=WSRQ37tzk1
+- **IBM Research page:** https://research.ibm.com/publications/entropy-aware-on-policy-distillation-of-language-models
+- Authors: Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, Kimin Lee (KAIST + IBM Research)
+- Status: **ICLR 2026 Spotlight**, submission #113. License on the OpenReview record is **CC BY 4.0**.
+- Code: not yet released on GitHub at the time of audit (paper accepted 2026-03-03). IBM authors typically release within the conference window. **Maturity flag: paper-ready, code-pending.** This is the only candidate where we'd need to re-implement from the paper.
+### Loss core (entropy-gated forward/reverse KL mixture)
+The paper diagnoses a failure mode in the reverse-KL-on-policy distillation
+recipe used by MiniLLM, OPSD, and (implicitly) by our SDPO channel: when the
+**teacher distribution has high entropy at a given token**, reverse KL's
+mode-seeking gradient becomes noisy and collapses the student's diversity.
+Their fix: at each token `t`, gate between forward and reverse KL based on
+the teacher's entropy:
+```
+H_t = − Σ_v p_T(v | x, y_<t) · log p_T(v | x, y_<t)        (teacher entropy)
+α_t = sigmoid( (H_t − τ) / s )                              ∈ (0, 1)
+L_EA(θ) = E_{y ~ q_θ} Σ_t [
+    (1 − α_t) · D_KL( q_θ(· | x, y_<t) ‖ p_T(· | x, y_<t) )    ← reverse KL
+  +     α_t   · D_KL( p_T(· | x, y_<t) ‖ q_θ(· | x, y_<t) )    ← forward KL
+]
+```
+`τ` is an entropy threshold (default ≈ 1.0 nat in their experiments) and `s`
+is a temperature controlling how sharp the gate is. When the teacher is
+confident (`H_t` small → `α_t ≈ 0`) the loss is pure reverse KL, identical to
+MiniLLM/OPSD behaviour. When the teacher is uncertain (`H_t` large → `α_t ≈ 1`)
+the loss switches to forward KL, which is mode-covering and preserves
+student diversity.
+Reported gains over baseline reverse-KL OPD on Qwen3-0.6B/1.7B/4B: Pass@8 on
+six math benchmarks improves by +1.37 / +2.39 / +5.05 respectively. The
+larger gains at larger student size suggest the failure mode reverse KL
+exhibits gets *worse* with capacity, not better.
+### Why it composes with the existing stack
+- It is **strictly token-wise**: same trajectory, same teacher logits, same
+  rollout pipeline as the existing channel 2. The only change is the loss
+  reduction — instead of computing `generalized_jsd_loss` with a single fixed
+  β, you compute a per-token mixture of forward and reverse KL with weight
+  given by teacher entropy.
+- This is genuinely orthogonal to OPSD/SDPO. OPSD's contribution is
+  *privileged-context teacher distribution under student rollouts*. EA-OPD's
+  contribution is *which divergence to use at each token of that distribution*.
+  Both can be true simultaneously.
+- Directly addresses a failure mode the framework's roadmap will hit:
+  multi-teacher trace replay (channel 3) produces high-entropy aggregated
+  teacher distributions at exactly the steps where teachers disagree. Those
+  are the steps where reverse KL behaves worst. EA-OPD's entropy gate would
+  automatically soften the loss on those exact tokens.
+- Composes with TAID (Candidate 2) too — they operate on different axes:
+  TAID anneals the *target distribution*, EA-OPD chooses the *divergence
+  direction*. Stacking is straightforward and proposed as ADR-007 follow-up.
+### Implementation cost
+- **~120 LOC** estimate (no reference code to vendor yet).
+- Dependencies: nothing new. Token-level entropy is `−(p * log p).sum(-1)`,
+  forward KL is the existing teacher-on-student term, reverse KL is the
+  student-on-teacher term we already compute for the JSD in OPSD. The work is
+  re-shaping the existing per-token loss to expose both directions.
+- **Risk note:** code not yet public. We should hold this candidate behind a
+  feature flag until the IBM/KAIST team releases reference code (expected by
+  ICLR 2026 in May). If the implementation ships sooner we should vendor and
+  match line-for-line; if not, we re-derive from the paper formula and add a
+  unit test that reproduces their toy entropy-vs-divergence plot.
+---
+## Honourable mention — KTO (Kahneman-Tversky Optimization)
+- **arXiv:** https://arxiv.org/abs/2402.01306
+- **Code:** integrated into HuggingFace `trl` library since v0.8 (Apache-2.0).
+- License/maturity: **production**. KTO is a standard `trl` trainer alongside DPO.
+### Loss core
+KTO replaces preference pairs with **per-output binary desirability** signals.
+For a desirable output `y_+` and undesirable output `y_−`:
+```
+r_θ(x, y) = β · log( π_θ(y|x) / π_ref(y|x) )
+z_0 = E_{x', y' ~ π_θ}[ KL( π_θ(·|x') ‖ π_ref(·|x') ) ]      (reference point)
+L_KTO = E_{x, y_+} [λ_D · (1 − σ(r_θ(x, y_+) − z_0))]        (desirable)
+      + E_{x, y_−} [λ_U · (1 − σ(z_0 − r_θ(x, y_−)))]        (undesirable)
+```
+with default `λ_D = λ_U = 1`. The derivation is via prospect theory: this is
+a Kahneman-Tversky utility function applied to the implicit reward. KTO
+matches DPO at 1B–30B even though it sees only `2n` binary signals where
+DPO sees `n` pairs.
+### Why we down-rank it relative to the top-3
+KTO is the right answer **only if** the framework wants to ingest single-side
+trace signals (e.g., "this trace step succeeded" / "this step crashed the
+agent") without constructing pairs. The current
+`research/05-trace-replay-distillation.md` design **does** construct pairs
+from multi-teacher replay (that is the whole point of the multi-teacher
+variance signal), so the marginal value of KTO is small *for channel 3 as
+specified*. If the trace-replay design pivots toward absolute scores per
+step rather than relative pairs, KTO becomes the right loss and is already
+free from `trl`. Add to the backlog as conditional.
+---
+## Audited but NOT recommended
+### GKD — Generalized Knowledge Distillation (Agarwal et al., 2023)
+- **arXiv:** https://arxiv.org/abs/2306.13649 (Google DeepMind)
+- **Loss core:** student samples its own outputs, teacher provides token
+  probabilities, divergence is generalized JSD with parameter β:
+  ```
+  D_JSD(β)(P‖Q) = β·KL(P ‖ βP+(1−β)Q) + (1−β)·KL(Q ��� βP+(1−β)Q)
+  ```
+- **Why excluded:** **this is exactly the formula we already have** as
+  `composer_replication.opsd.generalized_jsd_loss` (lifted from
+  `siyan-zhao/OPSD`). GKD's contribution beyond the loss formula is the
+  on-policy student sampling protocol — which OPSD also does. No incremental
+  value to add.
+### DistiLLM (Ko et al., ICML 2024)
+- **arXiv:** https://arxiv.org/abs/2402.03898
+- **GitHub:** https://github.com/jongwooko/distillm — MIT, last push 2025-03
+- **Loss core:** *Skew KL divergence* `KL(p ‖ λp + (1−λ)q)` plus an *adaptive
+  off-policy* student-generated-output (SGO) scheduler.
+- **Why excluded:** the skew-KL is a special case of generalized JSD (set the
+  mixture coefficient appropriately) — same family the framework already
+  has. The interesting contribution, the SGO scheduler, is a process
+  optimisation, not a loss. The TAID paper's own ablation (Table 6) shows
+  TAID > Skew KL across student sizes, so TAID dominates this candidate.
+### MiniLLM (Gu et al., ICLR 2024)
+- **arXiv:** https://arxiv.org/abs/2306.08543
+- **GitHub:** https://github.com/microsoft/LMOps/tree/main/minillm — MIT, repo
+  active (last push 2026-04)
+- **Loss core:** reverse KL minimised by policy-gradient on student rollouts,
+  with three optimisation tricks: single-step decomposition (variance
+  reduction), teacher-mixed sampling (anti-reward-hacking), length
+  normalisation.
+- **Why excluded:** reverse-KL on-policy distillation **is the same recipe
+  family as SDPO/OPSD** the framework already implements. Adding MiniLLM
+  would be a parallel implementation of the same idea, not an addition.
+  Entropy-Aware OPD (Candidate 3) is a *strict improvement* over MiniLLM's
+  pure reverse-KL on exactly the failure mode MiniLLM identifies (mode
+  collapse in high-entropy regions).
+### Self-Rewarding Language Models (Yuan et al., 2024)
+- **arXiv:** https://arxiv.org/abs/2401.10020 (Meta + NYU)
+- **Why excluded:** SRLM is a *training procedure* (iterative DPO with the
+  model judging its own outputs), not a loss. The actual loss is plain DPO,
+  which the framework already supports. The procedural contribution belongs
+  in a future ADR on data generation, not in the distillation channel.
+### TAID's relationship to "TAID arXiv 2501.16937 if it exists"
+The user asked us to verify existence. **It exists.** Submitted 2025-01-28,
+ICLR 2025, code at https://github.com/SakanaAI/TAID with two released
+checkpoints (`TAID-LLM-1.5B`, `TAID-VLM-2B`). Confirmed primary source.
+---
+## 2026 papers found
+The targeted Exa search (`category=research paper`, `startPublishedDate=2026-01-01`)
+surfaced four 2026 distillation papers worth listing for completeness:
+1. **Entropy-Aware On-Policy Distillation** — ICLR 2026 Spotlight. ⭐ Promoted to top-3 above.
+2. **KL for a KL: On-Policy Distillation with Control Variate Baseline** (arXiv 2605.07865, Oh et al., 2026-05). Variance-reduction trick for on-policy KL distillation. Useful future read but not a new loss — it's a baseline subtraction added to MiniLLM-style policy gradient.
+3. **Rethinking On-Policy Distillation: Phenomenology, Mechanism, and Recipe** (https://github.com/thunlp/OPD, Tsinghua NLP, last push 2026-04). Empirical study, not a new loss formulation.
+4. **Hybrid Policy Distillation for LLMs** (ICML 2026 poster, Zhu et al.). Combines off-policy and on-policy distillation; positioned as a recipe rather than a new loss; abstract suggests strong overlap with TAID's annealing argument.
+5. **Don't Ignore the Tail: Decoupling top-K Probabilities for Efficient Language Model Distillation** (ICML 2026 poster, Dasgupta et al.). Targets the long-tail of teacher distributions. Interesting but currently only an abstract; deferred until the camera-ready PDF is available.
+None of these except Entropy-Aware OPD are mature enough (released code +
+license + reproducible scale) to recommend adding right now.
+---
+## Recommended follow-up wiring
+For ADR-007 the proposed addition is a `composer_replication.distillation`
+sub-package with three pluggable hooks:
+```
+composer_replication/
+  distillation/
+    __init__.py
+    targets.py        # taid_target(...), fixed_target(...)         ← Candidate 2
+    losses.py         # reuses opsd.generalized_jsd_loss
+                       # adds entropy_aware_kl_loss(...)             ← Candidate 3
+  preference/
+    simpo.py          # simpo_loss(...)                              ← Candidate 1
+    dpo.py            # existing trace-replay path
+```
+The composition rule for the total loss becomes:
+```
+L_total =   λ_grpo · L_GRPO            (channel 1, unchanged)
+        + λ_distill · L_distill        (channel 2, see below)
+        +    λ_pref · L_pref           (channel 3, choose DPO or SimPO)
+L_distill = entropy_aware_kl_loss(
+                target = taid_target(student, teacher, t),
+                student = student,
+                teacher_entropy_gate = α_t
+            )
+```
+This keeps the existing `generalized_jsd_loss` reachable as a fallback
+(set `α_t ≡ 0` and `t ≡ 1` and you recover SDPO/OPSD exactly).
+---
+## Sources index
+| Paper | arXiv | GitHub | License | Last push | Maturity |
+|-------|-------|--------|---------|-----------|----------|
+| SimPO | https://arxiv.org/abs/2405.14734 | https://github.com/princeton-nlp/SimPO | MIT | 2024-10-12 | Production |
+| TAID | https://arxiv.org/abs/2501.16937 | https://github.com/SakanaAI/TAID | Apache-2.0 | 2025-10-06 | Production |
+| Entropy-Aware OPD | n/a (OpenReview WSRQ37tzk1) | code-pending | CC BY 4.0 (paper) | n/a | Paper-only |
+| KTO | https://arxiv.org/abs/2402.01306 | huggingface/trl (built-in) | Apache-2.0 | continuous | Production |
+| GKD | https://arxiv.org/abs/2306.13649 | (no official repo from authors; reproduced inside SakanaAI/TAID and jongwooko/distillm) | n/a | n/a | Reference only |
+| DistiLLM | https://arxiv.org/abs/2402.03898 | https://github.com/jongwooko/distillm | (no LICENSE file at audit time) | 2025-03-13 | Research |
+| MiniLLM | https://arxiv.org/abs/2306.08543 | https://github.com/microsoft/LMOps/tree/main/minillm | MIT | 2026-04-08 | Production |
+| Self-Rewarding LM | https://arxiv.org/abs/2401.10020 | (no canonical repo; integrated into many forks) | n/a | n/a | Procedure, not a loss |
+---
+## Notes for ADR-007 author
+1. **SimPO and TAID can land independently and without coordination.** They
+   touch different files and do not compete.
+2. **Entropy-Aware OPD should land last.** Wait for the IBM/KAIST authors'
+   code release; if it's not out by the time we want to ship the change, the
+   formula is simple enough to re-derive but we should pin a unit test that
+   reproduces the paper's Figure 3 entropy-vs-divergence behaviour.
+3. **Do not also pull in GKD/DistiLLM/MiniLLM.** Their loss contributions are
+   strict subsets of what (TAID + Entropy-Aware OPD + existing
+   `generalized_jsd_loss`) covers.
+4. **KTO should be added as a backlog item** with a "trigger" condition:
+   when the trace-replay reward design moves from preference pairs to per-step
+   binary signals, switch on the `trl.KTOTrainer` path.
+---
+*Absolute path of this report:* `/mnt/e/CS/HF/composer-replication-framework/docs/research/SELF_DISTILLATION_LANDSCAPE.md`

docs/research/WAVE_13_FINAL_REVIEW.md ADDED Viewed

	@@ -0,0 +1,239 @@

+# Wave 13 Adversarial Cross-Model Review
+**Reviewer:** Claude Opus 4.7 (sub-agent via delegate_task)
+**Date:** 2026-05-26
+**Scope:** Wave 13 additions only (35 new tests, 4 ADRs, 6 new modules)
+**Method:** Read-and-grep audit + targeted test runs (CPU)
+## Top-line verdict
+**CONDITIONAL PASS with two BLOCKERs.** Wave 13 substantially advances
+the brief expansion (serverless DiLoCo abstraction, replaysim
+normalization, three distillation losses, PRIME-RL recipe, Monarch
+tie-in). The **distillation losses are the strongest deliverable** —
+real, well-tested, mathematically faithful to the cited papers. The
+serverless-DiLoCo local executor + ObjectStoreAllReduce barrier are
+also genuine and exercised by 3 real multi-process tests.
+**However, two material claims are not test-validated, and one new
+module silently produces a degenerate loss in its primary code path.**
+ADR claims that say "X is added to compose_loss" describe code that
+wasn't actually written. The MockManager → DiLoCo "drop-in" is
+unverified end-to-end.
+Wave 11's reviewer found 2 genuine BLOCKERs. This review finds **2
+BLOCKERs + 4 SUGGESTIONs + 2 NITs**.
+---
+## Finding 1 — BLOCKER: PRIME-RL `composer_loss.loss_fn` SDPO term is mathematically degenerate (always 0)
+**Severity:** BLOCKER
+**Evidence:** `composer_replication/recipes/prime_rl/composer_loss.py:79-86`
+The PRIME-RL composer-loss adapter applies `unsqueeze(-1)` to `(B, T)`
+log-prob tensors before passing them to `generalized_jsd_loss`, which
+calls `F.log_softmax(..., dim=-1)`. Softmax of a single-element vector
+is exactly 1.0; its log is 0. Therefore both `student_log_probs` and
+`teacher_log_probs` are identically zero, the JSD between them is 0,
+and the SDPO contribution **is always 0 regardless of `alpha_sdpo` or
+the actual log-prob values.**
+```python
+>>> import torch.nn.functional as F
+>>> F.log_softmax(torch.randn(2, 3, 1), dim=-1)
+tensor([[[0.],[0.],[0.]],[[0.],[0.],[0.]]])
+```
+The docstring calls this "a deliberate approximation," but it is not
+an approximation — it's a mathematically degenerate operation that
+silently disables channel 2.
+**Fix direction:**
+- Gate the SDPO branch behind `len(trainer_lp.shape) >= 3`, raising
+  `NotImplementedError` until PRIME-RL surfaces full logits.
+- Update `prime_rl_recipe.md` and ADR-006 to stop claiming PRIME-RL
+  has working SDPO; mark it deferred.
+---
+## Finding 2 — BLOCKER: ADR-007 declares `compose_loss` kwargs that were never added
+**Severity:** BLOCKER
+**Evidence:**
+- `docs/adrs/ADR-007-self-distillation-losses.md:103-108` claims:
+  > `composer_replication.compose_loss` gets new optional kwargs:
+  >   - `dpo_variant: Literal["dpo", "simpo"] = "dpo"` — switches channel 3
+  >   - `sdpo_wrapper: Literal["none", "taid", "entropy_opd"] = "none"` — wraps channel 2
+  >   - `taid_schedule_step: int | None = None`
+  >   - `taid_total_steps: int | None = None`
+- `composer_replication/loss.py:54-65` actual signature has **none**
+  of these. `grep -n "dpo_variant\|sdpo_wrapper\|taid"
+  composer_replication/loss.py` returns empty.
+The new losses live in `composer_replication.distillation` as
+standalone functions but **are not wired into the framework's actual
+loss composition.** A user reading ADR-007 + the README would believe
+`compose_loss(model, inputs, dpo_variant="simpo", sdpo_wrapper="taid", ...)`
+works; it would raise `TypeError`. The 17 distillation tests verify
+the standalone losses but never exercise integration.
+**Fix direction:**
+- Either (a) add the kwargs to `compose_loss` and write at least one
+  integration test combining e.g. SDPO+TAID (~30 LOC change), or
+- (b) downgrade ADR-007 status to "Standalone losses landed;
+  integration deferred to Wave 14."
+---
+## Finding 3 — SUGGESTION: `default.yaml` replaysim recipe uses string ops on list-of-dict fields
+**Severity:** SUGGESTION (would be BLOCKER if a test exercised the real path)
+**Evidence:**
+- `composer_replication/recipes/replaysim/default.yaml` configures
+  `text_length_filter`, `words_num_filter`, `special_characters_filter`,
+  `document_deduplicator` with `text_keys: ["chosen", "rejected"]`.
+- In the record produced by `_dpo_pair_to_dj_record`, `chosen` and
+  `rejected` are **lists of dicts**
+  (`[{"role": "assistant", "content": "..."}]`) — not strings.
+- data-juicer's `text_length_filter` expects string-typed fields;
+  running it on a list will either crash or no-op silently.
+The reason no test catches this: tests only validate the real path *if
+data-juicer is installed*, and even then only check `__init__` succeeds.
+There is no test that calls `normalize()` against a real data-juicer
+executor with the default recipe.
+**Fix direction:**
+- Reshape `_dpo_pair_to_dj_record` to extract `content` strings
+  alongside the messages-format list.
+- Add one test (skip-marked unless `data_juicer` is importable) that
+  runs the real op-graph on 3 hand-crafted records.
+---
+## Finding 4 — SUGGESTION: MockManager → torchft.DiLoCo "drop-in" claim is unverified end-to-end
+**Severity:** SUGGESTION
+**Evidence:**
+- `composer_replication/diloco/serverless/allreduce.py:188-191` claims
+  MockManager "drops into" `make_diloco_outer_loop`.
+- The only test covering MockManager (`test_mock_manager_shape_compat`)
+  is a `hasattr` smoke that calls `.allreduce` on a `world_size=1`
+  store (passthrough).
+- torchft.Manager has additional surface area
+  (`current_step`, `is_leader`, `_pg`, `report_error`,
+  internal step accounting) that DiLoCo's `_apply_pseudogradient`
+  may consult depending on version.
+**Fix direction:**
+- Add a single integration test that constructs
+  `make_diloco_outer_loop(manager=MockManager(store), ...)` against a
+  tiny `nn.Linear` and runs one outer round — even single-process.
+- Audit `torchft/local_sgd.py` for the `Manager`-rooted call sites and
+  add stubs for any methods DiLoCo actually consults beyond `allreduce`.
+---
+## Finding 5 — SUGGESTION: README claim "9 multi-process tests" is mildly inflated
+**Severity:** SUGGESTION (NIT bordering)
+**Evidence:**
+- README.md and V1_V8_COVERAGE both state: *"9 multi-process tests
+  pinning the allreduce barrier."*
+- Actual breakdown:
+  - 4 single-process unit tests + `test_mock_manager_shape_compat` (5)
+  - 4 multi-process tests spawning subprocesses (parametrized [2,3] of
+    `_runs_allreduce_across_replicas`, `_handles_multiple_rounds`,
+    `_reports_failed_replicas`)
+- Of the 4 multi-process tests, only **3 actually exercise the
+  allreduce barrier**; `_reports_failed_replicas` deliberately raises
+  before any allreduce call.
+**Wave 13 clearly does NOT fake-pass via world_size=1** — the multi-
+process barrier is real. But the count is rounded up.
+**Fix direction:** Replace "9 multi-process tests" with "9 tests
+covering the serverless DiLoCo layer, of which 4 spawn real
+subprocesses and 3 exercise the allreduce barrier across replicas."
+---
+## Finding 6 — SUGGESTION: PRIME-RL channel 1 is REINFORCE not GRPO; ignores `inference_logprobs`
+**Severity:** SUGGESTION
+**Evidence:** `composer_replication/recipes/prime_rl/composer_loss.py:62-68`
+computes:
+```python
+grpo_loss = -(advantages * trainer_lp * mask).sum() / mask.sum().clamp_min(epsilon)
+```
+This is plain REINFORCE with advantage. PRIME-RL's `LossInputs`
+exposes `inference_logprobs` precisely because GRPO-with-replay-buffer
+requires the importance-sampling ratio
+`exp(trainer_lp - inference_lp)` (PPO-style clipped objective).
+The file says "SKELETON" so this isn't a hidden bug per se, but the
+loss is **labeled GRPO and is not GRPO**.
+**Fix direction:** Either implement the ratio + clipping (~20 LOC) or
+rename channel-1 comment to "REINFORCE-with-advantage stub" with a TODO.
+---
+## Finding 7 — NIT: ModalExecutor / HFJobsExecutor are skeleton-only with `NotImplementedError` in `__init__`
+**Severity:** NIT (this is documented, but README phrasing is slightly soft)
+**Evidence:** Honestly documented as skeletons in the code, ADR-005,
+and README. NIT: a user trying `ModalExecutor()` gets a runtime error
+rather than an import-time clue.
+**Fix direction:** Low priority. Update README phrase to "skeleton-only
+— raises NotImplementedError until v0.x." Or use a `__getattr__` on
+the package that raises a clearer message.
+---
+## Finding 8 — NIT: SimPO test uses positive log-probs (impossible values)
+**Severity:** NIT
+**Evidence:** `test_distillation_losses.py:27-46` calls `simpo_loss`
+with `chosen=tensor([0.5, 0.4, 0.3])`. Log-probabilities are bounded
+above by 0; positive values aren't possible from any softmax. The tests
+still verify the formula correctly, but the test inputs aren't legal.
+**Fix direction:** Use negative values — purely cosmetic.
+---
+## Cross-cutting risk check
+73 tests passed in 29.29s on the CPU-fast subset. Spike 008 5/5 still
+pass. The new `composer_replication.diloco.serverless` package is
+purely additive; the existing `make_diloco_outer_loop` is untouched.
+**No cross-wave regressions detected on CPU.** GPU tests + slow CPU
+e2e tests not re-run; regression risk low since Wave 13 doesn't touch
+their dependencies.
+---
+## Summary scorecard
+| Item | Verdict |
+|---|---|
+| Distillation module (SimPO/TAID/Entropy-Aware OPD) standalone | ✅ Real, well-tested, paper-faithful |
+| Distillation integrated into `compose_loss` | ❌ **Not implemented** despite ADR-007 (Finding 2) |
+| ObjectStoreAllReduce + LocalProcessExecutor | ✅ Real multi-process barrier validated |
+| MockManager → DiLoCo drop-in | 🟡 Shape-checked only; integration unverified (Finding 4) |
+| Modal/HFJobs adapters | 🟡 Honestly documented as skeletons (Finding 7) |
+| Replaysim DJNormalizer passthrough | ✅ Works |
+| Replaysim default.yaml against real data-juicer | ❌ **Recipe field types don't match record shape** (Finding 3) |
+| PRIME-RL composer_loss.loss_fn | ❌ **SDPO term silently 0** (Finding 1); channel 1 is REINFORCE not GRPO (Finding 6) |
+| Monarch actors | ✅ Honest skeleton; raises NotImplementedError |
+| Altered-minds tie-in doc | ✅ Design-only, scoped honestly |
+| 35 new tests | All pass; 3 of 4 multi-process tests are genuine (Finding 5) |
+**Recommendation:** Address Findings 1 and 2 before publishing the
+Wave 13 expansion as "closed." Findings 3 and 4 should be addressed
+before any user attempts the real data-juicer or real torchft DiLoCo
+path. Findings 5–8 are cleanup.

pyproject.toml CHANGED Viewed

@@ -16,16 +16,23 @@ keywords = [
     "rlvr",
     "grpo",
     "sdpo",
     "dpo",
     "diloco",
     "agentic",
     "coding-agents",
     "composer-2-5",
     "cursor",
     "trl",
     "verl",
     "openenv",
     "torchft",
 ]
 classifiers = [
     "Development Status :: 3 - Alpha",
@@ -47,17 +54,35 @@ dependencies = [
 replay = [
     "httpx>=0.27",
 ]
-# DiLoCo outer-loop optimizer
 diloco = [
     "torchft-nightly",
 ]
-# Production training (TRL GRPOTrainer subclass)
 train = [
     "trl>=0.12",
     "peft>=0.13",
     "accelerate>=1.0",
     "datasets>=3.0",
 ]
 # Everything for development
 dev = [
     "pytest>=8.0",

     "rlvr",
     "grpo",
     "sdpo",
+    "simpo",
+    "taid",
     "dpo",
     "diloco",
+    "decoupled-diloco",
     "agentic",
     "coding-agents",
     "composer-2-5",
     "cursor",
     "trl",
     "verl",
+    "prime-rl",
     "openenv",
     "torchft",
+    "monarch",
+    "modal",
+    "huggingface-jobs",
 ]
 classifiers = [
     "Development Status :: 3 - Alpha",
 replay = [
     "httpx>=0.27",
 ]
+# DiLoCo outer-loop optimizer (single-process)
 diloco = [
     "torchft-nightly",
 ]
+# Decoupled DiLoCo over serverless executors (per ADR-005)
+serverless = [
+    "fsspec>=2024.6",
+    "huggingface_hub>=0.27",   # for hf:// fsspec backend + HF Jobs
+]
+# Replaysim dataset normalization (per ADR-004)
+replaysim = [
+    "data-juicer>=1.0",
+    "composer-replication[replay]",   # replaysim builds on the replay channel
+]
+# Production training (TRL GRPOTrainer subclass — Recipe A)
 train = [
     "trl>=0.12",
     "peft>=0.13",
     "accelerate>=1.0",
     "datasets>=3.0",
 ]
+# PRIME-RL recipe (Recipe C — per ADR-006)
+prime-rl = [
+    "prime-rl>=0.5",
+]
+# Monarch actor mesh (per ADR-006)
+monarch = [
+    "monarch>=0.4.1",
+]
 # Everything for development
 dev = [
     "pytest>=8.0",