Reinforcement Learning
Transformers
English
post-training
distillation
agentic-coding
composer-2.5
cursor
kimi-k2
grpo
dapo
diloco
openenv
trl
verl
research
methodology
Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| # Wave 13 Adversarial Cross-Model Review | |
| **Reviewer:** Claude Opus 4.7 (sub-agent via delegate_task) | |
| **Date:** 2026-05-26 | |
| **Scope:** Wave 13 additions only (35 new tests, 4 ADRs, 6 new modules) | |
| **Method:** Read-and-grep audit + targeted test runs (CPU) | |
| ## Top-line verdict | |
| **CONDITIONAL PASS with two BLOCKERs.** Wave 13 substantially advances | |
| the brief expansion (serverless DiLoCo abstraction, replaysim | |
| normalization, three distillation losses, PRIME-RL recipe, Monarch | |
| tie-in). The **distillation losses are the strongest deliverable** — | |
| real, well-tested, mathematically faithful to the cited papers. The | |
| serverless-DiLoCo local executor + ObjectStoreAllReduce barrier are | |
| also genuine and exercised by 3 real multi-process tests. | |
| **However, two material claims are not test-validated, and one new | |
| module silently produces a degenerate loss in its primary code path.** | |
| ADR claims that say "X is added to compose_loss" describe code that | |
| wasn't actually written. The MockManager → DiLoCo "drop-in" is | |
| unverified end-to-end. | |
| Wave 11's reviewer found 2 genuine BLOCKERs. This review finds **2 | |
| BLOCKERs + 4 SUGGESTIONs + 2 NITs**. | |
| --- | |
| ## Finding 1 — BLOCKER: PRIME-RL `composer_loss.loss_fn` SDPO term is mathematically degenerate (always 0) | |
| **Severity:** BLOCKER | |
| **Evidence:** `composer_replication/recipes/prime_rl/composer_loss.py:79-86` | |
| The PRIME-RL composer-loss adapter applies `unsqueeze(-1)` to `(B, T)` | |
| log-prob tensors before passing them to `generalized_jsd_loss`, which | |
| calls `F.log_softmax(..., dim=-1)`. Softmax of a single-element vector | |
| is exactly 1.0; its log is 0. Therefore both `student_log_probs` and | |
| `teacher_log_probs` are identically zero, the JSD between them is 0, | |
| and the SDPO contribution **is always 0 regardless of `alpha_sdpo` or | |
| the actual log-prob values.** | |
| ```python | |
| >>> import torch.nn.functional as F | |
| >>> F.log_softmax(torch.randn(2, 3, 1), dim=-1) | |
| tensor([[[0.],[0.],[0.]],[[0.],[0.],[0.]]]) | |
| ``` | |
| The docstring calls this "a deliberate approximation," but it is not | |
| an approximation — it's a mathematically degenerate operation that | |
| silently disables channel 2. | |
| **Fix direction:** | |
| - Gate the SDPO branch behind `len(trainer_lp.shape) >= 3`, raising | |
| `NotImplementedError` until PRIME-RL surfaces full logits. | |
| - Update `prime_rl_recipe.md` and ADR-006 to stop claiming PRIME-RL | |
| has working SDPO; mark it deferred. | |
| --- | |
| ## Finding 2 — BLOCKER: ADR-007 declares `compose_loss` kwargs that were never added | |
| **Severity:** BLOCKER | |
| **Evidence:** | |
| - `docs/adrs/ADR-007-self-distillation-losses.md:103-108` claims: | |
| > `composer_replication.compose_loss` gets new optional kwargs: | |
| > - `dpo_variant: Literal["dpo", "simpo"] = "dpo"` — switches channel 3 | |
| > - `sdpo_wrapper: Literal["none", "taid", "entropy_opd"] = "none"` — wraps channel 2 | |
| > - `taid_schedule_step: int | None = None` | |
| > - `taid_total_steps: int | None = None` | |
| - `composer_replication/loss.py:54-65` actual signature has **none** | |
| of these. `grep -n "dpo_variant\|sdpo_wrapper\|taid" | |
| composer_replication/loss.py` returns empty. | |
| The new losses live in `composer_replication.distillation` as | |
| standalone functions but **are not wired into the framework's actual | |
| loss composition.** A user reading ADR-007 + the README would believe | |
| `compose_loss(model, inputs, dpo_variant="simpo", sdpo_wrapper="taid", ...)` | |
| works; it would raise `TypeError`. The 17 distillation tests verify | |
| the standalone losses but never exercise integration. | |
| **Fix direction:** | |
| - Either (a) add the kwargs to `compose_loss` and write at least one | |
| integration test combining e.g. SDPO+TAID (~30 LOC change), or | |
| - (b) downgrade ADR-007 status to "Standalone losses landed; | |
| integration deferred to Wave 14." | |
| --- | |
| ## Finding 3 — SUGGESTION: `default.yaml` replaysim recipe uses string ops on list-of-dict fields | |
| **Severity:** SUGGESTION (would be BLOCKER if a test exercised the real path) | |
| **Evidence:** | |
| - `composer_replication/recipes/replaysim/default.yaml` configures | |
| `text_length_filter`, `words_num_filter`, `special_characters_filter`, | |
| `document_deduplicator` with `text_keys: ["chosen", "rejected"]`. | |
| - In the record produced by `_dpo_pair_to_dj_record`, `chosen` and | |
| `rejected` are **lists of dicts** | |
| (`[{"role": "assistant", "content": "..."}]`) — not strings. | |
| - data-juicer's `text_length_filter` expects string-typed fields; | |
| running it on a list will either crash or no-op silently. | |
| The reason no test catches this: tests only validate the real path *if | |
| data-juicer is installed*, and even then only check `__init__` succeeds. | |
| There is no test that calls `normalize()` against a real data-juicer | |
| executor with the default recipe. | |
| **Fix direction:** | |
| - Reshape `_dpo_pair_to_dj_record` to extract `content` strings | |
| alongside the messages-format list. | |
| - Add one test (skip-marked unless `data_juicer` is importable) that | |
| runs the real op-graph on 3 hand-crafted records. | |
| --- | |
| ## Finding 4 — SUGGESTION: MockManager → torchft.DiLoCo "drop-in" claim is unverified end-to-end | |
| **Severity:** SUGGESTION | |
| **Evidence:** | |
| - `composer_replication/diloco/serverless/allreduce.py:188-191` claims | |
| MockManager "drops into" `make_diloco_outer_loop`. | |
| - The only test covering MockManager (`test_mock_manager_shape_compat`) | |
| is a `hasattr` smoke that calls `.allreduce` on a `world_size=1` | |
| store (passthrough). | |
| - torchft.Manager has additional surface area | |
| (`current_step`, `is_leader`, `_pg`, `report_error`, | |
| internal step accounting) that DiLoCo's `_apply_pseudogradient` | |
| may consult depending on version. | |
| **Fix direction:** | |
| - Add a single integration test that constructs | |
| `make_diloco_outer_loop(manager=MockManager(store), ...)` against a | |
| tiny `nn.Linear` and runs one outer round — even single-process. | |
| - Audit `torchft/local_sgd.py` for the `Manager`-rooted call sites and | |
| add stubs for any methods DiLoCo actually consults beyond `allreduce`. | |
| --- | |
| ## Finding 5 — SUGGESTION: README claim "9 multi-process tests" is mildly inflated | |
| **Severity:** SUGGESTION (NIT bordering) | |
| **Evidence:** | |
| - README.md and V1_V8_COVERAGE both state: *"9 multi-process tests | |
| pinning the allreduce barrier."* | |
| - Actual breakdown: | |
| - 4 single-process unit tests + `test_mock_manager_shape_compat` (5) | |
| - 4 multi-process tests spawning subprocesses (parametrized [2,3] of | |
| `_runs_allreduce_across_replicas`, `_handles_multiple_rounds`, | |
| `_reports_failed_replicas`) | |
| - Of the 4 multi-process tests, only **3 actually exercise the | |
| allreduce barrier**; `_reports_failed_replicas` deliberately raises | |
| before any allreduce call. | |
| **Wave 13 clearly does NOT fake-pass via world_size=1** — the multi- | |
| process barrier is real. But the count is rounded up. | |
| **Fix direction:** Replace "9 multi-process tests" with "9 tests | |
| covering the serverless DiLoCo layer, of which 4 spawn real | |
| subprocesses and 3 exercise the allreduce barrier across replicas." | |
| --- | |
| ## Finding 6 — SUGGESTION: PRIME-RL channel 1 is REINFORCE not GRPO; ignores `inference_logprobs` | |
| **Severity:** SUGGESTION | |
| **Evidence:** `composer_replication/recipes/prime_rl/composer_loss.py:62-68` | |
| computes: | |
| ```python | |
| grpo_loss = -(advantages * trainer_lp * mask).sum() / mask.sum().clamp_min(epsilon) | |
| ``` | |
| This is plain REINFORCE with advantage. PRIME-RL's `LossInputs` | |
| exposes `inference_logprobs` precisely because GRPO-with-replay-buffer | |
| requires the importance-sampling ratio | |
| `exp(trainer_lp - inference_lp)` (PPO-style clipped objective). | |
| The file says "SKELETON" so this isn't a hidden bug per se, but the | |
| loss is **labeled GRPO and is not GRPO**. | |
| **Fix direction:** Either implement the ratio + clipping (~20 LOC) or | |
| rename channel-1 comment to "REINFORCE-with-advantage stub" with a TODO. | |
| --- | |
| ## Finding 7 — NIT: ModalExecutor / HFJobsExecutor are skeleton-only with `NotImplementedError` in `__init__` | |
| **Severity:** NIT (this is documented, but README phrasing is slightly soft) | |
| **Evidence:** Honestly documented as skeletons in the code, ADR-005, | |
| and README. NIT: a user trying `ModalExecutor()` gets a runtime error | |
| rather than an import-time clue. | |
| **Fix direction:** Low priority. Update README phrase to "skeleton-only | |
| — raises NotImplementedError until v0.x." Or use a `__getattr__` on | |
| the package that raises a clearer message. | |
| --- | |
| ## Finding 8 — NIT: SimPO test uses positive log-probs (impossible values) | |
| **Severity:** NIT | |
| **Evidence:** `test_distillation_losses.py:27-46` calls `simpo_loss` | |
| with `chosen=tensor([0.5, 0.4, 0.3])`. Log-probabilities are bounded | |
| above by 0; positive values aren't possible from any softmax. The tests | |
| still verify the formula correctly, but the test inputs aren't legal. | |
| **Fix direction:** Use negative values — purely cosmetic. | |
| --- | |
| ## Cross-cutting risk check | |
| 73 tests passed in 29.29s on the CPU-fast subset. Spike 008 5/5 still | |
| pass. The new `composer_replication.diloco.serverless` package is | |
| purely additive; the existing `make_diloco_outer_loop` is untouched. | |
| **No cross-wave regressions detected on CPU.** GPU tests + slow CPU | |
| e2e tests not re-run; regression risk low since Wave 13 doesn't touch | |
| their dependencies. | |
| --- | |
| ## Summary scorecard | |
| | Item | Verdict | | |
| |---|---| | |
| | Distillation module (SimPO/TAID/Entropy-Aware OPD) standalone | ✅ Real, well-tested, paper-faithful | | |
| | Distillation integrated into `compose_loss` | ❌ **Not implemented** despite ADR-007 (Finding 2) | | |
| | ObjectStoreAllReduce + LocalProcessExecutor | ✅ Real multi-process barrier validated | | |
| | MockManager → DiLoCo drop-in | 🟡 Shape-checked only; integration unverified (Finding 4) | | |
| | Modal/HFJobs adapters | 🟡 Honestly documented as skeletons (Finding 7) | | |
| | Replaysim DJNormalizer passthrough | ✅ Works | | |
| | Replaysim default.yaml against real data-juicer | ❌ **Recipe field types don't match record shape** (Finding 3) | | |
| | PRIME-RL composer_loss.loss_fn | ❌ **SDPO term silently 0** (Finding 1); channel 1 is REINFORCE not GRPO (Finding 6) | | |
| | Monarch actors | ✅ Honest skeleton; raises NotImplementedError | | |
| | Altered-minds tie-in doc | ✅ Design-only, scoped honestly | | |
| | 35 new tests | All pass; 3 of 4 multi-process tests are genuine (Finding 5) | | |
| **Recommendation:** Address Findings 1 and 2 before publishing the | |
| Wave 13 expansion as "closed." Findings 3 and 4 should be addressed | |
| before any user attempts the real data-juicer or real torchft DiLoCo | |
| path. Findings 5–8 are cleanup. | |