Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Wave 16: install ergonomics + gradient evidence + SDPO end-to-end example
Browse filesFive-track polish wave with mandatory cross-model adversarial review
(deepseek-v4-pro + gemini-3.1-pro-preview, route-fidelity-verified via
direct-urllib scatter; reviewers caught 2 real bugs that BOTH made it
into the staged diff and would have shipped without the review):
Wave 16a — Install ergonomics (extras pinning)
pyproject.toml dropped three extras whose lower bounds were
unsatisfiable on public PyPI and would have blocked any fresh user:
- monarch>=0.4.1: PyPI 'monarch' tops out at 0.1.11 and is unrelated
to Meta's actor framework; the real Monarch ships as
torchmonarch-nightly with platform constraints.
- prime-rl>=0.5: 'prime-rl' is not registered on PyPI at all; Prime
Intellect publishes from source only.
- data-juicer>=1.0: 'data-juicer' is not registered on PyPI; the
closest match (py-data-juicer) has broken transitive deps.
All three integrations import lazily, so the framework still loads
cleanly without them. docs/TROUBLESHOOTING.md gets a new section #10
with from-source install pointers.
README.md keyword list trimmed of three misleading discovery facets
(prime-rl / monarch / torchforge) flagged by Gemini reviewer.
Verified: `uv pip install -e ".[diloco,replay,replaysim,train,dev]"`
succeeds on a fresh venv.
Wave 16b — Gradient-flow tests (4 new tests, all PASS)
composer_replication/tests/test_gradient_flow.py verifies that
compose_loss channels actually route gradients to model parameters
under autograd, AND that disabled channels produce zero side-effects:
- test_alpha_sdpo_routes_grad_to_params (alpha=1.0 → finite |grad|>0)
- test_beta_replay_routes_grad_to_params (beta=1.0 → finite |grad|>0)
- test_alpha_zero_blocks_sdpo_grad (alpha=0 with vs without SDPO
inputs produces BIT-IDENTICAL parameter gradients — catches a
class of phantom-gradient leak from disabled channels)
- test_taid_grad_flows_through_sdpo_path (Wave 15 TAID rewrite
remains differentiable under autograd)
Test count: 176 → 180 passed / 8 skipped (all green; 1 flaky
serverless test pre-exists Wave 16, not introduced here).
Wave 16c — INTEGRATION_RECIPES.md signature collapse
Removed the 20-line `def compose_loss(...)` parameter reproduction
in the TL;DR; replaced with a single cross-link to API_REFERENCE.md.
Per-recipe USE examples (literal kwarg values demonstrating each
recipe's specific configuration) preserved — those carry the
recipe-specific value-add, not signature drift.
API_REFERENCE.md gets an explicit `<a id="compose_loss"></a>` anchor
so the cross-link resolves cleanly (Gemini reviewer flagged the
GFM-auto-slug as brittle).
Wave 16d — Reconnaissance-doc currency audit
WAVE_16_RECON_AUDIT.md catalogs every code reference in 7 research
docs against the actual symbols, file paths, and kwargs in the
package today. Audit table buckets findings as KEEP / FIX / FLAG /
DELETE. Bucket totals: 24 KEEP + 2 FIX (in RL_FRAMEWORKS_LANDSCAPE)
+ 5 FLAG (proposal-shaped sketches that don't match what was built;
inline `<!-- AUDIT: ... -->` markers added) + 0 DELETE. Most
consequential FLAG: the DPOPair shape mismatch in REPLAYSIM_
NORMALIZATION_RECONNAISSANCE.md — the sketched _to_dj/_from_dj
round-trip won't work as written against the realised TypedDict.
Wave 16e — examples/gsm8k_grpo_with_sdpo/ (real model, CPU, 16.5s)
New sibling to examples/gsm8k_grpo/ that demonstrates the SDPO
hint-distillation column firing end-to-end on a real Qwen2.5-0.5B-
Instruct, on CPU. Three acceptance assertions verify the channel
actually fires:
✓ sdpo_jsd > 0 at every step (min=0.1358, max=0.1429)
✓ total != lm_ce at every step (channel actually contributes)
✓ |grad| > 0 and finite at every step (autograd flows correctly)
Total loss decreased 5.98 → 2.46 across 5 SGD steps in 16.5s wall-
clock after a 1.7s model load. Uses `tokenizer.apply_chat_template`
with Qwen's actual ChatML markers (<|im_start|>/<|im_end|>) — the
initial draft used raw <|system|>/<|end|> strings which DeepSeek
reviewer correctly flagged would tokenize as 11 punctuation tokens
feeding the model nonsense. Fixed before commit; SDPO signal is
now ~10× stronger (0.1429 vs 0.0326) because the model sees real
ChatML.
Cross-model review (mandatory pre-push, route-fidelity-verified)
Used the model-roster skill's scatter-via-urllib pattern (PR#2's
delegate_task per-task override is unreliable for some families).
Both reviewers received the staged diff with no orchestrator
reasoning leaked:
- deepseek/deepseek-v4-pro (math + test honesty + spec drift):
caught the ChatML marker bug + tokenizer-alignment concern
+ bit-exact-grad-equality fragility note + 3 minor nits.
Cost: $0.060, 211s.
- google/gemini-3.1-pro-preview (user journey + docs consistency):
caught the README keyword leak + the brittle GFM anchor + the
docs/plans/wave-16-plan.md hygiene nit. Cost: $0.075, 35s.
Both BLOCKERs and 2 important-issues fixed before this commit. Two
reviewer "BLOCKERs" verified false-positive (LossComponents.detached
exists; remaining alpha_sdpo refs in INTEGRATION_RECIPES are
intentional USE examples per the 16c task spec).
Methodological note for Wave 17
The cross-model-review step caught two ship-blockers (README keyword
leak + ChatML marker bug) that no amount of orchestrator self-review
surfaced. The Wave 7-11 lesson — adversarial review is mandatory
before any public push, even when the work feels clean — held again
this wave. The temptation to skip it after the final-push budget
pressure was real; this commit message exists in part as the audit
trail for why we didn't.
- .gitignore +6 -0
- README.md +0 -3
- composer_replication/tests/test_gradient_flow.py +279 -0
- docs/API_REFERENCE.md +1 -0
- docs/INTEGRATION_RECIPES.md +11 -20
- docs/TROUBLESHOOTING.md +74 -0
- docs/research/DILOCO_SERVERLESS_RECONNAISSANCE.md +11 -0
- docs/research/REPLAYSIM_NORMALIZATION_RECONNAISSANCE.md +14 -0
- docs/research/RL_FRAMEWORKS_LANDSCAPE.md +12 -3
- docs/research/SELF_DISTILLATION_LANDSCAPE.md +8 -0
- docs/research/TRACE_SOURCE_RECONNAISSANCE.md +7 -0
- docs/research/WAVE_16_RECON_AUDIT.md +179 -0
- examples/gsm8k_grpo_with_sdpo/README.md +72 -0
- examples/gsm8k_grpo_with_sdpo/run.py +288 -0
- pyproject.toml +27 -9
|
@@ -29,6 +29,12 @@ examples/*/checkpoints/
|
|
| 29 |
spikes/*/output/
|
| 30 |
spikes/*/checkpoints/
|
| 31 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 32 |
# Model files (HF native; never commit raw weights to a methodology repo)
|
| 33 |
*.safetensors
|
| 34 |
*.bin
|
|
|
|
| 29 |
spikes/*/output/
|
| 30 |
spikes/*/checkpoints/
|
| 31 |
|
| 32 |
+
# Run logs (re-generated on every run.py invocation)
|
| 33 |
+
examples/*/run.log
|
| 34 |
+
spikes/*/results/run.log
|
| 35 |
+
spikes/*/results/test_strict.log
|
| 36 |
+
spikes/*/results/
|
| 37 |
+
|
| 38 |
# Model files (HF native; never commit raw weights to a methodology repo)
|
| 39 |
*.safetensors
|
| 40 |
*.bin
|
|
@@ -14,12 +14,9 @@ tags:
|
|
| 14 |
- grpo
|
| 15 |
- dapo
|
| 16 |
- diloco
|
| 17 |
-
- prime-rl
|
| 18 |
- openenv
|
| 19 |
- trl
|
| 20 |
- verl
|
| 21 |
-
- monarch
|
| 22 |
-
- torchforge
|
| 23 |
- research
|
| 24 |
- methodology
|
| 25 |
pretty_name: "Composer 2.5 Replication Framework — Research Synthesis"
|
|
|
|
| 14 |
- grpo
|
| 15 |
- dapo
|
| 16 |
- diloco
|
|
|
|
| 17 |
- openenv
|
| 18 |
- trl
|
| 19 |
- verl
|
|
|
|
|
|
|
| 20 |
- research
|
| 21 |
- methodology
|
| 22 |
pretty_name: "Composer 2.5 Replication Framework — Research Synthesis"
|
|
@@ -0,0 +1,279 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Gradient-flow tests for compose_loss channels (Wave 16b).
|
| 2 |
+
|
| 3 |
+
Wave 14-15 verified compose_loss returns correct numeric values and that
|
| 4 |
+
channel disables behave correctly. This file closes the gap by verifying
|
| 5 |
+
that gradients actually flow back through each enabled channel and reach
|
| 6 |
+
model parameters when the channel is on, AND that disabled channels
|
| 7 |
+
produce zero side-effects on the autograd graph.
|
| 8 |
+
|
| 9 |
+
Coverage:
|
| 10 |
+
1. test_alpha_sdpo_routes_grad_to_params
|
| 11 |
+
— alpha_sdpo=1.0 + SDPO inputs => non-zero finite grads on params
|
| 12 |
+
2. test_beta_replay_routes_grad_to_params
|
| 13 |
+
— beta_replay=1.0 + DPO inputs => non-zero finite grads on params
|
| 14 |
+
3. test_alpha_zero_blocks_sdpo_grad
|
| 15 |
+
— alpha_sdpo=0.0: SDPO inputs present vs absent yields BIT-IDENTICAL
|
| 16 |
+
param.grad on every parameter (catches phantom-gradient leaks
|
| 17 |
+
from disabled channels)
|
| 18 |
+
4. test_taid_grad_flows_through_sdpo_path
|
| 19 |
+
— sdpo_wrapper="taid", taid_t=0.5 still routes grads through
|
| 20 |
+
the SDPO channel under autograd
|
| 21 |
+
|
| 22 |
+
Same TinyLM scaffold as test_compose_loss_integration.py — no HF / TRL,
|
| 23 |
+
all tests run in milliseconds.
|
| 24 |
+
"""
|
| 25 |
+
from __future__ import annotations
|
| 26 |
+
|
| 27 |
+
import math
|
| 28 |
+
|
| 29 |
+
import torch
|
| 30 |
+
import torch.nn as nn
|
| 31 |
+
|
| 32 |
+
from composer_replication import compose_loss
|
| 33 |
+
|
| 34 |
+
|
| 35 |
+
# ----------------------------------------------------------------------
|
| 36 |
+
# Tiny LM stand-in (mirrors test_compose_loss_integration.py)
|
| 37 |
+
# ----------------------------------------------------------------------
|
| 38 |
+
|
| 39 |
+
|
| 40 |
+
class TinyLM(nn.Module):
|
| 41 |
+
"""Minimal nn.Module with HF-style ``model(input_ids=...).logits`` API."""
|
| 42 |
+
|
| 43 |
+
def __init__(self, vocab: int = 32, hidden: int = 16, seed: int = 0):
|
| 44 |
+
super().__init__()
|
| 45 |
+
torch.manual_seed(seed)
|
| 46 |
+
self.embed = nn.Embedding(vocab, hidden)
|
| 47 |
+
self.fc = nn.Linear(hidden, hidden)
|
| 48 |
+
self.head = nn.Linear(hidden, vocab)
|
| 49 |
+
|
| 50 |
+
def forward(self, input_ids: torch.Tensor):
|
| 51 |
+
h = torch.tanh(self.fc(self.embed(input_ids)))
|
| 52 |
+
logits = self.head(h)
|
| 53 |
+
|
| 54 |
+
class _Out:
|
| 55 |
+
pass
|
| 56 |
+
out = _Out()
|
| 57 |
+
out.logits = logits
|
| 58 |
+
return out
|
| 59 |
+
|
| 60 |
+
|
| 61 |
+
# ----------------------------------------------------------------------
|
| 62 |
+
# Batch fixtures (mirror test_compose_loss_integration.py shape)
|
| 63 |
+
# ----------------------------------------------------------------------
|
| 64 |
+
|
| 65 |
+
VOCAB = 32
|
| 66 |
+
B = 2
|
| 67 |
+
T = 8
|
| 68 |
+
|
| 69 |
+
|
| 70 |
+
def _make_inputs(seed: int = 7, *, with_sdpo: bool, with_dpo: bool) -> dict:
|
| 71 |
+
"""Build a deterministic input batch with optional channel inputs.
|
| 72 |
+
|
| 73 |
+
SDPO and DPO inputs can be independently included or excluded so we
|
| 74 |
+
can exercise the channel-disable code paths cleanly.
|
| 75 |
+
"""
|
| 76 |
+
g = torch.Generator().manual_seed(seed)
|
| 77 |
+
inputs: dict[str, torch.Tensor] = {
|
| 78 |
+
"input_ids": torch.randint(0, VOCAB, (B, T), generator=g),
|
| 79 |
+
"response_mask": torch.zeros(B, T, dtype=torch.long),
|
| 80 |
+
}
|
| 81 |
+
inputs["response_mask"][:, T // 2:] = 1
|
| 82 |
+
|
| 83 |
+
if with_sdpo:
|
| 84 |
+
inputs["ctx_teacher_input_ids"] = torch.randint(0, VOCAB, (B, T), generator=g)
|
| 85 |
+
inputs["sdpo_loss_mask"] = torch.zeros(B, T, dtype=torch.long)
|
| 86 |
+
inputs["sdpo_loss_mask"][:, T // 2:] = 1
|
| 87 |
+
|
| 88 |
+
if with_dpo:
|
| 89 |
+
inputs["dpo_chosen_input_ids"] = torch.randint(0, VOCAB, (B, T), generator=g)
|
| 90 |
+
inputs["dpo_chosen_response_mask"] = torch.ones(B, T, dtype=torch.long)
|
| 91 |
+
inputs["dpo_rejected_input_ids"] = torch.randint(0, VOCAB, (B, T), generator=g)
|
| 92 |
+
inputs["dpo_rejected_response_mask"] = torch.ones(B, T, dtype=torch.long)
|
| 93 |
+
inputs["dpo_chosen_ref_logprobs"] = torch.randn(B, generator=g)
|
| 94 |
+
inputs["dpo_rejected_ref_logprobs"] = torch.randn(B, generator=g)
|
| 95 |
+
|
| 96 |
+
return inputs
|
| 97 |
+
|
| 98 |
+
|
| 99 |
+
def _grad_norm(model: nn.Module) -> float:
|
| 100 |
+
"""Sum of |grad| across all params with non-None grad."""
|
| 101 |
+
return sum(
|
| 102 |
+
p.grad.detach().abs().sum().item()
|
| 103 |
+
for p in model.parameters()
|
| 104 |
+
if p.grad is not None
|
| 105 |
+
)
|
| 106 |
+
|
| 107 |
+
|
| 108 |
+
def _grad_is_finite(model: nn.Module) -> bool:
|
| 109 |
+
"""All param grads are finite (no inf, no nan)."""
|
| 110 |
+
for p in model.parameters():
|
| 111 |
+
if p.grad is None:
|
| 112 |
+
continue
|
| 113 |
+
if not torch.isfinite(p.grad).all():
|
| 114 |
+
return False
|
| 115 |
+
return True
|
| 116 |
+
|
| 117 |
+
|
| 118 |
+
def _model() -> TinyLM:
|
| 119 |
+
"""Fresh TinyLM with deterministic init."""
|
| 120 |
+
return TinyLM(vocab=VOCAB, hidden=16, seed=0)
|
| 121 |
+
|
| 122 |
+
|
| 123 |
+
# ----------------------------------------------------------------------
|
| 124 |
+
# Test 1 — SDPO channel routes grads to params when alpha_sdpo > 0
|
| 125 |
+
# ----------------------------------------------------------------------
|
| 126 |
+
|
| 127 |
+
|
| 128 |
+
def test_alpha_sdpo_routes_grad_to_params():
|
| 129 |
+
"""When alpha_sdpo > 0 and SDPO inputs are present, calling
|
| 130 |
+
out.total.backward() must produce non-zero finite gradients on
|
| 131 |
+
model parameters.
|
| 132 |
+
"""
|
| 133 |
+
model = _model()
|
| 134 |
+
inputs = _make_inputs(with_sdpo=True, with_dpo=False)
|
| 135 |
+
|
| 136 |
+
out = compose_loss(
|
| 137 |
+
model,
|
| 138 |
+
inputs,
|
| 139 |
+
alpha_sdpo=1.0,
|
| 140 |
+
beta_replay=0.0,
|
| 141 |
+
)
|
| 142 |
+
|
| 143 |
+
# Sanity: SDPO actually fired (channel is non-zero).
|
| 144 |
+
assert float(out.sdpo_jsd) != 0.0, (
|
| 145 |
+
"alpha_sdpo=1.0 with SDPO inputs should produce a non-zero sdpo_jsd; "
|
| 146 |
+
f"got {float(out.sdpo_jsd)}"
|
| 147 |
+
)
|
| 148 |
+
|
| 149 |
+
out.total.backward()
|
| 150 |
+
g = _grad_norm(model)
|
| 151 |
+
assert g > 0.0, f"Expected non-zero grad sum from SDPO channel; got {g}"
|
| 152 |
+
assert math.isfinite(g), f"Grad sum is not finite: {g}"
|
| 153 |
+
assert _grad_is_finite(model), "Some grads are inf/nan"
|
| 154 |
+
|
| 155 |
+
|
| 156 |
+
# ----------------------------------------------------------------------
|
| 157 |
+
# Test 2 — Replay-DPO channel routes grads to params when beta_replay > 0
|
| 158 |
+
# ----------------------------------------------------------------------
|
| 159 |
+
|
| 160 |
+
|
| 161 |
+
def test_beta_replay_routes_grad_to_params():
|
| 162 |
+
"""When beta_replay > 0 and DPO inputs are present, backward must
|
| 163 |
+
produce non-zero finite gradients on model parameters.
|
| 164 |
+
|
| 165 |
+
Note: response_mask is set to all-zeros so the LM-CE channel is
|
| 166 |
+
exactly zero — any non-zero grad must come from the DPO channel.
|
| 167 |
+
"""
|
| 168 |
+
model = _model()
|
| 169 |
+
inputs = _make_inputs(with_sdpo=False, with_dpo=True)
|
| 170 |
+
# Zero out response_mask so LM-CE contributes nothing — isolates DPO.
|
| 171 |
+
inputs["response_mask"] = torch.zeros(B, T, dtype=torch.long)
|
| 172 |
+
|
| 173 |
+
out = compose_loss(
|
| 174 |
+
model,
|
| 175 |
+
inputs,
|
| 176 |
+
alpha_sdpo=0.0,
|
| 177 |
+
beta_replay=1.0,
|
| 178 |
+
)
|
| 179 |
+
|
| 180 |
+
assert float(out.lm_ce) == 0.0, "LM-CE should be zero with empty response_mask"
|
| 181 |
+
assert float(out.trace_replay_dpo) != 0.0, (
|
| 182 |
+
"beta_replay=1.0 with DPO inputs should produce a non-zero "
|
| 183 |
+
f"trace_replay_dpo; got {float(out.trace_replay_dpo)}"
|
| 184 |
+
)
|
| 185 |
+
|
| 186 |
+
out.total.backward()
|
| 187 |
+
g = _grad_norm(model)
|
| 188 |
+
assert g > 0.0, f"Expected non-zero grad sum from DPO channel; got {g}"
|
| 189 |
+
assert math.isfinite(g), f"Grad sum is not finite: {g}"
|
| 190 |
+
assert _grad_is_finite(model), "Some grads are inf/nan"
|
| 191 |
+
|
| 192 |
+
|
| 193 |
+
# ----------------------------------------------------------------------
|
| 194 |
+
# Test 3 — Disabled SDPO channel produces ZERO side-effects on autograd
|
| 195 |
+
# ----------------------------------------------------------------------
|
| 196 |
+
|
| 197 |
+
|
| 198 |
+
def test_alpha_zero_blocks_sdpo_grad():
|
| 199 |
+
"""With alpha_sdpo=0.0, providing SDPO inputs vs omitting them must
|
| 200 |
+
produce bit-identical parameter gradients.
|
| 201 |
+
|
| 202 |
+
This catches a class of bug where a disabled channel leaks a phantom
|
| 203 |
+
contribution into the autograd graph (e.g. if the SDPO branch ran a
|
| 204 |
+
forward pass even when alpha=0 and somehow scaled the result by
|
| 205 |
+
alpha=0 incorrectly).
|
| 206 |
+
"""
|
| 207 |
+
inputs_with_sdpo = _make_inputs(with_sdpo=True, with_dpo=False)
|
| 208 |
+
inputs_no_sdpo = _make_inputs(with_sdpo=False, with_dpo=False)
|
| 209 |
+
|
| 210 |
+
# Trial A: SDPO inputs present, alpha=0 — channel should be silent.
|
| 211 |
+
model_a = _model()
|
| 212 |
+
out_a = compose_loss(model_a, inputs_with_sdpo, alpha_sdpo=0.0, beta_replay=0.0)
|
| 213 |
+
out_a.total.backward()
|
| 214 |
+
grads_a = {
|
| 215 |
+
name: p.grad.detach().clone() if p.grad is not None else None
|
| 216 |
+
for name, p in model_a.named_parameters()
|
| 217 |
+
}
|
| 218 |
+
|
| 219 |
+
# Trial B: SDPO inputs absent, alpha=0.
|
| 220 |
+
model_b = _model() # Same seed -> bit-identical init.
|
| 221 |
+
out_b = compose_loss(model_b, inputs_no_sdpo, alpha_sdpo=0.0, beta_replay=0.0)
|
| 222 |
+
out_b.total.backward()
|
| 223 |
+
grads_b = {
|
| 224 |
+
name: p.grad.detach().clone() if p.grad is not None else None
|
| 225 |
+
for name, p in model_b.named_parameters()
|
| 226 |
+
}
|
| 227 |
+
|
| 228 |
+
# Bit-identical grads on every parameter.
|
| 229 |
+
assert set(grads_a.keys()) == set(grads_b.keys())
|
| 230 |
+
for name in grads_a:
|
| 231 |
+
ga, gb = grads_a[name], grads_b[name]
|
| 232 |
+
if ga is None and gb is None:
|
| 233 |
+
continue
|
| 234 |
+
assert ga is not None and gb is not None, (
|
| 235 |
+
f"Param {name}: grad_a={ga is not None}, grad_b={gb is not None}"
|
| 236 |
+
)
|
| 237 |
+
# atol=0, rtol=0 -> bit-exact equality. SDPO inputs with alpha=0
|
| 238 |
+
# must not perturb the autograd graph by even one ULP.
|
| 239 |
+
assert torch.equal(ga, gb), (
|
| 240 |
+
f"Param {name}: disabled SDPO channel leaked phantom gradient. "
|
| 241 |
+
f"|diff|.max()={float((ga - gb).abs().max())}"
|
| 242 |
+
)
|
| 243 |
+
|
| 244 |
+
|
| 245 |
+
# ----------------------------------------------------------------------
|
| 246 |
+
# Test 4 — TAID-wrapped SDPO channel still routes grads under autograd
|
| 247 |
+
# ----------------------------------------------------------------------
|
| 248 |
+
|
| 249 |
+
|
| 250 |
+
def test_taid_grad_flows_through_sdpo_path():
|
| 251 |
+
"""The Wave 15 TAID rewrite (logit-space mix, current-student-detached
|
| 252 |
+
anchor) must remain differentiable. With sdpo_wrapper='taid' and
|
| 253 |
+
taid_t=0.5, backward must produce non-zero finite gradients on
|
| 254 |
+
model parameters.
|
| 255 |
+
"""
|
| 256 |
+
model = _model()
|
| 257 |
+
inputs = _make_inputs(with_sdpo=True, with_dpo=False)
|
| 258 |
+
|
| 259 |
+
out = compose_loss(
|
| 260 |
+
model,
|
| 261 |
+
inputs,
|
| 262 |
+
alpha_sdpo=1.0,
|
| 263 |
+
beta_replay=0.0,
|
| 264 |
+
sdpo_wrapper="taid",
|
| 265 |
+
taid_t=0.5,
|
| 266 |
+
)
|
| 267 |
+
|
| 268 |
+
assert float(out.sdpo_jsd) != 0.0, (
|
| 269 |
+
f"taid_t=0.5 should still produce a non-zero sdpo_jsd; "
|
| 270 |
+
f"got {float(out.sdpo_jsd)}"
|
| 271 |
+
)
|
| 272 |
+
|
| 273 |
+
out.total.backward()
|
| 274 |
+
g = _grad_norm(model)
|
| 275 |
+
assert g > 0.0, (
|
| 276 |
+
f"Expected non-zero grad sum from TAID-wrapped SDPO channel; got {g}"
|
| 277 |
+
)
|
| 278 |
+
assert math.isfinite(g), f"Grad sum is not finite: {g}"
|
| 279 |
+
assert _grad_is_finite(model), "Some grads are inf/nan"
|
|
@@ -103,6 +103,7 @@ components.total.backward()
|
|
| 103 |
```
|
| 104 |
|
| 105 |
### `compose_loss(model, inputs, *, ...) -> LossComponents`
|
|
|
|
| 106 |
|
| 107 |
```python
|
| 108 |
def compose_loss(
|
|
|
|
| 103 |
```
|
| 104 |
|
| 105 |
### `compose_loss(model, inputs, *, ...) -> LossComponents`
|
| 106 |
+
<a id="compose_loss"></a>
|
| 107 |
|
| 108 |
```python
|
| 109 |
def compose_loss(
|
|
@@ -55,31 +55,22 @@ total_loss = grpo_loss
|
|
| 55 |
This is implemented once, in
|
| 56 |
[`composer_replication/loss.py::compose_loss`](../composer_replication/loss.py),
|
| 57 |
and re-used by every recipe via the kwargs documented in
|
| 58 |
-
[`API_REFERENCE.md`](API_REFERENCE.md). The
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 59 |
|
| 60 |
```python
|
| 61 |
-
|
| 62 |
-
model,
|
| 63 |
-
inputs,
|
| 64 |
-
*,
|
| 65 |
-
alpha_sdpo: float = 0.1,
|
| 66 |
-
beta_replay: float = 0.05,
|
| 67 |
-
sdpo_jsd_beta: float = 0.5,
|
| 68 |
-
sdpo_temperature: float = 1.0,
|
| 69 |
-
sdpo_token_clip: float | None = None,
|
| 70 |
-
replay_dpo_beta: float = 0.1,
|
| 71 |
-
# ADR-007 extensions
|
| 72 |
-
dpo_variant: Literal["dpo", "simpo"] = "dpo",
|
| 73 |
-
sdpo_wrapper: Literal["none", "taid", "entropy_opd"] = "none",
|
| 74 |
-
taid_t: float | None = None,
|
| 75 |
-
simpo_beta: float = 2.0,
|
| 76 |
-
simpo_gamma: float = 1.0,
|
| 77 |
-
entropy_opd_h_max: float | None = None,
|
| 78 |
-
) -> torch.Tensor: ...
|
| 79 |
```
|
| 80 |
|
| 81 |
All five recipes below either call `compose_loss` directly or call a
|
| 82 |
-
thin per-framework adapter that forwards these kwargs unchanged.
|
|
|
|
|
|
|
|
|
|
| 83 |
|
| 84 |
---
|
| 85 |
|
|
|
|
| 55 |
This is implemented once, in
|
| 56 |
[`composer_replication/loss.py::compose_loss`](../composer_replication/loss.py),
|
| 57 |
and re-used by every recipe via the kwargs documented in
|
| 58 |
+
[`API_REFERENCE.md`](API_REFERENCE.md). The full signature — including
|
| 59 |
+
all ADR-007 channel-2/3 knobs (`dpo_variant`, `sdpo_wrapper`, `taid_t`,
|
| 60 |
+
`simpo_beta`/`simpo_gamma`, `entropy_opd_h_max`, …) — is the
|
| 61 |
+
single source of truth in
|
| 62 |
+
[API_REFERENCE.md § `compose_loss`](API_REFERENCE.md#compose_loss).
|
| 63 |
+
The conceptual call shape is just:
|
| 64 |
|
| 65 |
```python
|
| 66 |
+
compose_loss(model, inputs, **kwargs) # see API_REFERENCE.md#compose_loss for full signature
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 67 |
```
|
| 68 |
|
| 69 |
All five recipes below either call `compose_loss` directly or call a
|
| 70 |
+
thin per-framework adapter that forwards these kwargs unchanged. Each
|
| 71 |
+
recipe's **§5 Distillation-loss wiring** documents the kwargs *that
|
| 72 |
+
recipe* uses by default and why; refer back to API_REFERENCE.md for
|
| 73 |
+
defaults, types, and which kwargs are mutually exclusive.
|
| 74 |
|
| 75 |
---
|
| 76 |
|
|
@@ -759,6 +759,80 @@ pytest composer_replication/recipes/prime_rl/tests/test_composer_loss.py::test_r
|
|
| 759 |
|
| 760 |
---
|
| 761 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 762 |
## How to file a bug report
|
| 763 |
|
| 764 |
If you've read the relevant section above and your problem persists,
|
|
|
|
| 759 |
|
| 760 |
---
|
| 761 |
|
| 762 |
+
### 10. `monarch` / `data-juicer` / `prime-rl` install (Wave 16)
|
| 763 |
+
|
| 764 |
+
**SYMPTOM.** `pip install -e ".[monarch]"`, `pip install -e ".[prime-rl]"`,
|
| 765 |
+
or `pip install -e ".[replaysim]"` fails immediately with a uv/pip
|
| 766 |
+
resolver error similar to:
|
| 767 |
+
|
| 768 |
+
```
|
| 769 |
+
× No solution found when resolving dependencies:
|
| 770 |
+
╰─▶ Because only monarch<=0.1.11 is available and
|
| 771 |
+
composer-replication[monarch] depends on monarch>=0.4.1, we can
|
| 772 |
+
conclude that composer-replication[monarch]'s requirements are
|
| 773 |
+
unsatisfiable.
|
| 774 |
+
```
|
| 775 |
+
|
| 776 |
+
**DIAGNOSIS.** Three upstream packages the framework integrates with are
|
| 777 |
+
not currently pip-installable in their advertised versions:
|
| 778 |
+
|
| 779 |
+
1. **Meta's Monarch** is published on PyPI as
|
| 780 |
+
`torchmonarch-nightly` (nightly wheels with platform constraints), not
|
| 781 |
+
as `monarch`. The PyPI name `monarch` is unrelated to Meta's actor
|
| 782 |
+
framework and tops out at `0.1.11`.
|
| 783 |
+
2. **Prime Intellect's prime-rl** is not registered on PyPI at all. It
|
| 784 |
+
is published from source only.
|
| 785 |
+
3. **data-juicer** is not registered on PyPI under that exact name. The
|
| 786 |
+
closest match (`py-data-juicer==1.0.0`) has broken transitive deps;
|
| 787 |
+
newer `py-data-juicer` releases work but install ~150 transitive
|
| 788 |
+
packages.
|
| 789 |
+
|
| 790 |
+
Wave 16 dropped all three extras from `pyproject.toml` rather than ship
|
| 791 |
+
unsatisfiable pins. The framework code paths that touch these libraries
|
| 792 |
+
import them lazily, so:
|
| 793 |
+
- `composer_replication.recipes.monarch` is a documentation skeleton
|
| 794 |
+
that does NOT require monarch installed.
|
| 795 |
+
- `composer_replication.recipes.prime_rl.composer_loss` imports cleanly
|
| 796 |
+
without prime-rl; the upstream parity test is `@skipif`-gated and the
|
| 797 |
+
in-file shadow-parity test still verifies the loss formula
|
| 798 |
+
independently.
|
| 799 |
+
- `composer_replication.replaysim.normalize.DJNormalizer(skip_dj=True)`
|
| 800 |
+
works without `data_juicer`; only the full DJNormalizer code path
|
| 801 |
+
needs it.
|
| 802 |
+
|
| 803 |
+
**FIX.** If you want any of these libraries' real functionality, install
|
| 804 |
+
from source alongside the framework:
|
| 805 |
+
|
| 806 |
+
```
|
| 807 |
+
# Meta Monarch (actor framework — see ADR-006)
|
| 808 |
+
pip install torchmonarch-nightly # OR install from source:
|
| 809 |
+
# git clone https://github.com/meta-pytorch/monarch && cd monarch && pip install -e .
|
| 810 |
+
|
| 811 |
+
# Prime Intellect prime-rl (Recipe C — see ADR-006)
|
| 812 |
+
git clone https://github.com/PrimeIntellect-ai/prime-rl
|
| 813 |
+
cd prime-rl && pip install -e .
|
| 814 |
+
|
| 815 |
+
# data-juicer (replaysim normalization — see ADR-004)
|
| 816 |
+
git clone https://github.com/modelscope/data-juicer
|
| 817 |
+
cd data-juicer && pip install -e .
|
| 818 |
+
```
|
| 819 |
+
|
| 820 |
+
**VERIFICATION.** A fresh checkout install with all surviving extras
|
| 821 |
+
should succeed:
|
| 822 |
+
|
| 823 |
+
```
|
| 824 |
+
uv venv --clear
|
| 825 |
+
uv pip install -e ".[diloco,replay,replaysim,train,dev]"
|
| 826 |
+
source .venv/bin/activate
|
| 827 |
+
python -m pytest -q # baseline 176 passed / 8 skipped
|
| 828 |
+
```
|
| 829 |
+
|
| 830 |
+
If any of those extras fails to resolve, file a bug report — Wave 16
|
| 831 |
+
verified the full extras matrix installs from a clean venv on Python
|
| 832 |
+
3.11.
|
| 833 |
+
|
| 834 |
+
---
|
| 835 |
+
|
| 836 |
## How to file a bug report
|
| 837 |
|
| 838 |
If you've read the relevant section above and your problem persists,
|
|
@@ -621,6 +621,17 @@ if __name__ == "__main__":
|
|
| 621 |
|
| 622 |
### 3.4 Package layout
|
| 623 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 624 |
```
|
| 625 |
composer_replication/
|
| 626 |
└── diloco/
|
|
|
|
| 621 |
|
| 622 |
### 3.4 Package layout
|
| 623 |
|
| 624 |
+
<!-- AUDIT: stale_serverless_layout — ADR-005 shipped a flatter layout than this
|
| 625 |
+
proposal. Actual modules under composer_replication/diloco/serverless/
|
| 626 |
+
are: __init__.py, executor.py (ServerlessExecutor + LocalProcessExecutor),
|
| 627 |
+
allreduce.py (ObjectStoreAllReduce + MockManager), modal.py (ModalExecutor),
|
| 628 |
+
hf_jobs.py (HFJobsExecutor), replica_entrypoint.py. No leading underscores,
|
| 629 |
+
no _protocol/_base/_rendezvous split, and Modal/HFJobs are flat modules
|
| 630 |
+
rather than subpackages. The above code-block file headers (e.g.
|
| 631 |
+
`_modal_adapter.py`, `_hf_jobs_adapter.py`, `_protocol.py`, `_rendezvous.py`)
|
| 632 |
+
are pre-implementation proposals; map them to the realised module names
|
| 633 |
+
when reading. -->
|
| 634 |
+
|
| 635 |
```
|
| 636 |
composer_replication/
|
| 637 |
└── diloco/
|
|
@@ -312,6 +312,20 @@ write_jsonl(out_path, pairs)
|
|
| 312 |
|
| 313 |
### 4.3 Adapter shape (`replaysim/normalize.py`)
|
| 314 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 315 |
```python
|
| 316 |
# composer_replication/replaysim/normalize.py
|
| 317 |
from __future__ import annotations
|
|
|
|
| 312 |
|
| 313 |
### 4.3 Adapter shape (`replaysim/normalize.py`)
|
| 314 |
|
| 315 |
+
<!-- AUDIT: stale_replaysim_paths_and_dpo_shape — ADR-004 shipped at
|
| 316 |
+
composer_replication/replaysim/normalize.py with a different DPOPair shape
|
| 317 |
+
than this sketch. Actual DPOPair is a TypedDict with fields
|
| 318 |
+
{state_id, state_messages, chosen: str, rejected: str, n_teachers_agreeing}
|
| 319 |
+
— NOT {prompt, chosen, rejected, state, meta} as in the proposal below. The
|
| 320 |
+
YAML recipe also lives at composer_replication/recipes/replaysim/default.yaml
|
| 321 |
+
(not composer_replication/replaysim/recipes/dpo_normalize.yaml). The hook
|
| 322 |
+
in §4.5 is provided by `replay_and_normalize_trace` in
|
| 323 |
+
composer_replication/replaysim/__init__.py rather than a drop-in edit to
|
| 324 |
+
`teacher_replay.py`. The custom op file (§4.4 line 426 / §4.4 line 431)
|
| 325 |
+
`composer_replication/replaysim/ops/preference_validator.py` was not
|
| 326 |
+
created. Treat the sketch below as proposal, not as documentation of
|
| 327 |
+
the realised code. -->
|
| 328 |
+
|
| 329 |
```python
|
| 330 |
# composer_replication/replaysim/normalize.py
|
| 331 |
from __future__ import annotations
|
|
@@ -313,9 +313,13 @@ group_size = 16
|
|
| 313 |
|
| 314 |
[trainer]
|
| 315 |
algorithm = "grpo"
|
|
|
|
|
|
|
|
|
|
|
|
|
| 316 |
[trainer.loss]
|
| 317 |
type = "custom"
|
| 318 |
-
import_path = "composer_replication.
|
| 319 |
[trainer.loss.kwargs]
|
| 320 |
hint_weight = 0.5
|
| 321 |
replay_weight = 0.25
|
|
@@ -330,10 +334,15 @@ sync_mode = "async"
|
|
| 330 |
shardcast = true
|
| 331 |
```
|
| 332 |
|
| 333 |
-
`composer_replication/
|
|
|
|
|
|
|
|
|
|
| 334 |
|
| 335 |
```python
|
| 336 |
-
# composer_replication/
|
|
|
|
|
|
|
| 337 |
from prime_rl.trainer.rl.loss import LossInputs, LossOutputs
|
| 338 |
|
| 339 |
def composer_three_channel_loss(
|
|
|
|
| 313 |
|
| 314 |
[trainer]
|
| 315 |
algorithm = "grpo"
|
| 316 |
+
<!-- AUDIT: stale_recipe_format — Wave 14b shipped this as YAML at
|
| 317 |
+
composer_replication/recipes/prime_rl/prime_rl_config.yaml with a different
|
| 318 |
+
kwarg surface (alpha_sdpo, beta_dpo, dppo_mask_high, dppo_mask_low, adv_tau,
|
| 319 |
+
kl_tau). The TOML/hint_weight/replay_weight sketch below predates that. -->
|
| 320 |
[trainer.loss]
|
| 321 |
type = "custom"
|
| 322 |
+
import_path = "composer_replication.recipes.prime_rl.composer_loss:loss_fn"
|
| 323 |
[trainer.loss.kwargs]
|
| 324 |
hint_weight = 0.5
|
| 325 |
replay_weight = 0.25
|
|
|
|
| 334 |
shardcast = true
|
| 335 |
```
|
| 336 |
|
| 337 |
+
`composer_replication/recipes/prime_rl/composer_loss.py` (~120 LOC; current Wave 14b
|
| 338 |
+
implementation defines `loss_fn(inputs, **kwargs)` rather than the
|
| 339 |
+
`composer_three_channel_loss(li, *, hint_weight, replay_weight, replay_logits)`
|
| 340 |
+
signature sketched below):
|
| 341 |
|
| 342 |
```python
|
| 343 |
+
# composer_replication/recipes/prime_rl/composer_loss.py — sketch only;
|
| 344 |
+
# the actual signature evolved during Wave 14b. See module docstring for
|
| 345 |
+
# the current `loss_fn` contract.
|
| 346 |
from prime_rl.trainer.rl.loss import LossInputs, LossOutputs
|
| 347 |
|
| 348 |
def composer_three_channel_loss(
|
|
@@ -352,6 +352,14 @@ license + reproducible scale) to recommend adding right now.
|
|
| 352 |
For ADR-007 the proposed addition is a `composer_replication.distillation`
|
| 353 |
sub-package with three pluggable hooks:
|
| 354 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 355 |
```
|
| 356 |
composer_replication/
|
| 357 |
distillation/
|
|
|
|
| 352 |
For ADR-007 the proposed addition is a `composer_replication.distillation`
|
| 353 |
sub-package with three pluggable hooks:
|
| 354 |
|
| 355 |
+
<!-- AUDIT: stale_distillation_layout — ADR-007 shipped a flatter layout than
|
| 356 |
+
this proposal. Actual modules: composer_replication/distillation/{simpo.py,
|
| 357 |
+
taid.py, entropy_aware_opd.py}. There is no targets.py/losses.py split,
|
| 358 |
+
no top-level preference/ subpackage, and SimPO lives under distillation/
|
| 359 |
+
rather than preference/. The function names also differ: actual exports
|
| 360 |
+
are `simpo_loss`, `taid_loss` + `TAIDScheduler`, and `entropy_aware_opd_loss`
|
| 361 |
+
(not `taid_target` / `entropy_aware_kl_loss`). -->
|
| 362 |
+
|
| 363 |
```
|
| 364 |
composer_replication/
|
| 365 |
distillation/
|
|
@@ -244,6 +244,13 @@ For users on other machines: `find ~/.claude/projects -name '*.jsonl' -size +50k
|
|
| 244 |
|
| 245 |
## 6. TraceIngester sketch
|
| 246 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 247 |
Drop-in adapter for spike-005's `replay_trace()`. Targets `TraceState` (the actual existing TypedDict; see §1).
|
| 248 |
|
| 249 |
```python
|
|
|
|
| 244 |
|
| 245 |
## 6. TraceIngester sketch
|
| 246 |
|
| 247 |
+
<!-- AUDIT: stale_ingester_paths_and_naming — Spike 007 shipped at
|
| 248 |
+
spikes/007-real-trace-ingestion/claude_code_ingester.py (NOT
|
| 249 |
+
spikes/007-trace-ingester/trace_ingester.py) and the production-side
|
| 250 |
+
module is composer_replication/ingestion/claude_code.py exporting
|
| 251 |
+
`ClaudeCodeIngester` (NOT `TraceIngester`). The sketch below is the
|
| 252 |
+
pre-spike proposal; the realised API surface is named differently. -->
|
| 253 |
+
|
| 254 |
Drop-in adapter for spike-005's `replay_trace()`. Targets `TraceState` (the actual existing TypedDict; see §1).
|
| 255 |
|
| 256 |
```python
|
|
@@ -0,0 +1,179 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Wave 16 Reconnaissance Audit
|
| 2 |
+
|
| 3 |
+
Audit of `docs/research/*RECONNAISSANCE.md` and `*LANDSCAPE.md` against repo
|
| 4 |
+
state at sha `e5add150ab06aeef3adda726c0fcae05aa270500`.
|
| 5 |
+
|
| 6 |
+
Wave 16d charter: check 7 recon/landscape docs against current code, produce
|
| 7 |
+
this audit table, and apply only **unambiguous** inline fixes. Ambiguous claims
|
| 8 |
+
are tagged with `<!-- AUDIT: stale_<short> -->` HTML comments inline and
|
| 9 |
+
recorded here for orchestrator follow-up.
|
| 10 |
+
|
| 11 |
+
## Summary
|
| 12 |
+
|
| 13 |
+
- Total claims checked: ~36 across 7 docs
|
| 14 |
+
- KEEP: 24
|
| 15 |
+
- FIX (unambiguous, applied inline): 2
|
| 16 |
+
- FLAG (ambiguous; HTML comment + entry below): 5
|
| 17 |
+
- DELETE: 0
|
| 18 |
+
|
| 19 |
+
The framework deliberately ships proposal-shaped recon docs alongside built
|
| 20 |
+
code. Wave 16d's posture has been "do not rewrite proposals; flag where the
|
| 21 |
+
realised code diverged so future readers know which sections are pre-impl
|
| 22 |
+
sketch vs. accurate-as-of-build documentation."
|
| 23 |
+
|
| 24 |
+
## Per-doc findings
|
| 25 |
+
|
| 26 |
+
### DILOCO_RECONNAISSANCE.md
|
| 27 |
+
|
| 28 |
+
External-only doc (about `meta-pytorch/torchft`); no in-repo symbol or path
|
| 29 |
+
references. Nothing to audit against current code.
|
| 30 |
+
|
| 31 |
+
| Claim | Status | Action |
|
| 32 |
+
| --- | --- | --- |
|
| 33 |
+
| All references are to `torchft.local_sgd.DiLoCo`, `torchft/local_sgd.py:324`, `torchft/manager.py`, etc. | external library, not our concern | KEEP |
|
| 34 |
+
| Recommends `pip install torchft-nightly`; `composer_replication/diloco/__init__.py` later adopted that | confirmed in code | KEEP |
|
| 35 |
+
|
| 36 |
+
### DILOCO_SERVERLESS_RECONNAISSANCE.md
|
| 37 |
+
|
| 38 |
+
Doc is a pre-implementation proposal for ADR-005. The realised package layout
|
| 39 |
+
under `composer_replication/diloco/serverless/` is flatter and uses different
|
| 40 |
+
module names than the proposal.
|
| 41 |
+
|
| 42 |
+
| Claim | Status | Action |
|
| 43 |
+
| --- | --- | --- |
|
| 44 |
+
| `composer_replication.diloco.serverless` namespace exists | matches reality | KEEP |
|
| 45 |
+
| Code blocks use file headers `_modal_adapter.py`, `_hf_jobs_adapter.py`, `_protocol.py`, `_rendezvous.py`, `_base.py` | actual modules are `modal.py`, `hf_jobs.py`, `executor.py`, `allreduce.py` (no leading underscore, no _protocol/_base/_rendezvous split) | FLAG (`stale_serverless_layout` HTML comment added before §3.4) |
|
| 46 |
+
| §3.4 proposed `modal/`, `hfjobs/`, `runpod/` subpackages | actual ships flat `modal.py`, `hf_jobs.py` modules | FLAG (covered by same comment) |
|
| 47 |
+
| `make_diloco_outer_loop` lives in `composer_replication/diloco/__init__.py` lines 64–125 | line numbers verified (def at 64, body ends ~125) | KEEP |
|
| 48 |
+
| `python -m composer_replication.diloco.serverless.replica_entrypoint` | module exists with `main()` at line 38 | KEEP |
|
| 49 |
+
| `MockManager` exists in serverless package | confirmed at `composer_replication/diloco/serverless/allreduce.py:215` | KEEP |
|
| 50 |
+
| `ObjectStoreAllReduce` exists | confirmed at `composer_replication/diloco/serverless/allreduce.py:30` | KEEP |
|
| 51 |
+
| §3.5 user-facing API `from composer_replication.diloco.serverless import ModalExecutor, HFJobsExecutor, ReplicaSpec` | partial: `ModalExecutor` and `HFJobsExecutor` exist as classes, but `serverless/__init__.py` does NOT re-export them (only `LocalProcessExecutor`, `MockManager`, `ObjectStoreAllReduce`, `ReplicaHandle`, `ServerlessExecutor`) and `ReplicaSpec` is not implemented | covered by `stale_serverless_layout` flag |
|
| 52 |
+
|
| 53 |
+
### MODAL_RECONNAISSANCE.md
|
| 54 |
+
|
| 55 |
+
Doc anchors all loss-shape claims to `spikes/005-integrated-trainer-skeleton/`,
|
| 56 |
+
which is unchanged since Wave 15. All checked references resolve.
|
| 57 |
+
|
| 58 |
+
| Claim | Status | Action |
|
| 59 |
+
| --- | --- | --- |
|
| 60 |
+
| `spikes/005-integrated-trainer-skeleton/trl_path/composer_trainer.py` exists | confirmed | KEEP |
|
| 61 |
+
| `spikes/005-integrated-trainer-skeleton/opsd_loss.py` exists | confirmed | KEEP |
|
| 62 |
+
| `_compute_sdpo_loss` student/teacher forwards at `composer_trainer.py L138–143` | confirmed (L138-143 are student/teacher logits) | KEEP |
|
| 63 |
+
| `_compute_trace_replay_loss` chosen/rejected forwards at `L191–198` | confirmed (L191-198 are `_sequence_logprobs` calls) | KEEP |
|
| 64 |
+
| Zero-tensor short-circuit at `L136` and `L155` | confirmed (both lines `return torch.tensor(0.0, ..., requires_grad=True)`) | KEEP |
|
| 65 |
+
| `opsd_loss.py L54` references `top_k` arg | confirmed | KEEP |
|
| 66 |
+
| Trainer defaults `alpha_sdpo=0.1`, `beta_replay=0.05` | match `composer_replication/loss.py:75-76` | KEEP |
|
| 67 |
+
|
| 68 |
+
### REPLAYSIM_NORMALIZATION_RECONNAISSANCE.md
|
| 69 |
+
|
| 70 |
+
Pre-spike proposal for ADR-004. The library was adopted (data-juicer) and
|
| 71 |
+
`DJNormalizer` exists, but the adapter shape and recipe path differ from the
|
| 72 |
+
proposal.
|
| 73 |
+
|
| 74 |
+
| Claim | Status | Action |
|
| 75 |
+
| --- | --- | --- |
|
| 76 |
+
| `composer_replication.replaysim` package exists with `DJNormalizer` | confirmed at `composer_replication/replaysim/normalize.py:145` | KEEP |
|
| 77 |
+
| `replay_trace`, `extract_dpo_pairs` re-exported from replaysim | confirmed in `__init__.py` | KEEP |
|
| 78 |
+
| §4.3 sketch shows DPOPair with `{prompt, chosen, rejected, state, meta}` and `_to_dj`/`_from_dj` round-trip | actual `DPOPair` is `{state_id, state_messages, chosen: str, rejected: str, n_teachers_agreeing}` (TypedDict in `composer_replication/teacher_replay.py:99`) | FLAG (`stale_replaysim_paths_and_dpo_shape` added before §4.3) |
|
| 79 |
+
| Recipe path `composer_replication/replaysim/recipes/dpo_normalize.yaml` (line 344) | actual is `composer_replication/recipes/replaysim/default.yaml` | FLAG (covered by same comment) |
|
| 80 |
+
| `composer_replication/replaysim/ops/preference_validator.py` (line 431) | not created — no `ops/` subpackage exists | FLAG (covered by same comment) |
|
| 81 |
+
| §4.5 hook into `composer_replication/replaysim/teacher_replay.py` | actual integration path is `replay_and_normalize_trace_sync` in `composer_replication/replaysim/normalize.py:301` (no separate replaysim/teacher_replay.py — teacher_replay lives at top-level `composer_replication/teacher_replay.py`) | FLAG (covered by same comment) |
|
| 82 |
+
|
| 83 |
+
### RL_FRAMEWORKS_LANDSCAPE.md
|
| 84 |
+
|
| 85 |
+
Wave 14b parity rewrite changed the public surface for the PRIME-RL recipe.
|
| 86 |
+
Most file/symbol references in the doc are unambiguously fixable.
|
| 87 |
+
|
| 88 |
+
| Claim | Status | Action |
|
| 89 |
+
| --- | --- | --- |
|
| 90 |
+
| PRIME-RL `LossInputs` / `LossOutputs` interface, fields `trainer_logprobs, inference_logprobs, teacher_logprobs, advantages, loss_mask` | matches `composer_replication/recipes/prime_rl/composer_loss.py:17-28` | KEEP |
|
| 91 |
+
| `import_path = "composer_replication.losses.composer_three_channel_loss"` (line 318) | actual is `composer_replication.recipes.prime_rl.composer_loss:loss_fn` | FIX (applied inline) |
|
| 92 |
+
| `composer_replication/losses.py` (~120 LOC) (line 333) | actual file is `composer_replication/recipes/prime_rl/composer_loss.py`; function is `loss_fn` not `composer_three_channel_loss`; signature is `(inputs, **kwargs)` not `(li, *, hint_weight, replay_weight, replay_logits)` | FIX (filename + AUDIT comment noting signature drift; sketch retained as-is) |
|
| 93 |
+
| Recipe in `recipes/composer_v0_prime_rl.toml` with kwargs `hint_weight`, `replay_weight`, `replay_logits_path` | actual recipe is `composer_replication/recipes/prime_rl/prime_rl_config.yaml` (YAML not TOML) and kwargs are `alpha_sdpo, beta_dpo, dppo_mask_high, dppo_mask_low, adv_tau, kl_tau` | FLAG (`stale_recipe_format` HTML comment added at the recipe block) |
|
| 94 |
+
| Monarch sketch `composer_replication/orchestrator/monarch_runner.py` | not in code; treated as v0.2 sketch (and §6.2 already labels it "Monarch wrap-up sketch (v0.2)") | KEEP (clearly v0.2 forward-looking; no AUDIT needed) |
|
| 95 |
+
| Note that `composer_replication/recipes/monarch/actors.py` exists, providing the monarch actor surface | matches `composer_replication/recipes/monarch/actors.py` | KEEP |
|
| 96 |
+
|
| 97 |
+
### SELF_DISTILLATION_LANDSCAPE.md
|
| 98 |
+
|
| 99 |
+
Audit doc for ADR-007. Three losses (SimPO, TAID, Entropy-Aware OPD) were
|
| 100 |
+
adopted, but the package layout proposed in §"Recommended follow-up wiring"
|
| 101 |
+
is not what was built.
|
| 102 |
+
|
| 103 |
+
| Claim | Status | Action |
|
| 104 |
+
| --- | --- | --- |
|
| 105 |
+
| References `composer_replication/__init__.py` and existing `composer_replication.opsd.generalized_jsd_loss` | confirmed | KEEP |
|
| 106 |
+
| Audited candidate verdicts (SimPO/TAID/EA-OPD recommended) | matches what shipped under `composer_replication/distillation/` | KEEP |
|
| 107 |
+
| Proposed package layout: `composer_replication/distillation/{targets.py, losses.py}` + `composer_replication/preference/{simpo.py, dpo.py}` | actual: flat `composer_replication/distillation/{simpo.py, taid.py, entropy_aware_opd.py}` — no targets/losses split, no top-level `preference/` package | FLAG (`stale_distillation_layout` HTML comment added before the proposal block) |
|
| 108 |
+
| Function names `taid_target`, `entropy_aware_kl_loss`, `fixed_target` | actual exports: `taid_loss` + `TAIDScheduler`, `entropy_aware_opd_loss`, `simpo_loss` | FLAG (covered by same comment) |
|
| 109 |
+
| Composition rule sketch `L_distill = entropy_aware_kl_loss(target = taid_target(...), ...)` | not realised as a single composed function — actual API is per-loss with `compose_loss` mixing channels via `sdpo_wrapper`/`dpo_variant` switches | FLAG (covered by same comment) |
|
| 110 |
+
|
| 111 |
+
### TRACE_SOURCE_RECONNAISSANCE.md
|
| 112 |
+
|
| 113 |
+
Pre-spike audit feeding ADR-002 and Spike 007. Doc cites the actual
|
| 114 |
+
`TraceState`/`DPOPair` TypedDicts correctly (matches current
|
| 115 |
+
`composer_replication/teacher_replay.py:81-104`). The sketch in §6 uses
|
| 116 |
+
spike-shape names that do not match what shipped.
|
| 117 |
+
|
| 118 |
+
| Claim | Status | Action |
|
| 119 |
+
| --- | --- | --- |
|
| 120 |
+
| `TraceState` and `DPOPair` field lists in §1 | match `composer_replication/teacher_replay.py` (TypedDicts at lines 81-104) | KEEP |
|
| 121 |
+
| `spikes/005-integrated-trainer-skeleton/teacher_replay.py` exists | confirmed | KEEP |
|
| 122 |
+
| §6 sketch path `spikes/007-trace-ingester/trace_ingester.py` and class `TraceIngester` | actual spike path is `spikes/007-real-trace-ingestion/claude_code_ingester.py` and the production class is `composer_replication.ingestion.claude_code.ClaudeCodeIngester` | FLAG (`stale_ingester_paths_and_naming` HTML comment added before §6) |
|
| 123 |
+
| Direct inspection of `~/.claude/projects/` JSONL files | not testable from CI; user-machine claim | KEEP |
|
| 124 |
+
| Re-use of `TraceState` from spike-005 `teacher_replay.py` | spike still has it; production also has it | KEEP |
|
| 125 |
+
|
| 126 |
+
## Open items for Wave 17+
|
| 127 |
+
|
| 128 |
+
These are the FLAGged ambiguous claims that need orchestrator decision before
|
| 129 |
+
a confident rewrite:
|
| 130 |
+
|
| 131 |
+
1. **DILOCO_SERVERLESS_RECONNAISSANCE.md §3.4** — proposed serverless package
|
| 132 |
+
layout (`_modal_adapter.py`, `_protocol.py`, etc.) does not match shipped
|
| 133 |
+
layout (`modal.py`, `executor.py`, etc.). Decide: rewrite §3.4 to document
|
| 134 |
+
shipped layout, or keep as historical proposal. Note also: §3.5 references
|
| 135 |
+
`ModalExecutor` and `HFJobsExecutor` as `from … serverless import …` but
|
| 136 |
+
`serverless/__init__.py` only re-exports `LocalProcessExecutor`. Either
|
| 137 |
+
the public re-export should be added (code change, out of Wave 16d scope)
|
| 138 |
+
or §3.5 needs to use the longer module path.
|
| 139 |
+
|
| 140 |
+
2. **REPLAYSIM_NORMALIZATION_RECONNAISSANCE.md §4** — Adapter sketch assumes
|
| 141 |
+
a `DPOPair` shape (`{prompt, chosen, rejected, state, meta}`) that does
|
| 142 |
+
not match the realised TypedDict (`{state_id, state_messages, chosen: str,
|
| 143 |
+
rejected: str, n_teachers_agreeing}`). The §4.3 `_to_dj`/`_from_dj`
|
| 144 |
+
functions in the sketch will not work as written. Decide: rewrite §4 to
|
| 145 |
+
match `replay_and_normalize_trace_sync` in
|
| 146 |
+
`composer_replication/replaysim/normalize.py:301`, or keep as proposal-
|
| 147 |
+
shaped historical context.
|
| 148 |
+
|
| 149 |
+
3. **RL_FRAMEWORKS_LANDSCAPE.md §6.1** — recipe sketch is `.toml` with
|
| 150 |
+
`hint_weight`/`replay_weight` kwargs; reality is `.yaml` with
|
| 151 |
+
`alpha_sdpo`/`beta_dpo`/`dppo_mask_high`/`dppo_mask_low`/`adv_tau`/`kl_tau`.
|
| 152 |
+
The `loss_fn(inputs, **kwargs)` signature also differs from the
|
| 153 |
+
`composer_three_channel_loss(li, *, hint_weight, replay_weight,
|
| 154 |
+
replay_logits)` sketch. Decide: rewrite §6.1 to match shipped recipe, or
|
| 155 |
+
keep as the original landscape proposal.
|
| 156 |
+
|
| 157 |
+
4. **SELF_DISTILLATION_LANDSCAPE.md §"Recommended follow-up wiring"** — the
|
| 158 |
+
`distillation/{targets.py, losses.py}` + `preference/{simpo.py, dpo.py}`
|
| 159 |
+
layout is not what shipped. Actual is flat
|
| 160 |
+
`composer_replication/distillation/{simpo.py, taid.py, entropy_aware_opd.py}`
|
| 161 |
+
with function names `simpo_loss`, `taid_loss` + `TAIDScheduler`,
|
| 162 |
+
`entropy_aware_opd_loss`. Decide: rewrite the wiring sketch or leave as
|
| 163 |
+
proposal-shaped record of pre-ADR thinking.
|
| 164 |
+
|
| 165 |
+
5. **TRACE_SOURCE_RECONNAISSANCE.md §6** — `TraceIngester` sketch differs from
|
| 166 |
+
shipped `ClaudeCodeIngester`. Decide: rewrite §6 to point at the realised
|
| 167 |
+
ingester (would also require updating spike path from
|
| 168 |
+
`007-trace-ingester` to `007-real-trace-ingestion`), or keep as
|
| 169 |
+
pre-spike proposal.
|
| 170 |
+
|
| 171 |
+
## What was NOT changed
|
| 172 |
+
|
| 173 |
+
- `WAVE_*_FINAL_REVIEW.md` files — explicitly out of scope for Wave 16d.
|
| 174 |
+
- Any code under `composer_replication/`, `examples/`, `tests/` — code-side
|
| 175 |
+
fixes (e.g. adding `ModalExecutor`/`HFJobsExecutor` to `serverless/__init__.py`
|
| 176 |
+
re-exports) belong to a code wave, not a doc audit.
|
| 177 |
+
- Whole-section rewrites of any doc — Wave 16d's mandate is "audit + safe
|
| 178 |
+
inline fixes only". Each FLAG above is a candidate for a future targeted
|
| 179 |
+
rewrite wave.
|
|
@@ -0,0 +1,72 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# gsm8k_grpo_with_sdpo — SDPO column end-to-end on Qwen2.5-0.5B-Instruct (CPU)
|
| 2 |
+
|
| 3 |
+
This is the sibling to `examples/gsm8k_grpo/` that demonstrates the
|
| 4 |
+
**SDPO hint-distillation column** firing end-to-end on a real
|
| 5 |
+
HuggingFace causal LM, on CPU, in ~90 seconds. Where `gsm8k_grpo/run.py`
|
| 6 |
+
runs plain GRPO with `alpha_sdpo=0`, this script enables `alpha_sdpo=0.5`
|
| 7 |
+
and verifies the SDPO channel actually fires and routes gradients
|
| 8 |
+
through the model.
|
| 9 |
+
|
| 10 |
+
## Run it
|
| 11 |
+
|
| 12 |
+
```bash
|
| 13 |
+
pip install -e ".[train]"
|
| 14 |
+
python examples/gsm8k_grpo_with_sdpo/run.py
|
| 15 |
+
```
|
| 16 |
+
|
| 17 |
+
Expected wall-clock: ~60-120s on CPU (one-time HF model download on
|
| 18 |
+
first run).
|
| 19 |
+
|
| 20 |
+
## What success looks like
|
| 21 |
+
|
| 22 |
+
The script will print 5 SGD steps' worth of channel-decomposed losses
|
| 23 |
+
and end with three ✓ assertions:
|
| 24 |
+
|
| 25 |
+
```
|
| 26 |
+
step 1/5: total=5.9801 lm_ce=5.9087 sdpo_jsd=0.1429 trace_replay_dpo=0.0000 |grad|=6.45e+06
|
| 27 |
+
step 2/5: total=4.2268 lm_ce=4.1573 sdpo_jsd=0.1390 trace_replay_dpo=0.0000 |grad|=1.20e+06
|
| 28 |
+
...
|
| 29 |
+
step 5/5: total=2.4644 lm_ce=2.3962 sdpo_jsd=0.1363 trace_replay_dpo=0.0000 |grad|=1.03e+06
|
| 30 |
+
[4/4] Verifying SDPO column wiring ...
|
| 31 |
+
✓ sdpo_jsd > 0 at every step (min=0.1358, max=0.1429)
|
| 32 |
+
✓ total != lm_ce at every step (min |diff|=0.0679, max=0.0714)
|
| 33 |
+
✓ |grad| > 0 and finite at every step (min=1.01e+06, max=6.45e+06)
|
| 34 |
+
✅ SDPO column wiring verified end-to-end.
|
| 35 |
+
```
|
| 36 |
+
|
| 37 |
+
Wall-clock on the reference run: **16.5s** for 5 SGD steps after a
|
| 38 |
+
1.7s model-load phase (no model download — already cached). The SDPO
|
| 39 |
+
signal magnitude (~0.14) is meaningful here because the script uses
|
| 40 |
+
Qwen's actual ChatML markers (`<|im_start|>` / `<|im_end|>`) via
|
| 41 |
+
`tokenizer.apply_chat_template` — not raw marker strings, which would
|
| 42 |
+
be tokenized as 11 punctuation tokens and the model would see nonsense.
|
| 43 |
+
|
| 44 |
+
If `sdpo_jsd` ever shows up as `0.0000`, the SDPO column is silent —
|
| 45 |
+
that means either (a) `alpha_sdpo=0`, (b) `ctx_teacher_input_ids` is
|
| 46 |
+
missing from the input batch, or (c) the data collator is producing
|
| 47 |
+
empty teacher contexts.
|
| 48 |
+
|
| 49 |
+
## Why this is not a real training run
|
| 50 |
+
|
| 51 |
+
The hint contexts here are hand-crafted — every example gets the same
|
| 52 |
+
generic "remember to verify your arithmetic" hint, and the response
|
| 53 |
+
mask is fabricated to mark the back half of the sequence. This is a
|
| 54 |
+
**code-path smoke test**, not a recipe. Real SDPO training uses a
|
| 55 |
+
`ComposerDataCollator` (see
|
| 56 |
+
`composer_replication.trainer.data_collator`) that generates per-step
|
| 57 |
+
hints from the actual error sites in your trace data.
|
| 58 |
+
|
| 59 |
+
## Cross-references
|
| 60 |
+
|
| 61 |
+
- [`composer_replication.compose_loss`](../../composer_replication/loss.py) — the loss-composition entrypoint
|
| 62 |
+
- [`docs/COMPOSER_RECIPE_MAPPING.md`](../../docs/COMPOSER_RECIPE_MAPPING.md) — how SDPO maps to Cursor's Composer-2.5 hint-distillation
|
| 63 |
+
- [`docs/adrs/ADR-002-channel2-sdpo.md`](../../docs/adrs/ADR-002-channel2-sdpo.md) — SDPO design decision
|
| 64 |
+
- [`examples/gsm8k_grpo/run.py`](../gsm8k_grpo/run.py) — plain GRPO sibling (alpha_sdpo=0)
|
| 65 |
+
|
| 66 |
+
## CPU vs GPU
|
| 67 |
+
|
| 68 |
+
This example is intentionally CPU-only and small (B=2, T=32, 5 steps)
|
| 69 |
+
so it exercises the SDPO code path without needing a GPU. For real
|
| 70 |
+
training on Qwen2.5 at scale, switch to
|
| 71 |
+
`ComposerReplicationTrainer` (TRL-backed) and a real GPU; see
|
| 72 |
+
[`docs/USER_GUIDE.md`](../../docs/USER_GUIDE.md) §8 Recipe A.
|
|
@@ -0,0 +1,288 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""GRPO + SDPO column wiring on Qwen2.5-0.5B-Instruct (CPU end-to-end).
|
| 2 |
+
|
| 3 |
+
This is the sibling to `examples/gsm8k_grpo/` that demonstrates the
|
| 4 |
+
**SDPO hint-distillation column** firing end-to-end on a real
|
| 5 |
+
HuggingFace model, on CPU, without needing TRL's full GRPO training
|
| 6 |
+
loop. Where `gsm8k_grpo/run.py` runs plain GRPO with `alpha_sdpo=0`,
|
| 7 |
+
this script loads the same model and shows that:
|
| 8 |
+
|
| 9 |
+
1. `compose_loss(model, inputs, alpha_sdpo=0.5, ...)` produces a
|
| 10 |
+
non-zero `sdpo_jsd` channel on a real HF causal LM, and
|
| 11 |
+
2. backward through that channel reaches model parameters with
|
| 12 |
+
finite gradients, and
|
| 13 |
+
3. running 5 SGD steps with the SDPO column enabled reduces the
|
| 14 |
+
channel-decomposed total loss.
|
| 15 |
+
|
| 16 |
+
This is the smallest possible "real-model" SDPO-wiring proof — the
|
| 17 |
+
hand-crafted hint contexts here are not realistic training data, they
|
| 18 |
+
just exercise the SDPO code path. For production SDPO, use
|
| 19 |
+
`ComposerReplicationTrainer` with a `ComposerDataCollator` that emits
|
| 20 |
+
`ctx_teacher_input_ids` / `sdpo_loss_mask` columns from your real
|
| 21 |
+
trace data (see `composer_replication.trainer.data_collator`).
|
| 22 |
+
|
| 23 |
+
Usage:
|
| 24 |
+
pip install -e ".[train]"
|
| 25 |
+
python examples/gsm8k_grpo_with_sdpo/run.py
|
| 26 |
+
|
| 27 |
+
Cross-references:
|
| 28 |
+
- `composer_replication.compose_loss` — the loss-composition entrypoint
|
| 29 |
+
- `docs/COMPOSER_RECIPE_MAPPING.md` — how SDPO maps to Cursor's
|
| 30 |
+
Composer-2.5 hint-distillation
|
| 31 |
+
- `docs/adrs/ADR-002-channel2-sdpo.md` — SDPO design
|
| 32 |
+
- `examples/gsm8k_grpo/run.py` — plain GRPO (no SDPO) sibling
|
| 33 |
+
"""
|
| 34 |
+
from __future__ import annotations
|
| 35 |
+
|
| 36 |
+
import logging
|
| 37 |
+
import random
|
| 38 |
+
import sys
|
| 39 |
+
import time
|
| 40 |
+
from pathlib import Path
|
| 41 |
+
|
| 42 |
+
import torch
|
| 43 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 44 |
+
|
| 45 |
+
from composer_replication import compose_loss
|
| 46 |
+
|
| 47 |
+
# ---------------------------------------------------------------------------
|
| 48 |
+
# Config
|
| 49 |
+
# ---------------------------------------------------------------------------
|
| 50 |
+
|
| 51 |
+
MODEL_REPO = "Qwen/Qwen2.5-0.5B-Instruct"
|
| 52 |
+
N_STEPS = 5
|
| 53 |
+
B = 2 # batch size
|
| 54 |
+
T = 32 # sequence length (small to keep CPU fast)
|
| 55 |
+
LR = 1e-5
|
| 56 |
+
ALPHA_SDPO = 0.5 # SDPO column weight; large enough to dominate the signal
|
| 57 |
+
BETA_REPLAY = 0.0 # DPO column off — this example focuses on SDPO
|
| 58 |
+
|
| 59 |
+
OUTPUT_DIR = Path(__file__).resolve().parent / "output"
|
| 60 |
+
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
|
| 61 |
+
|
| 62 |
+
|
| 63 |
+
# ---------------------------------------------------------------------------
|
| 64 |
+
# Tiny GSM8K-shaped fixture — we fabricate the chat strings so the model
|
| 65 |
+
# sees realistic prose. The "hint" is the same prompt with a
|
| 66 |
+
# "remember to verify your arithmetic" line inserted; that's what makes
|
| 67 |
+
# the teacher context differ from the student context.
|
| 68 |
+
# ---------------------------------------------------------------------------
|
| 69 |
+
|
| 70 |
+
PROBLEMS = [
|
| 71 |
+
{
|
| 72 |
+
"question": "Janet has 3 boxes with 4 apples each. How many apples total?",
|
| 73 |
+
"gold": "12",
|
| 74 |
+
},
|
| 75 |
+
{
|
| 76 |
+
"question": "A train travels 60 miles in 2 hours. What's its average speed?",
|
| 77 |
+
"gold": "30",
|
| 78 |
+
},
|
| 79 |
+
]
|
| 80 |
+
|
| 81 |
+
SYS = "You are a math tutor. End with `#### N` where N is the answer."
|
| 82 |
+
HINT = "Hint: re-check your arithmetic before giving the final answer."
|
| 83 |
+
|
| 84 |
+
|
| 85 |
+
def _build_chat_messages(question: str, *, with_hint: bool) -> list[dict]:
|
| 86 |
+
"""Format a single example as chat messages. with_hint=True is the
|
| 87 |
+
teacher context (hint inserted as an extra system turn). Returns the
|
| 88 |
+
OpenAI-style messages list, ready for tokenizer.apply_chat_template.
|
| 89 |
+
|
| 90 |
+
Verified 2026-05-26: Qwen2.5 uses ChatML markers (`<|im_start|>` /
|
| 91 |
+
`<|im_end|>`), NOT `<|system|>` / `<|end|>`. Using
|
| 92 |
+
`apply_chat_template` is the only safe way to format input — raw
|
| 93 |
+
marker strings get tokenized as 11 punctuation tokens and the model
|
| 94 |
+
sees nonsense.
|
| 95 |
+
"""
|
| 96 |
+
messages = [{"role": "system", "content": SYS}]
|
| 97 |
+
if with_hint:
|
| 98 |
+
# Two system turns: Qwen's chat template will format both with
|
| 99 |
+
# <|im_start|>system / <|im_end|> markers.
|
| 100 |
+
messages.append({"role": "system", "content": HINT})
|
| 101 |
+
messages.append({"role": "user", "content": question})
|
| 102 |
+
return messages
|
| 103 |
+
|
| 104 |
+
|
| 105 |
+
def build_inputs(tokenizer) -> dict[str, torch.Tensor]:
|
| 106 |
+
"""Tokenize PROBLEMS into a compose_loss-shaped batch.
|
| 107 |
+
|
| 108 |
+
Returns a dict with:
|
| 109 |
+
- input_ids: (B, T) student rollouts (no hint)
|
| 110 |
+
- response_mask: (B, T)
|
| 111 |
+
- ctx_teacher_input_ids: (B, T) hint-conditioned context
|
| 112 |
+
- sdpo_loss_mask: (B, T) 1 at assistant-response tokens
|
| 113 |
+
"""
|
| 114 |
+
student_msg_lists = [_build_chat_messages(p["question"], with_hint=False) for p in PROBLEMS[:B]]
|
| 115 |
+
teacher_msg_lists = [_build_chat_messages(p["question"], with_hint=True) for p in PROBLEMS[:B]]
|
| 116 |
+
|
| 117 |
+
# Render via Qwen's chat template — produces real <|im_start|>/<|im_end|> tokens.
|
| 118 |
+
student_strs = [
|
| 119 |
+
tokenizer.apply_chat_template(m, tokenize=False, add_generation_prompt=True)
|
| 120 |
+
for m in student_msg_lists
|
| 121 |
+
]
|
| 122 |
+
teacher_strs = [
|
| 123 |
+
tokenizer.apply_chat_template(m, tokenize=False, add_generation_prompt=True)
|
| 124 |
+
for m in teacher_msg_lists
|
| 125 |
+
]
|
| 126 |
+
|
| 127 |
+
s_tok = tokenizer(
|
| 128 |
+
student_strs,
|
| 129 |
+
max_length=T,
|
| 130 |
+
truncation=True,
|
| 131 |
+
padding="max_length",
|
| 132 |
+
return_tensors="pt",
|
| 133 |
+
)
|
| 134 |
+
t_tok = tokenizer(
|
| 135 |
+
teacher_strs,
|
| 136 |
+
max_length=T,
|
| 137 |
+
truncation=True,
|
| 138 |
+
padding="max_length",
|
| 139 |
+
return_tensors="pt",
|
| 140 |
+
)
|
| 141 |
+
|
| 142 |
+
# Mark the second half of each sequence as the "response" — purely
|
| 143 |
+
# synthetic for this smoke; in real training the response_mask comes
|
| 144 |
+
# from the rollout pipeline.
|
| 145 |
+
response_mask = torch.zeros(B, T, dtype=torch.long)
|
| 146 |
+
response_mask[:, T // 2:] = 1
|
| 147 |
+
sdpo_loss_mask = response_mask.clone()
|
| 148 |
+
|
| 149 |
+
return {
|
| 150 |
+
"input_ids": s_tok["input_ids"],
|
| 151 |
+
"response_mask": response_mask,
|
| 152 |
+
"ctx_teacher_input_ids": t_tok["input_ids"],
|
| 153 |
+
"sdpo_loss_mask": sdpo_loss_mask,
|
| 154 |
+
}
|
| 155 |
+
|
| 156 |
+
|
| 157 |
+
# ---------------------------------------------------------------------------
|
| 158 |
+
# Main
|
| 159 |
+
# ---------------------------------------------------------------------------
|
| 160 |
+
|
| 161 |
+
|
| 162 |
+
def main() -> int:
|
| 163 |
+
random.seed(42)
|
| 164 |
+
torch.manual_seed(42)
|
| 165 |
+
|
| 166 |
+
log_path = OUTPUT_DIR.parent / "run.log"
|
| 167 |
+
logging.basicConfig(
|
| 168 |
+
level=logging.INFO,
|
| 169 |
+
format="%(asctime)s [%(levelname)s] %(name)s: %(message)s",
|
| 170 |
+
handlers=[
|
| 171 |
+
logging.StreamHandler(sys.stdout),
|
| 172 |
+
logging.FileHandler(log_path, mode="w"),
|
| 173 |
+
],
|
| 174 |
+
)
|
| 175 |
+
log = logging.getLogger("gsm8k_grpo_with_sdpo")
|
| 176 |
+
|
| 177 |
+
log.info("=" * 64)
|
| 178 |
+
log.info("GRPO + SDPO + GSM8K + Qwen2.5-0.5B-Instruct (CPU)")
|
| 179 |
+
log.info("=" * 64)
|
| 180 |
+
|
| 181 |
+
log.info("[1/4] Loading model + tokenizer ...")
|
| 182 |
+
t0 = time.time()
|
| 183 |
+
tokenizer = AutoTokenizer.from_pretrained(MODEL_REPO)
|
| 184 |
+
if tokenizer.pad_token_id is None:
|
| 185 |
+
tokenizer.pad_token = tokenizer.eos_token
|
| 186 |
+
model = AutoModelForCausalLM.from_pretrained(MODEL_REPO, torch_dtype=torch.float32)
|
| 187 |
+
model.to("cpu")
|
| 188 |
+
n_params = sum(p.numel() for p in model.parameters())
|
| 189 |
+
log.info(" loaded in %.1fs (%.3fB params)", time.time() - t0, n_params / 1e9)
|
| 190 |
+
|
| 191 |
+
log.info("[2/4] Building hint-conditioned batch (B=%d, T=%d) ...", B, T)
|
| 192 |
+
inputs = build_inputs(tokenizer)
|
| 193 |
+
for k, v in inputs.items():
|
| 194 |
+
log.info(" %s: shape=%s, dtype=%s", k, tuple(v.shape), v.dtype)
|
| 195 |
+
|
| 196 |
+
log.info("[3/4] Running %d SGD steps with alpha_sdpo=%.2f ...", N_STEPS, ALPHA_SDPO)
|
| 197 |
+
optim = torch.optim.SGD(model.parameters(), lr=LR)
|
| 198 |
+
history: list[dict[str, float]] = []
|
| 199 |
+
|
| 200 |
+
model.train()
|
| 201 |
+
t0 = time.time()
|
| 202 |
+
for step in range(N_STEPS):
|
| 203 |
+
optim.zero_grad()
|
| 204 |
+
out = compose_loss(
|
| 205 |
+
model,
|
| 206 |
+
inputs,
|
| 207 |
+
alpha_sdpo=ALPHA_SDPO,
|
| 208 |
+
beta_replay=BETA_REPLAY,
|
| 209 |
+
)
|
| 210 |
+
out.total.backward()
|
| 211 |
+
|
| 212 |
+
# Sanity: gradients are finite + non-zero
|
| 213 |
+
gnorm = sum(
|
| 214 |
+
p.grad.abs().sum().item()
|
| 215 |
+
for p in model.parameters()
|
| 216 |
+
if p.grad is not None
|
| 217 |
+
)
|
| 218 |
+
optim.step()
|
| 219 |
+
|
| 220 |
+
components = out.detached()
|
| 221 |
+
components["grad_norm"] = gnorm
|
| 222 |
+
history.append(components)
|
| 223 |
+
log.info(
|
| 224 |
+
" step %d/%d: total=%.4f lm_ce=%.4f sdpo_jsd=%.4f trace_replay_dpo=%.4f |grad|=%.2e",
|
| 225 |
+
step + 1, N_STEPS,
|
| 226 |
+
components["total"], components["lm_ce"],
|
| 227 |
+
components["sdpo_jsd"], components["trace_replay_dpo"],
|
| 228 |
+
gnorm,
|
| 229 |
+
)
|
| 230 |
+
dt = time.time() - t0
|
| 231 |
+
log.info("Training complete in %.1fs (avg %.1fs/step)", dt, dt / N_STEPS)
|
| 232 |
+
|
| 233 |
+
# ------------------------------------------------------------------
|
| 234 |
+
# Acceptance assertions — SDPO column actually fired
|
| 235 |
+
# ------------------------------------------------------------------
|
| 236 |
+
log.info("[4/4] Verifying SDPO column wiring ...")
|
| 237 |
+
|
| 238 |
+
# 1. SDPO channel was non-zero at every step (channel actually fired)
|
| 239 |
+
sdpo_values = [h["sdpo_jsd"] for h in history]
|
| 240 |
+
assert all(s > 0.0 for s in sdpo_values), (
|
| 241 |
+
f"SDPO column is identically zero — channel did not fire. "
|
| 242 |
+
f"sdpo_jsd values: {sdpo_values}"
|
| 243 |
+
)
|
| 244 |
+
log.info(" ✓ sdpo_jsd > 0 at every step (min=%.4f, max=%.4f)",
|
| 245 |
+
min(sdpo_values), max(sdpo_values))
|
| 246 |
+
|
| 247 |
+
# 2. total != lm_ce at every step (SDPO actually contributed to total)
|
| 248 |
+
diffs = [abs(h["total"] - h["lm_ce"]) for h in history]
|
| 249 |
+
assert all(d > 1e-6 for d in diffs), (
|
| 250 |
+
f"total ≈ lm_ce at every step — SDPO contribution is negligible. "
|
| 251 |
+
f"abs(total - lm_ce): {diffs}"
|
| 252 |
+
)
|
| 253 |
+
log.info(" ✓ total != lm_ce at every step (min |diff|=%.4f, max=%.4f)",
|
| 254 |
+
min(diffs), max(diffs))
|
| 255 |
+
|
| 256 |
+
# 3. Gradients were finite + non-zero throughout
|
| 257 |
+
gnorms = [h["grad_norm"] for h in history]
|
| 258 |
+
assert all(g > 0.0 for g in gnorms), (
|
| 259 |
+
f"Some steps had zero gradient norm: {gnorms}"
|
| 260 |
+
)
|
| 261 |
+
import math
|
| 262 |
+
assert all(math.isfinite(g) for g in gnorms), (
|
| 263 |
+
f"Some steps had non-finite gradient norm: {gnorms}"
|
| 264 |
+
)
|
| 265 |
+
log.info(" ✓ |grad| > 0 and finite at every step (min=%.2e, max=%.2e)",
|
| 266 |
+
min(gnorms), max(gnorms))
|
| 267 |
+
|
| 268 |
+
# ------------------------------------------------------------------
|
| 269 |
+
# Summary
|
| 270 |
+
# ------------------------------------------------------------------
|
| 271 |
+
log.info("=" * 64)
|
| 272 |
+
log.info("Summary")
|
| 273 |
+
log.info("=" * 64)
|
| 274 |
+
log.info(" steps: %d", N_STEPS)
|
| 275 |
+
log.info(" alpha_sdpo: %.2f", ALPHA_SDPO)
|
| 276 |
+
log.info(" beta_replay: %.2f", BETA_REPLAY)
|
| 277 |
+
log.info(" model params: %.3fB", n_params / 1e9)
|
| 278 |
+
log.info(" total step 1: %.4f", history[0]["total"])
|
| 279 |
+
log.info(" total step %d: %.4f", N_STEPS, history[-1]["total"])
|
| 280 |
+
log.info(" wall-clock: %.1fs", dt)
|
| 281 |
+
log.info(" log file: %s", log_path)
|
| 282 |
+
log.info("=" * 64)
|
| 283 |
+
log.info("✅ SDPO column wiring verified end-to-end.")
|
| 284 |
+
return 0
|
| 285 |
+
|
| 286 |
+
|
| 287 |
+
if __name__ == "__main__":
|
| 288 |
+
sys.exit(main())
|
|
@@ -30,7 +30,6 @@ keywords = [
|
|
| 30 |
"prime-rl",
|
| 31 |
"openenv",
|
| 32 |
"torchft",
|
| 33 |
-
"monarch",
|
| 34 |
"modal",
|
| 35 |
"huggingface-jobs",
|
| 36 |
]
|
|
@@ -64,8 +63,16 @@ serverless = [
|
|
| 64 |
"huggingface_hub>=0.27", # for hf:// fsspec backend + HF Jobs
|
| 65 |
]
|
| 66 |
# Replaysim dataset normalization (per ADR-004)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 67 |
replaysim = [
|
| 68 |
-
"data-juicer>=1.0",
|
| 69 |
"composer-replication[replay]", # replaysim builds on the replay channel
|
| 70 |
]
|
| 71 |
# Production training (TRL GRPOTrainer subclass — Recipe A)
|
|
@@ -76,13 +83,24 @@ train = [
|
|
| 76 |
"datasets>=3.0",
|
| 77 |
]
|
| 78 |
# PRIME-RL recipe (Recipe C — per ADR-006)
|
| 79 |
-
prime-rl
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
#
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 86 |
# Everything for development
|
| 87 |
dev = [
|
| 88 |
"pytest>=8.0",
|
|
|
|
| 30 |
"prime-rl",
|
| 31 |
"openenv",
|
| 32 |
"torchft",
|
|
|
|
| 33 |
"modal",
|
| 34 |
"huggingface-jobs",
|
| 35 |
]
|
|
|
|
| 63 |
"huggingface_hub>=0.27", # for hf:// fsspec backend + HF Jobs
|
| 64 |
]
|
| 65 |
# Replaysim dataset normalization (per ADR-004)
|
| 66 |
+
#
|
| 67 |
+
# NOTE: data-juicer is intentionally NOT pinned as an extra. The package
|
| 68 |
+
# named "data-juicer" does not exist on PyPI (the closest match,
|
| 69 |
+
# "py-data-juicer==1.0.0", has broken transitive deps; later py-data-juicer
|
| 70 |
+
# releases work but install ~150 transitive packages). Users who want the
|
| 71 |
+
# DJNormalizer adapter should install data-juicer from source themselves —
|
| 72 |
+
# see docs/TROUBLESHOOTING.md ("monarch / data-juicer install"). The
|
| 73 |
+
# replaysim Python module imports data_juicer lazily, so the framework
|
| 74 |
+
# package imports cleanly without it; only DJNormalizer use-time fails.
|
| 75 |
replaysim = [
|
|
|
|
| 76 |
"composer-replication[replay]", # replaysim builds on the replay channel
|
| 77 |
]
|
| 78 |
# Production training (TRL GRPOTrainer subclass — Recipe A)
|
|
|
|
| 83 |
"datasets>=3.0",
|
| 84 |
]
|
| 85 |
# PRIME-RL recipe (Recipe C — per ADR-006)
|
| 86 |
+
# NOTE: a `prime-rl` extra used to be advertised here pinning
|
| 87 |
+
# `prime-rl>=0.5`. That pin is unsatisfiable: the `prime-rl` PyPI name is
|
| 88 |
+
# not registered. Prime Intellect publishes prime-rl from source only
|
| 89 |
+
# (https://github.com/PrimeIntellect-ai/prime-rl). The framework's
|
| 90 |
+
# composer_replication.recipes.prime_rl adapter handles its absence
|
| 91 |
+
# gracefully (the upstream parity test is skip-marked when prime-rl is
|
| 92 |
+
# not importable) and the in-file shadow-parity test still verifies the
|
| 93 |
+
# loss formula independently. The extra is dropped — see
|
| 94 |
+
# docs/TROUBLESHOOTING.md ("prime-rl install") for installation guidance.
|
| 95 |
+
# NOTE: a `monarch` extra used to be advertised here pinning
|
| 96 |
+
# `monarch>=0.4.1`. That pin is unsatisfiable: PyPI's `monarch` package
|
| 97 |
+
# is unrelated to Meta's actor framework and tops out at 0.1.11. The real
|
| 98 |
+
# Meta Monarch is published as `torchmonarch-nightly` and ships only as
|
| 99 |
+
# nightly wheels with platform constraints. Per ADR-006, full Monarch
|
| 100 |
+
# integration is a v0.2+ bet and the `composer_replication.recipes.monarch`
|
| 101 |
+
# module is a documentation skeleton (importing it does NOT require
|
| 102 |
+
# monarch installed). The extra is dropped — see docs/TROUBLESHOOTING.md
|
| 103 |
+
# ("monarch / data-juicer install") for installation guidance.
|
| 104 |
# Everything for development
|
| 105 |
dev = [
|
| 106 |
"pytest>=8.0",
|