Reinforcement Learning
Transformers
English
post-training
distillation
agentic-coding
composer-2.5
cursor
kimi-k2
grpo
dapo
diloco
openenv
trl
verl
research
methodology
Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
File size: 39,442 Bytes
d9dd3a5 e5add15 d9dd3a5 c0a5ab7 d9dd3a5 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 | # TROUBLESHOOTING — Wave 14
This document catalogs every Wave-14-known failure mode in the Composer
Replication Framework, along with how to diagnose, fix, and verify each
one. It is intentionally surgical: the surface area added in Waves 12–14
(SimPO/TAID/Entropy-OPD distillation kwargs, the PRIME-RL composer-loss
adapter, the serverless DiLoCo `MockManager` + `ObjectStoreAllReduce`
path, and the data-juicer-backed replaysim normalizer) introduced new
ways for users to trip themselves up. Each failure mode here is something
a maintainer has actually seen or anticipated during the cross-model
review of Wave 14.
If you hit something not covered below, jump to the
[How to file a bug report](#how-to-file-a-bug-report) section at the end —
the template there gives a maintainer everything they need to reproduce.
---
## Common things to check first
Before reading any further, run through this checklist. ~80% of "framework
broken" reports turn out to be one of these:
1. **Python version.** The framework targets Python 3.10–3.12. The
`pyproject.toml` `target-version` is `py310`. If you are on 3.13+,
transitive deps (notably Ray, pulled in by data-juicer) may not yet
ship wheels and will try to build from source. Run `python --version`.
2. **Fresh virtual environment.** Mixing the framework into an existing
environment that already has `torch`, `transformers`, `trl`, or
`torchft` pinned to incompatible versions is the #1 source of import-
time errors. Create a new venv: `python -m venv .venv && source
.venv/bin/activate && pip install -e .[dev]`.
3. **Editable install.** Most contributors run `pip install -e .` so
that local edits to `composer_replication/` are picked up. If you
`pip install composer-replication` from a registry instead, your
edits to the source tree will be ignored. Confirm with
`pip show composer-replication | grep Location`.
4. **Optional extras.** Several modules are optional-dep gated:
- `[replay]` — adds `httpx` (used for OpenRouter teacher calls).
- `[train]` — adds TRL, peft, accelerate, datasets (production GRPO).
- `[replaysim]` — adds `data-juicer` (and via it, Ray as a transitive).
- `[serverless]` — adds `fsspec`. For non-local rendezvous URIs you
also need a backend-specific fsspec adapter (see Failure Mode 5).
- `[dev]` — adds `pytest`, `ruff`, etc.
If you see `ModuleNotFoundError: No module named 'data_juicer'`, you
forgot the extra. Install with `pip install -e .[replaysim]`.
5. **Run the test suite first.** Before debugging anything, run the
subset of tests touching the area you care about:
```
pytest composer_replication/tests/ # core compose_loss
pytest composer_replication/distillation/tests/ # SimPO / TAID / OPD
pytest composer_replication/recipes/prime_rl/tests/ # PRIME-RL adapter
pytest composer_replication/diloco/serverless/tests/ # MockManager + DiLoCo
pytest composer_replication/replaysim/tests/ # data-juicer normalizer
```
If any green test fails for you locally, the problem is environmental
— fix that before digging into your own code.
6. **Read the docstring of the symbol you're calling.** Wave 14
docstrings are written to be the first line of documentation. The
`compose_loss` docstring (`composer_replication/loss.py`) lists every
required and optional input key. The `MockManager` docstring
enumerates the torchft surface methods it implements.
---
## Failure modes
### 1. `pip install -e .[replaysim]` hangs or fails on Python 3.12 with a Ray-related path error
**SYMPTOM.** Installing the `[replaysim]` extra (which pulls
`data-juicer`) triggers a transitive install of Ray. On Python 3.12, the
first `import ray` (often during `pip` build hooks or the first time
data-juicer is loaded) fails with messages mentioning
`/tmp/ray/session_*` paths, missing `pyarrow` symbols, or `OSError:
[Errno 2] No such file or directory: '/dev/shm/ray-...'` inside Docker.
**DIAGNOSIS.** `data-juicer` declares `ray` as a transitive dependency.
On Python 3.12 the wheel matrix is incomplete for some Ray versions, and
Ray's first-import probes `/dev/shm` and `/tmp/ray` for its session
state. In a sandboxed container, restricted CI runner, or WSL
environment with a non-default `/tmp`, those probes fail. Wave 14
subagent T2 hit this in CI and worked around it by pinning Ray and by
making sure `/tmp` exists and is writable.
**FIX.**
- Prefer Python 3.11 if you're on 3.12+ and don't need 3.12 features.
- If you must stay on 3.12, ensure `/tmp` is writable and pre-create the
session directory: `mkdir -p /tmp/ray && chmod 1777 /tmp/ray`.
- In Docker, mount a real tmpfs at `/dev/shm`:
`docker run --shm-size=2g …`.
- If you don't need replaysim normalization, you can skip the extra
entirely. The `DJNormalizer(skip_dj=True)` passthrough (see
`composer_replication/replaysim/normalize.py:165`) does not import
`data_juicer` and therefore does not import Ray.
**VERIFICATION.** The skip-dj passthrough is exercised by
`test_dj_normalizer_skip_dj_passthrough` and
`test_dj_normalizer_skip_dj_preserves_count` in
`composer_replication/replaysim/tests/test_replaysim.py`. Both run
without `data_juicer` installed:
```
pytest composer_replication/replaysim/tests/test_replaysim.py::test_dj_normalizer_skip_dj_passthrough -xvs
```
If that passes in your environment, your `[replaysim]`-less install is
healthy — only the full data-juicer code path requires Ray.
---
### 2. `compose_loss` produces wrong-looking numbers when combining new kwargs
**SYMPTOM.** You pass several Wave-14 distillation kwargs to
`compose_loss` (e.g. `dpo_variant="simpo"`, `sdpo_wrapper="taid"`,
`taid_schedule_step=0`, `simpo_beta=2.0`, `entropy_opd_h_max=…`), and
the loss curve looks wrong: NaNs, identically-zero `sdpo_jsd` channel,
or a `total` that is bit-different from your reference run with no
distillation kwargs at all.
**DIAGNOSIS.** `compose_loss` now has 13 keyword arguments and the
contract between them is non-trivial. Subagent T1's review identified
three combinations that look reasonable but are unsupported:
- Passing `taid_schedule_step` without `taid_total_steps` (or vice
versa). The function raises `ValueError` clearly, but the message can
scroll past in noisy logs.
- Passing `dpo_variant="simpo"` while still supplying
`dpo_chosen_ref_logprobs`. Those keys are **silently ignored** —
SimPO is reference-free.
- Passing `sdpo_wrapper="taid"` without supplying either
`student_init_logits` OR `student_init_input_ids` in `inputs`. The
function will fall back to a forward pass through the (possibly
drifted) live model, which is a footgun late in training (see Failure
Mode 8).
**FIX.** Read the docstring at the top of
`composer_replication/loss.py` (lines 25–39 list the three pluggable
losses and their preconditions). The general rule:
```python
from composer_replication import compose_loss
# Defaults (no distillation knobs) reproduce legacy 3-channel composition bit-exact.
out = compose_loss(model, inputs)
# To opt into SimPO, pass dpo_variant ONLY. Do not pass ref-logprob keys.
out = compose_loss(model, inputs, dpo_variant="simpo",
simpo_beta=2.0, simpo_gamma=1.0)
# To opt into TAID, pass BOTH schedule_step AND total_steps, AND make sure
# inputs["student_init_logits"] is populated (see Failure Mode 8).
out = compose_loss(model, inputs, sdpo_wrapper="taid",
taid_schedule_step=step, taid_total_steps=total_steps)
```
Setting all 13 kwargs to their defaults is **bit-exact equivalent** to
the pre-Wave-13 3-channel loss; if your defaults call gives different
numbers than your old code, file a bug.
**VERIFICATION.** The bit-exact equivalence and every supported
combination is locked in by the 11 integration tests in
`composer_replication/tests/test_compose_loss_integration.py`. The most
important ones:
- `test_defaults_bit_exact_with_legacy_kwargs` — passing the new kwargs
at their defaults is identical to legacy.
- `test_simpo_does_not_require_ref_logprobs` — SimPO works with the
ref-logprob keys absent from `inputs`.
- `test_taid_alpha_one_recovers_sdpo` — TAID with `alpha_min=alpha_max=1`
reproduces standard SDPO.
- `test_taid_requires_schedule_step` / `test_taid_requires_total_steps` —
the partial-config error path.
```
pytest composer_replication/tests/test_compose_loss_integration.py -xvs
```
---
### 3. `MockManager` works today but silently breaks after a torchft upgrade
**SYMPTOM.** Your serverless DiLoCo run starts, the first outer round
completes, and then `torchft.DiLoCo` raises an `AttributeError` on
something like `_use_async_quorum`, `should_commit`, or
`current_step` — or worse, it silently uses the wrong sync semantics.
**DIAGNOSIS.** `MockManager` is a duck-typed shim that mirrors
`torchft.Manager` rather than subclassing it. The surface it implements
is enumerated in the docstring at
`composer_replication/diloco/serverless/allreduce.py:215`:
> Methods/attributes DiLoCo touches: `allreduce`, `should_commit`,
> `start_quorum`, `current_step`, `disallow_state_dict_read`,
> `allow_state_dict_read`, `register_state_dict_fn`, `_use_async_quorum`
> (attribute), `num_participants`, `rank`.
The two **private** members in that list — `_use_async_quorum` and the
internal `current_step` counter — are private torchft API that may be
renamed without notice in any torchft minor release. Wave 14 subagent
T3 specifically called this out: "If torchft renames `_use_async_quorum`
to anything else, MockManager silently breaks because there is nothing
holding the contract beyond a string."
**FIX.**
- **Pin torchft.** In `pyproject.toml` keep your torchft version pinned
to a known-good range (e.g. `torchft>=0.2,<0.4`). When you need to
upgrade, do so deliberately and re-run the integration tests below
before merging.
- **Watch the deprecation warning.** Wave 14 sets up a clear path to
warn if `_use_async_quorum` is read on a fresh instance — see the
comment at `allreduce.py:255`.
- **Don't pass an arbitrary torchft branch.** If you've patched torchft
locally, the `MockManager` may need updating in lockstep. The
surface-compatibility tests below will catch this in CI.
**VERIFICATION.** The full DiLoCo × MockManager surface is exercised by:
- `test_mock_manager_shape_compat` in
`composer_replication/diloco/serverless/tests/test_serverless_local.py`
— sanity check that all expected methods/attributes exist.
- `test_mockmanager_has_full_diloco_call_surface` in
`composer_replication/diloco/serverless/tests/test_serverless_diloco_integration.py`
— runs an end-to-end outer round through real torchft `DiLoCo`,
hitting every method on the surface list above.
- `test_mockmanager_diloco_outer_round_completes` — full one-round
smoke ending in a successful outer SGD step.
If any of these tests turn red after a torchft bump, **do not ship**:
inspect the new torchft Manager surface and update `MockManager`
to match.
```
pytest composer_replication/diloco/serverless/tests/test_serverless_diloco_integration.py -xvs
```
---
### 4. SimPO loss curve looks like noise
**SYMPTOM.** You wired in `dpo_variant="simpo"`, the run starts, and
the `trace_replay_dpo` channel either drifts to large negative values
(→ `total` blows up) or oscillates with much higher variance than
standard DPO. The loss curve "looks like noise."
**DIAGNOSIS.** SimPO uses **average per-token log-probability**
(`Σ logπ(c_t) / |c|`), not sum log-prob. From the SimPO docstring
(`composer_replication/distillation/simpo.py:11–18`):
> SimPO drops the reference-policy term, replaces it with a target
> margin γ, and uses **average sequence log-probability instead of
> sum**. […] L_SimPO = -log σ( β · [avg_logπ(c) - avg_logπ(r)] - γ )
If you compute `chosen_logprobs.sum()` (or any unmasked aggregation) and
hand it to SimPO as `chosen_avg_logprobs`, the loss is undefined: β=2.0
times a sum-log-prob is on a totally different scale than β=2.0 times an
average. The result looks plausible per-batch but the optimum is
nowhere near the dataset's true preference signal.
**FIX.** Use the helper
`composer_replication.distillation.simpo.avg_sequence_logprob`:
```python
from composer_replication.distillation.simpo import (
simpo_loss, avg_sequence_logprob,
)
chosen_avg = avg_sequence_logprob(chosen_logprobs, chosen_response_mask)
rejected_avg = avg_sequence_logprob(rejected_logprobs, rejected_response_mask)
loss = simpo_loss(chosen_avg, rejected_avg, beta=2.0, gamma=1.0)
```
The mask is **1 on response tokens, 0 on prompt+padding** — same
convention as the rest of the framework. If you must roll your own
aggregation, divide by `response_mask.sum(dim=-1).clamp_min(1.0)`,
not by `response_mask.shape[-1]`.
**VERIFICATION.** The avg-vs-sum semantics are pinned by
`test_avg_sequence_logprob` in
`composer_replication/distillation/tests/test_distillation_losses.py`,
which constructs known per-token log-probs and asserts the helper
returns the correct per-sequence average. The end-to-end SimPO
loss-shape check is `test_simpo_loss_returns_scalar` in the same file.
```
pytest composer_replication/distillation/tests/test_distillation_losses.py::test_avg_sequence_logprob -xvs
pytest composer_replication/distillation/tests/test_distillation_losses.py::test_simpo_loss_lower_for_better_separation -xvs
```
---
### 5. `ObjectStoreAllReduce` works locally but fails on `s3://` at first allreduce
**SYMPTOM.** You construct
`ObjectStoreAllReduce(uri="s3://my-bucket/run42/", rank=0,
world_size=4)`. The constructor succeeds. The first call to
`allreduce(tensor, name="...")` raises `ImportError: Install s3fs to
access S3` or `botocore.exceptions.NoCredentialsError: Unable to locate
credentials`.
**DIAGNOSIS.** `ObjectStoreAllReduce` uses fsspec to reach the
backend, but **fsspec only ships protocol stubs, not adapters**. The
constructor doesn't know which protocol you'll use and doesn't
eagerly validate, so it accepts any URI. The `s3://` adapter requires:
1. The `s3fs` package (`pip install s3fs`), which is **not** in the
default `[serverless]` extra.
2. Working AWS credentials (env vars, `~/.aws/credentials`, IAM role,
or whatever your environment normally provides to boto3).
The same is true for `gs://` (`gcsfs`), `az://` (`adlfs`), and
`hf://` (`huggingface_hub`'s fsspec integration, which is included if
you have `huggingface_hub` installed).
**FIX.**
- Install the right adapter alongside the framework:
```
pip install s3fs # for s3://
pip install gcsfs # for gs://
pip install adlfs # for az://
```
- Verify credentials work outside the framework first:
```
python -c "import s3fs; print(s3fs.S3FileSystem().ls('my-bucket'))"
```
- If you're running on Modal/HF Jobs, set the credentials as Modal
secrets / HF Jobs env vars in the executor config — not in your
local shell.
The constructor could in principle perform an eager probe (e.g. a
`HEAD` on the rendezvous prefix) to fail fast at init time. Wave 14
deliberately did not add this because it adds a network round-trip on
every replica startup. If you want pre-flight validation in your
training script, call `fsspec.filesystem(protocol).ls(uri)` yourself
before constructing the manager.
**VERIFICATION.** The `file://` and bare-path code paths — the only
ones that don't need an extra adapter — are exercised by:
- `test_object_store_allreduce_local_paths_create_dir`
- `test_object_store_allreduce_world_size_1_passthrough`
- `test_object_store_allreduce_round_id_increments`
…all in
`composer_replication/diloco/serverless/tests/test_serverless_local.py`.
If those pass and your `s3://` URI fails, the framework is fine and
your fsspec adapter or credentials are the problem.
```
pytest composer_replication/diloco/serverless/tests/test_serverless_local.py -xvs
```
---
### 6. Custom replaysim recipe drops every record (or crashes data-juicer)
**SYMPTOM.** You wrote a custom replaysim YAML recipe modeled on
`composer_replication/recipes/replaysim/default.yaml`. It loads
without error, but every input DPO pair is dropped, OR data-juicer
raises `KeyError: 'text_key'`, OR it raises a complaint about
"expected str, got list" inside one of the filters.
**DIAGNOSIS.** Wave 14 fixed two related bugs in the *default* recipe
that custom-recipe authors will hit again. Both are documented in the
header comment at
`composer_replication/recipes/replaysim/default.yaml:21–35`:
1. **`text_keys` plural vs `text_key` singular.** The top-level
dataset contract uses `text_keys: chosen` (plural). Each individual
op uses `text_key: chosen` (singular). They are not interchangeable.
data-juicer's dataset loader validates that the `text_keys` field
exists on every record before any op runs; an op that uses
`text_keys` instead of `text_key` is silently misconfigured.
2. **`chosen` / `rejected` as strings vs as list-of-dicts.**
data-juicer ops like `text_length_filter`, `words_num_filter`,
`special_characters_filter`, and `document_deduplicator` read a
single string field. Pointing them at the chat-messages list
(`chosen_messages`, `rejected_messages`) crashes or silently
no-ops. The framework's `_dpo_pair_to_dj_record` keeps **both**
shapes side-by-side: `chosen`/`rejected` (strings) for filter ops,
and `chosen_messages`/`rejected_messages` (chat-messages list) for
chat-aware ops + the `NormalizedDPOPair` round-trip.
**FIX.** Treat the default recipe as your starting template. Concretely:
- Always declare `text_keys: chosen` at the top.
- For every length/word/special-char op you add, duplicate it: once
with `text_key: chosen`, once with `text_key: rejected`. (Each op
takes only one `text_key` — see comment at lines 31–35 of
`default.yaml`.)
- Never point a filter op at `chosen_messages` or `rejected_messages`.
Those are list-of-dicts; only chat-aware ops accept that shape.
**VERIFICATION.** The two-shape contract is locked in by:
- `test_record_chosen_rejected_are_flat_strings_for_dj_text_ops` —
asserts `chosen` and `rejected` are bare strings on every record
produced by `_dpo_pair_to_dj_record`.
- `test_record_chosen_rejected_messages_carry_chat_shape` — asserts
`chosen_messages` / `rejected_messages` exist as list-of-dicts.
- `test_dj_normalizer_e2e_default_recipe(tmp_path)` — runs the actual
default recipe through real data-juicer end-to-end (skipped if
`data_juicer` isn't importable).
…all in
`composer_replication/replaysim/tests/test_replaysim.py`. If those
pass and your custom recipe still drops everything, diff your YAML
against `default.yaml` until the two shapes align.
```
pytest composer_replication/replaysim/tests/test_replaysim.py -xvs
```
---
### 7. `ValueError: expected (seq,) shape, got (B, T)` from PRIME-RL composer_loss
**SYMPTOM.** You wired the PRIME-RL recipe into a training loop you
adapted from another framework (TRL, openrlhf, etc.), and on the very
first `loss_fn` call you get a `ValueError` mentioning shape
`(seq,)` versus `(B, T)`.
**DIAGNOSIS.** PRIME-RL calls its loss function **one sample at a
time**, with 1-D `(seq,)` tensors — not batched `(B, T)` tensors. The
recipe's docstring spells this out at
`composer_replication/recipes/prime_rl/composer_loss.py:16–30`:
> Note the **per-sample (seq,) shape** — PRIME-RL's runner calls the
> loss function one sample at a time, not on a batched (B, T) tensor.
Wave 14 fixed an earlier draft of the recipe that incorrectly assumed
`(B, T)`. The new version raises a clear `ValueError` if you hand it
the wrong shape, instead of silently broadcasting and producing
nonsense gradients. Users who are used to TRL or openrlhf — both of
which call the loss with batched tensors — see this on day one.
**FIX.**
- If you are running inside PRIME-RL via its `CustomLossConfig`, you
don't need to do anything: PRIME-RL's runner produces `(seq,)`
tensors and the recipe accepts them.
- If you are calling the recipe directly from your own runner, slice
your batch into per-sample 1-D tensors before each call:
```python
for b in range(B):
inputs_b = LossInputs(
trainer_logprobs=batched.trainer_logprobs[b],
inference_logprobs=batched.inference_logprobs[b],
advantages=batched.advantages[b],
loss_mask=batched.loss_mask[b],
teacher_logprobs=None if batched.teacher_logprobs is None
else batched.teacher_logprobs[b],
)
loss = loss_fn(inputs_b, ...)
```
- If you genuinely need a batched API, write a thin wrapper around
`loss_fn`. Don't patch the recipe — its shape contract is dictated
by PRIME-RL, not by us.
**VERIFICATION.** The shape contract is pinned by two tests in
`composer_replication/recipes/prime_rl/tests/test_composer_loss.py`:
- `test_advantages_shape_validates_seq_accepted` — `(seq,)` succeeds.
- `test_advantages_shape_validates_bt_rejected` — `(B, T)` raises
`ValueError`.
```
pytest composer_replication/recipes/prime_rl/tests/test_composer_loss.py -xvs
```
---
### 8. TAID can't run mid-training because `student_init_logits` is missing
**SYMPTOM.** You decide partway through a training run to enable
`sdpo_wrapper="taid"` (e.g. you read the TAID paper after step 2000
and want to retrofit). The next training step blows up — either with
a `KeyError` for `student_init_logits` / `student_init_input_ids`, or
with a strange-looking loss because the framework fell back to
re-running a forward pass through the *current* (drifted) model
instead of the init model.
**DIAGNOSIS.** TAID interpolates between the **student's distribution
at step 0** and the teacher's distribution. From the TAID docstring at
`composer_replication/distillation/taid.py:10–24`:
> TAID interpolates between an "identity" target (the student's own
> distribution at step 0) and the teacher's distribution, with the
> interpolation coefficient annealed from 0 → 1 over training.
That step-0 reference target has to come from somewhere. The framework
accepts it via either:
1. `inputs["student_init_logits"]` — a precomputed `(B, T, V)` tensor
captured at training start (preferred for production), OR
2. `inputs["student_init_input_ids"]` — input ids for a frozen forward
pass through `model`. **This assumes `model` has not yet drifted
from init.** It is correct only at step 0 or in tests; in
production it silently produces the wrong target.
If you forgot to capture the init logits at step 0, you cannot
faithfully use TAID mid-run.
**FIX.** Capture init logits at step 0 and persist them:
```python
# At step 0, before any optimizer.step() call:
with torch.no_grad():
init_logits = model(input_ids=batch["input_ids"]).logits
# Save to disk if you'll need them across restarts:
torch.save(init_logits, "checkpoints/init_logits_batch0.pt")
inputs["student_init_logits"] = init_logits
# Or, if you have a fixed eval probe set, capture init logits once
# for that fixed set and reuse them every step:
inputs["student_init_logits"] = cached_init_logits
```
If you genuinely have no step-0 snapshot, **TAID is not retrofittable**
to your run. Your options are:
- Restart from a checkpoint that *was* the step-0 model.
- Use a different distillation wrapper (`sdpo_wrapper="entropy_opd"`)
that doesn't need init logits.
- Accept the bias from the live-model fallback path. Don't.
**VERIFICATION.** The precomputed-vs-live-fallback contract is exercised by:
- `test_taid_accepts_precomputed_student_init_logits` in
`composer_replication/tests/test_compose_loss_integration.py` —
passes precomputed logits and asserts the TAID-wrapped channel uses
them.
- `test_taid_alpha_one_recovers_sdpo` — asserts that with
`alpha_min=alpha_max=1.0` (i.e. pure teacher target, init logits
ignored) TAID reproduces standard SDPO. If your training ignores
init logits silently, *this* is the test that would have failed.
```
pytest composer_replication/tests/test_compose_loss_integration.py::test_taid_accepts_precomputed_student_init_logits -xvs
```
---
### 9. `ModalExecutor()` or `HFJobsExecutor()` raises `NotImplementedError` at construction
**SYMPTOM.** You write
`executor = ModalExecutor(app_name="my-app")` (or the HF Jobs
equivalent) in a production script and the constructor immediately
raises:
```
NotImplementedError: ModalExecutor is a v0 skeleton; full implementation pending.
Use LocalProcessExecutor for testing.
```
Same for `HFJobsExecutor`. This is at *init time*, not at the first
`launch_replicas` call.
**DIAGNOSIS.** Per ADR-005 the v0 release ships only the
`ServerlessExecutor` Protocol and the reference `LocalProcessExecutor`.
The Modal and HF Jobs implementations are **import-safe skeletons** —
the classes exist and you can `from … import ModalExecutor`, but
`__init__` raises `NotImplementedError` to prevent silent partial
behavior. See `modal.py:64` and `hf_jobs.py:64`.
This is intentional. We didn't want to ship a half-working Modal
executor that succeeds at `launch_replicas` and then silently fails
two-thirds of the way through `collect`.
**FIX.**
- Use `LocalProcessExecutor` for development, CI, and any single-host
multi-process testing.
- For real cloud deployment in the v0 era, run your training script
directly in Modal/HF Jobs by hand: write your own thin Modal
function that constructs `MockManager(ObjectStoreAllReduce(uri,
rank, world_size))` and runs the training loop. The skeleton
docstrings at `modal.py:24–48` and `hf_jobs.py:26–49` show exactly
the pattern.
- Watch the `BACKLOG.md` for v0 polish — the real implementations are
scheduled.
**VERIFICATION.** That `LocalProcessExecutor` is fully functional and
correctly implements the Protocol is locked in by:
- `test_local_executor_runs_allreduce_across_replicas` in
`composer_replication/diloco/serverless/tests/test_serverless_local.py`
— runs N replicas locally, performs an allreduce across them.
- `test_local_executor_handles_multiple_rounds`
- `test_local_executor_reports_failed_replicas`
If those tests pass, your serverless DiLoCo machinery works — only the
specific cloud adapters are missing. The skeletons themselves are not
under test (raising in `__init__` is the contract).
```
pytest composer_replication/diloco/serverless/tests/test_serverless_local.py -xvs
```
---
### 10. DPPO mask drops every token — "loss became 0" or "no gradients"
**SYMPTOM.** You ported a PPO config from another framework (KL
penalty + clip ε=0.2 + value loss), wired it into the PRIME-RL recipe
with the default `dppo_mask_high=0.2` / `dppo_mask_low=0.2`, and the
training loss is suspiciously close to zero. Inspecting the recipe's
internal `keep_mask` shows nearly every token is being masked out.
**DIAGNOSIS.** PRIME-RL's "DPPO mask" is **not** the same as PPO
clipping, and not even the same as a log-ratio threshold. From the
recipe docstring at
`composer_replication/recipes/prime_rl/composer_loss.py` (mirroring
PRIME-RL upstream `prime_rl/trainer/rl/loss.py` lines 137-148):
> The mask gate is on **probability-space**
> `probs_diff = exp(trainer_lp) - exp(inference_lp)`, NOT on the
> log-ratio. A positive-advantage token is dropped iff
> `probs_diff > dppo_mask_high`; a negative-advantage token iff
> `probs_diff < -dppo_mask_low`. Masked tokens are **dropped from the
> policy-gradient term** but still contribute to the KL penalty.
The defaults `dppo_mask_high=dppo_mask_low=0.2` match PRIME-RL's
`DefaultLossConfig`. Because the gate is on probability-space, the
"in-band" zone is
`exp(trainer_lp) ∈ [exp(inference_lp) - 0.2, exp(inference_lp) + 0.2]`.
For a token with inference probability ~0.5 this is a fairly tight
band; for tokens at probability ~0.001 or ~0.999 the same threshold
behaves very differently from a log-ratio bound. This is by design —
PRIME-RL is bounding the absolute change in token probability, not the
multiplicative change.
The two failure modes:
1. **All tokens masked.** Trainer and inference engines disagree
sharply (fp16 vs bf16, stale rollout cache, mismatched chat
templates) and `probs_diff` exceeds 0.2 almost everywhere.
2. **No tokens masked.** Trainer ≈ inference (e.g. you forgot to step
the optimizer between rollouts) so the bound is never binding and
the policy never sees any DPPO regularization.
**FIX.** Inspect the empirical `probs_diff` distribution before
tuning:
```python
# In your training loop:
probs_diff = torch.exp(trainer_logprobs) - torch.exp(inference_logprobs)
print(torch.quantile(probs_diff.abs(), torch.tensor([0.5, 0.9, 0.99])))
```
For a healthy on-policy run with bf16 trainer + bf16 inference and
fresh rollouts, the central 99% of `|probs_diff|` should sit well
below `0.2`. If yours doesn't, the upstream divergence is the
problem, not the bound. Bumping `dppo_mask_high/low` to 0.5 or 1.0 is
a workaround but it disables the trust-region intent of DPPO.
**Do not** translate PPO ε=0.2 directly. PPO ε=0.2 is a multiplicative
log-ratio bound (`|log_ratio| < log(1.2) ≈ 0.18`); DPPO's 0.2 is an
**additive probability-space** bound. The semantics are different and
the defaults are deliberately tight in probability space.
If you genuinely want to disable the mask (e.g. for bug-isolation),
pass `dppo_mask_high=1e6, dppo_mask_low=1e6` (both are
`Field(..., ge=0)` upstream — negative values are rejected by
both PRIME-RL and our adapter). There is a regression test for
exactly this knob.
**VERIFICATION.**
- `test_dppo_mask_high_drops_positive_advantage_outliers` and
`test_dppo_mask_low_drops_negative_advantage_outliers` in
`composer_replication/recipes/prime_rl/tests/test_composer_loss.py`
— assert that out-of-bound tokens are dropped from the
policy-gradient term (with the upstream sign-of-advantage gate).
- `test_dppo_mask_sign_conditioned_on_advantage` — asserts that a
positive-advantage token with a large *negative* probs_diff is NOT
dropped (PRIME-RL only checks the upper bound for positive-advantage
tokens).
- `test_dppo_bounds_can_be_disabled` — asserts that very wide bounds
(`1e6`) pass every token through.
- `test_parity_with_prime_rl_default_loss_fn` — when `prime-rl` is
installed, runs identical inputs through PRIME-RL upstream and our
adapter and asserts the loss matches.
```
pytest composer_replication/recipes/prime_rl/tests/test_composer_loss.py -xvs
```
---
### 11. `compose_loss` runs but the GRPO channel doesn't behave like real GRPO
**SYMPTOM.** You read the README, saw the "3-channel composition: GRPO
+ SDPO + trace-replay DPO" tagline, called `compose_loss(model,
inputs)` directly in your training loop, and your reward curve never
moves the way it would in a real GRPO trainer. Or: you compared
against a TRL `GRPOTrainer` baseline and `compose_loss` produces
totally different numbers.
**DIAGNOSIS.** From the docstring at the top of
`composer_replication/loss.py:1–16`:
> This is a verification-harness mirror of
> `ComposerReplicationTrainer._compute_loss` that does NOT depend on
> TRL's GRPOTrainer parent. The GRPO channel is replaced with standard
> LM next-token-prediction cross-entropy, which is the limit GRPO
> converges to under deterministic rewards.
>
> Use it for: CPU smokes on real HF models, unit tests of loss
> composition without spinning up TRL, anywhere we want to verify
> gradient flow through the 3-channel sum without paying TRL's full
> machinery cost.
>
> **Do NOT use it as the production training loss.** Production =
> ComposerReplicationTrainer (a real GRPOTrainer subclass).
The `lm_ce` channel labelled "GRPO" in the LossComponents dataclass is
a **stub**: it is plain language-modeling cross-entropy. It is the
correct channel for verification (gradient flow, channel weighting,
distillation wiring), but it is not GRPO's surrogate objective and
will never produce the same numbers as real GRPO under stochastic
rewards.
Real GRPO requires:
- A reward model or rule-based reward,
- Per-prompt advantage estimation across G samples,
- An importance-sampling-ratio clip / mask.
Those live in TRL's `GRPOTrainer`, in our PRIME-RL recipe at
`composer_replication/recipes/prime_rl/composer_loss.py`, or (when
shipped) in a future VeRL recipe.
**FIX.**
- For production GRPO training, do **not** call `compose_loss` directly.
Instead use one of:
- `composer_replication.trainer.composer_trainer.ComposerReplicationTrainer`
— TRL `GRPOTrainer` subclass, full machinery.
- `composer_replication.recipes.prime_rl.composer_loss.loss_fn` —
PRIME-RL's `CustomLossConfig` adapter (channel 1 is real DPPO-clipped GRPO).
- For ablations, smokes, and unit tests, `compose_loss` is the right
tool — but log the `lm_ce` channel as `lm_ce`, not as `grpo`. The
`LossComponents` dataclass already names the field correctly; if
your wandb logger relabels it as "GRPO loss", fix the label.
**VERIFICATION.**
- The 11-test integration suite at
`composer_replication/tests/test_compose_loss_integration.py` only
asserts gradient flow + bit-exact composition; it deliberately does
not assert any GRPO-specific property of `compose_loss`. That's the
contract.
- The PRIME-RL recipe's real DPPO+KL behavior is asserted by
`test_returns_finite_scalar`,
`test_dppo_mask_high_drops_positive_advantage_outliers`,
`test_dppo_mask_sign_conditioned_on_advantage`, and
`test_parity_with_prime_rl_default_loss_fn` (skip-marked when
`prime-rl` is not installed)
in `composer_replication/recipes/prime_rl/tests/test_composer_loss.py`.
Those tests verify a real importance-sampling-ratio gradient with
PRIME-RL's advantage-conditioned mask, which `compose_loss` would
not pass.
If you find yourself wanting `compose_loss` to behave like real GRPO,
that is the signal to switch to one of the production paths above.
```
pytest composer_replication/tests/test_compose_loss_integration.py::test_defaults_bit_exact_with_legacy_kwargs -xvs
pytest composer_replication/recipes/prime_rl/tests/test_composer_loss.py::test_returns_finite_scalar -xvs
```
---
### 10. `monarch` / `data-juicer` / `prime-rl` install (Wave 16)
**SYMPTOM.** `pip install -e ".[monarch]"`, `pip install -e ".[prime-rl]"`,
or `pip install -e ".[replaysim]"` fails immediately with a uv/pip
resolver error similar to:
```
× No solution found when resolving dependencies:
╰─▶ Because only monarch<=0.1.11 is available and
composer-replication[monarch] depends on monarch>=0.4.1, we can
conclude that composer-replication[monarch]'s requirements are
unsatisfiable.
```
**DIAGNOSIS.** Three upstream packages the framework integrates with are
not currently pip-installable in their advertised versions:
1. **Meta's Monarch** is published on PyPI as
`torchmonarch-nightly` (nightly wheels with platform constraints), not
as `monarch`. The PyPI name `monarch` is unrelated to Meta's actor
framework and tops out at `0.1.11`.
2. **Prime Intellect's prime-rl** is not registered on PyPI at all. It
is published from source only.
3. **data-juicer** is not registered on PyPI under that exact name. The
closest match (`py-data-juicer==1.0.0`) has broken transitive deps;
newer `py-data-juicer` releases work but install ~150 transitive
packages.
Wave 16 dropped all three extras from `pyproject.toml` rather than ship
unsatisfiable pins. The framework code paths that touch these libraries
import them lazily, so:
- `composer_replication.recipes.monarch` is a documentation skeleton
that does NOT require monarch installed.
- `composer_replication.recipes.prime_rl.composer_loss` imports cleanly
without prime-rl; the upstream parity test is `@skipif`-gated and the
in-file shadow-parity test still verifies the loss formula
independently.
- `composer_replication.replaysim.normalize.DJNormalizer(skip_dj=True)`
works without `data_juicer`; only the full DJNormalizer code path
needs it.
**FIX.** If you want any of these libraries' real functionality, install
from source alongside the framework:
```
# Meta Monarch (actor framework — see ADR-006)
pip install torchmonarch-nightly # OR install from source:
# git clone https://github.com/meta-pytorch/monarch && cd monarch && pip install -e .
# Prime Intellect prime-rl (Recipe C — see ADR-006)
git clone https://github.com/PrimeIntellect-ai/prime-rl
cd prime-rl && pip install -e .
# data-juicer (replaysim normalization — see ADR-004)
git clone https://github.com/modelscope/data-juicer
cd data-juicer && pip install -e .
```
**VERIFICATION.** A fresh checkout install with all surviving extras
should succeed:
```
uv venv --clear
uv pip install -e ".[diloco,replay,replaysim,train,dev]"
source .venv/bin/activate
python -m pytest -q # baseline 176 passed / 8 skipped
```
If any of those extras fails to resolve, file a bug report — Wave 16
verified the full extras matrix installs from a clean venv on Python
3.11.
---
## How to file a bug report
If you've read the relevant section above and your problem persists,
file a bug. Include **all** sections of the template below — the most
common reason a maintainer can't repro is a missing piece of
environmental context.
```markdown
### What I expected vs what happened
(One paragraph.)
### Repro steps
1. ...
2. ...
3. ...
Minimal self-contained snippet (no `from my_local_thing import …`):
```python
# repro.py
from composer_replication import compose_loss
...
```
### Environment
- OS: (uname -a or `ver` on Windows)
- Python: (python --version)
- composer-replication: (pip show composer-replication | head -3)
- torch: (python -c "import torch; print(torch.__version__)")
- torchft: (python -c "import torchft; print(torchft.__version__)" || echo "n/a")
- transformers / trl: (versions, or "not installed")
- data-juicer / fsspec: (versions, or "not installed")
- s3fs / gcsfs / adlfs: (versions if relevant)
- GPU: (nvidia-smi -L or "CPU only")
- Install method: pip install -e . / wheel / other
- Extras installed: [replay] [replaysim] [serverless] [dev]
### What you've already tried
- [ ] Read the relevant Failure Mode section of docs/TROUBLESHOOTING.md
(which one: ___)
- [ ] Ran `pytest <relevant test path>` and confirmed those tests pass
- [ ] Ran the repro snippet in a fresh venv
- [ ] Confirmed it reproduces on Python 3.11 (if you were on 3.12 / 3.13)
### Logs
(Full traceback. If it's a wrong-loss-curve rather than an exception,
paste loss values for the first 10 steps and link any wandb/tb run.)
### Hypothesis
(Optional. If you have a guess at where the bug is, name the file +
line number. We'll look there first.)
```
A few rules:
- **Do not** paste API keys, AWS credentials, or HuggingFace tokens.
- **Do** include the failing test name if you've narrowed it to one.
- **Do** distinguish "never worked" from "regressed between commit X
and Y." A regression-bisect goes straight to the front of the queue.
- **One bug per issue.** Multi-headed reports lose items in triage.
The Wave-14 surface area is large, but the test suite covers it
densely — every section above corresponds to a green test that proves
the fix worked.
|