Reinforcement Learning
Transformers
English
post-training
distillation
agentic-coding
composer-2.5
cursor
kimi-k2
grpo
dapo
diloco
openenv
trl
verl
research
methodology
Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| # TROUBLESHOOTING — Wave 14 | |
| This document catalogs every Wave-14-known failure mode in the Composer | |
| Replication Framework, along with how to diagnose, fix, and verify each | |
| one. It is intentionally surgical: the surface area added in Waves 12–14 | |
| (SimPO/TAID/Entropy-OPD distillation kwargs, the PRIME-RL composer-loss | |
| adapter, the serverless DiLoCo `MockManager` + `ObjectStoreAllReduce` | |
| path, and the data-juicer-backed replaysim normalizer) introduced new | |
| ways for users to trip themselves up. Each failure mode here is something | |
| a maintainer has actually seen or anticipated during the cross-model | |
| review of Wave 14. | |
| If you hit something not covered below, jump to the | |
| [How to file a bug report](#how-to-file-a-bug-report) section at the end — | |
| the template there gives a maintainer everything they need to reproduce. | |
| --- | |
| ## Common things to check first | |
| Before reading any further, run through this checklist. ~80% of "framework | |
| broken" reports turn out to be one of these: | |
| 1. **Python version.** The framework targets Python 3.10–3.12. The | |
| `pyproject.toml` `target-version` is `py310`. If you are on 3.13+, | |
| transitive deps (notably Ray, pulled in by data-juicer) may not yet | |
| ship wheels and will try to build from source. Run `python --version`. | |
| 2. **Fresh virtual environment.** Mixing the framework into an existing | |
| environment that already has `torch`, `transformers`, `trl`, or | |
| `torchft` pinned to incompatible versions is the #1 source of import- | |
| time errors. Create a new venv: `python -m venv .venv && source | |
| .venv/bin/activate && pip install -e .[dev]`. | |
| 3. **Editable install.** Most contributors run `pip install -e .` so | |
| that local edits to `composer_replication/` are picked up. If you | |
| `pip install composer-replication` from a registry instead, your | |
| edits to the source tree will be ignored. Confirm with | |
| `pip show composer-replication | grep Location`. | |
| 4. **Optional extras.** Several modules are optional-dep gated: | |
| - `[replay]` — adds `httpx` (used for OpenRouter teacher calls). | |
| - `[train]` — adds TRL, peft, accelerate, datasets (production GRPO). | |
| - `[replaysim]` — adds `data-juicer` (and via it, Ray as a transitive). | |
| - `[serverless]` — adds `fsspec`. For non-local rendezvous URIs you | |
| also need a backend-specific fsspec adapter (see Failure Mode 5). | |
| - `[dev]` — adds `pytest`, `ruff`, etc. | |
| If you see `ModuleNotFoundError: No module named 'data_juicer'`, you | |
| forgot the extra. Install with `pip install -e .[replaysim]`. | |
| 5. **Run the test suite first.** Before debugging anything, run the | |
| subset of tests touching the area you care about: | |
| ``` | |
| pytest composer_replication/tests/ # core compose_loss | |
| pytest composer_replication/distillation/tests/ # SimPO / TAID / OPD | |
| pytest composer_replication/recipes/prime_rl/tests/ # PRIME-RL adapter | |
| pytest composer_replication/diloco/serverless/tests/ # MockManager + DiLoCo | |
| pytest composer_replication/replaysim/tests/ # data-juicer normalizer | |
| ``` | |
| If any green test fails for you locally, the problem is environmental | |
| — fix that before digging into your own code. | |
| 6. **Read the docstring of the symbol you're calling.** Wave 14 | |
| docstrings are written to be the first line of documentation. The | |
| `compose_loss` docstring (`composer_replication/loss.py`) lists every | |
| required and optional input key. The `MockManager` docstring | |
| enumerates the torchft surface methods it implements. | |
| --- | |
| ## Failure modes | |
| ### 1. `pip install -e .[replaysim]` hangs or fails on Python 3.12 with a Ray-related path error | |
| **SYMPTOM.** Installing the `[replaysim]` extra (which pulls | |
| `data-juicer`) triggers a transitive install of Ray. On Python 3.12, the | |
| first `import ray` (often during `pip` build hooks or the first time | |
| data-juicer is loaded) fails with messages mentioning | |
| `/tmp/ray/session_*` paths, missing `pyarrow` symbols, or `OSError: | |
| [Errno 2] No such file or directory: '/dev/shm/ray-...'` inside Docker. | |
| **DIAGNOSIS.** `data-juicer` declares `ray` as a transitive dependency. | |
| On Python 3.12 the wheel matrix is incomplete for some Ray versions, and | |
| Ray's first-import probes `/dev/shm` and `/tmp/ray` for its session | |
| state. In a sandboxed container, restricted CI runner, or WSL | |
| environment with a non-default `/tmp`, those probes fail. Wave 14 | |
| subagent T2 hit this in CI and worked around it by pinning Ray and by | |
| making sure `/tmp` exists and is writable. | |
| **FIX.** | |
| - Prefer Python 3.11 if you're on 3.12+ and don't need 3.12 features. | |
| - If you must stay on 3.12, ensure `/tmp` is writable and pre-create the | |
| session directory: `mkdir -p /tmp/ray && chmod 1777 /tmp/ray`. | |
| - In Docker, mount a real tmpfs at `/dev/shm`: | |
| `docker run --shm-size=2g …`. | |
| - If you don't need replaysim normalization, you can skip the extra | |
| entirely. The `DJNormalizer(skip_dj=True)` passthrough (see | |
| `composer_replication/replaysim/normalize.py:165`) does not import | |
| `data_juicer` and therefore does not import Ray. | |
| **VERIFICATION.** The skip-dj passthrough is exercised by | |
| `test_dj_normalizer_skip_dj_passthrough` and | |
| `test_dj_normalizer_skip_dj_preserves_count` in | |
| `composer_replication/replaysim/tests/test_replaysim.py`. Both run | |
| without `data_juicer` installed: | |
| ``` | |
| pytest composer_replication/replaysim/tests/test_replaysim.py::test_dj_normalizer_skip_dj_passthrough -xvs | |
| ``` | |
| If that passes in your environment, your `[replaysim]`-less install is | |
| healthy — only the full data-juicer code path requires Ray. | |
| --- | |
| ### 2. `compose_loss` produces wrong-looking numbers when combining new kwargs | |
| **SYMPTOM.** You pass several Wave-14 distillation kwargs to | |
| `compose_loss` (e.g. `dpo_variant="simpo"`, `sdpo_wrapper="taid"`, | |
| `taid_schedule_step=0`, `simpo_beta=2.0`, `entropy_opd_h_max=…`), and | |
| the loss curve looks wrong: NaNs, identically-zero `sdpo_jsd` channel, | |
| or a `total` that is bit-different from your reference run with no | |
| distillation kwargs at all. | |
| **DIAGNOSIS.** `compose_loss` now has 13 keyword arguments and the | |
| contract between them is non-trivial. Subagent T1's review identified | |
| three combinations that look reasonable but are unsupported: | |
| - Passing `taid_schedule_step` without `taid_total_steps` (or vice | |
| versa). The function raises `ValueError` clearly, but the message can | |
| scroll past in noisy logs. | |
| - Passing `dpo_variant="simpo"` while still supplying | |
| `dpo_chosen_ref_logprobs`. Those keys are **silently ignored** — | |
| SimPO is reference-free. | |
| - Passing `sdpo_wrapper="taid"` without supplying either | |
| `student_init_logits` OR `student_init_input_ids` in `inputs`. The | |
| function will fall back to a forward pass through the (possibly | |
| drifted) live model, which is a footgun late in training (see Failure | |
| Mode 8). | |
| **FIX.** Read the docstring at the top of | |
| `composer_replication/loss.py` (lines 25–39 list the three pluggable | |
| losses and their preconditions). The general rule: | |
| ```python | |
| from composer_replication import compose_loss | |
| # Defaults (no distillation knobs) reproduce legacy 3-channel composition bit-exact. | |
| out = compose_loss(model, inputs) | |
| # To opt into SimPO, pass dpo_variant ONLY. Do not pass ref-logprob keys. | |
| out = compose_loss(model, inputs, dpo_variant="simpo", | |
| simpo_beta=2.0, simpo_gamma=1.0) | |
| # To opt into TAID, pass BOTH schedule_step AND total_steps, AND make sure | |
| # inputs["student_init_logits"] is populated (see Failure Mode 8). | |
| out = compose_loss(model, inputs, sdpo_wrapper="taid", | |
| taid_schedule_step=step, taid_total_steps=total_steps) | |
| ``` | |
| Setting all 13 kwargs to their defaults is **bit-exact equivalent** to | |
| the pre-Wave-13 3-channel loss; if your defaults call gives different | |
| numbers than your old code, file a bug. | |
| **VERIFICATION.** The bit-exact equivalence and every supported | |
| combination is locked in by the 11 integration tests in | |
| `composer_replication/tests/test_compose_loss_integration.py`. The most | |
| important ones: | |
| - `test_defaults_bit_exact_with_legacy_kwargs` — passing the new kwargs | |
| at their defaults is identical to legacy. | |
| - `test_simpo_does_not_require_ref_logprobs` — SimPO works with the | |
| ref-logprob keys absent from `inputs`. | |
| - `test_taid_alpha_one_recovers_sdpo` — TAID with `alpha_min=alpha_max=1` | |
| reproduces standard SDPO. | |
| - `test_taid_requires_schedule_step` / `test_taid_requires_total_steps` — | |
| the partial-config error path. | |
| ``` | |
| pytest composer_replication/tests/test_compose_loss_integration.py -xvs | |
| ``` | |
| --- | |
| ### 3. `MockManager` works today but silently breaks after a torchft upgrade | |
| **SYMPTOM.** Your serverless DiLoCo run starts, the first outer round | |
| completes, and then `torchft.DiLoCo` raises an `AttributeError` on | |
| something like `_use_async_quorum`, `should_commit`, or | |
| `current_step` — or worse, it silently uses the wrong sync semantics. | |
| **DIAGNOSIS.** `MockManager` is a duck-typed shim that mirrors | |
| `torchft.Manager` rather than subclassing it. The surface it implements | |
| is enumerated in the docstring at | |
| `composer_replication/diloco/serverless/allreduce.py:215`: | |
| > Methods/attributes DiLoCo touches: `allreduce`, `should_commit`, | |
| > `start_quorum`, `current_step`, `disallow_state_dict_read`, | |
| > `allow_state_dict_read`, `register_state_dict_fn`, `_use_async_quorum` | |
| > (attribute), `num_participants`, `rank`. | |
| The two **private** members in that list — `_use_async_quorum` and the | |
| internal `current_step` counter — are private torchft API that may be | |
| renamed without notice in any torchft minor release. Wave 14 subagent | |
| T3 specifically called this out: "If torchft renames `_use_async_quorum` | |
| to anything else, MockManager silently breaks because there is nothing | |
| holding the contract beyond a string." | |
| **FIX.** | |
| - **Pin torchft.** In `pyproject.toml` keep your torchft version pinned | |
| to a known-good range (e.g. `torchft>=0.2,<0.4`). When you need to | |
| upgrade, do so deliberately and re-run the integration tests below | |
| before merging. | |
| - **Watch the deprecation warning.** Wave 14 sets up a clear path to | |
| warn if `_use_async_quorum` is read on a fresh instance — see the | |
| comment at `allreduce.py:255`. | |
| - **Don't pass an arbitrary torchft branch.** If you've patched torchft | |
| locally, the `MockManager` may need updating in lockstep. The | |
| surface-compatibility tests below will catch this in CI. | |
| **VERIFICATION.** The full DiLoCo × MockManager surface is exercised by: | |
| - `test_mock_manager_shape_compat` in | |
| `composer_replication/diloco/serverless/tests/test_serverless_local.py` | |
| — sanity check that all expected methods/attributes exist. | |
| - `test_mockmanager_has_full_diloco_call_surface` in | |
| `composer_replication/diloco/serverless/tests/test_serverless_diloco_integration.py` | |
| — runs an end-to-end outer round through real torchft `DiLoCo`, | |
| hitting every method on the surface list above. | |
| - `test_mockmanager_diloco_outer_round_completes` — full one-round | |
| smoke ending in a successful outer SGD step. | |
| If any of these tests turn red after a torchft bump, **do not ship**: | |
| inspect the new torchft Manager surface and update `MockManager` | |
| to match. | |
| ``` | |
| pytest composer_replication/diloco/serverless/tests/test_serverless_diloco_integration.py -xvs | |
| ``` | |
| --- | |
| ### 4. SimPO loss curve looks like noise | |
| **SYMPTOM.** You wired in `dpo_variant="simpo"`, the run starts, and | |
| the `trace_replay_dpo` channel either drifts to large negative values | |
| (→ `total` blows up) or oscillates with much higher variance than | |
| standard DPO. The loss curve "looks like noise." | |
| **DIAGNOSIS.** SimPO uses **average per-token log-probability** | |
| (`Σ logπ(c_t) / |c|`), not sum log-prob. From the SimPO docstring | |
| (`composer_replication/distillation/simpo.py:11–18`): | |
| > SimPO drops the reference-policy term, replaces it with a target | |
| > margin γ, and uses **average sequence log-probability instead of | |
| > sum**. […] L_SimPO = -log σ( β · [avg_logπ(c) - avg_logπ(r)] - γ ) | |
| If you compute `chosen_logprobs.sum()` (or any unmasked aggregation) and | |
| hand it to SimPO as `chosen_avg_logprobs`, the loss is undefined: β=2.0 | |
| times a sum-log-prob is on a totally different scale than β=2.0 times an | |
| average. The result looks plausible per-batch but the optimum is | |
| nowhere near the dataset's true preference signal. | |
| **FIX.** Use the helper | |
| `composer_replication.distillation.simpo.avg_sequence_logprob`: | |
| ```python | |
| from composer_replication.distillation.simpo import ( | |
| simpo_loss, avg_sequence_logprob, | |
| ) | |
| chosen_avg = avg_sequence_logprob(chosen_logprobs, chosen_response_mask) | |
| rejected_avg = avg_sequence_logprob(rejected_logprobs, rejected_response_mask) | |
| loss = simpo_loss(chosen_avg, rejected_avg, beta=2.0, gamma=1.0) | |
| ``` | |
| The mask is **1 on response tokens, 0 on prompt+padding** — same | |
| convention as the rest of the framework. If you must roll your own | |
| aggregation, divide by `response_mask.sum(dim=-1).clamp_min(1.0)`, | |
| not by `response_mask.shape[-1]`. | |
| **VERIFICATION.** The avg-vs-sum semantics are pinned by | |
| `test_avg_sequence_logprob` in | |
| `composer_replication/distillation/tests/test_distillation_losses.py`, | |
| which constructs known per-token log-probs and asserts the helper | |
| returns the correct per-sequence average. The end-to-end SimPO | |
| loss-shape check is `test_simpo_loss_returns_scalar` in the same file. | |
| ``` | |
| pytest composer_replication/distillation/tests/test_distillation_losses.py::test_avg_sequence_logprob -xvs | |
| pytest composer_replication/distillation/tests/test_distillation_losses.py::test_simpo_loss_lower_for_better_separation -xvs | |
| ``` | |
| --- | |
| ### 5. `ObjectStoreAllReduce` works locally but fails on `s3://` at first allreduce | |
| **SYMPTOM.** You construct | |
| `ObjectStoreAllReduce(uri="s3://my-bucket/run42/", rank=0, | |
| world_size=4)`. The constructor succeeds. The first call to | |
| `allreduce(tensor, name="...")` raises `ImportError: Install s3fs to | |
| access S3` or `botocore.exceptions.NoCredentialsError: Unable to locate | |
| credentials`. | |
| **DIAGNOSIS.** `ObjectStoreAllReduce` uses fsspec to reach the | |
| backend, but **fsspec only ships protocol stubs, not adapters**. The | |
| constructor doesn't know which protocol you'll use and doesn't | |
| eagerly validate, so it accepts any URI. The `s3://` adapter requires: | |
| 1. The `s3fs` package (`pip install s3fs`), which is **not** in the | |
| default `[serverless]` extra. | |
| 2. Working AWS credentials (env vars, `~/.aws/credentials`, IAM role, | |
| or whatever your environment normally provides to boto3). | |
| The same is true for `gs://` (`gcsfs`), `az://` (`adlfs`), and | |
| `hf://` (`huggingface_hub`'s fsspec integration, which is included if | |
| you have `huggingface_hub` installed). | |
| **FIX.** | |
| - Install the right adapter alongside the framework: | |
| ``` | |
| pip install s3fs # for s3:// | |
| pip install gcsfs # for gs:// | |
| pip install adlfs # for az:// | |
| ``` | |
| - Verify credentials work outside the framework first: | |
| ``` | |
| python -c "import s3fs; print(s3fs.S3FileSystem().ls('my-bucket'))" | |
| ``` | |
| - If you're running on Modal/HF Jobs, set the credentials as Modal | |
| secrets / HF Jobs env vars in the executor config — not in your | |
| local shell. | |
| The constructor could in principle perform an eager probe (e.g. a | |
| `HEAD` on the rendezvous prefix) to fail fast at init time. Wave 14 | |
| deliberately did not add this because it adds a network round-trip on | |
| every replica startup. If you want pre-flight validation in your | |
| training script, call `fsspec.filesystem(protocol).ls(uri)` yourself | |
| before constructing the manager. | |
| **VERIFICATION.** The `file://` and bare-path code paths — the only | |
| ones that don't need an extra adapter — are exercised by: | |
| - `test_object_store_allreduce_local_paths_create_dir` | |
| - `test_object_store_allreduce_world_size_1_passthrough` | |
| - `test_object_store_allreduce_round_id_increments` | |
| …all in | |
| `composer_replication/diloco/serverless/tests/test_serverless_local.py`. | |
| If those pass and your `s3://` URI fails, the framework is fine and | |
| your fsspec adapter or credentials are the problem. | |
| ``` | |
| pytest composer_replication/diloco/serverless/tests/test_serverless_local.py -xvs | |
| ``` | |
| --- | |
| ### 6. Custom replaysim recipe drops every record (or crashes data-juicer) | |
| **SYMPTOM.** You wrote a custom replaysim YAML recipe modeled on | |
| `composer_replication/recipes/replaysim/default.yaml`. It loads | |
| without error, but every input DPO pair is dropped, OR data-juicer | |
| raises `KeyError: 'text_key'`, OR it raises a complaint about | |
| "expected str, got list" inside one of the filters. | |
| **DIAGNOSIS.** Wave 14 fixed two related bugs in the *default* recipe | |
| that custom-recipe authors will hit again. Both are documented in the | |
| header comment at | |
| `composer_replication/recipes/replaysim/default.yaml:21–35`: | |
| 1. **`text_keys` plural vs `text_key` singular.** The top-level | |
| dataset contract uses `text_keys: chosen` (plural). Each individual | |
| op uses `text_key: chosen` (singular). They are not interchangeable. | |
| data-juicer's dataset loader validates that the `text_keys` field | |
| exists on every record before any op runs; an op that uses | |
| `text_keys` instead of `text_key` is silently misconfigured. | |
| 2. **`chosen` / `rejected` as strings vs as list-of-dicts.** | |
| data-juicer ops like `text_length_filter`, `words_num_filter`, | |
| `special_characters_filter`, and `document_deduplicator` read a | |
| single string field. Pointing them at the chat-messages list | |
| (`chosen_messages`, `rejected_messages`) crashes or silently | |
| no-ops. The framework's `_dpo_pair_to_dj_record` keeps **both** | |
| shapes side-by-side: `chosen`/`rejected` (strings) for filter ops, | |
| and `chosen_messages`/`rejected_messages` (chat-messages list) for | |
| chat-aware ops + the `NormalizedDPOPair` round-trip. | |
| **FIX.** Treat the default recipe as your starting template. Concretely: | |
| - Always declare `text_keys: chosen` at the top. | |
| - For every length/word/special-char op you add, duplicate it: once | |
| with `text_key: chosen`, once with `text_key: rejected`. (Each op | |
| takes only one `text_key` — see comment at lines 31–35 of | |
| `default.yaml`.) | |
| - Never point a filter op at `chosen_messages` or `rejected_messages`. | |
| Those are list-of-dicts; only chat-aware ops accept that shape. | |
| **VERIFICATION.** The two-shape contract is locked in by: | |
| - `test_record_chosen_rejected_are_flat_strings_for_dj_text_ops` — | |
| asserts `chosen` and `rejected` are bare strings on every record | |
| produced by `_dpo_pair_to_dj_record`. | |
| - `test_record_chosen_rejected_messages_carry_chat_shape` — asserts | |
| `chosen_messages` / `rejected_messages` exist as list-of-dicts. | |
| - `test_dj_normalizer_e2e_default_recipe(tmp_path)` — runs the actual | |
| default recipe through real data-juicer end-to-end (skipped if | |
| `data_juicer` isn't importable). | |
| …all in | |
| `composer_replication/replaysim/tests/test_replaysim.py`. If those | |
| pass and your custom recipe still drops everything, diff your YAML | |
| against `default.yaml` until the two shapes align. | |
| ``` | |
| pytest composer_replication/replaysim/tests/test_replaysim.py -xvs | |
| ``` | |
| --- | |
| ### 7. `ValueError: expected (seq,) shape, got (B, T)` from PRIME-RL composer_loss | |
| **SYMPTOM.** You wired the PRIME-RL recipe into a training loop you | |
| adapted from another framework (TRL, openrlhf, etc.), and on the very | |
| first `loss_fn` call you get a `ValueError` mentioning shape | |
| `(seq,)` versus `(B, T)`. | |
| **DIAGNOSIS.** PRIME-RL calls its loss function **one sample at a | |
| time**, with 1-D `(seq,)` tensors — not batched `(B, T)` tensors. The | |
| recipe's docstring spells this out at | |
| `composer_replication/recipes/prime_rl/composer_loss.py:16–30`: | |
| > Note the **per-sample (seq,) shape** — PRIME-RL's runner calls the | |
| > loss function one sample at a time, not on a batched (B, T) tensor. | |
| Wave 14 fixed an earlier draft of the recipe that incorrectly assumed | |
| `(B, T)`. The new version raises a clear `ValueError` if you hand it | |
| the wrong shape, instead of silently broadcasting and producing | |
| nonsense gradients. Users who are used to TRL or openrlhf — both of | |
| which call the loss with batched tensors — see this on day one. | |
| **FIX.** | |
| - If you are running inside PRIME-RL via its `CustomLossConfig`, you | |
| don't need to do anything: PRIME-RL's runner produces `(seq,)` | |
| tensors and the recipe accepts them. | |
| - If you are calling the recipe directly from your own runner, slice | |
| your batch into per-sample 1-D tensors before each call: | |
| ```python | |
| for b in range(B): | |
| inputs_b = LossInputs( | |
| trainer_logprobs=batched.trainer_logprobs[b], | |
| inference_logprobs=batched.inference_logprobs[b], | |
| advantages=batched.advantages[b], | |
| loss_mask=batched.loss_mask[b], | |
| teacher_logprobs=None if batched.teacher_logprobs is None | |
| else batched.teacher_logprobs[b], | |
| ) | |
| loss = loss_fn(inputs_b, ...) | |
| ``` | |
| - If you genuinely need a batched API, write a thin wrapper around | |
| `loss_fn`. Don't patch the recipe — its shape contract is dictated | |
| by PRIME-RL, not by us. | |
| **VERIFICATION.** The shape contract is pinned by two tests in | |
| `composer_replication/recipes/prime_rl/tests/test_composer_loss.py`: | |
| - `test_advantages_shape_validates_seq_accepted` — `(seq,)` succeeds. | |
| - `test_advantages_shape_validates_bt_rejected` — `(B, T)` raises | |
| `ValueError`. | |
| ``` | |
| pytest composer_replication/recipes/prime_rl/tests/test_composer_loss.py -xvs | |
| ``` | |
| --- | |
| ### 8. TAID can't run mid-training because `student_init_logits` is missing | |
| **SYMPTOM.** You decide partway through a training run to enable | |
| `sdpo_wrapper="taid"` (e.g. you read the TAID paper after step 2000 | |
| and want to retrofit). The next training step blows up — either with | |
| a `KeyError` for `student_init_logits` / `student_init_input_ids`, or | |
| with a strange-looking loss because the framework fell back to | |
| re-running a forward pass through the *current* (drifted) model | |
| instead of the init model. | |
| **DIAGNOSIS.** TAID interpolates between the **student's distribution | |
| at step 0** and the teacher's distribution. From the TAID docstring at | |
| `composer_replication/distillation/taid.py:10–24`: | |
| > TAID interpolates between an "identity" target (the student's own | |
| > distribution at step 0) and the teacher's distribution, with the | |
| > interpolation coefficient annealed from 0 → 1 over training. | |
| That step-0 reference target has to come from somewhere. The framework | |
| accepts it via either: | |
| 1. `inputs["student_init_logits"]` — a precomputed `(B, T, V)` tensor | |
| captured at training start (preferred for production), OR | |
| 2. `inputs["student_init_input_ids"]` — input ids for a frozen forward | |
| pass through `model`. **This assumes `model` has not yet drifted | |
| from init.** It is correct only at step 0 or in tests; in | |
| production it silently produces the wrong target. | |
| If you forgot to capture the init logits at step 0, you cannot | |
| faithfully use TAID mid-run. | |
| **FIX.** Capture init logits at step 0 and persist them: | |
| ```python | |
| # At step 0, before any optimizer.step() call: | |
| with torch.no_grad(): | |
| init_logits = model(input_ids=batch["input_ids"]).logits | |
| # Save to disk if you'll need them across restarts: | |
| torch.save(init_logits, "checkpoints/init_logits_batch0.pt") | |
| inputs["student_init_logits"] = init_logits | |
| # Or, if you have a fixed eval probe set, capture init logits once | |
| # for that fixed set and reuse them every step: | |
| inputs["student_init_logits"] = cached_init_logits | |
| ``` | |
| If you genuinely have no step-0 snapshot, **TAID is not retrofittable** | |
| to your run. Your options are: | |
| - Restart from a checkpoint that *was* the step-0 model. | |
| - Use a different distillation wrapper (`sdpo_wrapper="entropy_opd"`) | |
| that doesn't need init logits. | |
| - Accept the bias from the live-model fallback path. Don't. | |
| **VERIFICATION.** The precomputed-vs-live-fallback contract is exercised by: | |
| - `test_taid_accepts_precomputed_student_init_logits` in | |
| `composer_replication/tests/test_compose_loss_integration.py` — | |
| passes precomputed logits and asserts the TAID-wrapped channel uses | |
| them. | |
| - `test_taid_alpha_one_recovers_sdpo` — asserts that with | |
| `alpha_min=alpha_max=1.0` (i.e. pure teacher target, init logits | |
| ignored) TAID reproduces standard SDPO. If your training ignores | |
| init logits silently, *this* is the test that would have failed. | |
| ``` | |
| pytest composer_replication/tests/test_compose_loss_integration.py::test_taid_accepts_precomputed_student_init_logits -xvs | |
| ``` | |
| --- | |
| ### 9. `ModalExecutor()` or `HFJobsExecutor()` raises `NotImplementedError` at construction | |
| **SYMPTOM.** You write | |
| `executor = ModalExecutor(app_name="my-app")` (or the HF Jobs | |
| equivalent) in a production script and the constructor immediately | |
| raises: | |
| ``` | |
| NotImplementedError: ModalExecutor is a v0 skeleton; full implementation pending. | |
| Use LocalProcessExecutor for testing. | |
| ``` | |
| Same for `HFJobsExecutor`. This is at *init time*, not at the first | |
| `launch_replicas` call. | |
| **DIAGNOSIS.** Per ADR-005 the v0 release ships only the | |
| `ServerlessExecutor` Protocol and the reference `LocalProcessExecutor`. | |
| The Modal and HF Jobs implementations are **import-safe skeletons** — | |
| the classes exist and you can `from … import ModalExecutor`, but | |
| `__init__` raises `NotImplementedError` to prevent silent partial | |
| behavior. See `modal.py:64` and `hf_jobs.py:64`. | |
| This is intentional. We didn't want to ship a half-working Modal | |
| executor that succeeds at `launch_replicas` and then silently fails | |
| two-thirds of the way through `collect`. | |
| **FIX.** | |
| - Use `LocalProcessExecutor` for development, CI, and any single-host | |
| multi-process testing. | |
| - For real cloud deployment in the v0 era, run your training script | |
| directly in Modal/HF Jobs by hand: write your own thin Modal | |
| function that constructs `MockManager(ObjectStoreAllReduce(uri, | |
| rank, world_size))` and runs the training loop. The skeleton | |
| docstrings at `modal.py:24–48` and `hf_jobs.py:26–49` show exactly | |
| the pattern. | |
| - Watch the `BACKLOG.md` for v0 polish — the real implementations are | |
| scheduled. | |
| **VERIFICATION.** That `LocalProcessExecutor` is fully functional and | |
| correctly implements the Protocol is locked in by: | |
| - `test_local_executor_runs_allreduce_across_replicas` in | |
| `composer_replication/diloco/serverless/tests/test_serverless_local.py` | |
| — runs N replicas locally, performs an allreduce across them. | |
| - `test_local_executor_handles_multiple_rounds` | |
| - `test_local_executor_reports_failed_replicas` | |
| If those tests pass, your serverless DiLoCo machinery works — only the | |
| specific cloud adapters are missing. The skeletons themselves are not | |
| under test (raising in `__init__` is the contract). | |
| ``` | |
| pytest composer_replication/diloco/serverless/tests/test_serverless_local.py -xvs | |
| ``` | |
| --- | |
| ### 10. DPPO mask drops every token — "loss became 0" or "no gradients" | |
| **SYMPTOM.** You ported a PPO config from another framework (KL | |
| penalty + clip ε=0.2 + value loss), wired it into the PRIME-RL recipe | |
| with the default `dppo_mask_high=0.2` / `dppo_mask_low=0.2`, and the | |
| training loss is suspiciously close to zero. Inspecting the recipe's | |
| internal `keep_mask` shows nearly every token is being masked out. | |
| **DIAGNOSIS.** PRIME-RL's "DPPO mask" is **not** the same as PPO | |
| clipping, and not even the same as a log-ratio threshold. From the | |
| recipe docstring at | |
| `composer_replication/recipes/prime_rl/composer_loss.py` (mirroring | |
| PRIME-RL upstream `prime_rl/trainer/rl/loss.py` lines 137-148): | |
| > The mask gate is on **probability-space** | |
| > `probs_diff = exp(trainer_lp) - exp(inference_lp)`, NOT on the | |
| > log-ratio. A positive-advantage token is dropped iff | |
| > `probs_diff > dppo_mask_high`; a negative-advantage token iff | |
| > `probs_diff < -dppo_mask_low`. Masked tokens are **dropped from the | |
| > policy-gradient term** but still contribute to the KL penalty. | |
| The defaults `dppo_mask_high=dppo_mask_low=0.2` match PRIME-RL's | |
| `DefaultLossConfig`. Because the gate is on probability-space, the | |
| "in-band" zone is | |
| `exp(trainer_lp) ∈ [exp(inference_lp) - 0.2, exp(inference_lp) + 0.2]`. | |
| For a token with inference probability ~0.5 this is a fairly tight | |
| band; for tokens at probability ~0.001 or ~0.999 the same threshold | |
| behaves very differently from a log-ratio bound. This is by design — | |
| PRIME-RL is bounding the absolute change in token probability, not the | |
| multiplicative change. | |
| The two failure modes: | |
| 1. **All tokens masked.** Trainer and inference engines disagree | |
| sharply (fp16 vs bf16, stale rollout cache, mismatched chat | |
| templates) and `probs_diff` exceeds 0.2 almost everywhere. | |
| 2. **No tokens masked.** Trainer ≈ inference (e.g. you forgot to step | |
| the optimizer between rollouts) so the bound is never binding and | |
| the policy never sees any DPPO regularization. | |
| **FIX.** Inspect the empirical `probs_diff` distribution before | |
| tuning: | |
| ```python | |
| # In your training loop: | |
| probs_diff = torch.exp(trainer_logprobs) - torch.exp(inference_logprobs) | |
| print(torch.quantile(probs_diff.abs(), torch.tensor([0.5, 0.9, 0.99]))) | |
| ``` | |
| For a healthy on-policy run with bf16 trainer + bf16 inference and | |
| fresh rollouts, the central 99% of `|probs_diff|` should sit well | |
| below `0.2`. If yours doesn't, the upstream divergence is the | |
| problem, not the bound. Bumping `dppo_mask_high/low` to 0.5 or 1.0 is | |
| a workaround but it disables the trust-region intent of DPPO. | |
| **Do not** translate PPO ε=0.2 directly. PPO ε=0.2 is a multiplicative | |
| log-ratio bound (`|log_ratio| < log(1.2) ≈ 0.18`); DPPO's 0.2 is an | |
| **additive probability-space** bound. The semantics are different and | |
| the defaults are deliberately tight in probability space. | |
| If you genuinely want to disable the mask (e.g. for bug-isolation), | |
| pass `dppo_mask_high=1e6, dppo_mask_low=1e6` (both are | |
| `Field(..., ge=0)` upstream — negative values are rejected by | |
| both PRIME-RL and our adapter). There is a regression test for | |
| exactly this knob. | |
| **VERIFICATION.** | |
| - `test_dppo_mask_high_drops_positive_advantage_outliers` and | |
| `test_dppo_mask_low_drops_negative_advantage_outliers` in | |
| `composer_replication/recipes/prime_rl/tests/test_composer_loss.py` | |
| — assert that out-of-bound tokens are dropped from the | |
| policy-gradient term (with the upstream sign-of-advantage gate). | |
| - `test_dppo_mask_sign_conditioned_on_advantage` — asserts that a | |
| positive-advantage token with a large *negative* probs_diff is NOT | |
| dropped (PRIME-RL only checks the upper bound for positive-advantage | |
| tokens). | |
| - `test_dppo_bounds_can_be_disabled` — asserts that very wide bounds | |
| (`1e6`) pass every token through. | |
| - `test_parity_with_prime_rl_default_loss_fn` — when `prime-rl` is | |
| installed, runs identical inputs through PRIME-RL upstream and our | |
| adapter and asserts the loss matches. | |
| ``` | |
| pytest composer_replication/recipes/prime_rl/tests/test_composer_loss.py -xvs | |
| ``` | |
| --- | |
| ### 11. `compose_loss` runs but the GRPO channel doesn't behave like real GRPO | |
| **SYMPTOM.** You read the README, saw the "3-channel composition: GRPO | |
| + SDPO + trace-replay DPO" tagline, called `compose_loss(model, | |
| inputs)` directly in your training loop, and your reward curve never | |
| moves the way it would in a real GRPO trainer. Or: you compared | |
| against a TRL `GRPOTrainer` baseline and `compose_loss` produces | |
| totally different numbers. | |
| **DIAGNOSIS.** From the docstring at the top of | |
| `composer_replication/loss.py:1–16`: | |
| > This is a verification-harness mirror of | |
| > `ComposerReplicationTrainer._compute_loss` that does NOT depend on | |
| > TRL's GRPOTrainer parent. The GRPO channel is replaced with standard | |
| > LM next-token-prediction cross-entropy, which is the limit GRPO | |
| > converges to under deterministic rewards. | |
| > | |
| > Use it for: CPU smokes on real HF models, unit tests of loss | |
| > composition without spinning up TRL, anywhere we want to verify | |
| > gradient flow through the 3-channel sum without paying TRL's full | |
| > machinery cost. | |
| > | |
| > **Do NOT use it as the production training loss.** Production = | |
| > ComposerReplicationTrainer (a real GRPOTrainer subclass). | |
| The `lm_ce` channel labelled "GRPO" in the LossComponents dataclass is | |
| a **stub**: it is plain language-modeling cross-entropy. It is the | |
| correct channel for verification (gradient flow, channel weighting, | |
| distillation wiring), but it is not GRPO's surrogate objective and | |
| will never produce the same numbers as real GRPO under stochastic | |
| rewards. | |
| Real GRPO requires: | |
| - A reward model or rule-based reward, | |
| - Per-prompt advantage estimation across G samples, | |
| - An importance-sampling-ratio clip / mask. | |
| Those live in TRL's `GRPOTrainer`, in our PRIME-RL recipe at | |
| `composer_replication/recipes/prime_rl/composer_loss.py`, or (when | |
| shipped) in a future VeRL recipe. | |
| **FIX.** | |
| - For production GRPO training, do **not** call `compose_loss` directly. | |
| Instead use one of: | |
| - `composer_replication.trainer.composer_trainer.ComposerReplicationTrainer` | |
| — TRL `GRPOTrainer` subclass, full machinery. | |
| - `composer_replication.recipes.prime_rl.composer_loss.loss_fn` — | |
| PRIME-RL's `CustomLossConfig` adapter (channel 1 is real DPPO-clipped GRPO). | |
| - For ablations, smokes, and unit tests, `compose_loss` is the right | |
| tool — but log the `lm_ce` channel as `lm_ce`, not as `grpo`. The | |
| `LossComponents` dataclass already names the field correctly; if | |
| your wandb logger relabels it as "GRPO loss", fix the label. | |
| **VERIFICATION.** | |
| - The 11-test integration suite at | |
| `composer_replication/tests/test_compose_loss_integration.py` only | |
| asserts gradient flow + bit-exact composition; it deliberately does | |
| not assert any GRPO-specific property of `compose_loss`. That's the | |
| contract. | |
| - The PRIME-RL recipe's real DPPO+KL behavior is asserted by | |
| `test_returns_finite_scalar`, | |
| `test_dppo_mask_high_drops_positive_advantage_outliers`, | |
| `test_dppo_mask_sign_conditioned_on_advantage`, and | |
| `test_parity_with_prime_rl_default_loss_fn` (skip-marked when | |
| `prime-rl` is not installed) | |
| in `composer_replication/recipes/prime_rl/tests/test_composer_loss.py`. | |
| Those tests verify a real importance-sampling-ratio gradient with | |
| PRIME-RL's advantage-conditioned mask, which `compose_loss` would | |
| not pass. | |
| If you find yourself wanting `compose_loss` to behave like real GRPO, | |
| that is the signal to switch to one of the production paths above. | |
| ``` | |
| pytest composer_replication/tests/test_compose_loss_integration.py::test_defaults_bit_exact_with_legacy_kwargs -xvs | |
| pytest composer_replication/recipes/prime_rl/tests/test_composer_loss.py::test_returns_finite_scalar -xvs | |
| ``` | |
| --- | |
| ### 10. `monarch` / `data-juicer` / `prime-rl` install (Wave 16) | |
| **SYMPTOM.** `pip install -e ".[monarch]"`, `pip install -e ".[prime-rl]"`, | |
| or `pip install -e ".[replaysim]"` fails immediately with a uv/pip | |
| resolver error similar to: | |
| ``` | |
| × No solution found when resolving dependencies: | |
| ╰─▶ Because only monarch<=0.1.11 is available and | |
| composer-replication[monarch] depends on monarch>=0.4.1, we can | |
| conclude that composer-replication[monarch]'s requirements are | |
| unsatisfiable. | |
| ``` | |
| **DIAGNOSIS.** Three upstream packages the framework integrates with are | |
| not currently pip-installable in their advertised versions: | |
| 1. **Meta's Monarch** is published on PyPI as | |
| `torchmonarch-nightly` (nightly wheels with platform constraints), not | |
| as `monarch`. The PyPI name `monarch` is unrelated to Meta's actor | |
| framework and tops out at `0.1.11`. | |
| 2. **Prime Intellect's prime-rl** is not registered on PyPI at all. It | |
| is published from source only. | |
| 3. **data-juicer** is not registered on PyPI under that exact name. The | |
| closest match (`py-data-juicer==1.0.0`) has broken transitive deps; | |
| newer `py-data-juicer` releases work but install ~150 transitive | |
| packages. | |
| Wave 16 dropped all three extras from `pyproject.toml` rather than ship | |
| unsatisfiable pins. The framework code paths that touch these libraries | |
| import them lazily, so: | |
| - `composer_replication.recipes.monarch` is a documentation skeleton | |
| that does NOT require monarch installed. | |
| - `composer_replication.recipes.prime_rl.composer_loss` imports cleanly | |
| without prime-rl; the upstream parity test is `@skipif`-gated and the | |
| in-file shadow-parity test still verifies the loss formula | |
| independently. | |
| - `composer_replication.replaysim.normalize.DJNormalizer(skip_dj=True)` | |
| works without `data_juicer`; only the full DJNormalizer code path | |
| needs it. | |
| **FIX.** If you want any of these libraries' real functionality, install | |
| from source alongside the framework: | |
| ``` | |
| # Meta Monarch (actor framework — see ADR-006) | |
| pip install torchmonarch-nightly # OR install from source: | |
| # git clone https://github.com/meta-pytorch/monarch && cd monarch && pip install -e . | |
| # Prime Intellect prime-rl (Recipe C — see ADR-006) | |
| git clone https://github.com/PrimeIntellect-ai/prime-rl | |
| cd prime-rl && pip install -e . | |
| # data-juicer (replaysim normalization — see ADR-004) | |
| git clone https://github.com/modelscope/data-juicer | |
| cd data-juicer && pip install -e . | |
| ``` | |
| **VERIFICATION.** A fresh checkout install with all surviving extras | |
| should succeed: | |
| ``` | |
| uv venv --clear | |
| uv pip install -e ".[diloco,replay,replaysim,train,dev]" | |
| source .venv/bin/activate | |
| python -m pytest -q # baseline 176 passed / 8 skipped | |
| ``` | |
| If any of those extras fails to resolve, file a bug report — Wave 16 | |
| verified the full extras matrix installs from a clean venv on Python | |
| 3.11. | |
| --- | |
| ## How to file a bug report | |
| If you've read the relevant section above and your problem persists, | |
| file a bug. Include **all** sections of the template below — the most | |
| common reason a maintainer can't repro is a missing piece of | |
| environmental context. | |
| ```markdown | |
| ### What I expected vs what happened | |
| (One paragraph.) | |
| ### Repro steps | |
| 1. ... | |
| 2. ... | |
| 3. ... | |
| Minimal self-contained snippet (no `from my_local_thing import …`): | |
| ```python | |
| # repro.py | |
| from composer_replication import compose_loss | |
| ... | |
| ``` | |
| ### Environment | |
| - OS: (uname -a or `ver` on Windows) | |
| - Python: (python --version) | |
| - composer-replication: (pip show composer-replication | head -3) | |
| - torch: (python -c "import torch; print(torch.__version__)") | |
| - torchft: (python -c "import torchft; print(torchft.__version__)" || echo "n/a") | |
| - transformers / trl: (versions, or "not installed") | |
| - data-juicer / fsspec: (versions, or "not installed") | |
| - s3fs / gcsfs / adlfs: (versions if relevant) | |
| - GPU: (nvidia-smi -L or "CPU only") | |
| - Install method: pip install -e . / wheel / other | |
| - Extras installed: [replay] [replaysim] [serverless] [dev] | |
| ### What you've already tried | |
| - [ ] Read the relevant Failure Mode section of docs/TROUBLESHOOTING.md | |
| (which one: ___) | |
| - [ ] Ran `pytest <relevant test path>` and confirmed those tests pass | |
| - [ ] Ran the repro snippet in a fresh venv | |
| - [ ] Confirmed it reproduces on Python 3.11 (if you were on 3.12 / 3.13) | |
| ### Logs | |
| (Full traceback. If it's a wrong-loss-curve rather than an exception, | |
| paste loss values for the first 10 steps and link any wandb/tb run.) | |
| ### Hypothesis | |
| (Optional. If you have a guess at where the bug is, name the file + | |
| line number. We'll look there first.) | |
| ``` | |
| A few rules: | |
| - **Do not** paste API keys, AWS credentials, or HuggingFace tokens. | |
| - **Do** include the failing test name if you've narrowed it to one. | |
| - **Do** distinguish "never worked" from "regressed between commit X | |
| and Y." A regression-bisect goes straight to the front of the queue. | |
| - **One bug per issue.** Multi-headed reports lose items in triage. | |
| The Wave-14 surface area is large, but the test suite covers it | |
| densely — every section above corresponds to a green test that proves | |
| the fix worked. | |