composer-replication-framework / docs /TROUBLESHOOTING.md
Codeseys's picture
Wave 16: install ergonomics + gradient evidence + SDPO end-to-end example
c0a5ab7

TROUBLESHOOTING — Wave 14

This document catalogs every Wave-14-known failure mode in the Composer Replication Framework, along with how to diagnose, fix, and verify each one. It is intentionally surgical: the surface area added in Waves 12–14 (SimPO/TAID/Entropy-OPD distillation kwargs, the PRIME-RL composer-loss adapter, the serverless DiLoCo MockManager + ObjectStoreAllReduce path, and the data-juicer-backed replaysim normalizer) introduced new ways for users to trip themselves up. Each failure mode here is something a maintainer has actually seen or anticipated during the cross-model review of Wave 14.

If you hit something not covered below, jump to the How to file a bug report section at the end — the template there gives a maintainer everything they need to reproduce.


Common things to check first

Before reading any further, run through this checklist. ~80% of "framework broken" reports turn out to be one of these:

  1. Python version. The framework targets Python 3.10–3.12. The pyproject.toml target-version is py310. If you are on 3.13+, transitive deps (notably Ray, pulled in by data-juicer) may not yet ship wheels and will try to build from source. Run python --version.

  2. Fresh virtual environment. Mixing the framework into an existing environment that already has torch, transformers, trl, or torchft pinned to incompatible versions is the #1 source of import- time errors. Create a new venv: python -m venv .venv && source .venv/bin/activate && pip install -e .[dev].

  3. Editable install. Most contributors run pip install -e . so that local edits to composer_replication/ are picked up. If you pip install composer-replication from a registry instead, your edits to the source tree will be ignored. Confirm with pip show composer-replication | grep Location.

  4. Optional extras. Several modules are optional-dep gated:

    • [replay] — adds httpx (used for OpenRouter teacher calls).
    • [train] — adds TRL, peft, accelerate, datasets (production GRPO).
    • [replaysim] — adds data-juicer (and via it, Ray as a transitive).
    • [serverless] — adds fsspec. For non-local rendezvous URIs you also need a backend-specific fsspec adapter (see Failure Mode 5).
    • [dev] — adds pytest, ruff, etc. If you see ModuleNotFoundError: No module named 'data_juicer', you forgot the extra. Install with pip install -e .[replaysim].
  5. Run the test suite first. Before debugging anything, run the subset of tests touching the area you care about:

    pytest composer_replication/tests/                 # core compose_loss
    pytest composer_replication/distillation/tests/    # SimPO / TAID / OPD
    pytest composer_replication/recipes/prime_rl/tests/  # PRIME-RL adapter
    pytest composer_replication/diloco/serverless/tests/ # MockManager + DiLoCo
    pytest composer_replication/replaysim/tests/       # data-juicer normalizer
    

    If any green test fails for you locally, the problem is environmental — fix that before digging into your own code.

  6. Read the docstring of the symbol you're calling. Wave 14 docstrings are written to be the first line of documentation. The compose_loss docstring (composer_replication/loss.py) lists every required and optional input key. The MockManager docstring enumerates the torchft surface methods it implements.


Failure modes

1. pip install -e .[replaysim] hangs or fails on Python 3.12 with a Ray-related path error

SYMPTOM. Installing the [replaysim] extra (which pulls data-juicer) triggers a transitive install of Ray. On Python 3.12, the first import ray (often during pip build hooks or the first time data-juicer is loaded) fails with messages mentioning /tmp/ray/session_* paths, missing pyarrow symbols, or OSError: [Errno 2] No such file or directory: '/dev/shm/ray-...' inside Docker.

DIAGNOSIS. data-juicer declares ray as a transitive dependency. On Python 3.12 the wheel matrix is incomplete for some Ray versions, and Ray's first-import probes /dev/shm and /tmp/ray for its session state. In a sandboxed container, restricted CI runner, or WSL environment with a non-default /tmp, those probes fail. Wave 14 subagent T2 hit this in CI and worked around it by pinning Ray and by making sure /tmp exists and is writable.

FIX.

  • Prefer Python 3.11 if you're on 3.12+ and don't need 3.12 features.
  • If you must stay on 3.12, ensure /tmp is writable and pre-create the session directory: mkdir -p /tmp/ray && chmod 1777 /tmp/ray.
  • In Docker, mount a real tmpfs at /dev/shm: docker run --shm-size=2g ….
  • If you don't need replaysim normalization, you can skip the extra entirely. The DJNormalizer(skip_dj=True) passthrough (see composer_replication/replaysim/normalize.py:165) does not import data_juicer and therefore does not import Ray.

VERIFICATION. The skip-dj passthrough is exercised by test_dj_normalizer_skip_dj_passthrough and test_dj_normalizer_skip_dj_preserves_count in composer_replication/replaysim/tests/test_replaysim.py. Both run without data_juicer installed:

pytest composer_replication/replaysim/tests/test_replaysim.py::test_dj_normalizer_skip_dj_passthrough -xvs

If that passes in your environment, your [replaysim]-less install is healthy — only the full data-juicer code path requires Ray.


2. compose_loss produces wrong-looking numbers when combining new kwargs

SYMPTOM. You pass several Wave-14 distillation kwargs to compose_loss (e.g. dpo_variant="simpo", sdpo_wrapper="taid", taid_schedule_step=0, simpo_beta=2.0, entropy_opd_h_max=…), and the loss curve looks wrong: NaNs, identically-zero sdpo_jsd channel, or a total that is bit-different from your reference run with no distillation kwargs at all.

DIAGNOSIS. compose_loss now has 13 keyword arguments and the contract between them is non-trivial. Subagent T1's review identified three combinations that look reasonable but are unsupported:

  • Passing taid_schedule_step without taid_total_steps (or vice versa). The function raises ValueError clearly, but the message can scroll past in noisy logs.
  • Passing dpo_variant="simpo" while still supplying dpo_chosen_ref_logprobs. Those keys are silently ignored — SimPO is reference-free.
  • Passing sdpo_wrapper="taid" without supplying either student_init_logits OR student_init_input_ids in inputs. The function will fall back to a forward pass through the (possibly drifted) live model, which is a footgun late in training (see Failure Mode 8).

FIX. Read the docstring at the top of composer_replication/loss.py (lines 25–39 list the three pluggable losses and their preconditions). The general rule:

from composer_replication import compose_loss

# Defaults (no distillation knobs) reproduce legacy 3-channel composition bit-exact.
out = compose_loss(model, inputs)

# To opt into SimPO, pass dpo_variant ONLY. Do not pass ref-logprob keys.
out = compose_loss(model, inputs, dpo_variant="simpo",
                   simpo_beta=2.0, simpo_gamma=1.0)

# To opt into TAID, pass BOTH schedule_step AND total_steps, AND make sure
# inputs["student_init_logits"] is populated (see Failure Mode 8).
out = compose_loss(model, inputs, sdpo_wrapper="taid",
                   taid_schedule_step=step, taid_total_steps=total_steps)

Setting all 13 kwargs to their defaults is bit-exact equivalent to the pre-Wave-13 3-channel loss; if your defaults call gives different numbers than your old code, file a bug.

VERIFICATION. The bit-exact equivalence and every supported combination is locked in by the 11 integration tests in composer_replication/tests/test_compose_loss_integration.py. The most important ones:

  • test_defaults_bit_exact_with_legacy_kwargs — passing the new kwargs at their defaults is identical to legacy.
  • test_simpo_does_not_require_ref_logprobs — SimPO works with the ref-logprob keys absent from inputs.
  • test_taid_alpha_one_recovers_sdpo — TAID with alpha_min=alpha_max=1 reproduces standard SDPO.
  • test_taid_requires_schedule_step / test_taid_requires_total_steps — the partial-config error path.
pytest composer_replication/tests/test_compose_loss_integration.py -xvs

3. MockManager works today but silently breaks after a torchft upgrade

SYMPTOM. Your serverless DiLoCo run starts, the first outer round completes, and then torchft.DiLoCo raises an AttributeError on something like _use_async_quorum, should_commit, or current_step — or worse, it silently uses the wrong sync semantics.

DIAGNOSIS. MockManager is a duck-typed shim that mirrors torchft.Manager rather than subclassing it. The surface it implements is enumerated in the docstring at composer_replication/diloco/serverless/allreduce.py:215:

Methods/attributes DiLoCo touches: allreduce, should_commit, start_quorum, current_step, disallow_state_dict_read, allow_state_dict_read, register_state_dict_fn, _use_async_quorum (attribute), num_participants, rank.

The two private members in that list — _use_async_quorum and the internal current_step counter — are private torchft API that may be renamed without notice in any torchft minor release. Wave 14 subagent T3 specifically called this out: "If torchft renames _use_async_quorum to anything else, MockManager silently breaks because there is nothing holding the contract beyond a string."

FIX.

  • Pin torchft. In pyproject.toml keep your torchft version pinned to a known-good range (e.g. torchft>=0.2,<0.4). When you need to upgrade, do so deliberately and re-run the integration tests below before merging.
  • Watch the deprecation warning. Wave 14 sets up a clear path to warn if _use_async_quorum is read on a fresh instance — see the comment at allreduce.py:255.
  • Don't pass an arbitrary torchft branch. If you've patched torchft locally, the MockManager may need updating in lockstep. The surface-compatibility tests below will catch this in CI.

VERIFICATION. The full DiLoCo × MockManager surface is exercised by:

  • test_mock_manager_shape_compat in composer_replication/diloco/serverless/tests/test_serverless_local.py — sanity check that all expected methods/attributes exist.
  • test_mockmanager_has_full_diloco_call_surface in composer_replication/diloco/serverless/tests/test_serverless_diloco_integration.py — runs an end-to-end outer round through real torchft DiLoCo, hitting every method on the surface list above.
  • test_mockmanager_diloco_outer_round_completes — full one-round smoke ending in a successful outer SGD step.

If any of these tests turn red after a torchft bump, do not ship: inspect the new torchft Manager surface and update MockManager to match.

pytest composer_replication/diloco/serverless/tests/test_serverless_diloco_integration.py -xvs

4. SimPO loss curve looks like noise

SYMPTOM. You wired in dpo_variant="simpo", the run starts, and the trace_replay_dpo channel either drifts to large negative values (→ total blows up) or oscillates with much higher variance than standard DPO. The loss curve "looks like noise."

DIAGNOSIS. SimPO uses average per-token log-probability (Σ logπ(c_t) / |c|), not sum log-prob. From the SimPO docstring (composer_replication/distillation/simpo.py:11–18):

SimPO drops the reference-policy term, replaces it with a target margin γ, and uses average sequence log-probability instead of sum. […] L_SimPO = -log σ( β · [avg_logπ(c) - avg_logπ(r)] - γ )

If you compute chosen_logprobs.sum() (or any unmasked aggregation) and hand it to SimPO as chosen_avg_logprobs, the loss is undefined: β=2.0 times a sum-log-prob is on a totally different scale than β=2.0 times an average. The result looks plausible per-batch but the optimum is nowhere near the dataset's true preference signal.

FIX. Use the helper composer_replication.distillation.simpo.avg_sequence_logprob:

from composer_replication.distillation.simpo import (
    simpo_loss, avg_sequence_logprob,
)

chosen_avg = avg_sequence_logprob(chosen_logprobs, chosen_response_mask)
rejected_avg = avg_sequence_logprob(rejected_logprobs, rejected_response_mask)
loss = simpo_loss(chosen_avg, rejected_avg, beta=2.0, gamma=1.0)

The mask is 1 on response tokens, 0 on prompt+padding — same convention as the rest of the framework. If you must roll your own aggregation, divide by response_mask.sum(dim=-1).clamp_min(1.0), not by response_mask.shape[-1].

VERIFICATION. The avg-vs-sum semantics are pinned by test_avg_sequence_logprob in composer_replication/distillation/tests/test_distillation_losses.py, which constructs known per-token log-probs and asserts the helper returns the correct per-sequence average. The end-to-end SimPO loss-shape check is test_simpo_loss_returns_scalar in the same file.

pytest composer_replication/distillation/tests/test_distillation_losses.py::test_avg_sequence_logprob -xvs
pytest composer_replication/distillation/tests/test_distillation_losses.py::test_simpo_loss_lower_for_better_separation -xvs

5. ObjectStoreAllReduce works locally but fails on s3:// at first allreduce

SYMPTOM. You construct ObjectStoreAllReduce(uri="s3://my-bucket/run42/", rank=0, world_size=4). The constructor succeeds. The first call to allreduce(tensor, name="...") raises ImportError: Install s3fs to access S3 or botocore.exceptions.NoCredentialsError: Unable to locate credentials.

DIAGNOSIS. ObjectStoreAllReduce uses fsspec to reach the backend, but fsspec only ships protocol stubs, not adapters. The constructor doesn't know which protocol you'll use and doesn't eagerly validate, so it accepts any URI. The s3:// adapter requires:

  1. The s3fs package (pip install s3fs), which is not in the default [serverless] extra.
  2. Working AWS credentials (env vars, ~/.aws/credentials, IAM role, or whatever your environment normally provides to boto3).

The same is true for gs:// (gcsfs), az:// (adlfs), and hf:// (huggingface_hub's fsspec integration, which is included if you have huggingface_hub installed).

FIX.

  • Install the right adapter alongside the framework:
    pip install s3fs       # for s3://
    pip install gcsfs      # for gs://
    pip install adlfs      # for az://
    
  • Verify credentials work outside the framework first:
    python -c "import s3fs; print(s3fs.S3FileSystem().ls('my-bucket'))"
    
  • If you're running on Modal/HF Jobs, set the credentials as Modal secrets / HF Jobs env vars in the executor config — not in your local shell.

The constructor could in principle perform an eager probe (e.g. a HEAD on the rendezvous prefix) to fail fast at init time. Wave 14 deliberately did not add this because it adds a network round-trip on every replica startup. If you want pre-flight validation in your training script, call fsspec.filesystem(protocol).ls(uri) yourself before constructing the manager.

VERIFICATION. The file:// and bare-path code paths — the only ones that don't need an extra adapter — are exercised by:

  • test_object_store_allreduce_local_paths_create_dir
  • test_object_store_allreduce_world_size_1_passthrough
  • test_object_store_allreduce_round_id_increments

…all in composer_replication/diloco/serverless/tests/test_serverless_local.py. If those pass and your s3:// URI fails, the framework is fine and your fsspec adapter or credentials are the problem.

pytest composer_replication/diloco/serverless/tests/test_serverless_local.py -xvs

6. Custom replaysim recipe drops every record (or crashes data-juicer)

SYMPTOM. You wrote a custom replaysim YAML recipe modeled on composer_replication/recipes/replaysim/default.yaml. It loads without error, but every input DPO pair is dropped, OR data-juicer raises KeyError: 'text_key', OR it raises a complaint about "expected str, got list" inside one of the filters.

DIAGNOSIS. Wave 14 fixed two related bugs in the default recipe that custom-recipe authors will hit again. Both are documented in the header comment at composer_replication/recipes/replaysim/default.yaml:21–35:

  1. text_keys plural vs text_key singular. The top-level dataset contract uses text_keys: chosen (plural). Each individual op uses text_key: chosen (singular). They are not interchangeable. data-juicer's dataset loader validates that the text_keys field exists on every record before any op runs; an op that uses text_keys instead of text_key is silently misconfigured.

  2. chosen / rejected as strings vs as list-of-dicts. data-juicer ops like text_length_filter, words_num_filter, special_characters_filter, and document_deduplicator read a single string field. Pointing them at the chat-messages list (chosen_messages, rejected_messages) crashes or silently no-ops. The framework's _dpo_pair_to_dj_record keeps both shapes side-by-side: chosen/rejected (strings) for filter ops, and chosen_messages/rejected_messages (chat-messages list) for chat-aware ops + the NormalizedDPOPair round-trip.

FIX. Treat the default recipe as your starting template. Concretely:

  • Always declare text_keys: chosen at the top.
  • For every length/word/special-char op you add, duplicate it: once with text_key: chosen, once with text_key: rejected. (Each op takes only one text_key — see comment at lines 31–35 of default.yaml.)
  • Never point a filter op at chosen_messages or rejected_messages. Those are list-of-dicts; only chat-aware ops accept that shape.

VERIFICATION. The two-shape contract is locked in by:

  • test_record_chosen_rejected_are_flat_strings_for_dj_text_ops — asserts chosen and rejected are bare strings on every record produced by _dpo_pair_to_dj_record.
  • test_record_chosen_rejected_messages_carry_chat_shape — asserts chosen_messages / rejected_messages exist as list-of-dicts.
  • test_dj_normalizer_e2e_default_recipe(tmp_path) — runs the actual default recipe through real data-juicer end-to-end (skipped if data_juicer isn't importable).

…all in composer_replication/replaysim/tests/test_replaysim.py. If those pass and your custom recipe still drops everything, diff your YAML against default.yaml until the two shapes align.

pytest composer_replication/replaysim/tests/test_replaysim.py -xvs

7. ValueError: expected (seq,) shape, got (B, T) from PRIME-RL composer_loss

SYMPTOM. You wired the PRIME-RL recipe into a training loop you adapted from another framework (TRL, openrlhf, etc.), and on the very first loss_fn call you get a ValueError mentioning shape (seq,) versus (B, T).

DIAGNOSIS. PRIME-RL calls its loss function one sample at a time, with 1-D (seq,) tensors — not batched (B, T) tensors. The recipe's docstring spells this out at composer_replication/recipes/prime_rl/composer_loss.py:16–30:

Note the per-sample (seq,) shape — PRIME-RL's runner calls the loss function one sample at a time, not on a batched (B, T) tensor.

Wave 14 fixed an earlier draft of the recipe that incorrectly assumed (B, T). The new version raises a clear ValueError if you hand it the wrong shape, instead of silently broadcasting and producing nonsense gradients. Users who are used to TRL or openrlhf — both of which call the loss with batched tensors — see this on day one.

FIX.

  • If you are running inside PRIME-RL via its CustomLossConfig, you don't need to do anything: PRIME-RL's runner produces (seq,) tensors and the recipe accepts them.
  • If you are calling the recipe directly from your own runner, slice your batch into per-sample 1-D tensors before each call:
    for b in range(B):
        inputs_b = LossInputs(
            trainer_logprobs=batched.trainer_logprobs[b],
            inference_logprobs=batched.inference_logprobs[b],
            advantages=batched.advantages[b],
            loss_mask=batched.loss_mask[b],
            teacher_logprobs=None if batched.teacher_logprobs is None
                            else batched.teacher_logprobs[b],
        )
        loss = loss_fn(inputs_b, ...)
    
  • If you genuinely need a batched API, write a thin wrapper around loss_fn. Don't patch the recipe — its shape contract is dictated by PRIME-RL, not by us.

VERIFICATION. The shape contract is pinned by two tests in composer_replication/recipes/prime_rl/tests/test_composer_loss.py:

  • test_advantages_shape_validates_seq_accepted(seq,) succeeds.
  • test_advantages_shape_validates_bt_rejected(B, T) raises ValueError.
pytest composer_replication/recipes/prime_rl/tests/test_composer_loss.py -xvs

8. TAID can't run mid-training because student_init_logits is missing

SYMPTOM. You decide partway through a training run to enable sdpo_wrapper="taid" (e.g. you read the TAID paper after step 2000 and want to retrofit). The next training step blows up — either with a KeyError for student_init_logits / student_init_input_ids, or with a strange-looking loss because the framework fell back to re-running a forward pass through the current (drifted) model instead of the init model.

DIAGNOSIS. TAID interpolates between the student's distribution at step 0 and the teacher's distribution. From the TAID docstring at composer_replication/distillation/taid.py:10–24:

TAID interpolates between an "identity" target (the student's own distribution at step 0) and the teacher's distribution, with the interpolation coefficient annealed from 0 → 1 over training.

That step-0 reference target has to come from somewhere. The framework accepts it via either:

  1. inputs["student_init_logits"] — a precomputed (B, T, V) tensor captured at training start (preferred for production), OR
  2. inputs["student_init_input_ids"] — input ids for a frozen forward pass through model. This assumes model has not yet drifted from init. It is correct only at step 0 or in tests; in production it silently produces the wrong target.

If you forgot to capture the init logits at step 0, you cannot faithfully use TAID mid-run.

FIX. Capture init logits at step 0 and persist them:

# At step 0, before any optimizer.step() call:
with torch.no_grad():
    init_logits = model(input_ids=batch["input_ids"]).logits
    # Save to disk if you'll need them across restarts:
    torch.save(init_logits, "checkpoints/init_logits_batch0.pt")
    inputs["student_init_logits"] = init_logits

# Or, if you have a fixed eval probe set, capture init logits once
# for that fixed set and reuse them every step:
inputs["student_init_logits"] = cached_init_logits

If you genuinely have no step-0 snapshot, TAID is not retrofittable to your run. Your options are:

  • Restart from a checkpoint that was the step-0 model.
  • Use a different distillation wrapper (sdpo_wrapper="entropy_opd") that doesn't need init logits.
  • Accept the bias from the live-model fallback path. Don't.

VERIFICATION. The precomputed-vs-live-fallback contract is exercised by:

  • test_taid_accepts_precomputed_student_init_logits in composer_replication/tests/test_compose_loss_integration.py — passes precomputed logits and asserts the TAID-wrapped channel uses them.
  • test_taid_alpha_one_recovers_sdpo — asserts that with alpha_min=alpha_max=1.0 (i.e. pure teacher target, init logits ignored) TAID reproduces standard SDPO. If your training ignores init logits silently, this is the test that would have failed.
pytest composer_replication/tests/test_compose_loss_integration.py::test_taid_accepts_precomputed_student_init_logits -xvs

9. ModalExecutor() or HFJobsExecutor() raises NotImplementedError at construction

SYMPTOM. You write executor = ModalExecutor(app_name="my-app") (or the HF Jobs equivalent) in a production script and the constructor immediately raises:

NotImplementedError: ModalExecutor is a v0 skeleton; full implementation pending.
Use LocalProcessExecutor for testing.

Same for HFJobsExecutor. This is at init time, not at the first launch_replicas call.

DIAGNOSIS. Per ADR-005 the v0 release ships only the ServerlessExecutor Protocol and the reference LocalProcessExecutor. The Modal and HF Jobs implementations are import-safe skeletons — the classes exist and you can from … import ModalExecutor, but __init__ raises NotImplementedError to prevent silent partial behavior. See modal.py:64 and hf_jobs.py:64.

This is intentional. We didn't want to ship a half-working Modal executor that succeeds at launch_replicas and then silently fails two-thirds of the way through collect.

FIX.

  • Use LocalProcessExecutor for development, CI, and any single-host multi-process testing.
  • For real cloud deployment in the v0 era, run your training script directly in Modal/HF Jobs by hand: write your own thin Modal function that constructs MockManager(ObjectStoreAllReduce(uri, rank, world_size)) and runs the training loop. The skeleton docstrings at modal.py:24–48 and hf_jobs.py:26–49 show exactly the pattern.
  • Watch the BACKLOG.md for v0 polish — the real implementations are scheduled.

VERIFICATION. That LocalProcessExecutor is fully functional and correctly implements the Protocol is locked in by:

  • test_local_executor_runs_allreduce_across_replicas in composer_replication/diloco/serverless/tests/test_serverless_local.py — runs N replicas locally, performs an allreduce across them.
  • test_local_executor_handles_multiple_rounds
  • test_local_executor_reports_failed_replicas

If those tests pass, your serverless DiLoCo machinery works — only the specific cloud adapters are missing. The skeletons themselves are not under test (raising in __init__ is the contract).

pytest composer_replication/diloco/serverless/tests/test_serverless_local.py -xvs

10. DPPO mask drops every token — "loss became 0" or "no gradients"

SYMPTOM. You ported a PPO config from another framework (KL penalty + clip ε=0.2 + value loss), wired it into the PRIME-RL recipe with the default dppo_mask_high=0.2 / dppo_mask_low=0.2, and the training loss is suspiciously close to zero. Inspecting the recipe's internal keep_mask shows nearly every token is being masked out.

DIAGNOSIS. PRIME-RL's "DPPO mask" is not the same as PPO clipping, and not even the same as a log-ratio threshold. From the recipe docstring at composer_replication/recipes/prime_rl/composer_loss.py (mirroring PRIME-RL upstream prime_rl/trainer/rl/loss.py lines 137-148):

The mask gate is on probability-space probs_diff = exp(trainer_lp) - exp(inference_lp), NOT on the log-ratio. A positive-advantage token is dropped iff probs_diff > dppo_mask_high; a negative-advantage token iff probs_diff < -dppo_mask_low. Masked tokens are dropped from the policy-gradient term but still contribute to the KL penalty.

The defaults dppo_mask_high=dppo_mask_low=0.2 match PRIME-RL's DefaultLossConfig. Because the gate is on probability-space, the "in-band" zone is exp(trainer_lp) ∈ [exp(inference_lp) - 0.2, exp(inference_lp) + 0.2]. For a token with inference probability ~0.5 this is a fairly tight band; for tokens at probability ~0.001 or ~0.999 the same threshold behaves very differently from a log-ratio bound. This is by design — PRIME-RL is bounding the absolute change in token probability, not the multiplicative change.

The two failure modes:

  1. All tokens masked. Trainer and inference engines disagree sharply (fp16 vs bf16, stale rollout cache, mismatched chat templates) and probs_diff exceeds 0.2 almost everywhere.
  2. No tokens masked. Trainer ≈ inference (e.g. you forgot to step the optimizer between rollouts) so the bound is never binding and the policy never sees any DPPO regularization.

FIX. Inspect the empirical probs_diff distribution before tuning:

# In your training loop:
probs_diff = torch.exp(trainer_logprobs) - torch.exp(inference_logprobs)
print(torch.quantile(probs_diff.abs(), torch.tensor([0.5, 0.9, 0.99])))

For a healthy on-policy run with bf16 trainer + bf16 inference and fresh rollouts, the central 99% of |probs_diff| should sit well below 0.2. If yours doesn't, the upstream divergence is the problem, not the bound. Bumping dppo_mask_high/low to 0.5 or 1.0 is a workaround but it disables the trust-region intent of DPPO.

Do not translate PPO ε=0.2 directly. PPO ε=0.2 is a multiplicative log-ratio bound (|log_ratio| < log(1.2) ≈ 0.18); DPPO's 0.2 is an additive probability-space bound. The semantics are different and the defaults are deliberately tight in probability space.

If you genuinely want to disable the mask (e.g. for bug-isolation), pass dppo_mask_high=1e6, dppo_mask_low=1e6 (both are Field(..., ge=0) upstream — negative values are rejected by both PRIME-RL and our adapter). There is a regression test for exactly this knob.

VERIFICATION.

  • test_dppo_mask_high_drops_positive_advantage_outliers and test_dppo_mask_low_drops_negative_advantage_outliers in composer_replication/recipes/prime_rl/tests/test_composer_loss.py — assert that out-of-bound tokens are dropped from the policy-gradient term (with the upstream sign-of-advantage gate).
  • test_dppo_mask_sign_conditioned_on_advantage — asserts that a positive-advantage token with a large negative probs_diff is NOT dropped (PRIME-RL only checks the upper bound for positive-advantage tokens).
  • test_dppo_bounds_can_be_disabled — asserts that very wide bounds (1e6) pass every token through.
  • test_parity_with_prime_rl_default_loss_fn — when prime-rl is installed, runs identical inputs through PRIME-RL upstream and our adapter and asserts the loss matches.
pytest composer_replication/recipes/prime_rl/tests/test_composer_loss.py -xvs

11. compose_loss runs but the GRPO channel doesn't behave like real GRPO

SYMPTOM. You read the README, saw the "3-channel composition: GRPO

  • SDPO + trace-replay DPO" tagline, called compose_loss(model, inputs) directly in your training loop, and your reward curve never moves the way it would in a real GRPO trainer. Or: you compared against a TRL GRPOTrainer baseline and compose_loss produces totally different numbers.

DIAGNOSIS. From the docstring at the top of composer_replication/loss.py:1–16:

This is a verification-harness mirror of ComposerReplicationTrainer._compute_loss that does NOT depend on TRL's GRPOTrainer parent. The GRPO channel is replaced with standard LM next-token-prediction cross-entropy, which is the limit GRPO converges to under deterministic rewards.

Use it for: CPU smokes on real HF models, unit tests of loss composition without spinning up TRL, anywhere we want to verify gradient flow through the 3-channel sum without paying TRL's full machinery cost.

Do NOT use it as the production training loss. Production = ComposerReplicationTrainer (a real GRPOTrainer subclass).

The lm_ce channel labelled "GRPO" in the LossComponents dataclass is a stub: it is plain language-modeling cross-entropy. It is the correct channel for verification (gradient flow, channel weighting, distillation wiring), but it is not GRPO's surrogate objective and will never produce the same numbers as real GRPO under stochastic rewards.

Real GRPO requires:

  • A reward model or rule-based reward,
  • Per-prompt advantage estimation across G samples,
  • An importance-sampling-ratio clip / mask.

Those live in TRL's GRPOTrainer, in our PRIME-RL recipe at composer_replication/recipes/prime_rl/composer_loss.py, or (when shipped) in a future VeRL recipe.

FIX.

  • For production GRPO training, do not call compose_loss directly. Instead use one of:
    • composer_replication.trainer.composer_trainer.ComposerReplicationTrainer — TRL GRPOTrainer subclass, full machinery.
    • composer_replication.recipes.prime_rl.composer_loss.loss_fn — PRIME-RL's CustomLossConfig adapter (channel 1 is real DPPO-clipped GRPO).
  • For ablations, smokes, and unit tests, compose_loss is the right tool — but log the lm_ce channel as lm_ce, not as grpo. The LossComponents dataclass already names the field correctly; if your wandb logger relabels it as "GRPO loss", fix the label.

VERIFICATION.

  • The 11-test integration suite at composer_replication/tests/test_compose_loss_integration.py only asserts gradient flow + bit-exact composition; it deliberately does not assert any GRPO-specific property of compose_loss. That's the contract.
  • The PRIME-RL recipe's real DPPO+KL behavior is asserted by test_returns_finite_scalar, test_dppo_mask_high_drops_positive_advantage_outliers, test_dppo_mask_sign_conditioned_on_advantage, and test_parity_with_prime_rl_default_loss_fn (skip-marked when prime-rl is not installed) in composer_replication/recipes/prime_rl/tests/test_composer_loss.py. Those tests verify a real importance-sampling-ratio gradient with PRIME-RL's advantage-conditioned mask, which compose_loss would not pass.

If you find yourself wanting compose_loss to behave like real GRPO, that is the signal to switch to one of the production paths above.

pytest composer_replication/tests/test_compose_loss_integration.py::test_defaults_bit_exact_with_legacy_kwargs -xvs
pytest composer_replication/recipes/prime_rl/tests/test_composer_loss.py::test_returns_finite_scalar -xvs

10. monarch / data-juicer / prime-rl install (Wave 16)

SYMPTOM. pip install -e ".[monarch]", pip install -e ".[prime-rl]", or pip install -e ".[replaysim]" fails immediately with a uv/pip resolver error similar to:

× No solution found when resolving dependencies:
  ╰─▶ Because only monarch<=0.1.11 is available and
      composer-replication[monarch] depends on monarch>=0.4.1, we can
      conclude that composer-replication[monarch]'s requirements are
      unsatisfiable.

DIAGNOSIS. Three upstream packages the framework integrates with are not currently pip-installable in their advertised versions:

  1. Meta's Monarch is published on PyPI as torchmonarch-nightly (nightly wheels with platform constraints), not as monarch. The PyPI name monarch is unrelated to Meta's actor framework and tops out at 0.1.11.
  2. Prime Intellect's prime-rl is not registered on PyPI at all. It is published from source only.
  3. data-juicer is not registered on PyPI under that exact name. The closest match (py-data-juicer==1.0.0) has broken transitive deps; newer py-data-juicer releases work but install ~150 transitive packages.

Wave 16 dropped all three extras from pyproject.toml rather than ship unsatisfiable pins. The framework code paths that touch these libraries import them lazily, so:

  • composer_replication.recipes.monarch is a documentation skeleton that does NOT require monarch installed.
  • composer_replication.recipes.prime_rl.composer_loss imports cleanly without prime-rl; the upstream parity test is @skipif-gated and the in-file shadow-parity test still verifies the loss formula independently.
  • composer_replication.replaysim.normalize.DJNormalizer(skip_dj=True) works without data_juicer; only the full DJNormalizer code path needs it.

FIX. If you want any of these libraries' real functionality, install from source alongside the framework:

# Meta Monarch (actor framework — see ADR-006)
pip install torchmonarch-nightly       # OR install from source:
# git clone https://github.com/meta-pytorch/monarch && cd monarch && pip install -e .

# Prime Intellect prime-rl (Recipe C — see ADR-006)
git clone https://github.com/PrimeIntellect-ai/prime-rl
cd prime-rl && pip install -e .

# data-juicer (replaysim normalization — see ADR-004)
git clone https://github.com/modelscope/data-juicer
cd data-juicer && pip install -e .

VERIFICATION. A fresh checkout install with all surviving extras should succeed:

uv venv --clear
uv pip install -e ".[diloco,replay,replaysim,train,dev]"
source .venv/bin/activate
python -m pytest -q                    # baseline 176 passed / 8 skipped

If any of those extras fails to resolve, file a bug report — Wave 16 verified the full extras matrix installs from a clean venv on Python 3.11.


How to file a bug report

If you've read the relevant section above and your problem persists, file a bug. Include all sections of the template below — the most common reason a maintainer can't repro is a missing piece of environmental context.

### What I expected vs what happened
(One paragraph.)

### Repro steps
1. ...
2. ...
3. ...

Minimal self-contained snippet (no `from my_local_thing import …`):

```python
# repro.py
from composer_replication import compose_loss
...

Environment

  • OS: (uname -a or ver on Windows)
  • Python: (python --version)
  • composer-replication: (pip show composer-replication | head -3)
  • torch: (python -c "import torch; print(torch.version)")
  • torchft: (python -c "import torchft; print(torchft.version)" || echo "n/a")
  • transformers / trl: (versions, or "not installed")
  • data-juicer / fsspec: (versions, or "not installed")
  • s3fs / gcsfs / adlfs: (versions if relevant)
  • GPU: (nvidia-smi -L or "CPU only")
  • Install method: pip install -e . / wheel / other
  • Extras installed: [replay] [replaysim] [serverless] [dev]

What you've already tried

  • Read the relevant Failure Mode section of docs/TROUBLESHOOTING.md (which one: ___)
  • Ran pytest <relevant test path> and confirmed those tests pass
  • Ran the repro snippet in a fresh venv
  • Confirmed it reproduces on Python 3.11 (if you were on 3.12 / 3.13)

Logs

(Full traceback. If it's a wrong-loss-curve rather than an exception, paste loss values for the first 10 steps and link any wandb/tb run.)

Hypothesis

(Optional. If you have a guess at where the bug is, name the file + line number. We'll look there first.)


A few rules:
- **Do not** paste API keys, AWS credentials, or HuggingFace tokens.
- **Do** include the failing test name if you've narrowed it to one.
- **Do** distinguish "never worked" from "regressed between commit X
  and Y." A regression-bisect goes straight to the front of the queue.
- **One bug per issue.** Multi-headed reports lose items in triage.

The Wave-14 surface area is large, but the test suite covers it
densely — every section above corresponds to a green test that proves
the fix worked.