composer-replication-framework / docs /adrs /ADR-004-replaysim-normalization.md
Codeseys's picture
Wave 13: serverless DiLoCo + replaysim normalization + 3 distillation losses + PRIME-RL + Monarch
b266c31
# ADR-004 — Replaysim normalization layer for the trace-replay channel
**Status**: Accepted
**Date**: 2026-05-26
**Wave**: 13 (deep work loop, expansion phase)
## Context
The brief's V5 clause says:
> use traces from an llm-application usage then replay the traces with
> different models to see at each llm-step what the llm would do. by doing
> this we get distillation data from any number of models that could be
> used to train the target model further
The user added 2026-05-26: *"see if we can leverage [a normalization
library] to normalize the data while also making the replaysim dataset
generation."*
Currently the framework has `composer_replication.teacher_replay`:
- `replay_trace()` — N-teacher OpenRouter replay, returns
`list[TeacherCallResult]`
- `extract_dpo_pairs()` — converts teacher disagreement to `list[DPOPair]`
This produces preference-pair training data, but with **zero normalization**:
no dedup, no length filtering, no language detection, no quality
filtering, no chat-template validation. The output is closer to "raw
LLM API responses" than "training-ready dataset."
For the replaysim to power downstream RL training (V6), the dataset needs
to be production-quality. Hand-rolling that pipeline is a tax we'd rather
not pay.
## Options considered
Audited five candidates in `docs/research/REPLAYSIM_NORMALIZATION_RECONNAISSANCE.md`:
| Library | License | Multi-turn? | DPO pairs? | Streaming? | GPU? | Verdict |
|---|---|---|---|---|---|---|
| HuggingFace `datatrove` | MIT | ❌ flat-text only | ❌ | ✅ | ❌ | Deal-breaker on multi-turn |
| Alibaba `data-juicer` | Apache-2 | ✅ native `messages` ops | ✅ `pair_preference_mapper` | ✅ | ❌ for ops we need | **Chosen** |
| NVIDIA `nemo-curator` | Apache-2 | partial | ❌ | ✅ | ✅ mandatory for differentiating ops | Reject — GPU-bound for the ops we need |
| Argilla `distilabel` | Apache-2 | ✅ native chat | ✅ formatters | ✅ | ❌ | Reject — would replace teacher orchestration, not just normalize |
| Databricks `lilac` | — | n/a | n/a | n/a | n/a | Reject — archived 2024-03 |
## Decision
**Adopt `data-juicer` (Alibaba/modelscope, Apache-2.0, last push 2026-05-25, 6.4k★).**
Reasons:
1. **It's the only candidate with native multi-turn + DPO support in the
*normalization* op-graph.** Has `pair_preference_mapper`,
`dialog_intent_detection_mapper`, `dialog_topic_detection_mapper`,
etc. that operate on chat-format messages directly.
2. **CPU-runnable for our op set.** The differentiating ops we need
(length filter, language ID, chat-template validation, dedup) all
work on CPU. We avoid the NeMo-Curator GPU dependency entirely.
3. **Streaming-friendly.** Op graph is a DAG; we can pipe `replay_trace`
output into the graph during generation, not as a post-hoc pass. This
matters for cost discipline — bad teacher outputs get filtered before
contributing to OpenRouter spend on subsequent steps.
4. **YAML-recipe driven.** Recipes live in `recipes/replaysim/` and can
be version-controlled. A user can swap normalization recipes without
touching framework code.
## Consequences
### Accepted
- New module `composer_replication.replaysim` lifts the existing
`teacher_replay` logic out of the package's flat namespace and adds:
- `composer_replication.replaysim.normalize``DJNormalizer` adapter
that wraps `data-juicer` op graphs around `replay_trace` output
- `recipes/replaysim/default.yaml` — base normalization recipe (length
filter + chat-template validation + per-turn dedup)
- Optional `recipes/replaysim/with_disagreement_filter.yaml` — adds a
semantic-similarity filter that drops "false disagreements" where
teachers used different wording for the same answer
- New optional dependency `[replaysim]` extra in `pyproject.toml`:
`pip install -e .[replaysim]` pulls `data-juicer`. Core install
doesn't require it.
- The existing `replay_trace` and `extract_dpo_pairs` keep their
signatures. The normalizer is opt-in via a `normalizer=` kwarg on a
new `replay_and_normalize_trace` convenience function.
### One-day spike before merge
`pair_preference_mapper` in data-juicer might unconditionally re-synthesize
the `rejected` text via an LLM call. We already have `rejected` from
teacher disagreement and don't want to pay another API call. The recon
flagged this — verify by reading the mapper's source, and if it's LLM-bound,
substitute a plain validator that checks the field exists + isn't empty.
If the spike fails (the mapper IS LLM-bound and isn't easily replaceable),
fall back to writing a custom `DJOp` subclass that validates pre-existing
DPO pairs without re-synthesis. ~50 LOC.
### Rejected paths
- **`datatrove`**: would have required hand-rolling all chat-template logic
on top of flat-text ops. Bigger ongoing maintenance cost than
data-juicer's native multi-turn support.
- **`nemo-curator`**: GPU-mandatory ops mean we'd need to pay for GPU during
dataset generation (separate from the replay phase, which is already
GPU-free). Net cost increase for no quality win.
- **`distilabel`**: too broad — its pipeline abstraction would replace our
`replay_trace` entirely. We'd lose direct OpenRouter cost control + the
audit trail. Possible v0.3 migration if data-juicer becomes a bottleneck.
### Future work
- v0.2: add a `recipes/replaysim/altered_minds.yaml` for the user's
`altered-minds` workstream tie-in (per Wave 13 expansion)
- v0.3: revisit if `distilabel` becomes more mature and the migration
cost vs ongoing-maintenance balance shifts
## Source
`docs/research/REPLAYSIM_NORMALIZATION_RECONNAISSANCE.md` (2026-05-26
subagent recon, primary-sourced from each repo's GitHub + DeepWiki).