# ADR-004 — Replaysim normalization layer for the trace-replay channel **Status**: Accepted **Date**: 2026-05-26 **Wave**: 13 (deep work loop, expansion phase) ## Context The brief's V5 clause says: > use traces from an llm-application usage then replay the traces with > different models to see at each llm-step what the llm would do. by doing > this we get distillation data from any number of models that could be > used to train the target model further The user added 2026-05-26: *"see if we can leverage [a normalization library] to normalize the data while also making the replaysim dataset generation."* Currently the framework has `composer_replication.teacher_replay`: - `replay_trace()` — N-teacher OpenRouter replay, returns `list[TeacherCallResult]` - `extract_dpo_pairs()` — converts teacher disagreement to `list[DPOPair]` This produces preference-pair training data, but with **zero normalization**: no dedup, no length filtering, no language detection, no quality filtering, no chat-template validation. The output is closer to "raw LLM API responses" than "training-ready dataset." For the replaysim to power downstream RL training (V6), the dataset needs to be production-quality. Hand-rolling that pipeline is a tax we'd rather not pay. ## Options considered Audited five candidates in `docs/research/REPLAYSIM_NORMALIZATION_RECONNAISSANCE.md`: | Library | License | Multi-turn? | DPO pairs? | Streaming? | GPU? | Verdict | |---|---|---|---|---|---|---| | HuggingFace `datatrove` | MIT | ❌ flat-text only | ❌ | ✅ | ❌ | Deal-breaker on multi-turn | | Alibaba `data-juicer` | Apache-2 | ✅ native `messages` ops | ✅ `pair_preference_mapper` | ✅ | ❌ for ops we need | **Chosen** | | NVIDIA `nemo-curator` | Apache-2 | partial | ❌ | ✅ | ✅ mandatory for differentiating ops | Reject — GPU-bound for the ops we need | | Argilla `distilabel` | Apache-2 | ✅ native chat | ✅ formatters | ✅ | ❌ | Reject — would replace teacher orchestration, not just normalize | | Databricks `lilac` | — | n/a | n/a | n/a | n/a | Reject — archived 2024-03 | ## Decision **Adopt `data-juicer` (Alibaba/modelscope, Apache-2.0, last push 2026-05-25, 6.4k★).** Reasons: 1. **It's the only candidate with native multi-turn + DPO support in the *normalization* op-graph.** Has `pair_preference_mapper`, `dialog_intent_detection_mapper`, `dialog_topic_detection_mapper`, etc. that operate on chat-format messages directly. 2. **CPU-runnable for our op set.** The differentiating ops we need (length filter, language ID, chat-template validation, dedup) all work on CPU. We avoid the NeMo-Curator GPU dependency entirely. 3. **Streaming-friendly.** Op graph is a DAG; we can pipe `replay_trace` output into the graph during generation, not as a post-hoc pass. This matters for cost discipline — bad teacher outputs get filtered before contributing to OpenRouter spend on subsequent steps. 4. **YAML-recipe driven.** Recipes live in `recipes/replaysim/` and can be version-controlled. A user can swap normalization recipes without touching framework code. ## Consequences ### Accepted - New module `composer_replication.replaysim` lifts the existing `teacher_replay` logic out of the package's flat namespace and adds: - `composer_replication.replaysim.normalize` — `DJNormalizer` adapter that wraps `data-juicer` op graphs around `replay_trace` output - `recipes/replaysim/default.yaml` — base normalization recipe (length filter + chat-template validation + per-turn dedup) - Optional `recipes/replaysim/with_disagreement_filter.yaml` — adds a semantic-similarity filter that drops "false disagreements" where teachers used different wording for the same answer - New optional dependency `[replaysim]` extra in `pyproject.toml`: `pip install -e .[replaysim]` pulls `data-juicer`. Core install doesn't require it. - The existing `replay_trace` and `extract_dpo_pairs` keep their signatures. The normalizer is opt-in via a `normalizer=` kwarg on a new `replay_and_normalize_trace` convenience function. ### One-day spike before merge `pair_preference_mapper` in data-juicer might unconditionally re-synthesize the `rejected` text via an LLM call. We already have `rejected` from teacher disagreement and don't want to pay another API call. The recon flagged this — verify by reading the mapper's source, and if it's LLM-bound, substitute a plain validator that checks the field exists + isn't empty. If the spike fails (the mapper IS LLM-bound and isn't easily replaceable), fall back to writing a custom `DJOp` subclass that validates pre-existing DPO pairs without re-synthesis. ~50 LOC. ### Rejected paths - **`datatrove`**: would have required hand-rolling all chat-template logic on top of flat-text ops. Bigger ongoing maintenance cost than data-juicer's native multi-turn support. - **`nemo-curator`**: GPU-mandatory ops mean we'd need to pay for GPU during dataset generation (separate from the replay phase, which is already GPU-free). Net cost increase for no quality win. - **`distilabel`**: too broad — its pipeline abstraction would replace our `replay_trace` entirely. We'd lose direct OpenRouter cost control + the audit trail. Possible v0.3 migration if data-juicer becomes a bottleneck. ### Future work - v0.2: add a `recipes/replaysim/altered_minds.yaml` for the user's `altered-minds` workstream tie-in (per Wave 13 expansion) - v0.3: revisit if `distilabel` becomes more mature and the migration cost vs ongoing-maintenance balance shifts ## Source `docs/research/REPLAYSIM_NORMALIZATION_RECONNAISSANCE.md` (2026-05-26 subagent recon, primary-sourced from each repo's GitHub + DeepWiki).