composer-replication-framework / docs /adrs /ADR-004-replaysim-normalization.md
Codeseys's picture
Wave 13: serverless DiLoCo + replaysim normalization + 3 distillation losses + PRIME-RL + Monarch
b266c31

ADR-004 — Replaysim normalization layer for the trace-replay channel

Status: Accepted Date: 2026-05-26 Wave: 13 (deep work loop, expansion phase)

Context

The brief's V5 clause says:

use traces from an llm-application usage then replay the traces with different models to see at each llm-step what the llm would do. by doing this we get distillation data from any number of models that could be used to train the target model further

The user added 2026-05-26: "see if we can leverage [a normalization library] to normalize the data while also making the replaysim dataset generation."

Currently the framework has composer_replication.teacher_replay:

  • replay_trace() — N-teacher OpenRouter replay, returns list[TeacherCallResult]
  • extract_dpo_pairs() — converts teacher disagreement to list[DPOPair]

This produces preference-pair training data, but with zero normalization: no dedup, no length filtering, no language detection, no quality filtering, no chat-template validation. The output is closer to "raw LLM API responses" than "training-ready dataset."

For the replaysim to power downstream RL training (V6), the dataset needs to be production-quality. Hand-rolling that pipeline is a tax we'd rather not pay.

Options considered

Audited five candidates in docs/research/REPLAYSIM_NORMALIZATION_RECONNAISSANCE.md:

Library License Multi-turn? DPO pairs? Streaming? GPU? Verdict
HuggingFace datatrove MIT ❌ flat-text only Deal-breaker on multi-turn
Alibaba data-juicer Apache-2 ✅ native messages ops pair_preference_mapper ❌ for ops we need Chosen
NVIDIA nemo-curator Apache-2 partial ✅ mandatory for differentiating ops Reject — GPU-bound for the ops we need
Argilla distilabel Apache-2 ✅ native chat ✅ formatters Reject — would replace teacher orchestration, not just normalize
Databricks lilac n/a n/a n/a n/a Reject — archived 2024-03

Decision

Adopt data-juicer (Alibaba/modelscope, Apache-2.0, last push 2026-05-25, 6.4k★).

Reasons:

  1. It's the only candidate with native multi-turn + DPO support in the normalization op-graph. Has pair_preference_mapper, dialog_intent_detection_mapper, dialog_topic_detection_mapper, etc. that operate on chat-format messages directly.

  2. CPU-runnable for our op set. The differentiating ops we need (length filter, language ID, chat-template validation, dedup) all work on CPU. We avoid the NeMo-Curator GPU dependency entirely.

  3. Streaming-friendly. Op graph is a DAG; we can pipe replay_trace output into the graph during generation, not as a post-hoc pass. This matters for cost discipline — bad teacher outputs get filtered before contributing to OpenRouter spend on subsequent steps.

  4. YAML-recipe driven. Recipes live in recipes/replaysim/ and can be version-controlled. A user can swap normalization recipes without touching framework code.

Consequences

Accepted

  • New module composer_replication.replaysim lifts the existing teacher_replay logic out of the package's flat namespace and adds:
    • composer_replication.replaysim.normalizeDJNormalizer adapter that wraps data-juicer op graphs around replay_trace output
    • recipes/replaysim/default.yaml — base normalization recipe (length filter + chat-template validation + per-turn dedup)
    • Optional recipes/replaysim/with_disagreement_filter.yaml — adds a semantic-similarity filter that drops "false disagreements" where teachers used different wording for the same answer
  • New optional dependency [replaysim] extra in pyproject.toml: pip install -e .[replaysim] pulls data-juicer. Core install doesn't require it.
  • The existing replay_trace and extract_dpo_pairs keep their signatures. The normalizer is opt-in via a normalizer= kwarg on a new replay_and_normalize_trace convenience function.

One-day spike before merge

pair_preference_mapper in data-juicer might unconditionally re-synthesize the rejected text via an LLM call. We already have rejected from teacher disagreement and don't want to pay another API call. The recon flagged this — verify by reading the mapper's source, and if it's LLM-bound, substitute a plain validator that checks the field exists + isn't empty.

If the spike fails (the mapper IS LLM-bound and isn't easily replaceable), fall back to writing a custom DJOp subclass that validates pre-existing DPO pairs without re-synthesis. ~50 LOC.

Rejected paths

  • datatrove: would have required hand-rolling all chat-template logic on top of flat-text ops. Bigger ongoing maintenance cost than data-juicer's native multi-turn support.
  • nemo-curator: GPU-mandatory ops mean we'd need to pay for GPU during dataset generation (separate from the replay phase, which is already GPU-free). Net cost increase for no quality win.
  • distilabel: too broad — its pipeline abstraction would replace our replay_trace entirely. We'd lose direct OpenRouter cost control + the audit trail. Possible v0.3 migration if data-juicer becomes a bottleneck.

Future work

  • v0.2: add a recipes/replaysim/altered_minds.yaml for the user's altered-minds workstream tie-in (per Wave 13 expansion)
  • v0.3: revisit if distilabel becomes more mature and the migration cost vs ongoing-maintenance balance shifts

Source

docs/research/REPLAYSIM_NORMALIZATION_RECONNAISSANCE.md (2026-05-26 subagent recon, primary-sourced from each repo's GitHub + DeepWiki).