# ADR-004 — Replaysim normalization layer for the trace-replay channel

**Status**: Accepted
**Date**: 2026-05-26
**Wave**: 13 (deep work loop, expansion phase)

## Context

The brief's V5 clause says:

> use traces from an llm-application usage then replay the traces with
> different models to see at each llm-step what the llm would do. by doing
> this we get distillation data from any number of models that could be
> used to train the target model further

The user added 2026-05-26: *"see if we can leverage [a normalization
library] to normalize the data while also making the replaysim dataset
generation."*

Currently the framework has `composer_replication.teacher_replay`:
- `replay_trace()` — N-teacher OpenRouter replay, returns
  `list[TeacherCallResult]`
- `extract_dpo_pairs()` — converts teacher disagreement to `list[DPOPair]`

This produces preference-pair training data, but with **zero normalization**:
no dedup, no length filtering, no language detection, no quality
filtering, no chat-template validation. The output is closer to "raw
LLM API responses" than "training-ready dataset."

For the replaysim to power downstream RL training (V6), the dataset needs
to be production-quality. Hand-rolling that pipeline is a tax we'd rather
not pay.

## Options considered

Audited five candidates in `docs/research/REPLAYSIM_NORMALIZATION_RECONNAISSANCE.md`:

| Library | License | Multi-turn? | DPO pairs? | Streaming? | GPU? | Verdict |
|---|---|---|---|---|---|---|
| HuggingFace `datatrove` | MIT | ❌ flat-text only | ❌ | ✅ | ❌ | Deal-breaker on multi-turn |
| Alibaba `data-juicer` | Apache-2 | ✅ native `messages` ops | ✅ `pair_preference_mapper` | ✅ | ❌ for ops we need | **Chosen** |
| NVIDIA `nemo-curator` | Apache-2 | partial | ❌ | ✅ | ✅ mandatory for differentiating ops | Reject — GPU-bound for the ops we need |
| Argilla `distilabel` | Apache-2 | ✅ native chat | ✅ formatters | ✅ | ❌ | Reject — would replace teacher orchestration, not just normalize |
| Databricks `lilac` | — | n/a | n/a | n/a | n/a | Reject — archived 2024-03 |

## Decision

**Adopt `data-juicer` (Alibaba/modelscope, Apache-2.0, last push 2026-05-25, 6.4k★).**

Reasons:

1. **It's the only candidate with native multi-turn + DPO support in the
   *normalization* op-graph.** Has `pair_preference_mapper`,
   `dialog_intent_detection_mapper`, `dialog_topic_detection_mapper`,
   etc. that operate on chat-format messages directly.

2. **CPU-runnable for our op set.** The differentiating ops we need
   (length filter, language ID, chat-template validation, dedup) all
   work on CPU. We avoid the NeMo-Curator GPU dependency entirely.

3. **Streaming-friendly.** Op graph is a DAG; we can pipe `replay_trace`
   output into the graph during generation, not as a post-hoc pass. This
   matters for cost discipline — bad teacher outputs get filtered before
   contributing to OpenRouter spend on subsequent steps.

4. **YAML-recipe driven.** Recipes live in `recipes/replaysim/` and can
   be version-controlled. A user can swap normalization recipes without
   touching framework code.

## Consequences

### Accepted

- New module `composer_replication.replaysim` lifts the existing
  `teacher_replay` logic out of the package's flat namespace and adds:
  - `composer_replication.replaysim.normalize` — `DJNormalizer` adapter
    that wraps `data-juicer` op graphs around `replay_trace` output
  - `recipes/replaysim/default.yaml` — base normalization recipe (length
    filter + chat-template validation + per-turn dedup)
  - Optional `recipes/replaysim/with_disagreement_filter.yaml` — adds a
    semantic-similarity filter that drops "false disagreements" where
    teachers used different wording for the same answer
- New optional dependency `[replaysim]` extra in `pyproject.toml`:
  `pip install -e .[replaysim]` pulls `data-juicer`. Core install
  doesn't require it.
- The existing `replay_trace` and `extract_dpo_pairs` keep their
  signatures. The normalizer is opt-in via a `normalizer=` kwarg on a
  new `replay_and_normalize_trace` convenience function.

### One-day spike before merge

`pair_preference_mapper` in data-juicer might unconditionally re-synthesize
the `rejected` text via an LLM call. We already have `rejected` from
teacher disagreement and don't want to pay another API call. The recon
flagged this — verify by reading the mapper's source, and if it's LLM-bound,
substitute a plain validator that checks the field exists + isn't empty.

If the spike fails (the mapper IS LLM-bound and isn't easily replaceable),
fall back to writing a custom `DJOp` subclass that validates pre-existing
DPO pairs without re-synthesis. ~50 LOC.

### Rejected paths

- **`datatrove`**: would have required hand-rolling all chat-template logic
  on top of flat-text ops. Bigger ongoing maintenance cost than
  data-juicer's native multi-turn support.
- **`nemo-curator`**: GPU-mandatory ops mean we'd need to pay for GPU during
  dataset generation (separate from the replay phase, which is already
  GPU-free). Net cost increase for no quality win.
- **`distilabel`**: too broad — its pipeline abstraction would replace our
  `replay_trace` entirely. We'd lose direct OpenRouter cost control + the
  audit trail. Possible v0.3 migration if data-juicer becomes a bottleneck.

### Future work

- v0.2: add a `recipes/replaysim/altered_minds.yaml` for the user's
  `altered-minds` workstream tie-in (per Wave 13 expansion)
- v0.3: revisit if `distilabel` becomes more mature and the migration
  cost vs ongoing-maintenance balance shifts

## Source

`docs/research/REPLAYSIM_NORMALIZATION_RECONNAISSANCE.md` (2026-05-26
subagent recon, primary-sourced from each repo's GitHub + DeepWiki).