composer-replication-framework / docs /adrs /ADR-004-replaysim-normalization.md

Wave 13: serverless DiLoCo + replaysim normalization + 3 distillation losses + PRIME-RL + Monarch

b266c31 15 days ago

5.76 kB

	# ADR-004 — Replaysim normalization layer for the trace-replay channel

	Status: Accepted
	Date: 2026-05-26
	Wave: 13 (deep work loop, expansion phase)

	## Context

	The brief's V5 clause says:

	> use traces from an llm-application usage then replay the traces with
	> different models to see at each llm-step what the llm would do. by doing
	> this we get distillation data from any number of models that could be
	> used to train the target model further

	The user added 2026-05-26: *"see if we can leverage [a normalization
	library] to normalize the data while also making the replaysim dataset
	generation."*

	Currently the framework has `composer_replication.teacher_replay`:
	- `replay_trace()` — N-teacher OpenRouter replay, returns
	`list[TeacherCallResult]`
	- `extract_dpo_pairs()` — converts teacher disagreement to `list[DPOPair]`

	This produces preference-pair training data, but with zero normalization:
	no dedup, no length filtering, no language detection, no quality
	filtering, no chat-template validation. The output is closer to "raw
	LLM API responses" than "training-ready dataset."

	For the replaysim to power downstream RL training (V6), the dataset needs
	to be production-quality. Hand-rolling that pipeline is a tax we'd rather
	not pay.

	## Options considered

	Audited five candidates in `docs/research/REPLAYSIM_NORMALIZATION_RECONNAISSANCE.md`:

	\| Library \| License \| Multi-turn? \| DPO pairs? \| Streaming? \| GPU? \| Verdict \|
	\|---\|---\|---\|---\|---\|---\|---\|
	\| HuggingFace `datatrove` \| MIT \| ❌ flat-text only \| ❌ \| ✅ \| ❌ \| Deal-breaker on multi-turn \|
	\| Alibaba `data-juicer` \| Apache-2 \| ✅ native `messages` ops \| ✅ `pair_preference_mapper` \| ✅ \| ❌ for ops we need \| Chosen \|
	\| NVIDIA `nemo-curator` \| Apache-2 \| partial \| ❌ \| ✅ \| ✅ mandatory for differentiating ops \| Reject — GPU-bound for the ops we need \|
	\| Argilla `distilabel` \| Apache-2 \| ✅ native chat \| ✅ formatters \| ✅ \| ❌ \| Reject — would replace teacher orchestration, not just normalize \|
	\| Databricks `lilac` \| — \| n/a \| n/a \| n/a \| n/a \| Reject — archived 2024-03 \|

	## Decision

	Adopt `data-juicer` (Alibaba/modelscope, Apache-2.0, last push 2026-05-25, 6.4k★).

	Reasons:

	1. **It's the only candidate with native multi-turn + DPO support in the
	normalization op-graph.** Has `pair_preference_mapper`,
	`dialog_intent_detection_mapper`, `dialog_topic_detection_mapper`,
	etc. that operate on chat-format messages directly.

	2. CPU-runnable for our op set. The differentiating ops we need
	(length filter, language ID, chat-template validation, dedup) all
	work on CPU. We avoid the NeMo-Curator GPU dependency entirely.

	3. Streaming-friendly. Op graph is a DAG; we can pipe `replay_trace`
	output into the graph during generation, not as a post-hoc pass. This
	matters for cost discipline — bad teacher outputs get filtered before
	contributing to OpenRouter spend on subsequent steps.

	4. YAML-recipe driven. Recipes live in `recipes/replaysim/` and can
	be version-controlled. A user can swap normalization recipes without
	touching framework code.

	## Consequences

	### Accepted

	- New module `composer_replication.replaysim` lifts the existing
	`teacher_replay` logic out of the package's flat namespace and adds:
	- `composer_replication.replaysim.normalize` — `DJNormalizer` adapter
	that wraps `data-juicer` op graphs around `replay_trace` output
	- `recipes/replaysim/default.yaml` — base normalization recipe (length
	filter + chat-template validation + per-turn dedup)
	- Optional `recipes/replaysim/with_disagreement_filter.yaml` — adds a
	semantic-similarity filter that drops "false disagreements" where
	teachers used different wording for the same answer
	- New optional dependency `[replaysim]` extra in `pyproject.toml`:
	`pip install -e .[replaysim]` pulls `data-juicer`. Core install
	doesn't require it.
	- The existing `replay_trace` and `extract_dpo_pairs` keep their
	signatures. The normalizer is opt-in via a `normalizer=` kwarg on a
	new `replay_and_normalize_trace` convenience function.

	### One-day spike before merge

	`pair_preference_mapper` in data-juicer might unconditionally re-synthesize
	the `rejected` text via an LLM call. We already have `rejected` from
	teacher disagreement and don't want to pay another API call. The recon
	flagged this — verify by reading the mapper's source, and if it's LLM-bound,
	substitute a plain validator that checks the field exists + isn't empty.

	If the spike fails (the mapper IS LLM-bound and isn't easily replaceable),
	fall back to writing a custom `DJOp` subclass that validates pre-existing
	DPO pairs without re-synthesis. ~50 LOC.

	### Rejected paths

	- `datatrove`: would have required hand-rolling all chat-template logic
	on top of flat-text ops. Bigger ongoing maintenance cost than
	data-juicer's native multi-turn support.
	- `nemo-curator`: GPU-mandatory ops mean we'd need to pay for GPU during
	dataset generation (separate from the replay phase, which is already
	GPU-free). Net cost increase for no quality win.
	- `distilabel`: too broad — its pipeline abstraction would replace our
	`replay_trace` entirely. We'd lose direct OpenRouter cost control + the
	audit trail. Possible v0.3 migration if data-juicer becomes a bottleneck.

	### Future work

	- v0.2: add a `recipes/replaysim/altered_minds.yaml` for the user's
	`altered-minds` workstream tie-in (per Wave 13 expansion)
	- v0.3: revisit if `distilabel` becomes more mature and the migration
	cost vs ongoing-maintenance balance shifts

	## Source

	`docs/research/REPLAYSIM_NORMALIZATION_RECONNAISSANCE.md` (2026-05-26
	subagent recon, primary-sourced from each repo's GitHub + DeepWiki).