Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
ADR-004 — Replaysim normalization layer for the trace-replay channel
Status: Accepted Date: 2026-05-26 Wave: 13 (deep work loop, expansion phase)
Context
The brief's V5 clause says:
use traces from an llm-application usage then replay the traces with different models to see at each llm-step what the llm would do. by doing this we get distillation data from any number of models that could be used to train the target model further
The user added 2026-05-26: "see if we can leverage [a normalization library] to normalize the data while also making the replaysim dataset generation."
Currently the framework has composer_replication.teacher_replay:
replay_trace()— N-teacher OpenRouter replay, returnslist[TeacherCallResult]extract_dpo_pairs()— converts teacher disagreement tolist[DPOPair]
This produces preference-pair training data, but with zero normalization: no dedup, no length filtering, no language detection, no quality filtering, no chat-template validation. The output is closer to "raw LLM API responses" than "training-ready dataset."
For the replaysim to power downstream RL training (V6), the dataset needs to be production-quality. Hand-rolling that pipeline is a tax we'd rather not pay.
Options considered
Audited five candidates in docs/research/REPLAYSIM_NORMALIZATION_RECONNAISSANCE.md:
| Library | License | Multi-turn? | DPO pairs? | Streaming? | GPU? | Verdict |
|---|---|---|---|---|---|---|
HuggingFace datatrove |
MIT | ❌ flat-text only | ❌ | ✅ | ❌ | Deal-breaker on multi-turn |
Alibaba data-juicer |
Apache-2 | ✅ native messages ops |
✅ pair_preference_mapper |
✅ | ❌ for ops we need | Chosen |
NVIDIA nemo-curator |
Apache-2 | partial | ❌ | ✅ | ✅ mandatory for differentiating ops | Reject — GPU-bound for the ops we need |
Argilla distilabel |
Apache-2 | ✅ native chat | ✅ formatters | ✅ | ❌ | Reject — would replace teacher orchestration, not just normalize |
Databricks lilac |
— | n/a | n/a | n/a | n/a | Reject — archived 2024-03 |
Decision
Adopt data-juicer (Alibaba/modelscope, Apache-2.0, last push 2026-05-25, 6.4k★).
Reasons:
It's the only candidate with native multi-turn + DPO support in the normalization op-graph. Has
pair_preference_mapper,dialog_intent_detection_mapper,dialog_topic_detection_mapper, etc. that operate on chat-format messages directly.CPU-runnable for our op set. The differentiating ops we need (length filter, language ID, chat-template validation, dedup) all work on CPU. We avoid the NeMo-Curator GPU dependency entirely.
Streaming-friendly. Op graph is a DAG; we can pipe
replay_traceoutput into the graph during generation, not as a post-hoc pass. This matters for cost discipline — bad teacher outputs get filtered before contributing to OpenRouter spend on subsequent steps.YAML-recipe driven. Recipes live in
recipes/replaysim/and can be version-controlled. A user can swap normalization recipes without touching framework code.
Consequences
Accepted
- New module
composer_replication.replaysimlifts the existingteacher_replaylogic out of the package's flat namespace and adds:composer_replication.replaysim.normalize—DJNormalizeradapter that wrapsdata-juicerop graphs aroundreplay_traceoutputrecipes/replaysim/default.yaml— base normalization recipe (length filter + chat-template validation + per-turn dedup)- Optional
recipes/replaysim/with_disagreement_filter.yaml— adds a semantic-similarity filter that drops "false disagreements" where teachers used different wording for the same answer
- New optional dependency
[replaysim]extra inpyproject.toml:pip install -e .[replaysim]pullsdata-juicer. Core install doesn't require it. - The existing
replay_traceandextract_dpo_pairskeep their signatures. The normalizer is opt-in via anormalizer=kwarg on a newreplay_and_normalize_traceconvenience function.
One-day spike before merge
pair_preference_mapper in data-juicer might unconditionally re-synthesize
the rejected text via an LLM call. We already have rejected from
teacher disagreement and don't want to pay another API call. The recon
flagged this — verify by reading the mapper's source, and if it's LLM-bound,
substitute a plain validator that checks the field exists + isn't empty.
If the spike fails (the mapper IS LLM-bound and isn't easily replaceable),
fall back to writing a custom DJOp subclass that validates pre-existing
DPO pairs without re-synthesis. ~50 LOC.
Rejected paths
datatrove: would have required hand-rolling all chat-template logic on top of flat-text ops. Bigger ongoing maintenance cost than data-juicer's native multi-turn support.nemo-curator: GPU-mandatory ops mean we'd need to pay for GPU during dataset generation (separate from the replay phase, which is already GPU-free). Net cost increase for no quality win.distilabel: too broad — its pipeline abstraction would replace ourreplay_traceentirely. We'd lose direct OpenRouter cost control + the audit trail. Possible v0.3 migration if data-juicer becomes a bottleneck.
Future work
- v0.2: add a
recipes/replaysim/altered_minds.yamlfor the user'saltered-mindsworkstream tie-in (per Wave 13 expansion) - v0.3: revisit if
distilabelbecomes more mature and the migration cost vs ongoing-maintenance balance shifts
Source
docs/research/REPLAYSIM_NORMALIZATION_RECONNAISSANCE.md (2026-05-26
subagent recon, primary-sourced from each repo's GitHub + DeepWiki).