Reinforcement Learning
Transformers
English
post-training
distillation
agentic-coding
composer-2.5
cursor
kimi-k2
grpo
dapo
diloco
openenv
trl
verl
research
methodology
Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| # ADR-004 — Replaysim normalization layer for the trace-replay channel | |
| **Status**: Accepted | |
| **Date**: 2026-05-26 | |
| **Wave**: 13 (deep work loop, expansion phase) | |
| ## Context | |
| The brief's V5 clause says: | |
| > use traces from an llm-application usage then replay the traces with | |
| > different models to see at each llm-step what the llm would do. by doing | |
| > this we get distillation data from any number of models that could be | |
| > used to train the target model further | |
| The user added 2026-05-26: *"see if we can leverage [a normalization | |
| library] to normalize the data while also making the replaysim dataset | |
| generation."* | |
| Currently the framework has `composer_replication.teacher_replay`: | |
| - `replay_trace()` — N-teacher OpenRouter replay, returns | |
| `list[TeacherCallResult]` | |
| - `extract_dpo_pairs()` — converts teacher disagreement to `list[DPOPair]` | |
| This produces preference-pair training data, but with **zero normalization**: | |
| no dedup, no length filtering, no language detection, no quality | |
| filtering, no chat-template validation. The output is closer to "raw | |
| LLM API responses" than "training-ready dataset." | |
| For the replaysim to power downstream RL training (V6), the dataset needs | |
| to be production-quality. Hand-rolling that pipeline is a tax we'd rather | |
| not pay. | |
| ## Options considered | |
| Audited five candidates in `docs/research/REPLAYSIM_NORMALIZATION_RECONNAISSANCE.md`: | |
| | Library | License | Multi-turn? | DPO pairs? | Streaming? | GPU? | Verdict | | |
| |---|---|---|---|---|---|---| | |
| | HuggingFace `datatrove` | MIT | ❌ flat-text only | ❌ | ✅ | ❌ | Deal-breaker on multi-turn | | |
| | Alibaba `data-juicer` | Apache-2 | ✅ native `messages` ops | ✅ `pair_preference_mapper` | ✅ | ❌ for ops we need | **Chosen** | | |
| | NVIDIA `nemo-curator` | Apache-2 | partial | ❌ | ✅ | ✅ mandatory for differentiating ops | Reject — GPU-bound for the ops we need | | |
| | Argilla `distilabel` | Apache-2 | ✅ native chat | ✅ formatters | ✅ | ❌ | Reject — would replace teacher orchestration, not just normalize | | |
| | Databricks `lilac` | — | n/a | n/a | n/a | n/a | Reject — archived 2024-03 | | |
| ## Decision | |
| **Adopt `data-juicer` (Alibaba/modelscope, Apache-2.0, last push 2026-05-25, 6.4k★).** | |
| Reasons: | |
| 1. **It's the only candidate with native multi-turn + DPO support in the | |
| *normalization* op-graph.** Has `pair_preference_mapper`, | |
| `dialog_intent_detection_mapper`, `dialog_topic_detection_mapper`, | |
| etc. that operate on chat-format messages directly. | |
| 2. **CPU-runnable for our op set.** The differentiating ops we need | |
| (length filter, language ID, chat-template validation, dedup) all | |
| work on CPU. We avoid the NeMo-Curator GPU dependency entirely. | |
| 3. **Streaming-friendly.** Op graph is a DAG; we can pipe `replay_trace` | |
| output into the graph during generation, not as a post-hoc pass. This | |
| matters for cost discipline — bad teacher outputs get filtered before | |
| contributing to OpenRouter spend on subsequent steps. | |
| 4. **YAML-recipe driven.** Recipes live in `recipes/replaysim/` and can | |
| be version-controlled. A user can swap normalization recipes without | |
| touching framework code. | |
| ## Consequences | |
| ### Accepted | |
| - New module `composer_replication.replaysim` lifts the existing | |
| `teacher_replay` logic out of the package's flat namespace and adds: | |
| - `composer_replication.replaysim.normalize` — `DJNormalizer` adapter | |
| that wraps `data-juicer` op graphs around `replay_trace` output | |
| - `recipes/replaysim/default.yaml` — base normalization recipe (length | |
| filter + chat-template validation + per-turn dedup) | |
| - Optional `recipes/replaysim/with_disagreement_filter.yaml` — adds a | |
| semantic-similarity filter that drops "false disagreements" where | |
| teachers used different wording for the same answer | |
| - New optional dependency `[replaysim]` extra in `pyproject.toml`: | |
| `pip install -e .[replaysim]` pulls `data-juicer`. Core install | |
| doesn't require it. | |
| - The existing `replay_trace` and `extract_dpo_pairs` keep their | |
| signatures. The normalizer is opt-in via a `normalizer=` kwarg on a | |
| new `replay_and_normalize_trace` convenience function. | |
| ### One-day spike before merge | |
| `pair_preference_mapper` in data-juicer might unconditionally re-synthesize | |
| the `rejected` text via an LLM call. We already have `rejected` from | |
| teacher disagreement and don't want to pay another API call. The recon | |
| flagged this — verify by reading the mapper's source, and if it's LLM-bound, | |
| substitute a plain validator that checks the field exists + isn't empty. | |
| If the spike fails (the mapper IS LLM-bound and isn't easily replaceable), | |
| fall back to writing a custom `DJOp` subclass that validates pre-existing | |
| DPO pairs without re-synthesis. ~50 LOC. | |
| ### Rejected paths | |
| - **`datatrove`**: would have required hand-rolling all chat-template logic | |
| on top of flat-text ops. Bigger ongoing maintenance cost than | |
| data-juicer's native multi-turn support. | |
| - **`nemo-curator`**: GPU-mandatory ops mean we'd need to pay for GPU during | |
| dataset generation (separate from the replay phase, which is already | |
| GPU-free). Net cost increase for no quality win. | |
| - **`distilabel`**: too broad — its pipeline abstraction would replace our | |
| `replay_trace` entirely. We'd lose direct OpenRouter cost control + the | |
| audit trail. Possible v0.3 migration if data-juicer becomes a bottleneck. | |
| ### Future work | |
| - v0.2: add a `recipes/replaysim/altered_minds.yaml` for the user's | |
| `altered-minds` workstream tie-in (per Wave 13 expansion) | |
| - v0.3: revisit if `distilabel` becomes more mature and the migration | |
| cost vs ongoing-maintenance balance shifts | |
| ## Source | |
| `docs/research/REPLAYSIM_NORMALIZATION_RECONNAISSANCE.md` (2026-05-26 | |
| subagent recon, primary-sourced from each repo's GitHub + DeepWiki). | |