Reinforcement Learning
Transformers
English
post-training
distillation
agentic-coding
composer-2.5
cursor
kimi-k2
grpo
dapo
diloco
openenv
trl
verl
research
methodology
Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
docs(adr-013): close acceptance-gate boxes 1-5; only user-gated spend remains
Browse filesThe LMA-side runner scaffold (the one outstanding non-budget-gated box) now
exists in the sister repo: llm-mental-alterations/composer_replication_runs/
with 15 green mock-driven tests (collect+pass with no torch/trl/modal/framework
installed), documented in LMA ADR-0017. Marks boxes 1-5 done; the only open
item is the explicit user go/no-go on real LMA-checkpoint / Modal budget spend.
docs/adrs/ADR-013-lma-integration-channel-ladder.md
CHANGED
|
@@ -90,21 +90,27 @@ experiment design.** Specifically:
|
|
| 90 |
|
| 91 |
## Acceptance gate
|
| 92 |
|
| 93 |
-
- [
|
| 94 |
unparseable→−0.2, multiple-answers→−0.1, length-penalty; a "always C" /
|
| 95 |
option-prior exploit is detectable via logged option distribution. Rationale
|
| 96 |
style is NOT scored.
|
| 97 |
-
- [
|
| 98 |
KL(p‖p)==0 and KL increases as the policy moves.
|
| 99 |
-
- [
|
| 100 |
unit test asserts A1 has both channels off, A2 SDPO-only, A3 replay-only.
|
| 101 |
-
- [
|
| 102 |
no Modal, no budget spend) proving the wiring: altered-ckpt → collator →
|
| 103 |
-
ComposerReplicationTrainer(A2 config) → reward_fn → step.
|
| 104 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 105 |
a pointer to this ADR's ladder; the amplification-risk finding documented.
|
| 106 |
- [ ] **Out of scope (user-gated):** any real LMA checkpoint load or Modal/budget
|
| 107 |
-
spend. Documented as the explicit go-decision.
|
|
|
|
|
|
|
| 108 |
|
| 109 |
## More Information
|
| 110 |
|
|
|
|
| 90 |
|
| 91 |
## Acceptance gate
|
| 92 |
|
| 93 |
+
- [x] `MMLUFormatReward` implemented + tested: correct→+1, wrong→0,
|
| 94 |
unparseable→−0.2, multiple-answers→−0.1, length-penalty; a "always C" /
|
| 95 |
option-prior exploit is detectable via logged option distribution. Rationale
|
| 96 |
style is NOT scored.
|
| 97 |
+
- [x] `dual_kl_logger` logs both KLs; unit test on a toy policy/ref pair asserts
|
| 98 |
KL(p‖p)==0 and KL increases as the policy moves.
|
| 99 |
+
- [x] `channel_ladder_configs()` returns A0–A4 with the documented α/β/kl_beta;
|
| 100 |
unit test asserts A1 has both channels off, A2 SDPO-only, A3 replay-only.
|
| 101 |
+
- [x] LMA runner scaffold exists with mock-driven unit tests (no real model load,
|
| 102 |
no Modal, no budget spend) proving the wiring: altered-ckpt → collator →
|
| 103 |
+
ComposerReplicationTrainer(A2 config) → reward_fn → step. **Shipped in the LMA
|
| 104 |
+
repo (NOT this one), 2026-05-29:** `llm-mental-alterations/composer_replication_runs/`
|
| 105 |
+
(`moral_scenarios_replay.py`, `train_grpo.py`, `eval_post_rl.py`,
|
| 106 |
+
`tests/test_runner_wiring.py` — 15 mock tests green, collect+pass with no
|
| 107 |
+
torch/trl/modal/framework installed). Documented in LMA ADR-0017.
|
| 108 |
+
- [x] `docs/ALTERED_MINDS_TIE_IN.md` updated: Phase-3 hyperparameters replaced by
|
| 109 |
a pointer to this ADR's ladder; the amplification-risk finding documented.
|
| 110 |
- [ ] **Out of scope (user-gated):** any real LMA checkpoint load or Modal/budget
|
| 111 |
+
spend. Documented as the explicit go-decision. ← **the single remaining gate;
|
| 112 |
+
this is the go/no-go the user holds.** All capability is built + tested up to
|
| 113 |
+
this line.
|
| 114 |
|
| 115 |
## More Information
|
| 116 |
|