Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
docs+seeds: file remaining-work issues (Seeds) + project-state doc
Browse filesInitialize git-native issue tracking (.seeds/) and file 6 remaining-work
issues with dependency wiring:
- A2 SDPO-only arm (build runner + error-trace dataset) [ready]
- A3 replay-DPO-only arm (needs preference corpus) [blocked on A2]
- A4 combined arm + final A0-A4 comparison [blocked on A2+A3]
- Higher-lr PO-objective sweep (make DAPO/GSPO clip-higher fire) [ready]
- Docker substrate e2e (hardware-gated) [ready]
- SECURITY: rotate exposed HF token (user-only, P1) [ready]
Add docs/PROJECT_STATE_AND_REMAINING_WORK.md: single-page snapshot of what
is proven, what remains, the Channel-3-not-from-Composer provenance
correction (per ADR-014), and the load-bearing gotchas (main-lags-master,
strip_thinking=False for SDPO, OUTPUT_DIR clobber, objective-diagnostic
logging).
Channel-3 framing corrected to "framework's own additive research channel,
NOT Composer's recipe" throughout.
- .gitattributes +3 -0
- docs/PROJECT_STATE_AND_REMAINING_WORK.md +88 -0
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
.seeds/issues.jsonl merge=union
|
| 2 |
+
.seeds/templates.jsonl merge=union
|
| 3 |
+
.seeds/plans.jsonl merge=union
|
|
@@ -0,0 +1,88 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Project State & Remaining Work — composer-replication-framework
|
| 2 |
+
|
| 3 |
+
**Snapshot date:** 2026-06-09
|
| 4 |
+
**Framework HEAD:** `aae66fa` (ADR-014 PO-objective menu) + `8d2e6fc` (seeds sync)
|
| 5 |
+
**LMA consumer HEAD:** `37c0ea5` (DAPO-vs-Dr.GRPO washout) on `Codeseys/llm-mental-alterations`
|
| 6 |
+
|
| 7 |
+
This doc is the single-page "where things stand" record. Issue tracking is git-native
|
| 8 |
+
via [Seeds](https://github.com/jayminwest/seeds) (`sd` CLI) in `.seeds/`. Run `sd ready`
|
| 9 |
+
for unblocked work, `sd list` for everything, `sd show <id>` for detail.
|
| 10 |
+
|
| 11 |
+
---
|
| 12 |
+
|
| 13 |
+
## What this framework is (honest one-paragraph)
|
| 14 |
+
|
| 15 |
+
A reusable RL/data-gen framework that replicates Cursor's **Composer 2.5** post-training
|
| 16 |
+
recipe at small scale, whose north-star consumer is the **llm-mental-alterations (LMA)**
|
| 17 |
+
project (apply targeted RL to a personality-altered SFT model and measure washout vs
|
| 18 |
+
amplification). Past-skeleton, production-shaped: 8 subpackages, 232 tests pass / 18 skip,
|
| 19 |
+
installable, with worked GSM8K-GRPO + SDPO-real-trace + A1-8B examples.
|
| 20 |
+
|
| 21 |
+
## The 3-channel loss — with HONEST provenance
|
| 22 |
+
|
| 23 |
+
`grpo + alpha_sdpo·sdpo_kl + beta_replay·trace_replay_dpo`
|
| 24 |
+
|
| 25 |
+
| Channel | What | Composer provenance |
|
| 26 |
+
|---|---|---|
|
| 27 |
+
| **1 — base PO objective** | Selectable MENU (ADR-014): `loss_type ∈ {grpo, dr_grpo, bnpo, dapo, cispo, luspo, sapo, vespo}`, default **Dr.GRPO** | ✅ CONFIRMED — Composer's base is Dr.GRPO (k1 KL; TRL uses k3, documented delta) |
|
| 28 |
+
| **2 — SDPO/OPSD** | On-policy self-distillation vs hint-conditioned teacher | ✅ CONFIRMED — IS Composer 2.5's "targeted RL with textual feedback" |
|
| 29 |
+
| **3 — trace-replay-DPO** | Preference DPO vs frontier teachers | ⚠️ **FRAMEWORK'S OWN ADDITIVE CHANNEL — NOT Composer.** Primary sources (blog + arXiv:2603.24477 §4.1) have no DPO / preference-pairs / multi-teacher. It's a deliberate β-gated research probe in the A0→A4 ladder. The code is fine; only the "this replicates Composer" framing was ever wrong, and ADR-014 records the correction. |
|
| 30 |
+
|
| 31 |
+
## What's PROVEN (don't re-litigate)
|
| 32 |
+
|
| 33 |
+
- **CPU SDPO fires** through the real collator alignment indices (JSD 0.057, gradient flows).
|
| 34 |
+
- **A10G GPU train-proof**: Qwen2.5-0.5B, bf16, 30 steps, loss 4.73→0.005 monotone.
|
| 35 |
+
- **A1 8B run EXECUTED**: 200 steps GRPO-only on `wave-h-5-llama-31-8b-seed42`, A100, reward
|
| 36 |
+
0.331→0.751, KL≈0.0014. Alteration eval at N=895.
|
| 37 |
+
- **DAPO-vs-Dr.GRPO washout** (A1): washout is **objective-INVARIANT at lr=1e-6** because DAPO's
|
| 38 |
+
clip-higher (`clip_ratio/high_mean`) never engaged (=0 across all 200 steps). We measured
|
| 39 |
+
"DAPO that couldn't fire its difference," not a real objective comparison.
|
| 40 |
+
- **nanochat post-training arc** (separate Modal laydown): SFT ChatCORE 0.3076, GSM8K-GRPO
|
| 41 |
+
Pass@1 doubled 0.0525→0.1250, SDPO end-to-end train smoke PASS.
|
| 42 |
+
- **832/832 real-trace SDPO alignment** (Wave 21) with `strip_thinking=False`.
|
| 43 |
+
|
| 44 |
+
## Remaining work (filed as Seeds — `sd ready` / `sd show <id>`)
|
| 45 |
+
|
| 46 |
+
### Ready / unblocked
|
| 47 |
+
- **`…-cb74` (P1 security):** ROTATE the exposed HF write-token `hf_uRP…`. On-disk plaintext
|
| 48 |
+
scrubbed 2026-05-29; token itself never rotated → treat as compromised. **User-only.**
|
| 49 |
+
- **`…-211e` (P2):** Higher-lr PO-objective sweep — make DAPO/GSPO clip-higher actually fire.
|
| 50 |
+
The informative experiment the washout null pointed to. Likely *more* informative than the
|
| 51 |
+
A2-A4 ladder at lr=1e-6 (same inert-knob risk). GPU + budget-gated.
|
| 52 |
+
- **`…-4936` (P2):** A2 SDPO-only ladder arm — build the runner + error-trace dataset. The
|
| 53 |
+
big build: `modal_ladder_a1.py` is hardcoded to A1; `train_grpo.py` is a plan-builder with a
|
| 54 |
+
placeholder pip name. Needs a real SDPO dataset (`strip_thinking=False`, seq≥1536) + an A100
|
| 55 |
+
entrypoint off the proven A1 image. GPU + budget-gated.
|
| 56 |
+
- **`…-245d` (P4):** Docker substrate e2e — test exists + skips cleanly; hardware-blocked (no
|
| 57 |
+
Docker host). Run the 4 gates against a real container when a Docker host exists.
|
| 58 |
+
|
| 59 |
+
### Blocked (dependency-gated)
|
| 60 |
+
- **`…-42f5` (P3):** A3 replay-DPO-only arm — needs a trace-replay-DPO preference corpus.
|
| 61 |
+
Blocked on A2 (shared runner/infra). Framework's-own-channel washout probe.
|
| 62 |
+
- **`…-dd7b` (P3):** A4 combined arm + final A0–A4 comparison table. Blocked on A2 **and** A3
|
| 63 |
+
(its value is reading the combined effect against the isolated baselines).
|
| 64 |
+
|
| 65 |
+
## Load-bearing gotchas (carry these forward)
|
| 66 |
+
|
| 67 |
+
1. **`main` LAGS `master` on both HF repos.** Any Modal `git clone … && pip install` MUST
|
| 68 |
+
`git checkout master` / pin a master SHA, or `ImportError: make_dr_grpo_config`.
|
| 69 |
+
2. **SDPO on real agent traces requires `strip_thinking=False`** — ~67% of error-recovery
|
| 70 |
+
turns are pure thinking; stripping yields empty masks. Keep `max_seq_len ≥ 1536`.
|
| 71 |
+
3. **OUTPUT_DIR clobber:** any sweep dimension (objective/lr/seed) you'll compare side-by-side
|
| 72 |
+
MUST be in the output path, or the later run overwrites the earlier checkpoint.
|
| 73 |
+
4. **Size Modal timeout off the SLOWEST objective** (DAPO overlong-mask ~26s/it, not Dr.GRPO ~17s/it).
|
| 74 |
+
5. **Log the distinguishing diagnostic** for any PO-objective ablation (`clip_ratio/high_mean`
|
| 75 |
+
for DAPO, sequence-level ratio for GSPO). A 0 means "knob didn't engage," NOT "objectives equal."
|
| 76 |
+
|
| 77 |
+
## In-flight as of this snapshot
|
| 78 |
+
|
| 79 |
+
- A `docs/refine-2026-06` branch (ccode-ultracode lane) is refining the `docs/` corpus:
|
| 80 |
+
propagating the Channel-3 provenance correction + ADR-014 menu into stale docs, archiving
|
| 81 |
+
dated `WAVE_*_FINAL_REVIEW` artifacts, refreshing README/BACKLOG. Docs-only; human review
|
| 82 |
+
before merge.
|
| 83 |
+
|
| 84 |
+
## Pointers
|
| 85 |
+
|
| 86 |
+
- Canonical wiki hub: `~/wiki/projects/composer-replication-framework.md`
|
| 87 |
+
- ADRs: `docs/adrs/` (001–014); ADR-014 is newest + records the Channel-3 provenance decision.
|
| 88 |
+
- LMA consumer + A1 results: `Codeseys/llm-mental-alterations` → `composer_replication_runs/`.
|