Reinforcement Learning
Transformers
English
post-training
distillation
agentic-coding
composer-2.5
cursor
kimi-k2
grpo
dapo
diloco
openenv
trl
verl
research
methodology
Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
docs: main is canonical branch; retire main-lags-master foot-gun
Browse filesRecord the 2026-06-09 decision that `main` is the canonical branch on both
HF repos. main==master converged via fast-forward; push to main going
forward. Reframe gotcha #1 from "main lags master" to "RESOLVED, keep it
synced" — the Modal SHA pins stay correct but their stale-main rationale
no longer applies.
docs/PROJECT_STATE_AND_REMAINING_WORK.md
CHANGED
|
@@ -62,10 +62,22 @@ installable, with worked GSM8K-GRPO + SDPO-real-trace + A1-8B examples.
|
|
| 62 |
- **`…-dd7b` (P3):** A4 combined arm + final A0–A4 comparison table. Blocked on A2 **and** A3
|
| 63 |
(its value is reading the combined effect against the isolated baselines).
|
| 64 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 65 |
## Load-bearing gotchas (carry these forward)
|
| 66 |
|
| 67 |
-
1. **
|
| 68 |
-
|
|
|
|
|
|
|
|
|
|
| 69 |
2. **SDPO on real agent traces requires `strip_thinking=False`** — ~67% of error-recovery
|
| 70 |
turns are pure thinking; stripping yields empty masks. Keep `max_seq_len ≥ 1536`.
|
| 71 |
3. **OUTPUT_DIR clobber:** any sweep dimension (objective/lr/seed) you'll compare side-by-side
|
|
|
|
| 62 |
- **`…-dd7b` (P3):** A4 combined arm + final A0–A4 comparison table. Blocked on A2 **and** A3
|
| 63 |
(its value is reading the combined effect against the isolated baselines).
|
| 64 |
|
| 65 |
+
## Branch convention (canonical: `main`)
|
| 66 |
+
|
| 67 |
+
**`main` is the canonical branch on both HF repos** (decided 2026-06-09). As of that date
|
| 68 |
+
`main == master == fb13ea3` (framework) / `37c0ea5` (LMA) — converged via clean fast-forward.
|
| 69 |
+
**Push to `main`** (or to both in lockstep). `master` is retained only as a mirror; do not let
|
| 70 |
+
the two drift. A fresh `git clone` now defaults to `main` and gets the complete tree (incl.
|
| 71 |
+
`make_dr_grpo_config` + ADR-014), so the old "must checkout master" foot-gun is RETIRED as long
|
| 72 |
+
as `main` stays current.
|
| 73 |
+
|
| 74 |
## Load-bearing gotchas (carry these forward)
|
| 75 |
|
| 76 |
+
1. **Branch sync (RESOLVED 2026-06-09, keep it that way).** `main` previously LAGGED `master`
|
| 77 |
+
(frozen at Wave 19), which is why older Modal images pin a `master` SHA "because main predates
|
| 78 |
+
`make_dr_grpo_config`." That divergence is now fixed (`main == master`). **Keep pushing to
|
| 79 |
+
`main`** so it never lags again; the SHA pins in Modal images remain correct but their
|
| 80 |
+
"main is stale" rationale no longer applies once both branches stay in sync.
|
| 81 |
2. **SDPO on real agent traces requires `strip_thinking=False`** — ~67% of error-recovery
|
| 82 |
turns are pure thinking; stripping yields empty masks. Keep `max_seq_len ≥ 1536`.
|
| 83 |
3. **OUTPUT_DIR clobber:** any sweep dimension (objective/lr/seed) you'll compare side-by-side
|