Reinforcement Learning
Transformers
English
post-training
distillation
agentic-coding
composer-2.5
cursor
kimi-k2
grpo
dapo
diloco
openenv
trl
verl
research
methodology
Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| # Wave 7–10 Final Review — Cross-model adversarial check | |
| **Reviewer**: external model, Phase 11 of the deep work loop. | |
| **Date**: 2026-05-26. | |
| **Mandate**: find substantive flaws. The research thesis is | |
| primary-source-validated; this attacks *implementation correctness* and | |
| *scope creep*, not the thesis. | |
| --- | |
| ## (a) Are the tests real evidence or theater? | |
| ### Spike 006 (Qwen2.5-0.5B-Instruct CPU smoke, 9 tests) | |
| **Verdict: mostly tautology, with usable ablation tests.** | |
| The headline "loss 0.7390 → 0.0031, 99.6% reduction in 5 steps" is | |
| technically true and substantively near-tautological: | |
| 1. **The same fixed ~50-token batch is reused for all 5 steps.** | |
| `build_batch` returns one conversation; the test loop calls | |
| `compose_loss(model, batch)` five times in a row. No reshuffle, no | |
| second batch, no held-out anything. | |
| 2. **0.5B params × AdamW(lr=1e-5) × identical 50-token batch × | |
| 5 steps = textbook memorization regime.** A randomly-initialized MLP | |
| would also reduce loss in this setup. The test does not distinguish | |
| "the 3-channel composition is correct" from "AdamW reduces fixed-batch | |
| loss on any non-degenerate objective." | |
| 3. **The SDPO channel is zero throughout** (`sdpo_jsd=0.0` on every | |
| row of `loss_curve.csv`). The verdict calls this "correct fallback | |
| behavior"; what it actually is is *the entire SDPO channel never | |
| being tested by this smoke*. The fallback is a literal `_zero(device)`. | |
| `generalized_jsd_loss` has no end-to-end test on a real HF model | |
| anywhere in the codebase. **This is the largest evidence gap for | |
| V8.** | |
| 4. **DPO uses dummy hard-coded reference logprobs** (`-30.0`, `-35.0`). | |
| This tests that `-logsigmoid(small_positive)` is differentiable, not | |
| that the trace-replay-DPO pipeline (reference-policy precompute + | |
| collator + loss) wires together. | |
| 5. The "loss decreases" assertion is `losses[-1] < losses[0]` — the | |
| weakest version of monotonicity. | |
| **Genuine value**: model-loads test, chat-template test, the three | |
| α=0 / β=0 ablation tests. The ablations would catch a regression where | |
| weights stop disabling channels. The "5-step decrease" is the weakest | |
| test in the file. | |
| **Run.log inconsistency, not flagged anywhere**: | |
| `examples/qwen_05b_quickstart/run.log` shows step-1 total = 0.0379; | |
| the spike `verdict.md` quotes step-1 total = 0.2090 for the same code, | |
| same model. Either the seed isn't pinned through the model forward | |
| (likely — `torch.manual_seed(42)` is in `build_batch` only), or the | |
| package's `compose_loss` differs subtly from the spike's. **Quoting | |
| exact numbers from a non-reproducible run as evidence is the sloppy | |
| version of every research-replication scandal.** | |
| ### Spike 007 (Claude Code ingester, 15 tests) | |
| **Verdict: strongest test suite of the three. Caveats apply.** | |
| Real engineering value: | |
| - Synthetic fixture exercises the actual record types (assistant, | |
| user/tool_result, summary, system, sidechain). Tests assert structural | |
| properties: history grows monotonically, `[THINKING]` stripped on | |
| replay but kept in student_action, unique state_ids, tool_use | |
| serialization, tool_result tagging. | |
| - `test_truncated_line_tolerated` would catch a real failure-mode | |
| removal of the JSON-decode try/except. | |
| - Subagent and sidechain skip tests catch real production cases. | |
| Caveats: | |
| - **The "real session" test is hardcoded to one path on the author's | |
| machine** (`/home/codeseys/.claude/projects/…/e4a34e2b-….jsonl`). | |
| No env var, no fixture-discovery; the test is `skipif(not exists)`. | |
| This is a manual integration test, not a CI test. ADR-002 said "CI | |
| users substitute their own"; the substitution mechanism doesn't | |
| exist. | |
| - The synthetic fixture is **author-written** and presumably designed | |
| alongside the ingester. There is no scrubbed third-party fixture. | |
| - Acceptance criterion #3 in BACKLOG ("end-to-end smoke: real trace → | |
| ingester → collator → 1-step `composer_total_loss`") is **unmet** — | |
| the spike stops at "ingester emits TraceStates correctly." There is | |
| no test that takes ingested records, runs the data collator, and runs | |
| through `compose_loss`. | |
| This suite would catch real regressions in the ingester. Its weakest | |
| property: ships no contributor-runnable real-trace test. | |
| ### Spike 008 (DiLoCo, 5 tests, single-process) | |
| **Verdict: the caveat is honest but says the test does not test what | |
| users will assume it tests.** | |
| BACKLOG acceptance criterion: *"Smoke test: 2 replicas × 4 inner steps | |
| × 2 outer rounds on the toy model from Spike 005, both replicas converge | |
| toward the same solution within tolerance."* | |
| What ships: **one** replica, mock manager whose `allreduce` is a | |
| **`passthrough` no-op** (test_diloco_smoke.py:78). This is "one | |
| replica's outer optimizer machinery fires," not "two replicas | |
| converge." The acceptance criterion was silently re-defined; the | |
| spike's verdict.md calls this a "limitation" but it is a redefinition. | |
| The recon doc (per ADR-003) claimed a "ready-to-paste" pattern with real | |
| shared-buffer averaging. The implementation hits a "post-hook | |
| sequencing bug." **One of the recon claim and the implementation is | |
| wrong**, and the gap is buried in verdict.md instead of fixed. | |
| **Genuine value**: `test_diloco_pseudogradient_sign_convention` is the | |
| **single best test in all of Wave 7-10**. It pins the sign convention | |
| with a concrete arithmetic prediction (`final == θ_initial + nudge`) | |
| and reports `wrong_sign_diff` on failure. A future torchft upgrade that | |
| flips the sign breaks this test loudly. ADR-003 specifically flagged | |
| this hazard, and the test catches it. Credit where due. | |
| **Separate flaw in `composer_diloco.py` docstring (lines 13–28)**: the | |
| "wrong-sign pseudogradient combined with SGD's subtract-grad semantics | |
| gives net step in the local-Δ direction once momentum builds up" gloss | |
| is incoherent. There is no "wrong-sign" pseudogradient. | |
| `θ_initial − θ_local` is the exact DiLoCo paper convention; SGD's | |
| `p ← p − lr·g` semantics are designed for it. The test is correct; the | |
| prose explaining why is wrong, and will mislead anyone porting the | |
| convention. | |
| --- | |
| ## (b) Is the package a real framework or a shim? | |
| **Verdict: a structured shim around three real components and two | |
| stubs. Not yet a framework.** | |
| What `pip install composer-replication` delivers: | |
| - `compose_loss` — labeled in its own top docstring as "Do NOT use as | |
| the production training loss." Re-exported as the headline package | |
| API and used in the quickstart. | |
| - `build_batch` — a hard-coded fixed-conversation factory built for the | |
| smoke (factorial / binary-search examples). Anyone using this in | |
| real training is using example code as production. | |
| - `ClaudeCodeIngester` — real, working component. Solid. | |
| - `generalized_jsd_loss` — real, working (extracted from OPSD, MIT). | |
| - `extract_dpo_pairs`, `replay_trace`, teacher specs — real, but | |
| require OpenRouter credentials + spend. | |
| - `ComposerReplicationTrainer` — TRL `GRPOTrainer` subclass. | |
| Useful only with `[train]` extra. Not exercised end-to-end on any | |
| real model in this repo. | |
| - `make_diloco_outer_loop` — wrapper. Useful only with `[diloco]` extra. | |
| What is missing for "pip install and start training": | |
| 1. No GPU end-to-end example. The brief targets Qwen3-7B / Qwen3-32B. | |
| 2. No CLI. `pyproject.toml` declares no `[project.scripts]`. | |
| 3. No config schema (Hydra/Pydantic). Users hand-construct teacher | |
| specs, hint generators, data collators. | |
| 4. The `[train]` extra pulls TRL but **no integration test** of | |
| `ComposerReplicationTrainer` against a real GRPO rollout exists in | |
| this repo. Spike 005 used TinyLM; Spike 006 stubbed GRPO out | |
| precisely to avoid TRL. | |
| 5. **`build_batch` should not be public API.** It belongs in | |
| `examples/`. Re-exporting at top level implies it is a general-purpose | |
| utility. | |
| 6. **Two sources of truth**: `composer_replication/loss.py` is a | |
| near-copy of `spikes/006-…/compose_loss.py` with one import path | |
| changed. The spike tests still import from the spike file. A bug fix | |
| in one will not propagate. Same for `composer_diloco.py` ↔ | |
| `composer_replication/diloco/__init__.py`. | |
| Real framework value: | |
| - `ClaudeCodeIngester` with non-trivial logic. | |
| - `generalized_jsd_loss` with token-clip + temperature. | |
| - DiLoCo wrapper with sign-pinning test. | |
| - Sane package layout with optional extras for heavy deps. | |
| Net: **a successful directory restructure plus an installable wrapper | |
| around three real components and two stubs.** Calling Wave 10 "framework | |
| is installable with working entrypoints (✅)" is letter-of-the-law; | |
| the brief's "framework" connotation isn't yet earned. | |
| --- | |
| ## (c) ADR defensibility | |
| ### ADR-001 (local 5090 over Modal) | |
| **Reasoning defensible; execution missing.** The | |
| "iteration cycle 25–40s vs 3–5min" argument is concrete and matches | |
| reality. The "verification smoke, not production" framing is correct. | |
| **Gap**: Spike 002a-mini was never run on the 5090 either. Phase 10 in | |
| DEEP_WORK_LOOP_LOG.md is ⏳ pending. ADR-001 chose the 5090 over Modal, | |
| and **then nothing ran on either.** No `nvidia-smi` snapshot, no GPU | |
| step-time CSV, no bf16 numerics check. The "rule out CPU-only blind | |
| spots" goal is unmet. The ADR should be marked "Accepted (execution | |
| deferred)" or the spike should run. | |
| ### ADR-002 (Claude Code JSONL trace source) | |
| **Defensible on every dimension the ADR considers; the dimensions are | |
| partial.** "1,015 real sessions, zero acquisition cost" is real. License | |
| and schema-stability arguments are well-sourced. | |
| **Adversarial counter not in the ADR**: Claude Code JSONL is the most | |
| self-serving choice. The framework targets training a coding-agent model. | |
| The training data is the author's own Claude Code sessions where the | |
| agent was Claude. The teacher pool (Spike 001) is OpenRouter-based and | |
| *includes Claude*. So: | |
| - "student action" = what Claude did. | |
| - teacher pool includes Claude. | |
| - DPO pairs = teachers' agreement vs Claude's literal text. | |
| This is **circular imitation**: training a future model to imitate | |
| Claude using Claude's outputs as the gold reference and Claude as one | |
| of the disagreement teachers. The teacher-disagreement signal density | |
| argument from Spike 001 is strongest with diverse teachers. With this | |
| trace source, the student-action is locked to one teacher family, | |
| biasing the disagreement signal. The ADR doesn't consider this; the | |
| ingester README doesn't flag it. **The ADR rationalizes the easy path | |
| without naming the data-leakage tradeoff.** | |
| ### ADR-003 (torchft for DiLoCo) | |
| **Genuinely defensible choice.** Meta-maintained library; rolling-own | |
| trap correctly identified; license analysis (rejecting `diloco_simple`) | |
| is right; sign-convention risk named and tested. | |
| **Gap is in delivery, not decision.** ADR-003 §Consequences §1 says: | |
| "2 replicas, 4 inner steps, 2 outer rounds on a TinyMLP, shared-buffer | |
| mock allreduce, assertions: replica equality after sync, params actually | |
| moved, Nesterov state populated, sync count matches expected." Spike 008 | |
| implements one replica + passthrough manager. The ADR commits to an | |
| implementation that the spike does not deliver, and the gap is flagged | |
| only in the spike's verdict, not in the ADR. | |
| If the recon doc said the pattern was "ready-to-paste" but actually | |
| hits a sequencing bug, **the recon doc is wrong** and an adversarial | |
| reviewer is allowed to point that out. | |
| --- | |
| ## (d) Scorecard inflation | |
| The 5/10 → 9/10 update overstates. Test by test: | |
| - **Test 6 (DiLoCo integrated in runnable stack) → ✅?** | |
| Letter-of-law yes, spirit no. `make_diloco_outer_loop` exists and | |
| fires on one replica. **Zero references to torchft or DiLoCo in | |
| `composer_trainer.py`** — DiLoCo is not integrated with the trainer. | |
| No two-replica integration test, no real distributed run. | |
| - **Test 7 (real HF model loads + runs) → ✅?** | |
| Yes — most legitimately closed item. Caveats from §(a) about depth | |
| of evidence apply, but the literal test is met. | |
| - **Test 8 (real LLM-application trace ingested end-to-end) → ✅?** | |
| Mostly yes. Ingester real and tested. **BACKLOG acceptance criterion | |
| #3 ("end-to-end: real trace → ingester → collator → 1-step | |
| `composer_total_loss`") is unmet.** | |
| - **Test 9 (framework installable with working entrypoints) → ✅?** | |
| Letter-of-law yes, spirit partial. `pip install -e .` works; the | |
| quickstart runs the smoke harness. Production entrypoint | |
| (`ComposerReplicationTrainer` driven by a config) does not exist. | |
| - **Test 10 (non-author can complete the journey) → ✅?** | |
| No. The supporting evidence is "Quickstart README + working | |
| installable demonstrate the full path on Qwen2.5-0.5B in <5min, $0." | |
| Test 10's original journey was "I have Qwen3-7B, I want a | |
| Composer-style variant." The parenthetical concession in the update | |
| ("For Qwen3-7B etc., GPU phase still gates the empirical demo") | |
| ✅'s the item anyway. | |
| **Honest re-scoring**: 5/10 → **7/10 ✅, 1/10 ⚠️ partial (test 8), | |
| 2/10 ❌ in spirit (tests 6, 10).** "9/10" overstates by ~2 points. | |
| --- | |
| ## (e) Commit quality | |
| ``` | |
| ac05fbf Wave 10 — packaging: composer_replication is now pip-installable | |
| d52e126 Tidy .gitignore (de-dup *.jsonl, restore section blank lines) | |
| a35a8d7 Spike 007: include synthetic_session.jsonl fixture in repo | |
| 57af35d Wave 7+8+9: spikes 006/007/008 — close vision-validation gaps V2/V5/V8 | |
| ac4bfb4 Wave 7: Phase 2-4 of deep work loop — backlog, parallel research, three ADRs | |
| 040eff8 Wave 6: vision validation self-audit (5/10 to 9/10 in 5 days, no GPU) | |
| ``` | |
| - `ac05fbf`, `d52e126`: accurate. | |
| - `a35a8d7`: accurate. Implies `57af35d` shipped a Spike 007 that did | |
| not actually run cleanly for anyone cloning before this commit. Mild | |
| overclaim risk on `57af35d`. | |
| - **`57af35d` is the single most overclaiming commit.** Title: "close | |
| vision-validation gaps V2/V5/V8." | |
| - V8: closed in the weakest sense (tautology critique above). | |
| - V5: structural ingestion closes; BACKLOG acceptance #3 unmet. | |
| - V2: silently re-defined (one replica, no convergence). | |
| Three closures claimed; one partial, one redefined. | |
| - **Chronology problem**: `040eff8` (Wave 6) declared the **5/10 → 9/10 | |
| forecast** in the commit subject. `ac4bfb4` (Wave 7, *next* commit) | |
| added the BACKLOG and ADRs — i.e., the *plan* to make the forecast | |
| true. `57af35d` (Wave 7-9) executed and ratified the 9/10 without | |
| re-auditing whether each item was actually closed in spirit. **No | |
| commit re-audits the scorecard against actually delivered evidence.** | |
| --- | |
| ## (f) Adversarial reviewer's strongest line of attack | |
| > "You have a research replication framework whose only published smoke | |
| > is a 5-step fixed-batch overfit on a 0.5B model on CPU, where the SDPO | |
| > channel is silently disabled (sdpo_jsd=0 throughout), the DPO channel | |
| > uses dummy reference logprobs, and the GRPO channel is replaced with | |
| > a stub. Of the three channels you advertise, **zero are tested | |
| > end-to-end on a real HF model.** Your DiLoCo integration is one | |
| > replica with a no-op `allreduce`. Your real-trace ingester is tested | |
| > against a fixture you wrote yourself plus a hardcoded path on your | |
| > laptop. Your scorecard moved from 5/10 to 9/10 with no GPU spend, no | |
| > third-party validation, and one commit that closed three vision- | |
| > validation gaps with one commit message. You are asking the reader to | |
| > believe that a $9B-startup commercial product is replicated by a CPU | |
| > smoke and three green test files — none of which the company itself | |
| > would call 'replicated.'" | |
| **Weakest defense**: "It's just v0.1 / smoke phase / GPU is the next | |
| phase." The *commit log and scorecard claim otherwise.* The defense | |
| "v0.1 caveat" only works if the v0.1 framing is honest at the top of | |
| the README and scorecard — and it is not. | |
| **Strongest actual defense**: the four primary-source-validated recon | |
| docs and Spike 001's measured cost floor. The *thesis* is credible and | |
| auditable. The *implementation phase* is overclaimed. | |
| --- | |
| ## What to fix before publishing publicly (priority order) | |
| ### 1. Re-state the scorecard honestly (BLOCKER) | |
| Replace 5/10 → 9/10 with **5/10 → 7/10 ✅, 1/10 ⚠️, 2/10 ❌-spirit.** | |
| List the spirit-failures explicitly (test 6 trainer integration, test 8 | |
| end-to-end, test 10 non-author). Single most important fix; everything | |
| else compounds on the inflated scorecard. | |
| ### 2. Fix Spike 008's V2 claim (BLOCKER) | |
| Either (a) add a real two-replica multiprocessing test (ADR-003 says | |
| this is feasible; the spike claims it isn't — reconcile), or (b) mark | |
| V2 as ⚠️ partial and rewrite BACKLOG: "machinery fires on one replica, | |
| sign convention pinned; cross-replica convergence deferred to GPU | |
| phase." Pick one. | |
| ### 3. Strengthen Spike 006 against the tautology critique | |
| Two cheap wins: | |
| - Test that loss decreases on **two alternating fixed batches** over 10 | |
| rounds (not just one memorized batch). | |
| - Test where **`alpha_sdpo=10.0` and SDPO actually fires** (truncate | |
| ctx_teacher to T_s tokens for matching shape). The SDPO channel is | |
| *not exercised on a real HF model anywhere* in the codebase. Largest | |
| evidence gap for V8. | |
| ### 4. Run Spike 002a-mini on the local 5090 | |
| ADR-001 made the choice; the spike was not run. Either drop the ADR | |
| (decision deferred) or run the spike (~30 min wall-clock per ADR's own | |
| estimate). Until then, the framework has zero GPU evidence of any kind. | |
| ### 5. Fix the run.log / verdict.md numerical inconsistency | |
| Quickstart run.log shows step-1=0.0379; spike verdict shows step-1=0.2090. | |
| Either pin the seed properly or document non-reproducibility and quote | |
| a band rather than exact numbers. | |
| ### 6. Acknowledge Claude Code JSONL's circularity in ADR-002 | |
| Add a "Risks accepted" entry naming the data-leakage concern: training | |
| on Claude's outputs while Claude is in the teacher pool produces a | |
| biased disagreement signal. Spike 007 README should also flag it. | |
| ### 7. Decide what `compose_loss` and `build_batch` are | |
| Either rename to `compose_loss_smoke` (and keep | |
| `ComposerReplicationTrainer._compute_loss` as production), or make | |
| `compose_loss` actually production-grade and demote `build_batch` out | |
| of public API. Production-disclaimed harness as the package's headline | |
| import is confusing. | |
| ### 8. Eliminate dual sources of truth | |
| `spikes/006-…/compose_loss.py` ↔ `composer_replication/loss.py`, and | |
| `spikes/008-…/composer_diloco.py` ↔ `composer_replication/diloco/__init__.py`. | |
| Make the spike import from the package; delete the duplicate. | |
| ### 9. Add the missing real-trace end-to-end test in Spike 007 | |
| Take ingester output → Spike 005 data collator → 1 step of `compose_loss`. | |
| This is BACKLOG acceptance #3; ~50 lines of test code closes V5's | |
| spirit gap. | |
| ### 10. Fix the sign-convention docstring in `composer_diloco.py` | |
| Replace the incoherent "wrong-sign + SGD subtract = right answer with | |
| momentum" gloss with: *"DiLoCo defines pseudo-gradient as | |
| `θ_initial − θ_local`; this is the negative of the local update | |
| direction, and standard SGD subtracts gradients, so the outer step | |
| moves in the local-update direction. No negation required."* The test | |
| is correct; the prose explaining it isn't. | |
| --- | |
| ## Credit where due | |
| - **Spike 007's `ClaudeCodeIngester`** is real, working, well-tested | |
| software with non-trivial logic (sidechain skip, thinking-block | |
| strip-on-replay, malformed-line tolerance). The synthetic fixture | |
| exercises the structural cases properly. | |
| - **Spike 008's pseudogradient-sign-convention test** is the single | |
| best test in all of Wave 7-10. It pins a known torchft hazard with an | |
| explicit arithmetic prediction and a `wrong_sign_diff` reported on | |
| failure. | |
| - **Spike 006's α=0 / β=0 ablation tests** would catch real regressions | |
| and document channel-disable semantics. | |
| - **All three ADRs are properly traceable to recon documents** | |
| (MODAL_RECONNAISSANCE, TRACE_SOURCE_RECONNAISSANCE, | |
| DILOCO_RECONNAISSANCE). The decisions can be challenged; the *process* | |
| is auditable, which is rare. | |
| - **Package layout** (`loss`, `batch`, `opsd`, `teacher_replay`, | |
| `ingestion/claude_code`, `diloco`, `trainer`) is sane; optional | |
| extras correctly avoid forcing TRL/torchft on every install. | |
| The work product is not zero. It is overclaimed by roughly one | |
| scorecard tier and one BACKLOG acceptance criterion. Fixing items | |
| 1, 2, 3, 5 above moves the framework from "publishable with a generous | |
| reviewer" to "publishable with a critical reviewer." Items 4 and 6 | |
| move it from "research replication" to "evidenced research replication." | |