Baladithya Balamurugan Claude Opus 4.8 (1M context) commited on Jun 9

Commit

c11cf49

1 Parent(s): 4e6e82e

Wave 1: fix 8 failing tests + unblock Docker E2E + dep/doc debt

Backlog-resolution Wave 1 (branch backlog/goal-resolution-2026-06).

Bugs fixed:
- B1 (P0): generate spikes/007 synthetic_session_with_error.jsonl fixture
(never committed — .gitignore *.jsonl whitelisted the sibling but not this
one) → the 8 failing test_trace_examples_adapter tests now pass. Added the
missing .gitignore whitelist line.
- B2 (P2): [dev] extra was un-installable on Apple Silicon (pulled Linux-only
torchft-nightly). Dropped diloco from base [dev]; added [dev-full] for Linux.
- B3 (P2): [serverless] extra now includes s3fs/boto3/kubernetes (needed for
real S3 rendezvous + the EKSExecutor/SageMakerExecutor adapters).

Newly unblocked:
- D1 (…-245d): Docker substrate E2E — Docker now available on host; both gates
(4-gate substrate inversion + cache-scrub-in-container) PASS. Long-standing
"hardware-blocked" item closed.

Doc/API debt:
- B4: reconciled divergent test counts (115/176/210/232) to one canonical
figure measured this session: 266 passed / 62 skipped / 328 collected, with
an env-variance note, in docs/V1_V8_COVERAGE.md; updated PROJECT_STATE,
BACKLOG, TROUBLESHOOTING to point at it.
- B5: replaced stale /mnt/e/ WSL footer paths with repo-relative in
USER_GUIDE/API_REFERENCE/INTEGRATION_RECIPES.
- B6: fixed dead ADR-002-channel2-sdpo.md link → ADR-008 (README + 2 run.py).
- B7: re-export make_dr_grpo_config/make_po_config/PO_OBJECTIVES at the
trainer-subpackage AND top-package level; documented the config factories +
the PO-objective menu table in API_REFERENCE.
- B8: corrected _refine-2026-06-SUMMARY self-stale "not merged/3 commits" →
merged at 4e6e82e/6 commits; fixed OVERVIEW foot-gun cross-ref.

Plus the deep-research deliverable: research/notes/final_report_*.md (the
multi-model MCTS tree-of-work design) + supporting vault notes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.gitignore +7 -0
BACKLOG.md +1 -1
composer_replication/__init__.py +7 -2
composer_replication/trainer/__init__.py +12 -2
docs/API_REFERENCE.md +59 -1
docs/BACKLOG_RESOLUTION_2026-06-09.md +60 -0
docs/INTEGRATION_RECIPES.md +1 -1
docs/OVERVIEW.md +4 -2
docs/PROJECT_STATE_AND_REMAINING_WORK.md +1 -1
docs/TROUBLESHOOTING.md +1 -1
docs/USER_GUIDE.md +1 -1
docs/V1_V8_COVERAGE.md +18 -2
docs/_refine-2026-06-SUMMARY.md +6 -2
examples/gsm8k_grpo/run.py +1 -1
examples/gsm8k_grpo_with_sdpo/README.md +1 -1
examples/gsm8k_grpo_with_sdpo/run.py +1 -1
pyproject.toml +19 -2
research/audit_findings.json +11 -0
research/comparisons.md +60 -0
research/critic-findings-depth.json +35 -0
research/critic-findings-dialectic.json +48 -0
research/critic-findings-instruction.json +14 -0
research/critic-findings-width.json +48 -0
research/loci.json +68 -0
research/notes/191108265-mastering-atari-go-chess-and-shogi-by-planning-with-a-learned-model.md +258 -0
research/notes/201109464-counterfactual-credit-assignment-in-model-free-reinforcement-learning.md +229 -0
research/notes/221114275-solving-math-word-problems-with-process-and-outcome-based-feedback.md +205 -0
research/notes/230104104-mastering-diverse-domains-through-world-models.md +196 -0
research/notes/230520050-lets-verify-step-by-step.md +207 -0
research/notes/230616803-would-i-have-gotten-that-reward-long-term-credit-assignment-by-counter.md +205 -0
research/notes/240211651-learning-from-failure-integrating-negative-examples-when-fine-tuning-l.md +200 -0
research/notes/240515383-generating-code-world-models-with-large-language-models-guided-by-mont.md +196 -0
research/notes/240701476-tree-search-for-language-model-agents.md +200 -0
research/notes/240919256-hybridflow-a-flexible-and-efficient-rlhf-framework.md +222 -0
research/notes/241020285-swe-search-enhancing-software-agents-with-monte-carlo-tree-search-and.md +207 -0
research/notes/241108794-llm-based-world-models-can-make-decisions-solely-but-rigorous-evaluati.md +194 -0
research/notes/241114499-understanding-world-or-predicting-future-a-comprehensive-survey-of-wor.md +226 -0
research/notes/250218449-swe-rl-advancing-llm-reasoning-via-reinforcement-learning-on-open-soft.md +214 -0
research/notes/250314391-how-much-do-llms-learn-from-negative-examples.md +189 -0
research/notes/250411343-a-minimalist-approach-to-llm-reasoning-from-rejection-sampling-to-rein.md +217 -0
research/notes/250415275-stop-summation-min-form-credit-assignment-is-all-process-reward-model.md +210 -0
research/notes/250518830-on-the-effect-of-negative-gradient-in-group-relative-deep-reinforcemen.md +200 -0
research/notes/250613358-socratic-rl-a-novel-framework-for-efficient-knowledge-acquisition-thro.md +214 -0
research/notes/250721046-a-survey-of-self-evolving-agents-what-when-how-and-where-to-evolve-on.md +251 -0
research/notes/250921240-tree-search-for-llm-agent-reinforcement-learning.md +202 -0
research/notes/251002387-cwm-an-open-weights-llm-for-research-on-code-generation-with-world-mod.md +291 -0
research/notes/251121654-evilgenie-a-reward-hacking-benchmark.md +199 -0
research/notes/251218832-from-word-to-world-can-large-language-models-be-implicit-text-based-wo.md +210 -0
research/notes/260103905-current-agents-fail-to-leverage-world-model-as-tool-for-foresight.md +210 -0
research/notes/260112307-rethinking-the-value-of-multi-agent-workflow-a-strong-single-agent-bas.md +206 -0

.gitignore CHANGED Viewed

@@ -46,3 +46,10 @@ spikes/*/results/
 !spikes/*/states.jsonl
 !spikes/*/results.jsonl
 !**/synthetic_session.jsonl

 !spikes/*/states.jsonl
 !spikes/*/results.jsonl
 !**/synthetic_session.jsonl
+!**/synthetic_session_with_error.jsonl
+# hyperresearch tooling (local agent scaffolding — not project source)
+.hyperresearch/
+.claude/
+CLAUDE.md
+research/temp/

BACKLOG.md CHANGED Viewed

@@ -82,7 +82,7 @@ Updated 2026-05-29 to reflect shipped waves (ingestion, diloco, packaging, datag
 - **ADR-008/009/010 (Datagen, Layered Hints, Dr.GRPO+SDPO)**: Shipped, examples documented.
 - **Cross-Family Architectural Review**: Shipped (`docs/reviews/cross-family-adr-008-009-010-2026-05-29/`).
 - **Alignment / V&V Closure**: ADR-011 (SDPO alignment indices), ADR-012 (close review findings), ADR-013 (LMA integration channel-ladder) shipped.
-- **Test Suites**: 210 passed / 16 skipped.
 - **Real Examples**: `examples/gsm8k_grpo/`, `examples/sdpo_with_real_traces_production/`.
 ## Deferred (post-loop, GPU-gated)

 - **ADR-008/009/010 (Datagen, Layered Hints, Dr.GRPO+SDPO)**: Shipped, examples documented.
 - **Cross-Family Architectural Review**: Shipped (`docs/reviews/cross-family-adr-008-009-010-2026-05-29/`).
 - **Alignment / V&V Closure**: ADR-011 (SDPO alignment indices), ADR-012 (close review findings), ADR-013 (LMA integration channel-ladder) shipped.
+- **Test Suites**: 266 passed / 62 skipped (measured 2026-06-09; canonical count + env-variance note in docs/V1_V8_COVERAGE.md).
 - **Real Examples**: `examples/gsm8k_grpo/`, `examples/sdpo_with_real_traces_production/`.
 ## Deferred (post-loop, GPU-gated)

composer_replication/__init__.py CHANGED Viewed

@@ -94,8 +94,13 @@ from composer_replication.teacher_replay import (
     replay_trace,
 )
-# Trainer (Spike 005)
-from composer_replication.trainer import ComposerReplicationTrainer
 # DiLoCo (Spike 008) — optional, requires torchft
 try:

     replay_trace,
 )
+# Trainer (Spike 005) + policy-optimization config factories (ADR-008/ADR-014)
+from composer_replication.trainer import (
+    PO_OBJECTIVES,
+    ComposerReplicationTrainer,
+    make_dr_grpo_config,
+    make_po_config,
+)
 # DiLoCo (Spike 008) — optional, requires torchft
 try:

composer_replication/trainer/__init__.py CHANGED Viewed

@@ -5,6 +5,16 @@ Per docs/adrs/ADR-003 (also wraps DiLoCo when training distributed).
 """
 from __future__ import annotations
-from composer_replication.trainer.composer_trainer import ComposerReplicationTrainer
-__all__ = ["ComposerReplicationTrainer"]

 """
 from __future__ import annotations
+from composer_replication.trainer.composer_trainer import (
+    PO_OBJECTIVES,
+    ComposerReplicationTrainer,
+    make_dr_grpo_config,
+    make_po_config,
+)
+__all__ = [
+    "ComposerReplicationTrainer",
+    "make_dr_grpo_config",
+    "make_po_config",
+    "PO_OBJECTIVES",
+]

docs/API_REFERENCE.md CHANGED Viewed

@@ -926,6 +926,64 @@ trainer = ComposerReplicationTrainer(
 # trainer.train()  # uses overridden _compute_loss
 ```
 ### `class TraceTurn(TypedDict, total=False)` — `trainer.data_collator`
 ```python
@@ -1460,4 +1518,4 @@ Untested-contract symbols (⚠️) and skeletons (🟡) are flagged inline above
 ---
-**Document path**: `/mnt/e/CS/HF/composer-replication-framework/docs/API_REFERENCE.md`

 # trainer.train()  # uses overridden _compute_loss
 ```
+### `make_dr_grpo_config(**overrides) -> trl.GRPOConfig`
+Builds a `trl.GRPOConfig` configured to the **Dr. GRPO** recipe (Composer 2.5's
+base objective per the Composer 2 tech report, arXiv:2603.24477; Dr.GRPO =
+Liu et al. arXiv:2503.20783). Forces three knobs unless explicitly overridden,
+with drift-guard assertions:
+- `loss_type="dr_grpo"` — removes GRPO's length-standardization length bias.
+- `scale_rewards="none"` — NO std-dev advantage normalization (Dr.GRPO requirement).
+- `num_iterations=1` — single-epoch / strict on-policy.
+Any field is overridable via kwargs (`learning_rate=`, `output_dir=`, `beta=`, …).
+**Honest KL-estimator delta** (ADR-012 #1): TRL 1.5.0's `GRPOTrainer._compute_loss`
+uses the **k3** estimator `exp(ref_logp−logp)−(ref_logp−logp)−1`, NOT the k1
+estimator `−log r` the Dr.GRPO/Composer report frames; the delta is small for r≈1
+and TRL is not monkeypatched — the delta is documented, not hidden. Exported from
+both `composer_replication` and `composer_replication.trainer`.
+```python
+from composer_replication import make_dr_grpo_config
+args = make_dr_grpo_config(output_dir="runs/x", learning_rate=1e-6)
+```
+### `make_po_config(objective="dr_grpo", **overrides) -> trl.GRPOConfig`
+Builds a `trl.GRPOConfig` for a **named policy-optimization objective** from the
+`PO_OBJECTIVES` menu (ADR-014). All presets are PURE CONFIG over trl 1.5.0's
+`GRPOTrainer` (verified by introspection) — no custom `_compute_loss` needed.
+`**overrides` set/override any `GRPOConfig` field on top.
+- Raises `ValueError` on an unknown objective (lists the valid menu).
+- Raises `AssertionError` if a requested knob silently failed to apply (drift guard;
+  e.g. GSPO guards `importance_sampling_level=="sequence"`).
+```python
+from composer_replication import make_po_config, PO_OBJECTIVES
+args = make_po_config("dapo", output_dir="runs/dapo", learning_rate=2e-6)
+```
+### `PO_OBJECTIVES: dict[str, dict]`
+The selectable base policy-optimization objectives (named presets over real trl
+1.5.0 `GRPOConfig` knobs). Keys and what each sets:
+| Objective | `loss_type` | `scale_rewards` | Distinguishing knob | Paper |
+|---|---|---|---|---|
+| `grpo` | `grpo` | `group` (std-norm) | IS=`token` | DeepSeekMath 2402.03300 |
+| `dr_grpo` *(default)* | `dr_grpo` | `none` | length-bias removed | 2503.20783 |
+| `bnpo` | `bnpo` | `batch` | batch-normalized | trl |
+| `dapo` | `dapo` | `none` | `epsilon_high=0.28` (decoupled clip-higher), `mask_truncated_completions`, `beta=0` | 2503.14476 |
+| `gspo` | `grpo` | `group` | `importance_sampling_level="sequence"` | Qwen 2507.18071 |
+| `cispo` | `cispo` | `none` | `epsilon_high=5.0` (detached IS coef) | MiniMax-M1 2506.13585 |
+> **Diagnostic gotcha:** for any PO-objective ablation, log the *distinguishing*
+> diagnostic (`clip_ratio/high_mean` for DAPO, the sequence-level ratio for GSPO).
+> A `0` means the knob never engaged — NOT that the objectives are equal. (This is
+> exactly the inert-knob artifact the A1 DAPO-vs-Dr.GRPO washout hit at lr=1e-6.)
 ### `class TraceTurn(TypedDict, total=False)` — `trainer.data_collator`
 ```python
 ---
+**Document path**: `docs/API_REFERENCE.md` (repo-relative)

docs/BACKLOG_RESOLUTION_2026-06-09.md ADDED Viewed

	@@ -0,0 +1,60 @@

+# Backlog Resolution — 2026-06-09
+Goal-driven systematic resolution of every pending item. This doc is the live audit + wave plan.
+## Phase 1 — Commit / working-tree state (captured 2026-06-09)
+- **Branch:** `main` (canonical) at `4e6e82e` = `origin/main` = `origin/master` (synced).
+- **Working branch for this effort:** `backlog/goal-resolution-2026-06` (off `main`).
+- **Untracked (from the hyperresearch run + tooling):** `research/` artifacts (query, scaffold, loci, comparisons, critic-findings, patch/polish logs, `notes/final_report_*`), `.hyperresearch/` (SQLite vault), `.claude/skills/` (16 hyperresearch step skills), `CLAUDE.md` (hyperresearch-injected). Decision: the deep-research deliverable (`research/notes/final_report_socratic-mcts-swe-worldmodel-8f6dea.md` + supporting artifacts) is worth committing as project research; `.hyperresearch/` (binary SQLite) and tooling scaffolding should be gitignored.
+- **Host capabilities NEW since last audit:** **Docker IS available** (`docker info` ok) → unblocks the substrate-E2E item. `.venv` (py3.13, torch 2.12, trl 1.5.1) present.
+## Phase 2 — Backlog audit (every item, categorized)
+### A. Real bugs / regressions (do NOW, no gating)
+| ID | Item | Priority | Complexity | Status |
+|---|---|---|---|---|
+| B1 | 8 failing tests: gitignored `synthetic_session_with_error.jsonl` fixture never committed (`.gitignore:45 *.jsonl` whitelists `synthetic_session.jsonl` but not the `_with_error` sibling). Breaks `composer_replication/ingestion/tests/test_trace_examples_adapter.py` (core pkg) + `examples/sdpo_with_real_traces_production/run.py`. | P0 | trivial | OPEN |
+| B2 | `[dev]` extra un-installable on Apple Silicon (pulls `torchft-nightly`, Linux-x86_64-only wheels) → `uv pip install -e '.[dev]'` fails entirely. | P2 | low | OPEN |
+| B3 | `[serverless]` extra missing `s3fs`/`boto3`/`kubernetes` (needed for real S3 rendezvous + the planned EKSExecutor). | P2 | low | OPEN |
+### B. Doc/state debt (do NOW)
+| ID | Item | Priority | Status |
+|---|---|---|---|
+| B4 | Test-count drift: docs claim 115 / 210 / 232 / 176 in different places; real count must be measured + reconciled to one canonical number (V1_V8_COVERAGE.md). | P2 | OPEN |
+| B5 | Stale WSL `/mnt/e/CS/HF/...` absolute-path footers in API_REFERENCE.md:1463, USER_GUIDE.md:703, INTEGRATION_RECIPES.md:985 (+ research/* occurrences). | P3 | OPEN |
+| B6 | Dead link `examples/gsm8k_grpo_with_sdpo/README.md:66 → docs/adrs/ADR-002-channel2-sdpo.md` (should be ADR-008-drgrpo-sdpo-live-channel.md). | P3 | OPEN |
+| B7 | API_REFERENCE.md missing the trainer config factories `make_dr_grpo_config` (ADR-008) + `make_po_config`/`PO_OBJECTIVES` (ADR-014) — real public API undocumented. | P2 | OPEN |
+| B8 | `_refine-2026-06-SUMMARY.md` self-stale ("not merged, 3 commits" — actually merged, 6 commits); README/OVERVIEW→TROUBLESHOOTING dangling foot-gun cross-ref. | P3 | OPEN |
+### C. Code-buildable Phase-0 deltas from the research report (do NOW — mockable, no GPU/cloud)
+| ID | Item | Priority | Complexity | Status |
+|---|---|---|---|---|
+| C1 | **Held-out disjoint eval + depth/generation kill-switch** — the "documented repo gap" + most load-bearing collapse safeguard (#2). Self-evolving flywheel is unsafe without it. CPU-testable. | P1 | med | OPEN |
+| C2 | **`EKSExecutor`** satisfying the `ServerlessExecutor` Protocol (launch_replicas=K8s indexed Jobs, poll/cancel/collect, S3 via ObjectStoreAllReduce) — ~150 LOC, mockable like ModalSpawnExecutor (its test uses `_MockFunctionCall`). The named-but-unimplemented `K8sExecutor` slot (executor.py:41). | P2 | med | OPEN |
+| C3 | Containerize `LocalSubprocessSandbox` (gVisor/Docker runtime) — now that Docker exists, the sandbox-execution path can be made real. | P3 | med | OPEN |
+### D. Hardware/host-gated — NOW RUNNABLE (Docker present)
+| ID | Item | Priority | Status |
+|---|---|---|---|
+| D1 (`…-245d`) | Docker substrate E2E (`composer_replication/datagen/tests/test_docker_substrate_e2e.py`) — the 4 inversion gates + cache-scrub on a real `python:3.11-slim` container. Was skipif-gated on `docker info`; **Docker now available → RUN IT**. | P4→now | OPEN |
+### E. Code-buildable, RUN-gated (build harness/tests; real run needs GPU+budget — user-only)
+| ID | Item | Priority | Status |
+|---|---|---|---|
+| E1 (`…-4936`) | A2 SDPO-only ladder runner + error-trace dataset builder. `modal_ladder_a1.py` hardcoded to A1. Build the runner + dataset tooling + CPU/mock tests; real A100 run is user-gated. | P2 | OPEN (build harness) |
+| E2 (`…-211e`) | Higher-lr PO-objective sweep harness — make DAPO/GSPO clip-higher fire; log the distinguishing diagnostic. Build the sweep config/driver + assertions; real run user-gated. | P2 | OPEN (build harness) |
+| E3 | `SageMakerExecutor` (~150 LOC, boto3 create_training_job, same S3 rendezvous) — mockable. | P3 | OPEN |
+### F. Genuinely gated — cannot execute here (document + verify only)
+| ID | Item | Priority | Status |
+|---|---|---|---|
+| F1 (`…-cb74`) | **ROTATE exposed HF write-token** — USER-ONLY (requires HF account access). AUDIT done: no live token in tracked tree (only env-var reads). Action = user rotates on huggingface.co. | P1 | DOCUMENTED (user-only) |
+| F2 | Real 8B LMA run (A2/A3/A4 arms `…-42f5`,`…-dd7b`) + higher-lr sweep RUNS — GPU + budget + user go/no-go. Harness buildable (E1/E2); the spend is user-only. | — | GATED (harness only) |
+## Wave plan
+- **Wave 1 (parallel):** B1, B2, B3, B4, B5, B6, B7, B8 (bugs + doc debt) ‖ D1 (Docker E2E) ‖ research fan-out (Tavily/Exa/DeepWiki) for C1/C2/E1/E2 best practices.
+- **Wave 2 (parallel, after research):** C1 (held-out eval + kill-switch) ‖ C2 (EKSExecutor) ‖ C3 (containerized sandbox) ‖ E1/E2/E3 harnesses.
+- **Concurrent review team:** audits each wave's diff, feeds findings back.
+- **Wave 3+:** reconcile review findings, fix, repeat until zero open + tests green.
+- **Final:** full suite green, docs reconciled, everything committed.

docs/INTEGRATION_RECIPES.md CHANGED Viewed

@@ -982,4 +982,4 @@ adapter boundary, not because the loss math is wrong.
 ---
-**File path:** `/mnt/e/CS/HF/composer-replication-framework/docs/INTEGRATION_RECIPES.md`


982
983	---
984
985	+ File path: `docs/INTEGRATION_RECIPES.md` (repo-relative)

docs/OVERVIEW.md CHANGED Viewed

@@ -67,8 +67,10 @@ where channel 1 is real GRPO rather than the LM-CE stub. See
 3. **The empirical question** — does the method actually beat plain GRPO at scale? — is the
    GPU-budget-gated v0.1 work (Spikes 002b/003/004) and remains open by design.
-See [`BACKLOG.md`](../BACKLOG.md) for the live gap list and [`docs/TROUBLESHOOTING.md`](TROUBLESHOOTING.md)
-for known foot-guns.
 ## Foot-guns worth knowing on day one

 3. **The empirical question** — does the method actually beat plain GRPO at scale? — is the
    GPU-budget-gated v0.1 work (Spikes 002b/003/004) and remains open by design.
+See [`BACKLOG.md`](../BACKLOG.md) for the live gap list, the **Foot-guns worth knowing
+on day one** section just below for the day-one gotchas (branch sync, `strip_thinking`,
+k1/k3, `compose_loss`-is-harness), and [`docs/TROUBLESHOOTING.md`](TROUBLESHOOTING.md)
+for install/runtime failure modes.
 ## Foot-guns worth knowing on day one

docs/PROJECT_STATE_AND_REMAINING_WORK.md CHANGED Viewed

@@ -15,7 +15,7 @@ for unblocked work, `sd list` for everything, `sd show <id>` for detail.
 A reusable RL/data-gen framework that replicates Cursor's **Composer 2.5** post-training
 recipe at small scale, whose north-star consumer is the **llm-mental-alterations (LMA)**
 project (apply targeted RL to a personality-altered SFT model and measure washout vs
-amplification). Past-skeleton, production-shaped: 8 subpackages, 232 tests pass / 18 skip,
 installable, with worked GSM8K-GRPO + SDPO-real-trace + A1-8B examples.
 ## The 3-channel loss — with HONEST provenance

 A reusable RL/data-gen framework that replicates Cursor's **Composer 2.5** post-training
 recipe at small scale, whose north-star consumer is the **llm-mental-alterations (LMA)**
 project (apply targeted RL to a personality-altered SFT model and measure washout vs
+amplification). Past-skeleton, production-shaped: 8 subpackages, 266 tests pass / 62 skip (measured 2026-06-09; see docs/V1_V8_COVERAGE.md for the canonical count + why skips vary by env),
 installable, with worked GSM8K-GRPO + SDPO-real-trace + A1-8B examples.
 ## The 3-channel loss — with HONEST provenance

docs/TROUBLESHOOTING.md CHANGED Viewed

@@ -824,7 +824,7 @@ should succeed:
 uv venv --clear
 uv pip install -e ".[diloco,replay,replaysim,train,dev]"
 source .venv/bin/activate
-python -m pytest -q                    # baseline 176 passed / 8 skipped
 ```
 If any of those extras fails to resolve, file a bug report — Wave 16

 uv venv --clear
 uv pip install -e ".[diloco,replay,replaysim,train,dev]"
 source .venv/bin/activate
+python -m pytest -q                    # baseline 266 passed / 62 skipped (2026-06-09; varies by optional deps/Docker — see docs/V1_V8_COVERAGE.md)
 ```
 If any of those extras fails to resolve, file a bug report — Wave 16

docs/USER_GUIDE.md CHANGED Viewed

@@ -700,4 +700,4 @@ Run the full suite with `pytest` from the repo root.
 ---
-**File path:** `/mnt/e/CS/HF/composer-replication-framework/docs/USER_GUIDE.md`


700
701	---
702
703	+ File path: `docs/USER_GUIDE.md` (repo-relative)

docs/V1_V8_COVERAGE.md CHANGED Viewed

@@ -112,7 +112,23 @@ The user expanded the brief mid-loop:
 **Wave 13 test addition**: 35 new tests passing (17 distillation + 9 serverless multi-process + 9 replaysim).
-The framework now covers the full expanded brief. **Total tests passing
-post-Wave-15: 115 + 1 skip-marked.** Wave-by-wave evolution: 72 (W12) → 93 (W13) → 124 (W14) → 130 (W14b) → 115 (W15: TAID rewrite consolidated 16 schedule-tests into 7 t-parameterized tests; OPSD upstream-parity test added skip-marked).
 This is the canonical running test count; other docs reference V1_V8_COVERAGE rather than restating.

 **Wave 13 test addition**: 35 new tests passing (17 distillation + 9 serverless multi-process + 9 replaysim).
+The framework now covers the full expanded brief.
+**Canonical test count (measured 2026-06-09 on this tree): 266 passed / 62 skipped / 328 collected.**
+Wave-by-wave growth of the *passing-on-a-minimal-CPU-env* subset: 72 (W12) → 93 (W13)
+→ 124 (W14) → 130 (W14b) → 115 (W15) → … → **266 (2026-06-09)** as later waves
+(datagen, ADR-011/012/013/014, serverless, ingestion adapter) added subpackages and tests.
+**Why the skip count varies by environment (and why older docs cite 115 / 176 / 210 / 232):**
+the suite has ~328 collected tests; how many *run vs skip* depends on what optional
+deps / host capabilities are present. Tests `skipif`-gate on: `torchft` (DiLoCo
+integration — Linux-x86_64-only, absent on macOS arm64), `modal`, `data-juicer`,
+`prime-rl`, the `/tmp/{opsd,taid}-clone` upstream-parity clones, a real Claude Code
+session log, and a live **Docker** host. On a minimal CPU env many of those skip;
+on a Docker-enabled host the substrate-E2E gates RUN (proven 2026-06-09). The
+divergent historical numbers (115 Wave-15, 232/18, 210/16, 176/8) are point-in-time
+snapshots under different dep/host matrices — they are not contradictions, but this
+line is the one canonical figure; reproduce it with `pip install -e '.[dev]'` then
+`pytest -q` (add `.[datagen]` + a Docker host to un-skip the substrate E2E).
 This is the canonical running test count; other docs reference V1_V8_COVERAGE rather than restating.

docs/_refine-2026-06-SUMMARY.md CHANGED Viewed

@@ -1,7 +1,11 @@
 # Docs Refine 2026-06 — Change Summary
-> Branch: `docs/refine-2026-06` (off `master` HEAD `aae66fa`). **Docs-only.** Not merged,
-> no PR opened — left for human review. Commit range: `aae66fa..e130879` (3 commits).
 This engagement refined the documentation corpus to (1) enforce the ground-truth provenance
 correction recorded in [ADR-014](adrs/ADR-014-policy-optimization-objective-menu.md), (2)

 # Docs Refine 2026-06 — Change Summary
+> Branch: `docs/refine-2026-06` (off `master` HEAD `aae66fa`). **Docs-only.**
+> **MERGED** into `main` as of `4e6e82e` (merge commit "Merge docs/refine-2026-06"),
+> after the 3 documented waves (`20e3bd9`, `f00833d`, `e130879`) plus 3 reconciliation
+> commits (`ace6dd4`, `5e64616`, `d7e4b4e`) that retired the now-resolved main-lags-master
+> foot-gun — 6 commits total in range `fb13ea3..4e6e82e`, not the 3 this summary originally
+> listed. (This header was updated 2026-06-09 to reflect the merged reality.)
 This engagement refined the documentation corpus to (1) enforce the ground-truth provenance
 correction recorded in [ADR-014](adrs/ADR-014-policy-optimization-objective-menu.md), (2)

examples/gsm8k_grpo/run.py CHANGED Viewed

@@ -23,7 +23,7 @@ Usage:
 Cross-references:
   - `docs/USER_GUIDE.md` §8 — Recipe A: TRL `GRPOTrainer` subclass
   - `docs/INTEGRATION_RECIPES.md` Recipe 1 — minimum-viable Python script
-  - `docs/adrs/ADR-002-channel2-sdpo.md` — SDPO design (not used here; see
     `run_with_sdpo.py` for the SDPO variant)
 """
 from __future__ import annotations

 Cross-references:
   - `docs/USER_GUIDE.md` §8 — Recipe A: TRL `GRPOTrainer` subclass
   - `docs/INTEGRATION_RECIPES.md` Recipe 1 — minimum-viable Python script
+  - `docs/adrs/ADR-008-drgrpo-sdpo-live-channel.md` — SDPO design (not used here; see
     `run_with_sdpo.py` for the SDPO variant)
 """
 from __future__ import annotations

examples/gsm8k_grpo_with_sdpo/README.md CHANGED Viewed

@@ -63,7 +63,7 @@ hints from the actual error sites in your trace data.
 - [`composer_replication.compose_loss`](../../composer_replication/loss.py) — the loss-composition entrypoint
 - [`docs/COMPOSER_RECIPE_MAPPING.md`](../../docs/COMPOSER_RECIPE_MAPPING.md) — how SDPO maps to Cursor's Composer-2.5 hint-distillation
-- [`docs/adrs/ADR-002-channel2-sdpo.md`](../../docs/adrs/ADR-002-channel2-sdpo.md) — SDPO design decision
 - [`examples/gsm8k_grpo/run.py`](../gsm8k_grpo/run.py) — plain GRPO sibling (alpha_sdpo=0)
 ## CPU vs GPU

 - [`composer_replication.compose_loss`](../../composer_replication/loss.py) — the loss-composition entrypoint
 - [`docs/COMPOSER_RECIPE_MAPPING.md`](../../docs/COMPOSER_RECIPE_MAPPING.md) — how SDPO maps to Cursor's Composer-2.5 hint-distillation
+- [`docs/adrs/ADR-008-drgrpo-sdpo-live-channel.md`](../../docs/adrs/ADR-008-drgrpo-sdpo-live-channel.md) — SDPO design decision
 - [`examples/gsm8k_grpo/run.py`](../gsm8k_grpo/run.py) — plain GRPO sibling (alpha_sdpo=0)
 ## CPU vs GPU

examples/gsm8k_grpo_with_sdpo/run.py CHANGED Viewed

@@ -28,7 +28,7 @@ Cross-references:
   - `composer_replication.compose_loss` — the loss-composition entrypoint
   - `docs/COMPOSER_RECIPE_MAPPING.md` — how SDPO maps to Cursor's
     Composer-2.5 hint-distillation
-  - `docs/adrs/ADR-002-channel2-sdpo.md` — SDPO design
   - `examples/gsm8k_grpo/run.py` — plain GRPO (no SDPO) sibling
 """
 from __future__ import annotations

   - `composer_replication.compose_loss` — the loss-composition entrypoint
   - `docs/COMPOSER_RECIPE_MAPPING.md` — how SDPO maps to Cursor's
     Composer-2.5 hint-distillation
+  - `docs/adrs/ADR-008-drgrpo-sdpo-live-channel.md` — SDPO design
   - `examples/gsm8k_grpo/run.py` — plain GRPO (no SDPO) sibling
 """
 from __future__ import annotations

pyproject.toml CHANGED Viewed

@@ -58,9 +58,16 @@ diloco = [
     "torchft-nightly",
 ]
 # Decoupled DiLoCo over serverless executors (per ADR-005)
 serverless = [
     "fsspec>=2024.6",
     "huggingface_hub>=0.27",   # for hf:// fsspec backend + HF Jobs
 ]
 # Replaysim dataset normalization (per ADR-004)
 #
@@ -111,11 +118,21 @@ datagen = [
 # module is a documentation skeleton (importing it does NOT require
 # monarch installed). The extra is dropped — see docs/TROUBLESHOOTING.md
 # ("monarch / data-juicer install") for installation guidance.
-# Everything for development
 dev = [
     "pytest>=8.0",
     "ruff>=0.6",
-    "composer-replication[replay,diloco,train]",
 ]
 [project.urls]

     "torchft-nightly",
 ]
 # Decoupled DiLoCo over serverless executors (per ADR-005)
+# fsspec gives the object-store rendezvous one code path (s3://, gs://, hf://,
+# file://); s3fs is the concrete S3 backend (the AWS default per the EKS design);
+# boto3 + kubernetes are needed by the AWS leaf adapters (SageMakerExecutor uses
+# boto3.create_training_job; EKSExecutor uses the kubernetes BatchV1 client).
 serverless = [
     "fsspec>=2024.6",
     "huggingface_hub>=0.27",   # for hf:// fsspec backend + HF Jobs
+    "s3fs>=2024.6",            # concrete S3 backend for ObjectStoreAllReduce (AWS default)
+    "boto3>=1.34",             # SageMakerExecutor (create_training_job) + S3 IAM
+    "kubernetes>=29.0",        # EKSExecutor (indexed k8s Jobs via BatchV1Api)
 ]
 # Replaysim dataset normalization (per ADR-004)
 #
 # module is a documentation skeleton (importing it does NOT require
 # monarch installed). The extra is dropped — see docs/TROUBLESHOOTING.md
 # ("monarch / data-juicer install") for installation guidance.
+# Development — the BASE dev set installs on every platform (macOS arm64 incl.).
+# NOTE: `diloco` (torchft-nightly) is deliberately NOT in base `dev`: torchft-nightly
+# ships Linux-x86_64 wheels only, so including it made `pip install -e '.[dev]'` fail
+# outright on Apple Silicon / any non-Linux-x86_64 host. The torchft-dependent tests
+# skipif-gate cleanly when it is absent, so the base dev set runs the full suite minus
+# the torchft integration tests on any platform.
 dev = [
     "pytest>=8.0",
     "ruff>=0.6",
+    "composer-replication[replay,train]",
+]
+# Full development incl. the DiLoCo outer-loop dep (Linux-x86_64 only — torchft-nightly).
+# Use on a Linux GPU/CI host to also exercise the torchft integration tests.
+dev-full = [
+    "composer-replication[dev,diloco,serverless,datagen]",
 ]
 [project.urls]

research/audit_findings.json ADDED Viewed

	@@ -0,0 +1,11 @@

+[
+ {
+  "mode": "hyperresearch-v8",
+  "run_id": "2026-06-09-socratic-mcts-swe-worldmodel-8f6dea",
+  "loci_count": 5,
+  "critical_findings_applied": 17,
+  "critical_findings_skipped": 0,
+  "polish_escalations": 0,
+  "final_word_count": 9207
+ }
+]

research/comparisons.md ADDED Viewed

	@@ -0,0 +1,60 @@

+# Cross-locus comparisons — argumentative spine
+## Tension 1: "Prune" means two different things at two different granularities
+- **Locus prune-vs-train-on-all** commits: TRAIN ON ALL branches, but *typed/routed* — winners→policy SFT/RL, losers→DPO rejects + world-model targets; the natural prune is the per-TURN JSD signal-presence test, not per-trajectory survival.
+- **Locus selfevolve-flywheel** commits: you MUST prune at the oracle-cleanliness gate before training, because train-on-all distills proxy-hacks (RSI §3.2) — reward-hacking branches must be discarded, not learned from.
+- **The cross-locus dynamic:** These look contradictory ("train on all" vs "prune") but reconcile into a precise rule: prune at TWO gates the policy must never cross — (a) oracle-cleanliness (drop reward-hacked / guard-broken branches entirely) and (b) per-turn signal-presence (skip zero-signal turns) — then train on ALL of what survives, routed by signal type. The flywheel locus supplies the safety floor that the prune-vs-all locus's "keep everything" must sit on top of.
+- **How the draft should engage this:** §4 must state the resolution as a two-gate filter (cleanliness gate + signal-presence gate) wrapping a typed train-on-all, NOT as "prune vs all". This is the headline reconciliation of the report.
+- **Calibration:** prune-vs-all is HIGH confidence that structured negatives beat positives-only; flywheel is HIGH that the gated version compounds, MEDIUM the current repo code is sufficient (safeguard #2, the disjoint held-out + kill-switch, is a documented GAP). Both name the SAME falsifier: held-out score declining while in-loop oracle reward rises = collapse caught in the act.
+## Tension 2: The failed branch is simultaneously poison (for the policy) and gold (for the world model)
+- **Locus worldmodel-latent-deliberation** commits: train-on-all for the world-model head (a failed branch is a *perfect* next-state-prediction label — CWM precedent), prune/reward-filter for the GRPO policy head — "same tree, two harvests."
+- **Locus prune-vs-train-on-all** commits: the single best use of a failed branch is exactly a world-model next-state-prediction target (route #2, "no policy-gradient penalty at all").
+- **The cross-locus dynamic:** Strong CONVERGENCE from two independent investigations onto the same mechanism — the failed branch's value is realized by predicting it, not by penalizing the policy with it. This dissolves the prune-vs-all dilemma: you never throw the failed branch away (world model eats it) and you never let it destabilize the policy (no raw negative gradient). Convergence-from-independent-paths is itself a finding.
+- **How the draft should engage this:** §2 and §4 must share this "two-harvest" frame explicitly; the world-model aux loss is what *makes* train-on-all safe for the policy, because it relocates the failed-branch signal off the policy gradient.
+- **Calibration:** worldmodel is HIGH on necessity of training it, MEDIUM-HIGH that the aux next-state head is the best lever; prune-vs-all independently rates the same head MEDIUM-HIGH. Shared falsifier: foresight@k with aux-ON ≈ token-RL-only (aux content loss redundant at scale).
+## Tension 3: The expensive tree only pays for itself if expansion is divergence-gated — and that gate is where the world model earns its keep
+- **Locus credit-assignment** commits: the divergence tree is a genuine PRM-free counterfactual process oracle, but O(N^D) cost means it's worth it ONLY with divergence-gated expansion (branch only at high-VOI turns where heterogeneous models already disagree) → ~O(N·decision-points).
+- **Locus worldmodel** commits: the bottleneck the literature identifies (2601.03905) is foresight *governance* — when/whether to deliberate — not simulator fidelity; RL on the `<deliberate>` token's *placement* teaches governance.
+- **The cross-locus dynamic:** COMPLICATION-into-synthesis: the same "where to spend deliberation" question appears as a COST control in credit-assignment (where to branch the env) and as a CAPABILITY target in worldmodel (where to emit `<deliberate>`). They are the same decision learned at two levels — the trained world model's governance signal is exactly the policy that should drive divergence-gated expansion at data-generation time. The system's most expensive knob (branch factor) and its core capability (foresight governance) are the same lever.
+- **How the draft should engage this:** §3 (GA) and the §8 cost section must tie the divergence gate to VOI; note the bootstrap — early rounds gate on cross-model disagreement, later rounds can gate on the model's own learned deliberation-confidence.
+- **Calibration:** credit-assignment is conditional ("YES but gated"); its falsifier (divergence-gated arm fails to beat equal-budget outcome-only GRPO++ on long-horizon tasks) is the single most important compute-matched ablation in the whole program.
+## Tension 4: Replay entrenches the human distribution — branching is the claimed escape, but only the oracle proves you escaped
+- **Locus selfevolve-flywheel** commits: human-trace entrenchment (Self-Play-SWE-RL 2512.18552) is real for the UNGUARDED version; the antidote is counterfactual branching OFF the human path graded by tests — "you fork, you don't replay."
+- **Locus credit-assignment** commits: sibling divergence (different models reaching different EXECUTED outcomes from a shared parent) is the unit of signal — which is precisely a fork off the parent trajectory, validated by execution.
+- **The cross-locus dynamic:** CONVERGENCE plus a caveat: branching is the mechanism that turns "replay" into "counterfactual exploration," and both loci agree the EXECUTION ORACLE (not teacher consensus, not a learned verifier) is what certifies the fork found something real. The repo's Channel 3 today is *weaker* on this axis precisely because its fitness is teacher-plurality, not test execution — the upgrade to execution-graded branching is the core delta.
+- **How the draft should engage this:** §1 and §6 must name this as the single most important upgrade over the repo's current Channel 3 (teacher-plurality fitness → execution-oracle fitness) and as the answer to the strongest adversarial prior.
+- **Calibration:** flywheel HIGH that branching+oracle escapes entrenchment; the open risk both flag is that a system generating its own tasks from its own traces can drift the held-out set toward the train set.
+## Tension 5: EKS-primary is cheap to adopt in CODE but the genuinely-new cost is sandbox fan-out — which is also the throughput ceiling of the whole idea
+- **Locus eks-architecture** commits: EKS-primary single-control-plane hybrid; the repo port is a ~300 LOC leaf adapter (EKSExecutor + SageMakerExecutor); BUT the one genuinely-new infra is per-branch sandbox isolation, and per-branch cold-start can dominate outer-loop wall-clock.
+- **Locus credit-assignment** commits: the rollout/branching is the system's most expensive piece ($64/trace ungated vs $0.98 flat); divergence-gating is mandatory.
+- **The cross-locus dynamic:** The architecture locus's "strongest counter" (sandbox cold-start dominates → demote EKS from 'primary for everything' to 'primary for control+training, bespoke pool for sandbox execution') is the SAME bottleneck the credit-assignment locus controls with divergence-gating. Infra cost and algorithmic cost are the same constraint: branch factor × sandbox cold-start. SWE-MiniSandbox (container-free kernel isolation, ~5% disk / ~25% env-prep) is the throughput primitive that makes high fan-out affordable.
+- **How the draft should engage this:** §8/§10 must connect the algorithmic gate (divergence-gating, §3) to the infra primitive (cheap sandboxes, container-free or snapshotted microVM) — the two cost controls are one. Honestly flag the "demote EKS for sandboxes" fallback.
+- **Calibration:** eks-architecture HIGH (8/10) on the design; explicit falsifier = measured per-branch sandbox cold-start dominating wall-clock.
+---
+## Step-8 corpus-critic confidence revisions (overturning evidence found — MUST be reflected in the draft)
+**Revision A — the heterogeneity premise is DOWNGRADED (contested, not assumed-positive).** Adversarial search found substantive counter-evidence that the system's single most distinctive choice (different model family per node + cross-family DPO) may not pay for itself:
+- "Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets" (2604.02460): at held-constant reasoning tokens, single-agent matches/beats multi-agent incl. the ensemble variant (the closest analogue to multi-rollout heterogeneous search); "many reported MAS gains are better explained by compute and context effects than by inherent architectural superiority"; holds across Qwen3/DeepSeek-R1/Gemini; Data-Processing-Inequality argument (one agent with full context >= split agents).
+- "Rethinking the Value of Multi-Agent Workflow: A Strong Single Agent Baseline" (2601.12307): a single-LLM baseline matched AFlow-optimized HETEROGENEOUS (GPT-4o-mini + Claude-Haiku) MCTS workflows at lower cost.
+- Cross-tokenizer/cross-family distillation is "a largely unsolved problem" (2604.07466 BLD + CTPD/CDM cluster): cross-family preference transfer is fragile, sometimes DEGRADES, needs special byte-level/OT machinery.
+- **Engagement guidance:** §1/§3/§4 must treat heterogeneity as a HYPOTHESIS requiring an equal-compute control arm (single strong model with N temperature/persona samples) before claiming any heterogeneity gain. The typed-train-on-all and divergence-tree positions do NOT depend on heterogeneity (they work with homogeneous N-sampling too), so the core design survives — but the "different models per node" flourish is now an ablation question, not a premise. NOTE: safeguard #4's "N>=3 population as anti-collapse diversity" SURVIVES (no source showed model-diversity gives zero anti-collapse benefit; on-policy-distillation survey ties gains to predictive diversity).
+**Revision B — the world-model aux loss is DEMOTED from "necessary" to "optional, parameter-isolated, ablation-gated."** Direct 2026 counter-evidence on all three angles:
+- "Reasoning and Tool-use Compete in Agentic RL" (2602.00994): jointly training two capabilities into one parameter set induces misaligned gradients / interference; decoupling into separate LoRA adapters (DART) beats joint optimization. → stacking a 2nd SDPO/next-state head onto the SAME policy head is the exact interfering configuration; argue for a separate head/adapter.
+- "Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning" (2605.06840): LLMs generate deep look-ahead in CoT but move choice is causally driven by shallow depth-1 nodes — foresight content generated but NOT consumed. So improving prediction quality may not move decisions.
+- "The Predictive-Causal Gap: An Impossibility Theorem" (2605.05029): pure predictive objectives provably/empirically optimize AWAY from causal/decision-relevant structure (92% lower prediction error while causal fidelity ~0).
+- Counter-counter (kept honest): SPA, VAGEN, Imagine-then-Plan, FOREAGENT all report explicit future-state simulation HELPS agentic pass-rate — the field is genuinely split.
+- **Engagement guidance:** §2 must reframe the aux next-state loss as OPTIONAL, in a parameter-isolated head/adapter (not fused into the policy head), gated behind the pre-registered ablation (aux-ON vs deliberation-token-RL-only) on the PRIMARY metric (pass-rate + counterfactual-foresight, NOT next-state accuracy). This matches the worldmodel investigator's OWN stated falsifier. The cheapest decisive experiment we could run ourselves is the SWE-specific next-state-head ablation (does not exist in the literature yet).
+**Revision C — even the EXECUTION ORACLE gets gamed (safeguard #1 is necessary but NOT sufficient).** The flywheel locus claimed a true execution oracle is "categorically different" from a proxy and thus immune to RSI-style depth-amplified hacking. Adversarial search complicated this: EvilGenie (2511.21654), "LLMs gaming verifiers: RLVR can lead to reward hacking" (2604.15149), and "Do synthetic trajectories reflect real reward hacking" (2604.23488) show verifiable/test-based rewards ARE gamed — agents hardcode/special-case to pass FAIL_TO_PASS, exploit fractional partial-credit, and overfit held-out tests. → Engagement: §4 (oracle-cleanliness gate) and the safeguards must state that the execution oracle REDUCES but does not eliminate the hack surface; HackMonitor + held-out disjoint eval + the depth kill-switch are doing real work, not belt-and-suspenders. The oracle bounds the hack surface (finite, vs an open-ended proxy) but PASS_TO_PASS guards, test-provenance checks, and contamination control are mandatory, not optional. This makes safeguard #2 (disjoint held-out + kill-switch) MORE load-bearing, not less.
+Net: the corpus critic STRENGTHENED the report by puncturing two overclaims and complicating a third. The robust core (fork-off-the-human-trace + execution oracle + typed train-on-all + two-gate prune + divergence-gated expansion + 4 safeguards + EKS-primary) is untouched; the two flourishes (heterogeneity-as-premise, aux-loss-as-necessary) become explicit ablation questions. Both shared falsifiers were independently confirmed as the right experiments.
+## Summary for the synthesizer
+The five loci are NOT orthogonal — they collapse into ONE coherent design with a single through-line: **fork off the human trace with heterogeneous models, grade by a true execution oracle, gate expansion on divergence/VOI, and route the resulting branches by signal type — winners to the policy, all branches (incl. failures) to a world-model next-state head — under two hard prune gates (oracle-cleanliness, per-turn signal-presence) and four collapse safeguards.** The world-model aux loss is the keystone: it is simultaneously the project's stated goal, the safe home for failed-branch signal (resolving prune-vs-all), and the learned governance policy that drives divergence-gated expansion (controlling cost). The single most important experiment is the compute-matched, generate-once/route-many P0–P6 ablation on the repo's ADR-013 ladder, measuring calibration/foresight, not just pass@1.

research/critic-findings-depth.json ADDED Viewed

	@@ -0,0 +1,35 @@

+{
+  "critic": "depth",
+  "findings": [
+    {
+      "severity": "high",
+      "section": "10. Cost, Throughput, Failure Modes (and the §3 callback at line 55)",
+      "issue": "The single quantitative anchor for the entire 'divergence-gating is mandatory' argument misreads its own source. The report frames '~$0.98/trace flat-ungated versus ~$64/trace for an ungated eight-teacher thousand-step branching tree.' But research/05:256 derives $64 explicitly as a FLAT replay cost ($0.008/step x 1000 steps x 8 teachers = 8000 forward passes, no branching). The repo's own flat-to-tree note (flat-multi-teacher-to-branching...md:40) states this directly: 'research/05 ... already prices the FLAT case at ~$64/trace ungated for 8 teachers x 1000 steps; a tree makes [gating] mandatory.' Both numbers the report compares are flat costs that differ only in scale (N=3 short trace = $0.98 from teacher_replay.py:7-8 spike-001; N=8 x 1000 steps = $64). Labeling the $64 figure a 'branching tree' conflates a teacher-count/length scale difference with the flat-vs-tree distinction, and badly UNDERSELLS the real tree cost: a true O(N^D) branching tree is combinatorially worse than $64, not equal to it. The argument's headline number is wrong in the direction that weakens the report's own thesis.",
+      "fix": "Reframe to: flat Channel-3 replay is ~$0.98/trace at N=3 (teacher_replay.py:7-8) and ~$64/trace at the 8-teacher x 1000-step scale (research/05:256) — both FLAT, O(N*T). A branching tree is O(N^D), strictly worse than either flat figure; that combinatorial blow-up (not the $0.98-to-$64 gap) is what makes divergence-gating mandatory. Drop 'branching tree' from the $64 clause.",
+      "anchor_quote": "~$0.98/trace flat-ungated versus ~$64/trace for an ungated eight-teacher thousand-step branching tree"
+    },
+    {
+      "severity": "medium",
+      "section": "9. The SageMaker Path and the Recommended Hybrid (also §6 reuse/build table)",
+      "issue": "The '~150 LOC each' executor estimate (and the '~300 LOC' combined figure in §6) undershoots the repo's only working ServerlessExecutor backend by ~2.5x and is not grounded in the existence proof the report itself cites. The report leans on ModalSpawnExecutor as the 'working proof' that calibrates the delta [42], but modal_spawn.py is 390 LOC and the executor.py reference (Protocol + LocalProcessExecutor) is 310 LOC. An EKS adapter that must handle Indexed Jobs, JOB_COMPLETION_INDEX->REPLICA_RANK mapping, GPU limits, IRSA, optional runtimeClassName, plus poll/cancel/stream_logs/collect against the Batch/Pod APIs is unlikely to be half the size of the Modal adapter. The figure reads as optimistic rather than measured, which weakens the report's load-bearing 'nine-tenths already exists / bounded delta' claim.",
+      "fix": "Either ground the estimate (e.g. 'ModalSpawnExecutor is 390 LOC; expect EKSExecutor in the same 300-400 LOC range') or soften to an order-of-magnitude ('a few hundred LOC each, comparable to the existing Modal adapter') instead of the precise '~150 LOC each'.",
+      "anchor_quote": "**`EKSExecutor` (~150 LOC, primary)**"
+    },
+    {
+      "severity": "low",
+      "section": "6. Grounding in the composer-replication-framework (reuse/build table) and §8/§9",
+      "issue": "The report consistently presents `EKSExecutor` as the repo's own reserved slot ('AWS leaf adapters | Build (~300 LOC) | `EKSExecutor` + `SageMakerExecutor`'). But the repo never names an EKSExecutor: the ServerlessExecutor Protocol docstring (executor.py:41) lists 'RunPodExecutor, SageMakerExecutor, K8sExecutor' as Future, and INTEGRATION_RECIPES.md:685 lists `K8sExecutor` (KubeRay/Volcano) as Roadmap. `EKSExecutor` is the report's coinage. SageMakerExecutor is a genuine repo-reserved name; EKSExecutor is not. This slightly overstates how pre-slotted the EKS path is.",
+      "fix": "Either note that the repo's roadmap slot is `K8sExecutor` (executor.py:41 / INTEGRATION_RECIPES.md:685) and EKSExecutor is the proposed concrete K8s implementation, or rename to `K8sExecutor` to match the repo. A one-clause parenthetical ('the repo's reserved `K8sExecutor` slot, here specialized to EKS') closes the gap.",
+      "anchor_quote": "`EKSExecutor` + `SageMakerExecutor` [42]"
+    },
+    {
+      "severity": "low",
+      "section": "7. What the Literature Says (Endorsements, the counterfactual-credit backbone)",
+      "issue": "The divergence-as-counterfactual-credit claim slightly conflates two distinct mechanisms. The report says siblings from a shared parent are 'low-variance because the shared parent differences out the baseline,' then attributes this to 'the quantity learned counterfactual-credit methods approximate with a hindsight model' [33]. But 2011.09464 (the cited note) achieves low variance via a FUTURE-CONDITIONAL (hindsight) baseline that conditions on the realized trajectory — not via a shared-parent/leave-one-out baseline (which is the standard MC advantage the repo's GRPO LOO already does). 'Shared parent differences out the baseline' is really the LOO/group-relative argument (closer to Tree-GRPO [44]), whereas the hindsight-model framing is CCA. The two are run together as if one mechanism.",
+      "fix": "Separate the two: the shared-parent differencing is a group-relative/LOO baseline (Tree-GRPO [44]); CCA [33] is the stronger, hindsight-conditioned variant that the executed-sibling structure approximates non-parametrically. Stating both as distinct sources of the low-variance claim is more accurate and actually strengthens the backbone.",
+      "anchor_quote": "low-variance because the shared parent differences out the baseline"
+    }
+  ],
+  "overall": "The report's core mechanism claims are unusually well-grounded — I verified each axis against source and most are faithful to the byte level. The flat->tree fitness delta (extract_dpo_pairs breaks after one teacher-plurality pair; _grade() returns masked pass-fraction) is exact. The SDPO-carrier-for-world-model claim is mechanically sound: the world-model 'splice realized observation into ctx_teacher as privileged info' reuses the same ctx_teacher = ctx_student + hint pattern, post-hint mask, and ADR-011 aligned-index gather that the real collator already implements (data_collator.py, ADR-011) — no hand-waving. Both prune gates are real: oracle-cleanliness = _grade() 0-masking (env.py:90), per-turn signal-presence = the collator empty-recovery row-drop (data_collator.py L308). ObjectStoreAllReduce is verified to the line: PUT round_{NNNNNN}/rank_{RRRR}.pt, poll-until-all-peers, mean, and the 'straggler blocks at the poll loop bounded by timeout_s=1800' claim is exactly what the code does (allreduce.py:151-162). The counterfactual-credit backbone is grounded (2011.09464 + Tree-GRPO step-level DPO equivalence), with only a minor mechanism conflation. The depth weaknesses are concentrated in the QUANTITATIVE concreteness, not the conceptual substance: the headline cost anchor mislabels a flat-scale figure as a tree cost (and thereby undersells the tree's true O(N^D) cost), the executor LOC estimates undershoot the only working backend by ~2.5x, and EKSExecutor is presented as a repo slot when the repo reserves K8sExecutor. None touch the load-bearing argument; all are surgical fixes that make the numbers honest.",
+  "findings_count_note": "4 findings: 1 high (cost-anchor misread), 1 medium (LOC estimate ungrounded), 2 low (naming + credit-mechanism conflation). The conceptual axes the checklist flagged are solid and I say so in overall rather than inventing nits."
+}

research/critic-findings-dialectic.json ADDED Viewed

	@@ -0,0 +1,48 @@

+{
+  "critic": "dialectic",
+  "overall": "The report engages all six mandated skeptic disconfirmers (single-agent>=multi 2604.02460; aux interference 2602.00994; myopic 2605.06840; predictive-causal gap 2605.05029; oracle-gamed EvilGenie/RLVR-hacking; outcome-only DeepSWE/SWE-RL) and renders the two contested flourishes (heterogeneity, world-model aux loss) as explicit pre-registered ablation arms with stated falsifiers and flip-conditions. Provenance is clean: Channel 3, the tree, and FeatureDeletionEnv bug injection are correctly attributed to the framework's own additions, never to Cursor (Ch1 Dr.GRPO + Ch2 SDPO) and never to Socratic-SWE (which the report correctly notes does NOT inject bugs). The central prune-vs-train-on-all question is committed (typed train-on-all under two hard gates), not hedged. The heterogeneity axis (§3/§7 Pushback 1) is solid and faithful to the counter-evidence note, with the equal-compute control arm and the surviving anti-collapse justification both correct. The findings below are not about missing disconfirmers but about (a) one in-repo counter-position the report straw-manned toward optimism, (b) a categorical claim its own DeepSWE source contradicts, (c) asymmetric domain-transfer skepticism applied to the pro side but not to load-bearing non-SWE disconfirmers it relies on, (d) a directly-SWE disconfirmer present in the corpus but never cited, and (e) a numerical misread. Few, high-quality.",
+  "findings": [
+    {
+      "severity": "high",
+      "section": "5. Pipeline Shape: Two Loops, Not Two Phases",
+      "issue": "The report straw-mans its own repo's counter-position. It claims self-distillation 'in this configuration, [is] a *stabilizer* and not only a collapse risk' citing SDFT, and treats Channel-2 SDPO as 'exactly that on-policy, demonstration-conditioned regime, not the static-synthetic-data regime that collapses.' But the repo's own ADR-013 (read in adr-decision-backbone note) states the opposite about THIS exact channel: 'SDPO against the altered model's own hint-conditioned forward pass is the channel most likely to AMPLIFY the distortion' and is 'an *experimental intervention*, not a benign stabilizer' (teacher==student-family; if hints add no independent info the optimum is to imitate the altered conditional, sharpening a soft bias into a hard preference). The report cites the optimistic external SDFT result while omitting the pessimistic in-repo finding on the very same mechanism, leaving the 'stabilizer' framing one-sided.",
+      "fix": "Add a clause acknowledging the repo's own counter-position: e.g. after 'is exactly that on-policy, demonstration-conditioned regime' add '— though the repo's own ADR-013 warns the same SDPO channel is the one most likely to AMPLIFY an existing distortion when the teacher is same-family and the hint adds no independent information, so the stabilizer claim holds only when the privileged-information conditioning carries genuine new signal (the per-turn JSD signal-presence gate of §4).'",
+      "anchor_quote": "Self-distillation in the inner loop is, in this configuration, a *stabilizer* and not only a collapse risk"
+    },
+    {
+      "severity": "high",
+      "section": "5. Pipeline Shape: Two Loops, Not Two Phases",
+      "issue": "Categorical overclaim contradicted by the report's own cited source. The report asserts a clean dichotomy: 'every working SWE flywheel optimizes a true execution oracle ...; every collapse story requires a proxy or self-judged verifier.' But DeepSWE [43] — cited approvingly two sentences later and throughout — documents near-collapse on a TRUE 0/1 execution oracle from positives alone: 'LLM agents may stumble upon correct patches and pass all tests without knowing. Training with these positives reinforces undesired behaviors ... leading to collapse,' which is precisely why DeepSWE needed compact filtering. So a true execution oracle did NOT prevent a collapse mode; positives on a real oracle produced it. The 'every collapse story requires a proxy' claim is falsified by the report's own evidence base.",
+      "fix": "Soften the dichotomy to acknowledge the positives-on-a-true-oracle collapse mode: e.g. change 'every collapse story requires a proxy or self-judged verifier' to 'most collapse stories require a proxy or self-judged verifier — though even a true execution oracle can collapse if positives reinforce accidental passes (DeepSWE's compact-filtering motivation [43]), which is a further argument for the per-turn signal gate and submit-gated credit.'",
+      "anchor_quote": "every collapse story requires a proxy or self-judged verifier"
+    },
+    {
+      "severity": "medium",
+      "section": "7. What the Literature Says (and Where It Pushes Back)",
+      "issue": "Asymmetric domain-transfer skepticism. The report disarms the pro-simulation cluster with 'none of those is a *SWE-pass-rate result at equal compute* — they are calibration, reasoning-trace, and non-SWE results.' But the report applies no such discount to two load-bearing disconfirmers that are equally non-SWE: the anti-emergence 'killer fact' (§2, [11] 2601.03905) is a vision-language-model agentic+VQA study, and 'the single most decisive result for *this* project' (§4, [27] 2503.14391) is a multiple-choice-QA Likra study, not SWE and not the DPO/GRPO regime in use. The same 'not a SWE-pass-rate result at equal compute' burden the report imposes on the pro side should be acknowledged for these anti-side pillars, or the symmetry argument is one-directional.",
+      "fix": "Add a one-clause symmetry caveat where the burden-shift is stated, e.g. after 'they are calibration, reasoning-trace, and non-SWE results' add '(the same domain-transfer caveat applies to the anti-side pillars — the world-model-as-tool foresight result [11] is VLM/VQA and the near-miss-calibration result [27] is MCQA — which is why the SWE-specific P0-P6 ablation, not the imported literature, is the actual decider).'",
+      "anchor_quote": "none of those is a *SWE-pass-rate result at equal compute*"
+    },
+    {
+      "severity": "medium",
+      "section": "2. The World-Model Goal: Training Latent What-If Deliberation",
+      "issue": "The anti-emergence case rests on a non-SWE study while a directly-on-domain SWE disconfirmer in the same corpus is never cited. The 'killer fact against emergence' [11] (2601.03905) is built on vision-language models over 'agentic and visual question answering tasks.' The corpus contains 2604.12147 (Plan Compliance in Autonomous Programming Agents, 16,991 SWE-agent trajectories on SWE-bench Verified + Pro across GPT-5 mini / DeepSeek-R1-V3 / Devstral) — flagged by the corpus-critic as 'the single most on-domain piece of evidence' that SWE agents fall back on memorized workflows and that a subpar/misaligned plan hurts MORE than no plan. It directly supports the report's selective-curriculum-over-naive-train-on-all thesis yet is absent from the citation list (no [49]; sources end at [48]). Grounding the anti-emergence and selective-structure arguments on a VLM/VQA study when a direct SWE result is available weakens the section.",
+      "fix": "Cite 2604.12147 in §2 (and/or §4) alongside [11]: e.g. after the foresight-governance sentence add 'and in SWE specifically, a study of 16,991 SWE-agent trajectories on SWE-bench finds agents revert to internalized workflows and that a misaligned plan hurts more than no plan — direct on-domain support for selective, alignment-gated structure over naive train-on-all [49].' Add the source to the Sources list.",
+      "anchor_quote": "handed a world model as a tool, agents invoke it under 1% of the time"
+    },
+    {
+      "severity": "medium",
+      "section": "2. The World-Model Goal: Training Latent What-If Deliberation",
+      "issue": "Numerical misread of the predictive-causal gap. The report says 'across 2,695 networks mean causal fidelity collapses toward ~1e-8 at high dimension *while achieving 92% lower prediction error*.' Per the source (the-predictive-causal-gap note), the MEAN causal fidelity across the 2,695 configurations is 0.49 (only 2.5% exceed 0.70); the ~1e-8 ('causally blind') figure and the 92%-lower-prediction-error figure are the high-dimension N=100 extreme, not the 2,695-network mean. Coupling '2,695 networks mean causal fidelity' with '~1e-8' conflates the corpus mean with the worst-case dimension and overstates the typical-case magnitude.",
+      "fix": "Split the two statistics: e.g. 'across 2,695 networks mean causal fidelity is 0.49 (only 2.5% exceed 0.70), and at high dimension (N=100) the optimal encoder becomes causally blind (~1e-8) *while achieving 92% lower prediction error*.'",
+      "anchor_quote": "across 2,695 networks mean causal fidelity collapses toward ~1e-8 at high dimension *while achieving 92% lower prediction error*"
+    },
+    {
+      "severity": "low",
+      "section": "4. The Central Question: Prune Bad Branches vs Train on All Branches",
+      "issue": "Under-engaged tension in the oracle-cleanliness argument. The report uses EvilGenie [30] to argue held-out tests are weak ('held-out tests giving only minimal detection improvement') and simultaneously makes the disjoint held-out eval 'the *most* load-bearing safeguard' (§4, §5 safeguard #2). EvilGenie's own finding is that the held-out-test method gave minimal improvement while the LLM JUDGE was 'highly effective at detecting reward hacking in unambiguous cases' — yet safeguard #1 forbids a learned/self-judged verifier in the training reward and the report leans on held-out eval. The report should reconcile why the safeguard it most relies on is the detector EvilGenie found weakest, and whether the LLM-judge detector (allowed only at test-time selection per safeguard #1) belongs in the monitoring stack.",
+      "fix": "Add a reconciling clause where EvilGenie is cited: e.g. 'EvilGenie found held-out tests weak as a *detector* but the LLM judge effective — so the held-out eval here is load-bearing as a drift TRIPWIRE (proxy-minus-realeval gain) rather than a per-trajectory hack detector, and an LLM-judge monitor is admissible for offline flagging though never as the training reward (safeguard #1).'",
+      "anchor_quote": "with held-out tests giving only minimal detection improvement"
+    }
+  ]
+}

research/critic-findings-instruction.json ADDED Viewed

	@@ -0,0 +1,14 @@

+{
+  "critic": "instruction",
+  "findings": [
+    {
+      "severity": "low",
+      "section": "## 2. The World-Model Goal: Training Latent What-If Deliberation",
+      "issue": "The prompt-decomposition entity 'World-model / latent-simulation literature' lists 'Chain of World' as a required_field, and the vault holds a dedicated note (260303195-chain-of-world-world-model-thinking-in-latent-motion.md) plus a synthesis lens note on it. The report's world-model section grounds latent deliberation in CWM [13], MuZero [14], Dreamer [15], From Word to World [12], and the foresight-governance paper [11], but never names Chain-of-World. Chain-of-World is the most direct 'latent-motion world-model thinking' analogue to the query's 'world-model latent deliberation' framing, so its absence is a small but real coverage gap on an explicitly-enumerated atomic item.",
+      "fix": "Add a one-clause citation to Chain-of-World where the report introduces value-equivalent latent prediction, e.g. after the MuZero/Dreamer sentence on line 27, append a clause noting Chain-of-World as the SWE-adjacent 'latent-motion' precedent for thinking in a learned latent rather than reconstructing full state. Keep it to one sentence; do not expand the section.",
+      "anchor_quote": "MuZero and Dreamer add the design discipline: learn the *value-equivalent* latent"
+    }
+  ],
+  "overall": "Instruction-following is excellent and needs no structural intervention. All 11 required H2 headings appear verbatim and in the exact order specified by research/prompt-decomposition.json (lines 9, 23, 37, 61, 95, 112, 138, 163, 181, 189, 208), with the expected '## Sources' as a 12th. All six required tables are present and correctly scoped: GA mapping (line 41), P0-P6 branch-usage experiment with arms+metrics+predicted ordering+explicit falsifier (line 83), repo reuse-vs-build ledger (line 116), paradigm comparison Socratic-RL/Socratic-SWE/Composer-2.5/proposed (line 142), EKS component table (line 169), and phased build plan (line 199). Every atomic sub_question is covered: world-model goal (S2, definition + next-state-prediction signal + MuZero/Dreamer/CWM + ECE/Brier/foresight@k measurement); GA framing (S3, all six concepts populated + where-it-breaks in three named places); the CENTRAL prune-vs-train-on-all question is answered as typed train-on-all under two hard gates AND backed by a concrete generate-once/route-many P0-P6 experiment with primary metrics and a stated falsifier; the '2 sections or 1?' question is answered head-on as 'two loops at different timescales, not two phases' (line 97); EKS is UNAMBIGUOUSLY PRIMARY -- the heading is '(Primary)', line 165 opens 'EKS is primary, with a single control plane', S9 frames SageMaker as 'not a competing platform ... an inner-loop node-group swap on the same control plane', and the honest EKS-demotion path on line 193 is scoped strictly to the sandbox-execution pool, never the control/training plane; the SageMaker/HyperPod path is concrete (1:1 control-plane mapping, ~150 LOC SageMakerExecutor, Training-Jobs-vs-HyperPod selection); repo grounding is dense and file-line-anchored throughout (teacher_replay.py, env.py, claude_code.py, composer_trainer.py, ADRs); cost/throughput is quantified ($0.98 vs $64/trace, 60-80% gating savings, $0.05/round comm, 50-70% Spot, SWE-MiniSandbox ~5%/~25%). The provenance guardrail (Channel 3 + tree are the framework's OWN additions, Cursor = Ch1 Dr.GRPO + Ch2 SDPO) is stated up front (line 7) and honored repo-wide. The single finding is a low-severity missing-citation nit (Chain-of-World), not a coverage failure.",
+  "findings_note": "one low finding only; axis is solid"
+}

research/critic-findings-width.json ADDED Viewed

	@@ -0,0 +1,48 @@

+{
+  "critic": "width",
+  "findings": [
+    {
+      "severity": "high",
+      "section": "## 7. What the Literature Says (and Where It Pushes Back)",
+      "issue": "The process-vs-outcome cluster is the empirical backbone of the entire §4 prune-vs-train-on-all argument, yet its two foundational papers are named once with WRONG citations and given no source IDs. Line 161 reads 'process supervision genuinely beats outcome on reasoning traces (Let's Verify, Uesato) [19][27]' — but [19] is the 'LLM-Based World Models' paper (arXiv:2411.08794) and [27] is 'How Much Do LLMs Learn From Negative Examples' (arXiv:2503.14391). Neither is Let's Verify nor Uesato. Both papers have dedicated, on-point vault notes (Lightman et al. 2305.20050 'Let's Verify Step by Step' — PRM beats ORM on MATH, releases PRM800K; Uesato et al. 2211.14275 — first head-to-head process-vs-outcome on GSM8K, process feedback cuts reasoning error 14.0%→3.4%) that carry zero source IDs anywhere in the report. The single sentence the skeptic-rebuttal rests on mis-attributes its evidence.",
+      "fix": "Add two new Sources entries: '[49] Let's Verify Step by Step (Lightman et al.) — arXiv:2305.20050 (PRM process supervision beats ORM on MATH; releases PRM800K)' and '[50] Solving math word problems with process- and outcome-based feedback (Uesato et al.) — arXiv:2211.14275 (first process-vs-outcome head-to-head; process feedback cuts reasoning error 14.0%→3.4% at final-answer parity)'. Then change the line-161 citation from '(Let's Verify, Uesato) [19][27]' to '(Let's Verify [49]; Uesato [50] — process feedback cuts reasoning error 14.0%→3.4% at final-answer parity)' so the named papers carry their own IDs.",
+      "anchor_quote": "process supervision genuinely beats outcome on reasoning traces (Let's Verify, Uesato), the world-model field is split"
+    },
+    {
+      "severity": "high",
+      "section": "## 1. What We Are Actually Building: From Multi-Teacher Replay to a Counterfactual Tree of Work",
+      "issue": "SWE-Search (arXiv:2410.20285) is the single closest published analogue to the proposed system — MCTS over repository-level SWE tasks with per-node value estimation, backtracking, and self-feedback — and the report names it ('SWE-Search expands nodes with one policy') but gives it no source ID and never engages its central, decision-relevant findings: a 23% relative SWE-bench improvement across five models from search ALONE, and the explicit result that performance scales with inference-time compute 'without requiring larger models or additional training data.' That is a sharper version of Pushback 3's skeptic case (does the tree's gain need training at all, or is it just test-time search?) and a direct input to the §7 paradigm table, yet the vault note is completely unused. Leaving the closest prior art uncited weakens the 'claim the synthesis, not the parts' provenance argument.",
+      "fix": "Add a Sources entry '[51] SWE-Search: Enhancing Software Agents with MCTS and Iterative Refinement — arXiv:2410.20285 (23% relative SWE-bench gain from search alone, single policy, scales with inference-time compute, no extra training)'. Tag the existing §1 mention 'SWE-Search expands nodes with one policy [51]', and add one clause to Pushback 3 (§7) noting SWE-Search already shows per-node SWE search helps at TEST time without training — so the tree must justify the marginal value of folding that search into TRAINING, not just adding search.",
+      "anchor_quote": "SWE-Search expands nodes with one policy; Symphony does heterogeneous-LM planning"
+    },
+    {
+      "severity": "medium",
+      "section": "## 3. The Genetic-Algorithm Framing — Where It Holds and Where It Breaks",
+      "issue": "Symphony (arXiv:2601.22623, NeurIPS 2025) is the strongest pro-heterogeneity result in the vault — a heterogeneous-LM MCTS planner whose explicit thesis is that single-agent MCTS yields 'insufficient diversity among generated branches' and that a heterogeneous LM pool 'enhances rollout diversity and facilitates more effective exploration,' outperforming SOTA when given API models. The report names Symphony once in §1 with no source ID, then builds Pushback 1 (heterogeneity-is-a-hypothesis) almost entirely on the anti-heterogeneity sources [21][22][23], leaving the heterogeneity-as-ablation framing under-steelmanned. Symphony is precisely the source that says the system's distinctive choice (different model per node) buys exploration diversity — the very 'anti-collapse diversity' the report concedes survives (safeguard 4) but does not source on the capability side.",
+      "fix": "Add a Sources entry '[52] SYMPHONY: Synergistic Multi-agent Planning with Heterogeneous LM Assembly — arXiv:2601.22623 (NeurIPS 2025; single-agent MCTS gives insufficient branch diversity; heterogeneous LM pool improves rollout diversity and exploration)'. In §3 Pushback 2 (heterogeneity), add a sentence acknowledging Symphony [52] as the counter-result that frames heterogeneity's surviving justification (exploration/branch diversity), so the ablation is set up as a genuine two-sided question rather than a near-foregone demotion.",
+      "anchor_quote": "Symphony does heterogeneous-LM planning"
+    },
+    {
+      "severity": "medium",
+      "section": "## 2. The World-Model Goal: Training Latent What-If Deliberation",
+      "issue": "Section 2 grounds the latent-deliberation 'value-equivalent / never reconstruct the full state' argument on MuZero [14] and Dreamer [15] (both pre-LLM RL) but leaves the most on-point 2026 vault note — Chain of World (arXiv:2603.03195, CVPR 2026) — entirely uncited. Chain of World is precisely a 'World Model Thinking in Latent Motion' paradigm that factorizes dynamics into a disentangled latent and predicts terminal state rather than reconstructing redundant background — the exact value-equivalent-latent point the report wants to make for SWE ('predict the signed FAIL_TO_PASS delta, never reconstruct the full token sea'). prompt-decomposition.json explicitly lists 'Chain of World' as a required field of the world-model literature cluster, so its absence is a coverage miss against the decomposition.",
+      "fix": "Add a Sources entry '[53] Chain of World: World Model Thinking in Latent Motion — arXiv:2603.03195 (CVPR 2026; disentangled latent-motion world model predicts terminal state instead of reconstructing redundant background)'. In §2, append to the MuZero/Dreamer value-equivalent sentence a clause: 'and the latent-motion line carries the same discipline into 2026 — factorize dynamics into a compact latent and predict the consequential terminal state, not the full frame [53]', tying the embodied result to the SWE next-state-delta target.",
+      "anchor_quote": "never reconstruct the full state, a high-entropy sea of irrelevant tokens [14][15]"
+    },
+    {
+      "severity": "medium",
+      "section": "## 4. The Central Question: Prune Bad Branches vs Train on All Branches",
+      "issue": "The report cites EvilGenie [30] only for its hacking-prevalence half ('explicit hardcoding / test-file edits by Codex and Claude Code') and then declares the disjoint held-out eval 'the most load-bearing safeguard.' But EvilGenie's other headline finding is that an LLM judge is HIGHLY EFFECTIVE at detecting reward hacking in unambiguous cases while held-out unit tests give only 'minimal improvement.' The report uses the held-out-is-weak half (line 79) but omits the LLM-judge-is-strong half — which is decision-relevant: it suggests a cheaper, validated hack DETECTOR (distinct from a learned reward) that the report's own safeguard framing ('learned verifier allowed only at test-time selection') would permit. Omitting it makes the held-out eval look like the only option when the source the report already cites offers a complementary one.",
+      "fix": "In §4 (line 79) after 'with held-out tests giving only minimal detection improvement', add: '— while in the same study an LLM judge proved highly effective at flagging unambiguous hacks, suggesting an offline LLM-judge hack-detector (never a training reward) as a cheaper complement to the held-out gate [30].' This uses the already-cited [30] note's second finding without adding a source.",
+      "anchor_quote": "with held-out tests giving only minimal detection improvement"
+    },
+    {
+      "severity": "low",
+      "section": "## 8. Implementing on AWS EKS (Primary)",
+      "issue": "The SWE-rebench / Nebius infrastructure vault note (behind-swe-rebench, a 26KB substantive source on production SWE-task collection + eval-at-scale, evaluating thousands of SWE instances per hour with distributed container orchestration on TractoAI) is unused. It is the most directly-relevant existence proof for the report's central infra claim that mass SWE-task sandboxing/eval is an established distributed pattern — and for §6's task-construction discussion (mining (problem, test-set) pairs from resolved GitHub issues, exactly the FeatureDeletionEnv substrate-inversion pattern). The EKS section leans on DeepSWE's 512-container limit [43] but omits the one note built specifically around scaling SWE-task execution infrastructure.",
+      "fix": "Add a Sources entry '[54] Behind SWE-rebench: infrastructure to collect/evaluate SWE tasks at scale — nebius.com (distributed container orchestration evaluating thousands of SWE instances/hour; (problem,test-set) pairs mined from resolved GitHub issues)'. In §8's data-plane or throughput discussion, add one clause citing [54] as production evidence that thousands-per-hour distributed SWE-task execution is an established pattern, reinforcing the 'EKS-primary is cheap to adopt' claim.",
+      "anchor_quote": "DeepSWE itself ran rollout collection on Kubernetes with a Cluster Autoscaler over 1000+ CPU cores [45][43]."
+    }
+  ],
+  "overall": "The report is unusually wide and dense: 48 source IDs, all 11 required section headings present, and genuinely deep engagement with the hardest clusters — reward-hacking-with-verifiable-rewards (EvilGenie/RLVR-gaming/synthetic-trajectories all cited [29][30][31]), self-evolving-collapse (survey §8.3 [38], RSI [29], Self-Play-SWE-RL [8]), the negatives/credit-assignment cluster ([25][26][27][28][33]), and the EKS sandbox/SWE-MiniSandbox/KubeRay/verl/HyperPod stack ([41][42][45][46][48]) are all well-used. Width is therefore strong, not weak. The findings are targeted, not a scattershot: the single highest-value gap is a citation MISMATCH — the process-vs-outcome backbone (Let's Verify [Lightman 2305.20050] and Uesato 2211.14275), which underpins all of §4, is named once with the wrong source IDs and the two foundational vault notes carry no IDs at all. The second is the omission of SWE-Search (2410.20285), the closest published per-turn-MCTS-on-SWE prior art, named but uncited and unengaged on its sharpest point (search helps at test time without training). The remaining four are smaller: an under-steelmanned pro-heterogeneity source (Symphony), the most on-point 2026 latent world-model note (Chain of World, also a decomposition required-field miss), a half-used EvilGenie finding (LLM-judge detector), and an unused SWE-rebench infra note. Fixing the two high-severity items (both surgical Source-list additions plus a one-line citation correction) materially strengthens the report's evidentiary spine; the rest are optional polish. No structural rework needed."
+}

research/loci.json ADDED Viewed

	@@ -0,0 +1,68 @@

+{
+  "loci": [
+    {
+      "name": "prune-vs-train-on-all",
+      "one_line": "Does training on losing/failed branches (vs pruning to winners-only) better instill counterfactual foresight + introspection — and HOW must negatives be used to help rather than destabilize?",
+      "flavor": "dialectical",
+      "importance": 10,
+      "uncertainty": 9,
+      "disagreement": 9,
+      "decision_impact": 10,
+      "composite_score": 38,
+      "source_budget": 15,
+      "rationale": "The user's explicitly-named CENTRAL question. Genuine empirical fork: RAFT/positives-only is stable & competitive (2504.11343) and naive negative gradient destabilizes (2505.18830), vs negatives carry unique signal that improves agent tuning (2402.11651, 2503.14391, expert-failures). Resolving it changes the entire dataset-construction design (prune the tree vs keep it as typed signal). Must produce an argued position + concrete experiment, grounded in the repo's ADR-013 A0-A4 ladder."
+    },
+    {
+      "name": "worldmodel-latent-deliberation",
+      "one_line": "Can latent 'what-if' deliberation (predict next repo-state before acting) be trained into a SWE agent via an auxiliary next-state-prediction objective, or does it emerge from scale — and how do you measure it?",
+      "flavor": "dialectical",
+      "importance": 9,
+      "uncertainty": 8,
+      "disagreement": 7,
+      "decision_impact": 9,
+      "composite_score": 33,
+      "source_budget": 12,
+      "rationale": "The user's core GOAL (the 'world-model thinking' aim). Fork: LLMs are implicit world models / emerges from scale (2512.18832, 2411.08794) vs agents fail to USE world models for foresight without explicit training (2601.03905) + MuZero/Chain-of-World train it explicitly (1911.08265, 2603.03195). Decision-relevant: determines whether to add the aux loss + a deliberation token, and how to measure (calibration / foresight accuracy). Must map onto the repo's SDPO channel as the natural carrier."
+    },
+    {
+      "name": "selfevolve-flywheel-vs-collapse",
+      "one_line": "Does the closed-loop multi-model MCTS + self-distillation flywheel compound improvement, or collapse into reward-hacking / diversity-loss / human-trace entrenchment — and what design choices prevent collapse?",
+      "flavor": "dialectical",
+      "importance": 9,
+      "uncertainty": 8,
+      "disagreement": 8,
+      "decision_impact": 9,
+      "composite_score": 34,
+      "source_budget": 11,
+      "rationale": "Determines whether the whole genetic-algorithm flywheel is sound. Strong adversarial convergence (reward-hacking worsens with depth — RSI ICLR2026; collapse from closed-loop self-distillation — self-evolving survey §8.3; replay entrenches human distribution — Self-Play-SWE-RL 2512.18552) vs working flywheels (Socratic-SWE +7.8, DeepSWE, SWE-RL). Resolution = keep a true execution ORACLE + heterogeneous-model population as anti-collapse diversity. High decision impact on safeguards."
+    },
+    {
+      "name": "credit-assignment-tree-as-process-signal",
+      "one_line": "Does the multi-model tree's divergence structure give cheap, dense PROCESS-level credit assignment that beats outcome-only RL — without training a separate PRM?",
+      "flavor": "technical",
+      "importance": 8,
+      "uncertainty": 6,
+      "disagreement": 7,
+      "decision_impact": 8,
+      "composite_score": 29,
+      "source_budget": 8,
+      "rationale": "The mechanism that makes the idea pay off. Process-supervision helps (Let's-Verify 2305.20050, PRM 2211.14275, Cursor's own targeted-feedback motivation) vs outcome-only suffices (DeepSWE, SWE-RL, min-form 2504.15275). The tree manufactures process signal cheaply from divergence + auto-generated textual feedback (wiring into the SDPO hint hook). Counterfactual credit-assignment theory (2011.09464, 2306.16803) is the formal backbone. Technical synthesis, moderate uncertainty."
+    },
+    {
+      "name": "eks-architecture-and-substrate-mapping",
+      "one_line": "What is the concrete EKS-primary (+ SageMaker-hybrid) architecture, and what is the minimal delta to map the repo's ServerlessExecutor/ObjectStoreAllReduce/DiLoCo onto it?",
+      "flavor": "technical",
+      "importance": 10,
+      "uncertainty": 4,
+      "disagreement": 5,
+      "decision_impact": 9,
+      "composite_score": 28,
+      "source_budget": 10,
+      "rationale": "The explicit DELIVERABLE ('how we could do it on sagemaker and/or eks, eks primarily'). Lower uncertainty (AWS-documented patterns: JARK/verl-on-EKS, KubeRay, Karpenter, GPU time-slicing/MIG, gVisor/Kata sandboxes, HyperPod) but very high decision impact — the report must commit to a concrete design. Includes the EKSExecutor delta, the sandbox-fan-out, the outer/inner loop placement, and the EKS-vs-SageMaker hybrid split."
+    }
+  ],
+  "skip_loci": [
+    {"name": "multimodel-tree-novelty-claim", "reason": "Resolved without depth: the honest position is the COMBINATION is novel, not the primitives (SWE-Search/tree-search use single models; Symphony mixes models for planning; Channel 3 already does flat multi-teacher). Folds into §1 framing, not a depth locus."},
+    {"name": "which-RL-engine-trl-vs-verl-vs-prime-rl", "reason": "Already decided in repo (ADR-006: TRL hosts SDPO since it needs full logits; verl/PRIME-RL for scale-out). Engineering choice, reported in §6/§8, not a contested research locus."}
+  ]
+}

research/notes/191108265-mastering-atari-go-chess-and-shogi-by-planning-with-a-learned-model.md ADDED Viewed

	@@ -0,0 +1,258 @@

+---
+title: '[1911.08265] Mastering Atari, Go, Chess and Shogi by Planning with a Learned
+  Model'
+id: 191108265-mastering-atari-go-chess-and-shogi-by-planning-with-a-learned-model
+tags:
+- socratic-mcts-swe-worldmodel-8f6dea
+created: '2026-06-09T04:22:01.612030Z'
+updated: '2026-06-09T04:22:19.478605Z'
+source: https://arxiv.org/abs/1911.08265
+source_domain: arxiv.org
+fetched_at: '2026-06-09T04:22:01.506414Z'
+fetch_provider: builtin
+status: draft
+type: note
+tier: institutional
+content_type: paper
+deprecated: false
+summary: '[1911.08265] Mastering Atari, Go, Chess and Shogi by Planning with a Learned
+  Model'
+---
+[1911.08265] Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model
+Computer Science > Machine Learning
+arXiv:1911.08265
+(cs)
+[Submitted on 19 Nov 2019 (
+v1
+), last revised 21 Feb 2020 (this version, v2)]
+Title:
+Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model
+Authors:
+Julian Schrittwieser
+,
+Ioannis Antonoglou
+,
+Thomas Hubert
+,
+Karen Simonyan
+,
+Laurent Sifre
+,
+Simon Schmitt
+,
+Arthur Guez
+,
+Edward Lockhart
+,
+Demis Hassabis
+,
+Thore Graepel
+,
+Timothy Lillicrap
+,
+David Silver
+View a PDF of the paper titled Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model, by Julian Schrittwieser and 11 other authors
+View PDF
+Abstract:
+Constructing agents with planning capabilities has long been one of the main challenges in the pursuit of artificial intelligence. Tree-based planning methods have enjoyed huge success in challenging domains, such as chess and Go, where a perfect simulator is available. However, in real-world problems the dynamics governing the environment are often complex and unknown. In this work we present the MuZero algorithm which, by combining a tree-based search with a learned model, achieves superhuman performance in a range of challenging and visually complex domains, without any knowledge of their underlying dynamics. MuZero learns a model that, when applied iteratively, predicts the quantities most directly relevant to planning: the reward, the action-selection policy, and the value function. When evaluated on 57 different Atari games - the canonical video game environment for testing AI techniques, in which model-based planning approaches have historically struggled - our new algorithm achieved a new state of the art. When evaluated on Go, chess and shogi, without any knowledge of the game rules, MuZero matched the superhuman performance of the AlphaZero algorithm that was supplied with the game rules.
+Subjects:
+Machine Learning (cs.LG)
+; Machine Learning (stat.ML)
+Cite as:
+arXiv:1911.08265
+[cs.LG]
+(or
+arXiv:1911.08265v2
+[cs.LG]
+for this version)
+https://doi.org/10.48550/arXiv.1911.08265
+Focus to learn more
+arXiv-issued DOI via DataCite
+Related DOI
+:
+https://doi.org/10.1038/s41586-020-03051-4
+Focus to learn more
+DOI(s) linking to related resources
+Submission history
+From: Julian Schrittwieser [
+view email
+]
+[v1]
+Tue, 19 Nov 2019 13:58:52 UTC (3,106 KB)
+[v2]
+Fri, 21 Feb 2020 18:05:30 UTC (2,973 KB)
+Full-text links:
+Access Paper:
+View a PDF of the paper titled Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model, by Julian Schrittwieser and 11 other authors
+View PDF
+TeX Source
+view license
+Ancillary-file links:
+Ancillary files
+(
+details
+)
+:
+atari_evaluations.json
+atari_repeatability.json
+atari_results.json
+atari_scaling.json
+atari_trainX_evalX.json
+board_game_elos.json
+go_policy_improvement.json
+go_scaling.json
+pseudocode.py
+qlearning_pacman_ablations.json
+(5 additional files not shown)
+You must enabled JavaScript to view entire file list.
+Current browse context:
+cs.LG
+< prev
+|
+next >
+new
+|
+recent
+|
+2019-11
+Change to browse by:
+cs
+stat
+stat.ML
+References & Citations
+NASA ADS
+Google Scholar
+Semantic Scholar
+1 blog link
+(
+what is this?
+)
+DBLP
+- CS Bibliography
+listing
+|
+bibtex
+Julian Schrittwieser
+Ioannis Antonoglou
+Thomas Hubert
+Karen Simonyan
+Laurent Sifre
+…
+export BibTeX citation
+Loading...
+BibTeX formatted citation
+×
+loading...
+Data provided by:
+Bookmark
+Bibliographic Tools
+Bibliographic and Citation Tools
+Bibliographic Explorer Toggle
+Bibliographic Explorer
+(
+What is the Explorer?
+)
+Connected Papers Toggle
+Connected Papers
+(
+What is Connected Papers?
+)
+Litmaps Toggle
+Litmaps
+(
+What is Litmaps?
+)
+scite.ai Toggle
+scite Smart Citations
+(
+What are Smart Citations?
+)
+Code, Data, Media
+Code, Data and Media Associated with this Article
+alphaXiv Toggle
+alphaXiv
+(
+What is alphaXiv?
+)
+Links to Code Toggle
+CatalyzeX Code Finder for Papers
+(
+What is CatalyzeX?
+)
+DagsHub Toggle
+DagsHub
+(
+What is DagsHub?
+)
+GotitPub Toggle
+Gotit.pub
+(
+What is GotitPub?
+)
+Huggingface Toggle
+Hugging Face
+(
+What is Huggingface?
+)
+Links to Code Toggle
+Papers with Code
+(
+What is Papers with Code?
+)
+ScienceCast Toggle
+ScienceCast
+(
+What is ScienceCast?
+)
+Demos
+Demos
+Replicate Toggle
+Replicate
+(
+What is Replicate?
+)
+Spaces Toggle
+Hugging Face Spaces
+(
+What is Spaces?
+)
+Spaces Toggle
+TXYZ.AI
+(
+What is TXYZ.AI?
+)
+Related Papers
+Recommenders and Search Tools
+Link to Influence Flower
+Influence Flower
+(
+What are Influence Flowers?
+)
+Core recommender toggle
+CORE Recommender
+(
+What is CORE?
+)
+IArxiv recommender toggle
+IArxiv Recommender
+(
+What is IArxiv?
+)
+Author
+Venue
+Institution
+Topic
+About arXivLabs
+arXivLabs: experimental projects with community collaborators
+arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
+Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
+Have an idea for a project that will add value for arXiv's community?
+Learn more about arXivLabs
+.
+Which authors of this paper are endorsers?
+|
+Disable MathJax
+(
+What is MathJax?
+)

research/notes/201109464-counterfactual-credit-assignment-in-model-free-reinforcement-learning.md ADDED Viewed

	@@ -0,0 +1,229 @@

+---
+title: '[2011.09464] Counterfactual Credit Assignment in Model-Free Reinforcement
+  Learning'
+id: 201109464-counterfactual-credit-assignment-in-model-free-reinforcement-learning
+tags:
+- socratic-mcts-swe-worldmodel-8f6dea
+created: '2026-06-09T04:23:24.356402Z'
+updated: '2026-06-09T04:23:49.899744Z'
+source: https://arxiv.org/abs/2011.09464
+source_domain: arxiv.org
+fetched_at: '2026-06-09T04:23:23.972635Z'
+fetch_provider: builtin
+status: draft
+type: note
+tier: institutional
+content_type: paper
+deprecated: false
+summary: 'Foundational counterfactual credit assignment in RL: future-conditional
+  baselines/critics separate skill from luck (action''s true influence on reward)
+  at provably low variance — the theory under ''where the trace diverged is the high-value
+  signal''.'
+---
+[2011.09464] Counterfactual Credit Assignment in Model-Free Reinforcement Learning
+Computer Science > Machine Learning
+arXiv:2011.09464
+(cs)
+[Submitted on 18 Nov 2020 (
+v1
+), last revised 14 Dec 2021 (this version, v2)]
+Title:
+Counterfactual Credit Assignment in Model-Free Reinforcement Learning
+Authors:
+Thomas Mesnard
+,
+Théophane Weber
+,
+Fabio Viola
+,
+Shantanu Thakoor
+,
+Alaa Saade
+,
+Anna Harutyunyan
+,
+Will Dabney
+,
+Tom Stepleton
+,
+Nicolas Heess
+,
+Arthur Guez
+,
+Éric Moulines
+,
+Marcus Hutter
+,
+Lars Buesing
+,
+Rémi Munos
+View a PDF of the paper titled Counterfactual Credit Assignment in Model-Free Reinforcement Learning, by Thomas Mesnard and 13 other authors
+View PDF
+Abstract:
+Credit assignment in reinforcement learning is the problem of measuring an action's influence on future rewards. In particular, this requires separating skill from luck, i.e. disentangling the effect of an action on rewards from that of external factors and subsequent actions. To achieve this, we adapt the notion of counterfactuals from causality theory to a model-free RL setup. The key idea is to condition value functions on future events, by learning to extract relevant information from a trajectory. We formulate a family of policy gradient algorithms that use these future-conditional value functions as baselines or critics, and show that they are provably low variance. To avoid the potential bias from conditioning on future information, we constrain the hindsight information to not contain information about the agent's actions. We demonstrate the efficacy and validity of our algorithm on a number of illustrative and challenging problems.
+Subjects:
+Machine Learning (cs.LG)
+Cite as:
+arXiv:2011.09464
+[cs.LG]
+(or
+arXiv:2011.09464v2
+[cs.LG]
+for this version)
+https://doi.org/10.48550/arXiv.2011.09464
+Focus to learn more
+arXiv-issued DOI via DataCite
+Submission history
+From: Thomas Mesnard [
+view email
+]
+[v1]
+Wed, 18 Nov 2020 18:41:44 UTC (25,181 KB)
+[v2]
+Tue, 14 Dec 2021 13:36:12 UTC (3,053 KB)
+Full-text links:
+Access Paper:
+View a PDF of the paper titled Counterfactual Credit Assignment in Model-Free Reinforcement Learning, by Thomas Mesnard and 13 other authors
+View PDF
+TeX Source
+view license
+Current browse context:
+cs.LG
+< prev
+|
+next >
+new
+|
+recent
+|
+2020-11
+Change to browse by:
+cs
+References & Citations
+NASA ADS
+Google Scholar
+Semantic Scholar
+DBLP
+- CS Bibliography
+listing
+|
+bibtex
+Thomas Mesnard
+Théophane Weber
+Fabio Viola
+Alaa Saade
+Anna Harutyunyan
+…
+export BibTeX citation
+Loading...
+BibTeX formatted citation
+×
+loading...
+Data provided by:
+Bookmark
+Bibliographic Tools
+Bibliographic and Citation Tools
+Bibliographic Explorer Toggle
+Bibliographic Explorer
+(
+What is the Explorer?
+)
+Connected Papers Toggle
+Connected Papers
+(
+What is Connected Papers?
+)
+Litmaps Toggle
+Litmaps
+(
+What is Litmaps?
+)
+scite.ai Toggle
+scite Smart Citations
+(
+What are Smart Citations?
+)
+Code, Data, Media
+Code, Data and Media Associated with this Article
+alphaXiv Toggle
+alphaXiv
+(
+What is alphaXiv?
+)
+Links to Code Toggle
+CatalyzeX Code Finder for Papers
+(
+What is CatalyzeX?
+)
+DagsHub Toggle
+DagsHub
+(
+What is DagsHub?
+)
+GotitPub Toggle
+Gotit.pub
+(
+What is GotitPub?
+)
+Huggingface Toggle
+Hugging Face
+(
+What is Huggingface?
+)
+ScienceCast Toggle
+ScienceCast
+(
+What is ScienceCast?
+)
+Demos
+Demos
+Replicate Toggle
+Replicate
+(
+What is Replicate?
+)
+Spaces Toggle
+Hugging Face Spaces
+(
+What is Spaces?
+)
+Spaces Toggle
+TXYZ.AI
+(
+What is TXYZ.AI?
+)
+Related Papers
+Recommenders and Search Tools
+Link to Influence Flower
+Influence Flower
+(
+What are Influence Flowers?
+)
+Core recommender toggle
+CORE Recommender
+(
+What is CORE?
+)
+IArxiv recommender toggle
+IArxiv Recommender
+(
+What is IArxiv?
+)
+Author
+Venue
+Institution
+Topic
+About arXivLabs
+arXivLabs: experimental projects with community collaborators
+arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
+Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
+Have an idea for a project that will add value for arXiv's community?
+Learn more about arXivLabs
+.
+Which authors of this paper are endorsers?
+|
+Disable MathJax
+(
+What is MathJax?
+)

research/notes/221114275-solving-math-word-problems-with-process-and-outcome-based-feedback.md ADDED Viewed

	@@ -0,0 +1,205 @@

+---
+title: '[2211.14275] Solving math word problems with process- and outcome-based feedback'
+id: 221114275-solving-math-word-problems-with-process-and-outcome-based-feedback
+tags:
+- socratic-mcts-swe-worldmodel-8f6dea
+created: '2026-06-09T04:23:24.366439Z'
+updated: '2026-06-09T04:23:56.531901Z'
+source: https://arxiv.org/abs/2211.14275
+source_domain: arxiv.org
+fetched_at: '2026-06-09T04:23:24.269109Z'
+fetch_provider: builtin
+status: draft
+type: note
+tier: institutional
+content_type: paper
+deprecated: false
+summary: 'DeepMind (Uesato et al. 2022): the original head-to-head of process- vs
+  outcome-based feedback; final-answer error parity but process feedback drastically
+  cuts reasoning/trace errors — motivates rewarding the path, not just the result.'
+---
+[2211.14275] Solving math word problems with process- and outcome-based feedback
+Computer Science > Machine Learning
+arXiv:2211.14275
+(cs)
+[Submitted on 25 Nov 2022]
+Title:
+Solving math word problems with process- and outcome-based feedback
+Authors:
+Jonathan Uesato
+,
+Nate Kushman
+,
+Ramana Kumar
+,
+Francis Song
+,
+Noah Siegel
+,
+Lisa Wang
+,
+Antonia Creswell
+,
+Geoffrey Irving
+,
+Irina Higgins
+View a PDF of the paper titled Solving math word problems with process- and outcome-based feedback, by Jonathan Uesato and 8 other authors
+View PDF
+Abstract:
+Recent work has shown that asking language models to generate reasoning steps improves performance on many reasoning tasks. When moving beyond prompting, this raises the question of how we should supervise such models: outcome-based approaches which supervise the final result, or process-based approaches which supervise the reasoning process itself? Differences between these approaches might naturally be expected not just in final-answer errors but also in reasoning errors, which can be difficult to detect and are problematic in many real-world domains such as education. We run the first comprehensive comparison between process- and outcome-based approaches trained on a natural language task, GSM8K. We find that pure outcome-based supervision produces similar final-answer error rates with less label supervision. However, for correct reasoning steps we find it necessary to use process-based supervision or supervision from learned reward models that emulate process-based feedback. In total, we improve the previous best results from 16.8% $\to$ 12.7% final-answer error and 14.0% $\to$ 3.4% reasoning error among final-answer-correct solutions.
+Subjects:
+Machine Learning (cs.LG)
+; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
+Cite as:
+arXiv:2211.14275
+[cs.LG]
+(or
+arXiv:2211.14275v1
+[cs.LG]
+for this version)
+https://doi.org/10.48550/arXiv.2211.14275
+Focus to learn more
+arXiv-issued DOI via DataCite
+Submission history
+From: Jonathan Uesato [
+view email
+]
+[v1]
+Fri, 25 Nov 2022 18:19:44 UTC (306 KB)
+Full-text links:
+Access Paper:
+View a PDF of the paper titled Solving math word problems with process- and outcome-based feedback, by Jonathan Uesato and 8 other authors
+View PDF
+TeX Source
+view license
+Current browse context:
+cs.LG
+< prev
+|
+next >
+new
+|
+recent
+|
+2022-11
+Change to browse by:
+cs
+cs.AI
+cs.CL
+References & Citations
+NASA ADS
+Google Scholar
+Semantic Scholar
+export BibTeX citation
+Loading...
+BibTeX formatted citation
+×
+loading...
+Data provided by:
+Bookmark
+Bibliographic Tools
+Bibliographic and Citation Tools
+Bibliographic Explorer Toggle
+Bibliographic Explorer
+(
+What is the Explorer?
+)
+Connected Papers Toggle
+Connected Papers
+(
+What is Connected Papers?
+)
+Litmaps Toggle
+Litmaps
+(
+What is Litmaps?
+)
+scite.ai Toggle
+scite Smart Citations
+(
+What are Smart Citations?
+)
+Code, Data, Media
+Code, Data and Media Associated with this Article
+alphaXiv Toggle
+alphaXiv
+(
+What is alphaXiv?
+)
+Links to Code Toggle
+CatalyzeX Code Finder for Papers
+(
+What is CatalyzeX?
+)
+DagsHub Toggle
+DagsHub
+(
+What is DagsHub?
+)
+GotitPub Toggle
+Gotit.pub
+(
+What is GotitPub?
+)
+Huggingface Toggle
+Hugging Face
+(
+What is Huggingface?
+)
+ScienceCast Toggle
+ScienceCast
+(
+What is ScienceCast?
+)
+Demos
+Demos
+Replicate Toggle
+Replicate
+(
+What is Replicate?
+)
+Spaces Toggle
+Hugging Face Spaces
+(
+What is Spaces?
+)
+Spaces Toggle
+TXYZ.AI
+(
+What is TXYZ.AI?
+)
+Related Papers
+Recommenders and Search Tools
+Link to Influence Flower
+Influence Flower
+(
+What are Influence Flowers?
+)
+Core recommender toggle
+CORE Recommender
+(
+What is CORE?
+)
+IArxiv recommender toggle
+IArxiv Recommender
+(
+What is IArxiv?
+)
+Author
+Venue
+Institution
+Topic
+About arXivLabs
+arXivLabs: experimental projects with community collaborators
+arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
+Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
+Have an idea for a project that will add value for arXiv's community?
+Learn more about arXivLabs
+.
+Which authors of this paper are endorsers?
+|
+Disable MathJax
+(
+What is MathJax?
+)

research/notes/230104104-mastering-diverse-domains-through-world-models.md ADDED Viewed

	@@ -0,0 +1,196 @@

+---
+title: '[2301.04104] Mastering Diverse Domains through World Models'
+id: 230104104-mastering-diverse-domains-through-world-models
+tags:
+- socratic-mcts-swe-worldmodel-8f6dea
+created: '2026-06-09T04:22:01.614454Z'
+updated: '2026-06-09T04:22:19.838964Z'
+source: https://arxiv.org/abs/2301.04104
+source_domain: arxiv.org
+fetched_at: '2026-06-09T04:22:01.600829Z'
+fetch_provider: builtin
+status: draft
+type: note
+tier: institutional
+content_type: paper
+deprecated: false
+summary: '[2301.04104] Mastering Diverse Domains through World Models'
+---
+[2301.04104] Mastering Diverse Domains through World Models
+Computer Science > Artificial Intelligence
+arXiv:2301.04104
+(cs)
+[Submitted on 10 Jan 2023 (
+v1
+), last revised 17 Apr 2024 (this version, v2)]
+Title:
+Mastering Diverse Domains through World Models
+Authors:
+Danijar Hafner
+,
+Jurgis Pasukonis
+,
+Jimmy Ba
+,
+Timothy Lillicrap
+View a PDF of the paper titled Mastering Diverse Domains through World Models, by Danijar Hafner and 3 other authors
+View PDF
+Abstract:
+Developing a general algorithm that learns to solve tasks across a wide range of applications has been a fundamental challenge in artificial intelligence. Although current reinforcement learning algorithms can be readily applied to tasks similar to what they have been developed for, configuring them for new application domains requires significant human expertise and experimentation. We present DreamerV3, a general algorithm that outperforms specialized methods across over 150 diverse tasks, with a single configuration. Dreamer learns a model of the environment and improves its behavior by imagining future scenarios. Robustness techniques based on normalization, balancing, and transformations enable stable learning across domains. Applied out of the box, Dreamer is the first algorithm to collect diamonds in Minecraft from scratch without human data or curricula. This achievement has been posed as a significant challenge in artificial intelligence that requires exploring farsighted strategies from pixels and sparse rewards in an open world. Our work allows solving challenging control problems without extensive experimentation, making reinforcement learning broadly applicable.
+Comments:
+Website:
+this https URL
+Subjects:
+Artificial Intelligence (cs.AI)
+; Machine Learning (cs.LG); Machine Learning (stat.ML)
+Cite as:
+arXiv:2301.04104
+[cs.AI]
+(or
+arXiv:2301.04104v2
+[cs.AI]
+for this version)
+https://doi.org/10.48550/arXiv.2301.04104
+Focus to learn more
+arXiv-issued DOI via DataCite
+Submission history
+From: Danijar Hafner [
+view email
+]
+[v1]
+Tue, 10 Jan 2023 18:12:16 UTC (2,210 KB)
+[v2]
+Wed, 17 Apr 2024 17:41:20 UTC (2,520 KB)
+Full-text links:
+Access Paper:
+View a PDF of the paper titled Mastering Diverse Domains through World Models, by Danijar Hafner and 3 other authors
+View PDF
+TeX Source
+view license
+Current browse context:
+cs.AI
+< prev
+|
+next >
+new
+|
+recent
+|
+2023-01
+Change to browse by:
+cs
+cs.LG
+stat
+stat.ML
+References & Citations
+NASA ADS
+Google Scholar
+Semantic Scholar
+export BibTeX citation
+Loading...
+BibTeX formatted citation
+×
+loading...
+Data provided by:
+Bookmark
+Bibliographic Tools
+Bibliographic and Citation Tools
+Bibliographic Explorer Toggle
+Bibliographic Explorer
+(
+What is the Explorer?
+)
+Connected Papers Toggle
+Connected Papers
+(
+What is Connected Papers?
+)
+Litmaps Toggle
+Litmaps
+(
+What is Litmaps?
+)
+scite.ai Toggle
+scite Smart Citations
+(
+What are Smart Citations?
+)
+Code, Data, Media
+Code, Data and Media Associated with this Article
+alphaXiv Toggle
+alphaXiv
+(
+What is alphaXiv?
+)
+Links to Code Toggle
+CatalyzeX Code Finder for Papers
+(
+What is CatalyzeX?
+)
+DagsHub Toggle
+DagsHub
+(
+What is DagsHub?
+)
+GotitPub Toggle
+Gotit.pub
+(
+What is GotitPub?
+)
+Huggingface Toggle
+Hugging Face
+(
+What is Huggingface?
+)
+ScienceCast Toggle
+ScienceCast
+(
+What is ScienceCast?
+)
+Demos
+Demos
+Replicate Toggle
+Replicate
+(
+What is Replicate?
+)
+Spaces Toggle
+Hugging Face Spaces
+(
+What is Spaces?
+)
+Spaces Toggle
+TXYZ.AI
+(
+What is TXYZ.AI?
+)
+Related Papers
+Recommenders and Search Tools
+Link to Influence Flower
+Influence Flower
+(
+What are Influence Flowers?
+)
+Core recommender toggle
+CORE Recommender
+(
+What is CORE?
+)
+Author
+Venue
+Institution
+Topic
+About arXivLabs
+arXivLabs: experimental projects with community collaborators
+arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
+Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
+Have an idea for a project that will add value for arXiv's community?
+Learn more about arXivLabs
+.
+Which authors of this paper are endorsers?
+|
+Disable MathJax
+(
+What is MathJax?
+)

research/notes/230520050-lets-verify-step-by-step.md ADDED Viewed

	@@ -0,0 +1,207 @@

+---
+title: '[2305.20050] Let''s Verify Step by Step'
+id: 230520050-lets-verify-step-by-step
+tags:
+- socratic-mcts-swe-worldmodel-8f6dea
+created: '2026-06-09T04:23:24.363053Z'
+updated: '2026-06-09T04:23:54.507242Z'
+source: https://arxiv.org/abs/2305.20050
+source_domain: arxiv.org
+fetched_at: '2026-06-09T04:23:24.177998Z'
+fetch_provider: builtin
+status: draft
+type: note
+tier: institutional
+content_type: paper
+deprecated: false
+summary: 'OpenAI (Lightman et al. 2023): process supervision (PRM, per-step labels)
+  substantially outperforms outcome supervision (ORM) on MATH and yields more reliable
+  reward — the canonical empirical case that step-level credit beats outcome-only.'
+---
+[2305.20050] Let's Verify Step by Step
+Computer Science > Machine Learning
+arXiv:2305.20050
+(cs)
+[Submitted on 31 May 2023]
+Title:
+Let's Verify Step by Step
+Authors:
+Hunter Lightman
+,
+Vineet Kosaraju
+,
+Yura Burda
+,
+Harri Edwards
+,
+Bowen Baker
+,
+Teddy Lee
+,
+Jan Leike
+,
+John Schulman
+,
+Ilya Sutskever
+,
+Karl Cobbe
+View a PDF of the paper titled Let's Verify Step by Step, by Hunter Lightman and 9 other authors
+View PDF
+Abstract:
+In recent years, large language models have greatly improved in their ability to perform complex multi-step reasoning. However, even state-of-the-art models still regularly produce logical mistakes. To train more reliable models, we can turn either to outcome supervision, which provides feedback for a final result, or process supervision, which provides feedback for each intermediate reasoning step. Given the importance of training reliable models, and given the high cost of human feedback, it is important to carefully compare the both methods. Recent work has already begun this comparison, but many questions still remain. We conduct our own investigation, finding that process supervision significantly outperforms outcome supervision for training models to solve problems from the challenging MATH dataset. Our process-supervised model solves 78% of problems from a representative subset of the MATH test set. Additionally, we show that active learning significantly improves the efficacy of process supervision. To support related research, we also release PRM800K, the complete dataset of 800,000 step-level human feedback labels used to train our best reward model.
+Subjects:
+Machine Learning (cs.LG)
+; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
+Cite as:
+arXiv:2305.20050
+[cs.LG]
+(or
+arXiv:2305.20050v1
+[cs.LG]
+for this version)
+https://doi.org/10.48550/arXiv.2305.20050
+Focus to learn more
+arXiv-issued DOI via DataCite
+Submission history
+From: Karl Cobbe [
+view email
+]
+[v1]
+Wed, 31 May 2023 17:24:00 UTC (10,363 KB)
+Full-text links:
+Access Paper:
+View a PDF of the paper titled Let's Verify Step by Step, by Hunter Lightman and 9 other authors
+View PDF
+TeX Source
+view license
+Current browse context:
+cs.LG
+< prev
+|
+next >
+new
+|
+recent
+|
+2023-05
+Change to browse by:
+cs
+cs.AI
+cs.CL
+References & Citations
+NASA ADS
+Google Scholar
+Semantic Scholar
+export BibTeX citation
+Loading...
+BibTeX formatted citation
+×
+loading...
+Data provided by:
+Bookmark
+Bibliographic Tools
+Bibliographic and Citation Tools
+Bibliographic Explorer Toggle
+Bibliographic Explorer
+(
+What is the Explorer?
+)
+Connected Papers Toggle
+Connected Papers
+(
+What is Connected Papers?
+)
+Litmaps Toggle
+Litmaps
+(
+What is Litmaps?
+)
+scite.ai Toggle
+scite Smart Citations
+(
+What are Smart Citations?
+)
+Code, Data, Media
+Code, Data and Media Associated with this Article
+alphaXiv Toggle
+alphaXiv
+(
+What is alphaXiv?
+)
+Links to Code Toggle
+CatalyzeX Code Finder for Papers
+(
+What is CatalyzeX?
+)
+DagsHub Toggle
+DagsHub
+(
+What is DagsHub?
+)
+GotitPub Toggle
+Gotit.pub
+(
+What is GotitPub?
+)
+Huggingface Toggle
+Hugging Face
+(
+What is Huggingface?
+)
+ScienceCast Toggle
+ScienceCast
+(
+What is ScienceCast?
+)
+Demos
+Demos
+Replicate Toggle
+Replicate
+(
+What is Replicate?
+)
+Spaces Toggle
+Hugging Face Spaces
+(
+What is Spaces?
+)
+Spaces Toggle
+TXYZ.AI
+(
+What is TXYZ.AI?
+)
+Related Papers
+Recommenders and Search Tools
+Link to Influence Flower
+Influence Flower
+(
+What are Influence Flowers?
+)
+Core recommender toggle
+CORE Recommender
+(
+What is CORE?
+)
+IArxiv recommender toggle
+IArxiv Recommender
+(
+What is IArxiv?
+)
+Author
+Venue
+Institution
+Topic
+About arXivLabs
+arXivLabs: experimental projects with community collaborators
+arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
+Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
+Have an idea for a project that will add value for arXiv's community?
+Learn more about arXivLabs
+.
+Which authors of this paper are endorsers?
+|
+Disable MathJax
+(
+What is MathJax?
+)

research/notes/230616803-would-i-have-gotten-that-reward-long-term-credit-assignment-by-counter.md ADDED Viewed

	@@ -0,0 +1,205 @@

+---
+title: '[2306.16803] Would I have gotten that reward? Long-term credit assignment
+  by counterfactual contribution analysis'
+id: 230616803-would-i-have-gotten-that-reward-long-term-credit-assignment-by-counter
+tags:
+- socratic-mcts-swe-worldmodel-8f6dea
+created: '2026-06-09T04:23:24.359888Z'
+updated: '2026-06-09T04:23:53.115536Z'
+source: https://arxiv.org/abs/2306.16803
+source_domain: arxiv.org
+fetched_at: '2026-06-09T04:23:24.081414Z'
+fetch_provider: builtin
+status: draft
+type: note
+tier: institutional
+content_type: paper
+deprecated: false
+summary: 'HCA+/Counterfactual Contribution Analysis (NeurIPS 2023): estimates each
+  action''s marginal contribution to long-term reward via hindsight/counterfactual
+  models — process-level credit at trajectory scale, directly applicable to crediting
+  the divergence step in a tree-of-work branch.'
+---
+[2306.16803] Would I have gotten that reward? Long-term credit assignment by counterfactual contribution analysis
+Computer Science > Machine Learning
+arXiv:2306.16803
+(cs)
+[Submitted on 29 Jun 2023 (
+v1
+), last revised 31 Oct 2023 (this version, v2)]
+Title:
+Would I have gotten that reward? Long-term credit assignment by counterfactual contribution analysis
+Authors:
+Alexander Meulemans
+,
+Simon Schug
+,
+Seijin Kobayashi
+,
+Nathaniel Daw
+,
+Gregory Wayne
+View a PDF of the paper titled Would I have gotten that reward? Long-term credit assignment by counterfactual contribution analysis, by Alexander Meulemans and 4 other authors
+View PDF
+Abstract:
+To make reinforcement learning more sample efficient, we need better credit assignment methods that measure an action's influence on future rewards. Building upon Hindsight Credit Assignment (HCA), we introduce Counterfactual Contribution Analysis (COCOA), a new family of model-based credit assignment algorithms. Our algorithms achieve precise credit assignment by measuring the contribution of actions upon obtaining subsequent rewards, by quantifying a counterfactual query: 'Would the agent still have reached this reward if it had taken another action?'. We show that measuring contributions w.r.t. rewarding states, as is done in HCA, results in spurious estimates of contributions, causing HCA to degrade towards the high-variance REINFORCE estimator in many relevant environments. Instead, we measure contributions w.r.t. rewards or learned representations of the rewarding objects, resulting in gradient estimates with lower variance. We run experiments on a suite of problems specifically designed to evaluate long-term credit assignment capabilities. By using dynamic programming, we measure ground-truth policy gradients and show that the improved performance of our new model-based credit assignment methods is due to lower bias and variance compared to HCA and common baselines. Our results demonstrate how modeling action contributions towards rewarding outcomes can be leveraged for credit assignment, opening a new path towards sample-efficient reinforcement learning.
+Comments:
+NeurIPS 2023 spotlight
+Subjects:
+Machine Learning (cs.LG)
+; Machine Learning (stat.ML)
+Cite as:
+arXiv:2306.16803
+[cs.LG]
+(or
+arXiv:2306.16803v2
+[cs.LG]
+for this version)
+https://doi.org/10.48550/arXiv.2306.16803
+Focus to learn more
+arXiv-issued DOI via DataCite
+Submission history
+From: Simon Schug [
+view email
+]
+[v1]
+Thu, 29 Jun 2023 09:27:27 UTC (1,429 KB)
+[v2]
+Tue, 31 Oct 2023 10:28:50 UTC (1,611 KB)
+Full-text links:
+Access Paper:
+View a PDF of the paper titled Would I have gotten that reward? Long-term credit assignment by counterfactual contribution analysis, by Alexander Meulemans and 4 other authors
+View PDF
+TeX Source
+view license
+Current browse context:
+cs.LG
+< prev
+|
+next >
+new
+|
+recent
+|
+2023-06
+Change to browse by:
+cs
+stat
+stat.ML
+References & Citations
+NASA ADS
+Google Scholar
+Semantic Scholar
+export BibTeX citation
+Loading...
+BibTeX formatted citation
+×
+loading...
+Data provided by:
+Bookmark
+Bibliographic Tools
+Bibliographic and Citation Tools
+Bibliographic Explorer Toggle
+Bibliographic Explorer
+(
+What is the Explorer?
+)
+Connected Papers Toggle
+Connected Papers
+(
+What is Connected Papers?
+)
+Litmaps Toggle
+Litmaps
+(
+What is Litmaps?
+)
+scite.ai Toggle
+scite Smart Citations
+(
+What are Smart Citations?
+)
+Code, Data, Media
+Code, Data and Media Associated with this Article
+alphaXiv Toggle
+alphaXiv
+(
+What is alphaXiv?
+)
+Links to Code Toggle
+CatalyzeX Code Finder for Papers
+(
+What is CatalyzeX?
+)
+DagsHub Toggle
+DagsHub
+(
+What is DagsHub?
+)
+GotitPub Toggle
+Gotit.pub
+(
+What is GotitPub?
+)
+Huggingface Toggle
+Hugging Face
+(
+What is Huggingface?
+)
+ScienceCast Toggle
+ScienceCast
+(
+What is ScienceCast?
+)
+Demos
+Demos
+Replicate Toggle
+Replicate
+(
+What is Replicate?
+)
+Spaces Toggle
+Hugging Face Spaces
+(
+What is Spaces?
+)
+Spaces Toggle
+TXYZ.AI
+(
+What is TXYZ.AI?
+)
+Related Papers
+Recommenders and Search Tools
+Link to Influence Flower
+Influence Flower
+(
+What are Influence Flowers?
+)
+Core recommender toggle
+CORE Recommender
+(
+What is CORE?
+)
+IArxiv recommender toggle
+IArxiv Recommender
+(
+What is IArxiv?
+)
+Author
+Venue
+Institution
+Topic
+About arXivLabs
+arXivLabs: experimental projects with community collaborators
+arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
+Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
+Have an idea for a project that will add value for arXiv's community?
+Learn more about arXivLabs
+.
+Which authors of this paper are endorsers?
+|
+Disable MathJax
+(
+What is MathJax?
+)

research/notes/240211651-learning-from-failure-integrating-negative-examples-when-fine-tuning-l.md ADDED Viewed

	@@ -0,0 +1,200 @@

+---
+title: '[2402.11651] Learning From Failure: Integrating Negative Examples when Fine-tuning
+  Large Language Models as Agents'
+id: 240211651-learning-from-failure-integrating-negative-examples-when-fine-tuning-l
+tags:
+- socratic-mcts-swe-worldmodel-8f6dea
+created: '2026-06-09T04:24:45.892505Z'
+updated: '2026-06-09T04:25:02.502012Z'
+source: https://arxiv.org/abs/2402.11651
+source_domain: arxiv.org
+fetched_at: '2026-06-09T04:24:45.281028Z'
+fetch_provider: builtin
+status: active
+type: note
+tier: institutional
+content_type: paper
+deprecated: false
+summary: '[2402.11651] Learning From Failure: Integrating Negative Examples when Fine-tuning
+  Large Language Models as Agents'
+---
+[2402.11651] Learning From Failure: Integrating Negative Examples when Fine-tuning Large Language Models as Agents
+Computer Science > Computation and Language
+arXiv:2402.11651
+(cs)
+[Submitted on 18 Feb 2024 (
+v1
+), last revised 16 Apr 2024 (this version, v2)]
+Title:
+Learning From Failure: Integrating Negative Examples when Fine-tuning Large Language Models as Agents
+Authors:
+Renxi Wang
+,
+Haonan Li
+,
+Xudong Han
+,
+Yixuan Zhang
+,
+Timothy Baldwin
+View a PDF of the paper titled Learning From Failure: Integrating Negative Examples when Fine-tuning Large Language Models as Agents, by Renxi Wang and 4 other authors
+View PDF
+HTML (experimental)
+Abstract:
+Large language models (LLMs) have achieved success in acting as agents, which interact with environments through tools such as search engines. However, LLMs are optimized for language generation instead of tool use during training or alignment, limiting their effectiveness as agents. To resolve this problem, previous work has first collected interaction trajectories between LLMs and environments, using only trajectories that successfully finished the task to fine-tune smaller models, making fine-tuning data scarce and acquiring it both difficult and costly. Discarding failed trajectories also leads to significant wastage of data and resources and limits the possible optimization paths during fine-tuning. In this paper, we argue that unsuccessful trajectories offer valuable insights, and LLMs can learn from these trajectories through appropriate quality control and fine-tuning strategies. By simply adding a prefix or suffix that tells the model whether to generate a successful trajectory during training, we improve model performance by a large margin on mathematical reasoning, multi-hop question answering, and strategic question answering tasks. We further analyze the inference results and find that our method provides a better trade-off between valuable information and errors in unsuccessful trajectories. To our knowledge, we are the first to demonstrate the value of negative trajectories and their application in agent-tunning scenarios. Our findings offer guidance for developing better agent-tuning methods and low-resource data usage techniques.
+Comments:
+Agent, LLM, Large Language Model
+Subjects:
+Computation and Language (cs.CL)
+ACM
+classes:
+I.2.7
+Cite as:
+arXiv:2402.11651
+[cs.CL]
+(or
+arXiv:2402.11651v2
+[cs.CL]
+for this version)
+https://doi.org/10.48550/arXiv.2402.11651
+Focus to learn more
+arXiv-issued DOI via DataCite
+Submission history
+From: Renxi Wang [
+view email
+]
+[v1]
+Sun, 18 Feb 2024 17:10:07 UTC (10,199 KB)
+[v2]
+Tue, 16 Apr 2024 11:41:13 UTC (10,670 KB)
+Full-text links:
+Access Paper:
+View a PDF of the paper titled Learning From Failure: Integrating Negative Examples when Fine-tuning Large Language Models as Agents, by Renxi Wang and 4 other authors
+View PDF
+HTML (experimental)
+TeX Source
+view license
+Current browse context:
+cs.CL
+< prev
+|
+next >
+new
+|
+recent
+|
+2024-02
+Change to browse by:
+cs
+References & Citations
+NASA ADS
+Google Scholar
+Semantic Scholar
+export BibTeX citation
+Loading...
+BibTeX formatted citation
+×
+loading...
+Data provided by:
+Bookmark
+Bibliographic Tools
+Bibliographic and Citation Tools
+Bibliographic Explorer Toggle
+Bibliographic Explorer
+(
+What is the Explorer?
+)
+Connected Papers Toggle
+Connected Papers
+(
+What is Connected Papers?
+)
+Litmaps Toggle
+Litmaps
+(
+What is Litmaps?
+)
+scite.ai Toggle
+scite Smart Citations
+(
+What are Smart Citations?
+)
+Code, Data, Media
+Code, Data and Media Associated with this Article
+alphaXiv Toggle
+alphaXiv
+(
+What is alphaXiv?
+)
+Links to Code Toggle
+CatalyzeX Code Finder for Papers
+(
+What is CatalyzeX?
+)
+DagsHub Toggle
+DagsHub
+(
+What is DagsHub?
+)
+GotitPub Toggle
+Gotit.pub
+(
+What is GotitPub?
+)
+Huggingface Toggle
+Hugging Face
+(
+What is Huggingface?
+)
+ScienceCast Toggle
+ScienceCast
+(
+What is ScienceCast?
+)
+Demos
+Demos
+Replicate Toggle
+Replicate
+(
+What is Replicate?
+)
+Spaces Toggle
+Hugging Face Spaces
+(
+What is Spaces?
+)
+Spaces Toggle
+TXYZ.AI
+(
+What is TXYZ.AI?
+)
+Related Papers
+Recommenders and Search Tools
+Link to Influence Flower
+Influence Flower
+(
+What are Influence Flowers?
+)
+Core recommender toggle
+CORE Recommender
+(
+What is CORE?
+)
+Author
+Venue
+Institution
+Topic
+About arXivLabs
+arXivLabs: experimental projects with community collaborators
+arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
+Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
+Have an idea for a project that will add value for arXiv's community?
+Learn more about arXivLabs
+.
+Which authors of this paper are endorsers?
+|
+Disable MathJax
+(
+What is MathJax?
+)

research/notes/240515383-generating-code-world-models-with-large-language-models-guided-by-mont.md ADDED Viewed

	@@ -0,0 +1,196 @@

+---
+title: '[2405.15383] Generating Code World Models with Large Language Models Guided
+  by Monte Carlo Tree Search'
+id: 240515383-generating-code-world-models-with-large-language-models-guided-by-mont
+tags:
+- socratic-mcts-swe-worldmodel-8f6dea
+created: '2026-06-09T04:22:12.336468Z'
+source: https://arxiv.org/abs/2405.15383
+source_domain: arxiv.org
+fetched_at: '2026-06-09T04:22:12.331439Z'
+fetch_provider: builtin
+status: draft
+type: note
+deprecated: false
+summary: '[2405.15383] Generating Code World Models with Large Language Models Guided
+  by Monte Carlo Tree Search'
+---
+[2405.15383] Generating Code World Models with Large Language Models Guided by Monte Carlo Tree Search
+Computer Science > Artificial Intelligence
+arXiv:2405.15383
+(cs)
+[Submitted on 24 May 2024 (
+v1
+), last revised 30 Oct 2024 (this version, v2)]
+Title:
+Generating Code World Models with Large Language Models Guided by Monte Carlo Tree Search
+Authors:
+Nicola Dainese
+,
+Matteo Merler
+,
+Minttu Alakuijala
+,
+Pekka Marttinen
+View a PDF of the paper titled Generating Code World Models with Large Language Models Guided by Monte Carlo Tree Search, by Nicola Dainese and 3 other authors
+View PDF
+Abstract:
+In this work we consider Code World Models, world models generated by a Large Language Model (LLM) in the form of Python code for model-based Reinforcement Learning (RL). Calling code instead of LLMs for planning has potential to be more precise, reliable, interpretable, and extremely efficient. However, writing appropriate Code World Models requires the ability to understand complex instructions, to generate exact code with non-trivial logic and to self-debug a long program with feedback from unit tests and environment trajectories. To address these challenges, we propose Generate, Improve and Fix with Monte Carlo Tree Search (GIF-MCTS), a new code generation strategy for LLMs. To test our approach in an offline RL setting, we introduce the Code World Models Benchmark (CWMB), a suite of program synthesis and planning tasks comprised of 18 diverse RL environments paired with corresponding textual descriptions and curated trajectories. GIF-MCTS surpasses all baselines on the CWMB and two other benchmarks, and we show that the Code World Models synthesized with it can be successfully used for planning, resulting in model-based RL agents with greatly improved sample efficiency and inference speed.
+Comments:
+Accepted at NeurIPS 2024, Main Track. 11 pages in main text, 40 pages including references and supplementary materials. 2 figures and 3 tables in the main text, 9 figures and 12 tables when including the supplementary materials. Website at
+this https URL
+Subjects:
+Artificial Intelligence (cs.AI)
+Cite as:
+arXiv:2405.15383
+[cs.AI]
+(or
+arXiv:2405.15383v2
+[cs.AI]
+for this version)
+https://doi.org/10.48550/arXiv.2405.15383
+Focus to learn more
+arXiv-issued DOI via DataCite
+Submission history
+From: Nicola Dainese [
+view email
+]
+[v1]
+Fri, 24 May 2024 09:31:26 UTC (238 KB)
+[v2]
+Wed, 30 Oct 2024 14:19:57 UTC (864 KB)
+Full-text links:
+Access Paper:
+View a PDF of the paper titled Generating Code World Models with Large Language Models Guided by Monte Carlo Tree Search, by Nicola Dainese and 3 other authors
+View PDF
+TeX Source
+view license
+Current browse context:
+cs.AI
+< prev
+|
+next >
+new
+|
+recent
+|
+2024-05
+Change to browse by:
+cs
+References & Citations
+NASA ADS
+Google Scholar
+Semantic Scholar
+export BibTeX citation
+Loading...
+BibTeX formatted citation
+×
+loading...
+Data provided by:
+Bookmark
+Bibliographic Tools
+Bibliographic and Citation Tools
+Bibliographic Explorer Toggle
+Bibliographic Explorer
+(
+What is the Explorer?
+)
+Connected Papers Toggle
+Connected Papers
+(
+What is Connected Papers?
+)
+Litmaps Toggle
+Litmaps
+(
+What is Litmaps?
+)
+scite.ai Toggle
+scite Smart Citations
+(
+What are Smart Citations?
+)
+Code, Data, Media
+Code, Data and Media Associated with this Article
+alphaXiv Toggle
+alphaXiv
+(
+What is alphaXiv?
+)
+Links to Code Toggle
+CatalyzeX Code Finder for Papers
+(
+What is CatalyzeX?
+)
+DagsHub Toggle
+DagsHub
+(
+What is DagsHub?
+)
+GotitPub Toggle
+Gotit.pub
+(
+What is GotitPub?
+)
+Huggingface Toggle
+Hugging Face
+(
+What is Huggingface?
+)
+Links to Code Toggle
+Papers with Code
+(
+What is Papers with Code?
+)
+ScienceCast Toggle
+ScienceCast
+(
+What is ScienceCast?
+)
+Demos
+Demos
+Replicate Toggle
+Replicate
+(
+What is Replicate?
+)
+Spaces Toggle
+Hugging Face Spaces
+(
+What is Spaces?
+)
+Spaces Toggle
+TXYZ.AI
+(
+What is TXYZ.AI?
+)
+Related Papers
+Recommenders and Search Tools
+Link to Influence Flower
+Influence Flower
+(
+What are Influence Flowers?
+)
+Core recommender toggle
+CORE Recommender
+(
+What is CORE?
+)
+Author
+Venue
+Institution
+Topic
+About arXivLabs
+arXivLabs: experimental projects with community collaborators
+arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
+Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
+Have an idea for a project that will add value for arXiv's community?
+Learn more about arXivLabs
+.
+Which authors of this paper are endorsers?
+|
+Disable MathJax
+(
+What is MathJax?
+)

research/notes/240701476-tree-search-for-language-model-agents.md ADDED Viewed

	@@ -0,0 +1,200 @@

+---
+title: '[2407.01476] Tree Search for Language Model Agents'
+id: 240701476-tree-search-for-language-model-agents
+tags:
+- socratic-mcts-swe-worldmodel-8f6dea
+created: '2026-06-09T04:22:47.411468Z'
+source: https://arxiv.org/abs/2407.01476
+source_domain: arxiv.org
+fetched_at: '2026-06-09T04:22:47.237299Z'
+fetch_provider: builtin
+status: draft
+type: note
+deprecated: false
+summary: '[2407.01476] Tree Search for Language Model Agents'
+---
+[2407.01476] Tree Search for Language Model Agents
+Computer Science > Artificial Intelligence
+arXiv:2407.01476
+(cs)
+[Submitted on 1 Jul 2024 (
+v1
+), last revised 8 Feb 2026 (this version, v4)]
+Title:
+Tree Search for Language Model Agents
+Authors:
+Jing Yu Koh
+,
+Stephen McAleer
+,
+Daniel Fried
+,
+Ruslan Salakhutdinov
+View a PDF of the paper titled Tree Search for Language Model Agents, by Jing Yu Koh and 3 other authors
+View PDF
+HTML (experimental)
+Abstract:
+Autonomous agents powered by language models (LMs) have demonstrated promise in their ability to perform decision-making tasks such as web automation. However, a key limitation remains: LMs, primarily optimized for natural language understanding and generation, struggle with multi-step reasoning, planning, and using environmental feedback when attempting to solve realistic computer tasks. Towards addressing this, we propose an inference-time search algorithm for LM agents to explicitly perform exploration and multi-step planning in interactive web environments. Our approach is a form of best-first tree search that operates within the actual environment space, and is complementary with most existing state-of-the-art agents. It is the first tree search algorithm for LM agents that shows effectiveness on realistic web tasks. On the challenging VisualWebArena benchmark, applying our search algorithm on top of a GPT-4o agent yields a 39.7% relative increase in success rate compared to the same baseline without search, setting a state-of-the-art success rate of 26.4%. On WebArena, search also yields a 28.0% relative improvement over a baseline agent, setting a competitive success rate of 19.2%. Our experiments highlight the effectiveness of search for web agents, and we demonstrate that performance scales with increased test-time compute. We conduct a thorough analysis of our results to highlight improvements from search, limitations, and promising directions for future work. Our code and models are publicly released at
+this https URL
+.
+Comments:
+13 pages. Models and code available at
+this https URL
+Subjects:
+Artificial Intelligence (cs.AI)
+; Computation and Language (cs.CL); Machine Learning (cs.LG)
+Cite as:
+arXiv:2407.01476
+[cs.AI]
+(or
+arXiv:2407.01476v4
+[cs.AI]
+for this version)
+https://doi.org/10.48550/arXiv.2407.01476
+Focus to learn more
+arXiv-issued DOI via DataCite
+Submission history
+From: Jing Yu Koh [
+view email
+]
+[v1]
+Mon, 1 Jul 2024 17:07:55 UTC (2,417 KB)
+[v2]
+Sat, 12 Oct 2024 19:58:57 UTC (2,435 KB)
+[v3]
+Wed, 24 Sep 2025 05:46:23 UTC (2,501 KB)
+[v4]
+Sun, 8 Feb 2026 15:06:40 UTC (2,495 KB)
+Full-text links:
+Access Paper:
+View a PDF of the paper titled Tree Search for Language Model Agents, by Jing Yu Koh and 3 other authors
+View PDF
+HTML (experimental)
+TeX Source
+view license
+Current browse context:
+cs.AI
+< prev
+|
+next >
+new
+|
+recent
+|
+2024-07
+Change to browse by:
+cs
+cs.CL
+cs.LG
+References & Citations
+NASA ADS
+Google Scholar
+Semantic Scholar
+export BibTeX citation
+Loading...
+BibTeX formatted citation
+×
+loading...
+Data provided by:
+Bookmark
+Bibliographic Tools
+Bibliographic and Citation Tools
+Bibliographic Explorer Toggle
+Bibliographic Explorer
+(
+What is the Explorer?
+)
+Connected Papers Toggle
+Connected Papers
+(
+What is Connected Papers?
+)
+Litmaps Toggle
+Litmaps
+(
+What is Litmaps?
+)
+scite.ai Toggle
+scite Smart Citations
+(
+What are Smart Citations?
+)
+Code, Data, Media
+Code, Data and Media Associated with this Article
+alphaXiv Toggle
+alphaXiv
+(
+What is alphaXiv?
+)
+Links to Code Toggle
+CatalyzeX Code Finder for Papers
+(
+What is CatalyzeX?
+)
+DagsHub Toggle
+DagsHub
+(
+What is DagsHub?
+)
+GotitPub Toggle
+Gotit.pub
+(
+What is GotitPub?
+)
+Huggingface Toggle
+Hugging Face
+(
+What is Huggingface?
+)
+ScienceCast Toggle
+ScienceCast
+(
+What is ScienceCast?
+)
+Demos
+Demos
+Replicate Toggle
+Replicate
+(
+What is Replicate?
+)
+Spaces Toggle
+Hugging Face Spaces
+(
+What is Spaces?
+)
+Spaces Toggle
+TXYZ.AI
+(
+What is TXYZ.AI?
+)
+Related Papers
+Recommenders and Search Tools
+Link to Influence Flower
+Influence Flower
+(
+What are Influence Flowers?
+)
+Core recommender toggle
+CORE Recommender
+(
+What is CORE?
+)
+Author
+Venue
+Institution
+Topic
+About arXivLabs
+arXivLabs: experimental projects with community collaborators
+arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
+Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
+Have an idea for a project that will add value for arXiv's community?
+Learn more about arXivLabs
+.
+Which authors of this paper are endorsers?
+|
+Disable MathJax
+(
+What is MathJax?
+)

research/notes/240919256-hybridflow-a-flexible-and-efficient-rlhf-framework.md ADDED Viewed

	@@ -0,0 +1,222 @@

+---
+title: '[2409.19256] HybridFlow: A Flexible and Efficient RLHF Framework'
+id: 240919256-hybridflow-a-flexible-and-efficient-rlhf-framework
+tags:
+- socratic-mcts-swe-worldmodel-8f6dea
+created: '2026-06-09T04:24:38.815877Z'
+updated: '2026-06-09T04:26:22.137190Z'
+source: https://arxiv.org/abs/2409.19256
+source_domain: arxiv.org
+fetched_at: '2026-06-09T04:24:38.794804Z'
+fetch_provider: builtin
+status: draft
+type: note
+tier: institutional
+deprecated: false
+summary: '[2409.19256] HybridFlow: A Flexible and Efficient RLHF Framework'
+---
+[2409.19256] HybridFlow: A Flexible and Efficient RLHF Framework
+Computer Science > Machine Learning
+arXiv:2409.19256
+(cs)
+[Submitted on 28 Sep 2024 (
+v1
+), last revised 2 Oct 2024 (this version, v2)]
+Title:
+HybridFlow: A Flexible and Efficient RLHF Framework
+Authors:
+Guangming Sheng
+,
+Chi Zhang
+,
+Zilingfeng Ye
+,
+Xibin Wu
+,
+Wang Zhang
+,
+Ru Zhang
+,
+Yanghua Peng
+,
+Haibin Lin
+,
+Chuan Wu
+View a PDF of the paper titled HybridFlow: A Flexible and Efficient RLHF Framework, by Guangming Sheng and 8 other authors
+View PDF
+HTML (experimental)
+Abstract:
+Reinforcement Learning from Human Feedback (RLHF) is widely used in Large Language Model (LLM) alignment. Traditional RL can be modeled as a dataflow, where each node represents computation of a neural network (NN) and each edge denotes data dependencies between the NNs. RLHF complicates the dataflow by expanding each node into a distributed LLM training or generation program, and each edge into a many-to-many multicast. Traditional RL frameworks execute the dataflow using a single controller to instruct both intra-node computation and inter-node communication, which can be inefficient in RLHF due to large control dispatch overhead for distributed intra-node computation. Existing RLHF systems adopt a multi-controller paradigm, which can be inflexible due to nesting distributed computation and data communication. We propose HybridFlow, which combines single-controller and multi-controller paradigms in a hybrid manner to enable flexible representation and efficient execution of the RLHF dataflow. We carefully design a set of hierarchical APIs that decouple and encapsulate computation and data dependencies in the complex RLHF dataflow, allowing efficient operation orchestration to implement RLHF algorithms and flexible mapping of the computation onto various devices. We further design a 3D-HybridEngine for efficient actor model resharding between training and generation phases, with zero memory redundancy and significantly reduced communication overhead. Our experimental results demonstrate 1.53$\times$~20.57$\times$ throughput improvement when running various RLHF algorithms using HybridFlow, as compared with state-of-the-art baselines. HybridFlow source code will be available at
+this https URL
+.
+Subjects:
+Machine Learning (cs.LG)
+; Distributed, Parallel, and Cluster Computing (cs.DC)
+ACM
+classes:
+I.2
+Cite as:
+arXiv:2409.19256
+[cs.LG]
+(or
+arXiv:2409.19256v2
+[cs.LG]
+for this version)
+https://doi.org/10.48550/arXiv.2409.19256
+Focus to learn more
+arXiv-issued DOI via DataCite
+Related DOI
+:
+https://doi.org/10.1145/3689031.3696075
+Focus to learn more
+DOI(s) linking to related resources
+Submission history
+From: Guangming Sheng [
+view email
+]
+[v1]
+Sat, 28 Sep 2024 06:20:03 UTC (1,755 KB)
+[v2]
+Wed, 2 Oct 2024 04:01:47 UTC (1,775 KB)
+Full-text links:
+Access Paper:
+View a PDF of the paper titled HybridFlow: A Flexible and Efficient RLHF Framework, by Guangming Sheng and 8 other authors
+View PDF
+HTML (experimental)
+TeX Source
+view license
+Current browse context:
+cs.LG
+< prev
+|
+next >
+new
+|
+recent
+|
+2024-09
+Change to browse by:
+cs
+cs.DC
+References & Citations
+NASA ADS
+Google Scholar
+Semantic Scholar
+export BibTeX citation
+Loading...
+BibTeX formatted citation
+×
+loading...
+Data provided by:
+Bookmark
+Bibliographic Tools
+Bibliographic and Citation Tools
+Bibliographic Explorer Toggle
+Bibliographic Explorer
+(
+What is the Explorer?
+)
+Connected Papers Toggle
+Connected Papers
+(
+What is Connected Papers?
+)
+Litmaps Toggle
+Litmaps
+(
+What is Litmaps?
+)
+scite.ai Toggle
+scite Smart Citations
+(
+What are Smart Citations?
+)
+Code, Data, Media
+Code, Data and Media Associated with this Article
+alphaXiv Toggle
+alphaXiv
+(
+What is alphaXiv?
+)
+Links to Code Toggle
+CatalyzeX Code Finder for Papers
+(
+What is CatalyzeX?
+)
+DagsHub Toggle
+DagsHub
+(
+What is DagsHub?
+)
+GotitPub Toggle
+Gotit.pub
+(
+What is GotitPub?
+)
+Huggingface Toggle
+Hugging Face
+(
+What is Huggingface?
+)
+Links to Code Toggle
+Papers with Code
+(
+What is Papers with Code?
+)
+ScienceCast Toggle
+ScienceCast
+(
+What is ScienceCast?
+)
+Demos
+Demos
+Replicate Toggle
+Replicate
+(
+What is Replicate?
+)
+Spaces Toggle
+Hugging Face Spaces
+(
+What is Spaces?
+)
+Spaces Toggle
+TXYZ.AI
+(
+What is TXYZ.AI?
+)
+Related Papers
+Recommenders and Search Tools
+Link to Influence Flower
+Influence Flower
+(
+What are Influence Flowers?
+)
+Core recommender toggle
+CORE Recommender
+(
+What is CORE?
+)
+IArxiv recommender toggle
+IArxiv Recommender
+(
+What is IArxiv?
+)
+Author
+Venue
+Institution
+Topic
+About arXivLabs
+arXivLabs: experimental projects with community collaborators
+arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
+Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
+Have an idea for a project that will add value for arXiv's community?
+Learn more about arXivLabs
+.
+Which authors of this paper are endorsers?
+|
+Disable MathJax
+(
+What is MathJax?
+)

research/notes/241020285-swe-search-enhancing-software-agents-with-monte-carlo-tree-search-and.md ADDED Viewed

	@@ -0,0 +1,207 @@

+---
+title: '[2410.20285] SWE-Search: Enhancing Software Agents with Monte Carlo Tree Search
+  and Iterative Refinement'
+id: 241020285-swe-search-enhancing-software-agents-with-monte-carlo-tree-search-and
+tags:
+- socratic-mcts-swe-worldmodel-8f6dea
+created: '2026-06-09T04:22:47.408752Z'
+source: https://arxiv.org/abs/2410.20285
+source_domain: arxiv.org
+fetched_at: '2026-06-09T04:22:47.075405Z'
+fetch_provider: builtin
+status: draft
+type: note
+deprecated: false
+summary: '[2410.20285] SWE-Search: Enhancing Software Agents with Monte Carlo Tree
+  Search and Iterative Refinement'
+---
+[2410.20285] SWE-Search: Enhancing Software Agents with Monte Carlo Tree Search and Iterative Refinement
+Computer Science > Artificial Intelligence
+arXiv:2410.20285
+(cs)
+[Submitted on 26 Oct 2024 (
+v1
+), last revised 2 Apr 2025 (this version, v6)]
+Title:
+SWE-Search: Enhancing Software Agents with Monte Carlo Tree Search and Iterative Refinement
+Authors:
+Antonis Antoniades
+,
+Albert Örwall
+,
+Kexun Zhang
+,
+Yuxi Xie
+,
+Anirudh Goyal
+,
+William Wang
+View a PDF of the paper titled SWE-Search: Enhancing Software Agents with Monte Carlo Tree Search and Iterative Refinement, by Antonis Antoniades and 5 other authors
+View PDF
+Abstract:
+Software engineers operating in complex and dynamic environments must continuously adapt to evolving requirements, learn iteratively from experience, and reconsider their approaches based on new insights. However, current large language model (LLM)-based software agents often follow linear, sequential processes that prevent backtracking and exploration of alternative solutions, limiting their ability to rethink their strategies when initial approaches prove ineffective. To address these challenges, we propose SWE-Search, a multi-agent framework that integrates Monte Carlo Tree Search (MCTS) with a self-improvement mechanism to enhance software agents' performance on repository-level software tasks. SWE-Search extends traditional MCTS by incorporating a hybrid value function that leverages LLMs for both numerical value estimation and qualitative evaluation. This enables self-feedback loops where agents iteratively refine their strategies based on both quantitative numerical evaluations and qualitative natural language assessments of pursued trajectories. The framework includes a SWE-Agent for adaptive exploration, a Value Agent for iterative feedback, and a Discriminator Agent that facilitates multi-agent debate for collaborative decision-making. Applied to the SWE-bench benchmark, our approach demonstrates a 23% relative improvement in performance across five models compared to standard open-source agents without MCTS. Our analysis reveals how performance scales with increased inference-time compute through deeper search, providing a pathway to improve software agents without requiring larger models or additional training data. This highlights the potential of self-evaluation driven search techniques in complex software engineering environments.
+Comments:
+Main body: 10 pages, 5 figures. Appendix: 5 pages, 4 figures. Open-source codebase
+Subjects:
+Artificial Intelligence (cs.AI)
+Cite as:
+arXiv:2410.20285
+[cs.AI]
+(or
+arXiv:2410.20285v6
+[cs.AI]
+for this version)
+https://doi.org/10.48550/arXiv.2410.20285
+Focus to learn more
+arXiv-issued DOI via DataCite
+Submission history
+From: Antonis Antoniades [
+view email
+]
+[v1]
+Sat, 26 Oct 2024 22:45:56 UTC (4,189 KB)
+[v2]
+Tue, 29 Oct 2024 18:25:20 UTC (4,189 KB)
+[v3]
+Sun, 15 Dec 2024 07:55:42 UTC (4,196 KB)
+[v4]
+Mon, 17 Feb 2025 23:13:48 UTC (4,196 KB)
+[v5]
+Sun, 2 Mar 2025 19:42:45 UTC (4,196 KB)
+[v6]
+Wed, 2 Apr 2025 04:13:19 UTC (3,821 KB)
+Full-text links:
+Access Paper:
+View a PDF of the paper titled SWE-Search: Enhancing Software Agents with Monte Carlo Tree Search and Iterative Refinement, by Antonis Antoniades and 5 other authors
+View PDF
+TeX Source
+view license
+Current browse context:
+cs.AI
+< prev
+|
+next >
+new
+|
+recent
+|
+2024-10
+Change to browse by:
+cs
+References & Citations
+NASA ADS
+Google Scholar
+Semantic Scholar
+export BibTeX citation
+Loading...
+BibTeX formatted citation
+×
+loading...
+Data provided by:
+Bookmark
+Bibliographic Tools
+Bibliographic and Citation Tools
+Bibliographic Explorer Toggle
+Bibliographic Explorer
+(
+What is the Explorer?
+)
+Connected Papers Toggle
+Connected Papers
+(
+What is Connected Papers?
+)
+Litmaps Toggle
+Litmaps
+(
+What is Litmaps?
+)
+scite.ai Toggle
+scite Smart Citations
+(
+What are Smart Citations?
+)
+Code, Data, Media
+Code, Data and Media Associated with this Article
+alphaXiv Toggle
+alphaXiv
+(
+What is alphaXiv?
+)
+Links to Code Toggle
+CatalyzeX Code Finder for Papers
+(
+What is CatalyzeX?
+)
+DagsHub Toggle
+DagsHub
+(
+What is DagsHub?
+)
+GotitPub Toggle
+Gotit.pub
+(
+What is GotitPub?
+)
+Huggingface Toggle
+Hugging Face
+(
+What is Huggingface?
+)
+Links to Code Toggle
+Papers with Code
+(
+What is Papers with Code?
+)
+ScienceCast Toggle
+ScienceCast
+(
+What is ScienceCast?
+)
+Demos
+Demos
+Replicate Toggle
+Replicate
+(
+What is Replicate?
+)
+Spaces Toggle
+Hugging Face Spaces
+(
+What is Spaces?
+)
+Spaces Toggle
+TXYZ.AI
+(
+What is TXYZ.AI?
+)
+Related Papers
+Recommenders and Search Tools
+Link to Influence Flower
+Influence Flower
+(
+What are Influence Flowers?
+)
+Core recommender toggle
+CORE Recommender
+(
+What is CORE?
+)
+Author
+Venue
+Institution
+Topic
+About arXivLabs
+arXivLabs: experimental projects with community collaborators
+arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
+Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
+Have an idea for a project that will add value for arXiv's community?
+Learn more about arXivLabs
+.
+Which authors of this paper are endorsers?
+|
+Disable MathJax
+(
+What is MathJax?
+)

research/notes/241108794-llm-based-world-models-can-make-decisions-solely-but-rigorous-evaluati.md ADDED Viewed

	@@ -0,0 +1,194 @@

+---
+title: '[2411.08794] LLM-Based World Models Can Make Decisions Solely, But Rigorous
+  Evaluations are Needed'
+id: 241108794-llm-based-world-models-can-make-decisions-solely-but-rigorous-evaluati
+tags:
+- socratic-mcts-swe-worldmodel-8f6dea
+created: '2026-06-09T04:22:37.365228Z'
+source: https://arxiv.org/abs/2411.08794
+source_domain: arxiv.org
+fetched_at: '2026-06-09T04:22:37.362163Z'
+fetch_provider: builtin
+status: draft
+type: note
+deprecated: false
+summary: '[2411.08794] LLM-Based World Models Can Make Decisions Solely, But Rigorous
+  Evaluations are Needed'
+---
+[2411.08794] LLM-Based World Models Can Make Decisions Solely, But Rigorous Evaluations are Needed
+Computer Science > Artificial Intelligence
+arXiv:2411.08794
+(cs)
+[Submitted on 13 Nov 2024 (
+v1
+), last revised 19 Mar 2026 (this version, v2)]
+Title:
+LLM-Based World Models Can Make Decisions Solely, But Rigorous Evaluations are Needed
+Authors:
+Chang Yang
+,
+Xinrun Wang
+,
+Junzhe Jiang
+,
+Qinggang Zhang
+,
+Xiao Huang
+View a PDF of the paper titled LLM-Based World Models Can Make Decisions Solely, But Rigorous Evaluations are Needed, by Chang Yang and Xinrun Wang and Junzhe Jiang and Qinggang Zhang and Xiao Huang
+View PDF
+HTML (experimental)
+Abstract:
+World model emerges as a key module in decision making, where MuZero and Dreamer achieve remarkable successes in complex tasks. Recent work leverages Large Language Models (LLMs) as general world simulators to simulate the dynamics of the world due to their generalizability. LLMs also serve as the world model for deliberative reasoning in Reasoning via Planning (RAP) and Tree of Thought (ToT). However, the world models are either evaluated as a general world simulator, or as a functional module of the agent, i.e., predicting the transitions to assist the planning. In this work, we propose a comprehensive evaluation of the world models with LLMs from the decision making perspective. Specifically, we leverage the 31 diverse environments from (Wang et al., 2023;2024) and curate the rule-based policy of each environment for the diverse evaluation. Then, we design three main tasks, i.e., policy verification, action proposal, and policy planning, where the world models can be used for decision making solely. Finally, we conduct the comprehensive evaluation of the advanced LLMs, i.e., GPT-4o and GPT-4o-mini, on the environments for the three main tasks under various settings. The key observations include: i) GPT-4o significantly outperforms GPT-4o-mini on the three main tasks, especially for the tasks which require the domain knowledge, ii) the performance of the world model with LLM will be decreased for long-term decision-making tasks, and iii) the combination of different functionalities of the world model will brings additional unstabilities of the performance.
+Comments:
+Accepted to TMLR
+Subjects:
+Artificial Intelligence (cs.AI)
+Cite as:
+arXiv:2411.08794
+[cs.AI]
+(or
+arXiv:2411.08794v2
+[cs.AI]
+for this version)
+https://doi.org/10.48550/arXiv.2411.08794
+Focus to learn more
+arXiv-issued DOI via DataCite
+Submission history
+From: Xinrun Wang [
+view email
+]
+[v1]
+Wed, 13 Nov 2024 17:19:32 UTC (501 KB)
+[v2]
+Thu, 19 Mar 2026 02:08:09 UTC (374 KB)
+Full-text links:
+Access Paper:
+View a PDF of the paper titled LLM-Based World Models Can Make Decisions Solely, But Rigorous Evaluations are Needed, by Chang Yang and Xinrun Wang and Junzhe Jiang and Qinggang Zhang and Xiao Huang
+View PDF
+HTML (experimental)
+TeX Source
+view license
+Current browse context:
+cs.AI
+< prev
+|
+next >
+new
+|
+recent
+|
+2024-11
+Change to browse by:
+cs
+References & Citations
+NASA ADS
+Google Scholar
+Semantic Scholar
+export BibTeX citation
+Loading...
+BibTeX formatted citation
+×
+loading...
+Data provided by:
+Bookmark
+Bibliographic Tools
+Bibliographic and Citation Tools
+Bibliographic Explorer Toggle
+Bibliographic Explorer
+(
+What is the Explorer?
+)
+Connected Papers Toggle
+Connected Papers
+(
+What is Connected Papers?
+)
+Litmaps Toggle
+Litmaps
+(
+What is Litmaps?
+)
+scite.ai Toggle
+scite Smart Citations
+(
+What are Smart Citations?
+)
+Code, Data, Media
+Code, Data and Media Associated with this Article
+alphaXiv Toggle
+alphaXiv
+(
+What is alphaXiv?
+)
+Links to Code Toggle
+CatalyzeX Code Finder for Papers
+(
+What is CatalyzeX?
+)
+DagsHub Toggle
+DagsHub
+(
+What is DagsHub?
+)
+GotitPub Toggle
+Gotit.pub
+(
+What is GotitPub?
+)
+Huggingface Toggle
+Hugging Face
+(
+What is Huggingface?
+)
+ScienceCast Toggle
+ScienceCast
+(
+What is ScienceCast?
+)
+Demos
+Demos
+Replicate Toggle
+Replicate
+(
+What is Replicate?
+)
+Spaces Toggle
+Hugging Face Spaces
+(
+What is Spaces?
+)
+Spaces Toggle
+TXYZ.AI
+(
+What is TXYZ.AI?
+)
+Related Papers
+Recommenders and Search Tools
+Link to Influence Flower
+Influence Flower
+(
+What are Influence Flowers?
+)
+Core recommender toggle
+CORE Recommender
+(
+What is CORE?
+)
+Author
+Venue
+Institution
+Topic
+About arXivLabs
+arXivLabs: experimental projects with community collaborators
+arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
+Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
+Have an idea for a project that will add value for arXiv's community?
+Learn more about arXivLabs
+.
+Which authors of this paper are endorsers?
+|
+Disable MathJax
+(
+What is MathJax?
+)

research/notes/241114499-understanding-world-or-predicting-future-a-comprehensive-survey-of-wor.md ADDED Viewed

	@@ -0,0 +1,226 @@

+---
+title: '[2411.14499] Understanding World or Predicting Future? A Comprehensive Survey
+  of World Models'
+id: 241114499-understanding-world-or-predicting-future-a-comprehensive-survey-of-wor
+tags:
+- socratic-mcts-swe-worldmodel-8f6dea
+created: '2026-06-09T04:22:01.609371Z'
+updated: '2026-06-09T04:22:19.116800Z'
+source: https://arxiv.org/abs/2411.14499
+source_domain: arxiv.org
+fetched_at: '2026-06-09T04:22:01.397314Z'
+fetch_provider: builtin
+status: draft
+type: note
+tier: institutional
+content_type: paper
+deprecated: false
+summary: '[2411.14499] Understanding World or Predicting Future? A Comprehensive Survey
+  of World Models'
+---
+[2411.14499] Understanding World or Predicting Future? A Comprehensive Survey of World Models
+Computer Science > Computation and Language
+arXiv:2411.14499
+(cs)
+[Submitted on 21 Nov 2024 (
+v1
+), last revised 10 Dec 2025 (this version, v4)]
+Title:
+Understanding World or Predicting Future? A Comprehensive Survey of World Models
+Authors:
+Jingtao Ding
+,
+Yunke Zhang
+,
+Yu Shang
+,
+Jie Feng
+,
+Yuheng Zhang
+,
+Zefang Zong
+,
+Yuan Yuan
+,
+Hongyuan Su
+,
+Nian Li
+,
+Jinghua Piao
+,
+Yucheng Deng
+,
+Nicholas Sukiennik
+,
+Chen Gao
+,
+Fengli Xu
+,
+Yong Li
+View a PDF of the paper titled Understanding World or Predicting Future? A Comprehensive Survey of World Models, by Jingtao Ding and 14 other authors
+View PDF
+HTML (experimental)
+Abstract:
+The concept of world models has garnered significant attention due to advancements in multimodal large language models such as GPT-4 and video generation models such as Sora, which are central to the pursuit of artificial general intelligence. This survey offers a comprehensive review of the literature on world models. Generally, world models are regarded as tools for either understanding the present state of the world or predicting its future dynamics. This review presents a systematic categorization of world models, emphasizing two primary functions: (1) constructing internal representations to understand the mechanisms of the world, and (2) predicting future states to simulate and guide decision-making. Initially, we examine the current progress in these two categories. We then explore the application of world models in key domains, including generative games, autonomous driving, robotics, and social simulacra, with a focus on how each domain utilizes these aspects. Finally, we outline key challenges and provide insights into potential future research directions. We summarize the representative papers along with their code repositories in
+this https URL
+.
+Comments:
+Extended version of the original ACM CSUR paper, 49 pages, 6 figures, 8 tables
+Subjects:
+Computation and Language (cs.CL)
+; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
+Cite as:
+arXiv:2411.14499
+[cs.CL]
+(or
+arXiv:2411.14499v4
+[cs.CL]
+for this version)
+https://doi.org/10.48550/arXiv.2411.14499
+Focus to learn more
+arXiv-issued DOI via DataCite
+Submission history
+From: Jingtao Ding [
+view email
+]
+[v1]
+Thu, 21 Nov 2024 03:58:50 UTC (4,019 KB)
+[v2]
+Wed, 25 Jun 2025 02:31:33 UTC (4,612 KB)
+[v3]
+Sat, 15 Nov 2025 14:33:14 UTC (4,613 KB)
+[v4]
+Wed, 10 Dec 2025 02:53:14 UTC (4,613 KB)
+Full-text links:
+Access Paper:
+View a PDF of the paper titled Understanding World or Predicting Future? A Comprehensive Survey of World Models, by Jingtao Ding and 14 other authors
+View PDF
+HTML (experimental)
+TeX Source
+view license
+Current browse context:
+cs.CL
+< prev
+|
+next >
+new
+|
+recent
+|
+2024-11
+Change to browse by:
+cs
+cs.AI
+cs.LG
+References & Citations
+NASA ADS
+Google Scholar
+Semantic Scholar
+export BibTeX citation
+Loading...
+BibTeX formatted citation
+×
+loading...
+Data provided by:
+Bookmark
+Bibliographic Tools
+Bibliographic and Citation Tools
+Bibliographic Explorer Toggle
+Bibliographic Explorer
+(
+What is the Explorer?
+)
+Connected Papers Toggle
+Connected Papers
+(
+What is Connected Papers?
+)
+Litmaps Toggle
+Litmaps
+(
+What is Litmaps?
+)
+scite.ai Toggle
+scite Smart Citations
+(
+What are Smart Citations?
+)
+Code, Data, Media
+Code, Data and Media Associated with this Article
+alphaXiv Toggle
+alphaXiv
+(
+What is alphaXiv?
+)
+Links to Code Toggle
+CatalyzeX Code Finder for Papers
+(
+What is CatalyzeX?
+)
+DagsHub Toggle
+DagsHub
+(
+What is DagsHub?
+)
+GotitPub Toggle
+Gotit.pub
+(
+What is GotitPub?
+)
+Huggingface Toggle
+Hugging Face
+(
+What is Huggingface?
+)
+ScienceCast Toggle
+ScienceCast
+(
+What is ScienceCast?
+)
+Demos
+Demos
+Replicate Toggle
+Replicate
+(
+What is Replicate?
+)
+Spaces Toggle
+Hugging Face Spaces
+(
+What is Spaces?
+)
+Spaces Toggle
+TXYZ.AI
+(
+What is TXYZ.AI?
+)
+Related Papers
+Recommenders and Search Tools
+Link to Influence Flower
+Influence Flower
+(
+What are Influence Flowers?
+)
+Core recommender toggle
+CORE Recommender
+(
+What is CORE?
+)
+Author
+Venue
+Institution
+Topic
+About arXivLabs
+arXivLabs: experimental projects with community collaborators
+arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
+Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
+Have an idea for a project that will add value for arXiv's community?
+Learn more about arXivLabs
+.
+Which authors of this paper are endorsers?
+|
+Disable MathJax
+(
+What is MathJax?
+)

research/notes/250218449-swe-rl-advancing-llm-reasoning-via-reinforcement-learning-on-open-soft.md ADDED Viewed

	@@ -0,0 +1,214 @@

+---
+title: '[2502.18449] SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on
+  Open Software Evolution'
+id: 250218449-swe-rl-advancing-llm-reasoning-via-reinforcement-learning-on-open-soft
+tags:
+- socratic-mcts-swe-worldmodel-8f6dea
+created: '2026-06-09T04:24:56.974939Z'
+updated: '2026-06-09T04:25:34.163662Z'
+source: https://arxiv.org/abs/2502.18449
+source_domain: arxiv.org
+fetched_at: '2026-06-09T04:24:55.251716Z'
+fetch_provider: builtin
+status: draft
+type: note
+tier: institutional
+content_type: paper
+deprecated: false
+summary: Wei et al. (Meta AI/UIUC/CMU), NeurIPS 2025. First RL approach scaling LLM
+  reasoning to real-world SWE using GitHub PR software-evolution data + lightweight
+  rule-based reward (difflib SequenceMatcher similarity to oracle patch, -1 for malformed);
+  GRPO optimizer; Llama3-SWE-RL-70B hits 41.0% SWE-bench Verified.
+---
+[2502.18449] SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution
+Computer Science > Software Engineering
+arXiv:2502.18449
+(cs)
+[Submitted on 25 Feb 2025 (
+v1
+), last revised 1 Dec 2025 (this version, v2)]
+Title:
+SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution
+Authors:
+Yuxiang Wei
+,
+Olivier Duchenne
+,
+Jade Copet
+,
+Quentin Carbonneaux
+,
+Lingming Zhang
+,
+Daniel Fried
+,
+Gabriel Synnaeve
+,
+Rishabh Singh
+,
+Sida I. Wang
+View a PDF of the paper titled SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution, by Yuxiang Wei and 8 other authors
+View PDF
+HTML (experimental)
+Abstract:
+The recent DeepSeek-R1 release has demonstrated the immense potential of reinforcement learning (RL) in enhancing the general reasoning capabilities of large language models (LLMs). While DeepSeek-R1 and other follow-up work primarily focus on applying RL to competitive coding and math problems, this paper introduces SWE-RL, the first approach to scale RL-based LLM reasoning for real-world software engineering. Leveraging a lightweight rule-based reward (e.g., the similarity score between ground-truth and LLM-generated solutions), SWE-RL enables LLMs to autonomously recover a developer's reasoning processes and solutions by learning from extensive open-source software evolution data -- the record of a software's entire lifecycle, including its code snapshots, code changes, and events such as issues and pull requests. Trained on top of Llama 3, our resulting reasoning model, Llama3-SWE-RL-70B, achieves a 41.0% solve rate on SWE-bench Verified -- a human-verified collection of real-world GitHub issues. To our knowledge, this is the best performance reported for medium-sized (<100B) LLMs to date, even comparable to leading proprietary LLMs like GPT-4o. Surprisingly, despite performing RL solely on software evolution data, Llama3-SWE-RL has even emerged with generalized reasoning skills. For example, it shows improved results on five out-of-domain tasks, namely, function coding, library use, code reasoning, mathematics, and general language understanding, whereas a supervised-finetuning baseline even leads to performance degradation on average. Overall, SWE-RL opens up a new direction to improve the reasoning capabilities of LLMs through reinforcement learning on massive software engineering data.
+Comments:
+Accepted to NeurIPS 2025 Main Track
+Subjects:
+Software Engineering (cs.SE)
+; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
+Cite as:
+arXiv:2502.18449
+[cs.SE]
+(or
+arXiv:2502.18449v2
+[cs.SE]
+for this version)
+https://doi.org/10.48550/arXiv.2502.18449
+Focus to learn more
+arXiv-issued DOI via DataCite
+Submission history
+From: Yuxiang Wei [
+view email
+]
+[v1]
+Tue, 25 Feb 2025 18:45:04 UTC (1,534 KB)
+[v2]
+Mon, 1 Dec 2025 00:16:59 UTC (812 KB)
+Full-text links:
+Access Paper:
+View a PDF of the paper titled SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution, by Yuxiang Wei and 8 other authors
+View PDF
+HTML (experimental)
+TeX Source
+view license
+Current browse context:
+cs.SE
+< prev
+|
+next >
+new
+|
+recent
+|
+2025-02
+Change to browse by:
+cs
+cs.AI
+cs.CL
+References & Citations
+NASA ADS
+Google Scholar
+Semantic Scholar
+export BibTeX citation
+Loading...
+BibTeX formatted citation
+×
+loading...
+Data provided by:
+Bookmark
+Bibliographic Tools
+Bibliographic and Citation Tools
+Bibliographic Explorer Toggle
+Bibliographic Explorer
+(
+What is the Explorer?
+)
+Connected Papers Toggle
+Connected Papers
+(
+What is Connected Papers?
+)
+Litmaps Toggle
+Litmaps
+(
+What is Litmaps?
+)
+scite.ai Toggle
+scite Smart Citations
+(
+What are Smart Citations?
+)
+Code, Data, Media
+Code, Data and Media Associated with this Article
+alphaXiv Toggle
+alphaXiv
+(
+What is alphaXiv?
+)
+Links to Code Toggle
+CatalyzeX Code Finder for Papers
+(
+What is CatalyzeX?
+)
+DagsHub Toggle
+DagsHub
+(
+What is DagsHub?
+)
+GotitPub Toggle
+Gotit.pub
+(
+What is GotitPub?
+)
+Huggingface Toggle
+Hugging Face
+(
+What is Huggingface?
+)
+ScienceCast Toggle
+ScienceCast
+(
+What is ScienceCast?
+)
+Demos
+Demos
+Replicate Toggle
+Replicate
+(
+What is Replicate?
+)
+Spaces Toggle
+Hugging Face Spaces
+(
+What is Spaces?
+)
+Spaces Toggle
+TXYZ.AI
+(
+What is TXYZ.AI?
+)
+Related Papers
+Recommenders and Search Tools
+Link to Influence Flower
+Influence Flower
+(
+What are Influence Flowers?
+)
+Core recommender toggle
+CORE Recommender
+(
+What is CORE?
+)
+Author
+Venue
+Institution
+Topic
+About arXivLabs
+arXivLabs: experimental projects with community collaborators
+arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
+Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
+Have an idea for a project that will add value for arXiv's community?
+Learn more about arXivLabs
+.
+Which authors of this paper are endorsers?
+|
+Disable MathJax
+(
+What is MathJax?
+)
+## Related
+- [[pdf]]

research/notes/250314391-how-much-do-llms-learn-from-negative-examples.md ADDED Viewed

	@@ -0,0 +1,189 @@

+---
+title: '[2503.14391] How much do LLMs learn from negative examples?'
+id: 250314391-how-much-do-llms-learn-from-negative-examples
+tags:
+- socratic-mcts-swe-worldmodel-8f6dea
+created: '2026-06-09T04:24:45.902975Z'
+updated: '2026-06-09T04:25:03.616963Z'
+source: https://arxiv.org/abs/2503.14391
+source_domain: arxiv.org
+fetched_at: '2026-06-09T04:24:45.659994Z'
+fetch_provider: builtin
+status: active
+type: note
+tier: institutional
+content_type: paper
+deprecated: false
+summary: '[2503.14391] How much do LLMs learn from negative examples?'
+---
+[2503.14391] How much do LLMs learn from negative examples?
+Computer Science > Computation and Language
+arXiv:2503.14391
+(cs)
+[Submitted on 18 Mar 2025]
+Title:
+How much do LLMs learn from negative examples?
+Authors:
+Shadi Hamdan
+,
+Deniz Yuret
+View a PDF of the paper titled How much do LLMs learn from negative examples?, by Shadi Hamdan and Deniz Yuret
+View PDF
+Abstract:
+Large language models (LLMs) undergo a three-phase training process: unsupervised pre-training, supervised fine-tuning (SFT), and learning from human feedback (RLHF/DPO). Notably, it is during the final phase that these models are exposed to negative examples -- incorrect, rejected, or suboptimal responses to queries. This paper delves into the role of negative examples in the training of LLMs, using a likelihood-ratio (Likra) model on multiple-choice question answering benchmarks to precisely manage the influence and the volume of negative examples. Our findings reveal three key insights: (1) During a critical phase in training, Likra with negative examples demonstrates a significantly larger improvement per training example compared to SFT using only positive examples. This leads to a sharp jump in the learning curve for Likra unlike the smooth and gradual improvement of SFT; (2) negative examples that are plausible but incorrect (near-misses) exert a greater influence; and (3) while training with positive examples fails to significantly decrease the likelihood of plausible but incorrect answers, training with negative examples more accurately identifies them. These results indicate a potentially significant role for negative examples in improving accuracy and reducing hallucinations for LLMs.
+Comments:
+8 pages, 6 figures
+Subjects:
+Computation and Language (cs.CL)
+MSC
+classes:
+68T50, 68T05
+ACM
+classes:
+I.2.6; I.2.7
+Cite as:
+arXiv:2503.14391
+[cs.CL]
+(or
+arXiv:2503.14391v1
+[cs.CL]
+for this version)
+https://doi.org/10.48550/arXiv.2503.14391
+Focus to learn more
+arXiv-issued DOI via DataCite
+Submission history
+From: Deniz Yuret [
+view email
+]
+[v1]
+Tue, 18 Mar 2025 16:26:29 UTC (38 KB)
+Full-text links:
+Access Paper:
+View a PDF of the paper titled How much do LLMs learn from negative examples?, by Shadi Hamdan and Deniz Yuret
+View PDF
+TeX Source
+view license
+Current browse context:
+cs.CL
+< prev
+|
+next >
+new
+|
+recent
+|
+2025-03
+Change to browse by:
+cs
+References & Citations
+NASA ADS
+Google Scholar
+Semantic Scholar
+export BibTeX citation
+Loading...
+BibTeX formatted citation
+×
+loading...
+Data provided by:
+Bookmark
+Bibliographic Tools
+Bibliographic and Citation Tools
+Bibliographic Explorer Toggle
+Bibliographic Explorer
+(
+What is the Explorer?
+)
+Connected Papers Toggle
+Connected Papers
+(
+What is Connected Papers?
+)
+Litmaps Toggle
+Litmaps
+(
+What is Litmaps?
+)
+scite.ai Toggle
+scite Smart Citations
+(
+What are Smart Citations?
+)
+Code, Data, Media
+Code, Data and Media Associated with this Article
+alphaXiv Toggle
+alphaXiv
+(
+What is alphaXiv?
+)
+Links to Code Toggle
+CatalyzeX Code Finder for Papers
+(
+What is CatalyzeX?
+)
+DagsHub Toggle
+DagsHub
+(
+What is DagsHub?
+)
+GotitPub Toggle
+Gotit.pub
+(
+What is GotitPub?
+)
+Huggingface Toggle
+Hugging Face
+(
+What is Huggingface?
+)
+ScienceCast Toggle
+ScienceCast
+(
+What is ScienceCast?
+)
+Demos
+Demos
+Replicate Toggle
+Replicate
+(
+What is Replicate?
+)
+Spaces Toggle
+Hugging Face Spaces
+(
+What is Spaces?
+)
+Spaces Toggle
+TXYZ.AI
+(
+What is TXYZ.AI?
+)
+Related Papers
+Recommenders and Search Tools
+Link to Influence Flower
+Influence Flower
+(
+What are Influence Flowers?
+)
+Core recommender toggle
+CORE Recommender
+(
+What is CORE?
+)
+Author
+Venue
+Institution
+Topic
+About arXivLabs
+arXivLabs: experimental projects with community collaborators
+arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
+Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
+Have an idea for a project that will add value for arXiv's community?
+Learn more about arXivLabs
+.
+Which authors of this paper are endorsers?
+|
+Disable MathJax
+(
+What is MathJax?
+)

research/notes/250411343-a-minimalist-approach-to-llm-reasoning-from-rejection-sampling-to-rein.md ADDED Viewed

	@@ -0,0 +1,217 @@

+---
+title: '[2504.11343] A Minimalist Approach to LLM Reasoning: from Rejection Sampling
+  to Reinforce'
+id: 250411343-a-minimalist-approach-to-llm-reasoning-from-rejection-sampling-to-rein
+tags:
+- socratic-mcts-swe-worldmodel-8f6dea
+created: '2026-06-09T04:24:45.896417Z'
+updated: '2026-06-09T04:25:02.883023Z'
+source: https://arxiv.org/abs/2504.11343
+source_domain: arxiv.org
+fetched_at: '2026-06-09T04:24:45.404807Z'
+fetch_provider: builtin
+status: active
+type: note
+tier: institutional
+content_type: paper
+deprecated: false
+summary: '[2504.11343] A Minimalist Approach to LLM Reasoning: from Rejection Sampling
+  to Reinforce'
+---
+[2504.11343] A Minimalist Approach to LLM Reasoning: from Rejection Sampling to Reinforce
+Computer Science > Machine Learning
+arXiv:2504.11343
+(cs)
+[Submitted on 15 Apr 2025 (
+v1
+), last revised 12 Jun 2025 (this version, v2)]
+Title:
+A Minimalist Approach to LLM Reasoning: from Rejection Sampling to Reinforce
+Authors:
+Wei Xiong
+,
+Jiarui Yao
+,
+Yuhui Xu
+,
+Bo Pang
+,
+Lei Wang
+,
+Doyen Sahoo
+,
+Junnan Li
+,
+Nan Jiang
+,
+Tong Zhang
+,
+Caiming Xiong
+,
+Hanze Dong
+View a PDF of the paper titled A Minimalist Approach to LLM Reasoning: from Rejection Sampling to Reinforce, by Wei Xiong and 10 other authors
+View PDF
+HTML (experimental)
+Abstract:
+Reinforcement learning (RL) has become a prevailing approach for fine-tuning large language models (LLMs) on complex reasoning tasks. Among recent methods, GRPO stands out for its empirical success in training models such as DeepSeek-R1, yet the sources of its effectiveness remain poorly understood. In this work, we revisit GRPO from a reinforce-like algorithm perspective and analyze its core components. Surprisingly, we find that a simple rejection sampling baseline, RAFT, which trains only on positively rewarded samples, yields competitive performance than GRPO and PPO. Our ablation studies reveal that GRPO's main advantage arises from discarding prompts with entirely incorrect responses, rather than from its reward normalization. Motivated by this insight, we propose Reinforce-Rej, a minimal extension of policy gradient that filters both entirely incorrect and entirely correct samples. Reinforce-Rej improves KL efficiency and stability, serving as a lightweight yet effective alternative to more complex RL algorithms. We advocate RAFT as a robust and interpretable baseline, and suggest that future advances should focus on more principled designs for incorporating negative samples, rather than relying on them indiscriminately. Our findings provide guidance for future work in reward-based LLM post-training.
+Subjects:
+Machine Learning (cs.LG)
+; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
+Cite as:
+arXiv:2504.11343
+[cs.LG]
+(or
+arXiv:2504.11343v2
+[cs.LG]
+for this version)
+https://doi.org/10.48550/arXiv.2504.11343
+Focus to learn more
+arXiv-issued DOI via DataCite
+Submission history
+From: Hanze Dong [
+view email
+]
+[v1]
+Tue, 15 Apr 2025 16:15:02 UTC (228 KB)
+[v2]
+Thu, 12 Jun 2025 06:03:24 UTC (192 KB)
+Full-text links:
+Access Paper:
+View a PDF of the paper titled A Minimalist Approach to LLM Reasoning: from Rejection Sampling to Reinforce, by Wei Xiong and 10 other authors
+View PDF
+HTML (experimental)
+TeX Source
+view license
+Current browse context:
+cs.LG
+< prev
+|
+next >
+new
+|
+recent
+|
+2025-04
+Change to browse by:
+cs
+cs.AI
+cs.CL
+stat
+stat.ML
+References & Citations
+NASA ADS
+Google Scholar
+Semantic Scholar
+export BibTeX citation
+Loading...
+BibTeX formatted citation
+×
+loading...
+Data provided by:
+Bookmark
+Bibliographic Tools
+Bibliographic and Citation Tools
+Bibliographic Explorer Toggle
+Bibliographic Explorer
+(
+What is the Explorer?
+)
+Connected Papers Toggle
+Connected Papers
+(
+What is Connected Papers?
+)
+Litmaps Toggle
+Litmaps
+(
+What is Litmaps?
+)
+scite.ai Toggle
+scite Smart Citations
+(
+What are Smart Citations?
+)
+Code, Data, Media
+Code, Data and Media Associated with this Article
+alphaXiv Toggle
+alphaXiv
+(
+What is alphaXiv?
+)
+Links to Code Toggle
+CatalyzeX Code Finder for Papers
+(
+What is CatalyzeX?
+)
+DagsHub Toggle
+DagsHub
+(
+What is DagsHub?
+)
+GotitPub Toggle
+Gotit.pub
+(
+What is GotitPub?
+)
+Huggingface Toggle
+Hugging Face
+(
+What is Huggingface?
+)
+ScienceCast Toggle
+ScienceCast
+(
+What is ScienceCast?
+)
+Demos
+Demos
+Replicate Toggle
+Replicate
+(
+What is Replicate?
+)
+Spaces Toggle
+Hugging Face Spaces
+(
+What is Spaces?
+)
+Spaces Toggle
+TXYZ.AI
+(
+What is TXYZ.AI?
+)
+Related Papers
+Recommenders and Search Tools
+Link to Influence Flower
+Influence Flower
+(
+What are Influence Flowers?
+)
+Core recommender toggle
+CORE Recommender
+(
+What is CORE?
+)
+IArxiv recommender toggle
+IArxiv Recommender
+(
+What is IArxiv?
+)
+Author
+Venue
+Institution
+Topic
+About arXivLabs
+arXivLabs: experimental projects with community collaborators
+arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
+Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
+Have an idea for a project that will add value for arXiv's community?
+Learn more about arXivLabs
+.
+Which authors of this paper are endorsers?
+|
+Disable MathJax
+(
+What is MathJax?
+)

research/notes/250415275-stop-summation-min-form-credit-assignment-is-all-process-reward-model.md ADDED Viewed

	@@ -0,0 +1,210 @@

+---
+title: '[2504.15275] Stop Summation: Min-Form Credit Assignment Is All Process Reward
+  Model Needs for Reasoning'
+id: 250415275-stop-summation-min-form-credit-assignment-is-all-process-reward-model
+tags:
+- socratic-mcts-swe-worldmodel-8f6dea
+created: '2026-06-09T04:23:24.370137Z'
+updated: '2026-06-09T04:23:59.340315Z'
+source: https://arxiv.org/abs/2504.15275
+source_domain: arxiv.org
+fetched_at: '2026-06-09T04:23:24.356247Z'
+fetch_provider: builtin
+status: draft
+type: note
+tier: institutional
+content_type: paper
+deprecated: false
+summary: 'Cheng et al. 2025: argues PRM credit should be MIN-form (bottleneck step)
+  not summed across steps — a concrete step-level credit-assignment rule for which
+  branch/step carries the signal; relevant to prune-vs-train-on-all design.'
+---
+[2504.15275] Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning
+Computer Science > Artificial Intelligence
+arXiv:2504.15275
+(cs)
+[Submitted on 21 Apr 2025 (
+v1
+), last revised 23 Oct 2025 (this version, v3)]
+Title:
+Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning
+Authors:
+Jie Cheng
+,
+Gang Xiong
+,
+Ruixi Qiao
+,
+Lijun Li
+,
+Chao Guo
+,
+Junle Wang
+,
+Yisheng Lv
+,
+Fei-Yue Wang
+View a PDF of the paper titled Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning, by Jie Cheng and 7 other authors
+View PDF
+HTML (experimental)
+Abstract:
+Process reward models (PRMs) have proven effective for test-time scaling of Large Language Models (LLMs) on challenging reasoning tasks. However, reward hacking issues with PRMs limit their successful application in reinforcement fine-tuning. In this paper, we identify the main cause of PRM-induced reward hacking: the canonical summation-form credit assignment in reinforcement learning (RL), which defines the value as cumulative gamma-decayed future rewards, easily induces LLMs to hack steps with high rewards. To address this, we propose PURE: Process sUpervised Reinforcement lEarning. The key innovation of PURE is a min-form credit assignment that formulates the value function as the minimum of future rewards. This method significantly alleviates reward hacking by limiting the value function range and distributing advantages more reasonably. Through extensive experiments on 3 base models, we show that PRM-based approaches enabling min-form credit assignment achieve comparable reasoning performance to verifiable reward-based methods within only 30% steps. In contrast, the canonical sum-form credit assignment collapses training even at the beginning! Additionally, when we supplement PRM-based fine-tuning with just 10% verifiable rewards, we further alleviate reward hacking and produce the best fine-tuned model based on Qwen2.5-Math-7B in our experiments, achieving 82.5% accuracy on AMC23 and 53.3% average accuracy across 5 benchmarks. Moreover, we summarize the observed reward hacking cases and analyze the causes of training collapse. We release our code and model weights at
+this https URL
+.
+Comments:
+Accepted by NeurIPS 2025
+Subjects:
+Artificial Intelligence (cs.AI)
+; Machine Learning (cs.LG)
+Cite as:
+arXiv:2504.15275
+[cs.AI]
+(or
+arXiv:2504.15275v3
+[cs.AI]
+for this version)
+https://doi.org/10.48550/arXiv.2504.15275
+Focus to learn more
+arXiv-issued DOI via DataCite
+Submission history
+From: Jie Cheng [
+view email
+]
+[v1]
+Mon, 21 Apr 2025 17:59:02 UTC (321 KB)
+[v2]
+Fri, 23 May 2025 07:38:41 UTC (321 KB)
+[v3]
+Thu, 23 Oct 2025 16:28:10 UTC (332 KB)
+Full-text links:
+Access Paper:
+View a PDF of the paper titled Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning, by Jie Cheng and 7 other authors
+View PDF
+HTML (experimental)
+TeX Source
+view license
+Current browse context:
+cs.AI
+< prev
+|
+next >
+new
+|
+recent
+|
+2025-04
+Change to browse by:
+cs
+cs.LG
+References & Citations
+NASA ADS
+Google Scholar
+Semantic Scholar
+export BibTeX citation
+Loading...
+BibTeX formatted citation
+×
+loading...
+Data provided by:
+Bookmark
+Bibliographic Tools
+Bibliographic and Citation Tools
+Bibliographic Explorer Toggle
+Bibliographic Explorer
+(
+What is the Explorer?
+)
+Connected Papers Toggle
+Connected Papers
+(
+What is Connected Papers?
+)
+Litmaps Toggle
+Litmaps
+(
+What is Litmaps?
+)
+scite.ai Toggle
+scite Smart Citations
+(
+What are Smart Citations?
+)
+Code, Data, Media
+Code, Data and Media Associated with this Article
+alphaXiv Toggle
+alphaXiv
+(
+What is alphaXiv?
+)
+Links to Code Toggle
+CatalyzeX Code Finder for Papers
+(
+What is CatalyzeX?
+)
+DagsHub Toggle
+DagsHub
+(
+What is DagsHub?
+)
+GotitPub Toggle
+Gotit.pub
+(
+What is GotitPub?
+)
+Huggingface Toggle
+Hugging Face
+(
+What is Huggingface?
+)
+ScienceCast Toggle
+ScienceCast
+(
+What is ScienceCast?
+)
+Demos
+Demos
+Replicate Toggle
+Replicate
+(
+What is Replicate?
+)
+Spaces Toggle
+Hugging Face Spaces
+(
+What is Spaces?
+)
+Spaces Toggle
+TXYZ.AI
+(
+What is TXYZ.AI?
+)
+Related Papers
+Recommenders and Search Tools
+Link to Influence Flower
+Influence Flower
+(
+What are Influence Flowers?
+)
+Core recommender toggle
+CORE Recommender
+(
+What is CORE?
+)
+Author
+Venue
+Institution
+Topic
+About arXivLabs
+arXivLabs: experimental projects with community collaborators
+arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
+Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
+Have an idea for a project that will add value for arXiv's community?
+Learn more about arXivLabs
+.
+Which authors of this paper are endorsers?
+|
+Disable MathJax
+(
+What is MathJax?
+)

research/notes/250518830-on-the-effect-of-negative-gradient-in-group-relative-deep-reinforcemen.md ADDED Viewed

	@@ -0,0 +1,200 @@

+---
+title: '[2505.18830] On the Effect of Negative Gradient in Group Relative Deep Reinforcement
+  Optimization'
+id: 250518830-on-the-effect-of-negative-gradient-in-group-relative-deep-reinforcemen
+tags:
+- socratic-mcts-swe-worldmodel-8f6dea
+created: '2026-06-09T04:24:45.899882Z'
+updated: '2026-06-09T04:25:03.267186Z'
+source: https://arxiv.org/abs/2505.18830
+source_domain: arxiv.org
+fetched_at: '2026-06-09T04:24:45.517727Z'
+fetch_provider: builtin
+status: active
+type: note
+tier: institutional
+content_type: paper
+deprecated: false
+summary: '[2505.18830] On the Effect of Negative Gradient in Group Relative Deep Reinforcement
+  Optimization'
+---
+[2505.18830] On the Effect of Negative Gradient in Group Relative Deep Reinforcement Optimization
+Computer Science > Machine Learning
+arXiv:2505.18830
+(cs)
+[Submitted on 24 May 2025]
+Title:
+On the Effect of Negative Gradient in Group Relative Deep Reinforcement Optimization
+Authors:
+Wenlong Deng
+,
+Yi Ren
+,
+Muchen Li
+,
+Danica J. Sutherland
+,
+Xiaoxiao Li
+,
+Christos Thrampoulidis
+View a PDF of the paper titled On the Effect of Negative Gradient in Group Relative Deep Reinforcement Optimization, by Wenlong Deng and 5 other authors
+View PDF
+HTML (experimental)
+Abstract:
+Reinforcement learning (RL) has become popular in enhancing the reasoning capabilities of large language models (LLMs), with Group Relative Policy Optimization (GRPO) emerging as a widely used algorithm in recent systems. Despite GRPO's widespread adoption, we identify a previously unrecognized phenomenon we term Lazy Likelihood Displacement (LLD), wherein the likelihood of correct responses marginally increases or even decreases during training. This behavior mirrors a recently discovered misalignment issue in Direct Preference Optimization (DPO), attributed to the influence of negative gradients. We provide a theoretical analysis of GRPO's learning dynamic, identifying the source of LLD as the naive penalization of all tokens in incorrect responses with the same strength. To address this, we develop a method called NTHR, which downweights penalties on tokens contributing to the LLD. Unlike prior DPO-based approaches, NTHR takes advantage of GRPO's group-based structure, using correct responses as anchors to identify influential tokens. Experiments on math reasoning benchmarks demonstrate that NTHR effectively mitigates LLD, yielding consistent performance gains across models ranging from 0.5B to 3B parameters.
+Subjects:
+Machine Learning (cs.LG)
+; Computation and Language (cs.CL)
+Cite as:
+arXiv:2505.18830
+[cs.LG]
+(or
+arXiv:2505.18830v1
+[cs.LG]
+for this version)
+https://doi.org/10.48550/arXiv.2505.18830
+Focus to learn more
+arXiv-issued DOI via DataCite
+Submission history
+From: Wenlong Deng [
+view email
+]
+[v1]
+Sat, 24 May 2025 18:58:51 UTC (2,068 KB)
+Full-text links:
+Access Paper:
+View a PDF of the paper titled On the Effect of Negative Gradient in Group Relative Deep Reinforcement Optimization, by Wenlong Deng and 5 other authors
+View PDF
+HTML (experimental)
+TeX Source
+view license
+Current browse context:
+cs.LG
+< prev
+|
+next >
+new
+|
+recent
+|
+2025-05
+Change to browse by:
+cs
+cs.CL
+References & Citations
+NASA ADS
+Google Scholar
+Semantic Scholar
+export BibTeX citation
+Loading...
+BibTeX formatted citation
+×
+loading...
+Data provided by:
+Bookmark
+Bibliographic Tools
+Bibliographic and Citation Tools
+Bibliographic Explorer Toggle
+Bibliographic Explorer
+(
+What is the Explorer?
+)
+Connected Papers Toggle
+Connected Papers
+(
+What is Connected Papers?
+)
+Litmaps Toggle
+Litmaps
+(
+What is Litmaps?
+)
+scite.ai Toggle
+scite Smart Citations
+(
+What are Smart Citations?
+)
+Code, Data, Media
+Code, Data and Media Associated with this Article
+alphaXiv Toggle
+alphaXiv
+(
+What is alphaXiv?
+)
+Links to Code Toggle
+CatalyzeX Code Finder for Papers
+(
+What is CatalyzeX?
+)
+DagsHub Toggle
+DagsHub
+(
+What is DagsHub?
+)
+GotitPub Toggle
+Gotit.pub
+(
+What is GotitPub?
+)
+Huggingface Toggle
+Hugging Face
+(
+What is Huggingface?
+)
+ScienceCast Toggle
+ScienceCast
+(
+What is ScienceCast?
+)
+Demos
+Demos
+Replicate Toggle
+Replicate
+(
+What is Replicate?
+)
+Spaces Toggle
+Hugging Face Spaces
+(
+What is Spaces?
+)
+Spaces Toggle
+TXYZ.AI
+(
+What is TXYZ.AI?
+)
+Related Papers
+Recommenders and Search Tools
+Link to Influence Flower
+Influence Flower
+(
+What are Influence Flowers?
+)
+Core recommender toggle
+CORE Recommender
+(
+What is CORE?
+)
+IArxiv recommender toggle
+IArxiv Recommender
+(
+What is IArxiv?
+)
+Author
+Venue
+Institution
+Topic
+About arXivLabs
+arXivLabs: experimental projects with community collaborators
+arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
+Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
+Have an idea for a project that will add value for arXiv's community?
+Learn more about arXivLabs
+.
+Which authors of this paper are endorsers?
+|
+Disable MathJax
+(
+What is MathJax?
+)

research/notes/250613358-socratic-rl-a-novel-framework-for-efficient-knowledge-acquisition-thro.md ADDED Viewed

	@@ -0,0 +1,214 @@

+---
+title: '[2506.13358] Socratic RL: A Novel Framework for Efficient Knowledge Acquisition
+  through Iterative Reflection and Viewpoint Distillation'
+id: 250613358-socratic-rl-a-novel-framework-for-efficient-knowledge-acquisition-thro
+tags:
+- socratic-mcts-swe-worldmodel-8f6dea
+created: '2026-06-09T04:19:31.995934Z'
+source: https://arxiv.org/abs/2506.13358
+source_domain: arxiv.org
+fetched_at: '2026-06-09T04:19:31.874122Z'
+fetch_provider: builtin
+status: draft
+type: note
+tier: institutional
+content_type: paper
+deprecated: false
+summary: 'Socratic-RL (arXiv 2506.13358): decoupled Teacher-Student RL — Teacher extracts causal "viewpoints" from interaction histories, meta-learns via utility uplift U(v), distills viewpoints into Student weights via KL (L_distill). Process- not outcome-reward. ID VERIFIED REAL.'
+---
+[2506.13358] Socratic RL: A Novel Framework for Efficient Knowledge Acquisition through Iterative Reflection and Viewpoint Distillation
+Computer Science > Artificial Intelligence
+arXiv:2506.13358
+(cs)
+[Submitted on 16 Jun 2025]
+Title:
+Socratic RL: A Novel Framework for Efficient Knowledge Acquisition through Iterative Reflection and Viewpoint Distillation
+Authors:
+Xiangfan Wu
+View a PDF of the paper titled Socratic RL: A Novel Framework for Efficient Knowledge Acquisition through Iterative Reflection and Viewpoint Distillation, by Xiangfan Wu
+View PDF
+HTML (experimental)
+Abstract:
+Current Reinforcement Learning (RL) methodologies for Large Language Models (LLMs) often rely on simplistic, outcome-based reward signals (e.g., final answer correctness), which limits the depth of learning from each interaction. This paper introduces Socratic Reinforcement Learning (Socratic-RL), a novel, process-oriented framework designed to address this limitation. Socratic-RL operates on the principle that deeper understanding is achieved by reflecting on the causal reasons for errors and successes within the reasoning process itself. The framework employs a decoupled "Teacher-Student" architecture, where a "Teacher AI" analyzes interaction histories, extracts causal insights, and formulates them into structured "viewpoints." These viewpoints, acting as distilled guidance, are then used by a "Student AI" to enhance its subsequent reasoning. A key innovation is the iterative self-improvement of the Teacher AI, enabling its reflective capabilities to evolve through a meta-learning loop. To manage the accumulation of knowledge, a distillation mechanism compresses learned viewpoints into the Student's parameters. By focusing on process rather than just outcome, Socratic-RL presents a pathway toward enhanced sample efficiency, superior interpretability, and a more scalable architecture for self-improving AI systems. This paper details the foundational concepts, formal mechanisms, synergies, challenges, and a concrete research roadmap for this proposed framework.
+Subjects:
+Artificial Intelligence (cs.AI)
+; Machine Learning (cs.LG); Multiagent Systems (cs.MA)
+Cite as:
+arXiv:2506.13358
+[cs.AI]
+(or
+arXiv:2506.13358v1
+[cs.AI]
+for this version)
+https://doi.org/10.48550/arXiv.2506.13358
+Focus to learn more
+arXiv-issued DOI via DataCite
+Submission history
+From: Xiangfan Wu [
+view email
+]
+[v1]
+Mon, 16 Jun 2025 10:57:58 UTC (561 KB)
+Full-text links:
+Access Paper:
+View a PDF of the paper titled Socratic RL: A Novel Framework for Efficient Knowledge Acquisition through Iterative Reflection and Viewpoint Distillation, by Xiangfan Wu
+View PDF
+HTML (experimental)
+TeX Source
+view license
+Current browse context:
+cs.AI
+< prev
+|
+next >
+new
+|
+recent
+|
+2025-06
+Change to browse by:
+cs
+cs.LG
+cs.MA
+References & Citations
+NASA ADS
+Google Scholar
+Semantic Scholar
+export BibTeX citation
+Loading...
+BibTeX formatted citation
+×
+loading...
+Data provided by:
+Bookmark
+Bibliographic Tools
+Bibliographic and Citation Tools
+Bibliographic Explorer Toggle
+Bibliographic Explorer
+(
+What is the Explorer?
+)
+Connected Papers Toggle
+Connected Papers
+(
+What is Connected Papers?
+)
+Litmaps Toggle
+Litmaps
+(
+What is Litmaps?
+)
+scite.ai Toggle
+scite Smart Citations
+(
+What are Smart Citations?
+)
+Code, Data, Media
+Code, Data and Media Associated with this Article
+alphaXiv Toggle
+alphaXiv
+(
+What is alphaXiv?
+)
+Links to Code Toggle
+CatalyzeX Code Finder for Papers
+(
+What is CatalyzeX?
+)
+DagsHub Toggle
+DagsHub
+(
+What is DagsHub?
+)
+GotitPub Toggle
+Gotit.pub
+(
+What is GotitPub?
+)
+Huggingface Toggle
+Hugging Face
+(
+What is Huggingface?
+)
+ScienceCast Toggle
+ScienceCast
+(
+What is ScienceCast?
+)
+Demos
+Demos
+Replicate Toggle
+Replicate
+(
+What is Replicate?
+)
+Spaces Toggle
+Hugging Face Spaces
+(
+What is Spaces?
+)
+Spaces Toggle
+TXYZ.AI
+(
+What is TXYZ.AI?
+)
+Related Papers
+Recommenders and Search Tools
+Link to Influence Flower
+Influence Flower
+(
+What are Influence Flowers?
+)
+Core recommender toggle
+CORE Recommender
+(
+What is CORE?
+)
+Author
+Venue
+Institution
+Topic
+About arXivLabs
+arXivLabs: experimental projects with community collaborators
+arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
+Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
+Have an idea for a project that will add value for arXiv's community?
+Learn more about arXivLabs
+.
+Which authors of this paper are endorsers?
+|
+Disable MathJax
+(
+What is MathJax?
+)
+---
+## METHOD DETAIL (extracted from full HTML text, arxiv.org/html/2506.13358v1; institutional)
+VERIFICATION: arXiv ID **2506.13358** confirmed REAL (matches user transcript). Submitted 16 Jun 2025. Computer Science > Artificial Intelligence. Position paper: foundational concepts, formal mechanisms, synergies — process-oriented (not outcome-only) RL for LLMs.
+### Decoupled Teacher–Student architecture
+Two specialized agents. **Teacher AI** analyzes interaction histories, extracts CAUSAL insights, formulates structured **"viewpoints."** **Student AI** focuses purely on task-solving. Specialization: Student becomes expert at solving; Teacher becomes expert at reflection/causal analysis.
+### Viewpoints
+A viewpoint = "a piece of structured, human-readable text representing a generalizable principle, a heuristic, a causal explanation, or a counter-example" (e.g. "In arithmetic, operations inside parentheses must be evaluated first"). Viewpoints are PREPENDED to the Student's input context: **π_S(a_t | s_t, V; θ_S)** where V is the active viewpoint set. Knowledge base **V_KB** persists across episodes; active set V reset post-distillation.
+### Meta-learning loop (Teacher self-improvement)
+Teacher quality = viewpoint utility uplift on probe tasks 𝒫_probe:
+**U(v) = E[Score(π_S(·|p, V∪{v}))] − E[Score(π_S(·|p, V))]**.
+Teacher is refined to "generate viewpoints that construct the most effective prompts for the Student." This is the key innovation: reflective capability EVOLVES (a meta-learning loop on the Teacher), not static.
+### Distillation mechanism (bound context growth → compress into weights)
+Train a new Student π_S' via KL minimization so it acts as if it knows the principle without seeing it:
+**L_distill = E[ D_KL( π_S(·|Input, v; θ_S) ‖ π_S'(·|Input; θ_S') ) ]**.
+Alternative distillation strategies named: **DPO** (preferred/rejected pairs) and **Instruction Tuning** (reformat V_KB into training examples).
+### Process- vs outcome-reward
+Standard RL = "simplistic, outcome-oriented reward (e.g. final answer correctness)." Socratic-RL = "automated process supervision" over "the causal chain of successes and failures within the reasoning process itself" — contrast to RLHF scalar outcome rewards.
+### Claimed benefits / named algorithm
+Sample efficiency (richer process signals), interpretability (V_KB = human-readable "glass-box" log of acquired knowledge), scalability (distillation resets context window). **Algorithm 1: The Socratic-RL Core Loop** — 4 phases: Student Interaction → Teacher Reflection → Meta-Learning (Teacher Evolution) → Knowledge Distillation.
+### Relevance to composer-replication-framework
+Teacher→viewpoint→Student-context→distill-into-weights is the conceptual parent of the framework's **HintGenerator** (ADR-009: template → raw-error → LLM-judge → sibling-bootstrap) and **SDPO Channel 2** (hint-conditioned same-model teacher; generalized_jsd / OPSD kernel — "knows the principle without seeing the hint" == the L_distill KL-to-hint-conditioned-teacher objective). The Teacher meta-learning loop maps onto the user's "outer slow dataset-construction loop." On the user's PRUNE-vs-TRAIN-ON-ALL question this paper is pro-DISTILL-the-causal-insight (TRAIN on the extracted viewpoint), not pro-prune; viewpoints are textual-critique-guided mutation in the genetic-algorithm framing.

research/notes/250721046-a-survey-of-self-evolving-agents-what-when-how-and-where-to-evolve-on.md ADDED Viewed

	@@ -0,0 +1,251 @@

+---
+title: '[2507.21046] A Survey of Self-Evolving Agents: What, When, How, and Where
+  to Evolve on the Path to Artificial Super Intelligence'
+id: 250721046-a-survey-of-self-evolving-agents-what-when-how-and-where-to-evolve-on
+tags:
+- socratic-mcts-swe-worldmodel-8f6dea
+created: '2026-06-09T04:24:56.997446Z'
+updated: '2026-06-09T04:25:34.638684Z'
+source: https://arxiv.org/abs/2507.21046
+source_domain: arxiv.org
+fetched_at: '2026-06-09T04:24:56.751495Z'
+fetch_provider: builtin
+status: draft
+type: note
+tier: institutional
+content_type: paper
+deprecated: false
+summary: 'Survey taxonomizing self-evolving agents (what/when/how/where to evolve);
+  Section 8.3 catalogs emergent risks: misevolution, uncontrolled behavior drift,
+  deployment-time reward hacking in memory evolution, Alignment Tipping Process, model
+  collapse from closed-loop RL on static synthetic data.'
+---
+[2507.21046] A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence
+Computer Science > Artificial Intelligence
+arXiv:2507.21046
+(cs)
+[Submitted on 28 Jul 2025 (
+v1
+), last revised 16 Jan 2026 (this version, v4)]
+Title:
+A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence
+Authors:
+Huan-ang Gao
+,
+Jiayi Geng
+,
+Wenyue Hua
+,
+Mengkang Hu
+,
+Xinzhe Juan
+,
+Hongzhang Liu
+,
+Shilong Liu
+,
+Jiahao Qiu
+,
+Xuan Qi
+,
+Yiran Wu
+,
+Hongru Wang
+,
+Han Xiao
+,
+Yuhang Zhou
+,
+Shaokun Zhang
+,
+Jiayi Zhang
+,
+Jinyu Xiang
+,
+Yixiong Fang
+,
+Qiwen Zhao
+,
+Dongrui Liu
+,
+Qihan Ren
+,
+Cheng Qian
+,
+Zhenhailong Wang
+,
+Minda Hu
+,
+Huazheng Wang
+,
+Qingyun Wu
+,
+Heng Ji
+,
+Mengdi Wang
+View a PDF of the paper titled A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence, by Huan-ang Gao and 26 other authors
+View PDF
+HTML (experimental)
+Abstract:
+Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse tasks but remain fundamentally static, unable to adapt their internal parameters to novel tasks, evolving knowledge domains, or dynamic interaction contexts. As LLMs are increasingly deployed in open-ended, interactive environments, this static nature has become a critical bottleneck, necessitating agents that can adaptively reason, act, and evolve in real time. This paradigm shift -- from scaling static models to developing self-evolving agents -- has sparked growing interest in architectures and methods enabling continual learning and adaptation from data, interactions, and experiences. This survey provides the first systematic and comprehensive review of self-evolving agents, organizing the field around three foundational dimensions: what, when, and how to evolve. We examine evolutionary mechanisms across agent components (e.g., models, memory, tools, architecture), categorize adaptation methods by stages (e.g., intra-test-time, inter-test-time), and analyze the algorithmic and architectural designs that guide evolutionary adaptation (e.g., scalar rewards, textual feedback, single-agent and multi-agent systems). Additionally, we analyze evaluation metrics and benchmarks tailored for self-evolving agents, highlight applications in domains such as coding, education, and healthcare, and identify critical challenges and research directions in safety, scalability, and co-evolutionary dynamics. By providing a structured framework for understanding and designing self-evolving agents, this survey establishes a roadmap for advancing more adaptive, robust, and versatile agentic systems in both research and real-world deployments, and ultimately sheds light on the realization of Artificial Super Intelligence (ASI) where agents evolve autonomously and perform beyond human-level intelligence across tasks.
+Comments:
+77 pages, 9 figures, Transactions on Machine Learning Research (01/2026)
+Subjects:
+Artificial Intelligence (cs.AI)
+Cite as:
+arXiv:2507.21046
+[cs.AI]
+(or
+arXiv:2507.21046v4
+[cs.AI]
+for this version)
+https://doi.org/10.48550/arXiv.2507.21046
+Focus to learn more
+arXiv-issued DOI via DataCite
+Submission history
+From: Xinzhe Juan [
+view email
+]
+[v1]
+Mon, 28 Jul 2025 17:59:05 UTC (3,709 KB)
+[v2]
+Wed, 30 Jul 2025 17:59:37 UTC (3,753 KB)
+[v3]
+Fri, 1 Aug 2025 17:17:09 UTC (3,753 KB)
+[v4]
+Fri, 16 Jan 2026 20:59:08 UTC (3,766 KB)
+Full-text links:
+Access Paper:
+View a PDF of the paper titled A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence, by Huan-ang Gao and 26 other authors
+View PDF
+HTML (experimental)
+TeX Source
+view license
+Current browse context:
+cs.AI
+< prev
+|
+next >
+new
+|
+recent
+|
+2025-07
+Change to browse by:
+cs
+References & Citations
+NASA ADS
+Google Scholar
+Semantic Scholar
+export BibTeX citation
+Loading...
+BibTeX formatted citation
+×
+loading...
+Data provided by:
+Bookmark
+Bibliographic Tools
+Bibliographic and Citation Tools
+Bibliographic Explorer Toggle
+Bibliographic Explorer
+(
+What is the Explorer?
+)
+Connected Papers Toggle
+Connected Papers
+(
+What is Connected Papers?
+)
+Litmaps Toggle
+Litmaps
+(
+What is Litmaps?
+)
+scite.ai Toggle
+scite Smart Citations
+(
+What are Smart Citations?
+)
+Code, Data, Media
+Code, Data and Media Associated with this Article
+alphaXiv Toggle
+alphaXiv
+(
+What is alphaXiv?
+)
+Links to Code Toggle
+CatalyzeX Code Finder for Papers
+(
+What is CatalyzeX?
+)
+DagsHub Toggle
+DagsHub
+(
+What is DagsHub?
+)
+GotitPub Toggle
+Gotit.pub
+(
+What is GotitPub?
+)
+Huggingface Toggle
+Hugging Face
+(
+What is Huggingface?
+)
+ScienceCast Toggle
+ScienceCast
+(
+What is ScienceCast?
+)
+Demos
+Demos
+Replicate Toggle
+Replicate
+(
+What is Replicate?
+)
+Spaces Toggle
+Hugging Face Spaces
+(
+What is Spaces?
+)
+Spaces Toggle
+TXYZ.AI
+(
+What is TXYZ.AI?
+)
+Related Papers
+Recommenders and Search Tools
+Link to Influence Flower
+Influence Flower
+(
+What are Influence Flowers?
+)
+Core recommender toggle
+CORE Recommender
+(
+What is CORE?
+)
+Author
+Venue
+Institution
+Topic
+About arXivLabs
+arXivLabs: experimental projects with community collaborators
+arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
+Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
+Have an idea for a project that will add value for arXiv's community?
+Learn more about arXivLabs
+.
+Which authors of this paper are endorsers?
+|
+Disable MathJax
+(
+What is MathJax?
+)
+## Related
+- [[pdf]]

research/notes/250921240-tree-search-for-llm-agent-reinforcement-learning.md ADDED Viewed

	@@ -0,0 +1,202 @@

+---
+title: '[2509.21240] Tree Search for LLM Agent Reinforcement Learning'
+id: 250921240-tree-search-for-llm-agent-reinforcement-learning
+tags:
+- socratic-mcts-swe-worldmodel-8f6dea
+created: '2026-06-09T04:22:47.413734Z'
+source: https://arxiv.org/abs/2509.21240
+source_domain: arxiv.org
+fetched_at: '2026-06-09T04:22:47.405048Z'
+fetch_provider: builtin
+status: draft
+type: note
+deprecated: false
+summary: '[2509.21240] Tree Search for LLM Agent Reinforcement Learning'
+---
+[2509.21240] Tree Search for LLM Agent Reinforcement Learning
+Computer Science > Machine Learning
+arXiv:2509.21240
+(cs)
+[Submitted on 25 Sep 2025 (
+v1
+), last revised 18 Mar 2026 (this version, v3)]
+Title:
+Tree Search for LLM Agent Reinforcement Learning
+Authors:
+Yuxiang Ji
+,
+Ziyu Ma
+,
+Yong Wang
+,
+Guanhua Chen
+,
+Xiangxiang Chu
+,
+Liaoni Wu
+View a PDF of the paper titled Tree Search for LLM Agent Reinforcement Learning, by Yuxiang Ji and 5 other authors
+View PDF
+Abstract:
+Recent advances in reinforcement learning (RL) have significantly enhanced the agentic capabilities of large language models (LLMs). In long-term and multi-turn agent tasks, existing approaches driven solely by outcome rewards often suffer from the problem of sparse supervision. To address the challenge, we propose Tree-based Group Relative Policy Optimization (Tree-GRPO), a grouped agent RL method based on tree search, where each tree node represents the complete agent interaction step. By sharing common prefixes, the tree search sampling increases the number of rollouts achievable within a fixed budget of tokens or tool calls. Moreover, we find that the tree-structured trajectory naturally allows the construction of step-wise process supervised signals even using only the outcome reward. Based on this, Tree-GRPO estimates the grouped relative advantages both on intra-tree and inter-tree levels. Through theoretical analysis, we demonstrate that the objective of intra-tree level group relative policy optimization is equivalent to that of step-level direct preference learning. Experiments across 11 datasets and 3 types of QA tasks demonstrate the superiority of the proposed tree-based RL over the chain-based RL method.
+Comments:
+ICLR 2026, Code:
+this https URL
+Subjects:
+Machine Learning (cs.LG)
+; Artificial Intelligence (cs.AI)
+Cite as:
+arXiv:2509.21240
+[cs.LG]
+(or
+arXiv:2509.21240v3
+[cs.LG]
+for this version)
+https://doi.org/10.48550/arXiv.2509.21240
+Focus to learn more
+arXiv-issued DOI via DataCite
+Submission history
+From: Yuxiang Ji [
+view email
+]
+[v1]
+Thu, 25 Sep 2025 14:37:09 UTC (974 KB)
+[v2]
+Sat, 11 Oct 2025 09:55:47 UTC (938 KB)
+[v3]
+Wed, 18 Mar 2026 09:49:32 UTC (983 KB)
+Full-text links:
+Access Paper:
+View a PDF of the paper titled Tree Search for LLM Agent Reinforcement Learning, by Yuxiang Ji and 5 other authors
+View PDF
+TeX Source
+view license
+Current browse context:
+cs.LG
+< prev
+|
+next >
+new
+|
+recent
+|
+2025-09
+Change to browse by:
+cs
+cs.AI
+References & Citations
+NASA ADS
+Google Scholar
+Semantic Scholar
+export BibTeX citation
+Loading...
+BibTeX formatted citation
+×
+loading...
+Data provided by:
+Bookmark
+Bibliographic Tools
+Bibliographic and Citation Tools
+Bibliographic Explorer Toggle
+Bibliographic Explorer
+(
+What is the Explorer?
+)
+Connected Papers Toggle
+Connected Papers
+(
+What is Connected Papers?
+)
+Litmaps Toggle
+Litmaps
+(
+What is Litmaps?
+)
+scite.ai Toggle
+scite Smart Citations
+(
+What are Smart Citations?
+)
+Code, Data, Media
+Code, Data and Media Associated with this Article
+alphaXiv Toggle
+alphaXiv
+(
+What is alphaXiv?
+)
+Links to Code Toggle
+CatalyzeX Code Finder for Papers
+(
+What is CatalyzeX?
+)
+DagsHub Toggle
+DagsHub
+(
+What is DagsHub?
+)
+GotitPub Toggle
+Gotit.pub
+(
+What is GotitPub?
+)
+Huggingface Toggle
+Hugging Face
+(
+What is Huggingface?
+)
+ScienceCast Toggle
+ScienceCast
+(
+What is ScienceCast?
+)
+Demos
+Demos
+Replicate Toggle
+Replicate
+(
+What is Replicate?
+)
+Spaces Toggle
+Hugging Face Spaces
+(
+What is Spaces?
+)
+Spaces Toggle
+TXYZ.AI
+(
+What is TXYZ.AI?
+)
+Related Papers
+Recommenders and Search Tools
+Link to Influence Flower
+Influence Flower
+(
+What are Influence Flowers?
+)
+Core recommender toggle
+CORE Recommender
+(
+What is CORE?
+)
+IArxiv recommender toggle
+IArxiv Recommender
+(
+What is IArxiv?
+)
+Author
+Venue
+Institution
+Topic
+About arXivLabs
+arXivLabs: experimental projects with community collaborators
+arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
+Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
+Have an idea for a project that will add value for arXiv's community?
+Learn more about arXivLabs
+.
+Which authors of this paper are endorsers?
+|
+Disable MathJax
+(
+What is MathJax?
+)

research/notes/251002387-cwm-an-open-weights-llm-for-research-on-code-generation-with-world-mod.md ADDED Viewed

	@@ -0,0 +1,291 @@

+---
+title: '[2510.02387] CWM: An Open-Weights LLM for Research on Code Generation with
+  World Models'
+id: 251002387-cwm-an-open-weights-llm-for-research-on-code-generation-with-world-mod
+tags:
+- socratic-mcts-swe-worldmodel-8f6dea
+created: '2026-06-09T04:22:12.331553Z'
+source: https://arxiv.org/abs/2510.02387
+source_domain: arxiv.org
+fetched_at: '2026-06-09T04:22:12.211946Z'
+fetch_provider: builtin
+status: draft
+type: note
+deprecated: false
+summary: '[2510.02387] CWM: An Open-Weights LLM for Research on Code Generation with
+  World Models'
+---
+[2510.02387] CWM: An Open-Weights LLM for Research on Code Generation with World Models
+Computer Science > Software Engineering
+arXiv:2510.02387
+(cs)
+[Submitted on 30 Sep 2025]
+Title:
+CWM: An Open-Weights LLM for Research on Code Generation with World Models
+Authors:
+FAIR CodeGen team
+,
+Jade Copet
+,
+Quentin Carbonneaux
+,
+Gal Cohen
+,
+Jonas Gehring
+,
+Jacob Kahn
+,
+Jannik Kossen
+,
+Felix Kreuk
+,
+Emily McMilin
+,
+Michel Meyer
+,
+Yuxiang Wei
+,
+David Zhang
+,
+Kunhao Zheng
+,
+Jordi Armengol-Estapé
+,
+Pedram Bashiri
+,
+Maximilian Beck
+,
+Pierre Chambon
+,
+Abhishek Charnalia
+,
+Chris Cummins
+,
+Juliette Decugis
+,
+Zacharias V. Fisches
+,
+François Fleuret
+,
+Fabian Gloeckle
+,
+Alex Gu
+,
+Michael Hassid
+,
+Daniel Haziza
+,
+Badr Youbi Idrissi
+,
+Christian Keller
+,
+Rahul Kindi
+,
+Hugh Leather
+,
+Gallil Maimon
+,
+Aram Markosyan
+,
+Francisco Massa
+,
+Pierre-Emmanuel Mazaré
+,
+Vegard Mella
+,
+Naila Murray
+,
+Keyur Muzumdar
+,
+Peter O'Hearn
+,
+Matteo Pagliardini
+,
+Dmitrii Pedchenko
+,
+Tal Remez
+,
+Volker Seeker
+,
+Marco Selvi
+,
+Oren Sultan
+,
+Sida Wang
+,
+Luca Wehrstedt
+,
+Ori Yoran
+,
+Lingming Zhang
+,
+Taco Cohen
+,
+Yossi Adi
+,
+Gabriel Synnaeve
+View a PDF of the paper titled CWM: An Open-Weights LLM for Research on Code Generation with World Models, by FAIR CodeGen team and Jade Copet and 49 other authors
+View PDF
+HTML (experimental)
+Abstract:
+We release Code World Model (CWM), a 32-billion-parameter open-weights LLM, to advance research on code generation with world models. To improve code understanding beyond what can be learned from training on static code alone, we mid-train CWM on a large amount of observation-action trajectories from Python interpreter and agentic Docker environments, and perform extensive multi-task reasoning RL in verifiable coding, math, and multi-turn software engineering environments. With CWM, we provide a strong testbed for researchers to explore the opportunities world modeling affords for improving code generation with reasoning and planning in computational environments. We present first steps of how world models can benefit agentic coding, enable step-by-step simulation of Python code execution, and show early results of how reasoning can benefit from the latter. CWM is a dense, decoder-only LLM trained with a context size of up to 131k tokens. Independent of its world modeling capabilities, CWM offers strong performance on general coding and math tasks: it reaches pass@1 scores of 65.8% on SWE-bench Verified (with test-time scaling), 68.6% on LiveCodeBench, 96.6% on Math-500, and 76.0% on AIME 2024. To support further research on code world modeling, we release model checkpoints after mid-training, SFT, and RL.
+Comments:
+58 pages
+Subjects:
+Software Engineering (cs.SE)
+; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
+MSC
+classes:
+68T07
+ACM
+classes:
+I.2.7
+Cite as:
+arXiv:2510.02387
+[cs.SE]
+(or
+arXiv:2510.02387v1
+[cs.SE]
+for this version)
+https://doi.org/10.48550/arXiv.2510.02387
+Focus to learn more
+arXiv-issued DOI via DataCite
+Submission history
+From: Gabriel Synnaeve [
+view email
+]
+[v1]
+Tue, 30 Sep 2025 21:47:10 UTC (1,662 KB)
+Full-text links:
+Access Paper:
+View a PDF of the paper titled CWM: An Open-Weights LLM for Research on Code Generation with World Models, by FAIR CodeGen team and Jade Copet and 49 other authors
+View PDF
+HTML (experimental)
+TeX Source
+view license
+Current browse context:
+cs.SE
+< prev
+|
+next >
+new
+|
+recent
+|
+2025-10
+Change to browse by:
+cs
+cs.AI
+cs.LG
+References & Citations
+NASA ADS
+Google Scholar
+Semantic Scholar
+export BibTeX citation
+Loading...
+BibTeX formatted citation
+×
+loading...
+Data provided by:
+Bookmark
+Bibliographic Tools
+Bibliographic and Citation Tools
+Bibliographic Explorer Toggle
+Bibliographic Explorer
+(
+What is the Explorer?
+)
+Connected Papers Toggle
+Connected Papers
+(
+What is Connected Papers?
+)
+Litmaps Toggle
+Litmaps
+(
+What is Litmaps?
+)
+scite.ai Toggle
+scite Smart Citations
+(
+What are Smart Citations?
+)
+Code, Data, Media
+Code, Data and Media Associated with this Article
+alphaXiv Toggle
+alphaXiv
+(
+What is alphaXiv?
+)
+Links to Code Toggle
+CatalyzeX Code Finder for Papers
+(
+What is CatalyzeX?
+)
+DagsHub Toggle
+DagsHub
+(
+What is DagsHub?
+)
+GotitPub Toggle
+Gotit.pub
+(
+What is GotitPub?
+)
+Huggingface Toggle
+Hugging Face
+(
+What is Huggingface?
+)
+ScienceCast Toggle
+ScienceCast
+(
+What is ScienceCast?
+)
+Demos
+Demos
+Replicate Toggle
+Replicate
+(
+What is Replicate?
+)
+Spaces Toggle
+Hugging Face Spaces
+(
+What is Spaces?
+)
+Spaces Toggle
+TXYZ.AI
+(
+What is TXYZ.AI?
+)
+Related Papers
+Recommenders and Search Tools
+Link to Influence Flower
+Influence Flower
+(
+What are Influence Flowers?
+)
+Core recommender toggle
+CORE Recommender
+(
+What is CORE?
+)
+Author
+Venue
+Institution
+Topic
+About arXivLabs
+arXivLabs: experimental projects with community collaborators
+arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
+Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
+Have an idea for a project that will add value for arXiv's community?
+Learn more about arXivLabs
+.
+Which authors of this paper are endorsers?
+|
+Disable MathJax
+(
+What is MathJax?
+)

research/notes/251121654-evilgenie-a-reward-hacking-benchmark.md ADDED Viewed

	@@ -0,0 +1,199 @@

+---
+title: '[2511.21654] EvilGenie: A Reward Hacking Benchmark'
+id: 251121654-evilgenie-a-reward-hacking-benchmark
+tags:
+- socratic-mcts-swe-worldmodel-8f6dea
+- locus-prune-vs-train-on-all
+- locus-eks-architecture-and-substrate-mapping
+- locus-credit-assignment-tree-as-process-signal
+created: '2026-06-09T04:56:26.236010Z'
+source: https://arxiv.org/abs/2511.21654
+source_domain: arxiv.org
+fetched_at: '2026-06-09T04:56:25.940448Z'
+fetch_provider: builtin
+status: draft
+type: note
+deprecated: false
+summary: '[2511.21654] EvilGenie: A Reward Hacking Benchmark'
+---
+[2511.21654] EvilGenie: A Reward Hacking Benchmark
+Computer Science > Machine Learning
+arXiv:2511.21654
+(cs)
+[Submitted on 26 Nov 2025 (
+v1
+), last revised 17 May 2026 (this version, v2)]
+Title:
+EvilGenie: A Reward Hacking Benchmark
+Authors:
+Jonathan Gabor
+,
+Jayson Lynch
+,
+Jonathan Rosenfeld
+View a PDF of the paper titled EvilGenie: A Reward Hacking Benchmark, by Jonathan Gabor and 2 other authors
+View PDF
+HTML (experimental)
+Abstract:
+We introduce EvilGenie, a benchmark for reward hacking in programming settings. We source problems from LiveCodeBench and create an environment in which agents can easily reward hack, such as by hardcoding test cases or editing the testing files. We measure reward hacking in three ways: held out unit tests, LLM judges, and test file edit detection. We verify these methods against human review and each other. We find the LLM judge to be highly effective at detecting reward hacking in unambiguous cases, and observe only minimal improvement from the use of held out test cases. In addition to testing many models using Inspect's basic\_agent scaffold, we also measure reward hacking rates for three popular proprietary coding agents: OpenAI's Codex, Anthropic's Claude Code, and Google's Gemini CLI. We observe explicit reward hacking by both Codex and Claude Code, and misaligned behavior by all three agents. Our codebase can be found at
+this https URL
+.
+Subjects:
+Machine Learning (cs.LG)
+ACM
+classes:
+I.2.7
+Cite as:
+arXiv:2511.21654
+[cs.LG]
+(or
+arXiv:2511.21654v2
+[cs.LG]
+for this version)
+https://doi.org/10.48550/arXiv.2511.21654
+Focus to learn more
+arXiv-issued DOI via DataCite
+Submission history
+From: Jonathan Gabor [
+view email
+]
+[v1]
+Wed, 26 Nov 2025 18:27:17 UTC (75 KB)
+[v2]
+Sun, 17 May 2026 22:54:07 UTC (42 KB)
+Full-text links:
+Access Paper:
+View a PDF of the paper titled EvilGenie: A Reward Hacking Benchmark, by Jonathan Gabor and 2 other authors
+View PDF
+HTML (experimental)
+TeX Source
+view license
+Current browse context:
+cs.LG
+< prev
+|
+next >
+new
+|
+recent
+|
+2025-11
+Change to browse by:
+cs
+References & Citations
+NASA ADS
+Google Scholar
+Semantic Scholar
+export BibTeX citation
+Loading...
+BibTeX formatted citation
+×
+loading...
+Data provided by:
+Bookmark
+Bibliographic Tools
+Bibliographic and Citation Tools
+Bibliographic Explorer Toggle
+Bibliographic Explorer
+(
+What is the Explorer?
+)
+Connected Papers Toggle
+Connected Papers
+(
+What is Connected Papers?
+)
+Litmaps Toggle
+Litmaps
+(
+What is Litmaps?
+)
+scite.ai Toggle
+scite Smart Citations
+(
+What are Smart Citations?
+)
+Code, Data, Media
+Code, Data and Media Associated with this Article
+alphaXiv Toggle
+alphaXiv
+(
+What is alphaXiv?
+)
+Links to Code Toggle
+CatalyzeX Code Finder for Papers
+(
+What is CatalyzeX?
+)
+DagsHub Toggle
+DagsHub
+(
+What is DagsHub?
+)
+GotitPub Toggle
+Gotit.pub
+(
+What is GotitPub?
+)
+Huggingface Toggle
+Hugging Face
+(
+What is Huggingface?
+)
+ScienceCast Toggle
+ScienceCast
+(
+What is ScienceCast?
+)
+Demos
+Demos
+Replicate Toggle
+Replicate
+(
+What is Replicate?
+)
+Spaces Toggle
+Hugging Face Spaces
+(
+What is Spaces?
+)
+Spaces Toggle
+TXYZ.AI
+(
+What is TXYZ.AI?
+)
+Related Papers
+Recommenders and Search Tools
+Link to Influence Flower
+Influence Flower
+(
+What are Influence Flowers?
+)
+Core recommender toggle
+CORE Recommender
+(
+What is CORE?
+)
+IArxiv recommender toggle
+IArxiv Recommender
+(
+What is IArxiv?
+)
+Author
+Venue
+Institution
+Topic
+About arXivLabs
+arXivLabs: experimental projects with community collaborators
+arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
+Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
+Have an idea for a project that will add value for arXiv's community?
+Learn more about arXivLabs
+.
+Which authors of this paper are endorsers?
+|
+Disable MathJax
+(
+What is MathJax?
+)

research/notes/251218832-from-word-to-world-can-large-language-models-be-implicit-text-based-wo.md ADDED Viewed

	@@ -0,0 +1,210 @@

+---
+title: '[2512.18832] From Word to World: Can Large Language Models be Implicit Text-based
+  World Models?'
+id: 251218832-from-word-to-world-can-large-language-models-be-implicit-text-based-wo
+tags:
+- socratic-mcts-swe-worldmodel-8f6dea
+created: '2026-06-09T04:22:01.604517Z'
+updated: '2026-06-09T04:22:18.450487Z'
+source: https://arxiv.org/abs/2512.18832
+source_domain: arxiv.org
+fetched_at: '2026-06-09T04:22:01.134259Z'
+fetch_provider: builtin
+status: draft
+type: note
+tier: institutional
+content_type: paper
+deprecated: false
+summary: '[2512.18832] From Word to World: Can Large Language Models be Implicit Text-based
+  World Models?'
+---
+[2512.18832] From Word to World: Can Large Language Models be Implicit Text-based World Models?
+Computer Science > Computation and Language
+arXiv:2512.18832
+(cs)
+[Submitted on 21 Dec 2025 (
+v1
+), last revised 5 Mar 2026 (this version, v2)]
+Title:
+From Word to World: Can Large Language Models be Implicit Text-based World Models?
+Authors:
+Yixia Li
+,
+Hongru Wang
+,
+Jiahao Qiu
+,
+Zhenfei Yin
+,
+Dongdong Zhang
+,
+Cheng Qian
+,
+Zeping Li
+,
+Pony Ma
+,
+Guanhua Chen
+,
+Heng Ji
+View a PDF of the paper titled From Word to World: Can Large Language Models be Implicit Text-based World Models?, by Yixia Li and 9 other authors
+View PDF
+HTML (experimental)
+Abstract:
+Agentic reinforcement learning increasingly relies on experience-driven scaling, yet real-world environments remain non-adaptive, limited in coverage, and difficult to scale. World models offer a potential way to improve learning efficiency through simulated experience, but it remains unclear whether large language models can reliably serve this role and under what conditions they meaningfully benefit agents. We study these questions in text-based environments, which provide a controlled setting to reinterpret language modeling as next-state prediction under interaction. We introduce a three-level framework for evaluating LLM-based world models: (i) fidelity and consistency, (ii) scalability and robustness, and (iii) agent utility. Across five representative environments, we find that sufficiently trained world models maintain coherent latent state, scale predictably with data and model size, and improve agent performance via action verification, synthetic trajectory generation, and warm-starting reinforcement learning. Meanwhile, these gains depend critically on behavioral coverage and environment complexity, delineating clear boundry on when world modeling effectively supports agent learning.
+Subjects:
+Computation and Language (cs.CL)
+Cite as:
+arXiv:2512.18832
+[cs.CL]
+(or
+arXiv:2512.18832v2
+[cs.CL]
+for this version)
+https://doi.org/10.48550/arXiv.2512.18832
+Focus to learn more
+arXiv-issued DOI via DataCite
+Submission history
+From: Yixia Li [
+view email
+]
+[v1]
+Sun, 21 Dec 2025 17:28:42 UTC (2,094 KB)
+[v2]
+Thu, 5 Mar 2026 07:26:37 UTC (2,094 KB)
+Full-text links:
+Access Paper:
+View a PDF of the paper titled From Word to World: Can Large Language Models be Implicit Text-based World Models?, by Yixia Li and 9 other authors
+View PDF
+HTML (experimental)
+TeX Source
+view license
+Current browse context:
+cs.CL
+< prev
+|
+next >
+new
+|
+recent
+|
+2025-12
+Change to browse by:
+cs
+References & Citations
+NASA ADS
+Google Scholar
+Semantic Scholar
+export BibTeX citation
+Loading...
+BibTeX formatted citation
+×
+loading...
+Data provided by:
+Bookmark
+Bibliographic Tools
+Bibliographic and Citation Tools
+Bibliographic Explorer Toggle
+Bibliographic Explorer
+(
+What is the Explorer?
+)
+Connected Papers Toggle
+Connected Papers
+(
+What is Connected Papers?
+)
+Litmaps Toggle
+Litmaps
+(
+What is Litmaps?
+)
+scite.ai Toggle
+scite Smart Citations
+(
+What are Smart Citations?
+)
+Code, Data, Media
+Code, Data and Media Associated with this Article
+alphaXiv Toggle
+alphaXiv
+(
+What is alphaXiv?
+)
+Links to Code Toggle
+CatalyzeX Code Finder for Papers
+(
+What is CatalyzeX?
+)
+DagsHub Toggle
+DagsHub
+(
+What is DagsHub?
+)
+GotitPub Toggle
+Gotit.pub
+(
+What is GotitPub?
+)
+Huggingface Toggle
+Hugging Face
+(
+What is Huggingface?
+)
+Links to Code Toggle
+Papers with Code
+(
+What is Papers with Code?
+)
+ScienceCast Toggle
+ScienceCast
+(
+What is ScienceCast?
+)
+Demos
+Demos
+Replicate Toggle
+Replicate
+(
+What is Replicate?
+)
+Spaces Toggle
+Hugging Face Spaces
+(
+What is Spaces?
+)
+Spaces Toggle
+TXYZ.AI
+(
+What is TXYZ.AI?
+)
+Related Papers
+Recommenders and Search Tools
+Link to Influence Flower
+Influence Flower
+(
+What are Influence Flowers?
+)
+Core recommender toggle
+CORE Recommender
+(
+What is CORE?
+)
+Author
+Venue
+Institution
+Topic
+About arXivLabs
+arXivLabs: experimental projects with community collaborators
+arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
+Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
+Have an idea for a project that will add value for arXiv's community?
+Learn more about arXivLabs
+.
+Which authors of this paper are endorsers?
+|
+Disable MathJax
+(
+What is MathJax?
+)

research/notes/260103905-current-agents-fail-to-leverage-world-model-as-tool-for-foresight.md ADDED Viewed

	@@ -0,0 +1,210 @@

+---
+title: '[2601.03905] Current Agents Fail to Leverage World Model as Tool for Foresight'
+id: 260103905-current-agents-fail-to-leverage-world-model-as-tool-for-foresight
+tags:
+- socratic-mcts-swe-worldmodel-8f6dea
+created: '2026-06-09T04:22:01.607019Z'
+updated: '2026-06-09T04:22:18.777288Z'
+source: https://arxiv.org/abs/2601.03905
+source_domain: arxiv.org
+fetched_at: '2026-06-09T04:22:01.290034Z'
+fetch_provider: builtin
+status: draft
+type: note
+tier: institutional
+content_type: paper
+deprecated: false
+summary: '[2601.03905] Current Agents Fail to Leverage World Model as Tool for Foresight'
+---
+[2601.03905] Current Agents Fail to Leverage World Model as Tool for Foresight
+Computer Science > Artificial Intelligence
+arXiv:2601.03905
+(cs)
+[Submitted on 7 Jan 2026 (
+v1
+), last revised 8 Jan 2026 (this version, v2)]
+Title:
+Current Agents Fail to Leverage World Model as Tool for Foresight
+Authors:
+Cheng Qian
+,
+Emre Can Acikgoz
+,
+Bingxuan Li
+,
+Xiusi Chen
+,
+Yuji Zhang
+,
+Bingxiang He
+,
+Qinyu Luo
+,
+Dilek Hakkani-Tür
+,
+Gokhan Tur
+,
+Yunzhu Li
+,
+Heng Ji
+View a PDF of the paper titled Current Agents Fail to Leverage World Model as Tool for Foresight, by Cheng Qian and 10 other authors
+View PDF
+HTML (experimental)
+Abstract:
+Agents built on vision-language models increasingly face tasks that demand anticipating future states rather than relying on short-horizon reasoning. Generative world models offer a promising remedy: agents could use them as external simulators to foresee outcomes before acting. This paper empirically examines whether current agents can leverage such world models as tools to enhance their cognition. Across diverse agentic and visual question answering tasks, we observe that some agents rarely invoke simulation (fewer than 1%), frequently misuse predicted rollouts (approximately 15%), and often exhibit inconsistent or even degraded performance (up to 5%) when simulation is available or enforced. Attribution analysis further indicates that the primary bottleneck lies in the agents' capacity to decide when to simulate, how to interpret predicted outcomes, and how to integrate foresight into downstream reasoning. These findings underscore the need for mechanisms that foster calibrated, strategic interaction with world models, paving the way toward more reliable anticipatory cognition in future agent systems.
+Comments:
+36 Pages, 13 Figures, 17 Tables (Meta data updated)
+Subjects:
+Artificial Intelligence (cs.AI)
+; Computation and Language (cs.CL); Machine Learning (cs.LG)
+Cite as:
+arXiv:2601.03905
+[cs.AI]
+(or
+arXiv:2601.03905v2
+[cs.AI]
+for this version)
+https://doi.org/10.48550/arXiv.2601.03905
+Focus to learn more
+arXiv-issued DOI via DataCite
+Submission history
+From: Cheng Qian [
+view email
+]
+[v1]
+Wed, 7 Jan 2026 13:15:23 UTC (12,754 KB)
+[v2]
+Thu, 8 Jan 2026 02:36:21 UTC (12,754 KB)
+Full-text links:
+Access Paper:
+View a PDF of the paper titled Current Agents Fail to Leverage World Model as Tool for Foresight, by Cheng Qian and 10 other authors
+View PDF
+HTML (experimental)
+TeX Source
+view license
+Current browse context:
+cs.AI
+< prev
+|
+next >
+new
+|
+recent
+|
+2026-01
+Change to browse by:
+cs
+cs.CL
+cs.LG
+References & Citations
+NASA ADS
+Google Scholar
+Semantic Scholar
+export BibTeX citation
+Loading...
+BibTeX formatted citation
+×
+loading...
+Data provided by:
+Bookmark
+Bibliographic Tools
+Bibliographic and Citation Tools
+Bibliographic Explorer Toggle
+Bibliographic Explorer
+(
+What is the Explorer?
+)
+Connected Papers Toggle
+Connected Papers
+(
+What is Connected Papers?
+)
+Litmaps Toggle
+Litmaps
+(
+What is Litmaps?
+)
+scite.ai Toggle
+scite Smart Citations
+(
+What are Smart Citations?
+)
+Code, Data, Media
+Code, Data and Media Associated with this Article
+alphaXiv Toggle
+alphaXiv
+(
+What is alphaXiv?
+)
+Links to Code Toggle
+CatalyzeX Code Finder for Papers
+(
+What is CatalyzeX?
+)
+DagsHub Toggle
+DagsHub
+(
+What is DagsHub?
+)
+GotitPub Toggle
+Gotit.pub
+(
+What is GotitPub?
+)
+Huggingface Toggle
+Hugging Face
+(
+What is Huggingface?
+)
+ScienceCast Toggle
+ScienceCast
+(
+What is ScienceCast?
+)
+Demos
+Demos
+Replicate Toggle
+Replicate
+(
+What is Replicate?
+)
+Spaces Toggle
+Hugging Face Spaces
+(
+What is Spaces?
+)
+Spaces Toggle
+TXYZ.AI
+(
+What is TXYZ.AI?
+)
+Related Papers
+Recommenders and Search Tools
+Link to Influence Flower
+Influence Flower
+(
+What are Influence Flowers?
+)
+Core recommender toggle
+CORE Recommender
+(
+What is CORE?
+)
+Author
+Venue
+Institution
+Topic
+About arXivLabs
+arXivLabs: experimental projects with community collaborators
+arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
+Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
+Have an idea for a project that will add value for arXiv's community?
+Learn more about arXivLabs
+.
+Which authors of this paper are endorsers?
+|
+Disable MathJax
+(
+What is MathJax?
+)

research/notes/260112307-rethinking-the-value-of-multi-agent-workflow-a-strong-single-agent-bas.md ADDED Viewed

	@@ -0,0 +1,206 @@

+---
+title: '[2601.12307] Rethinking the Value of Multi-Agent Workflow: A Strong Single
+  Agent Baseline'
+id: 260112307-rethinking-the-value-of-multi-agent-workflow-a-strong-single-agent-bas
+tags:
+- socratic-mcts-swe-worldmodel-8f6dea
+- locus-prune-vs-train-on-all
+- locus-eks-architecture-and-substrate-mapping
+- locus-credit-assignment-tree-as-process-signal
+created: '2026-06-09T04:52:22.844340Z'
+source: https://arxiv.org/abs/2601.12307
+source_domain: arxiv.org
+fetched_at: '2026-06-09T04:52:22.442504Z'
+fetch_provider: builtin
+status: draft
+type: note
+deprecated: false
+summary: '[2601.12307] Rethinking the Value of Multi-Agent Workflow: A Strong Single
+  Agent Baseline'
+---
+[2601.12307] Rethinking the Value of Multi-Agent Workflow: A Strong Single Agent Baseline
+Computer Science > Multiagent Systems
+arXiv:2601.12307
+(cs)
+[Submitted on 18 Jan 2026]
+Title:
+Rethinking the Value of Multi-Agent Workflow: A Strong Single Agent Baseline
+Authors:
+Jiawei Xu
+,
+Arief Koesdwiady
+,
+Sisong Bei
+,
+Yan Han
+,
+Baixiang Huang
+,
+Dakuo Wang
+,
+Yutong Chen
+,
+Zheshen Wang
+,
+Peihao Wang
+,
+Pan Li
+,
+Ying Ding
+View a PDF of the paper titled Rethinking the Value of Multi-Agent Workflow: A Strong Single Agent Baseline, by Jiawei Xu and 10 other authors
+View PDF
+HTML (experimental)
+Abstract:
+Recent advances in LLM-based multi-agent systems (MAS) show that workflows composed of multiple LLM agents with distinct roles, tools, and communication patterns can outperform single-LLM baselines on complex tasks. However, most frameworks are homogeneous, where all agents share the same base LLM and differ only in prompts, tools, and positions in the workflow. This raises the question of whether such workflows can be simulated by a single agent through multi-turn conversations. We investigate this across seven benchmarks spanning coding, mathematics, general question answering, domain-specific reasoning, and real-world planning and tool use. Our results show that a single agent can reach the performance of homogeneous workflows with an efficiency advantage from KV cache reuse, and can even match the performance of an automatically optimized heterogeneous workflow. Building on this finding, we propose \textbf{OneFlow}, an algorithm that automatically tailors workflows for single-agent execution, reducing inference costs compared to existing automatic multi-agent design frameworks without trading off accuracy. These results position the single-LLM implementation of multi-agent workflows as a strong baseline for MAS research. We also note that single-LLM methods cannot capture heterogeneous workflows due to the lack of KV cache sharing across different LLMs, highlighting future opportunities in developing \textit{truly} heterogeneous multi-agent systems.
+Subjects:
+Multiagent Systems (cs.MA)
+; Computation and Language (cs.CL); Machine Learning (cs.LG)
+Cite as:
+arXiv:2601.12307
+[cs.MA]
+(or
+arXiv:2601.12307v1
+[cs.MA]
+for this version)
+https://doi.org/10.48550/arXiv.2601.12307
+Focus to learn more
+arXiv-issued DOI via DataCite
+Submission history
+From: Jiawei Xu [
+view email
+]
+[v1]
+Sun, 18 Jan 2026 08:16:09 UTC (429 KB)
+Full-text links:
+Access Paper:
+View a PDF of the paper titled Rethinking the Value of Multi-Agent Workflow: A Strong Single Agent Baseline, by Jiawei Xu and 10 other authors
+View PDF
+HTML (experimental)
+TeX Source
+view license
+Current browse context:
+cs.MA
+< prev
+|
+next >
+new
+|
+recent
+|
+2026-01
+Change to browse by:
+cs
+cs.CL
+cs.LG
+References & Citations
+NASA ADS
+Google Scholar
+Semantic Scholar
+export BibTeX citation
+Loading...
+BibTeX formatted citation
+×
+loading...
+Data provided by:
+Bookmark
+Bibliographic Tools
+Bibliographic and Citation Tools
+Bibliographic Explorer Toggle
+Bibliographic Explorer
+(
+What is the Explorer?
+)
+Connected Papers Toggle
+Connected Papers
+(
+What is Connected Papers?
+)
+Litmaps Toggle
+Litmaps
+(
+What is Litmaps?
+)
+scite.ai Toggle
+scite Smart Citations
+(
+What are Smart Citations?
+)
+Code, Data, Media
+Code, Data and Media Associated with this Article
+alphaXiv Toggle
+alphaXiv
+(
+What is alphaXiv?
+)
+Links to Code Toggle
+CatalyzeX Code Finder for Papers
+(
+What is CatalyzeX?
+)
+DagsHub Toggle
+DagsHub
+(
+What is DagsHub?
+)
+GotitPub Toggle
+Gotit.pub
+(
+What is GotitPub?
+)
+Huggingface Toggle
+Hugging Face
+(
+What is Huggingface?
+)
+ScienceCast Toggle
+ScienceCast
+(
+What is ScienceCast?
+)
+Demos
+Demos
+Replicate Toggle
+Replicate
+(
+What is Replicate?
+)
+Spaces Toggle
+Hugging Face Spaces
+(
+What is Spaces?
+)
+Spaces Toggle
+TXYZ.AI
+(
+What is TXYZ.AI?
+)
+Related Papers
+Recommenders and Search Tools
+Link to Influence Flower
+Influence Flower
+(
+What are Influence Flowers?
+)
+Core recommender toggle
+CORE Recommender
+(
+What is CORE?
+)
+Author
+Venue
+Institution
+Topic
+About arXivLabs
+arXivLabs: experimental projects with community collaborators
+arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
+Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
+Have an idea for a project that will add value for arXiv's community?
+Learn more about arXivLabs
+.
+Which authors of this paper are endorsers?
+|
+Disable MathJax
+(
+What is MathJax?
+)