Baladithya Balamurugan Claude Opus 4.8 (1M context) commited on
Commit
c11cf49
·
1 Parent(s): 4e6e82e

Wave 1: fix 8 failing tests + unblock Docker E2E + dep/doc debt

Browse files

Backlog-resolution Wave 1 (branch backlog/goal-resolution-2026-06).

Bugs fixed:
- B1 (P0): generate spikes/007 synthetic_session_with_error.jsonl fixture
(never committed — .gitignore *.jsonl whitelisted the sibling but not this
one) → the 8 failing test_trace_examples_adapter tests now pass. Added the
missing .gitignore whitelist line.
- B2 (P2): [dev] extra was un-installable on Apple Silicon (pulled Linux-only
torchft-nightly). Dropped diloco from base [dev]; added [dev-full] for Linux.
- B3 (P2): [serverless] extra now includes s3fs/boto3/kubernetes (needed for
real S3 rendezvous + the EKSExecutor/SageMakerExecutor adapters).

Newly unblocked:
- D1 (…-245d): Docker substrate E2E — Docker now available on host; both gates
(4-gate substrate inversion + cache-scrub-in-container) PASS. Long-standing
"hardware-blocked" item closed.

Doc/API debt:
- B4: reconciled divergent test counts (115/176/210/232) to one canonical
figure measured this session: 266 passed / 62 skipped / 328 collected, with
an env-variance note, in docs/V1_V8_COVERAGE.md; updated PROJECT_STATE,
BACKLOG, TROUBLESHOOTING to point at it.
- B5: replaced stale /mnt/e/ WSL footer paths with repo-relative in
USER_GUIDE/API_REFERENCE/INTEGRATION_RECIPES.
- B6: fixed dead ADR-002-channel2-sdpo.md link → ADR-008 (README + 2 run.py).
- B7: re-export make_dr_grpo_config/make_po_config/PO_OBJECTIVES at the
trainer-subpackage AND top-package level; documented the config factories +
the PO-objective menu table in API_REFERENCE.
- B8: corrected _refine-2026-06-SUMMARY self-stale "not merged/3 commits" →
merged at 4e6e82e/6 commits; fixed OVERVIEW foot-gun cross-ref.

Plus the deep-research deliverable: research/notes/final_report_*.md (the
multi-model MCTS tree-of-work design) + supporting vault notes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. .gitignore +7 -0
  2. BACKLOG.md +1 -1
  3. composer_replication/__init__.py +7 -2
  4. composer_replication/trainer/__init__.py +12 -2
  5. docs/API_REFERENCE.md +59 -1
  6. docs/BACKLOG_RESOLUTION_2026-06-09.md +60 -0
  7. docs/INTEGRATION_RECIPES.md +1 -1
  8. docs/OVERVIEW.md +4 -2
  9. docs/PROJECT_STATE_AND_REMAINING_WORK.md +1 -1
  10. docs/TROUBLESHOOTING.md +1 -1
  11. docs/USER_GUIDE.md +1 -1
  12. docs/V1_V8_COVERAGE.md +18 -2
  13. docs/_refine-2026-06-SUMMARY.md +6 -2
  14. examples/gsm8k_grpo/run.py +1 -1
  15. examples/gsm8k_grpo_with_sdpo/README.md +1 -1
  16. examples/gsm8k_grpo_with_sdpo/run.py +1 -1
  17. pyproject.toml +19 -2
  18. research/audit_findings.json +11 -0
  19. research/comparisons.md +60 -0
  20. research/critic-findings-depth.json +35 -0
  21. research/critic-findings-dialectic.json +48 -0
  22. research/critic-findings-instruction.json +14 -0
  23. research/critic-findings-width.json +48 -0
  24. research/loci.json +68 -0
  25. research/notes/191108265-mastering-atari-go-chess-and-shogi-by-planning-with-a-learned-model.md +258 -0
  26. research/notes/201109464-counterfactual-credit-assignment-in-model-free-reinforcement-learning.md +229 -0
  27. research/notes/221114275-solving-math-word-problems-with-process-and-outcome-based-feedback.md +205 -0
  28. research/notes/230104104-mastering-diverse-domains-through-world-models.md +196 -0
  29. research/notes/230520050-lets-verify-step-by-step.md +207 -0
  30. research/notes/230616803-would-i-have-gotten-that-reward-long-term-credit-assignment-by-counter.md +205 -0
  31. research/notes/240211651-learning-from-failure-integrating-negative-examples-when-fine-tuning-l.md +200 -0
  32. research/notes/240515383-generating-code-world-models-with-large-language-models-guided-by-mont.md +196 -0
  33. research/notes/240701476-tree-search-for-language-model-agents.md +200 -0
  34. research/notes/240919256-hybridflow-a-flexible-and-efficient-rlhf-framework.md +222 -0
  35. research/notes/241020285-swe-search-enhancing-software-agents-with-monte-carlo-tree-search-and.md +207 -0
  36. research/notes/241108794-llm-based-world-models-can-make-decisions-solely-but-rigorous-evaluati.md +194 -0
  37. research/notes/241114499-understanding-world-or-predicting-future-a-comprehensive-survey-of-wor.md +226 -0
  38. research/notes/250218449-swe-rl-advancing-llm-reasoning-via-reinforcement-learning-on-open-soft.md +214 -0
  39. research/notes/250314391-how-much-do-llms-learn-from-negative-examples.md +189 -0
  40. research/notes/250411343-a-minimalist-approach-to-llm-reasoning-from-rejection-sampling-to-rein.md +217 -0
  41. research/notes/250415275-stop-summation-min-form-credit-assignment-is-all-process-reward-model.md +210 -0
  42. research/notes/250518830-on-the-effect-of-negative-gradient-in-group-relative-deep-reinforcemen.md +200 -0
  43. research/notes/250613358-socratic-rl-a-novel-framework-for-efficient-knowledge-acquisition-thro.md +214 -0
  44. research/notes/250721046-a-survey-of-self-evolving-agents-what-when-how-and-where-to-evolve-on.md +251 -0
  45. research/notes/250921240-tree-search-for-llm-agent-reinforcement-learning.md +202 -0
  46. research/notes/251002387-cwm-an-open-weights-llm-for-research-on-code-generation-with-world-mod.md +291 -0
  47. research/notes/251121654-evilgenie-a-reward-hacking-benchmark.md +199 -0
  48. research/notes/251218832-from-word-to-world-can-large-language-models-be-implicit-text-based-wo.md +210 -0
  49. research/notes/260103905-current-agents-fail-to-leverage-world-model-as-tool-for-foresight.md +210 -0
  50. research/notes/260112307-rethinking-the-value-of-multi-agent-workflow-a-strong-single-agent-bas.md +206 -0
.gitignore CHANGED
@@ -46,3 +46,10 @@ spikes/*/results/
46
  !spikes/*/states.jsonl
47
  !spikes/*/results.jsonl
48
  !**/synthetic_session.jsonl
 
 
 
 
 
 
 
 
46
  !spikes/*/states.jsonl
47
  !spikes/*/results.jsonl
48
  !**/synthetic_session.jsonl
49
+ !**/synthetic_session_with_error.jsonl
50
+
51
+ # hyperresearch tooling (local agent scaffolding — not project source)
52
+ .hyperresearch/
53
+ .claude/
54
+ CLAUDE.md
55
+ research/temp/
BACKLOG.md CHANGED
@@ -82,7 +82,7 @@ Updated 2026-05-29 to reflect shipped waves (ingestion, diloco, packaging, datag
82
  - **ADR-008/009/010 (Datagen, Layered Hints, Dr.GRPO+SDPO)**: Shipped, examples documented.
83
  - **Cross-Family Architectural Review**: Shipped (`docs/reviews/cross-family-adr-008-009-010-2026-05-29/`).
84
  - **Alignment / V&V Closure**: ADR-011 (SDPO alignment indices), ADR-012 (close review findings), ADR-013 (LMA integration channel-ladder) shipped.
85
- - **Test Suites**: 210 passed / 16 skipped.
86
  - **Real Examples**: `examples/gsm8k_grpo/`, `examples/sdpo_with_real_traces_production/`.
87
 
88
  ## Deferred (post-loop, GPU-gated)
 
82
  - **ADR-008/009/010 (Datagen, Layered Hints, Dr.GRPO+SDPO)**: Shipped, examples documented.
83
  - **Cross-Family Architectural Review**: Shipped (`docs/reviews/cross-family-adr-008-009-010-2026-05-29/`).
84
  - **Alignment / V&V Closure**: ADR-011 (SDPO alignment indices), ADR-012 (close review findings), ADR-013 (LMA integration channel-ladder) shipped.
85
+ - **Test Suites**: 266 passed / 62 skipped (measured 2026-06-09; canonical count + env-variance note in docs/V1_V8_COVERAGE.md).
86
  - **Real Examples**: `examples/gsm8k_grpo/`, `examples/sdpo_with_real_traces_production/`.
87
 
88
  ## Deferred (post-loop, GPU-gated)
composer_replication/__init__.py CHANGED
@@ -94,8 +94,13 @@ from composer_replication.teacher_replay import (
94
  replay_trace,
95
  )
96
 
97
- # Trainer (Spike 005)
98
- from composer_replication.trainer import ComposerReplicationTrainer
 
 
 
 
 
99
 
100
  # DiLoCo (Spike 008) — optional, requires torchft
101
  try:
 
94
  replay_trace,
95
  )
96
 
97
+ # Trainer (Spike 005) + policy-optimization config factories (ADR-008/ADR-014)
98
+ from composer_replication.trainer import (
99
+ PO_OBJECTIVES,
100
+ ComposerReplicationTrainer,
101
+ make_dr_grpo_config,
102
+ make_po_config,
103
+ )
104
 
105
  # DiLoCo (Spike 008) — optional, requires torchft
106
  try:
composer_replication/trainer/__init__.py CHANGED
@@ -5,6 +5,16 @@ Per docs/adrs/ADR-003 (also wraps DiLoCo when training distributed).
5
  """
6
  from __future__ import annotations
7
 
8
- from composer_replication.trainer.composer_trainer import ComposerReplicationTrainer
 
 
 
 
 
9
 
10
- __all__ = ["ComposerReplicationTrainer"]
 
 
 
 
 
 
5
  """
6
  from __future__ import annotations
7
 
8
+ from composer_replication.trainer.composer_trainer import (
9
+ PO_OBJECTIVES,
10
+ ComposerReplicationTrainer,
11
+ make_dr_grpo_config,
12
+ make_po_config,
13
+ )
14
 
15
+ __all__ = [
16
+ "ComposerReplicationTrainer",
17
+ "make_dr_grpo_config",
18
+ "make_po_config",
19
+ "PO_OBJECTIVES",
20
+ ]
docs/API_REFERENCE.md CHANGED
@@ -926,6 +926,64 @@ trainer = ComposerReplicationTrainer(
926
  # trainer.train() # uses overridden _compute_loss
927
  ```
928
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
929
  ### `class TraceTurn(TypedDict, total=False)` — `trainer.data_collator`
930
 
931
  ```python
@@ -1460,4 +1518,4 @@ Untested-contract symbols (⚠️) and skeletons (🟡) are flagged inline above
1460
 
1461
  ---
1462
 
1463
- **Document path**: `/mnt/e/CS/HF/composer-replication-framework/docs/API_REFERENCE.md`
 
926
  # trainer.train() # uses overridden _compute_loss
927
  ```
928
 
929
+ ### `make_dr_grpo_config(**overrides) -> trl.GRPOConfig`
930
+
931
+ Builds a `trl.GRPOConfig` configured to the **Dr. GRPO** recipe (Composer 2.5's
932
+ base objective per the Composer 2 tech report, arXiv:2603.24477; Dr.GRPO =
933
+ Liu et al. arXiv:2503.20783). Forces three knobs unless explicitly overridden,
934
+ with drift-guard assertions:
935
+
936
+ - `loss_type="dr_grpo"` — removes GRPO's length-standardization length bias.
937
+ - `scale_rewards="none"` — NO std-dev advantage normalization (Dr.GRPO requirement).
938
+ - `num_iterations=1` — single-epoch / strict on-policy.
939
+
940
+ Any field is overridable via kwargs (`learning_rate=`, `output_dir=`, `beta=`, …).
941
+ **Honest KL-estimator delta** (ADR-012 #1): TRL 1.5.0's `GRPOTrainer._compute_loss`
942
+ uses the **k3** estimator `exp(ref_logp−logp)−(ref_logp−logp)−1`, NOT the k1
943
+ estimator `−log r` the Dr.GRPO/Composer report frames; the delta is small for r≈1
944
+ and TRL is not monkeypatched — the delta is documented, not hidden. Exported from
945
+ both `composer_replication` and `composer_replication.trainer`.
946
+
947
+ ```python
948
+ from composer_replication import make_dr_grpo_config
949
+ args = make_dr_grpo_config(output_dir="runs/x", learning_rate=1e-6)
950
+ ```
951
+
952
+ ### `make_po_config(objective="dr_grpo", **overrides) -> trl.GRPOConfig`
953
+
954
+ Builds a `trl.GRPOConfig` for a **named policy-optimization objective** from the
955
+ `PO_OBJECTIVES` menu (ADR-014). All presets are PURE CONFIG over trl 1.5.0's
956
+ `GRPOTrainer` (verified by introspection) — no custom `_compute_loss` needed.
957
+ `**overrides` set/override any `GRPOConfig` field on top.
958
+
959
+ - Raises `ValueError` on an unknown objective (lists the valid menu).
960
+ - Raises `AssertionError` if a requested knob silently failed to apply (drift guard;
961
+ e.g. GSPO guards `importance_sampling_level=="sequence"`).
962
+
963
+ ```python
964
+ from composer_replication import make_po_config, PO_OBJECTIVES
965
+ args = make_po_config("dapo", output_dir="runs/dapo", learning_rate=2e-6)
966
+ ```
967
+
968
+ ### `PO_OBJECTIVES: dict[str, dict]`
969
+
970
+ The selectable base policy-optimization objectives (named presets over real trl
971
+ 1.5.0 `GRPOConfig` knobs). Keys and what each sets:
972
+
973
+ | Objective | `loss_type` | `scale_rewards` | Distinguishing knob | Paper |
974
+ |---|---|---|---|---|
975
+ | `grpo` | `grpo` | `group` (std-norm) | IS=`token` | DeepSeekMath 2402.03300 |
976
+ | `dr_grpo` *(default)* | `dr_grpo` | `none` | length-bias removed | 2503.20783 |
977
+ | `bnpo` | `bnpo` | `batch` | batch-normalized | trl |
978
+ | `dapo` | `dapo` | `none` | `epsilon_high=0.28` (decoupled clip-higher), `mask_truncated_completions`, `beta=0` | 2503.14476 |
979
+ | `gspo` | `grpo` | `group` | `importance_sampling_level="sequence"` | Qwen 2507.18071 |
980
+ | `cispo` | `cispo` | `none` | `epsilon_high=5.0` (detached IS coef) | MiniMax-M1 2506.13585 |
981
+
982
+ > **Diagnostic gotcha:** for any PO-objective ablation, log the *distinguishing*
983
+ > diagnostic (`clip_ratio/high_mean` for DAPO, the sequence-level ratio for GSPO).
984
+ > A `0` means the knob never engaged — NOT that the objectives are equal. (This is
985
+ > exactly the inert-knob artifact the A1 DAPO-vs-Dr.GRPO washout hit at lr=1e-6.)
986
+
987
  ### `class TraceTurn(TypedDict, total=False)` — `trainer.data_collator`
988
 
989
  ```python
 
1518
 
1519
  ---
1520
 
1521
+ **Document path**: `docs/API_REFERENCE.md` (repo-relative)
docs/BACKLOG_RESOLUTION_2026-06-09.md ADDED
@@ -0,0 +1,60 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Backlog Resolution — 2026-06-09
2
+
3
+ Goal-driven systematic resolution of every pending item. This doc is the live audit + wave plan.
4
+
5
+ ## Phase 1 — Commit / working-tree state (captured 2026-06-09)
6
+
7
+ - **Branch:** `main` (canonical) at `4e6e82e` = `origin/main` = `origin/master` (synced).
8
+ - **Working branch for this effort:** `backlog/goal-resolution-2026-06` (off `main`).
9
+ - **Untracked (from the hyperresearch run + tooling):** `research/` artifacts (query, scaffold, loci, comparisons, critic-findings, patch/polish logs, `notes/final_report_*`), `.hyperresearch/` (SQLite vault), `.claude/skills/` (16 hyperresearch step skills), `CLAUDE.md` (hyperresearch-injected). Decision: the deep-research deliverable (`research/notes/final_report_socratic-mcts-swe-worldmodel-8f6dea.md` + supporting artifacts) is worth committing as project research; `.hyperresearch/` (binary SQLite) and tooling scaffolding should be gitignored.
10
+ - **Host capabilities NEW since last audit:** **Docker IS available** (`docker info` ok) → unblocks the substrate-E2E item. `.venv` (py3.13, torch 2.12, trl 1.5.1) present.
11
+
12
+ ## Phase 2 — Backlog audit (every item, categorized)
13
+
14
+ ### A. Real bugs / regressions (do NOW, no gating)
15
+ | ID | Item | Priority | Complexity | Status |
16
+ |---|---|---|---|---|
17
+ | B1 | 8 failing tests: gitignored `synthetic_session_with_error.jsonl` fixture never committed (`.gitignore:45 *.jsonl` whitelists `synthetic_session.jsonl` but not the `_with_error` sibling). Breaks `composer_replication/ingestion/tests/test_trace_examples_adapter.py` (core pkg) + `examples/sdpo_with_real_traces_production/run.py`. | P0 | trivial | OPEN |
18
+ | B2 | `[dev]` extra un-installable on Apple Silicon (pulls `torchft-nightly`, Linux-x86_64-only wheels) → `uv pip install -e '.[dev]'` fails entirely. | P2 | low | OPEN |
19
+ | B3 | `[serverless]` extra missing `s3fs`/`boto3`/`kubernetes` (needed for real S3 rendezvous + the planned EKSExecutor). | P2 | low | OPEN |
20
+
21
+ ### B. Doc/state debt (do NOW)
22
+ | ID | Item | Priority | Status |
23
+ |---|---|---|---|
24
+ | B4 | Test-count drift: docs claim 115 / 210 / 232 / 176 in different places; real count must be measured + reconciled to one canonical number (V1_V8_COVERAGE.md). | P2 | OPEN |
25
+ | B5 | Stale WSL `/mnt/e/CS/HF/...` absolute-path footers in API_REFERENCE.md:1463, USER_GUIDE.md:703, INTEGRATION_RECIPES.md:985 (+ research/* occurrences). | P3 | OPEN |
26
+ | B6 | Dead link `examples/gsm8k_grpo_with_sdpo/README.md:66 → docs/adrs/ADR-002-channel2-sdpo.md` (should be ADR-008-drgrpo-sdpo-live-channel.md). | P3 | OPEN |
27
+ | B7 | API_REFERENCE.md missing the trainer config factories `make_dr_grpo_config` (ADR-008) + `make_po_config`/`PO_OBJECTIVES` (ADR-014) — real public API undocumented. | P2 | OPEN |
28
+ | B8 | `_refine-2026-06-SUMMARY.md` self-stale ("not merged, 3 commits" — actually merged, 6 commits); README/OVERVIEW→TROUBLESHOOTING dangling foot-gun cross-ref. | P3 | OPEN |
29
+
30
+ ### C. Code-buildable Phase-0 deltas from the research report (do NOW — mockable, no GPU/cloud)
31
+ | ID | Item | Priority | Complexity | Status |
32
+ |---|---|---|---|---|
33
+ | C1 | **Held-out disjoint eval + depth/generation kill-switch** — the "documented repo gap" + most load-bearing collapse safeguard (#2). Self-evolving flywheel is unsafe without it. CPU-testable. | P1 | med | OPEN |
34
+ | C2 | **`EKSExecutor`** satisfying the `ServerlessExecutor` Protocol (launch_replicas=K8s indexed Jobs, poll/cancel/collect, S3 via ObjectStoreAllReduce) — ~150 LOC, mockable like ModalSpawnExecutor (its test uses `_MockFunctionCall`). The named-but-unimplemented `K8sExecutor` slot (executor.py:41). | P2 | med | OPEN |
35
+ | C3 | Containerize `LocalSubprocessSandbox` (gVisor/Docker runtime) — now that Docker exists, the sandbox-execution path can be made real. | P3 | med | OPEN |
36
+
37
+ ### D. Hardware/host-gated — NOW RUNNABLE (Docker present)
38
+ | ID | Item | Priority | Status |
39
+ |---|---|---|---|
40
+ | D1 (`…-245d`) | Docker substrate E2E (`composer_replication/datagen/tests/test_docker_substrate_e2e.py`) — the 4 inversion gates + cache-scrub on a real `python:3.11-slim` container. Was skipif-gated on `docker info`; **Docker now available → RUN IT**. | P4→now | OPEN |
41
+
42
+ ### E. Code-buildable, RUN-gated (build harness/tests; real run needs GPU+budget — user-only)
43
+ | ID | Item | Priority | Status |
44
+ |---|---|---|---|
45
+ | E1 (`…-4936`) | A2 SDPO-only ladder runner + error-trace dataset builder. `modal_ladder_a1.py` hardcoded to A1. Build the runner + dataset tooling + CPU/mock tests; real A100 run is user-gated. | P2 | OPEN (build harness) |
46
+ | E2 (`…-211e`) | Higher-lr PO-objective sweep harness — make DAPO/GSPO clip-higher fire; log the distinguishing diagnostic. Build the sweep config/driver + assertions; real run user-gated. | P2 | OPEN (build harness) |
47
+ | E3 | `SageMakerExecutor` (~150 LOC, boto3 create_training_job, same S3 rendezvous) — mockable. | P3 | OPEN |
48
+
49
+ ### F. Genuinely gated — cannot execute here (document + verify only)
50
+ | ID | Item | Priority | Status |
51
+ |---|---|---|---|
52
+ | F1 (`…-cb74`) | **ROTATE exposed HF write-token** — USER-ONLY (requires HF account access). AUDIT done: no live token in tracked tree (only env-var reads). Action = user rotates on huggingface.co. | P1 | DOCUMENTED (user-only) |
53
+ | F2 | Real 8B LMA run (A2/A3/A4 arms `…-42f5`,`…-dd7b`) + higher-lr sweep RUNS — GPU + budget + user go/no-go. Harness buildable (E1/E2); the spend is user-only. | — | GATED (harness only) |
54
+
55
+ ## Wave plan
56
+ - **Wave 1 (parallel):** B1, B2, B3, B4, B5, B6, B7, B8 (bugs + doc debt) ‖ D1 (Docker E2E) ‖ research fan-out (Tavily/Exa/DeepWiki) for C1/C2/E1/E2 best practices.
57
+ - **Wave 2 (parallel, after research):** C1 (held-out eval + kill-switch) ‖ C2 (EKSExecutor) ‖ C3 (containerized sandbox) ‖ E1/E2/E3 harnesses.
58
+ - **Concurrent review team:** audits each wave's diff, feeds findings back.
59
+ - **Wave 3+:** reconcile review findings, fix, repeat until zero open + tests green.
60
+ - **Final:** full suite green, docs reconciled, everything committed.
docs/INTEGRATION_RECIPES.md CHANGED
@@ -982,4 +982,4 @@ adapter boundary, not because the loss math is wrong.
982
 
983
  ---
984
 
985
- **File path:** `/mnt/e/CS/HF/composer-replication-framework/docs/INTEGRATION_RECIPES.md`
 
982
 
983
  ---
984
 
985
+ **File path:** `docs/INTEGRATION_RECIPES.md` (repo-relative)
docs/OVERVIEW.md CHANGED
@@ -67,8 +67,10 @@ where channel 1 is real GRPO rather than the LM-CE stub. See
67
  3. **The empirical question** — does the method actually beat plain GRPO at scale? — is the
68
  GPU-budget-gated v0.1 work (Spikes 002b/003/004) and remains open by design.
69
 
70
- See [`BACKLOG.md`](../BACKLOG.md) for the live gap list and [`docs/TROUBLESHOOTING.md`](TROUBLESHOOTING.md)
71
- for known foot-guns.
 
 
72
 
73
  ## Foot-guns worth knowing on day one
74
 
 
67
  3. **The empirical question** — does the method actually beat plain GRPO at scale? — is the
68
  GPU-budget-gated v0.1 work (Spikes 002b/003/004) and remains open by design.
69
 
70
+ See [`BACKLOG.md`](../BACKLOG.md) for the live gap list, the **Foot-guns worth knowing
71
+ on day one** section just below for the day-one gotchas (branch sync, `strip_thinking`,
72
+ k1/k3, `compose_loss`-is-harness), and [`docs/TROUBLESHOOTING.md`](TROUBLESHOOTING.md)
73
+ for install/runtime failure modes.
74
 
75
  ## Foot-guns worth knowing on day one
76
 
docs/PROJECT_STATE_AND_REMAINING_WORK.md CHANGED
@@ -15,7 +15,7 @@ for unblocked work, `sd list` for everything, `sd show <id>` for detail.
15
  A reusable RL/data-gen framework that replicates Cursor's **Composer 2.5** post-training
16
  recipe at small scale, whose north-star consumer is the **llm-mental-alterations (LMA)**
17
  project (apply targeted RL to a personality-altered SFT model and measure washout vs
18
- amplification). Past-skeleton, production-shaped: 8 subpackages, 232 tests pass / 18 skip,
19
  installable, with worked GSM8K-GRPO + SDPO-real-trace + A1-8B examples.
20
 
21
  ## The 3-channel loss — with HONEST provenance
 
15
  A reusable RL/data-gen framework that replicates Cursor's **Composer 2.5** post-training
16
  recipe at small scale, whose north-star consumer is the **llm-mental-alterations (LMA)**
17
  project (apply targeted RL to a personality-altered SFT model and measure washout vs
18
+ amplification). Past-skeleton, production-shaped: 8 subpackages, 266 tests pass / 62 skip (measured 2026-06-09; see docs/V1_V8_COVERAGE.md for the canonical count + why skips vary by env),
19
  installable, with worked GSM8K-GRPO + SDPO-real-trace + A1-8B examples.
20
 
21
  ## The 3-channel loss — with HONEST provenance
docs/TROUBLESHOOTING.md CHANGED
@@ -824,7 +824,7 @@ should succeed:
824
  uv venv --clear
825
  uv pip install -e ".[diloco,replay,replaysim,train,dev]"
826
  source .venv/bin/activate
827
- python -m pytest -q # baseline 176 passed / 8 skipped
828
  ```
829
 
830
  If any of those extras fails to resolve, file a bug report — Wave 16
 
824
  uv venv --clear
825
  uv pip install -e ".[diloco,replay,replaysim,train,dev]"
826
  source .venv/bin/activate
827
+ python -m pytest -q # baseline 266 passed / 62 skipped (2026-06-09; varies by optional deps/Docker — see docs/V1_V8_COVERAGE.md)
828
  ```
829
 
830
  If any of those extras fails to resolve, file a bug report — Wave 16
docs/USER_GUIDE.md CHANGED
@@ -700,4 +700,4 @@ Run the full suite with `pytest` from the repo root.
700
 
701
  ---
702
 
703
- **File path:** `/mnt/e/CS/HF/composer-replication-framework/docs/USER_GUIDE.md`
 
700
 
701
  ---
702
 
703
+ **File path:** `docs/USER_GUIDE.md` (repo-relative)
docs/V1_V8_COVERAGE.md CHANGED
@@ -112,7 +112,23 @@ The user expanded the brief mid-loop:
112
 
113
  **Wave 13 test addition**: 35 new tests passing (17 distillation + 9 serverless multi-process + 9 replaysim).
114
 
115
- The framework now covers the full expanded brief. **Total tests passing
116
- post-Wave-15: 115 + 1 skip-marked.** Wave-by-wave evolution: 72 (W12) → 93 (W13) → 124 (W14) → 130 (W14b) → 115 (W15: TAID rewrite consolidated 16 schedule-tests into 7 t-parameterized tests; OPSD upstream-parity test added skip-marked).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
117
 
118
  This is the canonical running test count; other docs reference V1_V8_COVERAGE rather than restating.
 
112
 
113
  **Wave 13 test addition**: 35 new tests passing (17 distillation + 9 serverless multi-process + 9 replaysim).
114
 
115
+ The framework now covers the full expanded brief.
116
+
117
+ **Canonical test count (measured 2026-06-09 on this tree): 266 passed / 62 skipped / 328 collected.**
118
+ Wave-by-wave growth of the *passing-on-a-minimal-CPU-env* subset: 72 (W12) → 93 (W13)
119
+ → 124 (W14) → 130 (W14b) → 115 (W15) → … → **266 (2026-06-09)** as later waves
120
+ (datagen, ADR-011/012/013/014, serverless, ingestion adapter) added subpackages and tests.
121
+
122
+ **Why the skip count varies by environment (and why older docs cite 115 / 176 / 210 / 232):**
123
+ the suite has ~328 collected tests; how many *run vs skip* depends on what optional
124
+ deps / host capabilities are present. Tests `skipif`-gate on: `torchft` (DiLoCo
125
+ integration — Linux-x86_64-only, absent on macOS arm64), `modal`, `data-juicer`,
126
+ `prime-rl`, the `/tmp/{opsd,taid}-clone` upstream-parity clones, a real Claude Code
127
+ session log, and a live **Docker** host. On a minimal CPU env many of those skip;
128
+ on a Docker-enabled host the substrate-E2E gates RUN (proven 2026-06-09). The
129
+ divergent historical numbers (115 Wave-15, 232/18, 210/16, 176/8) are point-in-time
130
+ snapshots under different dep/host matrices — they are not contradictions, but this
131
+ line is the one canonical figure; reproduce it with `pip install -e '.[dev]'` then
132
+ `pytest -q` (add `.[datagen]` + a Docker host to un-skip the substrate E2E).
133
 
134
  This is the canonical running test count; other docs reference V1_V8_COVERAGE rather than restating.
docs/_refine-2026-06-SUMMARY.md CHANGED
@@ -1,7 +1,11 @@
1
  # Docs Refine 2026-06 — Change Summary
2
 
3
- > Branch: `docs/refine-2026-06` (off `master` HEAD `aae66fa`). **Docs-only.** Not merged,
4
- > no PR opened left for human review. Commit range: `aae66fa..e130879` (3 commits).
 
 
 
 
5
 
6
  This engagement refined the documentation corpus to (1) enforce the ground-truth provenance
7
  correction recorded in [ADR-014](adrs/ADR-014-policy-optimization-objective-menu.md), (2)
 
1
  # Docs Refine 2026-06 — Change Summary
2
 
3
+ > Branch: `docs/refine-2026-06` (off `master` HEAD `aae66fa`). **Docs-only.**
4
+ > **MERGED** into `main` as of `4e6e82e` (merge commit "Merge docs/refine-2026-06"),
5
+ > after the 3 documented waves (`20e3bd9`, `f00833d`, `e130879`) plus 3 reconciliation
6
+ > commits (`ace6dd4`, `5e64616`, `d7e4b4e`) that retired the now-resolved main-lags-master
7
+ > foot-gun — 6 commits total in range `fb13ea3..4e6e82e`, not the 3 this summary originally
8
+ > listed. (This header was updated 2026-06-09 to reflect the merged reality.)
9
 
10
  This engagement refined the documentation corpus to (1) enforce the ground-truth provenance
11
  correction recorded in [ADR-014](adrs/ADR-014-policy-optimization-objective-menu.md), (2)
examples/gsm8k_grpo/run.py CHANGED
@@ -23,7 +23,7 @@ Usage:
23
  Cross-references:
24
  - `docs/USER_GUIDE.md` §8 — Recipe A: TRL `GRPOTrainer` subclass
25
  - `docs/INTEGRATION_RECIPES.md` Recipe 1 — minimum-viable Python script
26
- - `docs/adrs/ADR-002-channel2-sdpo.md` — SDPO design (not used here; see
27
  `run_with_sdpo.py` for the SDPO variant)
28
  """
29
  from __future__ import annotations
 
23
  Cross-references:
24
  - `docs/USER_GUIDE.md` §8 — Recipe A: TRL `GRPOTrainer` subclass
25
  - `docs/INTEGRATION_RECIPES.md` Recipe 1 — minimum-viable Python script
26
+ - `docs/adrs/ADR-008-drgrpo-sdpo-live-channel.md` — SDPO design (not used here; see
27
  `run_with_sdpo.py` for the SDPO variant)
28
  """
29
  from __future__ import annotations
examples/gsm8k_grpo_with_sdpo/README.md CHANGED
@@ -63,7 +63,7 @@ hints from the actual error sites in your trace data.
63
 
64
  - [`composer_replication.compose_loss`](../../composer_replication/loss.py) — the loss-composition entrypoint
65
  - [`docs/COMPOSER_RECIPE_MAPPING.md`](../../docs/COMPOSER_RECIPE_MAPPING.md) — how SDPO maps to Cursor's Composer-2.5 hint-distillation
66
- - [`docs/adrs/ADR-002-channel2-sdpo.md`](../../docs/adrs/ADR-002-channel2-sdpo.md) — SDPO design decision
67
  - [`examples/gsm8k_grpo/run.py`](../gsm8k_grpo/run.py) — plain GRPO sibling (alpha_sdpo=0)
68
 
69
  ## CPU vs GPU
 
63
 
64
  - [`composer_replication.compose_loss`](../../composer_replication/loss.py) — the loss-composition entrypoint
65
  - [`docs/COMPOSER_RECIPE_MAPPING.md`](../../docs/COMPOSER_RECIPE_MAPPING.md) — how SDPO maps to Cursor's Composer-2.5 hint-distillation
66
+ - [`docs/adrs/ADR-008-drgrpo-sdpo-live-channel.md`](../../docs/adrs/ADR-008-drgrpo-sdpo-live-channel.md) — SDPO design decision
67
  - [`examples/gsm8k_grpo/run.py`](../gsm8k_grpo/run.py) — plain GRPO sibling (alpha_sdpo=0)
68
 
69
  ## CPU vs GPU
examples/gsm8k_grpo_with_sdpo/run.py CHANGED
@@ -28,7 +28,7 @@ Cross-references:
28
  - `composer_replication.compose_loss` — the loss-composition entrypoint
29
  - `docs/COMPOSER_RECIPE_MAPPING.md` — how SDPO maps to Cursor's
30
  Composer-2.5 hint-distillation
31
- - `docs/adrs/ADR-002-channel2-sdpo.md` — SDPO design
32
  - `examples/gsm8k_grpo/run.py` — plain GRPO (no SDPO) sibling
33
  """
34
  from __future__ import annotations
 
28
  - `composer_replication.compose_loss` — the loss-composition entrypoint
29
  - `docs/COMPOSER_RECIPE_MAPPING.md` — how SDPO maps to Cursor's
30
  Composer-2.5 hint-distillation
31
+ - `docs/adrs/ADR-008-drgrpo-sdpo-live-channel.md` — SDPO design
32
  - `examples/gsm8k_grpo/run.py` — plain GRPO (no SDPO) sibling
33
  """
34
  from __future__ import annotations
pyproject.toml CHANGED
@@ -58,9 +58,16 @@ diloco = [
58
  "torchft-nightly",
59
  ]
60
  # Decoupled DiLoCo over serverless executors (per ADR-005)
 
 
 
 
61
  serverless = [
62
  "fsspec>=2024.6",
63
  "huggingface_hub>=0.27", # for hf:// fsspec backend + HF Jobs
 
 
 
64
  ]
65
  # Replaysim dataset normalization (per ADR-004)
66
  #
@@ -111,11 +118,21 @@ datagen = [
111
  # module is a documentation skeleton (importing it does NOT require
112
  # monarch installed). The extra is dropped — see docs/TROUBLESHOOTING.md
113
  # ("monarch / data-juicer install") for installation guidance.
114
- # Everything for development
 
 
 
 
 
115
  dev = [
116
  "pytest>=8.0",
117
  "ruff>=0.6",
118
- "composer-replication[replay,diloco,train]",
 
 
 
 
 
119
  ]
120
 
121
  [project.urls]
 
58
  "torchft-nightly",
59
  ]
60
  # Decoupled DiLoCo over serverless executors (per ADR-005)
61
+ # fsspec gives the object-store rendezvous one code path (s3://, gs://, hf://,
62
+ # file://); s3fs is the concrete S3 backend (the AWS default per the EKS design);
63
+ # boto3 + kubernetes are needed by the AWS leaf adapters (SageMakerExecutor uses
64
+ # boto3.create_training_job; EKSExecutor uses the kubernetes BatchV1 client).
65
  serverless = [
66
  "fsspec>=2024.6",
67
  "huggingface_hub>=0.27", # for hf:// fsspec backend + HF Jobs
68
+ "s3fs>=2024.6", # concrete S3 backend for ObjectStoreAllReduce (AWS default)
69
+ "boto3>=1.34", # SageMakerExecutor (create_training_job) + S3 IAM
70
+ "kubernetes>=29.0", # EKSExecutor (indexed k8s Jobs via BatchV1Api)
71
  ]
72
  # Replaysim dataset normalization (per ADR-004)
73
  #
 
118
  # module is a documentation skeleton (importing it does NOT require
119
  # monarch installed). The extra is dropped — see docs/TROUBLESHOOTING.md
120
  # ("monarch / data-juicer install") for installation guidance.
121
+ # Development the BASE dev set installs on every platform (macOS arm64 incl.).
122
+ # NOTE: `diloco` (torchft-nightly) is deliberately NOT in base `dev`: torchft-nightly
123
+ # ships Linux-x86_64 wheels only, so including it made `pip install -e '.[dev]'` fail
124
+ # outright on Apple Silicon / any non-Linux-x86_64 host. The torchft-dependent tests
125
+ # skipif-gate cleanly when it is absent, so the base dev set runs the full suite minus
126
+ # the torchft integration tests on any platform.
127
  dev = [
128
  "pytest>=8.0",
129
  "ruff>=0.6",
130
+ "composer-replication[replay,train]",
131
+ ]
132
+ # Full development incl. the DiLoCo outer-loop dep (Linux-x86_64 only — torchft-nightly).
133
+ # Use on a Linux GPU/CI host to also exercise the torchft integration tests.
134
+ dev-full = [
135
+ "composer-replication[dev,diloco,serverless,datagen]",
136
  ]
137
 
138
  [project.urls]
research/audit_findings.json ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "mode": "hyperresearch-v8",
4
+ "run_id": "2026-06-09-socratic-mcts-swe-worldmodel-8f6dea",
5
+ "loci_count": 5,
6
+ "critical_findings_applied": 17,
7
+ "critical_findings_skipped": 0,
8
+ "polish_escalations": 0,
9
+ "final_word_count": 9207
10
+ }
11
+ ]
research/comparisons.md ADDED
@@ -0,0 +1,60 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Cross-locus comparisons — argumentative spine
2
+
3
+ ## Tension 1: "Prune" means two different things at two different granularities
4
+ - **Locus prune-vs-train-on-all** commits: TRAIN ON ALL branches, but *typed/routed* — winners→policy SFT/RL, losers→DPO rejects + world-model targets; the natural prune is the per-TURN JSD signal-presence test, not per-trajectory survival.
5
+ - **Locus selfevolve-flywheel** commits: you MUST prune at the oracle-cleanliness gate before training, because train-on-all distills proxy-hacks (RSI §3.2) — reward-hacking branches must be discarded, not learned from.
6
+ - **The cross-locus dynamic:** These look contradictory ("train on all" vs "prune") but reconcile into a precise rule: prune at TWO gates the policy must never cross — (a) oracle-cleanliness (drop reward-hacked / guard-broken branches entirely) and (b) per-turn signal-presence (skip zero-signal turns) — then train on ALL of what survives, routed by signal type. The flywheel locus supplies the safety floor that the prune-vs-all locus's "keep everything" must sit on top of.
7
+ - **How the draft should engage this:** §4 must state the resolution as a two-gate filter (cleanliness gate + signal-presence gate) wrapping a typed train-on-all, NOT as "prune vs all". This is the headline reconciliation of the report.
8
+ - **Calibration:** prune-vs-all is HIGH confidence that structured negatives beat positives-only; flywheel is HIGH that the gated version compounds, MEDIUM the current repo code is sufficient (safeguard #2, the disjoint held-out + kill-switch, is a documented GAP). Both name the SAME falsifier: held-out score declining while in-loop oracle reward rises = collapse caught in the act.
9
+
10
+ ## Tension 2: The failed branch is simultaneously poison (for the policy) and gold (for the world model)
11
+ - **Locus worldmodel-latent-deliberation** commits: train-on-all for the world-model head (a failed branch is a *perfect* next-state-prediction label — CWM precedent), prune/reward-filter for the GRPO policy head — "same tree, two harvests."
12
+ - **Locus prune-vs-train-on-all** commits: the single best use of a failed branch is exactly a world-model next-state-prediction target (route #2, "no policy-gradient penalty at all").
13
+ - **The cross-locus dynamic:** Strong CONVERGENCE from two independent investigations onto the same mechanism — the failed branch's value is realized by predicting it, not by penalizing the policy with it. This dissolves the prune-vs-all dilemma: you never throw the failed branch away (world model eats it) and you never let it destabilize the policy (no raw negative gradient). Convergence-from-independent-paths is itself a finding.
14
+ - **How the draft should engage this:** §2 and §4 must share this "two-harvest" frame explicitly; the world-model aux loss is what *makes* train-on-all safe for the policy, because it relocates the failed-branch signal off the policy gradient.
15
+ - **Calibration:** worldmodel is HIGH on necessity of training it, MEDIUM-HIGH that the aux next-state head is the best lever; prune-vs-all independently rates the same head MEDIUM-HIGH. Shared falsifier: foresight@k with aux-ON ≈ token-RL-only (aux content loss redundant at scale).
16
+
17
+ ## Tension 3: The expensive tree only pays for itself if expansion is divergence-gated — and that gate is where the world model earns its keep
18
+ - **Locus credit-assignment** commits: the divergence tree is a genuine PRM-free counterfactual process oracle, but O(N^D) cost means it's worth it ONLY with divergence-gated expansion (branch only at high-VOI turns where heterogeneous models already disagree) → ~O(N·decision-points).
19
+ - **Locus worldmodel** commits: the bottleneck the literature identifies (2601.03905) is foresight *governance* — when/whether to deliberate — not simulator fidelity; RL on the `<deliberate>` token's *placement* teaches governance.
20
+ - **The cross-locus dynamic:** COMPLICATION-into-synthesis: the same "where to spend deliberation" question appears as a COST control in credit-assignment (where to branch the env) and as a CAPABILITY target in worldmodel (where to emit `<deliberate>`). They are the same decision learned at two levels — the trained world model's governance signal is exactly the policy that should drive divergence-gated expansion at data-generation time. The system's most expensive knob (branch factor) and its core capability (foresight governance) are the same lever.
21
+ - **How the draft should engage this:** §3 (GA) and the §8 cost section must tie the divergence gate to VOI; note the bootstrap — early rounds gate on cross-model disagreement, later rounds can gate on the model's own learned deliberation-confidence.
22
+ - **Calibration:** credit-assignment is conditional ("YES but gated"); its falsifier (divergence-gated arm fails to beat equal-budget outcome-only GRPO++ on long-horizon tasks) is the single most important compute-matched ablation in the whole program.
23
+
24
+ ## Tension 4: Replay entrenches the human distribution — branching is the claimed escape, but only the oracle proves you escaped
25
+ - **Locus selfevolve-flywheel** commits: human-trace entrenchment (Self-Play-SWE-RL 2512.18552) is real for the UNGUARDED version; the antidote is counterfactual branching OFF the human path graded by tests — "you fork, you don't replay."
26
+ - **Locus credit-assignment** commits: sibling divergence (different models reaching different EXECUTED outcomes from a shared parent) is the unit of signal — which is precisely a fork off the parent trajectory, validated by execution.
27
+ - **The cross-locus dynamic:** CONVERGENCE plus a caveat: branching is the mechanism that turns "replay" into "counterfactual exploration," and both loci agree the EXECUTION ORACLE (not teacher consensus, not a learned verifier) is what certifies the fork found something real. The repo's Channel 3 today is *weaker* on this axis precisely because its fitness is teacher-plurality, not test execution — the upgrade to execution-graded branching is the core delta.
28
+ - **How the draft should engage this:** §1 and §6 must name this as the single most important upgrade over the repo's current Channel 3 (teacher-plurality fitness → execution-oracle fitness) and as the answer to the strongest adversarial prior.
29
+ - **Calibration:** flywheel HIGH that branching+oracle escapes entrenchment; the open risk both flag is that a system generating its own tasks from its own traces can drift the held-out set toward the train set.
30
+
31
+ ## Tension 5: EKS-primary is cheap to adopt in CODE but the genuinely-new cost is sandbox fan-out — which is also the throughput ceiling of the whole idea
32
+ - **Locus eks-architecture** commits: EKS-primary single-control-plane hybrid; the repo port is a ~300 LOC leaf adapter (EKSExecutor + SageMakerExecutor); BUT the one genuinely-new infra is per-branch sandbox isolation, and per-branch cold-start can dominate outer-loop wall-clock.
33
+ - **Locus credit-assignment** commits: the rollout/branching is the system's most expensive piece ($64/trace ungated vs $0.98 flat); divergence-gating is mandatory.
34
+ - **The cross-locus dynamic:** The architecture locus's "strongest counter" (sandbox cold-start dominates → demote EKS from 'primary for everything' to 'primary for control+training, bespoke pool for sandbox execution') is the SAME bottleneck the credit-assignment locus controls with divergence-gating. Infra cost and algorithmic cost are the same constraint: branch factor × sandbox cold-start. SWE-MiniSandbox (container-free kernel isolation, ~5% disk / ~25% env-prep) is the throughput primitive that makes high fan-out affordable.
35
+ - **How the draft should engage this:** §8/§10 must connect the algorithmic gate (divergence-gating, §3) to the infra primitive (cheap sandboxes, container-free or snapshotted microVM) — the two cost controls are one. Honestly flag the "demote EKS for sandboxes" fallback.
36
+ - **Calibration:** eks-architecture HIGH (8/10) on the design; explicit falsifier = measured per-branch sandbox cold-start dominating wall-clock.
37
+
38
+ ---
39
+
40
+ ## Step-8 corpus-critic confidence revisions (overturning evidence found — MUST be reflected in the draft)
41
+
42
+ **Revision A — the heterogeneity premise is DOWNGRADED (contested, not assumed-positive).** Adversarial search found substantive counter-evidence that the system's single most distinctive choice (different model family per node + cross-family DPO) may not pay for itself:
43
+ - "Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets" (2604.02460): at held-constant reasoning tokens, single-agent matches/beats multi-agent incl. the ensemble variant (the closest analogue to multi-rollout heterogeneous search); "many reported MAS gains are better explained by compute and context effects than by inherent architectural superiority"; holds across Qwen3/DeepSeek-R1/Gemini; Data-Processing-Inequality argument (one agent with full context >= split agents).
44
+ - "Rethinking the Value of Multi-Agent Workflow: A Strong Single Agent Baseline" (2601.12307): a single-LLM baseline matched AFlow-optimized HETEROGENEOUS (GPT-4o-mini + Claude-Haiku) MCTS workflows at lower cost.
45
+ - Cross-tokenizer/cross-family distillation is "a largely unsolved problem" (2604.07466 BLD + CTPD/CDM cluster): cross-family preference transfer is fragile, sometimes DEGRADES, needs special byte-level/OT machinery.
46
+ - **Engagement guidance:** §1/§3/§4 must treat heterogeneity as a HYPOTHESIS requiring an equal-compute control arm (single strong model with N temperature/persona samples) before claiming any heterogeneity gain. The typed-train-on-all and divergence-tree positions do NOT depend on heterogeneity (they work with homogeneous N-sampling too), so the core design survives — but the "different models per node" flourish is now an ablation question, not a premise. NOTE: safeguard #4's "N>=3 population as anti-collapse diversity" SURVIVES (no source showed model-diversity gives zero anti-collapse benefit; on-policy-distillation survey ties gains to predictive diversity).
47
+
48
+ **Revision B — the world-model aux loss is DEMOTED from "necessary" to "optional, parameter-isolated, ablation-gated."** Direct 2026 counter-evidence on all three angles:
49
+ - "Reasoning and Tool-use Compete in Agentic RL" (2602.00994): jointly training two capabilities into one parameter set induces misaligned gradients / interference; decoupling into separate LoRA adapters (DART) beats joint optimization. → stacking a 2nd SDPO/next-state head onto the SAME policy head is the exact interfering configuration; argue for a separate head/adapter.
50
+ - "Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning" (2605.06840): LLMs generate deep look-ahead in CoT but move choice is causally driven by shallow depth-1 nodes — foresight content generated but NOT consumed. So improving prediction quality may not move decisions.
51
+ - "The Predictive-Causal Gap: An Impossibility Theorem" (2605.05029): pure predictive objectives provably/empirically optimize AWAY from causal/decision-relevant structure (92% lower prediction error while causal fidelity ~0).
52
+ - Counter-counter (kept honest): SPA, VAGEN, Imagine-then-Plan, FOREAGENT all report explicit future-state simulation HELPS agentic pass-rate — the field is genuinely split.
53
+ - **Engagement guidance:** §2 must reframe the aux next-state loss as OPTIONAL, in a parameter-isolated head/adapter (not fused into the policy head), gated behind the pre-registered ablation (aux-ON vs deliberation-token-RL-only) on the PRIMARY metric (pass-rate + counterfactual-foresight, NOT next-state accuracy). This matches the worldmodel investigator's OWN stated falsifier. The cheapest decisive experiment we could run ourselves is the SWE-specific next-state-head ablation (does not exist in the literature yet).
54
+
55
+ **Revision C — even the EXECUTION ORACLE gets gamed (safeguard #1 is necessary but NOT sufficient).** The flywheel locus claimed a true execution oracle is "categorically different" from a proxy and thus immune to RSI-style depth-amplified hacking. Adversarial search complicated this: EvilGenie (2511.21654), "LLMs gaming verifiers: RLVR can lead to reward hacking" (2604.15149), and "Do synthetic trajectories reflect real reward hacking" (2604.23488) show verifiable/test-based rewards ARE gamed — agents hardcode/special-case to pass FAIL_TO_PASS, exploit fractional partial-credit, and overfit held-out tests. → Engagement: §4 (oracle-cleanliness gate) and the safeguards must state that the execution oracle REDUCES but does not eliminate the hack surface; HackMonitor + held-out disjoint eval + the depth kill-switch are doing real work, not belt-and-suspenders. The oracle bounds the hack surface (finite, vs an open-ended proxy) but PASS_TO_PASS guards, test-provenance checks, and contamination control are mandatory, not optional. This makes safeguard #2 (disjoint held-out + kill-switch) MORE load-bearing, not less.
56
+
57
+ Net: the corpus critic STRENGTHENED the report by puncturing two overclaims and complicating a third. The robust core (fork-off-the-human-trace + execution oracle + typed train-on-all + two-gate prune + divergence-gated expansion + 4 safeguards + EKS-primary) is untouched; the two flourishes (heterogeneity-as-premise, aux-loss-as-necessary) become explicit ablation questions. Both shared falsifiers were independently confirmed as the right experiments.
58
+
59
+ ## Summary for the synthesizer
60
+ The five loci are NOT orthogonal — they collapse into ONE coherent design with a single through-line: **fork off the human trace with heterogeneous models, grade by a true execution oracle, gate expansion on divergence/VOI, and route the resulting branches by signal type — winners to the policy, all branches (incl. failures) to a world-model next-state head — under two hard prune gates (oracle-cleanliness, per-turn signal-presence) and four collapse safeguards.** The world-model aux loss is the keystone: it is simultaneously the project's stated goal, the safe home for failed-branch signal (resolving prune-vs-all), and the learned governance policy that drives divergence-gated expansion (controlling cost). The single most important experiment is the compute-matched, generate-once/route-many P0–P6 ablation on the repo's ADR-013 ladder, measuring calibration/foresight, not just pass@1.
research/critic-findings-depth.json ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "critic": "depth",
3
+ "findings": [
4
+ {
5
+ "severity": "high",
6
+ "section": "10. Cost, Throughput, Failure Modes (and the §3 callback at line 55)",
7
+ "issue": "The single quantitative anchor for the entire 'divergence-gating is mandatory' argument misreads its own source. The report frames '~$0.98/trace flat-ungated versus ~$64/trace for an ungated eight-teacher thousand-step branching tree.' But research/05:256 derives $64 explicitly as a FLAT replay cost ($0.008/step x 1000 steps x 8 teachers = 8000 forward passes, no branching). The repo's own flat-to-tree note (flat-multi-teacher-to-branching...md:40) states this directly: 'research/05 ... already prices the FLAT case at ~$64/trace ungated for 8 teachers x 1000 steps; a tree makes [gating] mandatory.' Both numbers the report compares are flat costs that differ only in scale (N=3 short trace = $0.98 from teacher_replay.py:7-8 spike-001; N=8 x 1000 steps = $64). Labeling the $64 figure a 'branching tree' conflates a teacher-count/length scale difference with the flat-vs-tree distinction, and badly UNDERSELLS the real tree cost: a true O(N^D) branching tree is combinatorially worse than $64, not equal to it. The argument's headline number is wrong in the direction that weakens the report's own thesis.",
8
+ "fix": "Reframe to: flat Channel-3 replay is ~$0.98/trace at N=3 (teacher_replay.py:7-8) and ~$64/trace at the 8-teacher x 1000-step scale (research/05:256) — both FLAT, O(N*T). A branching tree is O(N^D), strictly worse than either flat figure; that combinatorial blow-up (not the $0.98-to-$64 gap) is what makes divergence-gating mandatory. Drop 'branching tree' from the $64 clause.",
9
+ "anchor_quote": "~$0.98/trace flat-ungated versus ~$64/trace for an ungated eight-teacher thousand-step branching tree"
10
+ },
11
+ {
12
+ "severity": "medium",
13
+ "section": "9. The SageMaker Path and the Recommended Hybrid (also §6 reuse/build table)",
14
+ "issue": "The '~150 LOC each' executor estimate (and the '~300 LOC' combined figure in §6) undershoots the repo's only working ServerlessExecutor backend by ~2.5x and is not grounded in the existence proof the report itself cites. The report leans on ModalSpawnExecutor as the 'working proof' that calibrates the delta [42], but modal_spawn.py is 390 LOC and the executor.py reference (Protocol + LocalProcessExecutor) is 310 LOC. An EKS adapter that must handle Indexed Jobs, JOB_COMPLETION_INDEX->REPLICA_RANK mapping, GPU limits, IRSA, optional runtimeClassName, plus poll/cancel/stream_logs/collect against the Batch/Pod APIs is unlikely to be half the size of the Modal adapter. The figure reads as optimistic rather than measured, which weakens the report's load-bearing 'nine-tenths already exists / bounded delta' claim.",
15
+ "fix": "Either ground the estimate (e.g. 'ModalSpawnExecutor is 390 LOC; expect EKSExecutor in the same 300-400 LOC range') or soften to an order-of-magnitude ('a few hundred LOC each, comparable to the existing Modal adapter') instead of the precise '~150 LOC each'.",
16
+ "anchor_quote": "**`EKSExecutor` (~150 LOC, primary)**"
17
+ },
18
+ {
19
+ "severity": "low",
20
+ "section": "6. Grounding in the composer-replication-framework (reuse/build table) and §8/§9",
21
+ "issue": "The report consistently presents `EKSExecutor` as the repo's own reserved slot ('AWS leaf adapters | Build (~300 LOC) | `EKSExecutor` + `SageMakerExecutor`'). But the repo never names an EKSExecutor: the ServerlessExecutor Protocol docstring (executor.py:41) lists 'RunPodExecutor, SageMakerExecutor, K8sExecutor' as Future, and INTEGRATION_RECIPES.md:685 lists `K8sExecutor` (KubeRay/Volcano) as Roadmap. `EKSExecutor` is the report's coinage. SageMakerExecutor is a genuine repo-reserved name; EKSExecutor is not. This slightly overstates how pre-slotted the EKS path is.",
22
+ "fix": "Either note that the repo's roadmap slot is `K8sExecutor` (executor.py:41 / INTEGRATION_RECIPES.md:685) and EKSExecutor is the proposed concrete K8s implementation, or rename to `K8sExecutor` to match the repo. A one-clause parenthetical ('the repo's reserved `K8sExecutor` slot, here specialized to EKS') closes the gap.",
23
+ "anchor_quote": "`EKSExecutor` + `SageMakerExecutor` [42]"
24
+ },
25
+ {
26
+ "severity": "low",
27
+ "section": "7. What the Literature Says (Endorsements, the counterfactual-credit backbone)",
28
+ "issue": "The divergence-as-counterfactual-credit claim slightly conflates two distinct mechanisms. The report says siblings from a shared parent are 'low-variance because the shared parent differences out the baseline,' then attributes this to 'the quantity learned counterfactual-credit methods approximate with a hindsight model' [33]. But 2011.09464 (the cited note) achieves low variance via a FUTURE-CONDITIONAL (hindsight) baseline that conditions on the realized trajectory — not via a shared-parent/leave-one-out baseline (which is the standard MC advantage the repo's GRPO LOO already does). 'Shared parent differences out the baseline' is really the LOO/group-relative argument (closer to Tree-GRPO [44]), whereas the hindsight-model framing is CCA. The two are run together as if one mechanism.",
29
+ "fix": "Separate the two: the shared-parent differencing is a group-relative/LOO baseline (Tree-GRPO [44]); CCA [33] is the stronger, hindsight-conditioned variant that the executed-sibling structure approximates non-parametrically. Stating both as distinct sources of the low-variance claim is more accurate and actually strengthens the backbone.",
30
+ "anchor_quote": "low-variance because the shared parent differences out the baseline"
31
+ }
32
+ ],
33
+ "overall": "The report's core mechanism claims are unusually well-grounded — I verified each axis against source and most are faithful to the byte level. The flat->tree fitness delta (extract_dpo_pairs breaks after one teacher-plurality pair; _grade() returns masked pass-fraction) is exact. The SDPO-carrier-for-world-model claim is mechanically sound: the world-model 'splice realized observation into ctx_teacher as privileged info' reuses the same ctx_teacher = ctx_student + hint pattern, post-hint mask, and ADR-011 aligned-index gather that the real collator already implements (data_collator.py, ADR-011) — no hand-waving. Both prune gates are real: oracle-cleanliness = _grade() 0-masking (env.py:90), per-turn signal-presence = the collator empty-recovery row-drop (data_collator.py L308). ObjectStoreAllReduce is verified to the line: PUT round_{NNNNNN}/rank_{RRRR}.pt, poll-until-all-peers, mean, and the 'straggler blocks at the poll loop bounded by timeout_s=1800' claim is exactly what the code does (allreduce.py:151-162). The counterfactual-credit backbone is grounded (2011.09464 + Tree-GRPO step-level DPO equivalence), with only a minor mechanism conflation. The depth weaknesses are concentrated in the QUANTITATIVE concreteness, not the conceptual substance: the headline cost anchor mislabels a flat-scale figure as a tree cost (and thereby undersells the tree's true O(N^D) cost), the executor LOC estimates undershoot the only working backend by ~2.5x, and EKSExecutor is presented as a repo slot when the repo reserves K8sExecutor. None touch the load-bearing argument; all are surgical fixes that make the numbers honest.",
34
+ "findings_count_note": "4 findings: 1 high (cost-anchor misread), 1 medium (LOC estimate ungrounded), 2 low (naming + credit-mechanism conflation). The conceptual axes the checklist flagged are solid and I say so in overall rather than inventing nits."
35
+ }
research/critic-findings-dialectic.json ADDED
@@ -0,0 +1,48 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "critic": "dialectic",
3
+ "overall": "The report engages all six mandated skeptic disconfirmers (single-agent>=multi 2604.02460; aux interference 2602.00994; myopic 2605.06840; predictive-causal gap 2605.05029; oracle-gamed EvilGenie/RLVR-hacking; outcome-only DeepSWE/SWE-RL) and renders the two contested flourishes (heterogeneity, world-model aux loss) as explicit pre-registered ablation arms with stated falsifiers and flip-conditions. Provenance is clean: Channel 3, the tree, and FeatureDeletionEnv bug injection are correctly attributed to the framework's own additions, never to Cursor (Ch1 Dr.GRPO + Ch2 SDPO) and never to Socratic-SWE (which the report correctly notes does NOT inject bugs). The central prune-vs-train-on-all question is committed (typed train-on-all under two hard gates), not hedged. The heterogeneity axis (§3/§7 Pushback 1) is solid and faithful to the counter-evidence note, with the equal-compute control arm and the surviving anti-collapse justification both correct. The findings below are not about missing disconfirmers but about (a) one in-repo counter-position the report straw-manned toward optimism, (b) a categorical claim its own DeepSWE source contradicts, (c) asymmetric domain-transfer skepticism applied to the pro side but not to load-bearing non-SWE disconfirmers it relies on, (d) a directly-SWE disconfirmer present in the corpus but never cited, and (e) a numerical misread. Few, high-quality.",
4
+ "findings": [
5
+ {
6
+ "severity": "high",
7
+ "section": "5. Pipeline Shape: Two Loops, Not Two Phases",
8
+ "issue": "The report straw-mans its own repo's counter-position. It claims self-distillation 'in this configuration, [is] a *stabilizer* and not only a collapse risk' citing SDFT, and treats Channel-2 SDPO as 'exactly that on-policy, demonstration-conditioned regime, not the static-synthetic-data regime that collapses.' But the repo's own ADR-013 (read in adr-decision-backbone note) states the opposite about THIS exact channel: 'SDPO against the altered model's own hint-conditioned forward pass is the channel most likely to AMPLIFY the distortion' and is 'an *experimental intervention*, not a benign stabilizer' (teacher==student-family; if hints add no independent info the optimum is to imitate the altered conditional, sharpening a soft bias into a hard preference). The report cites the optimistic external SDFT result while omitting the pessimistic in-repo finding on the very same mechanism, leaving the 'stabilizer' framing one-sided.",
9
+ "fix": "Add a clause acknowledging the repo's own counter-position: e.g. after 'is exactly that on-policy, demonstration-conditioned regime' add '— though the repo's own ADR-013 warns the same SDPO channel is the one most likely to AMPLIFY an existing distortion when the teacher is same-family and the hint adds no independent information, so the stabilizer claim holds only when the privileged-information conditioning carries genuine new signal (the per-turn JSD signal-presence gate of §4).'",
10
+ "anchor_quote": "Self-distillation in the inner loop is, in this configuration, a *stabilizer* and not only a collapse risk"
11
+ },
12
+ {
13
+ "severity": "high",
14
+ "section": "5. Pipeline Shape: Two Loops, Not Two Phases",
15
+ "issue": "Categorical overclaim contradicted by the report's own cited source. The report asserts a clean dichotomy: 'every working SWE flywheel optimizes a true execution oracle ...; every collapse story requires a proxy or self-judged verifier.' But DeepSWE [43] — cited approvingly two sentences later and throughout — documents near-collapse on a TRUE 0/1 execution oracle from positives alone: 'LLM agents may stumble upon correct patches and pass all tests without knowing. Training with these positives reinforces undesired behaviors ... leading to collapse,' which is precisely why DeepSWE needed compact filtering. So a true execution oracle did NOT prevent a collapse mode; positives on a real oracle produced it. The 'every collapse story requires a proxy' claim is falsified by the report's own evidence base.",
16
+ "fix": "Soften the dichotomy to acknowledge the positives-on-a-true-oracle collapse mode: e.g. change 'every collapse story requires a proxy or self-judged verifier' to 'most collapse stories require a proxy or self-judged verifier — though even a true execution oracle can collapse if positives reinforce accidental passes (DeepSWE's compact-filtering motivation [43]), which is a further argument for the per-turn signal gate and submit-gated credit.'",
17
+ "anchor_quote": "every collapse story requires a proxy or self-judged verifier"
18
+ },
19
+ {
20
+ "severity": "medium",
21
+ "section": "7. What the Literature Says (and Where It Pushes Back)",
22
+ "issue": "Asymmetric domain-transfer skepticism. The report disarms the pro-simulation cluster with 'none of those is a *SWE-pass-rate result at equal compute* — they are calibration, reasoning-trace, and non-SWE results.' But the report applies no such discount to two load-bearing disconfirmers that are equally non-SWE: the anti-emergence 'killer fact' (§2, [11] 2601.03905) is a vision-language-model agentic+VQA study, and 'the single most decisive result for *this* project' (§4, [27] 2503.14391) is a multiple-choice-QA Likra study, not SWE and not the DPO/GRPO regime in use. The same 'not a SWE-pass-rate result at equal compute' burden the report imposes on the pro side should be acknowledged for these anti-side pillars, or the symmetry argument is one-directional.",
23
+ "fix": "Add a one-clause symmetry caveat where the burden-shift is stated, e.g. after 'they are calibration, reasoning-trace, and non-SWE results' add '(the same domain-transfer caveat applies to the anti-side pillars — the world-model-as-tool foresight result [11] is VLM/VQA and the near-miss-calibration result [27] is MCQA — which is why the SWE-specific P0-P6 ablation, not the imported literature, is the actual decider).'",
24
+ "anchor_quote": "none of those is a *SWE-pass-rate result at equal compute*"
25
+ },
26
+ {
27
+ "severity": "medium",
28
+ "section": "2. The World-Model Goal: Training Latent What-If Deliberation",
29
+ "issue": "The anti-emergence case rests on a non-SWE study while a directly-on-domain SWE disconfirmer in the same corpus is never cited. The 'killer fact against emergence' [11] (2601.03905) is built on vision-language models over 'agentic and visual question answering tasks.' The corpus contains 2604.12147 (Plan Compliance in Autonomous Programming Agents, 16,991 SWE-agent trajectories on SWE-bench Verified + Pro across GPT-5 mini / DeepSeek-R1-V3 / Devstral) — flagged by the corpus-critic as 'the single most on-domain piece of evidence' that SWE agents fall back on memorized workflows and that a subpar/misaligned plan hurts MORE than no plan. It directly supports the report's selective-curriculum-over-naive-train-on-all thesis yet is absent from the citation list (no [49]; sources end at [48]). Grounding the anti-emergence and selective-structure arguments on a VLM/VQA study when a direct SWE result is available weakens the section.",
30
+ "fix": "Cite 2604.12147 in §2 (and/or §4) alongside [11]: e.g. after the foresight-governance sentence add 'and in SWE specifically, a study of 16,991 SWE-agent trajectories on SWE-bench finds agents revert to internalized workflows and that a misaligned plan hurts more than no plan — direct on-domain support for selective, alignment-gated structure over naive train-on-all [49].' Add the source to the Sources list.",
31
+ "anchor_quote": "handed a world model as a tool, agents invoke it under 1% of the time"
32
+ },
33
+ {
34
+ "severity": "medium",
35
+ "section": "2. The World-Model Goal: Training Latent What-If Deliberation",
36
+ "issue": "Numerical misread of the predictive-causal gap. The report says 'across 2,695 networks mean causal fidelity collapses toward ~1e-8 at high dimension *while achieving 92% lower prediction error*.' Per the source (the-predictive-causal-gap note), the MEAN causal fidelity across the 2,695 configurations is 0.49 (only 2.5% exceed 0.70); the ~1e-8 ('causally blind') figure and the 92%-lower-prediction-error figure are the high-dimension N=100 extreme, not the 2,695-network mean. Coupling '2,695 networks mean causal fidelity' with '~1e-8' conflates the corpus mean with the worst-case dimension and overstates the typical-case magnitude.",
37
+ "fix": "Split the two statistics: e.g. 'across 2,695 networks mean causal fidelity is 0.49 (only 2.5% exceed 0.70), and at high dimension (N=100) the optimal encoder becomes causally blind (~1e-8) *while achieving 92% lower prediction error*.'",
38
+ "anchor_quote": "across 2,695 networks mean causal fidelity collapses toward ~1e-8 at high dimension *while achieving 92% lower prediction error*"
39
+ },
40
+ {
41
+ "severity": "low",
42
+ "section": "4. The Central Question: Prune Bad Branches vs Train on All Branches",
43
+ "issue": "Under-engaged tension in the oracle-cleanliness argument. The report uses EvilGenie [30] to argue held-out tests are weak ('held-out tests giving only minimal detection improvement') and simultaneously makes the disjoint held-out eval 'the *most* load-bearing safeguard' (§4, §5 safeguard #2). EvilGenie's own finding is that the held-out-test method gave minimal improvement while the LLM JUDGE was 'highly effective at detecting reward hacking in unambiguous cases' — yet safeguard #1 forbids a learned/self-judged verifier in the training reward and the report leans on held-out eval. The report should reconcile why the safeguard it most relies on is the detector EvilGenie found weakest, and whether the LLM-judge detector (allowed only at test-time selection per safeguard #1) belongs in the monitoring stack.",
44
+ "fix": "Add a reconciling clause where EvilGenie is cited: e.g. 'EvilGenie found held-out tests weak as a *detector* but the LLM judge effective — so the held-out eval here is load-bearing as a drift TRIPWIRE (proxy-minus-realeval gain) rather than a per-trajectory hack detector, and an LLM-judge monitor is admissible for offline flagging though never as the training reward (safeguard #1).'",
45
+ "anchor_quote": "with held-out tests giving only minimal detection improvement"
46
+ }
47
+ ]
48
+ }
research/critic-findings-instruction.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "critic": "instruction",
3
+ "findings": [
4
+ {
5
+ "severity": "low",
6
+ "section": "## 2. The World-Model Goal: Training Latent What-If Deliberation",
7
+ "issue": "The prompt-decomposition entity 'World-model / latent-simulation literature' lists 'Chain of World' as a required_field, and the vault holds a dedicated note (260303195-chain-of-world-world-model-thinking-in-latent-motion.md) plus a synthesis lens note on it. The report's world-model section grounds latent deliberation in CWM [13], MuZero [14], Dreamer [15], From Word to World [12], and the foresight-governance paper [11], but never names Chain-of-World. Chain-of-World is the most direct 'latent-motion world-model thinking' analogue to the query's 'world-model latent deliberation' framing, so its absence is a small but real coverage gap on an explicitly-enumerated atomic item.",
8
+ "fix": "Add a one-clause citation to Chain-of-World where the report introduces value-equivalent latent prediction, e.g. after the MuZero/Dreamer sentence on line 27, append a clause noting Chain-of-World as the SWE-adjacent 'latent-motion' precedent for thinking in a learned latent rather than reconstructing full state. Keep it to one sentence; do not expand the section.",
9
+ "anchor_quote": "MuZero and Dreamer add the design discipline: learn the *value-equivalent* latent"
10
+ }
11
+ ],
12
+ "overall": "Instruction-following is excellent and needs no structural intervention. All 11 required H2 headings appear verbatim and in the exact order specified by research/prompt-decomposition.json (lines 9, 23, 37, 61, 95, 112, 138, 163, 181, 189, 208), with the expected '## Sources' as a 12th. All six required tables are present and correctly scoped: GA mapping (line 41), P0-P6 branch-usage experiment with arms+metrics+predicted ordering+explicit falsifier (line 83), repo reuse-vs-build ledger (line 116), paradigm comparison Socratic-RL/Socratic-SWE/Composer-2.5/proposed (line 142), EKS component table (line 169), and phased build plan (line 199). Every atomic sub_question is covered: world-model goal (S2, definition + next-state-prediction signal + MuZero/Dreamer/CWM + ECE/Brier/foresight@k measurement); GA framing (S3, all six concepts populated + where-it-breaks in three named places); the CENTRAL prune-vs-train-on-all question is answered as typed train-on-all under two hard gates AND backed by a concrete generate-once/route-many P0-P6 experiment with primary metrics and a stated falsifier; the '2 sections or 1?' question is answered head-on as 'two loops at different timescales, not two phases' (line 97); EKS is UNAMBIGUOUSLY PRIMARY -- the heading is '(Primary)', line 165 opens 'EKS is primary, with a single control plane', S9 frames SageMaker as 'not a competing platform ... an inner-loop node-group swap on the same control plane', and the honest EKS-demotion path on line 193 is scoped strictly to the sandbox-execution pool, never the control/training plane; the SageMaker/HyperPod path is concrete (1:1 control-plane mapping, ~150 LOC SageMakerExecutor, Training-Jobs-vs-HyperPod selection); repo grounding is dense and file-line-anchored throughout (teacher_replay.py, env.py, claude_code.py, composer_trainer.py, ADRs); cost/throughput is quantified ($0.98 vs $64/trace, 60-80% gating savings, $0.05/round comm, 50-70% Spot, SWE-MiniSandbox ~5%/~25%). The provenance guardrail (Channel 3 + tree are the framework's OWN additions, Cursor = Ch1 Dr.GRPO + Ch2 SDPO) is stated up front (line 7) and honored repo-wide. The single finding is a low-severity missing-citation nit (Chain-of-World), not a coverage failure.",
13
+ "findings_note": "one low finding only; axis is solid"
14
+ }
research/critic-findings-width.json ADDED
@@ -0,0 +1,48 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "critic": "width",
3
+ "findings": [
4
+ {
5
+ "severity": "high",
6
+ "section": "## 7. What the Literature Says (and Where It Pushes Back)",
7
+ "issue": "The process-vs-outcome cluster is the empirical backbone of the entire §4 prune-vs-train-on-all argument, yet its two foundational papers are named once with WRONG citations and given no source IDs. Line 161 reads 'process supervision genuinely beats outcome on reasoning traces (Let's Verify, Uesato) [19][27]' — but [19] is the 'LLM-Based World Models' paper (arXiv:2411.08794) and [27] is 'How Much Do LLMs Learn From Negative Examples' (arXiv:2503.14391). Neither is Let's Verify nor Uesato. Both papers have dedicated, on-point vault notes (Lightman et al. 2305.20050 'Let's Verify Step by Step' — PRM beats ORM on MATH, releases PRM800K; Uesato et al. 2211.14275 — first head-to-head process-vs-outcome on GSM8K, process feedback cuts reasoning error 14.0%→3.4%) that carry zero source IDs anywhere in the report. The single sentence the skeptic-rebuttal rests on mis-attributes its evidence.",
8
+ "fix": "Add two new Sources entries: '[49] Let's Verify Step by Step (Lightman et al.) — arXiv:2305.20050 (PRM process supervision beats ORM on MATH; releases PRM800K)' and '[50] Solving math word problems with process- and outcome-based feedback (Uesato et al.) — arXiv:2211.14275 (first process-vs-outcome head-to-head; process feedback cuts reasoning error 14.0%→3.4% at final-answer parity)'. Then change the line-161 citation from '(Let's Verify, Uesato) [19][27]' to '(Let's Verify [49]; Uesato [50] — process feedback cuts reasoning error 14.0%→3.4% at final-answer parity)' so the named papers carry their own IDs.",
9
+ "anchor_quote": "process supervision genuinely beats outcome on reasoning traces (Let's Verify, Uesato), the world-model field is split"
10
+ },
11
+ {
12
+ "severity": "high",
13
+ "section": "## 1. What We Are Actually Building: From Multi-Teacher Replay to a Counterfactual Tree of Work",
14
+ "issue": "SWE-Search (arXiv:2410.20285) is the single closest published analogue to the proposed system — MCTS over repository-level SWE tasks with per-node value estimation, backtracking, and self-feedback — and the report names it ('SWE-Search expands nodes with one policy') but gives it no source ID and never engages its central, decision-relevant findings: a 23% relative SWE-bench improvement across five models from search ALONE, and the explicit result that performance scales with inference-time compute 'without requiring larger models or additional training data.' That is a sharper version of Pushback 3's skeptic case (does the tree's gain need training at all, or is it just test-time search?) and a direct input to the §7 paradigm table, yet the vault note is completely unused. Leaving the closest prior art uncited weakens the 'claim the synthesis, not the parts' provenance argument.",
15
+ "fix": "Add a Sources entry '[51] SWE-Search: Enhancing Software Agents with MCTS and Iterative Refinement — arXiv:2410.20285 (23% relative SWE-bench gain from search alone, single policy, scales with inference-time compute, no extra training)'. Tag the existing §1 mention 'SWE-Search expands nodes with one policy [51]', and add one clause to Pushback 3 (§7) noting SWE-Search already shows per-node SWE search helps at TEST time without training — so the tree must justify the marginal value of folding that search into TRAINING, not just adding search.",
16
+ "anchor_quote": "SWE-Search expands nodes with one policy; Symphony does heterogeneous-LM planning"
17
+ },
18
+ {
19
+ "severity": "medium",
20
+ "section": "## 3. The Genetic-Algorithm Framing — Where It Holds and Where It Breaks",
21
+ "issue": "Symphony (arXiv:2601.22623, NeurIPS 2025) is the strongest pro-heterogeneity result in the vault — a heterogeneous-LM MCTS planner whose explicit thesis is that single-agent MCTS yields 'insufficient diversity among generated branches' and that a heterogeneous LM pool 'enhances rollout diversity and facilitates more effective exploration,' outperforming SOTA when given API models. The report names Symphony once in §1 with no source ID, then builds Pushback 1 (heterogeneity-is-a-hypothesis) almost entirely on the anti-heterogeneity sources [21][22][23], leaving the heterogeneity-as-ablation framing under-steelmanned. Symphony is precisely the source that says the system's distinctive choice (different model per node) buys exploration diversity — the very 'anti-collapse diversity' the report concedes survives (safeguard 4) but does not source on the capability side.",
22
+ "fix": "Add a Sources entry '[52] SYMPHONY: Synergistic Multi-agent Planning with Heterogeneous LM Assembly — arXiv:2601.22623 (NeurIPS 2025; single-agent MCTS gives insufficient branch diversity; heterogeneous LM pool improves rollout diversity and exploration)'. In §3 Pushback 2 (heterogeneity), add a sentence acknowledging Symphony [52] as the counter-result that frames heterogeneity's surviving justification (exploration/branch diversity), so the ablation is set up as a genuine two-sided question rather than a near-foregone demotion.",
23
+ "anchor_quote": "Symphony does heterogeneous-LM planning"
24
+ },
25
+ {
26
+ "severity": "medium",
27
+ "section": "## 2. The World-Model Goal: Training Latent What-If Deliberation",
28
+ "issue": "Section 2 grounds the latent-deliberation 'value-equivalent / never reconstruct the full state' argument on MuZero [14] and Dreamer [15] (both pre-LLM RL) but leaves the most on-point 2026 vault note — Chain of World (arXiv:2603.03195, CVPR 2026) — entirely uncited. Chain of World is precisely a 'World Model Thinking in Latent Motion' paradigm that factorizes dynamics into a disentangled latent and predicts terminal state rather than reconstructing redundant background — the exact value-equivalent-latent point the report wants to make for SWE ('predict the signed FAIL_TO_PASS delta, never reconstruct the full token sea'). prompt-decomposition.json explicitly lists 'Chain of World' as a required field of the world-model literature cluster, so its absence is a coverage miss against the decomposition.",
29
+ "fix": "Add a Sources entry '[53] Chain of World: World Model Thinking in Latent Motion — arXiv:2603.03195 (CVPR 2026; disentangled latent-motion world model predicts terminal state instead of reconstructing redundant background)'. In §2, append to the MuZero/Dreamer value-equivalent sentence a clause: 'and the latent-motion line carries the same discipline into 2026 — factorize dynamics into a compact latent and predict the consequential terminal state, not the full frame [53]', tying the embodied result to the SWE next-state-delta target.",
30
+ "anchor_quote": "never reconstruct the full state, a high-entropy sea of irrelevant tokens [14][15]"
31
+ },
32
+ {
33
+ "severity": "medium",
34
+ "section": "## 4. The Central Question: Prune Bad Branches vs Train on All Branches",
35
+ "issue": "The report cites EvilGenie [30] only for its hacking-prevalence half ('explicit hardcoding / test-file edits by Codex and Claude Code') and then declares the disjoint held-out eval 'the most load-bearing safeguard.' But EvilGenie's other headline finding is that an LLM judge is HIGHLY EFFECTIVE at detecting reward hacking in unambiguous cases while held-out unit tests give only 'minimal improvement.' The report uses the held-out-is-weak half (line 79) but omits the LLM-judge-is-strong half — which is decision-relevant: it suggests a cheaper, validated hack DETECTOR (distinct from a learned reward) that the report's own safeguard framing ('learned verifier allowed only at test-time selection') would permit. Omitting it makes the held-out eval look like the only option when the source the report already cites offers a complementary one.",
36
+ "fix": "In §4 (line 79) after 'with held-out tests giving only minimal detection improvement', add: '— while in the same study an LLM judge proved highly effective at flagging unambiguous hacks, suggesting an offline LLM-judge hack-detector (never a training reward) as a cheaper complement to the held-out gate [30].' This uses the already-cited [30] note's second finding without adding a source.",
37
+ "anchor_quote": "with held-out tests giving only minimal detection improvement"
38
+ },
39
+ {
40
+ "severity": "low",
41
+ "section": "## 8. Implementing on AWS EKS (Primary)",
42
+ "issue": "The SWE-rebench / Nebius infrastructure vault note (behind-swe-rebench, a 26KB substantive source on production SWE-task collection + eval-at-scale, evaluating thousands of SWE instances per hour with distributed container orchestration on TractoAI) is unused. It is the most directly-relevant existence proof for the report's central infra claim that mass SWE-task sandboxing/eval is an established distributed pattern — and for §6's task-construction discussion (mining (problem, test-set) pairs from resolved GitHub issues, exactly the FeatureDeletionEnv substrate-inversion pattern). The EKS section leans on DeepSWE's 512-container limit [43] but omits the one note built specifically around scaling SWE-task execution infrastructure.",
43
+ "fix": "Add a Sources entry '[54] Behind SWE-rebench: infrastructure to collect/evaluate SWE tasks at scale — nebius.com (distributed container orchestration evaluating thousands of SWE instances/hour; (problem,test-set) pairs mined from resolved GitHub issues)'. In §8's data-plane or throughput discussion, add one clause citing [54] as production evidence that thousands-per-hour distributed SWE-task execution is an established pattern, reinforcing the 'EKS-primary is cheap to adopt' claim.",
44
+ "anchor_quote": "DeepSWE itself ran rollout collection on Kubernetes with a Cluster Autoscaler over 1000+ CPU cores [45][43]."
45
+ }
46
+ ],
47
+ "overall": "The report is unusually wide and dense: 48 source IDs, all 11 required section headings present, and genuinely deep engagement with the hardest clusters — reward-hacking-with-verifiable-rewards (EvilGenie/RLVR-gaming/synthetic-trajectories all cited [29][30][31]), self-evolving-collapse (survey §8.3 [38], RSI [29], Self-Play-SWE-RL [8]), the negatives/credit-assignment cluster ([25][26][27][28][33]), and the EKS sandbox/SWE-MiniSandbox/KubeRay/verl/HyperPod stack ([41][42][45][46][48]) are all well-used. Width is therefore strong, not weak. The findings are targeted, not a scattershot: the single highest-value gap is a citation MISMATCH — the process-vs-outcome backbone (Let's Verify [Lightman 2305.20050] and Uesato 2211.14275), which underpins all of §4, is named once with the wrong source IDs and the two foundational vault notes carry no IDs at all. The second is the omission of SWE-Search (2410.20285), the closest published per-turn-MCTS-on-SWE prior art, named but uncited and unengaged on its sharpest point (search helps at test time without training). The remaining four are smaller: an under-steelmanned pro-heterogeneity source (Symphony), the most on-point 2026 latent world-model note (Chain of World, also a decomposition required-field miss), a half-used EvilGenie finding (LLM-judge detector), and an unused SWE-rebench infra note. Fixing the two high-severity items (both surgical Source-list additions plus a one-line citation correction) materially strengthens the report's evidentiary spine; the rest are optional polish. No structural rework needed."
48
+ }
research/loci.json ADDED
@@ -0,0 +1,68 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "loci": [
3
+ {
4
+ "name": "prune-vs-train-on-all",
5
+ "one_line": "Does training on losing/failed branches (vs pruning to winners-only) better instill counterfactual foresight + introspection — and HOW must negatives be used to help rather than destabilize?",
6
+ "flavor": "dialectical",
7
+ "importance": 10,
8
+ "uncertainty": 9,
9
+ "disagreement": 9,
10
+ "decision_impact": 10,
11
+ "composite_score": 38,
12
+ "source_budget": 15,
13
+ "rationale": "The user's explicitly-named CENTRAL question. Genuine empirical fork: RAFT/positives-only is stable & competitive (2504.11343) and naive negative gradient destabilizes (2505.18830), vs negatives carry unique signal that improves agent tuning (2402.11651, 2503.14391, expert-failures). Resolving it changes the entire dataset-construction design (prune the tree vs keep it as typed signal). Must produce an argued position + concrete experiment, grounded in the repo's ADR-013 A0-A4 ladder."
14
+ },
15
+ {
16
+ "name": "worldmodel-latent-deliberation",
17
+ "one_line": "Can latent 'what-if' deliberation (predict next repo-state before acting) be trained into a SWE agent via an auxiliary next-state-prediction objective, or does it emerge from scale — and how do you measure it?",
18
+ "flavor": "dialectical",
19
+ "importance": 9,
20
+ "uncertainty": 8,
21
+ "disagreement": 7,
22
+ "decision_impact": 9,
23
+ "composite_score": 33,
24
+ "source_budget": 12,
25
+ "rationale": "The user's core GOAL (the 'world-model thinking' aim). Fork: LLMs are implicit world models / emerges from scale (2512.18832, 2411.08794) vs agents fail to USE world models for foresight without explicit training (2601.03905) + MuZero/Chain-of-World train it explicitly (1911.08265, 2603.03195). Decision-relevant: determines whether to add the aux loss + a deliberation token, and how to measure (calibration / foresight accuracy). Must map onto the repo's SDPO channel as the natural carrier."
26
+ },
27
+ {
28
+ "name": "selfevolve-flywheel-vs-collapse",
29
+ "one_line": "Does the closed-loop multi-model MCTS + self-distillation flywheel compound improvement, or collapse into reward-hacking / diversity-loss / human-trace entrenchment — and what design choices prevent collapse?",
30
+ "flavor": "dialectical",
31
+ "importance": 9,
32
+ "uncertainty": 8,
33
+ "disagreement": 8,
34
+ "decision_impact": 9,
35
+ "composite_score": 34,
36
+ "source_budget": 11,
37
+ "rationale": "Determines whether the whole genetic-algorithm flywheel is sound. Strong adversarial convergence (reward-hacking worsens with depth — RSI ICLR2026; collapse from closed-loop self-distillation — self-evolving survey §8.3; replay entrenches human distribution — Self-Play-SWE-RL 2512.18552) vs working flywheels (Socratic-SWE +7.8, DeepSWE, SWE-RL). Resolution = keep a true execution ORACLE + heterogeneous-model population as anti-collapse diversity. High decision impact on safeguards."
38
+ },
39
+ {
40
+ "name": "credit-assignment-tree-as-process-signal",
41
+ "one_line": "Does the multi-model tree's divergence structure give cheap, dense PROCESS-level credit assignment that beats outcome-only RL — without training a separate PRM?",
42
+ "flavor": "technical",
43
+ "importance": 8,
44
+ "uncertainty": 6,
45
+ "disagreement": 7,
46
+ "decision_impact": 8,
47
+ "composite_score": 29,
48
+ "source_budget": 8,
49
+ "rationale": "The mechanism that makes the idea pay off. Process-supervision helps (Let's-Verify 2305.20050, PRM 2211.14275, Cursor's own targeted-feedback motivation) vs outcome-only suffices (DeepSWE, SWE-RL, min-form 2504.15275). The tree manufactures process signal cheaply from divergence + auto-generated textual feedback (wiring into the SDPO hint hook). Counterfactual credit-assignment theory (2011.09464, 2306.16803) is the formal backbone. Technical synthesis, moderate uncertainty."
50
+ },
51
+ {
52
+ "name": "eks-architecture-and-substrate-mapping",
53
+ "one_line": "What is the concrete EKS-primary (+ SageMaker-hybrid) architecture, and what is the minimal delta to map the repo's ServerlessExecutor/ObjectStoreAllReduce/DiLoCo onto it?",
54
+ "flavor": "technical",
55
+ "importance": 10,
56
+ "uncertainty": 4,
57
+ "disagreement": 5,
58
+ "decision_impact": 9,
59
+ "composite_score": 28,
60
+ "source_budget": 10,
61
+ "rationale": "The explicit DELIVERABLE ('how we could do it on sagemaker and/or eks, eks primarily'). Lower uncertainty (AWS-documented patterns: JARK/verl-on-EKS, KubeRay, Karpenter, GPU time-slicing/MIG, gVisor/Kata sandboxes, HyperPod) but very high decision impact — the report must commit to a concrete design. Includes the EKSExecutor delta, the sandbox-fan-out, the outer/inner loop placement, and the EKS-vs-SageMaker hybrid split."
62
+ }
63
+ ],
64
+ "skip_loci": [
65
+ {"name": "multimodel-tree-novelty-claim", "reason": "Resolved without depth: the honest position is the COMBINATION is novel, not the primitives (SWE-Search/tree-search use single models; Symphony mixes models for planning; Channel 3 already does flat multi-teacher). Folds into §1 framing, not a depth locus."},
66
+ {"name": "which-RL-engine-trl-vs-verl-vs-prime-rl", "reason": "Already decided in repo (ADR-006: TRL hosts SDPO since it needs full logits; verl/PRIME-RL for scale-out). Engineering choice, reported in §6/§8, not a contested research locus."}
67
+ ]
68
+ }
research/notes/191108265-mastering-atari-go-chess-and-shogi-by-planning-with-a-learned-model.md ADDED
@@ -0,0 +1,258 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: '[1911.08265] Mastering Atari, Go, Chess and Shogi by Planning with a Learned
3
+ Model'
4
+ id: 191108265-mastering-atari-go-chess-and-shogi-by-planning-with-a-learned-model
5
+ tags:
6
+ - socratic-mcts-swe-worldmodel-8f6dea
7
+ created: '2026-06-09T04:22:01.612030Z'
8
+ updated: '2026-06-09T04:22:19.478605Z'
9
+ source: https://arxiv.org/abs/1911.08265
10
+ source_domain: arxiv.org
11
+ fetched_at: '2026-06-09T04:22:01.506414Z'
12
+ fetch_provider: builtin
13
+ status: draft
14
+ type: note
15
+ tier: institutional
16
+ content_type: paper
17
+ deprecated: false
18
+ summary: '[1911.08265] Mastering Atari, Go, Chess and Shogi by Planning with a Learned
19
+ Model'
20
+ ---
21
+
22
+ [1911.08265] Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model
23
+ Computer Science > Machine Learning
24
+ arXiv:1911.08265
25
+ (cs)
26
+ [Submitted on 19 Nov 2019 (
27
+ v1
28
+ ), last revised 21 Feb 2020 (this version, v2)]
29
+ Title:
30
+ Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model
31
+ Authors:
32
+ Julian Schrittwieser
33
+ ,
34
+ Ioannis Antonoglou
35
+ ,
36
+ Thomas Hubert
37
+ ,
38
+ Karen Simonyan
39
+ ,
40
+ Laurent Sifre
41
+ ,
42
+ Simon Schmitt
43
+ ,
44
+ Arthur Guez
45
+ ,
46
+ Edward Lockhart
47
+ ,
48
+ Demis Hassabis
49
+ ,
50
+ Thore Graepel
51
+ ,
52
+ Timothy Lillicrap
53
+ ,
54
+ David Silver
55
+ View a PDF of the paper titled Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model, by Julian Schrittwieser and 11 other authors
56
+ View PDF
57
+ Abstract:
58
+ Constructing agents with planning capabilities has long been one of the main challenges in the pursuit of artificial intelligence. Tree-based planning methods have enjoyed huge success in challenging domains, such as chess and Go, where a perfect simulator is available. However, in real-world problems the dynamics governing the environment are often complex and unknown. In this work we present the MuZero algorithm which, by combining a tree-based search with a learned model, achieves superhuman performance in a range of challenging and visually complex domains, without any knowledge of their underlying dynamics. MuZero learns a model that, when applied iteratively, predicts the quantities most directly relevant to planning: the reward, the action-selection policy, and the value function. When evaluated on 57 different Atari games - the canonical video game environment for testing AI techniques, in which model-based planning approaches have historically struggled - our new algorithm achieved a new state of the art. When evaluated on Go, chess and shogi, without any knowledge of the game rules, MuZero matched the superhuman performance of the AlphaZero algorithm that was supplied with the game rules.
59
+ Subjects:
60
+ Machine Learning (cs.LG)
61
+ ; Machine Learning (stat.ML)
62
+ Cite as:
63
+ arXiv:1911.08265
64
+ [cs.LG]
65
+ (or
66
+ arXiv:1911.08265v2
67
+ [cs.LG]
68
+ for this version)
69
+ https://doi.org/10.48550/arXiv.1911.08265
70
+ Focus to learn more
71
+ arXiv-issued DOI via DataCite
72
+ Related DOI
73
+ :
74
+ https://doi.org/10.1038/s41586-020-03051-4
75
+ Focus to learn more
76
+ DOI(s) linking to related resources
77
+ Submission history
78
+ From: Julian Schrittwieser [
79
+ view email
80
+ ]
81
+ [v1]
82
+ Tue, 19 Nov 2019 13:58:52 UTC (3,106 KB)
83
+ [v2]
84
+ Fri, 21 Feb 2020 18:05:30 UTC (2,973 KB)
85
+ Full-text links:
86
+ Access Paper:
87
+ View a PDF of the paper titled Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model, by Julian Schrittwieser and 11 other authors
88
+ View PDF
89
+ TeX Source
90
+ view license
91
+ Ancillary-file links:
92
+ Ancillary files
93
+ (
94
+ details
95
+ )
96
+ :
97
+ atari_evaluations.json
98
+ atari_repeatability.json
99
+ atari_results.json
100
+ atari_scaling.json
101
+ atari_trainX_evalX.json
102
+ board_game_elos.json
103
+ go_policy_improvement.json
104
+ go_scaling.json
105
+ pseudocode.py
106
+ qlearning_pacman_ablations.json
107
+ (5 additional files not shown)
108
+ You must enabled JavaScript to view entire file list.
109
+ Current browse context:
110
+ cs.LG
111
+ < prev
112
+ |
113
+ next >
114
+ new
115
+ |
116
+ recent
117
+ |
118
+ 2019-11
119
+ Change to browse by:
120
+ cs
121
+ stat
122
+ stat.ML
123
+ References & Citations
124
+ NASA ADS
125
+ Google Scholar
126
+ Semantic Scholar
127
+ 1 blog link
128
+ (
129
+ what is this?
130
+ )
131
+ DBLP
132
+ - CS Bibliography
133
+ listing
134
+ |
135
+ bibtex
136
+ Julian Schrittwieser
137
+ Ioannis Antonoglou
138
+ Thomas Hubert
139
+ Karen Simonyan
140
+ Laurent Sifre
141
+
142
+ export BibTeX citation
143
+ Loading...
144
+ BibTeX formatted citation
145
+ ×
146
+ loading...
147
+ Data provided by:
148
+ Bookmark
149
+ Bibliographic Tools
150
+ Bibliographic and Citation Tools
151
+ Bibliographic Explorer Toggle
152
+ Bibliographic Explorer
153
+ (
154
+ What is the Explorer?
155
+ )
156
+ Connected Papers Toggle
157
+ Connected Papers
158
+ (
159
+ What is Connected Papers?
160
+ )
161
+ Litmaps Toggle
162
+ Litmaps
163
+ (
164
+ What is Litmaps?
165
+ )
166
+ scite.ai Toggle
167
+ scite Smart Citations
168
+ (
169
+ What are Smart Citations?
170
+ )
171
+ Code, Data, Media
172
+ Code, Data and Media Associated with this Article
173
+ alphaXiv Toggle
174
+ alphaXiv
175
+ (
176
+ What is alphaXiv?
177
+ )
178
+ Links to Code Toggle
179
+ CatalyzeX Code Finder for Papers
180
+ (
181
+ What is CatalyzeX?
182
+ )
183
+ DagsHub Toggle
184
+ DagsHub
185
+ (
186
+ What is DagsHub?
187
+ )
188
+ GotitPub Toggle
189
+ Gotit.pub
190
+ (
191
+ What is GotitPub?
192
+ )
193
+ Huggingface Toggle
194
+ Hugging Face
195
+ (
196
+ What is Huggingface?
197
+ )
198
+ Links to Code Toggle
199
+ Papers with Code
200
+ (
201
+ What is Papers with Code?
202
+ )
203
+ ScienceCast Toggle
204
+ ScienceCast
205
+ (
206
+ What is ScienceCast?
207
+ )
208
+ Demos
209
+ Demos
210
+ Replicate Toggle
211
+ Replicate
212
+ (
213
+ What is Replicate?
214
+ )
215
+ Spaces Toggle
216
+ Hugging Face Spaces
217
+ (
218
+ What is Spaces?
219
+ )
220
+ Spaces Toggle
221
+ TXYZ.AI
222
+ (
223
+ What is TXYZ.AI?
224
+ )
225
+ Related Papers
226
+ Recommenders and Search Tools
227
+ Link to Influence Flower
228
+ Influence Flower
229
+ (
230
+ What are Influence Flowers?
231
+ )
232
+ Core recommender toggle
233
+ CORE Recommender
234
+ (
235
+ What is CORE?
236
+ )
237
+ IArxiv recommender toggle
238
+ IArxiv Recommender
239
+ (
240
+ What is IArxiv?
241
+ )
242
+ Author
243
+ Venue
244
+ Institution
245
+ Topic
246
+ About arXivLabs
247
+ arXivLabs: experimental projects with community collaborators
248
+ arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
249
+ Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
250
+ Have an idea for a project that will add value for arXiv's community?
251
+ Learn more about arXivLabs
252
+ .
253
+ Which authors of this paper are endorsers?
254
+ |
255
+ Disable MathJax
256
+ (
257
+ What is MathJax?
258
+ )
research/notes/201109464-counterfactual-credit-assignment-in-model-free-reinforcement-learning.md ADDED
@@ -0,0 +1,229 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: '[2011.09464] Counterfactual Credit Assignment in Model-Free Reinforcement
3
+ Learning'
4
+ id: 201109464-counterfactual-credit-assignment-in-model-free-reinforcement-learning
5
+ tags:
6
+ - socratic-mcts-swe-worldmodel-8f6dea
7
+ created: '2026-06-09T04:23:24.356402Z'
8
+ updated: '2026-06-09T04:23:49.899744Z'
9
+ source: https://arxiv.org/abs/2011.09464
10
+ source_domain: arxiv.org
11
+ fetched_at: '2026-06-09T04:23:23.972635Z'
12
+ fetch_provider: builtin
13
+ status: draft
14
+ type: note
15
+ tier: institutional
16
+ content_type: paper
17
+ deprecated: false
18
+ summary: 'Foundational counterfactual credit assignment in RL: future-conditional
19
+ baselines/critics separate skill from luck (action''s true influence on reward)
20
+ at provably low variance — the theory under ''where the trace diverged is the high-value
21
+ signal''.'
22
+ ---
23
+
24
+ [2011.09464] Counterfactual Credit Assignment in Model-Free Reinforcement Learning
25
+ Computer Science > Machine Learning
26
+ arXiv:2011.09464
27
+ (cs)
28
+ [Submitted on 18 Nov 2020 (
29
+ v1
30
+ ), last revised 14 Dec 2021 (this version, v2)]
31
+ Title:
32
+ Counterfactual Credit Assignment in Model-Free Reinforcement Learning
33
+ Authors:
34
+ Thomas Mesnard
35
+ ,
36
+ Théophane Weber
37
+ ,
38
+ Fabio Viola
39
+ ,
40
+ Shantanu Thakoor
41
+ ,
42
+ Alaa Saade
43
+ ,
44
+ Anna Harutyunyan
45
+ ,
46
+ Will Dabney
47
+ ,
48
+ Tom Stepleton
49
+ ,
50
+ Nicolas Heess
51
+ ,
52
+ Arthur Guez
53
+ ,
54
+ Éric Moulines
55
+ ,
56
+ Marcus Hutter
57
+ ,
58
+ Lars Buesing
59
+ ,
60
+ Rémi Munos
61
+ View a PDF of the paper titled Counterfactual Credit Assignment in Model-Free Reinforcement Learning, by Thomas Mesnard and 13 other authors
62
+ View PDF
63
+ Abstract:
64
+ Credit assignment in reinforcement learning is the problem of measuring an action's influence on future rewards. In particular, this requires separating skill from luck, i.e. disentangling the effect of an action on rewards from that of external factors and subsequent actions. To achieve this, we adapt the notion of counterfactuals from causality theory to a model-free RL setup. The key idea is to condition value functions on future events, by learning to extract relevant information from a trajectory. We formulate a family of policy gradient algorithms that use these future-conditional value functions as baselines or critics, and show that they are provably low variance. To avoid the potential bias from conditioning on future information, we constrain the hindsight information to not contain information about the agent's actions. We demonstrate the efficacy and validity of our algorithm on a number of illustrative and challenging problems.
65
+ Subjects:
66
+ Machine Learning (cs.LG)
67
+ Cite as:
68
+ arXiv:2011.09464
69
+ [cs.LG]
70
+ (or
71
+ arXiv:2011.09464v2
72
+ [cs.LG]
73
+ for this version)
74
+ https://doi.org/10.48550/arXiv.2011.09464
75
+ Focus to learn more
76
+ arXiv-issued DOI via DataCite
77
+ Submission history
78
+ From: Thomas Mesnard [
79
+ view email
80
+ ]
81
+ [v1]
82
+ Wed, 18 Nov 2020 18:41:44 UTC (25,181 KB)
83
+ [v2]
84
+ Tue, 14 Dec 2021 13:36:12 UTC (3,053 KB)
85
+ Full-text links:
86
+ Access Paper:
87
+ View a PDF of the paper titled Counterfactual Credit Assignment in Model-Free Reinforcement Learning, by Thomas Mesnard and 13 other authors
88
+ View PDF
89
+ TeX Source
90
+ view license
91
+ Current browse context:
92
+ cs.LG
93
+ < prev
94
+ |
95
+ next >
96
+ new
97
+ |
98
+ recent
99
+ |
100
+ 2020-11
101
+ Change to browse by:
102
+ cs
103
+ References & Citations
104
+ NASA ADS
105
+ Google Scholar
106
+ Semantic Scholar
107
+ DBLP
108
+ - CS Bibliography
109
+ listing
110
+ |
111
+ bibtex
112
+ Thomas Mesnard
113
+ Théophane Weber
114
+ Fabio Viola
115
+ Alaa Saade
116
+ Anna Harutyunyan
117
+
118
+ export BibTeX citation
119
+ Loading...
120
+ BibTeX formatted citation
121
+ ×
122
+ loading...
123
+ Data provided by:
124
+ Bookmark
125
+ Bibliographic Tools
126
+ Bibliographic and Citation Tools
127
+ Bibliographic Explorer Toggle
128
+ Bibliographic Explorer
129
+ (
130
+ What is the Explorer?
131
+ )
132
+ Connected Papers Toggle
133
+ Connected Papers
134
+ (
135
+ What is Connected Papers?
136
+ )
137
+ Litmaps Toggle
138
+ Litmaps
139
+ (
140
+ What is Litmaps?
141
+ )
142
+ scite.ai Toggle
143
+ scite Smart Citations
144
+ (
145
+ What are Smart Citations?
146
+ )
147
+ Code, Data, Media
148
+ Code, Data and Media Associated with this Article
149
+ alphaXiv Toggle
150
+ alphaXiv
151
+ (
152
+ What is alphaXiv?
153
+ )
154
+ Links to Code Toggle
155
+ CatalyzeX Code Finder for Papers
156
+ (
157
+ What is CatalyzeX?
158
+ )
159
+ DagsHub Toggle
160
+ DagsHub
161
+ (
162
+ What is DagsHub?
163
+ )
164
+ GotitPub Toggle
165
+ Gotit.pub
166
+ (
167
+ What is GotitPub?
168
+ )
169
+ Huggingface Toggle
170
+ Hugging Face
171
+ (
172
+ What is Huggingface?
173
+ )
174
+ ScienceCast Toggle
175
+ ScienceCast
176
+ (
177
+ What is ScienceCast?
178
+ )
179
+ Demos
180
+ Demos
181
+ Replicate Toggle
182
+ Replicate
183
+ (
184
+ What is Replicate?
185
+ )
186
+ Spaces Toggle
187
+ Hugging Face Spaces
188
+ (
189
+ What is Spaces?
190
+ )
191
+ Spaces Toggle
192
+ TXYZ.AI
193
+ (
194
+ What is TXYZ.AI?
195
+ )
196
+ Related Papers
197
+ Recommenders and Search Tools
198
+ Link to Influence Flower
199
+ Influence Flower
200
+ (
201
+ What are Influence Flowers?
202
+ )
203
+ Core recommender toggle
204
+ CORE Recommender
205
+ (
206
+ What is CORE?
207
+ )
208
+ IArxiv recommender toggle
209
+ IArxiv Recommender
210
+ (
211
+ What is IArxiv?
212
+ )
213
+ Author
214
+ Venue
215
+ Institution
216
+ Topic
217
+ About arXivLabs
218
+ arXivLabs: experimental projects with community collaborators
219
+ arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
220
+ Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
221
+ Have an idea for a project that will add value for arXiv's community?
222
+ Learn more about arXivLabs
223
+ .
224
+ Which authors of this paper are endorsers?
225
+ |
226
+ Disable MathJax
227
+ (
228
+ What is MathJax?
229
+ )
research/notes/221114275-solving-math-word-problems-with-process-and-outcome-based-feedback.md ADDED
@@ -0,0 +1,205 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: '[2211.14275] Solving math word problems with process- and outcome-based feedback'
3
+ id: 221114275-solving-math-word-problems-with-process-and-outcome-based-feedback
4
+ tags:
5
+ - socratic-mcts-swe-worldmodel-8f6dea
6
+ created: '2026-06-09T04:23:24.366439Z'
7
+ updated: '2026-06-09T04:23:56.531901Z'
8
+ source: https://arxiv.org/abs/2211.14275
9
+ source_domain: arxiv.org
10
+ fetched_at: '2026-06-09T04:23:24.269109Z'
11
+ fetch_provider: builtin
12
+ status: draft
13
+ type: note
14
+ tier: institutional
15
+ content_type: paper
16
+ deprecated: false
17
+ summary: 'DeepMind (Uesato et al. 2022): the original head-to-head of process- vs
18
+ outcome-based feedback; final-answer error parity but process feedback drastically
19
+ cuts reasoning/trace errors — motivates rewarding the path, not just the result.'
20
+ ---
21
+
22
+ [2211.14275] Solving math word problems with process- and outcome-based feedback
23
+ Computer Science > Machine Learning
24
+ arXiv:2211.14275
25
+ (cs)
26
+ [Submitted on 25 Nov 2022]
27
+ Title:
28
+ Solving math word problems with process- and outcome-based feedback
29
+ Authors:
30
+ Jonathan Uesato
31
+ ,
32
+ Nate Kushman
33
+ ,
34
+ Ramana Kumar
35
+ ,
36
+ Francis Song
37
+ ,
38
+ Noah Siegel
39
+ ,
40
+ Lisa Wang
41
+ ,
42
+ Antonia Creswell
43
+ ,
44
+ Geoffrey Irving
45
+ ,
46
+ Irina Higgins
47
+ View a PDF of the paper titled Solving math word problems with process- and outcome-based feedback, by Jonathan Uesato and 8 other authors
48
+ View PDF
49
+ Abstract:
50
+ Recent work has shown that asking language models to generate reasoning steps improves performance on many reasoning tasks. When moving beyond prompting, this raises the question of how we should supervise such models: outcome-based approaches which supervise the final result, or process-based approaches which supervise the reasoning process itself? Differences between these approaches might naturally be expected not just in final-answer errors but also in reasoning errors, which can be difficult to detect and are problematic in many real-world domains such as education. We run the first comprehensive comparison between process- and outcome-based approaches trained on a natural language task, GSM8K. We find that pure outcome-based supervision produces similar final-answer error rates with less label supervision. However, for correct reasoning steps we find it necessary to use process-based supervision or supervision from learned reward models that emulate process-based feedback. In total, we improve the previous best results from 16.8% $\to$ 12.7% final-answer error and 14.0% $\to$ 3.4% reasoning error among final-answer-correct solutions.
51
+ Subjects:
52
+ Machine Learning (cs.LG)
53
+ ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
54
+ Cite as:
55
+ arXiv:2211.14275
56
+ [cs.LG]
57
+ (or
58
+ arXiv:2211.14275v1
59
+ [cs.LG]
60
+ for this version)
61
+ https://doi.org/10.48550/arXiv.2211.14275
62
+ Focus to learn more
63
+ arXiv-issued DOI via DataCite
64
+ Submission history
65
+ From: Jonathan Uesato [
66
+ view email
67
+ ]
68
+ [v1]
69
+ Fri, 25 Nov 2022 18:19:44 UTC (306 KB)
70
+ Full-text links:
71
+ Access Paper:
72
+ View a PDF of the paper titled Solving math word problems with process- and outcome-based feedback, by Jonathan Uesato and 8 other authors
73
+ View PDF
74
+ TeX Source
75
+ view license
76
+ Current browse context:
77
+ cs.LG
78
+ < prev
79
+ |
80
+ next >
81
+ new
82
+ |
83
+ recent
84
+ |
85
+ 2022-11
86
+ Change to browse by:
87
+ cs
88
+ cs.AI
89
+ cs.CL
90
+ References & Citations
91
+ NASA ADS
92
+ Google Scholar
93
+ Semantic Scholar
94
+ export BibTeX citation
95
+ Loading...
96
+ BibTeX formatted citation
97
+ ×
98
+ loading...
99
+ Data provided by:
100
+ Bookmark
101
+ Bibliographic Tools
102
+ Bibliographic and Citation Tools
103
+ Bibliographic Explorer Toggle
104
+ Bibliographic Explorer
105
+ (
106
+ What is the Explorer?
107
+ )
108
+ Connected Papers Toggle
109
+ Connected Papers
110
+ (
111
+ What is Connected Papers?
112
+ )
113
+ Litmaps Toggle
114
+ Litmaps
115
+ (
116
+ What is Litmaps?
117
+ )
118
+ scite.ai Toggle
119
+ scite Smart Citations
120
+ (
121
+ What are Smart Citations?
122
+ )
123
+ Code, Data, Media
124
+ Code, Data and Media Associated with this Article
125
+ alphaXiv Toggle
126
+ alphaXiv
127
+ (
128
+ What is alphaXiv?
129
+ )
130
+ Links to Code Toggle
131
+ CatalyzeX Code Finder for Papers
132
+ (
133
+ What is CatalyzeX?
134
+ )
135
+ DagsHub Toggle
136
+ DagsHub
137
+ (
138
+ What is DagsHub?
139
+ )
140
+ GotitPub Toggle
141
+ Gotit.pub
142
+ (
143
+ What is GotitPub?
144
+ )
145
+ Huggingface Toggle
146
+ Hugging Face
147
+ (
148
+ What is Huggingface?
149
+ )
150
+ ScienceCast Toggle
151
+ ScienceCast
152
+ (
153
+ What is ScienceCast?
154
+ )
155
+ Demos
156
+ Demos
157
+ Replicate Toggle
158
+ Replicate
159
+ (
160
+ What is Replicate?
161
+ )
162
+ Spaces Toggle
163
+ Hugging Face Spaces
164
+ (
165
+ What is Spaces?
166
+ )
167
+ Spaces Toggle
168
+ TXYZ.AI
169
+ (
170
+ What is TXYZ.AI?
171
+ )
172
+ Related Papers
173
+ Recommenders and Search Tools
174
+ Link to Influence Flower
175
+ Influence Flower
176
+ (
177
+ What are Influence Flowers?
178
+ )
179
+ Core recommender toggle
180
+ CORE Recommender
181
+ (
182
+ What is CORE?
183
+ )
184
+ IArxiv recommender toggle
185
+ IArxiv Recommender
186
+ (
187
+ What is IArxiv?
188
+ )
189
+ Author
190
+ Venue
191
+ Institution
192
+ Topic
193
+ About arXivLabs
194
+ arXivLabs: experimental projects with community collaborators
195
+ arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
196
+ Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
197
+ Have an idea for a project that will add value for arXiv's community?
198
+ Learn more about arXivLabs
199
+ .
200
+ Which authors of this paper are endorsers?
201
+ |
202
+ Disable MathJax
203
+ (
204
+ What is MathJax?
205
+ )
research/notes/230104104-mastering-diverse-domains-through-world-models.md ADDED
@@ -0,0 +1,196 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: '[2301.04104] Mastering Diverse Domains through World Models'
3
+ id: 230104104-mastering-diverse-domains-through-world-models
4
+ tags:
5
+ - socratic-mcts-swe-worldmodel-8f6dea
6
+ created: '2026-06-09T04:22:01.614454Z'
7
+ updated: '2026-06-09T04:22:19.838964Z'
8
+ source: https://arxiv.org/abs/2301.04104
9
+ source_domain: arxiv.org
10
+ fetched_at: '2026-06-09T04:22:01.600829Z'
11
+ fetch_provider: builtin
12
+ status: draft
13
+ type: note
14
+ tier: institutional
15
+ content_type: paper
16
+ deprecated: false
17
+ summary: '[2301.04104] Mastering Diverse Domains through World Models'
18
+ ---
19
+
20
+ [2301.04104] Mastering Diverse Domains through World Models
21
+ Computer Science > Artificial Intelligence
22
+ arXiv:2301.04104
23
+ (cs)
24
+ [Submitted on 10 Jan 2023 (
25
+ v1
26
+ ), last revised 17 Apr 2024 (this version, v2)]
27
+ Title:
28
+ Mastering Diverse Domains through World Models
29
+ Authors:
30
+ Danijar Hafner
31
+ ,
32
+ Jurgis Pasukonis
33
+ ,
34
+ Jimmy Ba
35
+ ,
36
+ Timothy Lillicrap
37
+ View a PDF of the paper titled Mastering Diverse Domains through World Models, by Danijar Hafner and 3 other authors
38
+ View PDF
39
+ Abstract:
40
+ Developing a general algorithm that learns to solve tasks across a wide range of applications has been a fundamental challenge in artificial intelligence. Although current reinforcement learning algorithms can be readily applied to tasks similar to what they have been developed for, configuring them for new application domains requires significant human expertise and experimentation. We present DreamerV3, a general algorithm that outperforms specialized methods across over 150 diverse tasks, with a single configuration. Dreamer learns a model of the environment and improves its behavior by imagining future scenarios. Robustness techniques based on normalization, balancing, and transformations enable stable learning across domains. Applied out of the box, Dreamer is the first algorithm to collect diamonds in Minecraft from scratch without human data or curricula. This achievement has been posed as a significant challenge in artificial intelligence that requires exploring farsighted strategies from pixels and sparse rewards in an open world. Our work allows solving challenging control problems without extensive experimentation, making reinforcement learning broadly applicable.
41
+ Comments:
42
+ Website:
43
+ this https URL
44
+ Subjects:
45
+ Artificial Intelligence (cs.AI)
46
+ ; Machine Learning (cs.LG); Machine Learning (stat.ML)
47
+ Cite as:
48
+ arXiv:2301.04104
49
+ [cs.AI]
50
+ (or
51
+ arXiv:2301.04104v2
52
+ [cs.AI]
53
+ for this version)
54
+ https://doi.org/10.48550/arXiv.2301.04104
55
+ Focus to learn more
56
+ arXiv-issued DOI via DataCite
57
+ Submission history
58
+ From: Danijar Hafner [
59
+ view email
60
+ ]
61
+ [v1]
62
+ Tue, 10 Jan 2023 18:12:16 UTC (2,210 KB)
63
+ [v2]
64
+ Wed, 17 Apr 2024 17:41:20 UTC (2,520 KB)
65
+ Full-text links:
66
+ Access Paper:
67
+ View a PDF of the paper titled Mastering Diverse Domains through World Models, by Danijar Hafner and 3 other authors
68
+ View PDF
69
+ TeX Source
70
+ view license
71
+ Current browse context:
72
+ cs.AI
73
+ < prev
74
+ |
75
+ next >
76
+ new
77
+ |
78
+ recent
79
+ |
80
+ 2023-01
81
+ Change to browse by:
82
+ cs
83
+ cs.LG
84
+ stat
85
+ stat.ML
86
+ References & Citations
87
+ NASA ADS
88
+ Google Scholar
89
+ Semantic Scholar
90
+ export BibTeX citation
91
+ Loading...
92
+ BibTeX formatted citation
93
+ ×
94
+ loading...
95
+ Data provided by:
96
+ Bookmark
97
+ Bibliographic Tools
98
+ Bibliographic and Citation Tools
99
+ Bibliographic Explorer Toggle
100
+ Bibliographic Explorer
101
+ (
102
+ What is the Explorer?
103
+ )
104
+ Connected Papers Toggle
105
+ Connected Papers
106
+ (
107
+ What is Connected Papers?
108
+ )
109
+ Litmaps Toggle
110
+ Litmaps
111
+ (
112
+ What is Litmaps?
113
+ )
114
+ scite.ai Toggle
115
+ scite Smart Citations
116
+ (
117
+ What are Smart Citations?
118
+ )
119
+ Code, Data, Media
120
+ Code, Data and Media Associated with this Article
121
+ alphaXiv Toggle
122
+ alphaXiv
123
+ (
124
+ What is alphaXiv?
125
+ )
126
+ Links to Code Toggle
127
+ CatalyzeX Code Finder for Papers
128
+ (
129
+ What is CatalyzeX?
130
+ )
131
+ DagsHub Toggle
132
+ DagsHub
133
+ (
134
+ What is DagsHub?
135
+ )
136
+ GotitPub Toggle
137
+ Gotit.pub
138
+ (
139
+ What is GotitPub?
140
+ )
141
+ Huggingface Toggle
142
+ Hugging Face
143
+ (
144
+ What is Huggingface?
145
+ )
146
+ ScienceCast Toggle
147
+ ScienceCast
148
+ (
149
+ What is ScienceCast?
150
+ )
151
+ Demos
152
+ Demos
153
+ Replicate Toggle
154
+ Replicate
155
+ (
156
+ What is Replicate?
157
+ )
158
+ Spaces Toggle
159
+ Hugging Face Spaces
160
+ (
161
+ What is Spaces?
162
+ )
163
+ Spaces Toggle
164
+ TXYZ.AI
165
+ (
166
+ What is TXYZ.AI?
167
+ )
168
+ Related Papers
169
+ Recommenders and Search Tools
170
+ Link to Influence Flower
171
+ Influence Flower
172
+ (
173
+ What are Influence Flowers?
174
+ )
175
+ Core recommender toggle
176
+ CORE Recommender
177
+ (
178
+ What is CORE?
179
+ )
180
+ Author
181
+ Venue
182
+ Institution
183
+ Topic
184
+ About arXivLabs
185
+ arXivLabs: experimental projects with community collaborators
186
+ arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
187
+ Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
188
+ Have an idea for a project that will add value for arXiv's community?
189
+ Learn more about arXivLabs
190
+ .
191
+ Which authors of this paper are endorsers?
192
+ |
193
+ Disable MathJax
194
+ (
195
+ What is MathJax?
196
+ )
research/notes/230520050-lets-verify-step-by-step.md ADDED
@@ -0,0 +1,207 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: '[2305.20050] Let''s Verify Step by Step'
3
+ id: 230520050-lets-verify-step-by-step
4
+ tags:
5
+ - socratic-mcts-swe-worldmodel-8f6dea
6
+ created: '2026-06-09T04:23:24.363053Z'
7
+ updated: '2026-06-09T04:23:54.507242Z'
8
+ source: https://arxiv.org/abs/2305.20050
9
+ source_domain: arxiv.org
10
+ fetched_at: '2026-06-09T04:23:24.177998Z'
11
+ fetch_provider: builtin
12
+ status: draft
13
+ type: note
14
+ tier: institutional
15
+ content_type: paper
16
+ deprecated: false
17
+ summary: 'OpenAI (Lightman et al. 2023): process supervision (PRM, per-step labels)
18
+ substantially outperforms outcome supervision (ORM) on MATH and yields more reliable
19
+ reward — the canonical empirical case that step-level credit beats outcome-only.'
20
+ ---
21
+
22
+ [2305.20050] Let's Verify Step by Step
23
+ Computer Science > Machine Learning
24
+ arXiv:2305.20050
25
+ (cs)
26
+ [Submitted on 31 May 2023]
27
+ Title:
28
+ Let's Verify Step by Step
29
+ Authors:
30
+ Hunter Lightman
31
+ ,
32
+ Vineet Kosaraju
33
+ ,
34
+ Yura Burda
35
+ ,
36
+ Harri Edwards
37
+ ,
38
+ Bowen Baker
39
+ ,
40
+ Teddy Lee
41
+ ,
42
+ Jan Leike
43
+ ,
44
+ John Schulman
45
+ ,
46
+ Ilya Sutskever
47
+ ,
48
+ Karl Cobbe
49
+ View a PDF of the paper titled Let's Verify Step by Step, by Hunter Lightman and 9 other authors
50
+ View PDF
51
+ Abstract:
52
+ In recent years, large language models have greatly improved in their ability to perform complex multi-step reasoning. However, even state-of-the-art models still regularly produce logical mistakes. To train more reliable models, we can turn either to outcome supervision, which provides feedback for a final result, or process supervision, which provides feedback for each intermediate reasoning step. Given the importance of training reliable models, and given the high cost of human feedback, it is important to carefully compare the both methods. Recent work has already begun this comparison, but many questions still remain. We conduct our own investigation, finding that process supervision significantly outperforms outcome supervision for training models to solve problems from the challenging MATH dataset. Our process-supervised model solves 78% of problems from a representative subset of the MATH test set. Additionally, we show that active learning significantly improves the efficacy of process supervision. To support related research, we also release PRM800K, the complete dataset of 800,000 step-level human feedback labels used to train our best reward model.
53
+ Subjects:
54
+ Machine Learning (cs.LG)
55
+ ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
56
+ Cite as:
57
+ arXiv:2305.20050
58
+ [cs.LG]
59
+ (or
60
+ arXiv:2305.20050v1
61
+ [cs.LG]
62
+ for this version)
63
+ https://doi.org/10.48550/arXiv.2305.20050
64
+ Focus to learn more
65
+ arXiv-issued DOI via DataCite
66
+ Submission history
67
+ From: Karl Cobbe [
68
+ view email
69
+ ]
70
+ [v1]
71
+ Wed, 31 May 2023 17:24:00 UTC (10,363 KB)
72
+ Full-text links:
73
+ Access Paper:
74
+ View a PDF of the paper titled Let's Verify Step by Step, by Hunter Lightman and 9 other authors
75
+ View PDF
76
+ TeX Source
77
+ view license
78
+ Current browse context:
79
+ cs.LG
80
+ < prev
81
+ |
82
+ next >
83
+ new
84
+ |
85
+ recent
86
+ |
87
+ 2023-05
88
+ Change to browse by:
89
+ cs
90
+ cs.AI
91
+ cs.CL
92
+ References & Citations
93
+ NASA ADS
94
+ Google Scholar
95
+ Semantic Scholar
96
+ export BibTeX citation
97
+ Loading...
98
+ BibTeX formatted citation
99
+ ×
100
+ loading...
101
+ Data provided by:
102
+ Bookmark
103
+ Bibliographic Tools
104
+ Bibliographic and Citation Tools
105
+ Bibliographic Explorer Toggle
106
+ Bibliographic Explorer
107
+ (
108
+ What is the Explorer?
109
+ )
110
+ Connected Papers Toggle
111
+ Connected Papers
112
+ (
113
+ What is Connected Papers?
114
+ )
115
+ Litmaps Toggle
116
+ Litmaps
117
+ (
118
+ What is Litmaps?
119
+ )
120
+ scite.ai Toggle
121
+ scite Smart Citations
122
+ (
123
+ What are Smart Citations?
124
+ )
125
+ Code, Data, Media
126
+ Code, Data and Media Associated with this Article
127
+ alphaXiv Toggle
128
+ alphaXiv
129
+ (
130
+ What is alphaXiv?
131
+ )
132
+ Links to Code Toggle
133
+ CatalyzeX Code Finder for Papers
134
+ (
135
+ What is CatalyzeX?
136
+ )
137
+ DagsHub Toggle
138
+ DagsHub
139
+ (
140
+ What is DagsHub?
141
+ )
142
+ GotitPub Toggle
143
+ Gotit.pub
144
+ (
145
+ What is GotitPub?
146
+ )
147
+ Huggingface Toggle
148
+ Hugging Face
149
+ (
150
+ What is Huggingface?
151
+ )
152
+ ScienceCast Toggle
153
+ ScienceCast
154
+ (
155
+ What is ScienceCast?
156
+ )
157
+ Demos
158
+ Demos
159
+ Replicate Toggle
160
+ Replicate
161
+ (
162
+ What is Replicate?
163
+ )
164
+ Spaces Toggle
165
+ Hugging Face Spaces
166
+ (
167
+ What is Spaces?
168
+ )
169
+ Spaces Toggle
170
+ TXYZ.AI
171
+ (
172
+ What is TXYZ.AI?
173
+ )
174
+ Related Papers
175
+ Recommenders and Search Tools
176
+ Link to Influence Flower
177
+ Influence Flower
178
+ (
179
+ What are Influence Flowers?
180
+ )
181
+ Core recommender toggle
182
+ CORE Recommender
183
+ (
184
+ What is CORE?
185
+ )
186
+ IArxiv recommender toggle
187
+ IArxiv Recommender
188
+ (
189
+ What is IArxiv?
190
+ )
191
+ Author
192
+ Venue
193
+ Institution
194
+ Topic
195
+ About arXivLabs
196
+ arXivLabs: experimental projects with community collaborators
197
+ arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
198
+ Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
199
+ Have an idea for a project that will add value for arXiv's community?
200
+ Learn more about arXivLabs
201
+ .
202
+ Which authors of this paper are endorsers?
203
+ |
204
+ Disable MathJax
205
+ (
206
+ What is MathJax?
207
+ )
research/notes/230616803-would-i-have-gotten-that-reward-long-term-credit-assignment-by-counter.md ADDED
@@ -0,0 +1,205 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: '[2306.16803] Would I have gotten that reward? Long-term credit assignment
3
+ by counterfactual contribution analysis'
4
+ id: 230616803-would-i-have-gotten-that-reward-long-term-credit-assignment-by-counter
5
+ tags:
6
+ - socratic-mcts-swe-worldmodel-8f6dea
7
+ created: '2026-06-09T04:23:24.359888Z'
8
+ updated: '2026-06-09T04:23:53.115536Z'
9
+ source: https://arxiv.org/abs/2306.16803
10
+ source_domain: arxiv.org
11
+ fetched_at: '2026-06-09T04:23:24.081414Z'
12
+ fetch_provider: builtin
13
+ status: draft
14
+ type: note
15
+ tier: institutional
16
+ content_type: paper
17
+ deprecated: false
18
+ summary: 'HCA+/Counterfactual Contribution Analysis (NeurIPS 2023): estimates each
19
+ action''s marginal contribution to long-term reward via hindsight/counterfactual
20
+ models — process-level credit at trajectory scale, directly applicable to crediting
21
+ the divergence step in a tree-of-work branch.'
22
+ ---
23
+
24
+ [2306.16803] Would I have gotten that reward? Long-term credit assignment by counterfactual contribution analysis
25
+ Computer Science > Machine Learning
26
+ arXiv:2306.16803
27
+ (cs)
28
+ [Submitted on 29 Jun 2023 (
29
+ v1
30
+ ), last revised 31 Oct 2023 (this version, v2)]
31
+ Title:
32
+ Would I have gotten that reward? Long-term credit assignment by counterfactual contribution analysis
33
+ Authors:
34
+ Alexander Meulemans
35
+ ,
36
+ Simon Schug
37
+ ,
38
+ Seijin Kobayashi
39
+ ,
40
+ Nathaniel Daw
41
+ ,
42
+ Gregory Wayne
43
+ View a PDF of the paper titled Would I have gotten that reward? Long-term credit assignment by counterfactual contribution analysis, by Alexander Meulemans and 4 other authors
44
+ View PDF
45
+ Abstract:
46
+ To make reinforcement learning more sample efficient, we need better credit assignment methods that measure an action's influence on future rewards. Building upon Hindsight Credit Assignment (HCA), we introduce Counterfactual Contribution Analysis (COCOA), a new family of model-based credit assignment algorithms. Our algorithms achieve precise credit assignment by measuring the contribution of actions upon obtaining subsequent rewards, by quantifying a counterfactual query: 'Would the agent still have reached this reward if it had taken another action?'. We show that measuring contributions w.r.t. rewarding states, as is done in HCA, results in spurious estimates of contributions, causing HCA to degrade towards the high-variance REINFORCE estimator in many relevant environments. Instead, we measure contributions w.r.t. rewards or learned representations of the rewarding objects, resulting in gradient estimates with lower variance. We run experiments on a suite of problems specifically designed to evaluate long-term credit assignment capabilities. By using dynamic programming, we measure ground-truth policy gradients and show that the improved performance of our new model-based credit assignment methods is due to lower bias and variance compared to HCA and common baselines. Our results demonstrate how modeling action contributions towards rewarding outcomes can be leveraged for credit assignment, opening a new path towards sample-efficient reinforcement learning.
47
+ Comments:
48
+ NeurIPS 2023 spotlight
49
+ Subjects:
50
+ Machine Learning (cs.LG)
51
+ ; Machine Learning (stat.ML)
52
+ Cite as:
53
+ arXiv:2306.16803
54
+ [cs.LG]
55
+ (or
56
+ arXiv:2306.16803v2
57
+ [cs.LG]
58
+ for this version)
59
+ https://doi.org/10.48550/arXiv.2306.16803
60
+ Focus to learn more
61
+ arXiv-issued DOI via DataCite
62
+ Submission history
63
+ From: Simon Schug [
64
+ view email
65
+ ]
66
+ [v1]
67
+ Thu, 29 Jun 2023 09:27:27 UTC (1,429 KB)
68
+ [v2]
69
+ Tue, 31 Oct 2023 10:28:50 UTC (1,611 KB)
70
+ Full-text links:
71
+ Access Paper:
72
+ View a PDF of the paper titled Would I have gotten that reward? Long-term credit assignment by counterfactual contribution analysis, by Alexander Meulemans and 4 other authors
73
+ View PDF
74
+ TeX Source
75
+ view license
76
+ Current browse context:
77
+ cs.LG
78
+ < prev
79
+ |
80
+ next >
81
+ new
82
+ |
83
+ recent
84
+ |
85
+ 2023-06
86
+ Change to browse by:
87
+ cs
88
+ stat
89
+ stat.ML
90
+ References & Citations
91
+ NASA ADS
92
+ Google Scholar
93
+ Semantic Scholar
94
+ export BibTeX citation
95
+ Loading...
96
+ BibTeX formatted citation
97
+ ×
98
+ loading...
99
+ Data provided by:
100
+ Bookmark
101
+ Bibliographic Tools
102
+ Bibliographic and Citation Tools
103
+ Bibliographic Explorer Toggle
104
+ Bibliographic Explorer
105
+ (
106
+ What is the Explorer?
107
+ )
108
+ Connected Papers Toggle
109
+ Connected Papers
110
+ (
111
+ What is Connected Papers?
112
+ )
113
+ Litmaps Toggle
114
+ Litmaps
115
+ (
116
+ What is Litmaps?
117
+ )
118
+ scite.ai Toggle
119
+ scite Smart Citations
120
+ (
121
+ What are Smart Citations?
122
+ )
123
+ Code, Data, Media
124
+ Code, Data and Media Associated with this Article
125
+ alphaXiv Toggle
126
+ alphaXiv
127
+ (
128
+ What is alphaXiv?
129
+ )
130
+ Links to Code Toggle
131
+ CatalyzeX Code Finder for Papers
132
+ (
133
+ What is CatalyzeX?
134
+ )
135
+ DagsHub Toggle
136
+ DagsHub
137
+ (
138
+ What is DagsHub?
139
+ )
140
+ GotitPub Toggle
141
+ Gotit.pub
142
+ (
143
+ What is GotitPub?
144
+ )
145
+ Huggingface Toggle
146
+ Hugging Face
147
+ (
148
+ What is Huggingface?
149
+ )
150
+ ScienceCast Toggle
151
+ ScienceCast
152
+ (
153
+ What is ScienceCast?
154
+ )
155
+ Demos
156
+ Demos
157
+ Replicate Toggle
158
+ Replicate
159
+ (
160
+ What is Replicate?
161
+ )
162
+ Spaces Toggle
163
+ Hugging Face Spaces
164
+ (
165
+ What is Spaces?
166
+ )
167
+ Spaces Toggle
168
+ TXYZ.AI
169
+ (
170
+ What is TXYZ.AI?
171
+ )
172
+ Related Papers
173
+ Recommenders and Search Tools
174
+ Link to Influence Flower
175
+ Influence Flower
176
+ (
177
+ What are Influence Flowers?
178
+ )
179
+ Core recommender toggle
180
+ CORE Recommender
181
+ (
182
+ What is CORE?
183
+ )
184
+ IArxiv recommender toggle
185
+ IArxiv Recommender
186
+ (
187
+ What is IArxiv?
188
+ )
189
+ Author
190
+ Venue
191
+ Institution
192
+ Topic
193
+ About arXivLabs
194
+ arXivLabs: experimental projects with community collaborators
195
+ arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
196
+ Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
197
+ Have an idea for a project that will add value for arXiv's community?
198
+ Learn more about arXivLabs
199
+ .
200
+ Which authors of this paper are endorsers?
201
+ |
202
+ Disable MathJax
203
+ (
204
+ What is MathJax?
205
+ )
research/notes/240211651-learning-from-failure-integrating-negative-examples-when-fine-tuning-l.md ADDED
@@ -0,0 +1,200 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: '[2402.11651] Learning From Failure: Integrating Negative Examples when Fine-tuning
3
+ Large Language Models as Agents'
4
+ id: 240211651-learning-from-failure-integrating-negative-examples-when-fine-tuning-l
5
+ tags:
6
+ - socratic-mcts-swe-worldmodel-8f6dea
7
+ created: '2026-06-09T04:24:45.892505Z'
8
+ updated: '2026-06-09T04:25:02.502012Z'
9
+ source: https://arxiv.org/abs/2402.11651
10
+ source_domain: arxiv.org
11
+ fetched_at: '2026-06-09T04:24:45.281028Z'
12
+ fetch_provider: builtin
13
+ status: active
14
+ type: note
15
+ tier: institutional
16
+ content_type: paper
17
+ deprecated: false
18
+ summary: '[2402.11651] Learning From Failure: Integrating Negative Examples when Fine-tuning
19
+ Large Language Models as Agents'
20
+ ---
21
+
22
+ [2402.11651] Learning From Failure: Integrating Negative Examples when Fine-tuning Large Language Models as Agents
23
+ Computer Science > Computation and Language
24
+ arXiv:2402.11651
25
+ (cs)
26
+ [Submitted on 18 Feb 2024 (
27
+ v1
28
+ ), last revised 16 Apr 2024 (this version, v2)]
29
+ Title:
30
+ Learning From Failure: Integrating Negative Examples when Fine-tuning Large Language Models as Agents
31
+ Authors:
32
+ Renxi Wang
33
+ ,
34
+ Haonan Li
35
+ ,
36
+ Xudong Han
37
+ ,
38
+ Yixuan Zhang
39
+ ,
40
+ Timothy Baldwin
41
+ View a PDF of the paper titled Learning From Failure: Integrating Negative Examples when Fine-tuning Large Language Models as Agents, by Renxi Wang and 4 other authors
42
+ View PDF
43
+ HTML (experimental)
44
+ Abstract:
45
+ Large language models (LLMs) have achieved success in acting as agents, which interact with environments through tools such as search engines. However, LLMs are optimized for language generation instead of tool use during training or alignment, limiting their effectiveness as agents. To resolve this problem, previous work has first collected interaction trajectories between LLMs and environments, using only trajectories that successfully finished the task to fine-tune smaller models, making fine-tuning data scarce and acquiring it both difficult and costly. Discarding failed trajectories also leads to significant wastage of data and resources and limits the possible optimization paths during fine-tuning. In this paper, we argue that unsuccessful trajectories offer valuable insights, and LLMs can learn from these trajectories through appropriate quality control and fine-tuning strategies. By simply adding a prefix or suffix that tells the model whether to generate a successful trajectory during training, we improve model performance by a large margin on mathematical reasoning, multi-hop question answering, and strategic question answering tasks. We further analyze the inference results and find that our method provides a better trade-off between valuable information and errors in unsuccessful trajectories. To our knowledge, we are the first to demonstrate the value of negative trajectories and their application in agent-tunning scenarios. Our findings offer guidance for developing better agent-tuning methods and low-resource data usage techniques.
46
+ Comments:
47
+ Agent, LLM, Large Language Model
48
+ Subjects:
49
+ Computation and Language (cs.CL)
50
+ ACM
51
+ classes:
52
+ I.2.7
53
+ Cite as:
54
+ arXiv:2402.11651
55
+ [cs.CL]
56
+ (or
57
+ arXiv:2402.11651v2
58
+ [cs.CL]
59
+ for this version)
60
+ https://doi.org/10.48550/arXiv.2402.11651
61
+ Focus to learn more
62
+ arXiv-issued DOI via DataCite
63
+ Submission history
64
+ From: Renxi Wang [
65
+ view email
66
+ ]
67
+ [v1]
68
+ Sun, 18 Feb 2024 17:10:07 UTC (10,199 KB)
69
+ [v2]
70
+ Tue, 16 Apr 2024 11:41:13 UTC (10,670 KB)
71
+ Full-text links:
72
+ Access Paper:
73
+ View a PDF of the paper titled Learning From Failure: Integrating Negative Examples when Fine-tuning Large Language Models as Agents, by Renxi Wang and 4 other authors
74
+ View PDF
75
+ HTML (experimental)
76
+ TeX Source
77
+ view license
78
+ Current browse context:
79
+ cs.CL
80
+ < prev
81
+ |
82
+ next >
83
+ new
84
+ |
85
+ recent
86
+ |
87
+ 2024-02
88
+ Change to browse by:
89
+ cs
90
+ References & Citations
91
+ NASA ADS
92
+ Google Scholar
93
+ Semantic Scholar
94
+ export BibTeX citation
95
+ Loading...
96
+ BibTeX formatted citation
97
+ ×
98
+ loading...
99
+ Data provided by:
100
+ Bookmark
101
+ Bibliographic Tools
102
+ Bibliographic and Citation Tools
103
+ Bibliographic Explorer Toggle
104
+ Bibliographic Explorer
105
+ (
106
+ What is the Explorer?
107
+ )
108
+ Connected Papers Toggle
109
+ Connected Papers
110
+ (
111
+ What is Connected Papers?
112
+ )
113
+ Litmaps Toggle
114
+ Litmaps
115
+ (
116
+ What is Litmaps?
117
+ )
118
+ scite.ai Toggle
119
+ scite Smart Citations
120
+ (
121
+ What are Smart Citations?
122
+ )
123
+ Code, Data, Media
124
+ Code, Data and Media Associated with this Article
125
+ alphaXiv Toggle
126
+ alphaXiv
127
+ (
128
+ What is alphaXiv?
129
+ )
130
+ Links to Code Toggle
131
+ CatalyzeX Code Finder for Papers
132
+ (
133
+ What is CatalyzeX?
134
+ )
135
+ DagsHub Toggle
136
+ DagsHub
137
+ (
138
+ What is DagsHub?
139
+ )
140
+ GotitPub Toggle
141
+ Gotit.pub
142
+ (
143
+ What is GotitPub?
144
+ )
145
+ Huggingface Toggle
146
+ Hugging Face
147
+ (
148
+ What is Huggingface?
149
+ )
150
+ ScienceCast Toggle
151
+ ScienceCast
152
+ (
153
+ What is ScienceCast?
154
+ )
155
+ Demos
156
+ Demos
157
+ Replicate Toggle
158
+ Replicate
159
+ (
160
+ What is Replicate?
161
+ )
162
+ Spaces Toggle
163
+ Hugging Face Spaces
164
+ (
165
+ What is Spaces?
166
+ )
167
+ Spaces Toggle
168
+ TXYZ.AI
169
+ (
170
+ What is TXYZ.AI?
171
+ )
172
+ Related Papers
173
+ Recommenders and Search Tools
174
+ Link to Influence Flower
175
+ Influence Flower
176
+ (
177
+ What are Influence Flowers?
178
+ )
179
+ Core recommender toggle
180
+ CORE Recommender
181
+ (
182
+ What is CORE?
183
+ )
184
+ Author
185
+ Venue
186
+ Institution
187
+ Topic
188
+ About arXivLabs
189
+ arXivLabs: experimental projects with community collaborators
190
+ arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
191
+ Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
192
+ Have an idea for a project that will add value for arXiv's community?
193
+ Learn more about arXivLabs
194
+ .
195
+ Which authors of this paper are endorsers?
196
+ |
197
+ Disable MathJax
198
+ (
199
+ What is MathJax?
200
+ )
research/notes/240515383-generating-code-world-models-with-large-language-models-guided-by-mont.md ADDED
@@ -0,0 +1,196 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: '[2405.15383] Generating Code World Models with Large Language Models Guided
3
+ by Monte Carlo Tree Search'
4
+ id: 240515383-generating-code-world-models-with-large-language-models-guided-by-mont
5
+ tags:
6
+ - socratic-mcts-swe-worldmodel-8f6dea
7
+ created: '2026-06-09T04:22:12.336468Z'
8
+ source: https://arxiv.org/abs/2405.15383
9
+ source_domain: arxiv.org
10
+ fetched_at: '2026-06-09T04:22:12.331439Z'
11
+ fetch_provider: builtin
12
+ status: draft
13
+ type: note
14
+ deprecated: false
15
+ summary: '[2405.15383] Generating Code World Models with Large Language Models Guided
16
+ by Monte Carlo Tree Search'
17
+ ---
18
+
19
+ [2405.15383] Generating Code World Models with Large Language Models Guided by Monte Carlo Tree Search
20
+ Computer Science > Artificial Intelligence
21
+ arXiv:2405.15383
22
+ (cs)
23
+ [Submitted on 24 May 2024 (
24
+ v1
25
+ ), last revised 30 Oct 2024 (this version, v2)]
26
+ Title:
27
+ Generating Code World Models with Large Language Models Guided by Monte Carlo Tree Search
28
+ Authors:
29
+ Nicola Dainese
30
+ ,
31
+ Matteo Merler
32
+ ,
33
+ Minttu Alakuijala
34
+ ,
35
+ Pekka Marttinen
36
+ View a PDF of the paper titled Generating Code World Models with Large Language Models Guided by Monte Carlo Tree Search, by Nicola Dainese and 3 other authors
37
+ View PDF
38
+ Abstract:
39
+ In this work we consider Code World Models, world models generated by a Large Language Model (LLM) in the form of Python code for model-based Reinforcement Learning (RL). Calling code instead of LLMs for planning has potential to be more precise, reliable, interpretable, and extremely efficient. However, writing appropriate Code World Models requires the ability to understand complex instructions, to generate exact code with non-trivial logic and to self-debug a long program with feedback from unit tests and environment trajectories. To address these challenges, we propose Generate, Improve and Fix with Monte Carlo Tree Search (GIF-MCTS), a new code generation strategy for LLMs. To test our approach in an offline RL setting, we introduce the Code World Models Benchmark (CWMB), a suite of program synthesis and planning tasks comprised of 18 diverse RL environments paired with corresponding textual descriptions and curated trajectories. GIF-MCTS surpasses all baselines on the CWMB and two other benchmarks, and we show that the Code World Models synthesized with it can be successfully used for planning, resulting in model-based RL agents with greatly improved sample efficiency and inference speed.
40
+ Comments:
41
+ Accepted at NeurIPS 2024, Main Track. 11 pages in main text, 40 pages including references and supplementary materials. 2 figures and 3 tables in the main text, 9 figures and 12 tables when including the supplementary materials. Website at
42
+ this https URL
43
+ Subjects:
44
+ Artificial Intelligence (cs.AI)
45
+ Cite as:
46
+ arXiv:2405.15383
47
+ [cs.AI]
48
+ (or
49
+ arXiv:2405.15383v2
50
+ [cs.AI]
51
+ for this version)
52
+ https://doi.org/10.48550/arXiv.2405.15383
53
+ Focus to learn more
54
+ arXiv-issued DOI via DataCite
55
+ Submission history
56
+ From: Nicola Dainese [
57
+ view email
58
+ ]
59
+ [v1]
60
+ Fri, 24 May 2024 09:31:26 UTC (238 KB)
61
+ [v2]
62
+ Wed, 30 Oct 2024 14:19:57 UTC (864 KB)
63
+ Full-text links:
64
+ Access Paper:
65
+ View a PDF of the paper titled Generating Code World Models with Large Language Models Guided by Monte Carlo Tree Search, by Nicola Dainese and 3 other authors
66
+ View PDF
67
+ TeX Source
68
+ view license
69
+ Current browse context:
70
+ cs.AI
71
+ < prev
72
+ |
73
+ next >
74
+ new
75
+ |
76
+ recent
77
+ |
78
+ 2024-05
79
+ Change to browse by:
80
+ cs
81
+ References & Citations
82
+ NASA ADS
83
+ Google Scholar
84
+ Semantic Scholar
85
+ export BibTeX citation
86
+ Loading...
87
+ BibTeX formatted citation
88
+ ×
89
+ loading...
90
+ Data provided by:
91
+ Bookmark
92
+ Bibliographic Tools
93
+ Bibliographic and Citation Tools
94
+ Bibliographic Explorer Toggle
95
+ Bibliographic Explorer
96
+ (
97
+ What is the Explorer?
98
+ )
99
+ Connected Papers Toggle
100
+ Connected Papers
101
+ (
102
+ What is Connected Papers?
103
+ )
104
+ Litmaps Toggle
105
+ Litmaps
106
+ (
107
+ What is Litmaps?
108
+ )
109
+ scite.ai Toggle
110
+ scite Smart Citations
111
+ (
112
+ What are Smart Citations?
113
+ )
114
+ Code, Data, Media
115
+ Code, Data and Media Associated with this Article
116
+ alphaXiv Toggle
117
+ alphaXiv
118
+ (
119
+ What is alphaXiv?
120
+ )
121
+ Links to Code Toggle
122
+ CatalyzeX Code Finder for Papers
123
+ (
124
+ What is CatalyzeX?
125
+ )
126
+ DagsHub Toggle
127
+ DagsHub
128
+ (
129
+ What is DagsHub?
130
+ )
131
+ GotitPub Toggle
132
+ Gotit.pub
133
+ (
134
+ What is GotitPub?
135
+ )
136
+ Huggingface Toggle
137
+ Hugging Face
138
+ (
139
+ What is Huggingface?
140
+ )
141
+ Links to Code Toggle
142
+ Papers with Code
143
+ (
144
+ What is Papers with Code?
145
+ )
146
+ ScienceCast Toggle
147
+ ScienceCast
148
+ (
149
+ What is ScienceCast?
150
+ )
151
+ Demos
152
+ Demos
153
+ Replicate Toggle
154
+ Replicate
155
+ (
156
+ What is Replicate?
157
+ )
158
+ Spaces Toggle
159
+ Hugging Face Spaces
160
+ (
161
+ What is Spaces?
162
+ )
163
+ Spaces Toggle
164
+ TXYZ.AI
165
+ (
166
+ What is TXYZ.AI?
167
+ )
168
+ Related Papers
169
+ Recommenders and Search Tools
170
+ Link to Influence Flower
171
+ Influence Flower
172
+ (
173
+ What are Influence Flowers?
174
+ )
175
+ Core recommender toggle
176
+ CORE Recommender
177
+ (
178
+ What is CORE?
179
+ )
180
+ Author
181
+ Venue
182
+ Institution
183
+ Topic
184
+ About arXivLabs
185
+ arXivLabs: experimental projects with community collaborators
186
+ arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
187
+ Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
188
+ Have an idea for a project that will add value for arXiv's community?
189
+ Learn more about arXivLabs
190
+ .
191
+ Which authors of this paper are endorsers?
192
+ |
193
+ Disable MathJax
194
+ (
195
+ What is MathJax?
196
+ )
research/notes/240701476-tree-search-for-language-model-agents.md ADDED
@@ -0,0 +1,200 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: '[2407.01476] Tree Search for Language Model Agents'
3
+ id: 240701476-tree-search-for-language-model-agents
4
+ tags:
5
+ - socratic-mcts-swe-worldmodel-8f6dea
6
+ created: '2026-06-09T04:22:47.411468Z'
7
+ source: https://arxiv.org/abs/2407.01476
8
+ source_domain: arxiv.org
9
+ fetched_at: '2026-06-09T04:22:47.237299Z'
10
+ fetch_provider: builtin
11
+ status: draft
12
+ type: note
13
+ deprecated: false
14
+ summary: '[2407.01476] Tree Search for Language Model Agents'
15
+ ---
16
+
17
+ [2407.01476] Tree Search for Language Model Agents
18
+ Computer Science > Artificial Intelligence
19
+ arXiv:2407.01476
20
+ (cs)
21
+ [Submitted on 1 Jul 2024 (
22
+ v1
23
+ ), last revised 8 Feb 2026 (this version, v4)]
24
+ Title:
25
+ Tree Search for Language Model Agents
26
+ Authors:
27
+ Jing Yu Koh
28
+ ,
29
+ Stephen McAleer
30
+ ,
31
+ Daniel Fried
32
+ ,
33
+ Ruslan Salakhutdinov
34
+ View a PDF of the paper titled Tree Search for Language Model Agents, by Jing Yu Koh and 3 other authors
35
+ View PDF
36
+ HTML (experimental)
37
+ Abstract:
38
+ Autonomous agents powered by language models (LMs) have demonstrated promise in their ability to perform decision-making tasks such as web automation. However, a key limitation remains: LMs, primarily optimized for natural language understanding and generation, struggle with multi-step reasoning, planning, and using environmental feedback when attempting to solve realistic computer tasks. Towards addressing this, we propose an inference-time search algorithm for LM agents to explicitly perform exploration and multi-step planning in interactive web environments. Our approach is a form of best-first tree search that operates within the actual environment space, and is complementary with most existing state-of-the-art agents. It is the first tree search algorithm for LM agents that shows effectiveness on realistic web tasks. On the challenging VisualWebArena benchmark, applying our search algorithm on top of a GPT-4o agent yields a 39.7% relative increase in success rate compared to the same baseline without search, setting a state-of-the-art success rate of 26.4%. On WebArena, search also yields a 28.0% relative improvement over a baseline agent, setting a competitive success rate of 19.2%. Our experiments highlight the effectiveness of search for web agents, and we demonstrate that performance scales with increased test-time compute. We conduct a thorough analysis of our results to highlight improvements from search, limitations, and promising directions for future work. Our code and models are publicly released at
39
+ this https URL
40
+ .
41
+ Comments:
42
+ 13 pages. Models and code available at
43
+ this https URL
44
+ Subjects:
45
+ Artificial Intelligence (cs.AI)
46
+ ; Computation and Language (cs.CL); Machine Learning (cs.LG)
47
+ Cite as:
48
+ arXiv:2407.01476
49
+ [cs.AI]
50
+ (or
51
+ arXiv:2407.01476v4
52
+ [cs.AI]
53
+ for this version)
54
+ https://doi.org/10.48550/arXiv.2407.01476
55
+ Focus to learn more
56
+ arXiv-issued DOI via DataCite
57
+ Submission history
58
+ From: Jing Yu Koh [
59
+ view email
60
+ ]
61
+ [v1]
62
+ Mon, 1 Jul 2024 17:07:55 UTC (2,417 KB)
63
+ [v2]
64
+ Sat, 12 Oct 2024 19:58:57 UTC (2,435 KB)
65
+ [v3]
66
+ Wed, 24 Sep 2025 05:46:23 UTC (2,501 KB)
67
+ [v4]
68
+ Sun, 8 Feb 2026 15:06:40 UTC (2,495 KB)
69
+ Full-text links:
70
+ Access Paper:
71
+ View a PDF of the paper titled Tree Search for Language Model Agents, by Jing Yu Koh and 3 other authors
72
+ View PDF
73
+ HTML (experimental)
74
+ TeX Source
75
+ view license
76
+ Current browse context:
77
+ cs.AI
78
+ < prev
79
+ |
80
+ next >
81
+ new
82
+ |
83
+ recent
84
+ |
85
+ 2024-07
86
+ Change to browse by:
87
+ cs
88
+ cs.CL
89
+ cs.LG
90
+ References & Citations
91
+ NASA ADS
92
+ Google Scholar
93
+ Semantic Scholar
94
+ export BibTeX citation
95
+ Loading...
96
+ BibTeX formatted citation
97
+ ×
98
+ loading...
99
+ Data provided by:
100
+ Bookmark
101
+ Bibliographic Tools
102
+ Bibliographic and Citation Tools
103
+ Bibliographic Explorer Toggle
104
+ Bibliographic Explorer
105
+ (
106
+ What is the Explorer?
107
+ )
108
+ Connected Papers Toggle
109
+ Connected Papers
110
+ (
111
+ What is Connected Papers?
112
+ )
113
+ Litmaps Toggle
114
+ Litmaps
115
+ (
116
+ What is Litmaps?
117
+ )
118
+ scite.ai Toggle
119
+ scite Smart Citations
120
+ (
121
+ What are Smart Citations?
122
+ )
123
+ Code, Data, Media
124
+ Code, Data and Media Associated with this Article
125
+ alphaXiv Toggle
126
+ alphaXiv
127
+ (
128
+ What is alphaXiv?
129
+ )
130
+ Links to Code Toggle
131
+ CatalyzeX Code Finder for Papers
132
+ (
133
+ What is CatalyzeX?
134
+ )
135
+ DagsHub Toggle
136
+ DagsHub
137
+ (
138
+ What is DagsHub?
139
+ )
140
+ GotitPub Toggle
141
+ Gotit.pub
142
+ (
143
+ What is GotitPub?
144
+ )
145
+ Huggingface Toggle
146
+ Hugging Face
147
+ (
148
+ What is Huggingface?
149
+ )
150
+ ScienceCast Toggle
151
+ ScienceCast
152
+ (
153
+ What is ScienceCast?
154
+ )
155
+ Demos
156
+ Demos
157
+ Replicate Toggle
158
+ Replicate
159
+ (
160
+ What is Replicate?
161
+ )
162
+ Spaces Toggle
163
+ Hugging Face Spaces
164
+ (
165
+ What is Spaces?
166
+ )
167
+ Spaces Toggle
168
+ TXYZ.AI
169
+ (
170
+ What is TXYZ.AI?
171
+ )
172
+ Related Papers
173
+ Recommenders and Search Tools
174
+ Link to Influence Flower
175
+ Influence Flower
176
+ (
177
+ What are Influence Flowers?
178
+ )
179
+ Core recommender toggle
180
+ CORE Recommender
181
+ (
182
+ What is CORE?
183
+ )
184
+ Author
185
+ Venue
186
+ Institution
187
+ Topic
188
+ About arXivLabs
189
+ arXivLabs: experimental projects with community collaborators
190
+ arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
191
+ Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
192
+ Have an idea for a project that will add value for arXiv's community?
193
+ Learn more about arXivLabs
194
+ .
195
+ Which authors of this paper are endorsers?
196
+ |
197
+ Disable MathJax
198
+ (
199
+ What is MathJax?
200
+ )
research/notes/240919256-hybridflow-a-flexible-and-efficient-rlhf-framework.md ADDED
@@ -0,0 +1,222 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: '[2409.19256] HybridFlow: A Flexible and Efficient RLHF Framework'
3
+ id: 240919256-hybridflow-a-flexible-and-efficient-rlhf-framework
4
+ tags:
5
+ - socratic-mcts-swe-worldmodel-8f6dea
6
+ created: '2026-06-09T04:24:38.815877Z'
7
+ updated: '2026-06-09T04:26:22.137190Z'
8
+ source: https://arxiv.org/abs/2409.19256
9
+ source_domain: arxiv.org
10
+ fetched_at: '2026-06-09T04:24:38.794804Z'
11
+ fetch_provider: builtin
12
+ status: draft
13
+ type: note
14
+ tier: institutional
15
+ deprecated: false
16
+ summary: '[2409.19256] HybridFlow: A Flexible and Efficient RLHF Framework'
17
+ ---
18
+
19
+ [2409.19256] HybridFlow: A Flexible and Efficient RLHF Framework
20
+ Computer Science > Machine Learning
21
+ arXiv:2409.19256
22
+ (cs)
23
+ [Submitted on 28 Sep 2024 (
24
+ v1
25
+ ), last revised 2 Oct 2024 (this version, v2)]
26
+ Title:
27
+ HybridFlow: A Flexible and Efficient RLHF Framework
28
+ Authors:
29
+ Guangming Sheng
30
+ ,
31
+ Chi Zhang
32
+ ,
33
+ Zilingfeng Ye
34
+ ,
35
+ Xibin Wu
36
+ ,
37
+ Wang Zhang
38
+ ,
39
+ Ru Zhang
40
+ ,
41
+ Yanghua Peng
42
+ ,
43
+ Haibin Lin
44
+ ,
45
+ Chuan Wu
46
+ View a PDF of the paper titled HybridFlow: A Flexible and Efficient RLHF Framework, by Guangming Sheng and 8 other authors
47
+ View PDF
48
+ HTML (experimental)
49
+ Abstract:
50
+ Reinforcement Learning from Human Feedback (RLHF) is widely used in Large Language Model (LLM) alignment. Traditional RL can be modeled as a dataflow, where each node represents computation of a neural network (NN) and each edge denotes data dependencies between the NNs. RLHF complicates the dataflow by expanding each node into a distributed LLM training or generation program, and each edge into a many-to-many multicast. Traditional RL frameworks execute the dataflow using a single controller to instruct both intra-node computation and inter-node communication, which can be inefficient in RLHF due to large control dispatch overhead for distributed intra-node computation. Existing RLHF systems adopt a multi-controller paradigm, which can be inflexible due to nesting distributed computation and data communication. We propose HybridFlow, which combines single-controller and multi-controller paradigms in a hybrid manner to enable flexible representation and efficient execution of the RLHF dataflow. We carefully design a set of hierarchical APIs that decouple and encapsulate computation and data dependencies in the complex RLHF dataflow, allowing efficient operation orchestration to implement RLHF algorithms and flexible mapping of the computation onto various devices. We further design a 3D-HybridEngine for efficient actor model resharding between training and generation phases, with zero memory redundancy and significantly reduced communication overhead. Our experimental results demonstrate 1.53$\times$~20.57$\times$ throughput improvement when running various RLHF algorithms using HybridFlow, as compared with state-of-the-art baselines. HybridFlow source code will be available at
51
+ this https URL
52
+ .
53
+ Subjects:
54
+ Machine Learning (cs.LG)
55
+ ; Distributed, Parallel, and Cluster Computing (cs.DC)
56
+ ACM
57
+ classes:
58
+ I.2
59
+ Cite as:
60
+ arXiv:2409.19256
61
+ [cs.LG]
62
+ (or
63
+ arXiv:2409.19256v2
64
+ [cs.LG]
65
+ for this version)
66
+ https://doi.org/10.48550/arXiv.2409.19256
67
+ Focus to learn more
68
+ arXiv-issued DOI via DataCite
69
+ Related DOI
70
+ :
71
+ https://doi.org/10.1145/3689031.3696075
72
+ Focus to learn more
73
+ DOI(s) linking to related resources
74
+ Submission history
75
+ From: Guangming Sheng [
76
+ view email
77
+ ]
78
+ [v1]
79
+ Sat, 28 Sep 2024 06:20:03 UTC (1,755 KB)
80
+ [v2]
81
+ Wed, 2 Oct 2024 04:01:47 UTC (1,775 KB)
82
+ Full-text links:
83
+ Access Paper:
84
+ View a PDF of the paper titled HybridFlow: A Flexible and Efficient RLHF Framework, by Guangming Sheng and 8 other authors
85
+ View PDF
86
+ HTML (experimental)
87
+ TeX Source
88
+ view license
89
+ Current browse context:
90
+ cs.LG
91
+ < prev
92
+ |
93
+ next >
94
+ new
95
+ |
96
+ recent
97
+ |
98
+ 2024-09
99
+ Change to browse by:
100
+ cs
101
+ cs.DC
102
+ References & Citations
103
+ NASA ADS
104
+ Google Scholar
105
+ Semantic Scholar
106
+ export BibTeX citation
107
+ Loading...
108
+ BibTeX formatted citation
109
+ ×
110
+ loading...
111
+ Data provided by:
112
+ Bookmark
113
+ Bibliographic Tools
114
+ Bibliographic and Citation Tools
115
+ Bibliographic Explorer Toggle
116
+ Bibliographic Explorer
117
+ (
118
+ What is the Explorer?
119
+ )
120
+ Connected Papers Toggle
121
+ Connected Papers
122
+ (
123
+ What is Connected Papers?
124
+ )
125
+ Litmaps Toggle
126
+ Litmaps
127
+ (
128
+ What is Litmaps?
129
+ )
130
+ scite.ai Toggle
131
+ scite Smart Citations
132
+ (
133
+ What are Smart Citations?
134
+ )
135
+ Code, Data, Media
136
+ Code, Data and Media Associated with this Article
137
+ alphaXiv Toggle
138
+ alphaXiv
139
+ (
140
+ What is alphaXiv?
141
+ )
142
+ Links to Code Toggle
143
+ CatalyzeX Code Finder for Papers
144
+ (
145
+ What is CatalyzeX?
146
+ )
147
+ DagsHub Toggle
148
+ DagsHub
149
+ (
150
+ What is DagsHub?
151
+ )
152
+ GotitPub Toggle
153
+ Gotit.pub
154
+ (
155
+ What is GotitPub?
156
+ )
157
+ Huggingface Toggle
158
+ Hugging Face
159
+ (
160
+ What is Huggingface?
161
+ )
162
+ Links to Code Toggle
163
+ Papers with Code
164
+ (
165
+ What is Papers with Code?
166
+ )
167
+ ScienceCast Toggle
168
+ ScienceCast
169
+ (
170
+ What is ScienceCast?
171
+ )
172
+ Demos
173
+ Demos
174
+ Replicate Toggle
175
+ Replicate
176
+ (
177
+ What is Replicate?
178
+ )
179
+ Spaces Toggle
180
+ Hugging Face Spaces
181
+ (
182
+ What is Spaces?
183
+ )
184
+ Spaces Toggle
185
+ TXYZ.AI
186
+ (
187
+ What is TXYZ.AI?
188
+ )
189
+ Related Papers
190
+ Recommenders and Search Tools
191
+ Link to Influence Flower
192
+ Influence Flower
193
+ (
194
+ What are Influence Flowers?
195
+ )
196
+ Core recommender toggle
197
+ CORE Recommender
198
+ (
199
+ What is CORE?
200
+ )
201
+ IArxiv recommender toggle
202
+ IArxiv Recommender
203
+ (
204
+ What is IArxiv?
205
+ )
206
+ Author
207
+ Venue
208
+ Institution
209
+ Topic
210
+ About arXivLabs
211
+ arXivLabs: experimental projects with community collaborators
212
+ arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
213
+ Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
214
+ Have an idea for a project that will add value for arXiv's community?
215
+ Learn more about arXivLabs
216
+ .
217
+ Which authors of this paper are endorsers?
218
+ |
219
+ Disable MathJax
220
+ (
221
+ What is MathJax?
222
+ )
research/notes/241020285-swe-search-enhancing-software-agents-with-monte-carlo-tree-search-and.md ADDED
@@ -0,0 +1,207 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: '[2410.20285] SWE-Search: Enhancing Software Agents with Monte Carlo Tree Search
3
+ and Iterative Refinement'
4
+ id: 241020285-swe-search-enhancing-software-agents-with-monte-carlo-tree-search-and
5
+ tags:
6
+ - socratic-mcts-swe-worldmodel-8f6dea
7
+ created: '2026-06-09T04:22:47.408752Z'
8
+ source: https://arxiv.org/abs/2410.20285
9
+ source_domain: arxiv.org
10
+ fetched_at: '2026-06-09T04:22:47.075405Z'
11
+ fetch_provider: builtin
12
+ status: draft
13
+ type: note
14
+ deprecated: false
15
+ summary: '[2410.20285] SWE-Search: Enhancing Software Agents with Monte Carlo Tree
16
+ Search and Iterative Refinement'
17
+ ---
18
+
19
+ [2410.20285] SWE-Search: Enhancing Software Agents with Monte Carlo Tree Search and Iterative Refinement
20
+ Computer Science > Artificial Intelligence
21
+ arXiv:2410.20285
22
+ (cs)
23
+ [Submitted on 26 Oct 2024 (
24
+ v1
25
+ ), last revised 2 Apr 2025 (this version, v6)]
26
+ Title:
27
+ SWE-Search: Enhancing Software Agents with Monte Carlo Tree Search and Iterative Refinement
28
+ Authors:
29
+ Antonis Antoniades
30
+ ,
31
+ Albert Örwall
32
+ ,
33
+ Kexun Zhang
34
+ ,
35
+ Yuxi Xie
36
+ ,
37
+ Anirudh Goyal
38
+ ,
39
+ William Wang
40
+ View a PDF of the paper titled SWE-Search: Enhancing Software Agents with Monte Carlo Tree Search and Iterative Refinement, by Antonis Antoniades and 5 other authors
41
+ View PDF
42
+ Abstract:
43
+ Software engineers operating in complex and dynamic environments must continuously adapt to evolving requirements, learn iteratively from experience, and reconsider their approaches based on new insights. However, current large language model (LLM)-based software agents often follow linear, sequential processes that prevent backtracking and exploration of alternative solutions, limiting their ability to rethink their strategies when initial approaches prove ineffective. To address these challenges, we propose SWE-Search, a multi-agent framework that integrates Monte Carlo Tree Search (MCTS) with a self-improvement mechanism to enhance software agents' performance on repository-level software tasks. SWE-Search extends traditional MCTS by incorporating a hybrid value function that leverages LLMs for both numerical value estimation and qualitative evaluation. This enables self-feedback loops where agents iteratively refine their strategies based on both quantitative numerical evaluations and qualitative natural language assessments of pursued trajectories. The framework includes a SWE-Agent for adaptive exploration, a Value Agent for iterative feedback, and a Discriminator Agent that facilitates multi-agent debate for collaborative decision-making. Applied to the SWE-bench benchmark, our approach demonstrates a 23% relative improvement in performance across five models compared to standard open-source agents without MCTS. Our analysis reveals how performance scales with increased inference-time compute through deeper search, providing a pathway to improve software agents without requiring larger models or additional training data. This highlights the potential of self-evaluation driven search techniques in complex software engineering environments.
44
+ Comments:
45
+ Main body: 10 pages, 5 figures. Appendix: 5 pages, 4 figures. Open-source codebase
46
+ Subjects:
47
+ Artificial Intelligence (cs.AI)
48
+ Cite as:
49
+ arXiv:2410.20285
50
+ [cs.AI]
51
+ (or
52
+ arXiv:2410.20285v6
53
+ [cs.AI]
54
+ for this version)
55
+ https://doi.org/10.48550/arXiv.2410.20285
56
+ Focus to learn more
57
+ arXiv-issued DOI via DataCite
58
+ Submission history
59
+ From: Antonis Antoniades [
60
+ view email
61
+ ]
62
+ [v1]
63
+ Sat, 26 Oct 2024 22:45:56 UTC (4,189 KB)
64
+ [v2]
65
+ Tue, 29 Oct 2024 18:25:20 UTC (4,189 KB)
66
+ [v3]
67
+ Sun, 15 Dec 2024 07:55:42 UTC (4,196 KB)
68
+ [v4]
69
+ Mon, 17 Feb 2025 23:13:48 UTC (4,196 KB)
70
+ [v5]
71
+ Sun, 2 Mar 2025 19:42:45 UTC (4,196 KB)
72
+ [v6]
73
+ Wed, 2 Apr 2025 04:13:19 UTC (3,821 KB)
74
+ Full-text links:
75
+ Access Paper:
76
+ View a PDF of the paper titled SWE-Search: Enhancing Software Agents with Monte Carlo Tree Search and Iterative Refinement, by Antonis Antoniades and 5 other authors
77
+ View PDF
78
+ TeX Source
79
+ view license
80
+ Current browse context:
81
+ cs.AI
82
+ < prev
83
+ |
84
+ next >
85
+ new
86
+ |
87
+ recent
88
+ |
89
+ 2024-10
90
+ Change to browse by:
91
+ cs
92
+ References & Citations
93
+ NASA ADS
94
+ Google Scholar
95
+ Semantic Scholar
96
+ export BibTeX citation
97
+ Loading...
98
+ BibTeX formatted citation
99
+ ×
100
+ loading...
101
+ Data provided by:
102
+ Bookmark
103
+ Bibliographic Tools
104
+ Bibliographic and Citation Tools
105
+ Bibliographic Explorer Toggle
106
+ Bibliographic Explorer
107
+ (
108
+ What is the Explorer?
109
+ )
110
+ Connected Papers Toggle
111
+ Connected Papers
112
+ (
113
+ What is Connected Papers?
114
+ )
115
+ Litmaps Toggle
116
+ Litmaps
117
+ (
118
+ What is Litmaps?
119
+ )
120
+ scite.ai Toggle
121
+ scite Smart Citations
122
+ (
123
+ What are Smart Citations?
124
+ )
125
+ Code, Data, Media
126
+ Code, Data and Media Associated with this Article
127
+ alphaXiv Toggle
128
+ alphaXiv
129
+ (
130
+ What is alphaXiv?
131
+ )
132
+ Links to Code Toggle
133
+ CatalyzeX Code Finder for Papers
134
+ (
135
+ What is CatalyzeX?
136
+ )
137
+ DagsHub Toggle
138
+ DagsHub
139
+ (
140
+ What is DagsHub?
141
+ )
142
+ GotitPub Toggle
143
+ Gotit.pub
144
+ (
145
+ What is GotitPub?
146
+ )
147
+ Huggingface Toggle
148
+ Hugging Face
149
+ (
150
+ What is Huggingface?
151
+ )
152
+ Links to Code Toggle
153
+ Papers with Code
154
+ (
155
+ What is Papers with Code?
156
+ )
157
+ ScienceCast Toggle
158
+ ScienceCast
159
+ (
160
+ What is ScienceCast?
161
+ )
162
+ Demos
163
+ Demos
164
+ Replicate Toggle
165
+ Replicate
166
+ (
167
+ What is Replicate?
168
+ )
169
+ Spaces Toggle
170
+ Hugging Face Spaces
171
+ (
172
+ What is Spaces?
173
+ )
174
+ Spaces Toggle
175
+ TXYZ.AI
176
+ (
177
+ What is TXYZ.AI?
178
+ )
179
+ Related Papers
180
+ Recommenders and Search Tools
181
+ Link to Influence Flower
182
+ Influence Flower
183
+ (
184
+ What are Influence Flowers?
185
+ )
186
+ Core recommender toggle
187
+ CORE Recommender
188
+ (
189
+ What is CORE?
190
+ )
191
+ Author
192
+ Venue
193
+ Institution
194
+ Topic
195
+ About arXivLabs
196
+ arXivLabs: experimental projects with community collaborators
197
+ arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
198
+ Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
199
+ Have an idea for a project that will add value for arXiv's community?
200
+ Learn more about arXivLabs
201
+ .
202
+ Which authors of this paper are endorsers?
203
+ |
204
+ Disable MathJax
205
+ (
206
+ What is MathJax?
207
+ )
research/notes/241108794-llm-based-world-models-can-make-decisions-solely-but-rigorous-evaluati.md ADDED
@@ -0,0 +1,194 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: '[2411.08794] LLM-Based World Models Can Make Decisions Solely, But Rigorous
3
+ Evaluations are Needed'
4
+ id: 241108794-llm-based-world-models-can-make-decisions-solely-but-rigorous-evaluati
5
+ tags:
6
+ - socratic-mcts-swe-worldmodel-8f6dea
7
+ created: '2026-06-09T04:22:37.365228Z'
8
+ source: https://arxiv.org/abs/2411.08794
9
+ source_domain: arxiv.org
10
+ fetched_at: '2026-06-09T04:22:37.362163Z'
11
+ fetch_provider: builtin
12
+ status: draft
13
+ type: note
14
+ deprecated: false
15
+ summary: '[2411.08794] LLM-Based World Models Can Make Decisions Solely, But Rigorous
16
+ Evaluations are Needed'
17
+ ---
18
+
19
+ [2411.08794] LLM-Based World Models Can Make Decisions Solely, But Rigorous Evaluations are Needed
20
+ Computer Science > Artificial Intelligence
21
+ arXiv:2411.08794
22
+ (cs)
23
+ [Submitted on 13 Nov 2024 (
24
+ v1
25
+ ), last revised 19 Mar 2026 (this version, v2)]
26
+ Title:
27
+ LLM-Based World Models Can Make Decisions Solely, But Rigorous Evaluations are Needed
28
+ Authors:
29
+ Chang Yang
30
+ ,
31
+ Xinrun Wang
32
+ ,
33
+ Junzhe Jiang
34
+ ,
35
+ Qinggang Zhang
36
+ ,
37
+ Xiao Huang
38
+ View a PDF of the paper titled LLM-Based World Models Can Make Decisions Solely, But Rigorous Evaluations are Needed, by Chang Yang and Xinrun Wang and Junzhe Jiang and Qinggang Zhang and Xiao Huang
39
+ View PDF
40
+ HTML (experimental)
41
+ Abstract:
42
+ World model emerges as a key module in decision making, where MuZero and Dreamer achieve remarkable successes in complex tasks. Recent work leverages Large Language Models (LLMs) as general world simulators to simulate the dynamics of the world due to their generalizability. LLMs also serve as the world model for deliberative reasoning in Reasoning via Planning (RAP) and Tree of Thought (ToT). However, the world models are either evaluated as a general world simulator, or as a functional module of the agent, i.e., predicting the transitions to assist the planning. In this work, we propose a comprehensive evaluation of the world models with LLMs from the decision making perspective. Specifically, we leverage the 31 diverse environments from (Wang et al., 2023;2024) and curate the rule-based policy of each environment for the diverse evaluation. Then, we design three main tasks, i.e., policy verification, action proposal, and policy planning, where the world models can be used for decision making solely. Finally, we conduct the comprehensive evaluation of the advanced LLMs, i.e., GPT-4o and GPT-4o-mini, on the environments for the three main tasks under various settings. The key observations include: i) GPT-4o significantly outperforms GPT-4o-mini on the three main tasks, especially for the tasks which require the domain knowledge, ii) the performance of the world model with LLM will be decreased for long-term decision-making tasks, and iii) the combination of different functionalities of the world model will brings additional unstabilities of the performance.
43
+ Comments:
44
+ Accepted to TMLR
45
+ Subjects:
46
+ Artificial Intelligence (cs.AI)
47
+ Cite as:
48
+ arXiv:2411.08794
49
+ [cs.AI]
50
+ (or
51
+ arXiv:2411.08794v2
52
+ [cs.AI]
53
+ for this version)
54
+ https://doi.org/10.48550/arXiv.2411.08794
55
+ Focus to learn more
56
+ arXiv-issued DOI via DataCite
57
+ Submission history
58
+ From: Xinrun Wang [
59
+ view email
60
+ ]
61
+ [v1]
62
+ Wed, 13 Nov 2024 17:19:32 UTC (501 KB)
63
+ [v2]
64
+ Thu, 19 Mar 2026 02:08:09 UTC (374 KB)
65
+ Full-text links:
66
+ Access Paper:
67
+ View a PDF of the paper titled LLM-Based World Models Can Make Decisions Solely, But Rigorous Evaluations are Needed, by Chang Yang and Xinrun Wang and Junzhe Jiang and Qinggang Zhang and Xiao Huang
68
+ View PDF
69
+ HTML (experimental)
70
+ TeX Source
71
+ view license
72
+ Current browse context:
73
+ cs.AI
74
+ < prev
75
+ |
76
+ next >
77
+ new
78
+ |
79
+ recent
80
+ |
81
+ 2024-11
82
+ Change to browse by:
83
+ cs
84
+ References & Citations
85
+ NASA ADS
86
+ Google Scholar
87
+ Semantic Scholar
88
+ export BibTeX citation
89
+ Loading...
90
+ BibTeX formatted citation
91
+ ×
92
+ loading...
93
+ Data provided by:
94
+ Bookmark
95
+ Bibliographic Tools
96
+ Bibliographic and Citation Tools
97
+ Bibliographic Explorer Toggle
98
+ Bibliographic Explorer
99
+ (
100
+ What is the Explorer?
101
+ )
102
+ Connected Papers Toggle
103
+ Connected Papers
104
+ (
105
+ What is Connected Papers?
106
+ )
107
+ Litmaps Toggle
108
+ Litmaps
109
+ (
110
+ What is Litmaps?
111
+ )
112
+ scite.ai Toggle
113
+ scite Smart Citations
114
+ (
115
+ What are Smart Citations?
116
+ )
117
+ Code, Data, Media
118
+ Code, Data and Media Associated with this Article
119
+ alphaXiv Toggle
120
+ alphaXiv
121
+ (
122
+ What is alphaXiv?
123
+ )
124
+ Links to Code Toggle
125
+ CatalyzeX Code Finder for Papers
126
+ (
127
+ What is CatalyzeX?
128
+ )
129
+ DagsHub Toggle
130
+ DagsHub
131
+ (
132
+ What is DagsHub?
133
+ )
134
+ GotitPub Toggle
135
+ Gotit.pub
136
+ (
137
+ What is GotitPub?
138
+ )
139
+ Huggingface Toggle
140
+ Hugging Face
141
+ (
142
+ What is Huggingface?
143
+ )
144
+ ScienceCast Toggle
145
+ ScienceCast
146
+ (
147
+ What is ScienceCast?
148
+ )
149
+ Demos
150
+ Demos
151
+ Replicate Toggle
152
+ Replicate
153
+ (
154
+ What is Replicate?
155
+ )
156
+ Spaces Toggle
157
+ Hugging Face Spaces
158
+ (
159
+ What is Spaces?
160
+ )
161
+ Spaces Toggle
162
+ TXYZ.AI
163
+ (
164
+ What is TXYZ.AI?
165
+ )
166
+ Related Papers
167
+ Recommenders and Search Tools
168
+ Link to Influence Flower
169
+ Influence Flower
170
+ (
171
+ What are Influence Flowers?
172
+ )
173
+ Core recommender toggle
174
+ CORE Recommender
175
+ (
176
+ What is CORE?
177
+ )
178
+ Author
179
+ Venue
180
+ Institution
181
+ Topic
182
+ About arXivLabs
183
+ arXivLabs: experimental projects with community collaborators
184
+ arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
185
+ Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
186
+ Have an idea for a project that will add value for arXiv's community?
187
+ Learn more about arXivLabs
188
+ .
189
+ Which authors of this paper are endorsers?
190
+ |
191
+ Disable MathJax
192
+ (
193
+ What is MathJax?
194
+ )
research/notes/241114499-understanding-world-or-predicting-future-a-comprehensive-survey-of-wor.md ADDED
@@ -0,0 +1,226 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: '[2411.14499] Understanding World or Predicting Future? A Comprehensive Survey
3
+ of World Models'
4
+ id: 241114499-understanding-world-or-predicting-future-a-comprehensive-survey-of-wor
5
+ tags:
6
+ - socratic-mcts-swe-worldmodel-8f6dea
7
+ created: '2026-06-09T04:22:01.609371Z'
8
+ updated: '2026-06-09T04:22:19.116800Z'
9
+ source: https://arxiv.org/abs/2411.14499
10
+ source_domain: arxiv.org
11
+ fetched_at: '2026-06-09T04:22:01.397314Z'
12
+ fetch_provider: builtin
13
+ status: draft
14
+ type: note
15
+ tier: institutional
16
+ content_type: paper
17
+ deprecated: false
18
+ summary: '[2411.14499] Understanding World or Predicting Future? A Comprehensive Survey
19
+ of World Models'
20
+ ---
21
+
22
+ [2411.14499] Understanding World or Predicting Future? A Comprehensive Survey of World Models
23
+ Computer Science > Computation and Language
24
+ arXiv:2411.14499
25
+ (cs)
26
+ [Submitted on 21 Nov 2024 (
27
+ v1
28
+ ), last revised 10 Dec 2025 (this version, v4)]
29
+ Title:
30
+ Understanding World or Predicting Future? A Comprehensive Survey of World Models
31
+ Authors:
32
+ Jingtao Ding
33
+ ,
34
+ Yunke Zhang
35
+ ,
36
+ Yu Shang
37
+ ,
38
+ Jie Feng
39
+ ,
40
+ Yuheng Zhang
41
+ ,
42
+ Zefang Zong
43
+ ,
44
+ Yuan Yuan
45
+ ,
46
+ Hongyuan Su
47
+ ,
48
+ Nian Li
49
+ ,
50
+ Jinghua Piao
51
+ ,
52
+ Yucheng Deng
53
+ ,
54
+ Nicholas Sukiennik
55
+ ,
56
+ Chen Gao
57
+ ,
58
+ Fengli Xu
59
+ ,
60
+ Yong Li
61
+ View a PDF of the paper titled Understanding World or Predicting Future? A Comprehensive Survey of World Models, by Jingtao Ding and 14 other authors
62
+ View PDF
63
+ HTML (experimental)
64
+ Abstract:
65
+ The concept of world models has garnered significant attention due to advancements in multimodal large language models such as GPT-4 and video generation models such as Sora, which are central to the pursuit of artificial general intelligence. This survey offers a comprehensive review of the literature on world models. Generally, world models are regarded as tools for either understanding the present state of the world or predicting its future dynamics. This review presents a systematic categorization of world models, emphasizing two primary functions: (1) constructing internal representations to understand the mechanisms of the world, and (2) predicting future states to simulate and guide decision-making. Initially, we examine the current progress in these two categories. We then explore the application of world models in key domains, including generative games, autonomous driving, robotics, and social simulacra, with a focus on how each domain utilizes these aspects. Finally, we outline key challenges and provide insights into potential future research directions. We summarize the representative papers along with their code repositories in
66
+ this https URL
67
+ .
68
+ Comments:
69
+ Extended version of the original ACM CSUR paper, 49 pages, 6 figures, 8 tables
70
+ Subjects:
71
+ Computation and Language (cs.CL)
72
+ ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
73
+ Cite as:
74
+ arXiv:2411.14499
75
+ [cs.CL]
76
+ (or
77
+ arXiv:2411.14499v4
78
+ [cs.CL]
79
+ for this version)
80
+ https://doi.org/10.48550/arXiv.2411.14499
81
+ Focus to learn more
82
+ arXiv-issued DOI via DataCite
83
+ Submission history
84
+ From: Jingtao Ding [
85
+ view email
86
+ ]
87
+ [v1]
88
+ Thu, 21 Nov 2024 03:58:50 UTC (4,019 KB)
89
+ [v2]
90
+ Wed, 25 Jun 2025 02:31:33 UTC (4,612 KB)
91
+ [v3]
92
+ Sat, 15 Nov 2025 14:33:14 UTC (4,613 KB)
93
+ [v4]
94
+ Wed, 10 Dec 2025 02:53:14 UTC (4,613 KB)
95
+ Full-text links:
96
+ Access Paper:
97
+ View a PDF of the paper titled Understanding World or Predicting Future? A Comprehensive Survey of World Models, by Jingtao Ding and 14 other authors
98
+ View PDF
99
+ HTML (experimental)
100
+ TeX Source
101
+ view license
102
+ Current browse context:
103
+ cs.CL
104
+ < prev
105
+ |
106
+ next >
107
+ new
108
+ |
109
+ recent
110
+ |
111
+ 2024-11
112
+ Change to browse by:
113
+ cs
114
+ cs.AI
115
+ cs.LG
116
+ References & Citations
117
+ NASA ADS
118
+ Google Scholar
119
+ Semantic Scholar
120
+ export BibTeX citation
121
+ Loading...
122
+ BibTeX formatted citation
123
+ ×
124
+ loading...
125
+ Data provided by:
126
+ Bookmark
127
+ Bibliographic Tools
128
+ Bibliographic and Citation Tools
129
+ Bibliographic Explorer Toggle
130
+ Bibliographic Explorer
131
+ (
132
+ What is the Explorer?
133
+ )
134
+ Connected Papers Toggle
135
+ Connected Papers
136
+ (
137
+ What is Connected Papers?
138
+ )
139
+ Litmaps Toggle
140
+ Litmaps
141
+ (
142
+ What is Litmaps?
143
+ )
144
+ scite.ai Toggle
145
+ scite Smart Citations
146
+ (
147
+ What are Smart Citations?
148
+ )
149
+ Code, Data, Media
150
+ Code, Data and Media Associated with this Article
151
+ alphaXiv Toggle
152
+ alphaXiv
153
+ (
154
+ What is alphaXiv?
155
+ )
156
+ Links to Code Toggle
157
+ CatalyzeX Code Finder for Papers
158
+ (
159
+ What is CatalyzeX?
160
+ )
161
+ DagsHub Toggle
162
+ DagsHub
163
+ (
164
+ What is DagsHub?
165
+ )
166
+ GotitPub Toggle
167
+ Gotit.pub
168
+ (
169
+ What is GotitPub?
170
+ )
171
+ Huggingface Toggle
172
+ Hugging Face
173
+ (
174
+ What is Huggingface?
175
+ )
176
+ ScienceCast Toggle
177
+ ScienceCast
178
+ (
179
+ What is ScienceCast?
180
+ )
181
+ Demos
182
+ Demos
183
+ Replicate Toggle
184
+ Replicate
185
+ (
186
+ What is Replicate?
187
+ )
188
+ Spaces Toggle
189
+ Hugging Face Spaces
190
+ (
191
+ What is Spaces?
192
+ )
193
+ Spaces Toggle
194
+ TXYZ.AI
195
+ (
196
+ What is TXYZ.AI?
197
+ )
198
+ Related Papers
199
+ Recommenders and Search Tools
200
+ Link to Influence Flower
201
+ Influence Flower
202
+ (
203
+ What are Influence Flowers?
204
+ )
205
+ Core recommender toggle
206
+ CORE Recommender
207
+ (
208
+ What is CORE?
209
+ )
210
+ Author
211
+ Venue
212
+ Institution
213
+ Topic
214
+ About arXivLabs
215
+ arXivLabs: experimental projects with community collaborators
216
+ arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
217
+ Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
218
+ Have an idea for a project that will add value for arXiv's community?
219
+ Learn more about arXivLabs
220
+ .
221
+ Which authors of this paper are endorsers?
222
+ |
223
+ Disable MathJax
224
+ (
225
+ What is MathJax?
226
+ )
research/notes/250218449-swe-rl-advancing-llm-reasoning-via-reinforcement-learning-on-open-soft.md ADDED
@@ -0,0 +1,214 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: '[2502.18449] SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on
3
+ Open Software Evolution'
4
+ id: 250218449-swe-rl-advancing-llm-reasoning-via-reinforcement-learning-on-open-soft
5
+ tags:
6
+ - socratic-mcts-swe-worldmodel-8f6dea
7
+ created: '2026-06-09T04:24:56.974939Z'
8
+ updated: '2026-06-09T04:25:34.163662Z'
9
+ source: https://arxiv.org/abs/2502.18449
10
+ source_domain: arxiv.org
11
+ fetched_at: '2026-06-09T04:24:55.251716Z'
12
+ fetch_provider: builtin
13
+ status: draft
14
+ type: note
15
+ tier: institutional
16
+ content_type: paper
17
+ deprecated: false
18
+ summary: Wei et al. (Meta AI/UIUC/CMU), NeurIPS 2025. First RL approach scaling LLM
19
+ reasoning to real-world SWE using GitHub PR software-evolution data + lightweight
20
+ rule-based reward (difflib SequenceMatcher similarity to oracle patch, -1 for malformed);
21
+ GRPO optimizer; Llama3-SWE-RL-70B hits 41.0% SWE-bench Verified.
22
+ ---
23
+
24
+ [2502.18449] SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution
25
+ Computer Science > Software Engineering
26
+ arXiv:2502.18449
27
+ (cs)
28
+ [Submitted on 25 Feb 2025 (
29
+ v1
30
+ ), last revised 1 Dec 2025 (this version, v2)]
31
+ Title:
32
+ SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution
33
+ Authors:
34
+ Yuxiang Wei
35
+ ,
36
+ Olivier Duchenne
37
+ ,
38
+ Jade Copet
39
+ ,
40
+ Quentin Carbonneaux
41
+ ,
42
+ Lingming Zhang
43
+ ,
44
+ Daniel Fried
45
+ ,
46
+ Gabriel Synnaeve
47
+ ,
48
+ Rishabh Singh
49
+ ,
50
+ Sida I. Wang
51
+ View a PDF of the paper titled SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution, by Yuxiang Wei and 8 other authors
52
+ View PDF
53
+ HTML (experimental)
54
+ Abstract:
55
+ The recent DeepSeek-R1 release has demonstrated the immense potential of reinforcement learning (RL) in enhancing the general reasoning capabilities of large language models (LLMs). While DeepSeek-R1 and other follow-up work primarily focus on applying RL to competitive coding and math problems, this paper introduces SWE-RL, the first approach to scale RL-based LLM reasoning for real-world software engineering. Leveraging a lightweight rule-based reward (e.g., the similarity score between ground-truth and LLM-generated solutions), SWE-RL enables LLMs to autonomously recover a developer's reasoning processes and solutions by learning from extensive open-source software evolution data -- the record of a software's entire lifecycle, including its code snapshots, code changes, and events such as issues and pull requests. Trained on top of Llama 3, our resulting reasoning model, Llama3-SWE-RL-70B, achieves a 41.0% solve rate on SWE-bench Verified -- a human-verified collection of real-world GitHub issues. To our knowledge, this is the best performance reported for medium-sized (<100B) LLMs to date, even comparable to leading proprietary LLMs like GPT-4o. Surprisingly, despite performing RL solely on software evolution data, Llama3-SWE-RL has even emerged with generalized reasoning skills. For example, it shows improved results on five out-of-domain tasks, namely, function coding, library use, code reasoning, mathematics, and general language understanding, whereas a supervised-finetuning baseline even leads to performance degradation on average. Overall, SWE-RL opens up a new direction to improve the reasoning capabilities of LLMs through reinforcement learning on massive software engineering data.
56
+ Comments:
57
+ Accepted to NeurIPS 2025 Main Track
58
+ Subjects:
59
+ Software Engineering (cs.SE)
60
+ ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
61
+ Cite as:
62
+ arXiv:2502.18449
63
+ [cs.SE]
64
+ (or
65
+ arXiv:2502.18449v2
66
+ [cs.SE]
67
+ for this version)
68
+ https://doi.org/10.48550/arXiv.2502.18449
69
+ Focus to learn more
70
+ arXiv-issued DOI via DataCite
71
+ Submission history
72
+ From: Yuxiang Wei [
73
+ view email
74
+ ]
75
+ [v1]
76
+ Tue, 25 Feb 2025 18:45:04 UTC (1,534 KB)
77
+ [v2]
78
+ Mon, 1 Dec 2025 00:16:59 UTC (812 KB)
79
+ Full-text links:
80
+ Access Paper:
81
+ View a PDF of the paper titled SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution, by Yuxiang Wei and 8 other authors
82
+ View PDF
83
+ HTML (experimental)
84
+ TeX Source
85
+ view license
86
+ Current browse context:
87
+ cs.SE
88
+ < prev
89
+ |
90
+ next >
91
+ new
92
+ |
93
+ recent
94
+ |
95
+ 2025-02
96
+ Change to browse by:
97
+ cs
98
+ cs.AI
99
+ cs.CL
100
+ References & Citations
101
+ NASA ADS
102
+ Google Scholar
103
+ Semantic Scholar
104
+ export BibTeX citation
105
+ Loading...
106
+ BibTeX formatted citation
107
+ ×
108
+ loading...
109
+ Data provided by:
110
+ Bookmark
111
+ Bibliographic Tools
112
+ Bibliographic and Citation Tools
113
+ Bibliographic Explorer Toggle
114
+ Bibliographic Explorer
115
+ (
116
+ What is the Explorer?
117
+ )
118
+ Connected Papers Toggle
119
+ Connected Papers
120
+ (
121
+ What is Connected Papers?
122
+ )
123
+ Litmaps Toggle
124
+ Litmaps
125
+ (
126
+ What is Litmaps?
127
+ )
128
+ scite.ai Toggle
129
+ scite Smart Citations
130
+ (
131
+ What are Smart Citations?
132
+ )
133
+ Code, Data, Media
134
+ Code, Data and Media Associated with this Article
135
+ alphaXiv Toggle
136
+ alphaXiv
137
+ (
138
+ What is alphaXiv?
139
+ )
140
+ Links to Code Toggle
141
+ CatalyzeX Code Finder for Papers
142
+ (
143
+ What is CatalyzeX?
144
+ )
145
+ DagsHub Toggle
146
+ DagsHub
147
+ (
148
+ What is DagsHub?
149
+ )
150
+ GotitPub Toggle
151
+ Gotit.pub
152
+ (
153
+ What is GotitPub?
154
+ )
155
+ Huggingface Toggle
156
+ Hugging Face
157
+ (
158
+ What is Huggingface?
159
+ )
160
+ ScienceCast Toggle
161
+ ScienceCast
162
+ (
163
+ What is ScienceCast?
164
+ )
165
+ Demos
166
+ Demos
167
+ Replicate Toggle
168
+ Replicate
169
+ (
170
+ What is Replicate?
171
+ )
172
+ Spaces Toggle
173
+ Hugging Face Spaces
174
+ (
175
+ What is Spaces?
176
+ )
177
+ Spaces Toggle
178
+ TXYZ.AI
179
+ (
180
+ What is TXYZ.AI?
181
+ )
182
+ Related Papers
183
+ Recommenders and Search Tools
184
+ Link to Influence Flower
185
+ Influence Flower
186
+ (
187
+ What are Influence Flowers?
188
+ )
189
+ Core recommender toggle
190
+ CORE Recommender
191
+ (
192
+ What is CORE?
193
+ )
194
+ Author
195
+ Venue
196
+ Institution
197
+ Topic
198
+ About arXivLabs
199
+ arXivLabs: experimental projects with community collaborators
200
+ arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
201
+ Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
202
+ Have an idea for a project that will add value for arXiv's community?
203
+ Learn more about arXivLabs
204
+ .
205
+ Which authors of this paper are endorsers?
206
+ |
207
+ Disable MathJax
208
+ (
209
+ What is MathJax?
210
+ )
211
+
212
+ ## Related
213
+
214
+ - [[pdf]]
research/notes/250314391-how-much-do-llms-learn-from-negative-examples.md ADDED
@@ -0,0 +1,189 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: '[2503.14391] How much do LLMs learn from negative examples?'
3
+ id: 250314391-how-much-do-llms-learn-from-negative-examples
4
+ tags:
5
+ - socratic-mcts-swe-worldmodel-8f6dea
6
+ created: '2026-06-09T04:24:45.902975Z'
7
+ updated: '2026-06-09T04:25:03.616963Z'
8
+ source: https://arxiv.org/abs/2503.14391
9
+ source_domain: arxiv.org
10
+ fetched_at: '2026-06-09T04:24:45.659994Z'
11
+ fetch_provider: builtin
12
+ status: active
13
+ type: note
14
+ tier: institutional
15
+ content_type: paper
16
+ deprecated: false
17
+ summary: '[2503.14391] How much do LLMs learn from negative examples?'
18
+ ---
19
+
20
+ [2503.14391] How much do LLMs learn from negative examples?
21
+ Computer Science > Computation and Language
22
+ arXiv:2503.14391
23
+ (cs)
24
+ [Submitted on 18 Mar 2025]
25
+ Title:
26
+ How much do LLMs learn from negative examples?
27
+ Authors:
28
+ Shadi Hamdan
29
+ ,
30
+ Deniz Yuret
31
+ View a PDF of the paper titled How much do LLMs learn from negative examples?, by Shadi Hamdan and Deniz Yuret
32
+ View PDF
33
+ Abstract:
34
+ Large language models (LLMs) undergo a three-phase training process: unsupervised pre-training, supervised fine-tuning (SFT), and learning from human feedback (RLHF/DPO). Notably, it is during the final phase that these models are exposed to negative examples -- incorrect, rejected, or suboptimal responses to queries. This paper delves into the role of negative examples in the training of LLMs, using a likelihood-ratio (Likra) model on multiple-choice question answering benchmarks to precisely manage the influence and the volume of negative examples. Our findings reveal three key insights: (1) During a critical phase in training, Likra with negative examples demonstrates a significantly larger improvement per training example compared to SFT using only positive examples. This leads to a sharp jump in the learning curve for Likra unlike the smooth and gradual improvement of SFT; (2) negative examples that are plausible but incorrect (near-misses) exert a greater influence; and (3) while training with positive examples fails to significantly decrease the likelihood of plausible but incorrect answers, training with negative examples more accurately identifies them. These results indicate a potentially significant role for negative examples in improving accuracy and reducing hallucinations for LLMs.
35
+ Comments:
36
+ 8 pages, 6 figures
37
+ Subjects:
38
+ Computation and Language (cs.CL)
39
+ MSC
40
+ classes:
41
+ 68T50, 68T05
42
+ ACM
43
+ classes:
44
+ I.2.6; I.2.7
45
+ Cite as:
46
+ arXiv:2503.14391
47
+ [cs.CL]
48
+ (or
49
+ arXiv:2503.14391v1
50
+ [cs.CL]
51
+ for this version)
52
+ https://doi.org/10.48550/arXiv.2503.14391
53
+ Focus to learn more
54
+ arXiv-issued DOI via DataCite
55
+ Submission history
56
+ From: Deniz Yuret [
57
+ view email
58
+ ]
59
+ [v1]
60
+ Tue, 18 Mar 2025 16:26:29 UTC (38 KB)
61
+ Full-text links:
62
+ Access Paper:
63
+ View a PDF of the paper titled How much do LLMs learn from negative examples?, by Shadi Hamdan and Deniz Yuret
64
+ View PDF
65
+ TeX Source
66
+ view license
67
+ Current browse context:
68
+ cs.CL
69
+ < prev
70
+ |
71
+ next >
72
+ new
73
+ |
74
+ recent
75
+ |
76
+ 2025-03
77
+ Change to browse by:
78
+ cs
79
+ References & Citations
80
+ NASA ADS
81
+ Google Scholar
82
+ Semantic Scholar
83
+ export BibTeX citation
84
+ Loading...
85
+ BibTeX formatted citation
86
+ ×
87
+ loading...
88
+ Data provided by:
89
+ Bookmark
90
+ Bibliographic Tools
91
+ Bibliographic and Citation Tools
92
+ Bibliographic Explorer Toggle
93
+ Bibliographic Explorer
94
+ (
95
+ What is the Explorer?
96
+ )
97
+ Connected Papers Toggle
98
+ Connected Papers
99
+ (
100
+ What is Connected Papers?
101
+ )
102
+ Litmaps Toggle
103
+ Litmaps
104
+ (
105
+ What is Litmaps?
106
+ )
107
+ scite.ai Toggle
108
+ scite Smart Citations
109
+ (
110
+ What are Smart Citations?
111
+ )
112
+ Code, Data, Media
113
+ Code, Data and Media Associated with this Article
114
+ alphaXiv Toggle
115
+ alphaXiv
116
+ (
117
+ What is alphaXiv?
118
+ )
119
+ Links to Code Toggle
120
+ CatalyzeX Code Finder for Papers
121
+ (
122
+ What is CatalyzeX?
123
+ )
124
+ DagsHub Toggle
125
+ DagsHub
126
+ (
127
+ What is DagsHub?
128
+ )
129
+ GotitPub Toggle
130
+ Gotit.pub
131
+ (
132
+ What is GotitPub?
133
+ )
134
+ Huggingface Toggle
135
+ Hugging Face
136
+ (
137
+ What is Huggingface?
138
+ )
139
+ ScienceCast Toggle
140
+ ScienceCast
141
+ (
142
+ What is ScienceCast?
143
+ )
144
+ Demos
145
+ Demos
146
+ Replicate Toggle
147
+ Replicate
148
+ (
149
+ What is Replicate?
150
+ )
151
+ Spaces Toggle
152
+ Hugging Face Spaces
153
+ (
154
+ What is Spaces?
155
+ )
156
+ Spaces Toggle
157
+ TXYZ.AI
158
+ (
159
+ What is TXYZ.AI?
160
+ )
161
+ Related Papers
162
+ Recommenders and Search Tools
163
+ Link to Influence Flower
164
+ Influence Flower
165
+ (
166
+ What are Influence Flowers?
167
+ )
168
+ Core recommender toggle
169
+ CORE Recommender
170
+ (
171
+ What is CORE?
172
+ )
173
+ Author
174
+ Venue
175
+ Institution
176
+ Topic
177
+ About arXivLabs
178
+ arXivLabs: experimental projects with community collaborators
179
+ arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
180
+ Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
181
+ Have an idea for a project that will add value for arXiv's community?
182
+ Learn more about arXivLabs
183
+ .
184
+ Which authors of this paper are endorsers?
185
+ |
186
+ Disable MathJax
187
+ (
188
+ What is MathJax?
189
+ )
research/notes/250411343-a-minimalist-approach-to-llm-reasoning-from-rejection-sampling-to-rein.md ADDED
@@ -0,0 +1,217 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: '[2504.11343] A Minimalist Approach to LLM Reasoning: from Rejection Sampling
3
+ to Reinforce'
4
+ id: 250411343-a-minimalist-approach-to-llm-reasoning-from-rejection-sampling-to-rein
5
+ tags:
6
+ - socratic-mcts-swe-worldmodel-8f6dea
7
+ created: '2026-06-09T04:24:45.896417Z'
8
+ updated: '2026-06-09T04:25:02.883023Z'
9
+ source: https://arxiv.org/abs/2504.11343
10
+ source_domain: arxiv.org
11
+ fetched_at: '2026-06-09T04:24:45.404807Z'
12
+ fetch_provider: builtin
13
+ status: active
14
+ type: note
15
+ tier: institutional
16
+ content_type: paper
17
+ deprecated: false
18
+ summary: '[2504.11343] A Minimalist Approach to LLM Reasoning: from Rejection Sampling
19
+ to Reinforce'
20
+ ---
21
+
22
+ [2504.11343] A Minimalist Approach to LLM Reasoning: from Rejection Sampling to Reinforce
23
+ Computer Science > Machine Learning
24
+ arXiv:2504.11343
25
+ (cs)
26
+ [Submitted on 15 Apr 2025 (
27
+ v1
28
+ ), last revised 12 Jun 2025 (this version, v2)]
29
+ Title:
30
+ A Minimalist Approach to LLM Reasoning: from Rejection Sampling to Reinforce
31
+ Authors:
32
+ Wei Xiong
33
+ ,
34
+ Jiarui Yao
35
+ ,
36
+ Yuhui Xu
37
+ ,
38
+ Bo Pang
39
+ ,
40
+ Lei Wang
41
+ ,
42
+ Doyen Sahoo
43
+ ,
44
+ Junnan Li
45
+ ,
46
+ Nan Jiang
47
+ ,
48
+ Tong Zhang
49
+ ,
50
+ Caiming Xiong
51
+ ,
52
+ Hanze Dong
53
+ View a PDF of the paper titled A Minimalist Approach to LLM Reasoning: from Rejection Sampling to Reinforce, by Wei Xiong and 10 other authors
54
+ View PDF
55
+ HTML (experimental)
56
+ Abstract:
57
+ Reinforcement learning (RL) has become a prevailing approach for fine-tuning large language models (LLMs) on complex reasoning tasks. Among recent methods, GRPO stands out for its empirical success in training models such as DeepSeek-R1, yet the sources of its effectiveness remain poorly understood. In this work, we revisit GRPO from a reinforce-like algorithm perspective and analyze its core components. Surprisingly, we find that a simple rejection sampling baseline, RAFT, which trains only on positively rewarded samples, yields competitive performance than GRPO and PPO. Our ablation studies reveal that GRPO's main advantage arises from discarding prompts with entirely incorrect responses, rather than from its reward normalization. Motivated by this insight, we propose Reinforce-Rej, a minimal extension of policy gradient that filters both entirely incorrect and entirely correct samples. Reinforce-Rej improves KL efficiency and stability, serving as a lightweight yet effective alternative to more complex RL algorithms. We advocate RAFT as a robust and interpretable baseline, and suggest that future advances should focus on more principled designs for incorporating negative samples, rather than relying on them indiscriminately. Our findings provide guidance for future work in reward-based LLM post-training.
58
+ Subjects:
59
+ Machine Learning (cs.LG)
60
+ ; Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
61
+ Cite as:
62
+ arXiv:2504.11343
63
+ [cs.LG]
64
+ (or
65
+ arXiv:2504.11343v2
66
+ [cs.LG]
67
+ for this version)
68
+ https://doi.org/10.48550/arXiv.2504.11343
69
+ Focus to learn more
70
+ arXiv-issued DOI via DataCite
71
+ Submission history
72
+ From: Hanze Dong [
73
+ view email
74
+ ]
75
+ [v1]
76
+ Tue, 15 Apr 2025 16:15:02 UTC (228 KB)
77
+ [v2]
78
+ Thu, 12 Jun 2025 06:03:24 UTC (192 KB)
79
+ Full-text links:
80
+ Access Paper:
81
+ View a PDF of the paper titled A Minimalist Approach to LLM Reasoning: from Rejection Sampling to Reinforce, by Wei Xiong and 10 other authors
82
+ View PDF
83
+ HTML (experimental)
84
+ TeX Source
85
+ view license
86
+ Current browse context:
87
+ cs.LG
88
+ < prev
89
+ |
90
+ next >
91
+ new
92
+ |
93
+ recent
94
+ |
95
+ 2025-04
96
+ Change to browse by:
97
+ cs
98
+ cs.AI
99
+ cs.CL
100
+ stat
101
+ stat.ML
102
+ References & Citations
103
+ NASA ADS
104
+ Google Scholar
105
+ Semantic Scholar
106
+ export BibTeX citation
107
+ Loading...
108
+ BibTeX formatted citation
109
+ ×
110
+ loading...
111
+ Data provided by:
112
+ Bookmark
113
+ Bibliographic Tools
114
+ Bibliographic and Citation Tools
115
+ Bibliographic Explorer Toggle
116
+ Bibliographic Explorer
117
+ (
118
+ What is the Explorer?
119
+ )
120
+ Connected Papers Toggle
121
+ Connected Papers
122
+ (
123
+ What is Connected Papers?
124
+ )
125
+ Litmaps Toggle
126
+ Litmaps
127
+ (
128
+ What is Litmaps?
129
+ )
130
+ scite.ai Toggle
131
+ scite Smart Citations
132
+ (
133
+ What are Smart Citations?
134
+ )
135
+ Code, Data, Media
136
+ Code, Data and Media Associated with this Article
137
+ alphaXiv Toggle
138
+ alphaXiv
139
+ (
140
+ What is alphaXiv?
141
+ )
142
+ Links to Code Toggle
143
+ CatalyzeX Code Finder for Papers
144
+ (
145
+ What is CatalyzeX?
146
+ )
147
+ DagsHub Toggle
148
+ DagsHub
149
+ (
150
+ What is DagsHub?
151
+ )
152
+ GotitPub Toggle
153
+ Gotit.pub
154
+ (
155
+ What is GotitPub?
156
+ )
157
+ Huggingface Toggle
158
+ Hugging Face
159
+ (
160
+ What is Huggingface?
161
+ )
162
+ ScienceCast Toggle
163
+ ScienceCast
164
+ (
165
+ What is ScienceCast?
166
+ )
167
+ Demos
168
+ Demos
169
+ Replicate Toggle
170
+ Replicate
171
+ (
172
+ What is Replicate?
173
+ )
174
+ Spaces Toggle
175
+ Hugging Face Spaces
176
+ (
177
+ What is Spaces?
178
+ )
179
+ Spaces Toggle
180
+ TXYZ.AI
181
+ (
182
+ What is TXYZ.AI?
183
+ )
184
+ Related Papers
185
+ Recommenders and Search Tools
186
+ Link to Influence Flower
187
+ Influence Flower
188
+ (
189
+ What are Influence Flowers?
190
+ )
191
+ Core recommender toggle
192
+ CORE Recommender
193
+ (
194
+ What is CORE?
195
+ )
196
+ IArxiv recommender toggle
197
+ IArxiv Recommender
198
+ (
199
+ What is IArxiv?
200
+ )
201
+ Author
202
+ Venue
203
+ Institution
204
+ Topic
205
+ About arXivLabs
206
+ arXivLabs: experimental projects with community collaborators
207
+ arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
208
+ Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
209
+ Have an idea for a project that will add value for arXiv's community?
210
+ Learn more about arXivLabs
211
+ .
212
+ Which authors of this paper are endorsers?
213
+ |
214
+ Disable MathJax
215
+ (
216
+ What is MathJax?
217
+ )
research/notes/250415275-stop-summation-min-form-credit-assignment-is-all-process-reward-model.md ADDED
@@ -0,0 +1,210 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: '[2504.15275] Stop Summation: Min-Form Credit Assignment Is All Process Reward
3
+ Model Needs for Reasoning'
4
+ id: 250415275-stop-summation-min-form-credit-assignment-is-all-process-reward-model
5
+ tags:
6
+ - socratic-mcts-swe-worldmodel-8f6dea
7
+ created: '2026-06-09T04:23:24.370137Z'
8
+ updated: '2026-06-09T04:23:59.340315Z'
9
+ source: https://arxiv.org/abs/2504.15275
10
+ source_domain: arxiv.org
11
+ fetched_at: '2026-06-09T04:23:24.356247Z'
12
+ fetch_provider: builtin
13
+ status: draft
14
+ type: note
15
+ tier: institutional
16
+ content_type: paper
17
+ deprecated: false
18
+ summary: 'Cheng et al. 2025: argues PRM credit should be MIN-form (bottleneck step)
19
+ not summed across steps — a concrete step-level credit-assignment rule for which
20
+ branch/step carries the signal; relevant to prune-vs-train-on-all design.'
21
+ ---
22
+
23
+ [2504.15275] Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning
24
+ Computer Science > Artificial Intelligence
25
+ arXiv:2504.15275
26
+ (cs)
27
+ [Submitted on 21 Apr 2025 (
28
+ v1
29
+ ), last revised 23 Oct 2025 (this version, v3)]
30
+ Title:
31
+ Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning
32
+ Authors:
33
+ Jie Cheng
34
+ ,
35
+ Gang Xiong
36
+ ,
37
+ Ruixi Qiao
38
+ ,
39
+ Lijun Li
40
+ ,
41
+ Chao Guo
42
+ ,
43
+ Junle Wang
44
+ ,
45
+ Yisheng Lv
46
+ ,
47
+ Fei-Yue Wang
48
+ View a PDF of the paper titled Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning, by Jie Cheng and 7 other authors
49
+ View PDF
50
+ HTML (experimental)
51
+ Abstract:
52
+ Process reward models (PRMs) have proven effective for test-time scaling of Large Language Models (LLMs) on challenging reasoning tasks. However, reward hacking issues with PRMs limit their successful application in reinforcement fine-tuning. In this paper, we identify the main cause of PRM-induced reward hacking: the canonical summation-form credit assignment in reinforcement learning (RL), which defines the value as cumulative gamma-decayed future rewards, easily induces LLMs to hack steps with high rewards. To address this, we propose PURE: Process sUpervised Reinforcement lEarning. The key innovation of PURE is a min-form credit assignment that formulates the value function as the minimum of future rewards. This method significantly alleviates reward hacking by limiting the value function range and distributing advantages more reasonably. Through extensive experiments on 3 base models, we show that PRM-based approaches enabling min-form credit assignment achieve comparable reasoning performance to verifiable reward-based methods within only 30% steps. In contrast, the canonical sum-form credit assignment collapses training even at the beginning! Additionally, when we supplement PRM-based fine-tuning with just 10% verifiable rewards, we further alleviate reward hacking and produce the best fine-tuned model based on Qwen2.5-Math-7B in our experiments, achieving 82.5% accuracy on AMC23 and 53.3% average accuracy across 5 benchmarks. Moreover, we summarize the observed reward hacking cases and analyze the causes of training collapse. We release our code and model weights at
53
+ this https URL
54
+ .
55
+ Comments:
56
+ Accepted by NeurIPS 2025
57
+ Subjects:
58
+ Artificial Intelligence (cs.AI)
59
+ ; Machine Learning (cs.LG)
60
+ Cite as:
61
+ arXiv:2504.15275
62
+ [cs.AI]
63
+ (or
64
+ arXiv:2504.15275v3
65
+ [cs.AI]
66
+ for this version)
67
+ https://doi.org/10.48550/arXiv.2504.15275
68
+ Focus to learn more
69
+ arXiv-issued DOI via DataCite
70
+ Submission history
71
+ From: Jie Cheng [
72
+ view email
73
+ ]
74
+ [v1]
75
+ Mon, 21 Apr 2025 17:59:02 UTC (321 KB)
76
+ [v2]
77
+ Fri, 23 May 2025 07:38:41 UTC (321 KB)
78
+ [v3]
79
+ Thu, 23 Oct 2025 16:28:10 UTC (332 KB)
80
+ Full-text links:
81
+ Access Paper:
82
+ View a PDF of the paper titled Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning, by Jie Cheng and 7 other authors
83
+ View PDF
84
+ HTML (experimental)
85
+ TeX Source
86
+ view license
87
+ Current browse context:
88
+ cs.AI
89
+ < prev
90
+ |
91
+ next >
92
+ new
93
+ |
94
+ recent
95
+ |
96
+ 2025-04
97
+ Change to browse by:
98
+ cs
99
+ cs.LG
100
+ References & Citations
101
+ NASA ADS
102
+ Google Scholar
103
+ Semantic Scholar
104
+ export BibTeX citation
105
+ Loading...
106
+ BibTeX formatted citation
107
+ ×
108
+ loading...
109
+ Data provided by:
110
+ Bookmark
111
+ Bibliographic Tools
112
+ Bibliographic and Citation Tools
113
+ Bibliographic Explorer Toggle
114
+ Bibliographic Explorer
115
+ (
116
+ What is the Explorer?
117
+ )
118
+ Connected Papers Toggle
119
+ Connected Papers
120
+ (
121
+ What is Connected Papers?
122
+ )
123
+ Litmaps Toggle
124
+ Litmaps
125
+ (
126
+ What is Litmaps?
127
+ )
128
+ scite.ai Toggle
129
+ scite Smart Citations
130
+ (
131
+ What are Smart Citations?
132
+ )
133
+ Code, Data, Media
134
+ Code, Data and Media Associated with this Article
135
+ alphaXiv Toggle
136
+ alphaXiv
137
+ (
138
+ What is alphaXiv?
139
+ )
140
+ Links to Code Toggle
141
+ CatalyzeX Code Finder for Papers
142
+ (
143
+ What is CatalyzeX?
144
+ )
145
+ DagsHub Toggle
146
+ DagsHub
147
+ (
148
+ What is DagsHub?
149
+ )
150
+ GotitPub Toggle
151
+ Gotit.pub
152
+ (
153
+ What is GotitPub?
154
+ )
155
+ Huggingface Toggle
156
+ Hugging Face
157
+ (
158
+ What is Huggingface?
159
+ )
160
+ ScienceCast Toggle
161
+ ScienceCast
162
+ (
163
+ What is ScienceCast?
164
+ )
165
+ Demos
166
+ Demos
167
+ Replicate Toggle
168
+ Replicate
169
+ (
170
+ What is Replicate?
171
+ )
172
+ Spaces Toggle
173
+ Hugging Face Spaces
174
+ (
175
+ What is Spaces?
176
+ )
177
+ Spaces Toggle
178
+ TXYZ.AI
179
+ (
180
+ What is TXYZ.AI?
181
+ )
182
+ Related Papers
183
+ Recommenders and Search Tools
184
+ Link to Influence Flower
185
+ Influence Flower
186
+ (
187
+ What are Influence Flowers?
188
+ )
189
+ Core recommender toggle
190
+ CORE Recommender
191
+ (
192
+ What is CORE?
193
+ )
194
+ Author
195
+ Venue
196
+ Institution
197
+ Topic
198
+ About arXivLabs
199
+ arXivLabs: experimental projects with community collaborators
200
+ arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
201
+ Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
202
+ Have an idea for a project that will add value for arXiv's community?
203
+ Learn more about arXivLabs
204
+ .
205
+ Which authors of this paper are endorsers?
206
+ |
207
+ Disable MathJax
208
+ (
209
+ What is MathJax?
210
+ )
research/notes/250518830-on-the-effect-of-negative-gradient-in-group-relative-deep-reinforcemen.md ADDED
@@ -0,0 +1,200 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: '[2505.18830] On the Effect of Negative Gradient in Group Relative Deep Reinforcement
3
+ Optimization'
4
+ id: 250518830-on-the-effect-of-negative-gradient-in-group-relative-deep-reinforcemen
5
+ tags:
6
+ - socratic-mcts-swe-worldmodel-8f6dea
7
+ created: '2026-06-09T04:24:45.899882Z'
8
+ updated: '2026-06-09T04:25:03.267186Z'
9
+ source: https://arxiv.org/abs/2505.18830
10
+ source_domain: arxiv.org
11
+ fetched_at: '2026-06-09T04:24:45.517727Z'
12
+ fetch_provider: builtin
13
+ status: active
14
+ type: note
15
+ tier: institutional
16
+ content_type: paper
17
+ deprecated: false
18
+ summary: '[2505.18830] On the Effect of Negative Gradient in Group Relative Deep Reinforcement
19
+ Optimization'
20
+ ---
21
+
22
+ [2505.18830] On the Effect of Negative Gradient in Group Relative Deep Reinforcement Optimization
23
+ Computer Science > Machine Learning
24
+ arXiv:2505.18830
25
+ (cs)
26
+ [Submitted on 24 May 2025]
27
+ Title:
28
+ On the Effect of Negative Gradient in Group Relative Deep Reinforcement Optimization
29
+ Authors:
30
+ Wenlong Deng
31
+ ,
32
+ Yi Ren
33
+ ,
34
+ Muchen Li
35
+ ,
36
+ Danica J. Sutherland
37
+ ,
38
+ Xiaoxiao Li
39
+ ,
40
+ Christos Thrampoulidis
41
+ View a PDF of the paper titled On the Effect of Negative Gradient in Group Relative Deep Reinforcement Optimization, by Wenlong Deng and 5 other authors
42
+ View PDF
43
+ HTML (experimental)
44
+ Abstract:
45
+ Reinforcement learning (RL) has become popular in enhancing the reasoning capabilities of large language models (LLMs), with Group Relative Policy Optimization (GRPO) emerging as a widely used algorithm in recent systems. Despite GRPO's widespread adoption, we identify a previously unrecognized phenomenon we term Lazy Likelihood Displacement (LLD), wherein the likelihood of correct responses marginally increases or even decreases during training. This behavior mirrors a recently discovered misalignment issue in Direct Preference Optimization (DPO), attributed to the influence of negative gradients. We provide a theoretical analysis of GRPO's learning dynamic, identifying the source of LLD as the naive penalization of all tokens in incorrect responses with the same strength. To address this, we develop a method called NTHR, which downweights penalties on tokens contributing to the LLD. Unlike prior DPO-based approaches, NTHR takes advantage of GRPO's group-based structure, using correct responses as anchors to identify influential tokens. Experiments on math reasoning benchmarks demonstrate that NTHR effectively mitigates LLD, yielding consistent performance gains across models ranging from 0.5B to 3B parameters.
46
+ Subjects:
47
+ Machine Learning (cs.LG)
48
+ ; Computation and Language (cs.CL)
49
+ Cite as:
50
+ arXiv:2505.18830
51
+ [cs.LG]
52
+ (or
53
+ arXiv:2505.18830v1
54
+ [cs.LG]
55
+ for this version)
56
+ https://doi.org/10.48550/arXiv.2505.18830
57
+ Focus to learn more
58
+ arXiv-issued DOI via DataCite
59
+ Submission history
60
+ From: Wenlong Deng [
61
+ view email
62
+ ]
63
+ [v1]
64
+ Sat, 24 May 2025 18:58:51 UTC (2,068 KB)
65
+ Full-text links:
66
+ Access Paper:
67
+ View a PDF of the paper titled On the Effect of Negative Gradient in Group Relative Deep Reinforcement Optimization, by Wenlong Deng and 5 other authors
68
+ View PDF
69
+ HTML (experimental)
70
+ TeX Source
71
+ view license
72
+ Current browse context:
73
+ cs.LG
74
+ < prev
75
+ |
76
+ next >
77
+ new
78
+ |
79
+ recent
80
+ |
81
+ 2025-05
82
+ Change to browse by:
83
+ cs
84
+ cs.CL
85
+ References & Citations
86
+ NASA ADS
87
+ Google Scholar
88
+ Semantic Scholar
89
+ export BibTeX citation
90
+ Loading...
91
+ BibTeX formatted citation
92
+ ×
93
+ loading...
94
+ Data provided by:
95
+ Bookmark
96
+ Bibliographic Tools
97
+ Bibliographic and Citation Tools
98
+ Bibliographic Explorer Toggle
99
+ Bibliographic Explorer
100
+ (
101
+ What is the Explorer?
102
+ )
103
+ Connected Papers Toggle
104
+ Connected Papers
105
+ (
106
+ What is Connected Papers?
107
+ )
108
+ Litmaps Toggle
109
+ Litmaps
110
+ (
111
+ What is Litmaps?
112
+ )
113
+ scite.ai Toggle
114
+ scite Smart Citations
115
+ (
116
+ What are Smart Citations?
117
+ )
118
+ Code, Data, Media
119
+ Code, Data and Media Associated with this Article
120
+ alphaXiv Toggle
121
+ alphaXiv
122
+ (
123
+ What is alphaXiv?
124
+ )
125
+ Links to Code Toggle
126
+ CatalyzeX Code Finder for Papers
127
+ (
128
+ What is CatalyzeX?
129
+ )
130
+ DagsHub Toggle
131
+ DagsHub
132
+ (
133
+ What is DagsHub?
134
+ )
135
+ GotitPub Toggle
136
+ Gotit.pub
137
+ (
138
+ What is GotitPub?
139
+ )
140
+ Huggingface Toggle
141
+ Hugging Face
142
+ (
143
+ What is Huggingface?
144
+ )
145
+ ScienceCast Toggle
146
+ ScienceCast
147
+ (
148
+ What is ScienceCast?
149
+ )
150
+ Demos
151
+ Demos
152
+ Replicate Toggle
153
+ Replicate
154
+ (
155
+ What is Replicate?
156
+ )
157
+ Spaces Toggle
158
+ Hugging Face Spaces
159
+ (
160
+ What is Spaces?
161
+ )
162
+ Spaces Toggle
163
+ TXYZ.AI
164
+ (
165
+ What is TXYZ.AI?
166
+ )
167
+ Related Papers
168
+ Recommenders and Search Tools
169
+ Link to Influence Flower
170
+ Influence Flower
171
+ (
172
+ What are Influence Flowers?
173
+ )
174
+ Core recommender toggle
175
+ CORE Recommender
176
+ (
177
+ What is CORE?
178
+ )
179
+ IArxiv recommender toggle
180
+ IArxiv Recommender
181
+ (
182
+ What is IArxiv?
183
+ )
184
+ Author
185
+ Venue
186
+ Institution
187
+ Topic
188
+ About arXivLabs
189
+ arXivLabs: experimental projects with community collaborators
190
+ arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
191
+ Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
192
+ Have an idea for a project that will add value for arXiv's community?
193
+ Learn more about arXivLabs
194
+ .
195
+ Which authors of this paper are endorsers?
196
+ |
197
+ Disable MathJax
198
+ (
199
+ What is MathJax?
200
+ )
research/notes/250613358-socratic-rl-a-novel-framework-for-efficient-knowledge-acquisition-thro.md ADDED
@@ -0,0 +1,214 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: '[2506.13358] Socratic RL: A Novel Framework for Efficient Knowledge Acquisition
3
+ through Iterative Reflection and Viewpoint Distillation'
4
+ id: 250613358-socratic-rl-a-novel-framework-for-efficient-knowledge-acquisition-thro
5
+ tags:
6
+ - socratic-mcts-swe-worldmodel-8f6dea
7
+ created: '2026-06-09T04:19:31.995934Z'
8
+ source: https://arxiv.org/abs/2506.13358
9
+ source_domain: arxiv.org
10
+ fetched_at: '2026-06-09T04:19:31.874122Z'
11
+ fetch_provider: builtin
12
+ status: draft
13
+ type: note
14
+ tier: institutional
15
+ content_type: paper
16
+ deprecated: false
17
+ summary: 'Socratic-RL (arXiv 2506.13358): decoupled Teacher-Student RL — Teacher extracts causal "viewpoints" from interaction histories, meta-learns via utility uplift U(v), distills viewpoints into Student weights via KL (L_distill). Process- not outcome-reward. ID VERIFIED REAL.'
18
+ ---
19
+
20
+ [2506.13358] Socratic RL: A Novel Framework for Efficient Knowledge Acquisition through Iterative Reflection and Viewpoint Distillation
21
+ Computer Science > Artificial Intelligence
22
+ arXiv:2506.13358
23
+ (cs)
24
+ [Submitted on 16 Jun 2025]
25
+ Title:
26
+ Socratic RL: A Novel Framework for Efficient Knowledge Acquisition through Iterative Reflection and Viewpoint Distillation
27
+ Authors:
28
+ Xiangfan Wu
29
+ View a PDF of the paper titled Socratic RL: A Novel Framework for Efficient Knowledge Acquisition through Iterative Reflection and Viewpoint Distillation, by Xiangfan Wu
30
+ View PDF
31
+ HTML (experimental)
32
+ Abstract:
33
+ Current Reinforcement Learning (RL) methodologies for Large Language Models (LLMs) often rely on simplistic, outcome-based reward signals (e.g., final answer correctness), which limits the depth of learning from each interaction. This paper introduces Socratic Reinforcement Learning (Socratic-RL), a novel, process-oriented framework designed to address this limitation. Socratic-RL operates on the principle that deeper understanding is achieved by reflecting on the causal reasons for errors and successes within the reasoning process itself. The framework employs a decoupled "Teacher-Student" architecture, where a "Teacher AI" analyzes interaction histories, extracts causal insights, and formulates them into structured "viewpoints." These viewpoints, acting as distilled guidance, are then used by a "Student AI" to enhance its subsequent reasoning. A key innovation is the iterative self-improvement of the Teacher AI, enabling its reflective capabilities to evolve through a meta-learning loop. To manage the accumulation of knowledge, a distillation mechanism compresses learned viewpoints into the Student's parameters. By focusing on process rather than just outcome, Socratic-RL presents a pathway toward enhanced sample efficiency, superior interpretability, and a more scalable architecture for self-improving AI systems. This paper details the foundational concepts, formal mechanisms, synergies, challenges, and a concrete research roadmap for this proposed framework.
34
+ Subjects:
35
+ Artificial Intelligence (cs.AI)
36
+ ; Machine Learning (cs.LG); Multiagent Systems (cs.MA)
37
+ Cite as:
38
+ arXiv:2506.13358
39
+ [cs.AI]
40
+ (or
41
+ arXiv:2506.13358v1
42
+ [cs.AI]
43
+ for this version)
44
+ https://doi.org/10.48550/arXiv.2506.13358
45
+ Focus to learn more
46
+ arXiv-issued DOI via DataCite
47
+ Submission history
48
+ From: Xiangfan Wu [
49
+ view email
50
+ ]
51
+ [v1]
52
+ Mon, 16 Jun 2025 10:57:58 UTC (561 KB)
53
+ Full-text links:
54
+ Access Paper:
55
+ View a PDF of the paper titled Socratic RL: A Novel Framework for Efficient Knowledge Acquisition through Iterative Reflection and Viewpoint Distillation, by Xiangfan Wu
56
+ View PDF
57
+ HTML (experimental)
58
+ TeX Source
59
+ view license
60
+ Current browse context:
61
+ cs.AI
62
+ < prev
63
+ |
64
+ next >
65
+ new
66
+ |
67
+ recent
68
+ |
69
+ 2025-06
70
+ Change to browse by:
71
+ cs
72
+ cs.LG
73
+ cs.MA
74
+ References & Citations
75
+ NASA ADS
76
+ Google Scholar
77
+ Semantic Scholar
78
+ export BibTeX citation
79
+ Loading...
80
+ BibTeX formatted citation
81
+ ×
82
+ loading...
83
+ Data provided by:
84
+ Bookmark
85
+ Bibliographic Tools
86
+ Bibliographic and Citation Tools
87
+ Bibliographic Explorer Toggle
88
+ Bibliographic Explorer
89
+ (
90
+ What is the Explorer?
91
+ )
92
+ Connected Papers Toggle
93
+ Connected Papers
94
+ (
95
+ What is Connected Papers?
96
+ )
97
+ Litmaps Toggle
98
+ Litmaps
99
+ (
100
+ What is Litmaps?
101
+ )
102
+ scite.ai Toggle
103
+ scite Smart Citations
104
+ (
105
+ What are Smart Citations?
106
+ )
107
+ Code, Data, Media
108
+ Code, Data and Media Associated with this Article
109
+ alphaXiv Toggle
110
+ alphaXiv
111
+ (
112
+ What is alphaXiv?
113
+ )
114
+ Links to Code Toggle
115
+ CatalyzeX Code Finder for Papers
116
+ (
117
+ What is CatalyzeX?
118
+ )
119
+ DagsHub Toggle
120
+ DagsHub
121
+ (
122
+ What is DagsHub?
123
+ )
124
+ GotitPub Toggle
125
+ Gotit.pub
126
+ (
127
+ What is GotitPub?
128
+ )
129
+ Huggingface Toggle
130
+ Hugging Face
131
+ (
132
+ What is Huggingface?
133
+ )
134
+ ScienceCast Toggle
135
+ ScienceCast
136
+ (
137
+ What is ScienceCast?
138
+ )
139
+ Demos
140
+ Demos
141
+ Replicate Toggle
142
+ Replicate
143
+ (
144
+ What is Replicate?
145
+ )
146
+ Spaces Toggle
147
+ Hugging Face Spaces
148
+ (
149
+ What is Spaces?
150
+ )
151
+ Spaces Toggle
152
+ TXYZ.AI
153
+ (
154
+ What is TXYZ.AI?
155
+ )
156
+ Related Papers
157
+ Recommenders and Search Tools
158
+ Link to Influence Flower
159
+ Influence Flower
160
+ (
161
+ What are Influence Flowers?
162
+ )
163
+ Core recommender toggle
164
+ CORE Recommender
165
+ (
166
+ What is CORE?
167
+ )
168
+ Author
169
+ Venue
170
+ Institution
171
+ Topic
172
+ About arXivLabs
173
+ arXivLabs: experimental projects with community collaborators
174
+ arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
175
+ Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
176
+ Have an idea for a project that will add value for arXiv's community?
177
+ Learn more about arXivLabs
178
+ .
179
+ Which authors of this paper are endorsers?
180
+ |
181
+ Disable MathJax
182
+ (
183
+ What is MathJax?
184
+ )
185
+ ---
186
+
187
+ ## METHOD DETAIL (extracted from full HTML text, arxiv.org/html/2506.13358v1; institutional)
188
+
189
+ VERIFICATION: arXiv ID **2506.13358** confirmed REAL (matches user transcript). Submitted 16 Jun 2025. Computer Science > Artificial Intelligence. Position paper: foundational concepts, formal mechanisms, synergies — process-oriented (not outcome-only) RL for LLMs.
190
+
191
+ ### Decoupled Teacher–Student architecture
192
+ Two specialized agents. **Teacher AI** analyzes interaction histories, extracts CAUSAL insights, formulates structured **"viewpoints."** **Student AI** focuses purely on task-solving. Specialization: Student becomes expert at solving; Teacher becomes expert at reflection/causal analysis.
193
+
194
+ ### Viewpoints
195
+ A viewpoint = "a piece of structured, human-readable text representing a generalizable principle, a heuristic, a causal explanation, or a counter-example" (e.g. "In arithmetic, operations inside parentheses must be evaluated first"). Viewpoints are PREPENDED to the Student's input context: **π_S(a_t | s_t, V; θ_S)** where V is the active viewpoint set. Knowledge base **V_KB** persists across episodes; active set V reset post-distillation.
196
+
197
+ ### Meta-learning loop (Teacher self-improvement)
198
+ Teacher quality = viewpoint utility uplift on probe tasks 𝒫_probe:
199
+ **U(v) = E[Score(π_S(·|p, V∪{v}))] − E[Score(π_S(·|p, V))]**.
200
+ Teacher is refined to "generate viewpoints that construct the most effective prompts for the Student." This is the key innovation: reflective capability EVOLVES (a meta-learning loop on the Teacher), not static.
201
+
202
+ ### Distillation mechanism (bound context growth → compress into weights)
203
+ Train a new Student π_S' via KL minimization so it acts as if it knows the principle without seeing it:
204
+ **L_distill = E[ D_KL( π_S(·|Input, v; θ_S) ‖ π_S'(·|Input; θ_S') ) ]**.
205
+ Alternative distillation strategies named: **DPO** (preferred/rejected pairs) and **Instruction Tuning** (reformat V_KB into training examples).
206
+
207
+ ### Process- vs outcome-reward
208
+ Standard RL = "simplistic, outcome-oriented reward (e.g. final answer correctness)." Socratic-RL = "automated process supervision" over "the causal chain of successes and failures within the reasoning process itself" — contrast to RLHF scalar outcome rewards.
209
+
210
+ ### Claimed benefits / named algorithm
211
+ Sample efficiency (richer process signals), interpretability (V_KB = human-readable "glass-box" log of acquired knowledge), scalability (distillation resets context window). **Algorithm 1: The Socratic-RL Core Loop** — 4 phases: Student Interaction → Teacher Reflection → Meta-Learning (Teacher Evolution) → Knowledge Distillation.
212
+
213
+ ### Relevance to composer-replication-framework
214
+ Teacher→viewpoint→Student-context→distill-into-weights is the conceptual parent of the framework's **HintGenerator** (ADR-009: template → raw-error → LLM-judge → sibling-bootstrap) and **SDPO Channel 2** (hint-conditioned same-model teacher; generalized_jsd / OPSD kernel — "knows the principle without seeing the hint" == the L_distill KL-to-hint-conditioned-teacher objective). The Teacher meta-learning loop maps onto the user's "outer slow dataset-construction loop." On the user's PRUNE-vs-TRAIN-ON-ALL question this paper is pro-DISTILL-the-causal-insight (TRAIN on the extracted viewpoint), not pro-prune; viewpoints are textual-critique-guided mutation in the genetic-algorithm framing.
research/notes/250721046-a-survey-of-self-evolving-agents-what-when-how-and-where-to-evolve-on.md ADDED
@@ -0,0 +1,251 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: '[2507.21046] A Survey of Self-Evolving Agents: What, When, How, and Where
3
+ to Evolve on the Path to Artificial Super Intelligence'
4
+ id: 250721046-a-survey-of-self-evolving-agents-what-when-how-and-where-to-evolve-on
5
+ tags:
6
+ - socratic-mcts-swe-worldmodel-8f6dea
7
+ created: '2026-06-09T04:24:56.997446Z'
8
+ updated: '2026-06-09T04:25:34.638684Z'
9
+ source: https://arxiv.org/abs/2507.21046
10
+ source_domain: arxiv.org
11
+ fetched_at: '2026-06-09T04:24:56.751495Z'
12
+ fetch_provider: builtin
13
+ status: draft
14
+ type: note
15
+ tier: institutional
16
+ content_type: paper
17
+ deprecated: false
18
+ summary: 'Survey taxonomizing self-evolving agents (what/when/how/where to evolve);
19
+ Section 8.3 catalogs emergent risks: misevolution, uncontrolled behavior drift,
20
+ deployment-time reward hacking in memory evolution, Alignment Tipping Process, model
21
+ collapse from closed-loop RL on static synthetic data.'
22
+ ---
23
+
24
+ [2507.21046] A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence
25
+ Computer Science > Artificial Intelligence
26
+ arXiv:2507.21046
27
+ (cs)
28
+ [Submitted on 28 Jul 2025 (
29
+ v1
30
+ ), last revised 16 Jan 2026 (this version, v4)]
31
+ Title:
32
+ A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence
33
+ Authors:
34
+ Huan-ang Gao
35
+ ,
36
+ Jiayi Geng
37
+ ,
38
+ Wenyue Hua
39
+ ,
40
+ Mengkang Hu
41
+ ,
42
+ Xinzhe Juan
43
+ ,
44
+ Hongzhang Liu
45
+ ,
46
+ Shilong Liu
47
+ ,
48
+ Jiahao Qiu
49
+ ,
50
+ Xuan Qi
51
+ ,
52
+ Yiran Wu
53
+ ,
54
+ Hongru Wang
55
+ ,
56
+ Han Xiao
57
+ ,
58
+ Yuhang Zhou
59
+ ,
60
+ Shaokun Zhang
61
+ ,
62
+ Jiayi Zhang
63
+ ,
64
+ Jinyu Xiang
65
+ ,
66
+ Yixiong Fang
67
+ ,
68
+ Qiwen Zhao
69
+ ,
70
+ Dongrui Liu
71
+ ,
72
+ Qihan Ren
73
+ ,
74
+ Cheng Qian
75
+ ,
76
+ Zhenhailong Wang
77
+ ,
78
+ Minda Hu
79
+ ,
80
+ Huazheng Wang
81
+ ,
82
+ Qingyun Wu
83
+ ,
84
+ Heng Ji
85
+ ,
86
+ Mengdi Wang
87
+ View a PDF of the paper titled A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence, by Huan-ang Gao and 26 other authors
88
+ View PDF
89
+ HTML (experimental)
90
+ Abstract:
91
+ Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse tasks but remain fundamentally static, unable to adapt their internal parameters to novel tasks, evolving knowledge domains, or dynamic interaction contexts. As LLMs are increasingly deployed in open-ended, interactive environments, this static nature has become a critical bottleneck, necessitating agents that can adaptively reason, act, and evolve in real time. This paradigm shift -- from scaling static models to developing self-evolving agents -- has sparked growing interest in architectures and methods enabling continual learning and adaptation from data, interactions, and experiences. This survey provides the first systematic and comprehensive review of self-evolving agents, organizing the field around three foundational dimensions: what, when, and how to evolve. We examine evolutionary mechanisms across agent components (e.g., models, memory, tools, architecture), categorize adaptation methods by stages (e.g., intra-test-time, inter-test-time), and analyze the algorithmic and architectural designs that guide evolutionary adaptation (e.g., scalar rewards, textual feedback, single-agent and multi-agent systems). Additionally, we analyze evaluation metrics and benchmarks tailored for self-evolving agents, highlight applications in domains such as coding, education, and healthcare, and identify critical challenges and research directions in safety, scalability, and co-evolutionary dynamics. By providing a structured framework for understanding and designing self-evolving agents, this survey establishes a roadmap for advancing more adaptive, robust, and versatile agentic systems in both research and real-world deployments, and ultimately sheds light on the realization of Artificial Super Intelligence (ASI) where agents evolve autonomously and perform beyond human-level intelligence across tasks.
92
+ Comments:
93
+ 77 pages, 9 figures, Transactions on Machine Learning Research (01/2026)
94
+ Subjects:
95
+ Artificial Intelligence (cs.AI)
96
+ Cite as:
97
+ arXiv:2507.21046
98
+ [cs.AI]
99
+ (or
100
+ arXiv:2507.21046v4
101
+ [cs.AI]
102
+ for this version)
103
+ https://doi.org/10.48550/arXiv.2507.21046
104
+ Focus to learn more
105
+ arXiv-issued DOI via DataCite
106
+ Submission history
107
+ From: Xinzhe Juan [
108
+ view email
109
+ ]
110
+ [v1]
111
+ Mon, 28 Jul 2025 17:59:05 UTC (3,709 KB)
112
+ [v2]
113
+ Wed, 30 Jul 2025 17:59:37 UTC (3,753 KB)
114
+ [v3]
115
+ Fri, 1 Aug 2025 17:17:09 UTC (3,753 KB)
116
+ [v4]
117
+ Fri, 16 Jan 2026 20:59:08 UTC (3,766 KB)
118
+ Full-text links:
119
+ Access Paper:
120
+ View a PDF of the paper titled A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence, by Huan-ang Gao and 26 other authors
121
+ View PDF
122
+ HTML (experimental)
123
+ TeX Source
124
+ view license
125
+ Current browse context:
126
+ cs.AI
127
+ < prev
128
+ |
129
+ next >
130
+ new
131
+ |
132
+ recent
133
+ |
134
+ 2025-07
135
+ Change to browse by:
136
+ cs
137
+ References & Citations
138
+ NASA ADS
139
+ Google Scholar
140
+ Semantic Scholar
141
+ export BibTeX citation
142
+ Loading...
143
+ BibTeX formatted citation
144
+ ×
145
+ loading...
146
+ Data provided by:
147
+ Bookmark
148
+ Bibliographic Tools
149
+ Bibliographic and Citation Tools
150
+ Bibliographic Explorer Toggle
151
+ Bibliographic Explorer
152
+ (
153
+ What is the Explorer?
154
+ )
155
+ Connected Papers Toggle
156
+ Connected Papers
157
+ (
158
+ What is Connected Papers?
159
+ )
160
+ Litmaps Toggle
161
+ Litmaps
162
+ (
163
+ What is Litmaps?
164
+ )
165
+ scite.ai Toggle
166
+ scite Smart Citations
167
+ (
168
+ What are Smart Citations?
169
+ )
170
+ Code, Data, Media
171
+ Code, Data and Media Associated with this Article
172
+ alphaXiv Toggle
173
+ alphaXiv
174
+ (
175
+ What is alphaXiv?
176
+ )
177
+ Links to Code Toggle
178
+ CatalyzeX Code Finder for Papers
179
+ (
180
+ What is CatalyzeX?
181
+ )
182
+ DagsHub Toggle
183
+ DagsHub
184
+ (
185
+ What is DagsHub?
186
+ )
187
+ GotitPub Toggle
188
+ Gotit.pub
189
+ (
190
+ What is GotitPub?
191
+ )
192
+ Huggingface Toggle
193
+ Hugging Face
194
+ (
195
+ What is Huggingface?
196
+ )
197
+ ScienceCast Toggle
198
+ ScienceCast
199
+ (
200
+ What is ScienceCast?
201
+ )
202
+ Demos
203
+ Demos
204
+ Replicate Toggle
205
+ Replicate
206
+ (
207
+ What is Replicate?
208
+ )
209
+ Spaces Toggle
210
+ Hugging Face Spaces
211
+ (
212
+ What is Spaces?
213
+ )
214
+ Spaces Toggle
215
+ TXYZ.AI
216
+ (
217
+ What is TXYZ.AI?
218
+ )
219
+ Related Papers
220
+ Recommenders and Search Tools
221
+ Link to Influence Flower
222
+ Influence Flower
223
+ (
224
+ What are Influence Flowers?
225
+ )
226
+ Core recommender toggle
227
+ CORE Recommender
228
+ (
229
+ What is CORE?
230
+ )
231
+ Author
232
+ Venue
233
+ Institution
234
+ Topic
235
+ About arXivLabs
236
+ arXivLabs: experimental projects with community collaborators
237
+ arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
238
+ Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
239
+ Have an idea for a project that will add value for arXiv's community?
240
+ Learn more about arXivLabs
241
+ .
242
+ Which authors of this paper are endorsers?
243
+ |
244
+ Disable MathJax
245
+ (
246
+ What is MathJax?
247
+ )
248
+
249
+ ## Related
250
+
251
+ - [[pdf]]
research/notes/250921240-tree-search-for-llm-agent-reinforcement-learning.md ADDED
@@ -0,0 +1,202 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: '[2509.21240] Tree Search for LLM Agent Reinforcement Learning'
3
+ id: 250921240-tree-search-for-llm-agent-reinforcement-learning
4
+ tags:
5
+ - socratic-mcts-swe-worldmodel-8f6dea
6
+ created: '2026-06-09T04:22:47.413734Z'
7
+ source: https://arxiv.org/abs/2509.21240
8
+ source_domain: arxiv.org
9
+ fetched_at: '2026-06-09T04:22:47.405048Z'
10
+ fetch_provider: builtin
11
+ status: draft
12
+ type: note
13
+ deprecated: false
14
+ summary: '[2509.21240] Tree Search for LLM Agent Reinforcement Learning'
15
+ ---
16
+
17
+ [2509.21240] Tree Search for LLM Agent Reinforcement Learning
18
+ Computer Science > Machine Learning
19
+ arXiv:2509.21240
20
+ (cs)
21
+ [Submitted on 25 Sep 2025 (
22
+ v1
23
+ ), last revised 18 Mar 2026 (this version, v3)]
24
+ Title:
25
+ Tree Search for LLM Agent Reinforcement Learning
26
+ Authors:
27
+ Yuxiang Ji
28
+ ,
29
+ Ziyu Ma
30
+ ,
31
+ Yong Wang
32
+ ,
33
+ Guanhua Chen
34
+ ,
35
+ Xiangxiang Chu
36
+ ,
37
+ Liaoni Wu
38
+ View a PDF of the paper titled Tree Search for LLM Agent Reinforcement Learning, by Yuxiang Ji and 5 other authors
39
+ View PDF
40
+ Abstract:
41
+ Recent advances in reinforcement learning (RL) have significantly enhanced the agentic capabilities of large language models (LLMs). In long-term and multi-turn agent tasks, existing approaches driven solely by outcome rewards often suffer from the problem of sparse supervision. To address the challenge, we propose Tree-based Group Relative Policy Optimization (Tree-GRPO), a grouped agent RL method based on tree search, where each tree node represents the complete agent interaction step. By sharing common prefixes, the tree search sampling increases the number of rollouts achievable within a fixed budget of tokens or tool calls. Moreover, we find that the tree-structured trajectory naturally allows the construction of step-wise process supervised signals even using only the outcome reward. Based on this, Tree-GRPO estimates the grouped relative advantages both on intra-tree and inter-tree levels. Through theoretical analysis, we demonstrate that the objective of intra-tree level group relative policy optimization is equivalent to that of step-level direct preference learning. Experiments across 11 datasets and 3 types of QA tasks demonstrate the superiority of the proposed tree-based RL over the chain-based RL method.
42
+ Comments:
43
+ ICLR 2026, Code:
44
+ this https URL
45
+ Subjects:
46
+ Machine Learning (cs.LG)
47
+ ; Artificial Intelligence (cs.AI)
48
+ Cite as:
49
+ arXiv:2509.21240
50
+ [cs.LG]
51
+ (or
52
+ arXiv:2509.21240v3
53
+ [cs.LG]
54
+ for this version)
55
+ https://doi.org/10.48550/arXiv.2509.21240
56
+ Focus to learn more
57
+ arXiv-issued DOI via DataCite
58
+ Submission history
59
+ From: Yuxiang Ji [
60
+ view email
61
+ ]
62
+ [v1]
63
+ Thu, 25 Sep 2025 14:37:09 UTC (974 KB)
64
+ [v2]
65
+ Sat, 11 Oct 2025 09:55:47 UTC (938 KB)
66
+ [v3]
67
+ Wed, 18 Mar 2026 09:49:32 UTC (983 KB)
68
+ Full-text links:
69
+ Access Paper:
70
+ View a PDF of the paper titled Tree Search for LLM Agent Reinforcement Learning, by Yuxiang Ji and 5 other authors
71
+ View PDF
72
+ TeX Source
73
+ view license
74
+ Current browse context:
75
+ cs.LG
76
+ < prev
77
+ |
78
+ next >
79
+ new
80
+ |
81
+ recent
82
+ |
83
+ 2025-09
84
+ Change to browse by:
85
+ cs
86
+ cs.AI
87
+ References & Citations
88
+ NASA ADS
89
+ Google Scholar
90
+ Semantic Scholar
91
+ export BibTeX citation
92
+ Loading...
93
+ BibTeX formatted citation
94
+ ×
95
+ loading...
96
+ Data provided by:
97
+ Bookmark
98
+ Bibliographic Tools
99
+ Bibliographic and Citation Tools
100
+ Bibliographic Explorer Toggle
101
+ Bibliographic Explorer
102
+ (
103
+ What is the Explorer?
104
+ )
105
+ Connected Papers Toggle
106
+ Connected Papers
107
+ (
108
+ What is Connected Papers?
109
+ )
110
+ Litmaps Toggle
111
+ Litmaps
112
+ (
113
+ What is Litmaps?
114
+ )
115
+ scite.ai Toggle
116
+ scite Smart Citations
117
+ (
118
+ What are Smart Citations?
119
+ )
120
+ Code, Data, Media
121
+ Code, Data and Media Associated with this Article
122
+ alphaXiv Toggle
123
+ alphaXiv
124
+ (
125
+ What is alphaXiv?
126
+ )
127
+ Links to Code Toggle
128
+ CatalyzeX Code Finder for Papers
129
+ (
130
+ What is CatalyzeX?
131
+ )
132
+ DagsHub Toggle
133
+ DagsHub
134
+ (
135
+ What is DagsHub?
136
+ )
137
+ GotitPub Toggle
138
+ Gotit.pub
139
+ (
140
+ What is GotitPub?
141
+ )
142
+ Huggingface Toggle
143
+ Hugging Face
144
+ (
145
+ What is Huggingface?
146
+ )
147
+ ScienceCast Toggle
148
+ ScienceCast
149
+ (
150
+ What is ScienceCast?
151
+ )
152
+ Demos
153
+ Demos
154
+ Replicate Toggle
155
+ Replicate
156
+ (
157
+ What is Replicate?
158
+ )
159
+ Spaces Toggle
160
+ Hugging Face Spaces
161
+ (
162
+ What is Spaces?
163
+ )
164
+ Spaces Toggle
165
+ TXYZ.AI
166
+ (
167
+ What is TXYZ.AI?
168
+ )
169
+ Related Papers
170
+ Recommenders and Search Tools
171
+ Link to Influence Flower
172
+ Influence Flower
173
+ (
174
+ What are Influence Flowers?
175
+ )
176
+ Core recommender toggle
177
+ CORE Recommender
178
+ (
179
+ What is CORE?
180
+ )
181
+ IArxiv recommender toggle
182
+ IArxiv Recommender
183
+ (
184
+ What is IArxiv?
185
+ )
186
+ Author
187
+ Venue
188
+ Institution
189
+ Topic
190
+ About arXivLabs
191
+ arXivLabs: experimental projects with community collaborators
192
+ arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
193
+ Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
194
+ Have an idea for a project that will add value for arXiv's community?
195
+ Learn more about arXivLabs
196
+ .
197
+ Which authors of this paper are endorsers?
198
+ |
199
+ Disable MathJax
200
+ (
201
+ What is MathJax?
202
+ )
research/notes/251002387-cwm-an-open-weights-llm-for-research-on-code-generation-with-world-mod.md ADDED
@@ -0,0 +1,291 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: '[2510.02387] CWM: An Open-Weights LLM for Research on Code Generation with
3
+ World Models'
4
+ id: 251002387-cwm-an-open-weights-llm-for-research-on-code-generation-with-world-mod
5
+ tags:
6
+ - socratic-mcts-swe-worldmodel-8f6dea
7
+ created: '2026-06-09T04:22:12.331553Z'
8
+ source: https://arxiv.org/abs/2510.02387
9
+ source_domain: arxiv.org
10
+ fetched_at: '2026-06-09T04:22:12.211946Z'
11
+ fetch_provider: builtin
12
+ status: draft
13
+ type: note
14
+ deprecated: false
15
+ summary: '[2510.02387] CWM: An Open-Weights LLM for Research on Code Generation with
16
+ World Models'
17
+ ---
18
+
19
+ [2510.02387] CWM: An Open-Weights LLM for Research on Code Generation with World Models
20
+ Computer Science > Software Engineering
21
+ arXiv:2510.02387
22
+ (cs)
23
+ [Submitted on 30 Sep 2025]
24
+ Title:
25
+ CWM: An Open-Weights LLM for Research on Code Generation with World Models
26
+ Authors:
27
+ FAIR CodeGen team
28
+ ,
29
+ Jade Copet
30
+ ,
31
+ Quentin Carbonneaux
32
+ ,
33
+ Gal Cohen
34
+ ,
35
+ Jonas Gehring
36
+ ,
37
+ Jacob Kahn
38
+ ,
39
+ Jannik Kossen
40
+ ,
41
+ Felix Kreuk
42
+ ,
43
+ Emily McMilin
44
+ ,
45
+ Michel Meyer
46
+ ,
47
+ Yuxiang Wei
48
+ ,
49
+ David Zhang
50
+ ,
51
+ Kunhao Zheng
52
+ ,
53
+ Jordi Armengol-Estapé
54
+ ,
55
+ Pedram Bashiri
56
+ ,
57
+ Maximilian Beck
58
+ ,
59
+ Pierre Chambon
60
+ ,
61
+ Abhishek Charnalia
62
+ ,
63
+ Chris Cummins
64
+ ,
65
+ Juliette Decugis
66
+ ,
67
+ Zacharias V. Fisches
68
+ ,
69
+ François Fleuret
70
+ ,
71
+ Fabian Gloeckle
72
+ ,
73
+ Alex Gu
74
+ ,
75
+ Michael Hassid
76
+ ,
77
+ Daniel Haziza
78
+ ,
79
+ Badr Youbi Idrissi
80
+ ,
81
+ Christian Keller
82
+ ,
83
+ Rahul Kindi
84
+ ,
85
+ Hugh Leather
86
+ ,
87
+ Gallil Maimon
88
+ ,
89
+ Aram Markosyan
90
+ ,
91
+ Francisco Massa
92
+ ,
93
+ Pierre-Emmanuel Mazaré
94
+ ,
95
+ Vegard Mella
96
+ ,
97
+ Naila Murray
98
+ ,
99
+ Keyur Muzumdar
100
+ ,
101
+ Peter O'Hearn
102
+ ,
103
+ Matteo Pagliardini
104
+ ,
105
+ Dmitrii Pedchenko
106
+ ,
107
+ Tal Remez
108
+ ,
109
+ Volker Seeker
110
+ ,
111
+ Marco Selvi
112
+ ,
113
+ Oren Sultan
114
+ ,
115
+ Sida Wang
116
+ ,
117
+ Luca Wehrstedt
118
+ ,
119
+ Ori Yoran
120
+ ,
121
+ Lingming Zhang
122
+ ,
123
+ Taco Cohen
124
+ ,
125
+ Yossi Adi
126
+ ,
127
+ Gabriel Synnaeve
128
+ View a PDF of the paper titled CWM: An Open-Weights LLM for Research on Code Generation with World Models, by FAIR CodeGen team and Jade Copet and 49 other authors
129
+ View PDF
130
+ HTML (experimental)
131
+ Abstract:
132
+ We release Code World Model (CWM), a 32-billion-parameter open-weights LLM, to advance research on code generation with world models. To improve code understanding beyond what can be learned from training on static code alone, we mid-train CWM on a large amount of observation-action trajectories from Python interpreter and agentic Docker environments, and perform extensive multi-task reasoning RL in verifiable coding, math, and multi-turn software engineering environments. With CWM, we provide a strong testbed for researchers to explore the opportunities world modeling affords for improving code generation with reasoning and planning in computational environments. We present first steps of how world models can benefit agentic coding, enable step-by-step simulation of Python code execution, and show early results of how reasoning can benefit from the latter. CWM is a dense, decoder-only LLM trained with a context size of up to 131k tokens. Independent of its world modeling capabilities, CWM offers strong performance on general coding and math tasks: it reaches pass@1 scores of 65.8% on SWE-bench Verified (with test-time scaling), 68.6% on LiveCodeBench, 96.6% on Math-500, and 76.0% on AIME 2024. To support further research on code world modeling, we release model checkpoints after mid-training, SFT, and RL.
133
+ Comments:
134
+ 58 pages
135
+ Subjects:
136
+ Software Engineering (cs.SE)
137
+ ; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
138
+ MSC
139
+ classes:
140
+ 68T07
141
+ ACM
142
+ classes:
143
+ I.2.7
144
+ Cite as:
145
+ arXiv:2510.02387
146
+ [cs.SE]
147
+ (or
148
+ arXiv:2510.02387v1
149
+ [cs.SE]
150
+ for this version)
151
+ https://doi.org/10.48550/arXiv.2510.02387
152
+ Focus to learn more
153
+ arXiv-issued DOI via DataCite
154
+ Submission history
155
+ From: Gabriel Synnaeve [
156
+ view email
157
+ ]
158
+ [v1]
159
+ Tue, 30 Sep 2025 21:47:10 UTC (1,662 KB)
160
+ Full-text links:
161
+ Access Paper:
162
+ View a PDF of the paper titled CWM: An Open-Weights LLM for Research on Code Generation with World Models, by FAIR CodeGen team and Jade Copet and 49 other authors
163
+ View PDF
164
+ HTML (experimental)
165
+ TeX Source
166
+ view license
167
+ Current browse context:
168
+ cs.SE
169
+ < prev
170
+ |
171
+ next >
172
+ new
173
+ |
174
+ recent
175
+ |
176
+ 2025-10
177
+ Change to browse by:
178
+ cs
179
+ cs.AI
180
+ cs.LG
181
+ References & Citations
182
+ NASA ADS
183
+ Google Scholar
184
+ Semantic Scholar
185
+ export BibTeX citation
186
+ Loading...
187
+ BibTeX formatted citation
188
+ ×
189
+ loading...
190
+ Data provided by:
191
+ Bookmark
192
+ Bibliographic Tools
193
+ Bibliographic and Citation Tools
194
+ Bibliographic Explorer Toggle
195
+ Bibliographic Explorer
196
+ (
197
+ What is the Explorer?
198
+ )
199
+ Connected Papers Toggle
200
+ Connected Papers
201
+ (
202
+ What is Connected Papers?
203
+ )
204
+ Litmaps Toggle
205
+ Litmaps
206
+ (
207
+ What is Litmaps?
208
+ )
209
+ scite.ai Toggle
210
+ scite Smart Citations
211
+ (
212
+ What are Smart Citations?
213
+ )
214
+ Code, Data, Media
215
+ Code, Data and Media Associated with this Article
216
+ alphaXiv Toggle
217
+ alphaXiv
218
+ (
219
+ What is alphaXiv?
220
+ )
221
+ Links to Code Toggle
222
+ CatalyzeX Code Finder for Papers
223
+ (
224
+ What is CatalyzeX?
225
+ )
226
+ DagsHub Toggle
227
+ DagsHub
228
+ (
229
+ What is DagsHub?
230
+ )
231
+ GotitPub Toggle
232
+ Gotit.pub
233
+ (
234
+ What is GotitPub?
235
+ )
236
+ Huggingface Toggle
237
+ Hugging Face
238
+ (
239
+ What is Huggingface?
240
+ )
241
+ ScienceCast Toggle
242
+ ScienceCast
243
+ (
244
+ What is ScienceCast?
245
+ )
246
+ Demos
247
+ Demos
248
+ Replicate Toggle
249
+ Replicate
250
+ (
251
+ What is Replicate?
252
+ )
253
+ Spaces Toggle
254
+ Hugging Face Spaces
255
+ (
256
+ What is Spaces?
257
+ )
258
+ Spaces Toggle
259
+ TXYZ.AI
260
+ (
261
+ What is TXYZ.AI?
262
+ )
263
+ Related Papers
264
+ Recommenders and Search Tools
265
+ Link to Influence Flower
266
+ Influence Flower
267
+ (
268
+ What are Influence Flowers?
269
+ )
270
+ Core recommender toggle
271
+ CORE Recommender
272
+ (
273
+ What is CORE?
274
+ )
275
+ Author
276
+ Venue
277
+ Institution
278
+ Topic
279
+ About arXivLabs
280
+ arXivLabs: experimental projects with community collaborators
281
+ arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
282
+ Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
283
+ Have an idea for a project that will add value for arXiv's community?
284
+ Learn more about arXivLabs
285
+ .
286
+ Which authors of this paper are endorsers?
287
+ |
288
+ Disable MathJax
289
+ (
290
+ What is MathJax?
291
+ )
research/notes/251121654-evilgenie-a-reward-hacking-benchmark.md ADDED
@@ -0,0 +1,199 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: '[2511.21654] EvilGenie: A Reward Hacking Benchmark'
3
+ id: 251121654-evilgenie-a-reward-hacking-benchmark
4
+ tags:
5
+ - socratic-mcts-swe-worldmodel-8f6dea
6
+ - locus-prune-vs-train-on-all
7
+ - locus-eks-architecture-and-substrate-mapping
8
+ - locus-credit-assignment-tree-as-process-signal
9
+ created: '2026-06-09T04:56:26.236010Z'
10
+ source: https://arxiv.org/abs/2511.21654
11
+ source_domain: arxiv.org
12
+ fetched_at: '2026-06-09T04:56:25.940448Z'
13
+ fetch_provider: builtin
14
+ status: draft
15
+ type: note
16
+ deprecated: false
17
+ summary: '[2511.21654] EvilGenie: A Reward Hacking Benchmark'
18
+ ---
19
+
20
+ [2511.21654] EvilGenie: A Reward Hacking Benchmark
21
+ Computer Science > Machine Learning
22
+ arXiv:2511.21654
23
+ (cs)
24
+ [Submitted on 26 Nov 2025 (
25
+ v1
26
+ ), last revised 17 May 2026 (this version, v2)]
27
+ Title:
28
+ EvilGenie: A Reward Hacking Benchmark
29
+ Authors:
30
+ Jonathan Gabor
31
+ ,
32
+ Jayson Lynch
33
+ ,
34
+ Jonathan Rosenfeld
35
+ View a PDF of the paper titled EvilGenie: A Reward Hacking Benchmark, by Jonathan Gabor and 2 other authors
36
+ View PDF
37
+ HTML (experimental)
38
+ Abstract:
39
+ We introduce EvilGenie, a benchmark for reward hacking in programming settings. We source problems from LiveCodeBench and create an environment in which agents can easily reward hack, such as by hardcoding test cases or editing the testing files. We measure reward hacking in three ways: held out unit tests, LLM judges, and test file edit detection. We verify these methods against human review and each other. We find the LLM judge to be highly effective at detecting reward hacking in unambiguous cases, and observe only minimal improvement from the use of held out test cases. In addition to testing many models using Inspect's basic\_agent scaffold, we also measure reward hacking rates for three popular proprietary coding agents: OpenAI's Codex, Anthropic's Claude Code, and Google's Gemini CLI. We observe explicit reward hacking by both Codex and Claude Code, and misaligned behavior by all three agents. Our codebase can be found at
40
+ this https URL
41
+ .
42
+ Subjects:
43
+ Machine Learning (cs.LG)
44
+ ACM
45
+ classes:
46
+ I.2.7
47
+ Cite as:
48
+ arXiv:2511.21654
49
+ [cs.LG]
50
+ (or
51
+ arXiv:2511.21654v2
52
+ [cs.LG]
53
+ for this version)
54
+ https://doi.org/10.48550/arXiv.2511.21654
55
+ Focus to learn more
56
+ arXiv-issued DOI via DataCite
57
+ Submission history
58
+ From: Jonathan Gabor [
59
+ view email
60
+ ]
61
+ [v1]
62
+ Wed, 26 Nov 2025 18:27:17 UTC (75 KB)
63
+ [v2]
64
+ Sun, 17 May 2026 22:54:07 UTC (42 KB)
65
+ Full-text links:
66
+ Access Paper:
67
+ View a PDF of the paper titled EvilGenie: A Reward Hacking Benchmark, by Jonathan Gabor and 2 other authors
68
+ View PDF
69
+ HTML (experimental)
70
+ TeX Source
71
+ view license
72
+ Current browse context:
73
+ cs.LG
74
+ < prev
75
+ |
76
+ next >
77
+ new
78
+ |
79
+ recent
80
+ |
81
+ 2025-11
82
+ Change to browse by:
83
+ cs
84
+ References & Citations
85
+ NASA ADS
86
+ Google Scholar
87
+ Semantic Scholar
88
+ export BibTeX citation
89
+ Loading...
90
+ BibTeX formatted citation
91
+ ×
92
+ loading...
93
+ Data provided by:
94
+ Bookmark
95
+ Bibliographic Tools
96
+ Bibliographic and Citation Tools
97
+ Bibliographic Explorer Toggle
98
+ Bibliographic Explorer
99
+ (
100
+ What is the Explorer?
101
+ )
102
+ Connected Papers Toggle
103
+ Connected Papers
104
+ (
105
+ What is Connected Papers?
106
+ )
107
+ Litmaps Toggle
108
+ Litmaps
109
+ (
110
+ What is Litmaps?
111
+ )
112
+ scite.ai Toggle
113
+ scite Smart Citations
114
+ (
115
+ What are Smart Citations?
116
+ )
117
+ Code, Data, Media
118
+ Code, Data and Media Associated with this Article
119
+ alphaXiv Toggle
120
+ alphaXiv
121
+ (
122
+ What is alphaXiv?
123
+ )
124
+ Links to Code Toggle
125
+ CatalyzeX Code Finder for Papers
126
+ (
127
+ What is CatalyzeX?
128
+ )
129
+ DagsHub Toggle
130
+ DagsHub
131
+ (
132
+ What is DagsHub?
133
+ )
134
+ GotitPub Toggle
135
+ Gotit.pub
136
+ (
137
+ What is GotitPub?
138
+ )
139
+ Huggingface Toggle
140
+ Hugging Face
141
+ (
142
+ What is Huggingface?
143
+ )
144
+ ScienceCast Toggle
145
+ ScienceCast
146
+ (
147
+ What is ScienceCast?
148
+ )
149
+ Demos
150
+ Demos
151
+ Replicate Toggle
152
+ Replicate
153
+ (
154
+ What is Replicate?
155
+ )
156
+ Spaces Toggle
157
+ Hugging Face Spaces
158
+ (
159
+ What is Spaces?
160
+ )
161
+ Spaces Toggle
162
+ TXYZ.AI
163
+ (
164
+ What is TXYZ.AI?
165
+ )
166
+ Related Papers
167
+ Recommenders and Search Tools
168
+ Link to Influence Flower
169
+ Influence Flower
170
+ (
171
+ What are Influence Flowers?
172
+ )
173
+ Core recommender toggle
174
+ CORE Recommender
175
+ (
176
+ What is CORE?
177
+ )
178
+ IArxiv recommender toggle
179
+ IArxiv Recommender
180
+ (
181
+ What is IArxiv?
182
+ )
183
+ Author
184
+ Venue
185
+ Institution
186
+ Topic
187
+ About arXivLabs
188
+ arXivLabs: experimental projects with community collaborators
189
+ arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
190
+ Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
191
+ Have an idea for a project that will add value for arXiv's community?
192
+ Learn more about arXivLabs
193
+ .
194
+ Which authors of this paper are endorsers?
195
+ |
196
+ Disable MathJax
197
+ (
198
+ What is MathJax?
199
+ )
research/notes/251218832-from-word-to-world-can-large-language-models-be-implicit-text-based-wo.md ADDED
@@ -0,0 +1,210 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: '[2512.18832] From Word to World: Can Large Language Models be Implicit Text-based
3
+ World Models?'
4
+ id: 251218832-from-word-to-world-can-large-language-models-be-implicit-text-based-wo
5
+ tags:
6
+ - socratic-mcts-swe-worldmodel-8f6dea
7
+ created: '2026-06-09T04:22:01.604517Z'
8
+ updated: '2026-06-09T04:22:18.450487Z'
9
+ source: https://arxiv.org/abs/2512.18832
10
+ source_domain: arxiv.org
11
+ fetched_at: '2026-06-09T04:22:01.134259Z'
12
+ fetch_provider: builtin
13
+ status: draft
14
+ type: note
15
+ tier: institutional
16
+ content_type: paper
17
+ deprecated: false
18
+ summary: '[2512.18832] From Word to World: Can Large Language Models be Implicit Text-based
19
+ World Models?'
20
+ ---
21
+
22
+ [2512.18832] From Word to World: Can Large Language Models be Implicit Text-based World Models?
23
+ Computer Science > Computation and Language
24
+ arXiv:2512.18832
25
+ (cs)
26
+ [Submitted on 21 Dec 2025 (
27
+ v1
28
+ ), last revised 5 Mar 2026 (this version, v2)]
29
+ Title:
30
+ From Word to World: Can Large Language Models be Implicit Text-based World Models?
31
+ Authors:
32
+ Yixia Li
33
+ ,
34
+ Hongru Wang
35
+ ,
36
+ Jiahao Qiu
37
+ ,
38
+ Zhenfei Yin
39
+ ,
40
+ Dongdong Zhang
41
+ ,
42
+ Cheng Qian
43
+ ,
44
+ Zeping Li
45
+ ,
46
+ Pony Ma
47
+ ,
48
+ Guanhua Chen
49
+ ,
50
+ Heng Ji
51
+ View a PDF of the paper titled From Word to World: Can Large Language Models be Implicit Text-based World Models?, by Yixia Li and 9 other authors
52
+ View PDF
53
+ HTML (experimental)
54
+ Abstract:
55
+ Agentic reinforcement learning increasingly relies on experience-driven scaling, yet real-world environments remain non-adaptive, limited in coverage, and difficult to scale. World models offer a potential way to improve learning efficiency through simulated experience, but it remains unclear whether large language models can reliably serve this role and under what conditions they meaningfully benefit agents. We study these questions in text-based environments, which provide a controlled setting to reinterpret language modeling as next-state prediction under interaction. We introduce a three-level framework for evaluating LLM-based world models: (i) fidelity and consistency, (ii) scalability and robustness, and (iii) agent utility. Across five representative environments, we find that sufficiently trained world models maintain coherent latent state, scale predictably with data and model size, and improve agent performance via action verification, synthetic trajectory generation, and warm-starting reinforcement learning. Meanwhile, these gains depend critically on behavioral coverage and environment complexity, delineating clear boundry on when world modeling effectively supports agent learning.
56
+ Subjects:
57
+ Computation and Language (cs.CL)
58
+ Cite as:
59
+ arXiv:2512.18832
60
+ [cs.CL]
61
+ (or
62
+ arXiv:2512.18832v2
63
+ [cs.CL]
64
+ for this version)
65
+ https://doi.org/10.48550/arXiv.2512.18832
66
+ Focus to learn more
67
+ arXiv-issued DOI via DataCite
68
+ Submission history
69
+ From: Yixia Li [
70
+ view email
71
+ ]
72
+ [v1]
73
+ Sun, 21 Dec 2025 17:28:42 UTC (2,094 KB)
74
+ [v2]
75
+ Thu, 5 Mar 2026 07:26:37 UTC (2,094 KB)
76
+ Full-text links:
77
+ Access Paper:
78
+ View a PDF of the paper titled From Word to World: Can Large Language Models be Implicit Text-based World Models?, by Yixia Li and 9 other authors
79
+ View PDF
80
+ HTML (experimental)
81
+ TeX Source
82
+ view license
83
+ Current browse context:
84
+ cs.CL
85
+ < prev
86
+ |
87
+ next >
88
+ new
89
+ |
90
+ recent
91
+ |
92
+ 2025-12
93
+ Change to browse by:
94
+ cs
95
+ References & Citations
96
+ NASA ADS
97
+ Google Scholar
98
+ Semantic Scholar
99
+ export BibTeX citation
100
+ Loading...
101
+ BibTeX formatted citation
102
+ ×
103
+ loading...
104
+ Data provided by:
105
+ Bookmark
106
+ Bibliographic Tools
107
+ Bibliographic and Citation Tools
108
+ Bibliographic Explorer Toggle
109
+ Bibliographic Explorer
110
+ (
111
+ What is the Explorer?
112
+ )
113
+ Connected Papers Toggle
114
+ Connected Papers
115
+ (
116
+ What is Connected Papers?
117
+ )
118
+ Litmaps Toggle
119
+ Litmaps
120
+ (
121
+ What is Litmaps?
122
+ )
123
+ scite.ai Toggle
124
+ scite Smart Citations
125
+ (
126
+ What are Smart Citations?
127
+ )
128
+ Code, Data, Media
129
+ Code, Data and Media Associated with this Article
130
+ alphaXiv Toggle
131
+ alphaXiv
132
+ (
133
+ What is alphaXiv?
134
+ )
135
+ Links to Code Toggle
136
+ CatalyzeX Code Finder for Papers
137
+ (
138
+ What is CatalyzeX?
139
+ )
140
+ DagsHub Toggle
141
+ DagsHub
142
+ (
143
+ What is DagsHub?
144
+ )
145
+ GotitPub Toggle
146
+ Gotit.pub
147
+ (
148
+ What is GotitPub?
149
+ )
150
+ Huggingface Toggle
151
+ Hugging Face
152
+ (
153
+ What is Huggingface?
154
+ )
155
+ Links to Code Toggle
156
+ Papers with Code
157
+ (
158
+ What is Papers with Code?
159
+ )
160
+ ScienceCast Toggle
161
+ ScienceCast
162
+ (
163
+ What is ScienceCast?
164
+ )
165
+ Demos
166
+ Demos
167
+ Replicate Toggle
168
+ Replicate
169
+ (
170
+ What is Replicate?
171
+ )
172
+ Spaces Toggle
173
+ Hugging Face Spaces
174
+ (
175
+ What is Spaces?
176
+ )
177
+ Spaces Toggle
178
+ TXYZ.AI
179
+ (
180
+ What is TXYZ.AI?
181
+ )
182
+ Related Papers
183
+ Recommenders and Search Tools
184
+ Link to Influence Flower
185
+ Influence Flower
186
+ (
187
+ What are Influence Flowers?
188
+ )
189
+ Core recommender toggle
190
+ CORE Recommender
191
+ (
192
+ What is CORE?
193
+ )
194
+ Author
195
+ Venue
196
+ Institution
197
+ Topic
198
+ About arXivLabs
199
+ arXivLabs: experimental projects with community collaborators
200
+ arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
201
+ Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
202
+ Have an idea for a project that will add value for arXiv's community?
203
+ Learn more about arXivLabs
204
+ .
205
+ Which authors of this paper are endorsers?
206
+ |
207
+ Disable MathJax
208
+ (
209
+ What is MathJax?
210
+ )
research/notes/260103905-current-agents-fail-to-leverage-world-model-as-tool-for-foresight.md ADDED
@@ -0,0 +1,210 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: '[2601.03905] Current Agents Fail to Leverage World Model as Tool for Foresight'
3
+ id: 260103905-current-agents-fail-to-leverage-world-model-as-tool-for-foresight
4
+ tags:
5
+ - socratic-mcts-swe-worldmodel-8f6dea
6
+ created: '2026-06-09T04:22:01.607019Z'
7
+ updated: '2026-06-09T04:22:18.777288Z'
8
+ source: https://arxiv.org/abs/2601.03905
9
+ source_domain: arxiv.org
10
+ fetched_at: '2026-06-09T04:22:01.290034Z'
11
+ fetch_provider: builtin
12
+ status: draft
13
+ type: note
14
+ tier: institutional
15
+ content_type: paper
16
+ deprecated: false
17
+ summary: '[2601.03905] Current Agents Fail to Leverage World Model as Tool for Foresight'
18
+ ---
19
+
20
+ [2601.03905] Current Agents Fail to Leverage World Model as Tool for Foresight
21
+ Computer Science > Artificial Intelligence
22
+ arXiv:2601.03905
23
+ (cs)
24
+ [Submitted on 7 Jan 2026 (
25
+ v1
26
+ ), last revised 8 Jan 2026 (this version, v2)]
27
+ Title:
28
+ Current Agents Fail to Leverage World Model as Tool for Foresight
29
+ Authors:
30
+ Cheng Qian
31
+ ,
32
+ Emre Can Acikgoz
33
+ ,
34
+ Bingxuan Li
35
+ ,
36
+ Xiusi Chen
37
+ ,
38
+ Yuji Zhang
39
+ ,
40
+ Bingxiang He
41
+ ,
42
+ Qinyu Luo
43
+ ,
44
+ Dilek Hakkani-Tür
45
+ ,
46
+ Gokhan Tur
47
+ ,
48
+ Yunzhu Li
49
+ ,
50
+ Heng Ji
51
+ View a PDF of the paper titled Current Agents Fail to Leverage World Model as Tool for Foresight, by Cheng Qian and 10 other authors
52
+ View PDF
53
+ HTML (experimental)
54
+ Abstract:
55
+ Agents built on vision-language models increasingly face tasks that demand anticipating future states rather than relying on short-horizon reasoning. Generative world models offer a promising remedy: agents could use them as external simulators to foresee outcomes before acting. This paper empirically examines whether current agents can leverage such world models as tools to enhance their cognition. Across diverse agentic and visual question answering tasks, we observe that some agents rarely invoke simulation (fewer than 1%), frequently misuse predicted rollouts (approximately 15%), and often exhibit inconsistent or even degraded performance (up to 5%) when simulation is available or enforced. Attribution analysis further indicates that the primary bottleneck lies in the agents' capacity to decide when to simulate, how to interpret predicted outcomes, and how to integrate foresight into downstream reasoning. These findings underscore the need for mechanisms that foster calibrated, strategic interaction with world models, paving the way toward more reliable anticipatory cognition in future agent systems.
56
+ Comments:
57
+ 36 Pages, 13 Figures, 17 Tables (Meta data updated)
58
+ Subjects:
59
+ Artificial Intelligence (cs.AI)
60
+ ; Computation and Language (cs.CL); Machine Learning (cs.LG)
61
+ Cite as:
62
+ arXiv:2601.03905
63
+ [cs.AI]
64
+ (or
65
+ arXiv:2601.03905v2
66
+ [cs.AI]
67
+ for this version)
68
+ https://doi.org/10.48550/arXiv.2601.03905
69
+ Focus to learn more
70
+ arXiv-issued DOI via DataCite
71
+ Submission history
72
+ From: Cheng Qian [
73
+ view email
74
+ ]
75
+ [v1]
76
+ Wed, 7 Jan 2026 13:15:23 UTC (12,754 KB)
77
+ [v2]
78
+ Thu, 8 Jan 2026 02:36:21 UTC (12,754 KB)
79
+ Full-text links:
80
+ Access Paper:
81
+ View a PDF of the paper titled Current Agents Fail to Leverage World Model as Tool for Foresight, by Cheng Qian and 10 other authors
82
+ View PDF
83
+ HTML (experimental)
84
+ TeX Source
85
+ view license
86
+ Current browse context:
87
+ cs.AI
88
+ < prev
89
+ |
90
+ next >
91
+ new
92
+ |
93
+ recent
94
+ |
95
+ 2026-01
96
+ Change to browse by:
97
+ cs
98
+ cs.CL
99
+ cs.LG
100
+ References & Citations
101
+ NASA ADS
102
+ Google Scholar
103
+ Semantic Scholar
104
+ export BibTeX citation
105
+ Loading...
106
+ BibTeX formatted citation
107
+ ×
108
+ loading...
109
+ Data provided by:
110
+ Bookmark
111
+ Bibliographic Tools
112
+ Bibliographic and Citation Tools
113
+ Bibliographic Explorer Toggle
114
+ Bibliographic Explorer
115
+ (
116
+ What is the Explorer?
117
+ )
118
+ Connected Papers Toggle
119
+ Connected Papers
120
+ (
121
+ What is Connected Papers?
122
+ )
123
+ Litmaps Toggle
124
+ Litmaps
125
+ (
126
+ What is Litmaps?
127
+ )
128
+ scite.ai Toggle
129
+ scite Smart Citations
130
+ (
131
+ What are Smart Citations?
132
+ )
133
+ Code, Data, Media
134
+ Code, Data and Media Associated with this Article
135
+ alphaXiv Toggle
136
+ alphaXiv
137
+ (
138
+ What is alphaXiv?
139
+ )
140
+ Links to Code Toggle
141
+ CatalyzeX Code Finder for Papers
142
+ (
143
+ What is CatalyzeX?
144
+ )
145
+ DagsHub Toggle
146
+ DagsHub
147
+ (
148
+ What is DagsHub?
149
+ )
150
+ GotitPub Toggle
151
+ Gotit.pub
152
+ (
153
+ What is GotitPub?
154
+ )
155
+ Huggingface Toggle
156
+ Hugging Face
157
+ (
158
+ What is Huggingface?
159
+ )
160
+ ScienceCast Toggle
161
+ ScienceCast
162
+ (
163
+ What is ScienceCast?
164
+ )
165
+ Demos
166
+ Demos
167
+ Replicate Toggle
168
+ Replicate
169
+ (
170
+ What is Replicate?
171
+ )
172
+ Spaces Toggle
173
+ Hugging Face Spaces
174
+ (
175
+ What is Spaces?
176
+ )
177
+ Spaces Toggle
178
+ TXYZ.AI
179
+ (
180
+ What is TXYZ.AI?
181
+ )
182
+ Related Papers
183
+ Recommenders and Search Tools
184
+ Link to Influence Flower
185
+ Influence Flower
186
+ (
187
+ What are Influence Flowers?
188
+ )
189
+ Core recommender toggle
190
+ CORE Recommender
191
+ (
192
+ What is CORE?
193
+ )
194
+ Author
195
+ Venue
196
+ Institution
197
+ Topic
198
+ About arXivLabs
199
+ arXivLabs: experimental projects with community collaborators
200
+ arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
201
+ Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
202
+ Have an idea for a project that will add value for arXiv's community?
203
+ Learn more about arXivLabs
204
+ .
205
+ Which authors of this paper are endorsers?
206
+ |
207
+ Disable MathJax
208
+ (
209
+ What is MathJax?
210
+ )
research/notes/260112307-rethinking-the-value-of-multi-agent-workflow-a-strong-single-agent-bas.md ADDED
@@ -0,0 +1,206 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: '[2601.12307] Rethinking the Value of Multi-Agent Workflow: A Strong Single
3
+ Agent Baseline'
4
+ id: 260112307-rethinking-the-value-of-multi-agent-workflow-a-strong-single-agent-bas
5
+ tags:
6
+ - socratic-mcts-swe-worldmodel-8f6dea
7
+ - locus-prune-vs-train-on-all
8
+ - locus-eks-architecture-and-substrate-mapping
9
+ - locus-credit-assignment-tree-as-process-signal
10
+ created: '2026-06-09T04:52:22.844340Z'
11
+ source: https://arxiv.org/abs/2601.12307
12
+ source_domain: arxiv.org
13
+ fetched_at: '2026-06-09T04:52:22.442504Z'
14
+ fetch_provider: builtin
15
+ status: draft
16
+ type: note
17
+ deprecated: false
18
+ summary: '[2601.12307] Rethinking the Value of Multi-Agent Workflow: A Strong Single
19
+ Agent Baseline'
20
+ ---
21
+
22
+ [2601.12307] Rethinking the Value of Multi-Agent Workflow: A Strong Single Agent Baseline
23
+ Computer Science > Multiagent Systems
24
+ arXiv:2601.12307
25
+ (cs)
26
+ [Submitted on 18 Jan 2026]
27
+ Title:
28
+ Rethinking the Value of Multi-Agent Workflow: A Strong Single Agent Baseline
29
+ Authors:
30
+ Jiawei Xu
31
+ ,
32
+ Arief Koesdwiady
33
+ ,
34
+ Sisong Bei
35
+ ,
36
+ Yan Han
37
+ ,
38
+ Baixiang Huang
39
+ ,
40
+ Dakuo Wang
41
+ ,
42
+ Yutong Chen
43
+ ,
44
+ Zheshen Wang
45
+ ,
46
+ Peihao Wang
47
+ ,
48
+ Pan Li
49
+ ,
50
+ Ying Ding
51
+ View a PDF of the paper titled Rethinking the Value of Multi-Agent Workflow: A Strong Single Agent Baseline, by Jiawei Xu and 10 other authors
52
+ View PDF
53
+ HTML (experimental)
54
+ Abstract:
55
+ Recent advances in LLM-based multi-agent systems (MAS) show that workflows composed of multiple LLM agents with distinct roles, tools, and communication patterns can outperform single-LLM baselines on complex tasks. However, most frameworks are homogeneous, where all agents share the same base LLM and differ only in prompts, tools, and positions in the workflow. This raises the question of whether such workflows can be simulated by a single agent through multi-turn conversations. We investigate this across seven benchmarks spanning coding, mathematics, general question answering, domain-specific reasoning, and real-world planning and tool use. Our results show that a single agent can reach the performance of homogeneous workflows with an efficiency advantage from KV cache reuse, and can even match the performance of an automatically optimized heterogeneous workflow. Building on this finding, we propose \textbf{OneFlow}, an algorithm that automatically tailors workflows for single-agent execution, reducing inference costs compared to existing automatic multi-agent design frameworks without trading off accuracy. These results position the single-LLM implementation of multi-agent workflows as a strong baseline for MAS research. We also note that single-LLM methods cannot capture heterogeneous workflows due to the lack of KV cache sharing across different LLMs, highlighting future opportunities in developing \textit{truly} heterogeneous multi-agent systems.
56
+ Subjects:
57
+ Multiagent Systems (cs.MA)
58
+ ; Computation and Language (cs.CL); Machine Learning (cs.LG)
59
+ Cite as:
60
+ arXiv:2601.12307
61
+ [cs.MA]
62
+ (or
63
+ arXiv:2601.12307v1
64
+ [cs.MA]
65
+ for this version)
66
+ https://doi.org/10.48550/arXiv.2601.12307
67
+ Focus to learn more
68
+ arXiv-issued DOI via DataCite
69
+ Submission history
70
+ From: Jiawei Xu [
71
+ view email
72
+ ]
73
+ [v1]
74
+ Sun, 18 Jan 2026 08:16:09 UTC (429 KB)
75
+ Full-text links:
76
+ Access Paper:
77
+ View a PDF of the paper titled Rethinking the Value of Multi-Agent Workflow: A Strong Single Agent Baseline, by Jiawei Xu and 10 other authors
78
+ View PDF
79
+ HTML (experimental)
80
+ TeX Source
81
+ view license
82
+ Current browse context:
83
+ cs.MA
84
+ < prev
85
+ |
86
+ next >
87
+ new
88
+ |
89
+ recent
90
+ |
91
+ 2026-01
92
+ Change to browse by:
93
+ cs
94
+ cs.CL
95
+ cs.LG
96
+ References & Citations
97
+ NASA ADS
98
+ Google Scholar
99
+ Semantic Scholar
100
+ export BibTeX citation
101
+ Loading...
102
+ BibTeX formatted citation
103
+ ×
104
+ loading...
105
+ Data provided by:
106
+ Bookmark
107
+ Bibliographic Tools
108
+ Bibliographic and Citation Tools
109
+ Bibliographic Explorer Toggle
110
+ Bibliographic Explorer
111
+ (
112
+ What is the Explorer?
113
+ )
114
+ Connected Papers Toggle
115
+ Connected Papers
116
+ (
117
+ What is Connected Papers?
118
+ )
119
+ Litmaps Toggle
120
+ Litmaps
121
+ (
122
+ What is Litmaps?
123
+ )
124
+ scite.ai Toggle
125
+ scite Smart Citations
126
+ (
127
+ What are Smart Citations?
128
+ )
129
+ Code, Data, Media
130
+ Code, Data and Media Associated with this Article
131
+ alphaXiv Toggle
132
+ alphaXiv
133
+ (
134
+ What is alphaXiv?
135
+ )
136
+ Links to Code Toggle
137
+ CatalyzeX Code Finder for Papers
138
+ (
139
+ What is CatalyzeX?
140
+ )
141
+ DagsHub Toggle
142
+ DagsHub
143
+ (
144
+ What is DagsHub?
145
+ )
146
+ GotitPub Toggle
147
+ Gotit.pub
148
+ (
149
+ What is GotitPub?
150
+ )
151
+ Huggingface Toggle
152
+ Hugging Face
153
+ (
154
+ What is Huggingface?
155
+ )
156
+ ScienceCast Toggle
157
+ ScienceCast
158
+ (
159
+ What is ScienceCast?
160
+ )
161
+ Demos
162
+ Demos
163
+ Replicate Toggle
164
+ Replicate
165
+ (
166
+ What is Replicate?
167
+ )
168
+ Spaces Toggle
169
+ Hugging Face Spaces
170
+ (
171
+ What is Spaces?
172
+ )
173
+ Spaces Toggle
174
+ TXYZ.AI
175
+ (
176
+ What is TXYZ.AI?
177
+ )
178
+ Related Papers
179
+ Recommenders and Search Tools
180
+ Link to Influence Flower
181
+ Influence Flower
182
+ (
183
+ What are Influence Flowers?
184
+ )
185
+ Core recommender toggle
186
+ CORE Recommender
187
+ (
188
+ What is CORE?
189
+ )
190
+ Author
191
+ Venue
192
+ Institution
193
+ Topic
194
+ About arXivLabs
195
+ arXivLabs: experimental projects with community collaborators
196
+ arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
197
+ Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
198
+ Have an idea for a project that will add value for arXiv's community?
199
+ Learn more about arXivLabs
200
+ .
201
+ Which authors of this paper are endorsers?
202
+ |
203
+ Disable MathJax
204
+ (
205
+ What is MathJax?
206
+ )