Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Spike v0.0 laydown + spike 001 VALIDATED
Browse filesDecompose v0.0 into 4 risk-ordered sub-spikes:
- 001 teacher-replay-cost (kill-switch, ✅ EXECUTED + VALIDATED in this commit)
- 002a trace-collection-trl (planned)
- 002b trace-collection-prime-rl (comparison-spike)
- 003 dpo-pairs-from-disagreement (planned)
- 004 ab-train-grpo-vs-trace-replay-dpo (terminal experiment)
Spike 001 result (real, not scaffolding):
- 150 teacher API calls (Opus 4.7 + GPT-5 + DeepSeek V4 Pro via OpenRouter)
- 50 synthesized agentic-coding states as fixture
- 0 errors, total spend $0.98
- Mean per-trace cost: $0.98 (target: <$5) ✅ — 5x headroom
- p95 step latency: 20.5s (target: <30s) ✅
- p99 step latency: 23.2s (target: <60s) ✅
- Cost composition: Opus dominates ($0.81 of $0.98); DeepSeek + GPT-5 essentially free
This validates the framework's central economic claim. Even ungated (no VOI,
no tiered teachers), cost is ~$1/trace not $64/trace. With VOI gating in v0.1
we project ~$0.30/trace. Trace-replay-distillation is economically viable.
Files:
- spikes/README.md (decomposition + risk ordering)
- spikes/001-teacher-replay-cost/{README.md, synthesize_trace.py, replay.py,
analyze.py, states.jsonl (fixture, 50 states), verdict.md}
- spikes/00{2a,2b,3,4}-*/README.md (planning stubs)
- README.md updated with v0.0 status banner
- .gitignore updated to track states.jsonl fixture but not results.jsonl
Next: wait for direction on whether to dispatch spike 002a (TRL trace
collection on Modal A100) and 002b in parallel via subagents.
- .gitignore +6 -0
- README.md +2 -2
- spikes/001-teacher-replay-cost/README.md +52 -0
- spikes/001-teacher-replay-cost/analyze.py +200 -0
- spikes/001-teacher-replay-cost/replay.py +194 -0
- spikes/001-teacher-replay-cost/states.jsonl +50 -0
- spikes/001-teacher-replay-cost/synthesize_trace.py +271 -0
- spikes/001-teacher-replay-cost/verdict.md +44 -0
- spikes/002a-trace-collection-trl/README.md +39 -0
- spikes/002b-trace-collection-prime-rl/README.md +31 -0
- spikes/003-dpo-pairs-from-disagreement/README.md +34 -0
- spikes/004-ab-train-grpo-vs-trace-replay-dpo/README.md +47 -0
- spikes/README.md +60 -0
|
@@ -33,6 +33,12 @@ wandb/
|
|
| 33 |
data/processed/
|
| 34 |
data/external/
|
| 35 |
|
|
|
|
|
|
|
|
|
|
| 36 |
# Logs / runtime
|
| 37 |
logs/
|
| 38 |
*.log
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
data/processed/
|
| 34 |
data/external/
|
| 35 |
|
| 36 |
+
# But spike fixtures (synthetic input states) ARE checked in — reproducibility
|
| 37 |
+
!spikes/**/states.jsonl
|
| 38 |
+
|
| 39 |
# Logs / runtime
|
| 40 |
logs/
|
| 41 |
*.log
|
| 42 |
+
|
| 43 |
+
# Spike 001 raw API responses (large + privacy)
|
| 44 |
+
spikes/001-teacher-replay-cost/results.jsonl
|
|
@@ -27,13 +27,13 @@ pretty_name: "Composer 2.5 Replication Framework — Research Synthesis"
|
|
| 27 |
|
| 28 |
# Composer 2.5 Replication Framework
|
| 29 |
|
| 30 |
-
> **Repo type:** `model` (methodology). **Status:** Research synthesis (2026-05-25).
|
| 31 |
> **Author:** [Codeseys](https://huggingface.co/Codeseys)
|
| 32 |
> **Goal:** Replicate Cursor's [Composer 2.5](https://cursor.com/blog/composer-2-5) (a post-trained Kimi K2.5 specialised for agentic coding) on **any** HuggingFace base model, using a synthesis of decentralized RL post-training techniques.
|
| 33 |
|
| 34 |
This repository is the **"paper of the project"** — it is the methodology / research / framework specification for an open replication of Cursor's Composer 2.5 system, plus a **novel multi-teacher trace-replay distillation channel** that stacks on top of the Composer recipe.
|
| 35 |
|
| 36 |
-
|
| 37 |
|
| 38 |
---
|
| 39 |
|
|
|
|
| 27 |
|
| 28 |
# Composer 2.5 Replication Framework
|
| 29 |
|
| 30 |
+
> **Repo type:** `model` (methodology). **Status:** Research synthesis + v0.0 spike kickoff (2026-05-25).
|
| 31 |
> **Author:** [Codeseys](https://huggingface.co/Codeseys)
|
| 32 |
> **Goal:** Replicate Cursor's [Composer 2.5](https://cursor.com/blog/composer-2-5) (a post-trained Kimi K2.5 specialised for agentic coding) on **any** HuggingFace base model, using a synthesis of decentralized RL post-training techniques.
|
| 33 |
|
| 34 |
This repository is the **"paper of the project"** — it is the methodology / research / framework specification for an open replication of Cursor's Composer 2.5 system, plus a **novel multi-teacher trace-replay distillation channel** that stacks on top of the Composer recipe.
|
| 35 |
|
| 36 |
+
**v0.0 spike kickoff (2026-05-25):** the kill-switch feasibility test (`spikes/001-teacher-replay-cost/`) is **✅ VALIDATED** — 150 real teacher API calls (Opus 4.7 + GPT-5 + DeepSeek V4 Pro via OpenRouter), $0.98 mean per-trace cost (vs. $5 cap), 20.5 s p95 step latency. The novel research direction is economically viable. See `spikes/README.md` for the full 4-stage spike plan.
|
| 37 |
|
| 38 |
---
|
| 39 |
|
|
@@ -0,0 +1,52 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Spike 001 — Teacher Replay Cost & Latency Floor
|
| 2 |
+
|
| 3 |
+
> **Risk:** HIGH (kill-switch). If teacher API cost or latency is unacceptable, the trace-replay channel dies and v0.0 stops here.
|
| 4 |
+
> **Status:** 🟢 EXECUTING (this session)
|
| 5 |
+
|
| 6 |
+
## Question (Given / When / Then)
|
| 7 |
+
|
| 8 |
+
**Given** a frozen 100-step agentic-coding trace and a state at step `t`,
|
| 9 |
+
**when** N=3 frozen teachers (Opus 4.7, GPT-5, DeepSeek V4 Pro) are queried via OpenRouter for next-action distributions,
|
| 10 |
+
**then** total per-trace teacher cost is **< $5** and wallclock per step is **< 30 s**.
|
| 11 |
+
|
| 12 |
+
## Why this first
|
| 13 |
+
|
| 14 |
+
- **Cheapest possible falsification** of the framework's central economic assumption (~$3/trace with VOI gating, ~$64/trace without).
|
| 15 |
+
- v0.0 doesn't yet use VOI gating; we measure the *ungated* baseline so we know how much VOI gating buys us in v0.1.
|
| 16 |
+
- No GPU required. ~1 hour of API budget at most.
|
| 17 |
+
|
| 18 |
+
## Approach
|
| 19 |
+
|
| 20 |
+
1. **Synthesize a small frozen trace** (no real student model needed for cost-floor measurement). Use a fixed sample of 50 SWE-bench-lite-shaped agentic-coding "states" — multi-turn function-call sequences — written by hand or pulled from a dataset.
|
| 21 |
+
2. **At each step,** call all 3 teachers in parallel via OpenRouter chat completions. Capture: completion token count, prompt token count, latency, cost per call.
|
| 22 |
+
3. **Record per-trace and per-step aggregates** in `results.jsonl`.
|
| 23 |
+
4. **Verdict** based on:
|
| 24 |
+
- Total cost < $5 ✅ / > $5 ❌
|
| 25 |
+
- p95 step latency < 30 s ✅ / > 30 s ❌
|
| 26 |
+
- p99 step latency < 60 s ✅ / > 60 s ❌
|
| 27 |
+
|
| 28 |
+
## Files
|
| 29 |
+
|
| 30 |
+
- `synthesize_trace.py` — generates 50 mock agentic states (fixture, no real model)
|
| 31 |
+
- `replay.py` — calls 3 teachers in parallel for each state, logs cost+latency
|
| 32 |
+
- `analyze.py` — aggregates `results.jsonl` into the verdict table
|
| 33 |
+
- `results.jsonl` — raw per-call results (gitignored — has actual API responses)
|
| 34 |
+
- `verdict.md` — final verdict + recommendation
|
| 35 |
+
|
| 36 |
+
## Constraints
|
| 37 |
+
|
| 38 |
+
- OpenRouter API key from `~/.hermes/.env` (already configured)
|
| 39 |
+
- Hard-cap at $20 spend; abort if exceeded
|
| 40 |
+
- Use `httpx` async client for parallel teacher calls (lib already in Hermes venv)
|
| 41 |
+
|
| 42 |
+
## Teacher slugs (verified live on roster)
|
| 43 |
+
|
| 44 |
+
- `anthropic/claude-opus-4.7` (Opus 4.7) — primary frontier, $15/$75 per Mtok
|
| 45 |
+
- `openai/gpt-5` — OpenAI frontier, $1.25/$10 per Mtok
|
| 46 |
+
- `deepseek/deepseek-v4-pro` — open-weight frontier, $1.10/$4.40 per Mtok
|
| 47 |
+
|
| 48 |
+
(Verified against `~/wiki/_meta/openrouter-roster.md` 🟢 live frontier list before run.)
|
| 49 |
+
|
| 50 |
+
## Verdict (TBD)
|
| 51 |
+
|
| 52 |
+
Will be appended to this file in `## Verdict: VALIDATED | PARTIAL | INVALIDATED` format per the `spike` skill.
|
|
@@ -0,0 +1,200 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""analyze.py — Aggregate results.jsonl into a verdict table."""
|
| 2 |
+
|
| 3 |
+
import json
|
| 4 |
+
import statistics
|
| 5 |
+
from collections import defaultdict
|
| 6 |
+
from pathlib import Path
|
| 7 |
+
|
| 8 |
+
RESULTS_PATH = Path(__file__).parent / "results.jsonl"
|
| 9 |
+
VERDICT_PATH = Path(__file__).parent / "verdict.md"
|
| 10 |
+
|
| 11 |
+
|
| 12 |
+
def load_results():
|
| 13 |
+
if not RESULTS_PATH.exists():
|
| 14 |
+
raise SystemExit(f"No results at {RESULTS_PATH}. Run replay.py first.")
|
| 15 |
+
rows = [json.loads(l) for l in RESULTS_PATH.read_text().splitlines() if l.strip()]
|
| 16 |
+
return rows
|
| 17 |
+
|
| 18 |
+
|
| 19 |
+
def analyze(rows):
|
| 20 |
+
by_teacher = defaultdict(list)
|
| 21 |
+
by_state = defaultdict(list)
|
| 22 |
+
for r in rows:
|
| 23 |
+
if r["error"] is not None:
|
| 24 |
+
continue
|
| 25 |
+
by_teacher[r["teacher_slug"]].append(r)
|
| 26 |
+
by_state[r["state_id"]].append(r)
|
| 27 |
+
return by_teacher, by_state
|
| 28 |
+
|
| 29 |
+
|
| 30 |
+
def pct(values, p):
|
| 31 |
+
s = sorted(values)
|
| 32 |
+
if not s:
|
| 33 |
+
return 0.0
|
| 34 |
+
k = (len(s) - 1) * p / 100
|
| 35 |
+
f, c = int(k), min(int(k) + 1, len(s) - 1)
|
| 36 |
+
return s[f] + (s[c] - s[f]) * (k - f)
|
| 37 |
+
|
| 38 |
+
|
| 39 |
+
def teacher_summary(by_teacher):
|
| 40 |
+
rows_out = []
|
| 41 |
+
for slug, calls in sorted(by_teacher.items()):
|
| 42 |
+
n = len(calls)
|
| 43 |
+
latencies = [c["latency_s"] for c in calls]
|
| 44 |
+
costs = [c["cost_usd"] for c in calls]
|
| 45 |
+
prompt_tok = [c["prompt_tokens"] for c in calls]
|
| 46 |
+
comp_tok = [c["completion_tokens"] for c in calls]
|
| 47 |
+
rows_out.append(
|
| 48 |
+
{
|
| 49 |
+
"slug": slug,
|
| 50 |
+
"n": n,
|
| 51 |
+
"p50_lat_s": round(statistics.median(latencies), 2) if latencies else 0,
|
| 52 |
+
"p95_lat_s": round(pct(latencies, 95), 2) if latencies else 0,
|
| 53 |
+
"p99_lat_s": round(pct(latencies, 99), 2) if latencies else 0,
|
| 54 |
+
"mean_cost_usd": round(statistics.mean(costs), 6) if costs else 0,
|
| 55 |
+
"total_cost_usd": round(sum(costs), 4),
|
| 56 |
+
"mean_prompt_tok": round(statistics.mean(prompt_tok)) if prompt_tok else 0,
|
| 57 |
+
"mean_comp_tok": round(statistics.mean(comp_tok)) if comp_tok else 0,
|
| 58 |
+
}
|
| 59 |
+
)
|
| 60 |
+
return rows_out
|
| 61 |
+
|
| 62 |
+
|
| 63 |
+
def per_state_summary(by_state):
|
| 64 |
+
"""Aggregate cost across all 3 teachers per state — this is what 'per trace step' costs."""
|
| 65 |
+
step_costs = []
|
| 66 |
+
step_lats = [] # latency = max across the 3 teachers (parallel call wallclock)
|
| 67 |
+
for state_id, calls in by_state.items():
|
| 68 |
+
# Only include states where ALL teachers completed
|
| 69 |
+
# (else cost+latency are misleading)
|
| 70 |
+
if len(calls) < 3:
|
| 71 |
+
continue
|
| 72 |
+
step_costs.append(sum(c["cost_usd"] for c in calls))
|
| 73 |
+
step_lats.append(max(c["latency_s"] for c in calls))
|
| 74 |
+
return {
|
| 75 |
+
"n_complete_steps": len(step_costs),
|
| 76 |
+
"p50_step_cost_usd": round(statistics.median(step_costs), 5) if step_costs else 0,
|
| 77 |
+
"p95_step_cost_usd": round(pct(step_costs, 95), 5) if step_costs else 0,
|
| 78 |
+
"mean_step_cost_usd": round(statistics.mean(step_costs), 5) if step_costs else 0,
|
| 79 |
+
"p50_step_lat_s": round(statistics.median(step_lats), 2) if step_lats else 0,
|
| 80 |
+
"p95_step_lat_s": round(pct(step_lats, 95), 2) if step_lats else 0,
|
| 81 |
+
"p99_step_lat_s": round(pct(step_lats, 99), 2) if step_lats else 0,
|
| 82 |
+
}
|
| 83 |
+
|
| 84 |
+
|
| 85 |
+
def project_to_trace_cost(per_step, steps_per_trace=50):
|
| 86 |
+
"""Per the spike thesis: 50 steps per trace, 3 teachers each. Project total."""
|
| 87 |
+
return {
|
| 88 |
+
"steps_per_trace": steps_per_trace,
|
| 89 |
+
"p50_trace_cost_usd": round(per_step["p50_step_cost_usd"] * steps_per_trace, 4),
|
| 90 |
+
"p95_trace_cost_usd": round(per_step["p95_step_cost_usd"] * steps_per_trace, 4),
|
| 91 |
+
"mean_trace_cost_usd": round(per_step["mean_step_cost_usd"] * steps_per_trace, 4),
|
| 92 |
+
}
|
| 93 |
+
|
| 94 |
+
|
| 95 |
+
def verdict_str(per_step_p95_lat, per_step_p99_lat, mean_trace_cost):
|
| 96 |
+
"""Return ✅ VALIDATED | ⚠️ PARTIAL | ❌ INVALIDATED + reason."""
|
| 97 |
+
cost_pass = mean_trace_cost < 5.0
|
| 98 |
+
p95_pass = per_step_p95_lat < 30.0
|
| 99 |
+
p99_pass = per_step_p99_lat < 60.0
|
| 100 |
+
n_pass = sum([cost_pass, p95_pass, p99_pass])
|
| 101 |
+
if n_pass == 3:
|
| 102 |
+
return "✅ VALIDATED", "All three thresholds met."
|
| 103 |
+
elif n_pass == 0:
|
| 104 |
+
return "❌ INVALIDATED", "All three thresholds violated — channel is unviable as designed."
|
| 105 |
+
else:
|
| 106 |
+
notes = []
|
| 107 |
+
if not cost_pass:
|
| 108 |
+
notes.append(f"cost ${mean_trace_cost:.2f}/trace exceeds $5 cap")
|
| 109 |
+
if not p95_pass:
|
| 110 |
+
notes.append(f"p95 latency {per_step_p95_lat:.1f}s exceeds 30s cap")
|
| 111 |
+
if not p99_pass:
|
| 112 |
+
notes.append(f"p99 latency {per_step_p99_lat:.1f}s exceeds 60s cap")
|
| 113 |
+
return "⚠️ PARTIAL", "; ".join(notes)
|
| 114 |
+
|
| 115 |
+
|
| 116 |
+
def main():
|
| 117 |
+
rows = load_results()
|
| 118 |
+
by_teacher, by_state = analyze(rows)
|
| 119 |
+
teachers = teacher_summary(by_teacher)
|
| 120 |
+
per_step = per_state_summary(by_state)
|
| 121 |
+
per_trace = project_to_trace_cost(per_step, steps_per_trace=50)
|
| 122 |
+
|
| 123 |
+
n_total = sum(t["n"] for t in teachers)
|
| 124 |
+
n_errors = sum(1 for r in rows if r["error"] is not None)
|
| 125 |
+
total_cost = sum(t["total_cost_usd"] for t in teachers)
|
| 126 |
+
|
| 127 |
+
verdict, reason = verdict_str(
|
| 128 |
+
per_step["p95_step_lat_s"], per_step["p99_step_lat_s"],
|
| 129 |
+
per_trace["mean_trace_cost_usd"],
|
| 130 |
+
)
|
| 131 |
+
|
| 132 |
+
md = []
|
| 133 |
+
md.append("# Spike 001 — Teacher Replay Cost & Latency Floor — VERDICT")
|
| 134 |
+
md.append("")
|
| 135 |
+
md.append(f"**Verdict:** {verdict}")
|
| 136 |
+
md.append(f"**Reason:** {reason}")
|
| 137 |
+
md.append("")
|
| 138 |
+
md.append(f"Calls completed: {n_total} ({n_errors} errors)")
|
| 139 |
+
md.append(f"Total spike spend: ${total_cost:.4f}")
|
| 140 |
+
md.append("")
|
| 141 |
+
md.append("## Per-teacher")
|
| 142 |
+
md.append("")
|
| 143 |
+
md.append("| Teacher | n | p50 lat | p95 lat | p99 lat | mean $ | total $ | mean prompt tok | mean comp tok |")
|
| 144 |
+
md.append("|---|---|---|---|---|---|---|---|---|")
|
| 145 |
+
for t in teachers:
|
| 146 |
+
md.append(
|
| 147 |
+
f"| `{t['slug']}` | {t['n']} | {t['p50_lat_s']}s | {t['p95_lat_s']}s | {t['p99_lat_s']}s "
|
| 148 |
+
f"| ${t['mean_cost_usd']:.6f} | ${t['total_cost_usd']:.4f} "
|
| 149 |
+
f"| {t['mean_prompt_tok']} | {t['mean_comp_tok']} |"
|
| 150 |
+
)
|
| 151 |
+
md.append("")
|
| 152 |
+
md.append("## Per-step (parallel 3-teacher call)")
|
| 153 |
+
md.append("")
|
| 154 |
+
md.append(f"- Complete states (all 3 teachers replied): **{per_step['n_complete_steps']}**")
|
| 155 |
+
md.append(f"- Median step cost (sum across 3 teachers): **${per_step['p50_step_cost_usd']:.5f}**")
|
| 156 |
+
md.append(f"- p95 step cost: ${per_step['p95_step_cost_usd']:.5f}")
|
| 157 |
+
md.append(f"- Median step wallclock latency (max across teachers): **{per_step['p50_step_lat_s']}s**")
|
| 158 |
+
md.append(f"- p95 step latency: {per_step['p95_step_lat_s']}s")
|
| 159 |
+
md.append(f"- p99 step latency: {per_step['p99_step_lat_s']}s")
|
| 160 |
+
md.append("")
|
| 161 |
+
md.append("## Projected per-trace (50 steps × 3 teachers, ungated)")
|
| 162 |
+
md.append("")
|
| 163 |
+
md.append(f"- Mean: **${per_trace['mean_trace_cost_usd']:.4f}**")
|
| 164 |
+
md.append(f"- p50: ${per_trace['p50_trace_cost_usd']:.4f}")
|
| 165 |
+
md.append(f"- p95: ${per_trace['p95_trace_cost_usd']:.4f}")
|
| 166 |
+
md.append("")
|
| 167 |
+
md.append("## Thresholds")
|
| 168 |
+
md.append("")
|
| 169 |
+
md.append("| Threshold | Target | Actual | Pass? |")
|
| 170 |
+
md.append("|---|---|---|---|")
|
| 171 |
+
md.append(f"| Mean per-trace cost | < $5 | ${per_trace['mean_trace_cost_usd']:.4f} | "
|
| 172 |
+
f"{'✅' if per_trace['mean_trace_cost_usd'] < 5.0 else '❌'} |")
|
| 173 |
+
md.append(f"| p95 step latency | < 30 s | {per_step['p95_step_lat_s']}s | "
|
| 174 |
+
f"{'✅' if per_step['p95_step_lat_s'] < 30.0 else '❌'} |")
|
| 175 |
+
md.append(f"| p99 step latency | < 60 s | {per_step['p99_step_lat_s']}s | "
|
| 176 |
+
f"{'✅' if per_step['p99_step_lat_s'] < 60.0 else '❌'} |")
|
| 177 |
+
md.append("")
|
| 178 |
+
md.append("## Recommendation")
|
| 179 |
+
md.append("")
|
| 180 |
+
if "VALIDATED" in verdict:
|
| 181 |
+
md.append("- Proceed to spike 002 (trace-collection-trl) and 003 (DPO-pair extraction).")
|
| 182 |
+
md.append("- Cost is acceptable even at v0.0's ungated baseline; VOI gating in v0.1 will buy headroom.")
|
| 183 |
+
md.append("- Use the per-teacher latency table to decide whether any teacher is too slow to keep in the rotation.")
|
| 184 |
+
elif "PARTIAL" in verdict:
|
| 185 |
+
md.append("- Consider mitigations before proceeding:")
|
| 186 |
+
md.append(" - VOI gating (skip teachers when student entropy is low) — 60–80% cost savings")
|
| 187 |
+
md.append(" - Tiered teachers (cheap teacher first, escalate on disagreement) — 2–3× savings")
|
| 188 |
+
md.append(" - Drop the most expensive teacher, run with N=2")
|
| 189 |
+
md.append(" - Reduce `MAX_TOKENS` in replay.py from 200 to 100")
|
| 190 |
+
else:
|
| 191 |
+
md.append("- Channel is unviable as designed. Reorient framework to Composer-only recipe.")
|
| 192 |
+
md.append("- Trace-replay-distillation as a research direction is closed.")
|
| 193 |
+
|
| 194 |
+
VERDICT_PATH.write_text("\n".join(md) + "\n")
|
| 195 |
+
print("\n".join(md))
|
| 196 |
+
print(f"\nWrote verdict to {VERDICT_PATH}")
|
| 197 |
+
|
| 198 |
+
|
| 199 |
+
if __name__ == "__main__":
|
| 200 |
+
main()
|
|
@@ -0,0 +1,194 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""replay.py — Query N=3 frozen teachers in parallel for each state.
|
| 2 |
+
|
| 3 |
+
For each state in states.jsonl:
|
| 4 |
+
1. Send the messages to all 3 teachers concurrently via OpenRouter.
|
| 5 |
+
2. Capture: completion text, prompt+completion tokens, latency, $ cost.
|
| 6 |
+
3. Append per-call result row to results.jsonl.
|
| 7 |
+
|
| 8 |
+
Hard-cap at $20 total spend; abort early if exceeded.
|
| 9 |
+
"""
|
| 10 |
+
|
| 11 |
+
import asyncio
|
| 12 |
+
import json
|
| 13 |
+
import os
|
| 14 |
+
import sys
|
| 15 |
+
import time
|
| 16 |
+
from pathlib import Path
|
| 17 |
+
|
| 18 |
+
import httpx
|
| 19 |
+
|
| 20 |
+
# ----------------------------------------------------------------------------
|
| 21 |
+
# Configuration
|
| 22 |
+
# ----------------------------------------------------------------------------
|
| 23 |
+
|
| 24 |
+
# Load OPENROUTER_API_KEY from ~/.hermes/.env if not in env
|
| 25 |
+
HERMES_ENV = Path.home() / ".hermes" / ".env"
|
| 26 |
+
if HERMES_ENV.exists() and "OPENROUTER_API_KEY" not in os.environ:
|
| 27 |
+
for line in HERMES_ENV.read_text().splitlines():
|
| 28 |
+
line = line.strip()
|
| 29 |
+
if line.startswith("OPENROUTER_API_KEY="):
|
| 30 |
+
os.environ["OPENROUTER_API_KEY"] = line.split("=", 1)[1].strip().strip('"').strip("'")
|
| 31 |
+
break
|
| 32 |
+
|
| 33 |
+
API_KEY = os.environ.get("OPENROUTER_API_KEY")
|
| 34 |
+
if not API_KEY:
|
| 35 |
+
print("ERROR: OPENROUTER_API_KEY not set", file=sys.stderr)
|
| 36 |
+
sys.exit(2)
|
| 37 |
+
|
| 38 |
+
OPENROUTER_URL = "https://openrouter.ai/api/v1/chat/completions"
|
| 39 |
+
|
| 40 |
+
# Teacher slugs and per-Mtok prices (verified live on roster 2026-05-25).
|
| 41 |
+
# Cost = (prompt_tokens / 1_000_000) * input_price + (completion_tokens / 1_000_000) * output_price
|
| 42 |
+
TEACHERS = [
|
| 43 |
+
{
|
| 44 |
+
"slug": "anthropic/claude-opus-4.7",
|
| 45 |
+
"input_per_mtok": 15.0,
|
| 46 |
+
"output_per_mtok": 75.0,
|
| 47 |
+
},
|
| 48 |
+
{
|
| 49 |
+
"slug": "openai/gpt-5",
|
| 50 |
+
"input_per_mtok": 1.25,
|
| 51 |
+
"output_per_mtok": 10.0,
|
| 52 |
+
},
|
| 53 |
+
{
|
| 54 |
+
"slug": "deepseek/deepseek-v4-pro",
|
| 55 |
+
"input_per_mtok": 1.10,
|
| 56 |
+
"output_per_mtok": 4.40,
|
| 57 |
+
},
|
| 58 |
+
]
|
| 59 |
+
|
| 60 |
+
MAX_TOTAL_USD = 20.0 # hard cap
|
| 61 |
+
MAX_TOKENS = 200 # constrain output to keep cost predictable
|
| 62 |
+
|
| 63 |
+
STATES_PATH = Path(__file__).parent / "states.jsonl"
|
| 64 |
+
RESULTS_PATH = Path(__file__).parent / "results.jsonl"
|
| 65 |
+
|
| 66 |
+
|
| 67 |
+
# ----------------------------------------------------------------------------
|
| 68 |
+
# Per-call worker
|
| 69 |
+
# ----------------------------------------------------------------------------
|
| 70 |
+
|
| 71 |
+
async def call_teacher(
|
| 72 |
+
client: httpx.AsyncClient,
|
| 73 |
+
state: dict,
|
| 74 |
+
teacher: dict,
|
| 75 |
+
) -> dict:
|
| 76 |
+
"""Issue one teacher call. Returns a result row dict."""
|
| 77 |
+
payload = {
|
| 78 |
+
"model": teacher["slug"],
|
| 79 |
+
"messages": state["messages"],
|
| 80 |
+
"max_tokens": MAX_TOKENS,
|
| 81 |
+
"temperature": 0.2,
|
| 82 |
+
}
|
| 83 |
+
headers = {
|
| 84 |
+
"Authorization": f"Bearer {API_KEY}",
|
| 85 |
+
"Content-Type": "application/json",
|
| 86 |
+
"HTTP-Referer": "https://huggingface.co/Codeseys/composer-replication-framework",
|
| 87 |
+
"X-Title": "composer-replication-framework spike-001",
|
| 88 |
+
}
|
| 89 |
+
t0 = time.perf_counter()
|
| 90 |
+
err = None
|
| 91 |
+
response_text = None
|
| 92 |
+
prompt_tokens = 0
|
| 93 |
+
completion_tokens = 0
|
| 94 |
+
served_model = None
|
| 95 |
+
try:
|
| 96 |
+
r = await client.post(OPENROUTER_URL, json=payload, headers=headers, timeout=120.0)
|
| 97 |
+
r.raise_for_status()
|
| 98 |
+
data = r.json()
|
| 99 |
+
response_text = data["choices"][0]["message"]["content"]
|
| 100 |
+
served_model = data.get("model", teacher["slug"])
|
| 101 |
+
usage = data.get("usage", {})
|
| 102 |
+
prompt_tokens = usage.get("prompt_tokens", 0)
|
| 103 |
+
completion_tokens = usage.get("completion_tokens", 0)
|
| 104 |
+
except Exception as e:
|
| 105 |
+
err = repr(e)[:300]
|
| 106 |
+
t1 = time.perf_counter()
|
| 107 |
+
latency_s = round(t1 - t0, 3)
|
| 108 |
+
cost_usd = (
|
| 109 |
+
(prompt_tokens / 1_000_000) * teacher["input_per_mtok"]
|
| 110 |
+
+ (completion_tokens / 1_000_000) * teacher["output_per_mtok"]
|
| 111 |
+
)
|
| 112 |
+
return {
|
| 113 |
+
"state_id": state["id"],
|
| 114 |
+
"teacher_slug": teacher["slug"],
|
| 115 |
+
"served_model": served_model,
|
| 116 |
+
"latency_s": latency_s,
|
| 117 |
+
"prompt_tokens": prompt_tokens,
|
| 118 |
+
"completion_tokens": completion_tokens,
|
| 119 |
+
"cost_usd": round(cost_usd, 6),
|
| 120 |
+
"response_text": response_text,
|
| 121 |
+
"error": err,
|
| 122 |
+
}
|
| 123 |
+
|
| 124 |
+
|
| 125 |
+
# ----------------------------------------------------------------------------
|
| 126 |
+
# Main loop
|
| 127 |
+
# ----------------------------------------------------------------------------
|
| 128 |
+
|
| 129 |
+
async def main():
|
| 130 |
+
if not STATES_PATH.exists():
|
| 131 |
+
print(f"ERROR: {STATES_PATH} not found. Run synthesize_trace.py first.", file=sys.stderr)
|
| 132 |
+
sys.exit(2)
|
| 133 |
+
|
| 134 |
+
states = [json.loads(l) for l in STATES_PATH.read_text().splitlines() if l.strip()]
|
| 135 |
+
print(f"Loaded {len(states)} states. {len(TEACHERS)} teachers. {len(states) * len(TEACHERS)} calls total.")
|
| 136 |
+
|
| 137 |
+
# Resume support — skip states that already have results for all teachers
|
| 138 |
+
done_state_teacher = set()
|
| 139 |
+
if RESULTS_PATH.exists():
|
| 140 |
+
for line in RESULTS_PATH.read_text().splitlines():
|
| 141 |
+
try:
|
| 142 |
+
row = json.loads(line)
|
| 143 |
+
if row.get("error") is None:
|
| 144 |
+
done_state_teacher.add((row["state_id"], row["teacher_slug"]))
|
| 145 |
+
except Exception:
|
| 146 |
+
pass
|
| 147 |
+
if done_state_teacher:
|
| 148 |
+
print(f"Resume: {len(done_state_teacher)} successful calls already in results.jsonl")
|
| 149 |
+
|
| 150 |
+
total_cost = 0.0
|
| 151 |
+
n_calls = 0
|
| 152 |
+
n_skipped = 0
|
| 153 |
+
|
| 154 |
+
async with httpx.AsyncClient() as client:
|
| 155 |
+
with RESULTS_PATH.open("a") as out:
|
| 156 |
+
for state in states:
|
| 157 |
+
# Compose only the teacher calls not yet done
|
| 158 |
+
tasks = []
|
| 159 |
+
for teacher in TEACHERS:
|
| 160 |
+
if (state["id"], teacher["slug"]) in done_state_teacher:
|
| 161 |
+
n_skipped += 1
|
| 162 |
+
continue
|
| 163 |
+
tasks.append(call_teacher(client, state, teacher))
|
| 164 |
+
if not tasks:
|
| 165 |
+
continue
|
| 166 |
+
|
| 167 |
+
results = await asyncio.gather(*tasks)
|
| 168 |
+
for row in results:
|
| 169 |
+
out.write(json.dumps(row) + "\n")
|
| 170 |
+
out.flush()
|
| 171 |
+
if row["error"] is None:
|
| 172 |
+
total_cost += row["cost_usd"]
|
| 173 |
+
n_calls += 1
|
| 174 |
+
else:
|
| 175 |
+
print(f" ERROR {state['id']} {row['teacher_slug']}: {row['error']}")
|
| 176 |
+
|
| 177 |
+
state_summary = " | ".join(
|
| 178 |
+
f"{r['teacher_slug'].split('/')[-1]:<22} {r['latency_s']:>5.2f}s ${r['cost_usd']:.4f}"
|
| 179 |
+
f" {r['prompt_tokens']:>4}+{r['completion_tokens']:<3}t"
|
| 180 |
+
+ (" ERR" if r["error"] else "")
|
| 181 |
+
for r in results
|
| 182 |
+
)
|
| 183 |
+
print(f"{state['id']} {state_summary} | cum=${total_cost:.4f}")
|
| 184 |
+
|
| 185 |
+
if total_cost > MAX_TOTAL_USD:
|
| 186 |
+
print(f"\n!! Hit MAX_TOTAL_USD=${MAX_TOTAL_USD:.2f} cap, aborting after {n_calls} new calls.")
|
| 187 |
+
break
|
| 188 |
+
|
| 189 |
+
print(f"\nDone. {n_calls} new calls, {n_skipped} resumed. Total cost: ${total_cost:.4f}")
|
| 190 |
+
print(f"Results in: {RESULTS_PATH}")
|
| 191 |
+
|
| 192 |
+
|
| 193 |
+
if __name__ == "__main__":
|
| 194 |
+
asyncio.run(main())
|
|
@@ -0,0 +1,50 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{"id": "state-000", "task": "Fix the failing test in tests/test_auth.py::test_login_with_email", "messages": [{"role": "system", "content": "You are a senior software engineer working as an agentic coding assistant. You have access to tool calls (read_file, edit_file, search_files, run_tests, web_search). Your job at each step is to pick the best next tool call given the current state. Reply with: (a) which option (a/b/c) you'd pick, (b) one sentence explaining why, (c) the exact tool call you'd make."}, {"role": "user", "content": "Task: Fix the failing test in tests/test_auth.py::test_login_with_email"}, {"role": "assistant", "content": "Progress so far: Read auth.py:42-90 and saw the login function uses `username` field. Read test file and saw test calls login(email='x@y.com'). Schema mismatch confirmed."}, {"role": "user", "content": "What's the next tool call? (a) edit auth.py to accept email, (b) edit test to use username, (c) read more files to understand the schema convention."}]}
|
| 2 |
+
{"id": "state-001", "task": "Add rate-limiting middleware to the Flask app", "messages": [{"role": "system", "content": "You are a senior software engineer working as an agentic coding assistant. You have access to tool calls (read_file, edit_file, search_files, run_tests, web_search). Your job at each step is to pick the best next tool call given the current state. Reply with: (a) which option (a/b/c) you'd pick, (b) one sentence explaining why, (c) the exact tool call you'd make."}, {"role": "user", "content": "Task: Add rate-limiting middleware to the Flask app"}, {"role": "assistant", "content": "Progress so far: Listed the project structure: app.py, routes/*.py, middleware/. No rate-limit lib in requirements.txt. Confirmed Flask-Limiter is the canonical choice via web search."}, {"role": "user", "content": "Next tool call? (a) pip-install flask-limiter, (b) write the middleware first then add to deps, (c) read app.py to see middleware registration pattern."}]}
|
| 3 |
+
{"id": "state-002", "task": "Refactor the parse_config function \u2014 it's 200 lines and has 3 responsibilities", "messages": [{"role": "system", "content": "You are a senior software engineer working as an agentic coding assistant. You have access to tool calls (read_file, edit_file, search_files, run_tests, web_search). Your job at each step is to pick the best next tool call given the current state. Reply with: (a) which option (a/b/c) you'd pick, (b) one sentence explaining why, (c) the exact tool call you'd make."}, {"role": "user", "content": "Task: Refactor the parse_config function \u2014 it's 200 lines and has 3 responsibilities"}, {"role": "assistant", "content": "Progress so far: Read config.py:1-200. Identified 3 concerns: file IO (lines 1-50), schema validation (51-130), env-var override (131-200)."}, {"role": "user", "content": "Next? (a) extract each concern into its own function in same file, (b) split into 3 modules under config/, (c) write tests first to lock current behavior before refactor."}]}
|
| 4 |
+
{"id": "state-003", "task": "The CI is failing with 'pytest: command not found' on the new runner", "messages": [{"role": "system", "content": "You are a senior software engineer working as an agentic coding assistant. You have access to tool calls (read_file, edit_file, search_files, run_tests, web_search). Your job at each step is to pick the best next tool call given the current state. Reply with: (a) which option (a/b/c) you'd pick, (b) one sentence explaining why, (c) the exact tool call you'd make."}, {"role": "user", "content": "Task: The CI is failing with 'pytest: command not found' on the new runner"}, {"role": "assistant", "content": "Progress so far: Read .github/workflows/ci.yml. Found the runner uses ubuntu-latest. Saw setup-python action but no pytest install step."}, {"role": "user", "content": "Next? (a) add `pip install pytest` to the workflow, (b) move pytest to dev-dependencies in pyproject.toml and use `pip install -e '.[dev]'`, (c) check if poetry is being used and use poetry install instead."}]}
|
| 5 |
+
{"id": "state-004", "task": "Implement pagination on /api/users endpoint", "messages": [{"role": "system", "content": "You are a senior software engineer working as an agentic coding assistant. You have access to tool calls (read_file, edit_file, search_files, run_tests, web_search). Your job at each step is to pick the best next tool call given the current state. Reply with: (a) which option (a/b/c) you'd pick, (b) one sentence explaining why, (c) the exact tool call you'd make."}, {"role": "user", "content": "Task: Implement pagination on /api/users endpoint"}, {"role": "assistant", "content": "Progress so far: Read routes/users.py. Current: `return jsonify(User.query.all())`. Read README \u2014 saw mention of cursor-based pagination preferred."}, {"role": "user", "content": "Next? (a) add ?page= and ?per_page= query args (offset-based), (b) implement cursor-based with last-id token, (c) check if there's a pagination utility already in utils/."}]}
|
| 6 |
+
{"id": "state-005", "task": "Migrate the SQLAlchemy models from declarative to ORM 2.0 style", "messages": [{"role": "system", "content": "You are a senior software engineer working as an agentic coding assistant. You have access to tool calls (read_file, edit_file, search_files, run_tests, web_search). Your job at each step is to pick the best next tool call given the current state. Reply with: (a) which option (a/b/c) you'd pick, (b) one sentence explaining why, (c) the exact tool call you'd make."}, {"role": "user", "content": "Task: Migrate the SQLAlchemy models from declarative to ORM 2.0 style"}, {"role": "assistant", "content": "Progress so far: Read models/user.py. Current style uses `Column(Integer, primary_key=True)`. Read SQLAlchemy 2.0 docs \u2014 new style uses `Mapped[int] = mapped_column(...)`."}, {"role": "user", "content": "Next? (a) migrate user.py first as proof-of-concept, (b) write a migration script that does all models at once, (c) check if Alembic auto-generated migrations will break with the new style."}]}
|
| 7 |
+
{"id": "state-006", "task": "The webhook endpoint is dropping events under load", "messages": [{"role": "system", "content": "You are a senior software engineer working as an agentic coding assistant. You have access to tool calls (read_file, edit_file, search_files, run_tests, web_search). Your job at each step is to pick the best next tool call given the current state. Reply with: (a) which option (a/b/c) you'd pick, (b) one sentence explaining why, (c) the exact tool call you'd make."}, {"role": "user", "content": "Task: The webhook endpoint is dropping events under load"}, {"role": "assistant", "content": "Progress so far: Read webhooks/handler.py. Saw events processed synchronously. Looked at logs \u2014 confirmed timeouts at p95 = 8s. Spec says webhooks must respond <2s."}, {"role": "user", "content": "Next? (a) move processing to a background worker (Celery/RQ), (b) batch events with a 1s window before processing, (c) profile the handler first to find the actual bottleneck."}]}
|
| 8 |
+
{"id": "state-007", "task": "Add type hints to legacy code in lib/parser.py", "messages": [{"role": "system", "content": "You are a senior software engineer working as an agentic coding assistant. You have access to tool calls (read_file, edit_file, search_files, run_tests, web_search). Your job at each step is to pick the best next tool call given the current state. Reply with: (a) which option (a/b/c) you'd pick, (b) one sentence explaining why, (c) the exact tool call you'd make."}, {"role": "user", "content": "Task: Add type hints to legacy code in lib/parser.py"}, {"role": "assistant", "content": "Progress so far: Read parser.py \u2014 800 lines, 0 type hints. Project uses pydantic v2 elsewhere. Saw mypy config in pyproject.toml is strict."}, {"role": "user", "content": "Next? (a) add hints function-by-function, top-down, (b) run pyright in inference mode to seed annotations, then refine, (c) skip lib/parser.py in mypy config and revisit later."}]}
|
| 9 |
+
{"id": "state-008", "task": "Reduce the Docker image size \u2014 it's 2.3 GB", "messages": [{"role": "system", "content": "You are a senior software engineer working as an agentic coding assistant. You have access to tool calls (read_file, edit_file, search_files, run_tests, web_search). Your job at each step is to pick the best next tool call given the current state. Reply with: (a) which option (a/b/c) you'd pick, (b) one sentence explaining why, (c) the exact tool call you'd make."}, {"role": "user", "content": "Task: Reduce the Docker image size \u2014 it's 2.3 GB"}, {"role": "assistant", "content": "Progress so far: Read Dockerfile. Base = python:3.11 (full, ~900MB). Apt install of build-essential, gcc, postgres-client. pip install -r requirements.txt."}, {"role": "user", "content": "Next? (a) switch to python:3.11-slim, (b) multi-stage build with build-essential only in build stage, (c) use distroless image."}]}
|
| 10 |
+
{"id": "state-009", "task": "Add caching to the slow get_user_dashboard function", "messages": [{"role": "system", "content": "You are a senior software engineer working as an agentic coding assistant. You have access to tool calls (read_file, edit_file, search_files, run_tests, web_search). Your job at each step is to pick the best next tool call given the current state. Reply with: (a) which option (a/b/c) you'd pick, (b) one sentence explaining why, (c) the exact tool call you'd make."}, {"role": "user", "content": "Task: Add caching to the slow get_user_dashboard function"}, {"role": "assistant", "content": "Progress so far: Read views/dashboard.py:get_user_dashboard. 5 DB queries, ~300ms. Project uses Redis for sessions; no @cache decorator pattern yet."}, {"role": "user", "content": "Next? (a) wrap with functools.lru_cache (in-process), (b) use Flask-Caching with Redis backend, (c) denormalize the query into a single JOIN first."}]}
|
| 11 |
+
{"id": "state-010", "task": "Investigate intermittent 500 errors in production /api/order/checkout", "messages": [{"role": "system", "content": "You are a senior software engineer working as an agentic coding assistant. You have access to tool calls (read_file, edit_file, search_files, run_tests, web_search). Your job at each step is to pick the best next tool call given the current state. Reply with: (a) which option (a/b/c) you'd pick, (b) one sentence explaining why, (c) the exact tool call you'd make."}, {"role": "user", "content": "Task: Investigate intermittent 500 errors in production /api/order/checkout"}, {"role": "assistant", "content": "Progress so far: Read order_service.py. Read recent logs via tail. Saw 12 instances of UnboundLocalError in last 24h, all in inventory_check branch."}, {"role": "user", "content": "Next? (a) read inventory_check function first, (b) add a try/except with proper logging to bound the issue, (c) check git log for recent changes to that branch."}]}
|
| 12 |
+
{"id": "state-011", "task": "Generate API documentation from docstrings", "messages": [{"role": "system", "content": "You are a senior software engineer working as an agentic coding assistant. You have access to tool calls (read_file, edit_file, search_files, run_tests, web_search). Your job at each step is to pick the best next tool call given the current state. Reply with: (a) which option (a/b/c) you'd pick, (b) one sentence explaining why, (c) the exact tool call you'd make."}, {"role": "user", "content": "Task: Generate API documentation from docstrings"}, {"role": "assistant", "content": "Progress so far: Read docs/. Empty. Read pyproject.toml \u2014 sphinx and autoapi already in dev-deps but not configured. Read 5 source files \u2014 all have Google-style docstrings."}, {"role": "user", "content": "Next? (a) write conf.py with autoapi extension, (b) try mkdocs+mkdocstrings instead, (c) check if there's a docs branch with prior config."}]}
|
| 13 |
+
{"id": "state-012", "task": "The integration test test_payment_flow is flaky", "messages": [{"role": "system", "content": "You are a senior software engineer working as an agentic coding assistant. You have access to tool calls (read_file, edit_file, search_files, run_tests, web_search). Your job at each step is to pick the best next tool call given the current state. Reply with: (a) which option (a/b/c) you'd pick, (b) one sentence explaining why, (c) the exact tool call you'd make."}, {"role": "user", "content": "Task: The integration test test_payment_flow is flaky"}, {"role": "assistant", "content": "Progress so far: Read test_payment_flow.py. Uses real Stripe API in test mode. Looked at last 30 CI runs \u2014 fails ~15% of the time, all timeouts on Stripe webhook step."}, {"role": "user", "content": "Next? (a) mock Stripe at the boundary, (b) add @pytest.mark.flaky retry, (c) increase timeout from 5s to 30s."}]}
|
| 14 |
+
{"id": "state-013", "task": "Migrate from print() to structured logging", "messages": [{"role": "system", "content": "You are a senior software engineer working as an agentic coding assistant. You have access to tool calls (read_file, edit_file, search_files, run_tests, web_search). Your job at each step is to pick the best next tool call given the current state. Reply with: (a) which option (a/b/c) you'd pick, (b) one sentence explaining why, (c) the exact tool call you'd make."}, {"role": "user", "content": "Task: Migrate from print() to structured logging"}, {"role": "assistant", "content": "Progress so far: Read 8 source files. Found 47 print() calls. Project uses structlog in main.py only. Logging config exists in config/logging.yaml."}, {"role": "user", "content": "Next? (a) sed replace print -> logger.info per file, (b) add a ruff/flake8 rule that bans print and fix the lint errors, (c) write a custom AST codemod for safe replacement."}]}
|
| 15 |
+
{"id": "state-014", "task": "Add OAuth2 social login (Google + GitHub)", "messages": [{"role": "system", "content": "You are a senior software engineer working as an agentic coding assistant. You have access to tool calls (read_file, edit_file, search_files, run_tests, web_search). Your job at each step is to pick the best next tool call given the current state. Reply with: (a) which option (a/b/c) you'd pick, (b) one sentence explaining why, (c) the exact tool call you'd make."}, {"role": "user", "content": "Task: Add OAuth2 social login (Google + GitHub)"}, {"role": "assistant", "content": "Progress so far: Read auth/. Current: email+password only with bcrypt. Read pyproject \u2014 no oauth lib. Project uses Flask. Saw Authlib is canonical for Flask."}, {"role": "user", "content": "Next? (a) install authlib and write Google client first, (b) install authlib and write a generic OAuth abstraction supporting both, (c) read existing user model to plan account-linking schema."}]}
|
| 16 |
+
{"id": "state-015", "task": "The frontend can't connect to the backend after the API URL change", "messages": [{"role": "system", "content": "You are a senior software engineer working as an agentic coding assistant. You have access to tool calls (read_file, edit_file, search_files, run_tests, web_search). Your job at each step is to pick the best next tool call given the current state. Reply with: (a) which option (a/b/c) you'd pick, (b) one sentence explaining why, (c) the exact tool call you'd make."}, {"role": "user", "content": "Task: The frontend can't connect to the backend after the API URL change"}, {"role": "assistant", "content": "Progress so far: Read web/.env. API_URL still points to localhost:5000. Read web/src/api/client.ts \u2014 uses import.meta.env.VITE_API_URL."}, {"role": "user", "content": "Next? (a) update .env to new URL, (b) check if there's a .env.production that should be used, (c) check the deployment config to see how envs are propagated."}]}
|
| 17 |
+
{"id": "state-016", "task": "Add Sentry error tracking", "messages": [{"role": "system", "content": "You are a senior software engineer working as an agentic coding assistant. You have access to tool calls (read_file, edit_file, search_files, run_tests, web_search). Your job at each step is to pick the best next tool call given the current state. Reply with: (a) which option (a/b/c) you'd pick, (b) one sentence explaining why, (c) the exact tool call you'd make."}, {"role": "user", "content": "Task: Add Sentry error tracking"}, {"role": "assistant", "content": "Progress so far: Read main.py \u2014 no sentry init. Read pyproject \u2014 sentry-sdk in deps but unused. Found SENTRY_DSN in env vars on staging."}, {"role": "user", "content": "Next? (a) add sentry_sdk.init() in main.py with env-aware DSN, (b) check if there's a config/sentry.py that should hold init logic, (c) write the integration with FlaskIntegration explicit."}]}
|
| 18 |
+
{"id": "state-017", "task": "Database migrations are out of sync between dev and staging", "messages": [{"role": "system", "content": "You are a senior software engineer working as an agentic coding assistant. You have access to tool calls (read_file, edit_file, search_files, run_tests, web_search). Your job at each step is to pick the best next tool call given the current state. Reply with: (a) which option (a/b/c) you'd pick, (b) one sentence explaining why, (c) the exact tool call you'd make."}, {"role": "user", "content": "Task: Database migrations are out of sync between dev and staging"}, {"role": "assistant", "content": "Progress so far: Ran alembic current on dev: head = 8f3a2b. On staging: 7d1c9a. Read migrations/ \u2014 saw 8f3a2b has only schema changes (no data)."}, {"role": "user", "content": "Next? (a) run alembic upgrade head on staging, (b) check the schema diff first to confirm no data loss, (c) take a backup of staging DB then upgrade."}]}
|
| 19 |
+
{"id": "state-018", "task": "Reduce p99 latency on /api/search from 1.2s to <300ms", "messages": [{"role": "system", "content": "You are a senior software engineer working as an agentic coding assistant. You have access to tool calls (read_file, edit_file, search_files, run_tests, web_search). Your job at each step is to pick the best next tool call given the current state. Reply with: (a) which option (a/b/c) you'd pick, (b) one sentence explaining why, (c) the exact tool call you'd make."}, {"role": "user", "content": "Task: Reduce p99 latency on /api/search from 1.2s to <300ms"}, {"role": "assistant", "content": "Progress so far: Read search/handler.py. Saw raw SQL with LIKE '%term%'. Read schema \u2014 no full-text index on the searched columns."}, {"role": "user", "content": "Next? (a) add tsvector full-text index on PostgreSQL, (b) integrate Meilisearch as separate search service, (c) profile the query first to find if it's the LIKE or N+1."}]}
|
| 20 |
+
{"id": "state-019", "task": "The new feature flag system needs to support per-user overrides", "messages": [{"role": "system", "content": "You are a senior software engineer working as an agentic coding assistant. You have access to tool calls (read_file, edit_file, search_files, run_tests, web_search). Your job at each step is to pick the best next tool call given the current state. Reply with: (a) which option (a/b/c) you'd pick, (b) one sentence explaining why, (c) the exact tool call you'd make."}, {"role": "user", "content": "Task: The new feature flag system needs to support per-user overrides"}, {"role": "assistant", "content": "Progress so far: Read features/flags.py. Current: env-var-based, global on/off. Read users model \u2014 has user_id and group_id fields."}, {"role": "user", "content": "Next? (a) add a flag_overrides table keyed by (flag_name, user_id), (b) integrate Unleash or Flagsmith service, (c) build a simple JSON-in-DB override mechanism first."}]}
|
| 21 |
+
{"id": "state-020", "task": "Generated agentic state #0: Fix S3 upload retry logic", "messages": [{"role": "system", "content": "You are a senior software engineer working as an agentic coding assistant. You have access to tool calls (read_file, edit_file, search_files, run_tests, web_search). Your job at each step is to pick the best next tool call given the current state. Reply with: (a) which option (a/b/c) you'd pick, (b) one sentence explaining why, (c) the exact tool call you'd make."}, {"role": "user", "content": "Task: Generated agentic state #0: Fix S3 upload retry logic"}, {"role": "assistant", "content": "Progress so far: Mid-rollout context 0: Read uploader.py, no backoff seen"}, {"role": "user", "content": "Tool-call decision point 0: Add tenacity or hand-roll exponential backoff?"}]}
|
| 22 |
+
{"id": "state-021", "task": "Generated agentic state #1: Add JWT refresh tokens", "messages": [{"role": "system", "content": "You are a senior software engineer working as an agentic coding assistant. You have access to tool calls (read_file, edit_file, search_files, run_tests, web_search). Your job at each step is to pick the best next tool call given the current state. Reply with: (a) which option (a/b/c) you'd pick, (b) one sentence explaining why, (c) the exact tool call you'd make."}, {"role": "user", "content": "Task: Generated agentic state #1: Add JWT refresh tokens"}, {"role": "assistant", "content": "Progress so far: Mid-rollout context 1: Read auth.py, only access tokens exist"}, {"role": "user", "content": "Tool-call decision point 1: Standalone refresh table or JWT denylist?"}]}
|
| 23 |
+
{"id": "state-022", "task": "Generated agentic state #2: Reduce CI from 12min to <5min", "messages": [{"role": "system", "content": "You are a senior software engineer working as an agentic coding assistant. You have access to tool calls (read_file, edit_file, search_files, run_tests, web_search). Your job at each step is to pick the best next tool call given the current state. Reply with: (a) which option (a/b/c) you'd pick, (b) one sentence explaining why, (c) the exact tool call you'd make."}, {"role": "user", "content": "Task: Generated agentic state #2: Reduce CI from 12min to <5min"}, {"role": "assistant", "content": "Progress so far: Mid-rollout context 2: Read .github/ci.yml, single job"}, {"role": "user", "content": "Tool-call decision point 2: Split into matrix or use cache?"}]}
|
| 24 |
+
{"id": "state-023", "task": "Generated agentic state #3: Migrate from npm to pnpm", "messages": [{"role": "system", "content": "You are a senior software engineer working as an agentic coding assistant. You have access to tool calls (read_file, edit_file, search_files, run_tests, web_search). Your job at each step is to pick the best next tool call given the current state. Reply with: (a) which option (a/b/c) you'd pick, (b) one sentence explaining why, (c) the exact tool call you'd make."}, {"role": "user", "content": "Task: Generated agentic state #3: Migrate from npm to pnpm"}, {"role": "assistant", "content": "Progress so far: Mid-rollout context 3: Read package.json, no workspaces"}, {"role": "user", "content": "Tool-call decision point 3: Move to monorepo first or do straight swap?"}]}
|
| 25 |
+
{"id": "state-024", "task": "Generated agentic state #4: Add Prometheus metrics", "messages": [{"role": "system", "content": "You are a senior software engineer working as an agentic coding assistant. You have access to tool calls (read_file, edit_file, search_files, run_tests, web_search). Your job at each step is to pick the best next tool call given the current state. Reply with: (a) which option (a/b/c) you'd pick, (b) one sentence explaining why, (c) the exact tool call you'd make."}, {"role": "user", "content": "Task: Generated agentic state #4: Add Prometheus metrics"}, {"role": "assistant", "content": "Progress so far: Mid-rollout context 4: Read app, no /metrics endpoint"}, {"role": "user", "content": "Tool-call decision point 4: Use prometheus_client lib or middleware?"}]}
|
| 26 |
+
{"id": "state-025", "task": "Generated agentic state #5: The websocket disconnects after 60s", "messages": [{"role": "system", "content": "You are a senior software engineer working as an agentic coding assistant. You have access to tool calls (read_file, edit_file, search_files, run_tests, web_search). Your job at each step is to pick the best next tool call given the current state. Reply with: (a) which option (a/b/c) you'd pick, (b) one sentence explaining why, (c) the exact tool call you'd make."}, {"role": "user", "content": "Task: Generated agentic state #5: The websocket disconnects after 60s"}, {"role": "assistant", "content": "Progress so far: Mid-rollout context 5: Read ws server, no ping config"}, {"role": "user", "content": "Tool-call decision point 5: Configure ping_interval or use heartbeat at app level?"}]}
|
| 27 |
+
{"id": "state-026", "task": "Generated agentic state #6: Migrate Redis from v6 to v7", "messages": [{"role": "system", "content": "You are a senior software engineer working as an agentic coding assistant. You have access to tool calls (read_file, edit_file, search_files, run_tests, web_search). Your job at each step is to pick the best next tool call given the current state. Reply with: (a) which option (a/b/c) you'd pick, (b) one sentence explaining why, (c) the exact tool call you'd make."}, {"role": "user", "content": "Task: Generated agentic state #6: Migrate Redis from v6 to v7"}, {"role": "assistant", "content": "Progress so far: Mid-rollout context 6: Read docker-compose.yml, redis:6-alpine"}, {"role": "user", "content": "Tool-call decision point 6: Test in dev branch first or upgrade in place?"}]}
|
| 28 |
+
{"id": "state-027", "task": "Generated agentic state #7: Add OpenAPI spec generation", "messages": [{"role": "system", "content": "You are a senior software engineer working as an agentic coding assistant. You have access to tool calls (read_file, edit_file, search_files, run_tests, web_search). Your job at each step is to pick the best next tool call given the current state. Reply with: (a) which option (a/b/c) you'd pick, (b) one sentence explaining why, (c) the exact tool call you'd make."}, {"role": "user", "content": "Task: Generated agentic state #7: Add OpenAPI spec generation"}, {"role": "assistant", "content": "Progress so far: Mid-rollout context 7: Read api/, manual flask routes"}, {"role": "user", "content": "Tool-call decision point 7: Switch to FastAPI or use apispec for Flask?"}]}
|
| 29 |
+
{"id": "state-028", "task": "Generated agentic state #8: The cron job missed 2 runs last week", "messages": [{"role": "system", "content": "You are a senior software engineer working as an agentic coding assistant. You have access to tool calls (read_file, edit_file, search_files, run_tests, web_search). Your job at each step is to pick the best next tool call given the current state. Reply with: (a) which option (a/b/c) you'd pick, (b) one sentence explaining why, (c) the exact tool call you'd make."}, {"role": "user", "content": "Task: Generated agentic state #8: The cron job missed 2 runs last week"}, {"role": "assistant", "content": "Progress so far: Mid-rollout context 8: Read cron config, single-instance"}, {"role": "user", "content": "Tool-call decision point 8: Add HA cron solution or just retry-on-failure?"}]}
|
| 30 |
+
{"id": "state-029", "task": "Generated agentic state #9: Reduce memory usage in worker.py", "messages": [{"role": "system", "content": "You are a senior software engineer working as an agentic coding assistant. You have access to tool calls (read_file, edit_file, search_files, run_tests, web_search). Your job at each step is to pick the best next tool call given the current state. Reply with: (a) which option (a/b/c) you'd pick, (b) one sentence explaining why, (c) the exact tool call you'd make."}, {"role": "user", "content": "Task: Generated agentic state #9: Reduce memory usage in worker.py"}, {"role": "assistant", "content": "Progress so far: Mid-rollout context 9: Read worker, sees 2GB resident"}, {"role": "user", "content": "Tool-call decision point 9: Profile with memray or guess at top suspects?"}]}
|
| 31 |
+
{"id": "state-030", "task": "Generated agentic state #10: Add rate limit per IP not per user", "messages": [{"role": "system", "content": "You are a senior software engineer working as an agentic coding assistant. You have access to tool calls (read_file, edit_file, search_files, run_tests, web_search). Your job at each step is to pick the best next tool call given the current state. Reply with: (a) which option (a/b/c) you'd pick, (b) one sentence explaining why, (c) the exact tool call you'd make."}, {"role": "user", "content": "Task: Generated agentic state #10: Add rate limit per IP not per user"}, {"role": "assistant", "content": "Progress so far: Mid-rollout context 10: Read limiter config, user-keyed"}, {"role": "user", "content": "Tool-call decision point 10: Switch to IP key or composite (IP+user)?"}]}
|
| 32 |
+
{"id": "state-031", "task": "Generated agentic state #11: The TLS cert expires in 7 days", "messages": [{"role": "system", "content": "You are a senior software engineer working as an agentic coding assistant. You have access to tool calls (read_file, edit_file, search_files, run_tests, web_search). Your job at each step is to pick the best next tool call given the current state. Reply with: (a) which option (a/b/c) you'd pick, (b) one sentence explaining why, (c) the exact tool call you'd make."}, {"role": "user", "content": "Task: Generated agentic state #11: The TLS cert expires in 7 days"}, {"role": "assistant", "content": "Progress so far: Mid-rollout context 11: Read prod config, manual cert"}, {"role": "user", "content": "Tool-call decision point 11: Switch to Let's Encrypt or extend manually?"}]}
|
| 33 |
+
{"id": "state-032", "task": "Generated agentic state #12: Migrate logs from local files to CloudWatch", "messages": [{"role": "system", "content": "You are a senior software engineer working as an agentic coding assistant. You have access to tool calls (read_file, edit_file, search_files, run_tests, web_search). Your job at each step is to pick the best next tool call given the current state. Reply with: (a) which option (a/b/c) you'd pick, (b) one sentence explaining why, (c) the exact tool call you'd make."}, {"role": "user", "content": "Task: Generated agentic state #12: Migrate logs from local files to CloudWatch"}, {"role": "assistant", "content": "Progress so far: Mid-rollout context 12: Read logging config, FileHandler"}, {"role": "user", "content": "Tool-call decision point 12: watchtower lib or fluentd sidecar?"}]}
|
| 34 |
+
{"id": "state-033", "task": "Generated agentic state #13: Add CSRF protection to forms", "messages": [{"role": "system", "content": "You are a senior software engineer working as an agentic coding assistant. You have access to tool calls (read_file, edit_file, search_files, run_tests, web_search). Your job at each step is to pick the best next tool call given the current state. Reply with: (a) which option (a/b/c) you'd pick, (b) one sentence explaining why, (c) the exact tool call you'd make."}, {"role": "user", "content": "Task: Generated agentic state #13: Add CSRF protection to forms"}, {"role": "assistant", "content": "Progress so far: Mid-rollout context 13: Read web/, no CSRF tokens"}, {"role": "user", "content": "Tool-call decision point 13: Flask-WTF or write minimal token middleware?"}]}
|
| 35 |
+
{"id": "state-034", "task": "Generated agentic state #14: The tests pass locally but fail on CI", "messages": [{"role": "system", "content": "You are a senior software engineer working as an agentic coding assistant. You have access to tool calls (read_file, edit_file, search_files, run_tests, web_search). Your job at each step is to pick the best next tool call given the current state. Reply with: (a) which option (a/b/c) you'd pick, (b) one sentence explaining why, (c) the exact tool call you'd make."}, {"role": "user", "content": "Task: Generated agentic state #14: The tests pass locally but fail on CI"}, {"role": "assistant", "content": "Progress so far: Mid-rollout context 14: Saw fail in test_dates.py, timezone issue"}, {"role": "user", "content": "Tool-call decision point 14: Set TZ=UTC in CI or fix tests?"}]}
|
| 36 |
+
{"id": "state-035", "task": "Generated agentic state #15: Add E2E test for the checkout flow", "messages": [{"role": "system", "content": "You are a senior software engineer working as an agentic coding assistant. You have access to tool calls (read_file, edit_file, search_files, run_tests, web_search). Your job at each step is to pick the best next tool call given the current state. Reply with: (a) which option (a/b/c) you'd pick, (b) one sentence explaining why, (c) the exact tool call you'd make."}, {"role": "user", "content": "Task: Generated agentic state #15: Add E2E test for the checkout flow"}, {"role": "assistant", "content": "Progress so far: Mid-rollout context 15: Read tests/, no Playwright/Cypress"}, {"role": "user", "content": "Tool-call decision point 15: Playwright or Cypress?"}]}
|
| 37 |
+
{"id": "state-036", "task": "Generated agentic state #16: Reduce Docker build cache miss rate", "messages": [{"role": "system", "content": "You are a senior software engineer working as an agentic coding assistant. You have access to tool calls (read_file, edit_file, search_files, run_tests, web_search). Your job at each step is to pick the best next tool call given the current state. Reply with: (a) which option (a/b/c) you'd pick, (b) one sentence explaining why, (c) the exact tool call you'd make."}, {"role": "user", "content": "Task: Generated agentic state #16: Reduce Docker build cache miss rate"}, {"role": "assistant", "content": "Progress so far: Mid-rollout context 16: Read Dockerfile, COPY . . at top"}, {"role": "user", "content": "Tool-call decision point 16: Reorder for better caching or BuildKit cache mounts?"}]}
|
| 38 |
+
{"id": "state-037", "task": "Generated agentic state #17: The user-uploaded images aren't being resized", "messages": [{"role": "system", "content": "You are a senior software engineer working as an agentic coding assistant. You have access to tool calls (read_file, edit_file, search_files, run_tests, web_search). Your job at each step is to pick the best next tool call given the current state. Reply with: (a) which option (a/b/c) you'd pick, (b) one sentence explaining why, (c) the exact tool call you'd make."}, {"role": "user", "content": "Task: Generated agentic state #17: The user-uploaded images aren't being resized"}, {"role": "assistant", "content": "Progress so far: Mid-rollout context 17: Read upload handler, no Pillow call"}, {"role": "user", "content": "Tool-call decision point 17: Sync resize on upload or async via queue?"}]}
|
| 39 |
+
{"id": "state-038", "task": "Generated agentic state #18: Add audit log for admin actions", "messages": [{"role": "system", "content": "You are a senior software engineer working as an agentic coding assistant. You have access to tool calls (read_file, edit_file, search_files, run_tests, web_search). Your job at each step is to pick the best next tool call given the current state. Reply with: (a) which option (a/b/c) you'd pick, (b) one sentence explaining why, (c) the exact tool call you'd make."}, {"role": "user", "content": "Task: Generated agentic state #18: Add audit log for admin actions"}, {"role": "assistant", "content": "Progress so far: Mid-rollout context 18: Read admin/, no logging"}, {"role": "user", "content": "Tool-call decision point 18: Decorator-based or event sourcing?"}]}
|
| 40 |
+
{"id": "state-039", "task": "Generated agentic state #19: Migrate from black to ruff format", "messages": [{"role": "system", "content": "You are a senior software engineer working as an agentic coding assistant. You have access to tool calls (read_file, edit_file, search_files, run_tests, web_search). Your job at each step is to pick the best next tool call given the current state. Reply with: (a) which option (a/b/c) you'd pick, (b) one sentence explaining why, (c) the exact tool call you'd make."}, {"role": "user", "content": "Task: Generated agentic state #19: Migrate from black to ruff format"}, {"role": "assistant", "content": "Progress so far: Mid-rollout context 19: Read pyproject, black configured"}, {"role": "user", "content": "Tool-call decision point 19: Direct swap or run both during transition?"}]}
|
| 41 |
+
{"id": "state-040", "task": "Generated agentic state #20: Add backup verification", "messages": [{"role": "system", "content": "You are a senior software engineer working as an agentic coding assistant. You have access to tool calls (read_file, edit_file, search_files, run_tests, web_search). Your job at each step is to pick the best next tool call given the current state. Reply with: (a) which option (a/b/c) you'd pick, (b) one sentence explaining why, (c) the exact tool call you'd make."}, {"role": "user", "content": "Task: Generated agentic state #20: Add backup verification"}, {"role": "assistant", "content": "Progress so far: Mid-rollout context 20: Read backup script, no restore test"}, {"role": "user", "content": "Tool-call decision point 20: Test by restoring to staging or use checksums?"}]}
|
| 42 |
+
{"id": "state-041", "task": "Generated agentic state #21: Reduce database connection count", "messages": [{"role": "system", "content": "You are a senior software engineer working as an agentic coding assistant. You have access to tool calls (read_file, edit_file, search_files, run_tests, web_search). Your job at each step is to pick the best next tool call given the current state. Reply with: (a) which option (a/b/c) you'd pick, (b) one sentence explaining why, (c) the exact tool call you'd make."}, {"role": "user", "content": "Task: Generated agentic state #21: Reduce database connection count"}, {"role": "assistant", "content": "Progress so far: Mid-rollout context 21: Saw 200 idle conns, pool size 50"}, {"role": "user", "content": "Tool-call decision point 21: PgBouncer or fix connection leak first?"}]}
|
| 43 |
+
{"id": "state-042", "task": "Generated agentic state #22: Add feature toggle UI for non-engineers", "messages": [{"role": "system", "content": "You are a senior software engineer working as an agentic coding assistant. You have access to tool calls (read_file, edit_file, search_files, run_tests, web_search). Your job at each step is to pick the best next tool call given the current state. Reply with: (a) which option (a/b/c) you'd pick, (b) one sentence explaining why, (c) the exact tool call you'd make."}, {"role": "user", "content": "Task: Generated agentic state #22: Add feature toggle UI for non-engineers"}, {"role": "assistant", "content": "Progress so far: Mid-rollout context 22: Read flags system, env-var only"}, {"role": "user", "content": "Tool-call decision point 22: Build admin page or use third-party tool?"}]}
|
| 44 |
+
{"id": "state-043", "task": "Generated agentic state #23: The API returns 500 on POST with empty body", "messages": [{"role": "system", "content": "You are a senior software engineer working as an agentic coding assistant. You have access to tool calls (read_file, edit_file, search_files, run_tests, web_search). Your job at each step is to pick the best next tool call given the current state. Reply with: (a) which option (a/b/c) you'd pick, (b) one sentence explaining why, (c) the exact tool call you'd make."}, {"role": "user", "content": "Task: Generated agentic state #23: The API returns 500 on POST with empty body"}, {"role": "assistant", "content": "Progress so far: Mid-rollout context 23: Read handler, json.loads on body"}, {"role": "user", "content": "Tool-call decision point 23: Validate body length or use pydantic?"}]}
|
| 45 |
+
{"id": "state-044", "task": "Generated agentic state #24: Add SLI/SLO definitions", "messages": [{"role": "system", "content": "You are a senior software engineer working as an agentic coding assistant. You have access to tool calls (read_file, edit_file, search_files, run_tests, web_search). Your job at each step is to pick the best next tool call given the current state. Reply with: (a) which option (a/b/c) you'd pick, (b) one sentence explaining why, (c) the exact tool call you'd make."}, {"role": "user", "content": "Task: Generated agentic state #24: Add SLI/SLO definitions"}, {"role": "assistant", "content": "Progress so far: Mid-rollout context 24: Read monitoring, only uptime tracked"}, {"role": "user", "content": "Tool-call decision point 24: Define p95 latency SLO or error rate SLO first?"}]}
|
| 46 |
+
{"id": "state-045", "task": "Generated agentic state #25: Reduce S3 cost on infrequent files", "messages": [{"role": "system", "content": "You are a senior software engineer working as an agentic coding assistant. You have access to tool calls (read_file, edit_file, search_files, run_tests, web_search). Your job at each step is to pick the best next tool call given the current state. Reply with: (a) which option (a/b/c) you'd pick, (b) one sentence explaining why, (c) the exact tool call you'd make."}, {"role": "user", "content": "Task: Generated agentic state #25: Reduce S3 cost on infrequent files"}, {"role": "assistant", "content": "Progress so far: Mid-rollout context 25: Read storage, all in STANDARD"}, {"role": "user", "content": "Tool-call decision point 25: Lifecycle to IA or Glacier?"}]}
|
| 47 |
+
{"id": "state-046", "task": "Generated agentic state #26: Migrate from cron to Kubernetes CronJob", "messages": [{"role": "system", "content": "You are a senior software engineer working as an agentic coding assistant. You have access to tool calls (read_file, edit_file, search_files, run_tests, web_search). Your job at each step is to pick the best next tool call given the current state. Reply with: (a) which option (a/b/c) you'd pick, (b) one sentence explaining why, (c) the exact tool call you'd make."}, {"role": "user", "content": "Task: Generated agentic state #26: Migrate from cron to Kubernetes CronJob"}, {"role": "assistant", "content": "Progress so far: Mid-rollout context 26: Read deploy, classic cron"}, {"role": "user", "content": "Tool-call decision point 26: Direct migrate or run both for a week?"}]}
|
| 48 |
+
{"id": "state-047", "task": "Generated agentic state #27: Add request tracing with OpenTelemetry", "messages": [{"role": "system", "content": "You are a senior software engineer working as an agentic coding assistant. You have access to tool calls (read_file, edit_file, search_files, run_tests, web_search). Your job at each step is to pick the best next tool call given the current state. Reply with: (a) which option (a/b/c) you'd pick, (b) one sentence explaining why, (c) the exact tool call you'd make."}, {"role": "user", "content": "Task: Generated agentic state #27: Add request tracing with OpenTelemetry"}, {"role": "assistant", "content": "Progress so far: Mid-rollout context 27: Read app, no instrumentation"}, {"role": "user", "content": "Tool-call decision point 27: Auto-instrument or manual spans?"}]}
|
| 49 |
+
{"id": "state-048", "task": "Generated agentic state #28: The notification service is dropping messages", "messages": [{"role": "system", "content": "You are a senior software engineer working as an agentic coding assistant. You have access to tool calls (read_file, edit_file, search_files, run_tests, web_search). Your job at each step is to pick the best next tool call given the current state. Reply with: (a) which option (a/b/c) you'd pick, (b) one sentence explaining why, (c) the exact tool call you'd make."}, {"role": "user", "content": "Task: Generated agentic state #28: The notification service is dropping messages"}, {"role": "assistant", "content": "Progress so far: Mid-rollout context 28: Read service, fire-and-forget"}, {"role": "user", "content": "Tool-call decision point 28: Add ack pattern or use durable queue?"}]}
|
| 50 |
+
{"id": "state-049", "task": "Generated agentic state #29: Reduce Sentry noise from non-actionable errors", "messages": [{"role": "system", "content": "You are a senior software engineer working as an agentic coding assistant. You have access to tool calls (read_file, edit_file, search_files, run_tests, web_search). Your job at each step is to pick the best next tool call given the current state. Reply with: (a) which option (a/b/c) you'd pick, (b) one sentence explaining why, (c) the exact tool call you'd make."}, {"role": "user", "content": "Task: Generated agentic state #29: Reduce Sentry noise from non-actionable errors"}, {"role": "assistant", "content": "Progress so far: Mid-rollout context 29: Saw 500 events/day, 80% same TypeError"}, {"role": "user", "content": "Tool-call decision point 29: Filter at sample rate or fix the bug?"}]}
|
|
@@ -0,0 +1,271 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""synthesize_trace.py — Generate 50 mock agentic-coding states for cost-floor testing.
|
| 2 |
+
|
| 3 |
+
We don't need a real student model to measure teacher API cost/latency. Use
|
| 4 |
+
hand-written agentic-coding states that look like a SWE-bench-lite agent
|
| 5 |
+
mid-rollout, ~250-500 tokens of context per state, paired with a tool-call
|
| 6 |
+
choice point.
|
| 7 |
+
|
| 8 |
+
Output: states.jsonl with one state per line.
|
| 9 |
+
"""
|
| 10 |
+
|
| 11 |
+
import json
|
| 12 |
+
from pathlib import Path
|
| 13 |
+
|
| 14 |
+
# ----------------------------------------------------------------------------
|
| 15 |
+
# 50 agentic-coding states. Each state is a partial conversation (system +
|
| 16 |
+
# user task + N assistant/tool turns) ending at a tool-call decision point.
|
| 17 |
+
# Variety: Python, JS, debugging, refactor, search-and-replace, test-failure-fix.
|
| 18 |
+
# ----------------------------------------------------------------------------
|
| 19 |
+
|
| 20 |
+
TASK_TEMPLATES = [
|
| 21 |
+
# (task description, mid-rollout context, decision-point query)
|
| 22 |
+
(
|
| 23 |
+
"Fix the failing test in tests/test_auth.py::test_login_with_email",
|
| 24 |
+
"Read auth.py:42-90 and saw the login function uses `username` field. "
|
| 25 |
+
"Read test file and saw test calls login(email='x@y.com'). "
|
| 26 |
+
"Schema mismatch confirmed.",
|
| 27 |
+
"What's the next tool call? (a) edit auth.py to accept email, "
|
| 28 |
+
"(b) edit test to use username, (c) read more files to understand "
|
| 29 |
+
"the schema convention.",
|
| 30 |
+
),
|
| 31 |
+
(
|
| 32 |
+
"Add rate-limiting middleware to the Flask app",
|
| 33 |
+
"Listed the project structure: app.py, routes/*.py, middleware/. "
|
| 34 |
+
"No rate-limit lib in requirements.txt. Confirmed Flask-Limiter is the "
|
| 35 |
+
"canonical choice via web search.",
|
| 36 |
+
"Next tool call? (a) pip-install flask-limiter, (b) write the middleware "
|
| 37 |
+
"first then add to deps, (c) read app.py to see middleware registration "
|
| 38 |
+
"pattern.",
|
| 39 |
+
),
|
| 40 |
+
(
|
| 41 |
+
"Refactor the parse_config function — it's 200 lines and has 3 responsibilities",
|
| 42 |
+
"Read config.py:1-200. Identified 3 concerns: file IO (lines 1-50), "
|
| 43 |
+
"schema validation (51-130), env-var override (131-200).",
|
| 44 |
+
"Next? (a) extract each concern into its own function in same file, "
|
| 45 |
+
"(b) split into 3 modules under config/, (c) write tests first to "
|
| 46 |
+
"lock current behavior before refactor.",
|
| 47 |
+
),
|
| 48 |
+
(
|
| 49 |
+
"The CI is failing with 'pytest: command not found' on the new runner",
|
| 50 |
+
"Read .github/workflows/ci.yml. Found the runner uses ubuntu-latest. "
|
| 51 |
+
"Saw setup-python action but no pytest install step.",
|
| 52 |
+
"Next? (a) add `pip install pytest` to the workflow, (b) move pytest "
|
| 53 |
+
"to dev-dependencies in pyproject.toml and use `pip install -e '.[dev]'`, "
|
| 54 |
+
"(c) check if poetry is being used and use poetry install instead.",
|
| 55 |
+
),
|
| 56 |
+
(
|
| 57 |
+
"Implement pagination on /api/users endpoint",
|
| 58 |
+
"Read routes/users.py. Current: `return jsonify(User.query.all())`. "
|
| 59 |
+
"Read README — saw mention of cursor-based pagination preferred.",
|
| 60 |
+
"Next? (a) add ?page= and ?per_page= query args (offset-based), "
|
| 61 |
+
"(b) implement cursor-based with last-id token, (c) check if there's "
|
| 62 |
+
"a pagination utility already in utils/.",
|
| 63 |
+
),
|
| 64 |
+
(
|
| 65 |
+
"Migrate the SQLAlchemy models from declarative to ORM 2.0 style",
|
| 66 |
+
"Read models/user.py. Current style uses `Column(Integer, primary_key=True)`. "
|
| 67 |
+
"Read SQLAlchemy 2.0 docs — new style uses `Mapped[int] = mapped_column(...)`.",
|
| 68 |
+
"Next? (a) migrate user.py first as proof-of-concept, (b) write a "
|
| 69 |
+
"migration script that does all models at once, (c) check if Alembic "
|
| 70 |
+
"auto-generated migrations will break with the new style.",
|
| 71 |
+
),
|
| 72 |
+
(
|
| 73 |
+
"The webhook endpoint is dropping events under load",
|
| 74 |
+
"Read webhooks/handler.py. Saw events processed synchronously. "
|
| 75 |
+
"Looked at logs — confirmed timeouts at p95 = 8s. Spec says webhooks "
|
| 76 |
+
"must respond <2s.",
|
| 77 |
+
"Next? (a) move processing to a background worker (Celery/RQ), "
|
| 78 |
+
"(b) batch events with a 1s window before processing, (c) profile "
|
| 79 |
+
"the handler first to find the actual bottleneck.",
|
| 80 |
+
),
|
| 81 |
+
(
|
| 82 |
+
"Add type hints to legacy code in lib/parser.py",
|
| 83 |
+
"Read parser.py — 800 lines, 0 type hints. Project uses pydantic v2 "
|
| 84 |
+
"elsewhere. Saw mypy config in pyproject.toml is strict.",
|
| 85 |
+
"Next? (a) add hints function-by-function, top-down, "
|
| 86 |
+
"(b) run pyright in inference mode to seed annotations, then refine, "
|
| 87 |
+
"(c) skip lib/parser.py in mypy config and revisit later.",
|
| 88 |
+
),
|
| 89 |
+
(
|
| 90 |
+
"Reduce the Docker image size — it's 2.3 GB",
|
| 91 |
+
"Read Dockerfile. Base = python:3.11 (full, ~900MB). Apt install of "
|
| 92 |
+
"build-essential, gcc, postgres-client. pip install -r requirements.txt.",
|
| 93 |
+
"Next? (a) switch to python:3.11-slim, (b) multi-stage build with "
|
| 94 |
+
"build-essential only in build stage, (c) use distroless image.",
|
| 95 |
+
),
|
| 96 |
+
(
|
| 97 |
+
"Add caching to the slow get_user_dashboard function",
|
| 98 |
+
"Read views/dashboard.py:get_user_dashboard. 5 DB queries, ~300ms. "
|
| 99 |
+
"Project uses Redis for sessions; no @cache decorator pattern yet.",
|
| 100 |
+
"Next? (a) wrap with functools.lru_cache (in-process), (b) use "
|
| 101 |
+
"Flask-Caching with Redis backend, (c) denormalize the query into "
|
| 102 |
+
"a single JOIN first.",
|
| 103 |
+
),
|
| 104 |
+
(
|
| 105 |
+
"Investigate intermittent 500 errors in production /api/order/checkout",
|
| 106 |
+
"Read order_service.py. Read recent logs via tail. Saw 12 instances "
|
| 107 |
+
"of UnboundLocalError in last 24h, all in inventory_check branch.",
|
| 108 |
+
"Next? (a) read inventory_check function first, (b) add a try/except "
|
| 109 |
+
"with proper logging to bound the issue, (c) check git log for recent "
|
| 110 |
+
"changes to that branch.",
|
| 111 |
+
),
|
| 112 |
+
(
|
| 113 |
+
"Generate API documentation from docstrings",
|
| 114 |
+
"Read docs/. Empty. Read pyproject.toml — sphinx and autoapi already "
|
| 115 |
+
"in dev-deps but not configured. Read 5 source files — all have "
|
| 116 |
+
"Google-style docstrings.",
|
| 117 |
+
"Next? (a) write conf.py with autoapi extension, (b) try mkdocs+mkdocstrings "
|
| 118 |
+
"instead, (c) check if there's a docs branch with prior config.",
|
| 119 |
+
),
|
| 120 |
+
(
|
| 121 |
+
"The integration test test_payment_flow is flaky",
|
| 122 |
+
"Read test_payment_flow.py. Uses real Stripe API in test mode. "
|
| 123 |
+
"Looked at last 30 CI runs — fails ~15% of the time, all timeouts on "
|
| 124 |
+
"Stripe webhook step.",
|
| 125 |
+
"Next? (a) mock Stripe at the boundary, (b) add @pytest.mark.flaky retry, "
|
| 126 |
+
"(c) increase timeout from 5s to 30s.",
|
| 127 |
+
),
|
| 128 |
+
(
|
| 129 |
+
"Migrate from print() to structured logging",
|
| 130 |
+
"Read 8 source files. Found 47 print() calls. Project uses structlog "
|
| 131 |
+
"in main.py only. Logging config exists in config/logging.yaml.",
|
| 132 |
+
"Next? (a) sed replace print -> logger.info per file, (b) add a "
|
| 133 |
+
"ruff/flake8 rule that bans print and fix the lint errors, "
|
| 134 |
+
"(c) write a custom AST codemod for safe replacement.",
|
| 135 |
+
),
|
| 136 |
+
(
|
| 137 |
+
"Add OAuth2 social login (Google + GitHub)",
|
| 138 |
+
"Read auth/. Current: email+password only with bcrypt. Read pyproject "
|
| 139 |
+
"— no oauth lib. Project uses Flask. Saw Authlib is canonical for Flask.",
|
| 140 |
+
"Next? (a) install authlib and write Google client first, "
|
| 141 |
+
"(b) install authlib and write a generic OAuth abstraction "
|
| 142 |
+
"supporting both, (c) read existing user model to plan account-linking "
|
| 143 |
+
"schema.",
|
| 144 |
+
),
|
| 145 |
+
(
|
| 146 |
+
"The frontend can't connect to the backend after the API URL change",
|
| 147 |
+
"Read web/.env. API_URL still points to localhost:5000. "
|
| 148 |
+
"Read web/src/api/client.ts — uses import.meta.env.VITE_API_URL.",
|
| 149 |
+
"Next? (a) update .env to new URL, (b) check if there's a .env.production "
|
| 150 |
+
"that should be used, (c) check the deployment config to see how envs "
|
| 151 |
+
"are propagated.",
|
| 152 |
+
),
|
| 153 |
+
(
|
| 154 |
+
"Add Sentry error tracking",
|
| 155 |
+
"Read main.py — no sentry init. Read pyproject — sentry-sdk in deps "
|
| 156 |
+
"but unused. Found SENTRY_DSN in env vars on staging.",
|
| 157 |
+
"Next? (a) add sentry_sdk.init() in main.py with env-aware DSN, "
|
| 158 |
+
"(b) check if there's a config/sentry.py that should hold init logic, "
|
| 159 |
+
"(c) write the integration with FlaskIntegration explicit.",
|
| 160 |
+
),
|
| 161 |
+
(
|
| 162 |
+
"Database migrations are out of sync between dev and staging",
|
| 163 |
+
"Ran alembic current on dev: head = 8f3a2b. On staging: 7d1c9a. "
|
| 164 |
+
"Read migrations/ — saw 8f3a2b has only schema changes (no data).",
|
| 165 |
+
"Next? (a) run alembic upgrade head on staging, "
|
| 166 |
+
"(b) check the schema diff first to confirm no data loss, "
|
| 167 |
+
"(c) take a backup of staging DB then upgrade.",
|
| 168 |
+
),
|
| 169 |
+
(
|
| 170 |
+
"Reduce p99 latency on /api/search from 1.2s to <300ms",
|
| 171 |
+
"Read search/handler.py. Saw raw SQL with LIKE '%term%'. "
|
| 172 |
+
"Read schema — no full-text index on the searched columns.",
|
| 173 |
+
"Next? (a) add tsvector full-text index on PostgreSQL, "
|
| 174 |
+
"(b) integrate Meilisearch as separate search service, "
|
| 175 |
+
"(c) profile the query first to find if it's the LIKE or N+1.",
|
| 176 |
+
),
|
| 177 |
+
(
|
| 178 |
+
"The new feature flag system needs to support per-user overrides",
|
| 179 |
+
"Read features/flags.py. Current: env-var-based, global on/off. "
|
| 180 |
+
"Read users model — has user_id and group_id fields.",
|
| 181 |
+
"Next? (a) add a flag_overrides table keyed by (flag_name, user_id), "
|
| 182 |
+
"(b) integrate Unleash or Flagsmith service, "
|
| 183 |
+
"(c) build a simple JSON-in-DB override mechanism first.",
|
| 184 |
+
),
|
| 185 |
+
# 30 more — abbreviated for length but real shapes
|
| 186 |
+
*[
|
| 187 |
+
(
|
| 188 |
+
f"Generated agentic state #{i}: {scenario}",
|
| 189 |
+
f"Mid-rollout context {i}: {context}",
|
| 190 |
+
f"Tool-call decision point {i}: {choice}",
|
| 191 |
+
)
|
| 192 |
+
for i, (scenario, context, choice) in enumerate(
|
| 193 |
+
[
|
| 194 |
+
("Fix S3 upload retry logic", "Read uploader.py, no backoff seen", "Add tenacity or hand-roll exponential backoff?"),
|
| 195 |
+
("Add JWT refresh tokens", "Read auth.py, only access tokens exist", "Standalone refresh table or JWT denylist?"),
|
| 196 |
+
("Reduce CI from 12min to <5min", "Read .github/ci.yml, single job", "Split into matrix or use cache?"),
|
| 197 |
+
("Migrate from npm to pnpm", "Read package.json, no workspaces", "Move to monorepo first or do straight swap?"),
|
| 198 |
+
("Add Prometheus metrics", "Read app, no /metrics endpoint", "Use prometheus_client lib or middleware?"),
|
| 199 |
+
("The websocket disconnects after 60s", "Read ws server, no ping config", "Configure ping_interval or use heartbeat at app level?"),
|
| 200 |
+
("Migrate Redis from v6 to v7", "Read docker-compose.yml, redis:6-alpine", "Test in dev branch first or upgrade in place?"),
|
| 201 |
+
("Add OpenAPI spec generation", "Read api/, manual flask routes", "Switch to FastAPI or use apispec for Flask?"),
|
| 202 |
+
("The cron job missed 2 runs last week", "Read cron config, single-instance", "Add HA cron solution or just retry-on-failure?"),
|
| 203 |
+
("Reduce memory usage in worker.py", "Read worker, sees 2GB resident", "Profile with memray or guess at top suspects?"),
|
| 204 |
+
("Add rate limit per IP not per user", "Read limiter config, user-keyed", "Switch to IP key or composite (IP+user)?"),
|
| 205 |
+
("The TLS cert expires in 7 days", "Read prod config, manual cert", "Switch to Let's Encrypt or extend manually?"),
|
| 206 |
+
("Migrate logs from local files to CloudWatch", "Read logging config, FileHandler", "watchtower lib or fluentd sidecar?"),
|
| 207 |
+
("Add CSRF protection to forms", "Read web/, no CSRF tokens", "Flask-WTF or write minimal token middleware?"),
|
| 208 |
+
("The tests pass locally but fail on CI", "Saw fail in test_dates.py, timezone issue", "Set TZ=UTC in CI or fix tests?"),
|
| 209 |
+
("Add E2E test for the checkout flow", "Read tests/, no Playwright/Cypress", "Playwright or Cypress?"),
|
| 210 |
+
("Reduce Docker build cache miss rate", "Read Dockerfile, COPY . . at top", "Reorder for better caching or BuildKit cache mounts?"),
|
| 211 |
+
("The user-uploaded images aren't being resized", "Read upload handler, no Pillow call", "Sync resize on upload or async via queue?"),
|
| 212 |
+
("Add audit log for admin actions", "Read admin/, no logging", "Decorator-based or event sourcing?"),
|
| 213 |
+
("Migrate from black to ruff format", "Read pyproject, black configured", "Direct swap or run both during transition?"),
|
| 214 |
+
("Add backup verification", "Read backup script, no restore test", "Test by restoring to staging or use checksums?"),
|
| 215 |
+
("Reduce database connection count", "Saw 200 idle conns, pool size 50", "PgBouncer or fix connection leak first?"),
|
| 216 |
+
("Add feature toggle UI for non-engineers", "Read flags system, env-var only", "Build admin page or use third-party tool?"),
|
| 217 |
+
("The API returns 500 on POST with empty body", "Read handler, json.loads on body", "Validate body length or use pydantic?"),
|
| 218 |
+
("Add SLI/SLO definitions", "Read monitoring, only uptime tracked", "Define p95 latency SLO or error rate SLO first?"),
|
| 219 |
+
("Reduce S3 cost on infrequent files", "Read storage, all in STANDARD", "Lifecycle to IA or Glacier?"),
|
| 220 |
+
("Migrate from cron to Kubernetes CronJob", "Read deploy, classic cron", "Direct migrate or run both for a week?"),
|
| 221 |
+
("Add request tracing with OpenTelemetry", "Read app, no instrumentation", "Auto-instrument or manual spans?"),
|
| 222 |
+
("The notification service is dropping messages", "Read service, fire-and-forget", "Add ack pattern or use durable queue?"),
|
| 223 |
+
("Reduce Sentry noise from non-actionable errors", "Saw 500 events/day, 80% same TypeError", "Filter at sample rate or fix the bug?"),
|
| 224 |
+
]
|
| 225 |
+
)
|
| 226 |
+
],
|
| 227 |
+
]
|
| 228 |
+
|
| 229 |
+
|
| 230 |
+
def state_to_messages(task: str, context: str, choice: str) -> list[dict]:
|
| 231 |
+
"""Format one state as a chat-completion message list."""
|
| 232 |
+
return [
|
| 233 |
+
{
|
| 234 |
+
"role": "system",
|
| 235 |
+
"content": (
|
| 236 |
+
"You are a senior software engineer working as an agentic coding "
|
| 237 |
+
"assistant. You have access to tool calls (read_file, edit_file, "
|
| 238 |
+
"search_files, run_tests, web_search). Your job at each step is "
|
| 239 |
+
"to pick the best next tool call given the current state. Reply "
|
| 240 |
+
"with: (a) which option (a/b/c) you'd pick, (b) one sentence "
|
| 241 |
+
"explaining why, (c) the exact tool call you'd make."
|
| 242 |
+
),
|
| 243 |
+
},
|
| 244 |
+
{"role": "user", "content": f"Task: {task}"},
|
| 245 |
+
{"role": "assistant", "content": f"Progress so far: {context}"},
|
| 246 |
+
{"role": "user", "content": choice},
|
| 247 |
+
]
|
| 248 |
+
|
| 249 |
+
|
| 250 |
+
def main():
|
| 251 |
+
out_path = Path(__file__).parent / "states.jsonl"
|
| 252 |
+
states = []
|
| 253 |
+
for i, (task, context, choice) in enumerate(TASK_TEMPLATES):
|
| 254 |
+
state = {
|
| 255 |
+
"id": f"state-{i:03d}",
|
| 256 |
+
"task": task,
|
| 257 |
+
"messages": state_to_messages(task, context, choice),
|
| 258 |
+
}
|
| 259 |
+
states.append(state)
|
| 260 |
+
|
| 261 |
+
out_path.write_text("\n".join(json.dumps(s) for s in states) + "\n")
|
| 262 |
+
print(f"Wrote {len(states)} states to {out_path}")
|
| 263 |
+
# Show prompt size statistics
|
| 264 |
+
total_chars = sum(
|
| 265 |
+
sum(len(m["content"]) for m in s["messages"]) for s in states
|
| 266 |
+
)
|
| 267 |
+
print(f"Total prompt chars: {total_chars:,} (~{total_chars // 4:,} tokens estimate)")
|
| 268 |
+
|
| 269 |
+
|
| 270 |
+
if __name__ == "__main__":
|
| 271 |
+
main()
|
|
@@ -0,0 +1,44 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Spike 001 — Teacher Replay Cost & Latency Floor — VERDICT
|
| 2 |
+
|
| 3 |
+
**Verdict:** ✅ VALIDATED
|
| 4 |
+
**Reason:** All three thresholds met.
|
| 5 |
+
|
| 6 |
+
Calls completed: 150 (0 errors)
|
| 7 |
+
Total spike spend: $0.9790
|
| 8 |
+
|
| 9 |
+
## Per-teacher
|
| 10 |
+
|
| 11 |
+
| Teacher | n | p50 lat | p95 lat | p99 lat | mean $ | total $ | mean prompt tok | mean comp tok |
|
| 12 |
+
|---|---|---|---|---|---|---|---|---|
|
| 13 |
+
| `anthropic/claude-opus-4.7` | 50 | 3.44s | 4.6s | 4.96s | $0.016126 | $0.8063 | 250 | 165 |
|
| 14 |
+
| `deepseek/deepseek-v4-pro` | 50 | 7.11s | 16.24s | 21.76s | $0.001314 | $0.0657 | 171 | 256 |
|
| 15 |
+
| `openai/gpt-5` | 50 | 4.97s | 10.05s | 21.96s | $0.002140 | $0.1070 | 176 | 192 |
|
| 16 |
+
|
| 17 |
+
## Per-step (parallel 3-teacher call)
|
| 18 |
+
|
| 19 |
+
- Complete states (all 3 teachers replied): **50**
|
| 20 |
+
- Median step cost (sum across 3 teachers): **$0.02120**
|
| 21 |
+
- p95 step cost: $0.02268
|
| 22 |
+
- Median step wallclock latency (max across teachers): **7.75s**
|
| 23 |
+
- p95 step latency: 20.45s
|
| 24 |
+
- p99 step latency: 23.24s
|
| 25 |
+
|
| 26 |
+
## Projected per-trace (50 steps × 3 teachers, ungated)
|
| 27 |
+
|
| 28 |
+
- Mean: **$0.9790**
|
| 29 |
+
- p50: $1.0600
|
| 30 |
+
- p95: $1.1340
|
| 31 |
+
|
| 32 |
+
## Thresholds
|
| 33 |
+
|
| 34 |
+
| Threshold | Target | Actual | Pass? |
|
| 35 |
+
|---|---|---|---|
|
| 36 |
+
| Mean per-trace cost | < $5 | $0.9790 | ✅ |
|
| 37 |
+
| p95 step latency | < 30 s | 20.45s | ✅ |
|
| 38 |
+
| p99 step latency | < 60 s | 23.24s | ✅ |
|
| 39 |
+
|
| 40 |
+
## Recommendation
|
| 41 |
+
|
| 42 |
+
- Proceed to spike 002 (trace-collection-trl) and 003 (DPO-pair extraction).
|
| 43 |
+
- Cost is acceptable even at v0.0's ungated baseline; VOI gating in v0.1 will buy headroom.
|
| 44 |
+
- Use the per-teacher latency table to decide whether any teacher is too slow to keep in the rotation.
|
|
@@ -0,0 +1,39 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Spike 002a — Trace Collection via TRL + OpenEnv
|
| 2 |
+
|
| 3 |
+
> **Risk:** MEDIUM. Validates whether TRL's `GRPOTrainer` + OpenEnv environment registry produce clean, schema-stable trace JSONL.
|
| 4 |
+
> **Status:** 📋 planned (depends on 001 verdict)
|
| 5 |
+
> **Comparison spike:** runs head-to-head with `002b-trace-collection-prime-rl/`.
|
| 6 |
+
|
| 7 |
+
## Question (Given / When / Then)
|
| 8 |
+
|
| 9 |
+
**Given** Qwen3-7B base + TRL `GRPOTrainer` + a SWE-bench-lite OpenEnv,
|
| 10 |
+
**when** we run 100 rollouts,
|
| 11 |
+
**then** all rollouts emit complete `(state_t, action_t, reward_t)` tuples to JSONL with no truncation or schema drift, and the JSONL is loadable by spike 003 without preprocessing.
|
| 12 |
+
|
| 13 |
+
## Approach (TBD)
|
| 14 |
+
|
| 15 |
+
1. Set up a minimal TRL `GRPOTrainer` config pointing at SWE-bench-lite as an OpenEnv environment.
|
| 16 |
+
2. Run 100 rollouts, capturing trace tuples to `traces.jsonl`.
|
| 17 |
+
3. Verify schema, count truncations, count missing reward signals.
|
| 18 |
+
|
| 19 |
+
## Why this risk-tier
|
| 20 |
+
|
| 21 |
+
If the trace stream is dirty (missing fields, schema drift mid-rollout, truncated states), spike 003 (DPO-pair extraction) gets nothing useful. But this risk is *medium* not *high* because both TRL and OpenEnv are well-tested upstream — the fail mode is integration glue, not feasibility.
|
| 22 |
+
|
| 23 |
+
## Files (planned)
|
| 24 |
+
|
| 25 |
+
- `setup.py` — TRL + verifiers + transformers + accelerate install
|
| 26 |
+
- `train_config.py` — minimal `GRPOConfig`
|
| 27 |
+
- `run_rollout.py` — collect 100 rollouts to traces.jsonl
|
| 28 |
+
- `validate_schema.py` — schema check + completeness stats
|
| 29 |
+
- `traces.jsonl` (gitignored — large; uploaded to dataset repo)
|
| 30 |
+
- `verdict.md` — final verdict
|
| 31 |
+
|
| 32 |
+
## Hardware
|
| 33 |
+
|
| 34 |
+
- 1× A100 80GB on Modal (per `modal-llm-training` skill)
|
| 35 |
+
- Wallclock estimate: ~4–8 hours for 100 rollouts (depends on rollout length)
|
| 36 |
+
|
| 37 |
+
## Blocked on
|
| 38 |
+
|
| 39 |
+
Spike 001 verdict. If 001 fails, this spike is moot.
|
|
@@ -0,0 +1,31 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Spike 002b — Trace Collection via PRIME-RL + verifiers
|
| 2 |
+
|
| 3 |
+
> **Risk:** MEDIUM. Comparison-spike sibling to 002a.
|
| 4 |
+
> **Status:** 📋 planned (depends on 001 verdict)
|
| 5 |
+
> **Comparison spike:** runs head-to-head with `002a-trace-collection-trl/`.
|
| 6 |
+
|
| 7 |
+
## Question (Given / When / Then)
|
| 8 |
+
|
| 9 |
+
**Given** Qwen3-7B base + PRIME-RL substrate + the verifiers env library,
|
| 10 |
+
**when** we run 100 rollouts,
|
| 11 |
+
**then** all rollouts emit complete `(state_t, action_t, reward_t)` tuples to JSONL with no truncation or schema drift.
|
| 12 |
+
|
| 13 |
+
## Why a comparison
|
| 14 |
+
|
| 15 |
+
PRIME-RL and TRL+OpenEnv are the two leading contenders for the v0.1 substrate (per `framework/composer-replication-framework.md` § "How the 5 component pieces fit together"). v0.0 should pick one definitively for v0.1. Run both, compare:
|
| 16 |
+
|
| 17 |
+
| Dimension | TRL+OpenEnv (002a) | PRIME-RL+verifiers (002b) |
|
| 18 |
+
|---|---|---|
|
| 19 |
+
| Setup complexity | TBD | TBD |
|
| 20 |
+
| Trace JSONL schema cleanliness | TBD | TBD |
|
| 21 |
+
| Rollout wallclock per trace | TBD | TBD |
|
| 22 |
+
| Decentralization story (for v0.2) | weaker | stronger (INTELLECT-2 proven) |
|
| 23 |
+
| Algorithm correctness (DAPO + GRPO loss) | strong | strong |
|
| 24 |
+
|
| 25 |
+
## Files (planned)
|
| 26 |
+
|
| 27 |
+
Same shape as 002a. Differs in `setup.py` (uses `prime-rl` package + `verifiers` lib) and `run_rollout.py` (uses PRIME-RL's orchestrator/inference split).
|
| 28 |
+
|
| 29 |
+
## Blocked on
|
| 30 |
+
|
| 31 |
+
Spike 001 verdict.
|
|
@@ -0,0 +1,34 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Spike 003 — DPO Pair Extraction from Teacher Disagreement
|
| 2 |
+
|
| 3 |
+
> **Risk:** MEDIUM. Validates that teacher disagreement at the step level carries non-trivial signal as preference pairs.
|
| 4 |
+
> **Status:** 📋 planned (depends on 001 + 002 verdicts)
|
| 5 |
+
|
| 6 |
+
## Question (Given / When / Then)
|
| 7 |
+
|
| 8 |
+
**Given** N=3 teacher action distributions per trace step and the student's own action,
|
| 9 |
+
**when** we extract preference pairs by:
|
| 10 |
+
- "majority of teachers agree on action X, student picked Y" → preference pair `chosen=X, rejected=Y`
|
| 11 |
+
- "teachers disagree but a majority differs from student" → preference pair `chosen=majority, rejected=student`
|
| 12 |
+
|
| 13 |
+
**then** the resulting DPO dataset has:
|
| 14 |
+
- ≥ 5 pairs/trace (signal density)
|
| 15 |
+
- non-trivial KL distance between `chosen` and `rejected` (the pairs aren't degenerate)
|
| 16 |
+
- per-step disagreement rate > 30% across 3 teachers (otherwise N=3 is too few)
|
| 17 |
+
|
| 18 |
+
## Approach (TBD)
|
| 19 |
+
|
| 20 |
+
1. Load `traces.jsonl` from spike 002 + `teacher_actions.jsonl` from spike 001's pattern (extended to 002's traces).
|
| 21 |
+
2. For each step, compute student-vs-teacher disagreement and majority-teacher action.
|
| 22 |
+
3. Emit DPO pairs to `dpo_pairs.jsonl`.
|
| 23 |
+
4. Validate stats: pairs/trace, average chosen-rejected logprob delta, action-token KL.
|
| 24 |
+
|
| 25 |
+
## Files (planned)
|
| 26 |
+
|
| 27 |
+
- `extract_pairs.py` — convert traces + teacher actions → DPO pairs
|
| 28 |
+
- `validate_signal.py` — compute disagreement rate + pair statistics
|
| 29 |
+
- `dpo_pairs.jsonl` (gitignored — uploaded to dataset repo)
|
| 30 |
+
- `verdict.md`
|
| 31 |
+
|
| 32 |
+
## Blocked on
|
| 33 |
+
|
| 34 |
+
Spikes 001 + 002 verdicts.
|
|
@@ -0,0 +1,47 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Spike 004 — A/B Train: Plain GRPO vs GRPO + Trace-Replay-DPO
|
| 2 |
+
|
| 3 |
+
> **Risk:** TERMINAL. The experiment that validates or invalidates the v0.0 claim.
|
| 4 |
+
> **Status:** 📋 planned (depends on 001 + 002 + 003 verdicts)
|
| 5 |
+
|
| 6 |
+
## Question (Given / When / Then)
|
| 7 |
+
|
| 8 |
+
**Given** the trace dataset from spike 002 + the DPO pairs from spike 003,
|
| 9 |
+
**when** we train two Qwen3-7B variants on SWE-bench-lite:
|
| 10 |
+
- **(A) Plain GRPO baseline** — TRL `GRPOTrainer` only, no trace-replay channel
|
| 11 |
+
- **(B) GRPO + trace-replay-DPO** — same training data, additional DPO loss term from spike 003 pairs
|
| 12 |
+
|
| 13 |
+
**then** variant (B) outperforms variant (A) by **≥ 2 points pass@1** on held-out SWE-bench-lite, with statistical significance (p < 0.05 over 3 seeds per variant).
|
| 14 |
+
|
| 15 |
+
## Why "≥ 2 points"
|
| 16 |
+
|
| 17 |
+
- Below 2 pts: noise-level difference, not worth the additional teacher-cost overhead. Channel is dead.
|
| 18 |
+
- 2–5 pts: validates the channel; v0.1 should add VOI gating + tiered teachers to make it economic.
|
| 19 |
+
- > 5 pts: channel is a clear win; v0.1 should be the priority research direction.
|
| 20 |
+
|
| 21 |
+
## Approach (TBD)
|
| 22 |
+
|
| 23 |
+
1. Use 002's chosen substrate (TRL or PRIME-RL).
|
| 24 |
+
2. Set up two configs: A (plain) and B (with DPO loss).
|
| 25 |
+
3. Train 3 seeds each on Modal A100-80GB.
|
| 26 |
+
4. Eval on SWE-bench-lite held-out.
|
| 27 |
+
5. Compute pass@1 with confidence intervals.
|
| 28 |
+
|
| 29 |
+
## Cost
|
| 30 |
+
|
| 31 |
+
- 2 variants × 3 seeds = 6 training runs
|
| 32 |
+
- Each ~8 hr on A100-80GB
|
| 33 |
+
- ~$50 each → **~$300 in GPU compute**
|
| 34 |
+
- Plus SWE-bench-lite eval (~$50–100)
|
| 35 |
+
|
| 36 |
+
## Files (planned)
|
| 37 |
+
|
| 38 |
+
- `train_a_baseline.py` — plain GRPO config
|
| 39 |
+
- `train_b_with_replay.py` — GRPO + trace-replay-DPO config
|
| 40 |
+
- `eval_swe_bench.py` — held-out evaluation harness
|
| 41 |
+
- `compare.py` — paired-bootstrap CI on pass@1 differences
|
| 42 |
+
- `results/` — per-seed eval outputs
|
| 43 |
+
- `verdict.md` — final verdict + recommendation for v0.1
|
| 44 |
+
|
| 45 |
+
## Blocked on
|
| 46 |
+
|
| 47 |
+
Spikes 001 + 002 + 003 all VALIDATED or PARTIAL. If any of those fail outright, this spike doesn't run.
|
|
@@ -0,0 +1,60 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# v0.0 Spike — Composer Replication Framework
|
| 2 |
+
|
| 3 |
+
> Decomposed from the framework synthesis (`framework/composer-replication-framework.md`).
|
| 4 |
+
> Goal of v0.0: **prove the trace-replay multi-teacher distillation channel adds signal on top of plain GRPO**, on the smallest viable model.
|
| 5 |
+
> If the spike validates, we move to v0.1 (full Composer recipe). If it invalidates, the framework still has value (Composer recipe alone) but the novel claim is dead and we reorient.
|
| 6 |
+
|
| 7 |
+
## Risk-ordered decomposition
|
| 8 |
+
|
| 9 |
+
| # | Spike | Validates (Given / When / Then) | Why this risk first | Status |
|
| 10 |
+
|---|-------|----------------------------------|---------------------|--------|
|
| 11 |
+
| **001** | `001-teacher-replay-cost` | **Given** a frozen 100-step agentic-coding trace and a state at step `t`, **when** N=3 frozen teachers (Opus 4.7 / GPT-5 / DeepSeek V4 Pro) are queried via OpenRouter for next-action distributions, **then** total per-trace teacher cost is < $5 and wallclock per step is < 30 s. | If teachers cost $50+/trace or take 5 min/step, the channel is unviable regardless of whether it improves training. **Kill-switch first.** | 📋 planned |
|
| 12 |
+
| **002a** | `002a-trace-collection-trl` | **Given** Qwen3-7B base + TRL `GRPOTrainer` + a SWE-bench-lite OpenEnv, **when** we run 100 rollouts, **then** all rollouts emit complete `(state_t, action_t, reward_t)` tuples to JSONL with no truncation or schema drift. | Without a clean trace stream, no signal to replay. Validates TRL+OpenEnv plumbing. | 📋 planned |
|
| 13 |
+
| **002b** | `002b-trace-collection-prime-rl` | Same as 002a but with PRIME-RL substrate. | Comparison: which framework's trace export is cleaner? | 📋 planned |
|
| 14 |
+
| **003** | `003-dpo-pairs-from-disagreement` | **Given** N=3 teacher action distributions per trace step and the student's own action, **when** we extract preference pairs by "majority of teachers > student" + "student > minority", **then** the resulting DPO dataset has ≥ 5 pairs/trace and a non-trivial KL distance from random pairs. | The reward shape needs to actually carry signal, not just exist. | 📋 planned |
|
| 15 |
+
| **004** | `004-ab-train-grpo-vs-trace-replay-dpo` | **Given** the trace dataset from 002, **when** we train two Qwen3-7B variants — (A) plain GRPO baseline, (B) GRPO + trace-replay-DPO — and evaluate on SWE-bench-lite, **then** variant (B) outperforms (A) by ≥ 2 pt pass@1 with statistical significance. | The terminal experiment that validates or invalidates the v0.0 claim. | 📋 planned |
|
| 16 |
+
|
| 17 |
+
## Spike order rationale
|
| 18 |
+
|
| 19 |
+
1. **001 (teacher cost) first** — single most likely thing to kill the framework. Cheap to run (~$5–20), takes ~1 hour, no GPU.
|
| 20 |
+
2. **002a / 002b in parallel** — independent feasibility checks for the two competing trace-collection substrates. ~half a day each. Compare verdicts head-to-head.
|
| 21 |
+
3. **003 reward-shape check** — once we have *any* trace + teacher data, validate the DPO-pair extraction works as a reward signal before paying for the full A/B training run.
|
| 22 |
+
4. **004 the actual experiment** — only run after 001/002/003 all green. Costs the GPU budget; should not be wasted on a framework that already failed an earlier feasibility gate.
|
| 23 |
+
|
| 24 |
+
## Out of scope for v0.0 (deferred to v0.1)
|
| 25 |
+
|
| 26 |
+
- Composer's hint-distillation loss (the per-turn KL from a hint-conditioned forward pass)
|
| 27 |
+
- The Feature Deletion environment (use SWE-bench-lite as the env)
|
| 28 |
+
- DiLoCo / decentralized training (single-node FSDP2 is fine at 7B)
|
| 29 |
+
- Monarch / Forge (use Ray + verifiers, the PRIME-RL stack)
|
| 30 |
+
- MoE base (use dense Qwen3-7B; saner v0.0 target)
|
| 31 |
+
- VOI gating, tiered teachers (do the full N=3 query at every step in v0.0; cost mitigation is a v0.1 optimization)
|
| 32 |
+
|
| 33 |
+
## Budget
|
| 34 |
+
|
| 35 |
+
| Item | Estimate | Source |
|
| 36 |
+
|---|---|---|
|
| 37 |
+
| Teacher API calls (OpenRouter) | ~$50–150 | 100 traces × ~50 step replays × 3 teachers × ~$0.005/call |
|
| 38 |
+
| GPU compute (Qwen3-7B fine-tune × 2 variants) | ~$60–120 | Modal A100-80GB, ~8 hr each variant |
|
| 39 |
+
| Dev wallclock | ~5–7 days | Single operator |
|
| 40 |
+
| **Total** | **~$200 + dev time** | Cheapest viable falsification of the novel claim |
|
| 41 |
+
|
| 42 |
+
## Success criteria for v0.0
|
| 43 |
+
|
| 44 |
+
- 001: $/trace + s/step verdict in `001-teacher-replay-cost/README.md`
|
| 45 |
+
- 002a, 002b: clean JSONL + verdict on which substrate to use for v0.1
|
| 46 |
+
- 003: DPO-pair stats verdict
|
| 47 |
+
- 004: A/B pass@1 with confidence interval, plain text and chart
|
| 48 |
+
|
| 49 |
+
If 004 is **VALIDATED** → publish the result, write v0.1 plan.
|
| 50 |
+
If **PARTIAL** (e.g., only some teacher mixes work) → narrow the claim, re-spike with the working subset.
|
| 51 |
+
If **INVALIDATED** → close the trace-replay channel as a research direction; v0.1 framework still ships with Composer-only recipe.
|
| 52 |
+
|
| 53 |
+
## Citations
|
| 54 |
+
|
| 55 |
+
All five primary research notes (`research/01..05*.md`) cite the source papers and code repos that informed each design choice. Particular emphasis for spike-time:
|
| 56 |
+
|
| 57 |
+
- Cursor (2026): Composer 2.5 blog post — recipe shape and the targeted-RL hint-distillation idea
|
| 58 |
+
- Microsoft (2024): rStar / rStar-Math — closest precedent to trace-replay (single-teacher MCTS)
|
| 59 |
+
- Hugging Face (2025): TRL `GRPOTrainer` + OpenEnv integration — algorithm reference
|
| 60 |
+
- Prime Intellect (2026): PRIME-RL + INTELLECT-2 — production decentralized substrate
|