tosi-n commited on
Commit
65053bf
Β·
verified Β·
1 Parent(s): 09f121c

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +85 -0
README.md ADDED
@@ -0,0 +1,85 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model: poolside/Laguna-XS.2
4
+ tags:
5
+ - reinforcement-learning
6
+ - lora
7
+ - trading
8
+ - coding-agent
9
+ - verifiers
10
+ - prime-intellect
11
+ - poolside-hackathon
12
+ library_name: peft
13
+ ---
14
+
15
+ # TradePool β€” a self-improving trading coding-agent (Laguna XS.2 LoRA)
16
+
17
+ **Poolside Γ— Prime Intellect Research Hackathon β€” Foundations track.**
18
+
19
+ A LoRA adapter for `poolside/Laguna-XS.2`, trained with reinforcement learning so the
20
+ model becomes a **coding agent that writes causal crypto trading-strategy functions**,
21
+ scored by a leak-proof out-of-sample backtest.
22
+
23
+ ## The idea in one line
24
+ > Trading discipline that normally lives as *prompt text* (a memory file of rules) is
25
+ > turned into **adapter weights** by rewarding disciplined, profitable behaviour on
26
+ > held-out market data. The verifier *is* the backtest.
27
+
28
+ ## How it works
29
+ 1. **Environment** (`verifiers`, v0 `SingleTurnEnv`, pushed to `stimulir/trade-pool`):
30
+ the agent is given a Base-chain token's in-sample price history + a library of causal
31
+ indicators (RSI, MACD, MAs, z-score, Bollinger, volatility) and must write
32
+ `def strategy(features, position) -> target_position`.
33
+ 2. **Verifier / reward** β€” the strategy runs bar-by-bar over a **held-out** window
34
+ (lookahead is structurally impossible; the function never sees future bars), scored by
35
+ a weighted rubric:
36
+ - OOS Sharpe (0.40) Β· beats buy-and-hold (0.20) Β· drawdown control (0.15) Β·
37
+ sane exposure (0.10) Β· transaction cost (0.05) Β· valid+actually-trades (0.10)
38
+ - Hard gates β†’ reward 0: invalid code, lookahead, NaN equity, **do-nothing strategies**.
39
+ 3. **Training** β€” Prime Hosted RL (GRPO), `poolside/Laguna-XS.2`, 50 steps, batch 128,
40
+ `rollouts_per_example=8`, `enable_thinking=false`. FREE hosted Laguna run.
41
+
42
+ ## Results
43
+ RL produced a clean, monotonic reward climb on the training environment:
44
+
45
+ | Stage | Total reward |
46
+ |---|---|
47
+ | step ~0 (baseline) | ~0.15 |
48
+ | step ~8 | 0.19 |
49
+ | step ~11 | 0.28 |
50
+ | step ~13 (peak) | ~0.42 |
51
+ | step ~50 (final) | ~0.34–0.41 |
52
+
53
+ Every rubric component improved together (not single-metric gaming):
54
+ `reward_valid` 0.30 β†’ ~0.70 (writes valid trading code far more often),
55
+ `reward_sharpe` 0.10 β†’ 0.33, drawdown/exposure/cost all up. Held-out-symbol eval on base
56
+ Laguna scored `reward_valid` 0.75 / `reward_sharpe` 0.45, confirming the env is in the
57
+ healthy trainable band before training.
58
+
59
+ ## The novel contribution: closing the self-improvement loop
60
+ - **Weights channel:** each RL iteration warm-starts from the prior adapter
61
+ (`checkpoint_id`) β€” genuine parametric continuation.
62
+ - **Curriculum channel:** a reflection step reads the prior adapter's out-of-sample eval
63
+ and shifts the next run's objective (sharpe β†’ min-drawdown β†’ balanced) and focuses the
64
+ weakest symbols β€” the agent's own results drive its next curriculum.
65
+ - **Falsifiable proof ("memory is the adapter"):** the discipline block (distilled from
66
+ 618 real prior trading decisions) can be **stripped from the prompt**
67
+ (`use_seed_principles=false`); if the trained adapter stays disciplined, the rules now
68
+ live in the weights, not the prompt.
69
+
70
+ ## Files
71
+ - `trade_pool/` β€” the full `verifiers` environment (features, causal backtester, executor,
72
+ rubric, data) β€” installable, builds to a wheel, bundles its own OHLCV tape.
73
+ - `adapter/` β€” the trained LoRA adapter weights for `poolside/Laguna-XS.2`.
74
+ - `configs/` β€” the RL training config(s).
75
+ - `reward_curve.txt`, `eval_*.json` β€” training + eval metrics.
76
+
77
+ ## Reproduce
78
+ ```bash
79
+ prime env push --path ./trade_pool --visibility PRIVATE # -> <you>/trade-pool
80
+ prime eval run <you>/trade-pool -m poolside/laguna-xs.2 -n 8 -r 1
81
+ prime train run configs/iter_1.toml # FREE hosted Laguna RL
82
+ prime deployments create <adapter_id> # serve the adapter
83
+ ```
84
+
85
+ Built at the Poolside London hackathon, 29–30 May 2026. Team: **TradePool** (Tosin Dairo).