poolside-laguna-hackathon
/

trade-pool

+---
+license: apache-2.0
+base_model: poolside/Laguna-XS.2
+tags:
+  - reinforcement-learning
+  - lora
+  - trading
+  - coding-agent
+  - verifiers
+  - prime-intellect
+  - poolside-hackathon
+library_name: peft
+---
+# TradePool — a self-improving trading coding-agent (Laguna XS.2 LoRA)
+**Poolside × Prime Intellect Research Hackathon — Foundations track.**
+A LoRA adapter for `poolside/Laguna-XS.2`, trained with reinforcement learning so the
+model becomes a **coding agent that writes causal crypto trading-strategy functions**,
+scored by a leak-proof out-of-sample backtest.
+## The idea in one line
+> Trading discipline that normally lives as *prompt text* (a memory file of rules) is
+> turned into **adapter weights** by rewarding disciplined, profitable behaviour on
+> held-out market data. The verifier *is* the backtest.
+## How it works
+1. **Environment** (`verifiers`, v0 `SingleTurnEnv`, pushed to `stimulir/trade-pool`):
+   the agent is given a Base-chain token's in-sample price history + a library of causal
+   indicators (RSI, MACD, MAs, z-score, Bollinger, volatility) and must write
+   `def strategy(features, position) -> target_position`.
+2. **Verifier / reward** — the strategy runs bar-by-bar over a **held-out** window
+   (lookahead is structurally impossible; the function never sees future bars), scored by
+   a weighted rubric:
+   - OOS Sharpe (0.40) · beats buy-and-hold (0.20) · drawdown control (0.15) ·
+     sane exposure (0.10) · transaction cost (0.05) · valid+actually-trades (0.10)
+   - Hard gates → reward 0: invalid code, lookahead, NaN equity, **do-nothing strategies**.
+3. **Training** — Prime Hosted RL (GRPO), `poolside/Laguna-XS.2`, 50 steps, batch 128,
+   `rollouts_per_example=8`, `enable_thinking=false`. FREE hosted Laguna run.
+## Results
+RL produced a clean, monotonic reward climb on the training environment:
+| Stage | Total reward |
+|---|---|
+| step ~0 (baseline) | ~0.15 |
+| step ~8  | 0.19 |
+| step ~11 | 0.28 |
+| step ~13 (peak) | ~0.42 |
+| step ~50 (final) | ~0.34–0.41 |
+Every rubric component improved together (not single-metric gaming):
+`reward_valid` 0.30 → ~0.70 (writes valid trading code far more often),
+`reward_sharpe` 0.10 → 0.33, drawdown/exposure/cost all up. Held-out-symbol eval on base
+Laguna scored `reward_valid` 0.75 / `reward_sharpe` 0.45, confirming the env is in the
+healthy trainable band before training.
+## The novel contribution: closing the self-improvement loop
+- **Weights channel:** each RL iteration warm-starts from the prior adapter
+  (`checkpoint_id`) — genuine parametric continuation.
+- **Curriculum channel:** a reflection step reads the prior adapter's out-of-sample eval
+  and shifts the next run's objective (sharpe → min-drawdown → balanced) and focuses the
+  weakest symbols — the agent's own results drive its next curriculum.
+- **Falsifiable proof ("memory is the adapter"):** the discipline block (distilled from
+  618 real prior trading decisions) can be **stripped from the prompt**
+  (`use_seed_principles=false`); if the trained adapter stays disciplined, the rules now
+  live in the weights, not the prompt.
+## Files
+- `trade_pool/` — the full `verifiers` environment (features, causal backtester, executor,
+  rubric, data) — installable, builds to a wheel, bundles its own OHLCV tape.
+- `adapter/` — the trained LoRA adapter weights for `poolside/Laguna-XS.2`.
+- `configs/` — the RL training config(s).
+- `reward_curve.txt`, `eval_*.json` — training + eval metrics.
+## Reproduce
+```bash
+prime env push --path ./trade_pool --visibility PRIVATE     # -> <you>/trade-pool
+prime eval run <you>/trade-pool -m poolside/laguna-xs.2 -n 8 -r 1
+prime train run configs/iter_1.toml                          # FREE hosted Laguna RL
+prime deployments create <adapter_id>                        # serve the adapter
+```
+Built at the Poolside London hackathon, 29–30 May 2026. Team: **TradePool** (Tosin Dairo).