Reinforcement Learning
Transformers
English
post-training
distillation
agentic-coding
composer-2.5
cursor
kimi-k2
grpo
dapo
diloco
openenv
trl
verl
research
methodology
Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
File size: 26,296 Bytes
ac4bfb4 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 | # Modal Reconnaissance — Composer 2.5 Replication GPU Smoke
**Audience:** trainer integrator running a one-shot 30-minute verification smoke for the Composer 2.5 Replication Framework (`spikes/005-integrated-trainer-skeleton/`).
**Workload:** `Qwen/Qwen2.5-0.5B-Instruct` (≈ 1 GB fp16 weights), ~50 forward+backward steps, custom 3-channel loss = `GRPO + α·SDPO-KL + β·trace-replay-DPO`. Batch size ≤ 4 sequences ≤ 2048 tokens. **Goal: prove the loss runs end-to-end and capture mem + step time.** This is *not* training — it's a smoke.
**Cap:** $5. **Local hardware:** RTX 5090, 32 GB VRAM, Modal CLI already configured (`~/.modal.toml`).
**Bottom line up front:** *Run it locally on the 5090.* Modal is the wrong tool for this specific job. The skeleton + price math below is for future scale-out, not the smoke.
---
## 1. Recommended Modal GPU type & estimated cost
### 1.1 Pricing table (from primary source)
All values copied verbatim from <https://modal.com/pricing> (fetched for this report). Modal bills per **second** of compute, not per minute or hour.
| GPU | Modal `gpu=` string | $ / sec | $ / hour | VRAM | Verdict for this smoke |
|----------------|----------------------|--------------|----------|--------|------------------------|
| Nvidia T4 | `"T4"` | 0.000164 | 0.590 | 16 GB | Too small for safe headroom on 3 fwd passes |
| **Nvidia L4** | `"L4"` | **0.000222** | **0.799**| 24 GB | ✅ **Recommended** — cheapest GPU that fits comfortably |
| Nvidia A10 | `"A10"` | 0.000306 | 1.102 | 24 GB | Acceptable; ~38% pricier than L4 for marginal speedup at sub-1B |
| Nvidia L40S | `"L40S"` | 0.000542 | 1.951 | 48 GB | Overkill — Modal's default rec, but unjustified at 0.5B |
| Nvidia A100-40GB| `"A100-40GB"` | 0.000583 | 2.099 | 40 GB | Overkill |
| Nvidia A100-80GB| `"A100-80GB"` | 0.000694 | 2.498 | 80 GB | Overkill |
| Nvidia H100 | `"H100!"` | 0.001097 | 3.949 | 80 GB | Wasteful |
(`H100!` suffix = pin to H100, opt out of Modal's automatic H200 upgrade. See <https://modal.com/docs/guide/gpu#automatic-upgrades-to-h200s>.)
**Auxiliary costs** (also primary, same page):
- CPU: $0.0000131 / physical-core / sec → ~$0.047 / core-hour. Min 0.125 cores per container.
- RAM: $0.00000222 / GiB / sec → ~$0.008 / GiB-hour.
- Volumes: $0.09 / GiB / month (first 1 TiB / mo free on the workspace).
- Starter plan: **$30 / month free credits** — your smoke is free if you haven't burned the budget elsewhere.
### 1.2 Why L4, not A10G or A100-40GB
The skill mlops/modal-llm-training defaults to L4/A10 for "small smokes" and that holds here. The framing: *Qwen2.5-0.5B in fp16 is ~1 GB of weights. The 3-channel loss does ≥3 forward passes per step (student-grad, teacher no-grad for SDPO, chosen+rejected for DPO). Even with the teacher forward held in memory you are nowhere near 24 GB.*
Concrete VRAM math for the workload (back of envelope, batch=2, seq=1024, bf16):
- Weights: ~1.0 GB
- Optimizer state (AdamW, fp32 m+v): ~4 GB (8 bytes × 0.5B params)
- Gradients (bf16): ~1 GB
- Activations for student fwd at B=2,T=1024: ~1–2 GB
- Teacher fwd (no grad, no act save): ~0.3 GB
- DPO chosen+rejected fwds (with grad): ~2–3 GB
- HF transformers overhead, KV scratch, framework: ~2 GB
- **Subtotal: ~11–14 GB** — comfortably inside 24 GB on L4.
**A10 is also fine** but costs 38% more for ~30–50% extra throughput on a workload where the GPU is already step-time-bound by Python overhead (see §3). Pay the L4 rate.
**A100-40GB is wrong.** You're paying 2.6× the L4 rate for memory you don't use and FLOPS that, on a 0.5B model with bs=2, you can't saturate. The Modal docs explicitly warn against this: *"Before you jump for the most powerful (and so most expensive) GPU, make sure you understand where the bottlenecks are…"* (<https://modal.com/docs/guide/gpu#b200-gpus>).
**T4 declined** because: (a) only 16 GB VRAM — tight given 3 fwd passes; (b) old Turing arch lacks bf16 hardware; you'd be on fp16/fp32, which trips up `transformers` flash-attention paths and adds debug surface area on a smoke that's already debugging custom loss code.
### 1.3 Cost projection for the actual smoke
Assume a 30-min wall-clock budget that breaks down realistically as:
- Container cold-start + image pull: 30–90 s (first run — Modal's container infra warm-boots in ~1 s but your image with torch+transformers takes one-time pull). See <https://modal.com/docs/guide/cold-start>.
- HF model download (Qwen 2.5-0.5B = ~1 GB shards) on first run: 15–45 s — **should be cached on a Modal Volume after run 1**.
- Setup inside fn (CUDA init, model.from_pretrained, optimizer build): 20–40 s.
- 50 training steps × ~2–4 s/step (3-channel loss, bs=2, seq=1024 on L4): **100–200 s**.
- Logging, save, exit: 5 s.
**Realistic total: ~3–7 minutes of GPU-billed time per run.**
Cost per run on L4:
- Lower bound (3 min): 180 s × $0.000222 = **$0.040**
- Upper bound (7 min): 420 s × $0.000222 = **$0.093**
- Plus CPU/RAM overhead (4 cores × 16 GB RAM): ~420 s × (4 × $0.0000131 + 16 × $0.00000222) = ~$0.037
**Per-run all-in: $0.08 – $0.13 on L4.** You can run the smoke ~50× before nudging the $5 cap. Comfortable.
For comparison, A10 same scenario: ~$0.11 – $0.18 per run. A100-40GB: ~$0.21 – $0.34. Still all under cap, but L4 is the rational pick.
### 1.4 Region & preemption multipliers (DON'T trip on these)
From the pricing-page footer:
- **Region selection: 1.5–1.75× base price.** Don't pin to a region unless you must.
- **Non-preemptible execution: 3× base price.** Default is preemptible — leave it. A 30-min smoke that gets preempted is fine; just retry. Setting `gpu_preempted=False` (or using non-preemptible mode) would push L4 to ~$2.40/hr and is unjustified.
---
## 2. Minimal `modal_app.py` skeleton
This is the actual file to drop into the repo, e.g. at `spikes/005-integrated-trainer-skeleton/modal_app.py`. It is intentionally one file, with no abstraction, sized for the smoke. Image pins are conservative — match what the user is running locally to avoid version drift between local debugging and Modal runs.
```python
"""modal_app.py — GPU smoke for the Composer 2.5 Replication Framework.
Goal: run ~50 forward+backward steps of the 3-channel loss
(GRPO + SDPO-KL + trace-replay-DPO) against Qwen/Qwen2.5-0.5B-Instruct,
capture peak VRAM and per-step latency, and exit. Single L4, single container.
Run: modal run modal_app.py
Logs: the function's print() output streams to your terminal.
"""
from __future__ import annotations
import modal
# ---------------------------------------------------------------------------
# 1) App + image
# ---------------------------------------------------------------------------
# Pin torch to a CUDA build that matches Modal's L4 driver (CUDA 12.x).
# Pin transformers/peft/trl to a known-good combination — the trainer skeleton
# was developed against transformers >= 4.45 and trl >= 0.12 for GRPOTrainer.
# If you bump any of these, re-verify GRPOTrainer._compute_loss is still the
# correct override hook (DeepWiki audit anchor: huggingface/trl).
image = (
modal.Image.debian_slim(python_version="3.11")
.apt_install("git")
.pip_install(
"torch==2.4.1", # CUDA 12.1 wheel from PyPI default index
"transformers==4.46.3",
"accelerate==1.1.1",
"peft==0.14.0",
"trl==0.12.2",
"datasets==3.1.0",
"huggingface_hub==0.26.5",
)
.env({
# Force HF to use the mounted Volume for model + dataset cache.
"HF_HOME": "/cache/hf",
"TRANSFORMERS_CACHE": "/cache/hf",
"HF_HUB_ENABLE_HF_TRANSFER": "1", # parallel download for the model
# Make Python flush prints immediately so we see step times live.
"PYTHONUNBUFFERED": "1",
# Reproducibility for the smoke.
"TOKENIZERS_PARALLELISM": "false",
})
)
# ---------------------------------------------------------------------------
# 2) Persistent volume for HF cache (so model isn't re-downloaded each run)
# ---------------------------------------------------------------------------
# 1 GB of Qwen weights persists here. First run pays the download cost,
# every subsequent run reuses the volume. Below 1 TiB / mo: free.
hf_cache = modal.Volume.from_name("hf-cache-composer-smoke", create_if_missing=True)
# ---------------------------------------------------------------------------
# 3) App + secrets
# ---------------------------------------------------------------------------
app = modal.App("composer-replication-smoke")
# Optional — only needed if you switch to a gated model. Qwen2.5-0.5B is open.
hf_secret = modal.Secret.from_name("huggingface-token", required_keys=[]) # no-op safety
# ---------------------------------------------------------------------------
# 4) The smoke function
# ---------------------------------------------------------------------------
@app.function(
image=image,
gpu="L4", # see §1: cheapest 24 GB option that fits
cpu=4.0, # 4 cores is plenty for tokenization on a sub-1B
memory=16 * 1024, # 16 GiB RAM is plenty
volumes={"/cache": hf_cache},
timeout=60 * 30, # hard 30-min cap matches the smoke spec
secrets=[hf_secret],
# NB: keep preemptible (default). Don't pay 3× to pin.
# NB: don't pin region — the 1.5–1.75× tax is unjustified for a smoke.
)
def smoke():
import time
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
MODEL_ID = "Qwen/Qwen2.5-0.5B-Instruct"
print(f"[smoke] torch={torch.__version__} cuda={torch.version.cuda} "
f"device={torch.cuda.get_device_name(0)} "
f"vram={torch.cuda.get_device_properties(0).total_memory/1e9:.1f} GB")
# -------------------------------------------------------------------
# Load tokenizer + model. bf16 — L4 supports it (Ada Lovelace).
# -------------------------------------------------------------------
t0 = time.perf_counter()
tok = AutoTokenizer.from_pretrained(MODEL_ID, cache_dir="/cache/hf")
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
cache_dir="/cache/hf",
torch_dtype=torch.bfloat16,
device_map="cuda:0",
)
model.train()
print(f"[smoke] model load: {time.perf_counter()-t0:.1f}s "
f"params={sum(p.numel() for p in model.parameters())/1e6:.1f}M")
# -------------------------------------------------------------------
# 50-step verification loop.
#
# NOTE: this stub uses a synthetic batch — a single forward+backward
# against an LM-head loss — *not* the full 3-channel loss. The point
# is to (a) verify the Modal harness, (b) measure the per-step time
# of a vanilla AutoModelForCausalLM step on this GPU as a baseline.
#
# Replace the body of the for-loop with the actual ComposerReplicationTrainer
# `_compute_loss` call once data_collator outputs are stubbed/mocked.
# See: spikes/005-integrated-trainer-skeleton/trl_path/composer_trainer.py
# -------------------------------------------------------------------
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-6)
# Synthetic batch: bs=2, seq=1024 — matches the realistic smoke shape.
B, T = 2, 1024
input_ids = torch.randint(0, tok.vocab_size, (B, T), device="cuda:0")
labels = input_ids.clone()
torch.cuda.reset_peak_memory_stats()
step_times = []
for step in range(50):
t = time.perf_counter()
out = model(input_ids=input_ids, labels=labels)
out.loss.backward()
optimizer.step()
optimizer.zero_grad(set_to_none=True)
torch.cuda.synchronize()
dt = time.perf_counter() - t
step_times.append(dt)
if step % 10 == 0:
print(f"[smoke] step {step:>3d} loss={out.loss.item():.4f} "
f"dt={dt*1000:.1f}ms peak_vram={torch.cuda.max_memory_allocated()/1e9:.2f}GB")
# -------------------------------------------------------------------
# Final report.
# -------------------------------------------------------------------
median_ms = sorted(step_times)[len(step_times)//2] * 1000
p95_ms = sorted(step_times)[int(len(step_times)*0.95)] * 1000
peak_gb = torch.cuda.max_memory_allocated() / 1e9
print(f"\n[smoke] DONE. median_step={median_ms:.1f}ms p95={p95_ms:.1f}ms "
f"peak_vram={peak_gb:.2f}GB total_time={sum(step_times):.1f}s")
# Persist cache for the next run.
hf_cache.commit()
@app.local_entrypoint()
def main():
smoke.remote()
```
### 2.1 What's deliberately *not* in the skeleton
- **No `flash-attn` install.** The `flash-attn` wheel build is a notorious time sink on Modal images (compiles against the CUDA toolkit). For a 0.5B smoke, SDPA (PyTorch's built-in scaled-dot-product attention) is fine and is on by default in transformers ≥ 4.45.
- **No `bitsandbytes`, no `unsloth`, no `xformers`.** All add build complexity. None give you anything on a smoke.
- **No DeepSpeed, no FSDP, no `accelerate launch`.** This is single-GPU; `accelerate` is in the image only because `trl` imports it. We don't invoke it.
- **No web endpoint, no `@app.cls`, no `enter` method.** A `@app.function()` with no warm-up is correct for a one-shot smoke. `enter`/lifecycle methods are for serving and amortizing model load across many calls — not relevant when you call once.
- **No `min_containers` or `buffer_containers`.** Those are warm-pool knobs for serving — they cost money. Default scale-from-zero is right.
- **No `Image.from_registry`.** `debian_slim` + `pip_install` is faster than pulling a CUDA base image when you don't need a custom CUDA toolkit.
### 2.2 What you do need to add when you wire the real loss
Replace the synthetic `for step in range(50)` body with:
```python
from data_collator import ComposerDataCollator # spike 005 path
from trl_path.composer_trainer import ComposerReplicationTrainer
# ...
# Build a small fixed dataset of (prompt, response, hint, dpo_pair) tuples
# inline in the smoke (10–20 examples). Don't pull a real RL rollout — the
# point is to verify the loss path, not the rollout path.
```
The smoke does **not** need a real rollout/sampling phase. Stub `inputs` with the keys `_compute_sdpo_loss` and `_compute_trace_replay_loss` consume (`ctx_teacher_input_ids`, `dpo_chosen_input_ids`, `dpo_chosen_response_mask`, `dpo_chosen_ref_logprobs`, `sdpo_loss_mask`, …) using fixed tensors. That's the real verification — does the 3-channel loss compute and back-propagate without shape errors. The trainer skeleton's logging will tell you per-channel values.
---
## 3. Gotchas that bite *this specific workload*
The Modal docs and the `mlops/modal-llm-training` skill cover ~30 lessons aimed at 7B–30B training. Most of them don't apply here. The ones that do:
### 3.1 The teacher forward in SDPO doubles your effective batch memory — but only briefly
`ComposerReplicationTrainer._compute_sdpo_loss` does this (composer_trainer.py L138–143):
```python
student_logits = model(input_ids=inputs["input_ids"]).logits # with grad
with torch.no_grad():
teacher_logits = model(input_ids=inputs["ctx_teacher_input_ids"]).logits
```
Two issues:
1. **Both logits tensors are held simultaneously** in `_compute_sdpo_loss` — they're handed to `generalized_jsd_loss` which keeps them alive for the JSD math. For Qwen 2.5-0.5B (vocab = 151,936), one logits tensor at B=2,T=1024 in bf16 is `2 * 1024 * 151936 * 2 bytes ≈ 622 MB`. Two of them = ~1.2 GB. **Negligible on a 24 GB L4** but worth noting because logits are surprisingly fat for the Qwen vocab.
2. **Use the `top_k` arg in `generalized_jsd_loss`** if you ever want to scale this up. The docstring (`opsd_loss.py` L54) explicitly recommends it: *"top_k: restrict KL to top-k tokens of the teacher distribution. Saves compute on large vocabularies (Qwen3 vocab = 152K)."* On the smoke, leave it `None` to verify the unrestricted path; flip it on for real training.
### 3.2 The DPO channel does TWO more grad'd forwards per step
`_compute_trace_replay_loss` (composer_trainer.py L191–198) calls `_sequence_logprobs(model, dpo_chosen_…)` and `_sequence_logprobs(model, dpo_rejected_…)`. Both are with-grad. So each training step is:
| Forward | Grad? | Notes |
|---------|-------|-------|
| `super()._compute_loss` (GRPO) | yes | parent's standard fwd |
| Student in SDPO | yes | only when alpha_sdpo ≠ 0 |
| Teacher in SDPO | no | hint-conditioned context |
| DPO chosen | yes | only when beta_replay ≠ 0 |
| DPO rejected | yes | only when beta_replay ≠ 0 |
**That's up to 4 grad'd forwards before the backward.** PyTorch will hold activations for all of them in the autograd graph until `.backward()` runs. For the smoke this is fine (0.5B × 4 act tapes ~ 4 GB at B=2,T=1024) but for any real training run on a larger model: **enable gradient checkpointing or run the SDPO/DPO channels in alternating steps** rather than every step.
For the smoke specifically: **set `alpha_sdpo=0.1` and `beta_replay=0.05`** (the trainer defaults) and verify activation memory peaks below 16 GB. If it doesn't, there's a bug in the data collator producing too-long sequences.
### 3.3 `requires_grad=True` on the zero-tensor short-circuit is a footgun
In `composer_trainer.py` L136 and L155, when SDPO is short-circuited:
```python
return torch.tensor(0.0, device=_device_of(model), requires_grad=True)
```
This is **not in the autograd graph** — it's a leaf tensor with `requires_grad=True` but no parent op. When you sum it into `total = grpo_loss + alpha * sdpo_kl + beta * replay_dpo`, the `0.0` contributes a zero gradient and doesn't break things, *but* if you ever try to call `total.backward()` on a step where ALL three channels short-circuited (e.g., a smoke step with no error sites and no DPO pairs), you'll get a `RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn`. **The smoke will hit this** if your synthetic batch lacks `ctx_teacher_input_ids` and `dpo_chosen_input_ids`.
Fix in the smoke: ensure the synthetic batch includes at least one `ctx_teacher_input_ids` row (it can be a copy of `input_ids` to keep things trivial) so SDPO doesn't short-circuit on every step.
### 3.4 `torch.cuda.synchronize()` before timing reads
If you don't `torch.cuda.synchronize()` before reading `time.perf_counter()` you'll measure CPU dispatch time, not GPU step time. The skeleton above includes it. The Modal runtime doesn't change this — same rule as local.
### 3.5 The HF cache Volume must be `commit()`ed
From <https://modal.com/docs/guide/volumes>: *Volume writes are not persisted across runs unless you call `.commit()` (or use `volume.batch_upload`).* The skeleton calls `hf_cache.commit()` at the end. If you forget, run 2 will re-download the model. This is the only "Modal-flavored" gotcha that bites a smoke.
### 3.6 What does NOT bite
These are the lessons from `mlops/modal-llm-training` that are **not relevant to a 0.5B smoke** — don't waste mental cycles on them:
- ❌ FSDP / DeepSpeed sharding setup. Single GPU.
- ❌ `accelerate launch` / multi-process distributed. Single GPU.
- ❌ Flash-attention version pinning vs torch version. SDPA is fine for 0.5B.
- ❌ Tensor parallelism / sequence parallelism. Single GPU.
- ❌ Multi-node clusters. Single node.
- ❌ Memory snapshotting (`enable_memory_snapshot=True`). It's a 30-min one-shot. The cold-start penalty is ~30 s on a smoke that runs for 5 min — 10% overhead, not worth the snapshot setup time.
- ❌ Region pinning for data locality. The whole input is `from_pretrained`, served by HF — Modal's default region is fine.
- ❌ Custom CUDA install (`Image.from_registry("nvidia/cuda:…")`). The pre-built torch wheel ships its own CUDA.
---
## 4. Decision rule: Modal vs the local 5090
### 4.1 The numbers
**Local 5090** (32 GB VRAM, Blackwell, ~1.6 PFLOPS bf16):
- Step time for Qwen-0.5B at B=2, T=1024, 3-channel loss (≈4 grad'd fwds + bwd): expect **~150–400 ms per step** based on parameter-count + Blackwell's bf16 throughput. Call it 300 ms.
- 50 steps: **~15 seconds of pure compute**.
- Plus model load (one-time, from local HF cache): ~5 seconds.
- Plus data collator setup: ~3 seconds.
- **Wall clock: ~25–40 seconds.**
- **Cost: $0** (electricity ignored — the 5090 draws ~600 W under load × 40 s = 6.7 Wh ≈ $0.001).
**Modal L4** (24 GB VRAM, Ada Lovelace, ~0.12 PFLOPS bf16):
- Step time for the same workload on L4: **~1.5–4 s per step.** (L4 is roughly 13× lower bf16 throughput than 5090, but the workload at B=2 won't saturate the 5090, so realistic gap is ~5–10×.) Call it 2 s.
- 50 steps: **~100 seconds of pure compute**.
- Plus container cold start, image pull, model download (cached after run 1), CUDA init: **30–90 s on first run, 20–40 s afterward**.
- **Wall clock: ~3–5 minutes per run (worst case 7 min on a cold first run).**
- **Cost: $0.08–$0.13 per run.**
### 4.2 The decision rule
> **For this specific 30-min smoke: run on the 5090. Do not use Modal.**
Reasoning:
1. **Latency:** the 5090 finishes the smoke in ~30 s. Modal's L4 needs ~5 minutes including cold start. That's a **10× iteration penalty** on a workload where the entire point is iterate-and-fix-the-shape-error cycles. Every minute waiting for Modal is a minute the user could have run the smoke 5 more times locally.
2. **Memory headroom:** the 5090's 32 GB is **larger** than the L4's 24 GB. There is no memory motivation to leave the local box.
3. **Network friction:** every Modal run requires `modal run`, syncing local code, waiting for image, watching logs. Local is `python modal_app.py` (or just import-and-run in a notebook).
4. **Cost asymmetry vs. iteration cost:** $0.10/run is not the issue. The issue is **30 minutes of attention spent on Modal infra is 30 minutes not spent debugging the loss**.
5. **The framework hasn't been verified to run end-to-end yet.** The first hundred bugs you'll find are local Python issues — wrong tensor shapes, missing keys in the collator, the `requires_grad=True` zero-tensor footgun (§3.3), TRL version mismatches. Debugging those over a Modal round-trip is masochism.
**When Modal becomes correct:**
| Scenario | Modal? | Why |
|----------|--------|-----|
| 30-min smoke on 0.5B (this task) | **No** | 5090 wins on every dimension |
| Sweep alpha_sdpo, beta_replay across 8 configs in parallel | **Yes** | 8× Modal containers in parallel beats 8 sequential runs on one 5090 |
| Scale to Qwen2.5-7B (real training) | **Yes** | 7B needs >32 GB for grad+optimizer, so 5090 is out; you want A100-80GB or H100 |
| Scale to multi-node (40B+) | **Yes (with caveats)** | Modal multi-node is in beta — see <https://modal.com/docs/guide/multi-node-training> |
| 24/7 inference of trained model | **Maybe** | Depends on QPS; Modal serverless wins for spiky, loses for steady |
### 4.3 Recommended workflow
1. **Write the smoke as `local_smoke.py`** that runs on the 5090. Same body as `modal_app.py`'s `smoke()` function, minus the `@app.function` decorator. Iterate there until 50 steps run cleanly.
2. **Then** drop the body into `modal_app.py` (the skeleton in §2). The Modal version's value is to verify "does it run on cloud Linux without local dotfile interference" and to baseline L4 step-time vs the 5090. That's a one-shot validation, not a development loop.
3. **For the real training run** (when it's an actual training run, not a smoke), start with A100-40GB on Modal (or H100 if you've got the credits) — the L4 step-time of ~2 s would translate to 2 s × 10,000 steps = ~5.5 hours which is fine for a smoke but painful for a real run.
---
## 5. References
All claims in this document are sourced from:
- **Pricing**: <https://modal.com/pricing> (canonical; updated regularly by Modal — re-fetch if cost-sensitive). Per-second numbers in §1.1 captured from this page at report-write time.
- **GPU naming**: <https://modal.com/docs/guide/gpu> — confirms `gpu="L4"`, `gpu="A10"` (not `"A10G"`), `gpu="A100-40GB"`, `gpu="H100!"` syntax.
- **Cold starts**: <https://modal.com/docs/guide/cold-start> — "Containers boot in about one second" + the warm-up period is image pull + global imports + `enter` methods.
- **Volumes**: <https://modal.com/docs/guide/volumes> — `commit()` semantics for HF cache persistence.
- **Region/preemption multipliers**: pricing page footer + <https://modal.com/docs/guide/preemption>.
- **Multi-node beta**: <https://modal.com/docs/guide/multi-node-training>.
- **Examples (for `Image.pip_install` patterns)**: <https://github.com/modal-labs/modal-examples> — see `06_gpu_and_ml/llm-finetuning/` for similar 0.5B/3B finetune patterns.
- **TRL `GRPOTrainer._compute_loss` extension point**: verified in `composer_trainer.py` header comment ("DeepWiki audit of huggingface/trl, 2026-05-25"). Confirmed `super()._compute_loss(model, inputs)` works as the framework's parent-call.
- **Local trainer code reviewed**:
- `/mnt/e/CS/HF/composer-replication-framework/spikes/005-integrated-trainer-skeleton/trl_path/composer_trainer.py`
- `/mnt/e/CS/HF/composer-replication-framework/spikes/005-integrated-trainer-skeleton/opsd_loss.py`
---
## 6. TL;DR
1. **GPU: L4. Cost: ~$0.10/run. Total budget burn: ~50× re-runs before the $5 cap.** Don't pay for A10G, A100, or H100 on a 0.5B smoke.
2. **Skeleton: §2** — `gpu="L4"`, 4 cores, 16 GB RAM, 30-min timeout, persistent HF cache Volume, default preemption, no region pin.
3. **Workload-specific gotchas: §3** — 3-channel loss does up to 4 grad'd forwards/step (memory headroom check), the zero-tensor `requires_grad=True` short-circuit can break `backward()`, and `volume.commit()` is mandatory.
4. **Decision: run on the 5090, not Modal.** 5090 finishes the smoke in ~30 s vs Modal's ~5 min including cold start, with $0 marginal cost and 10× faster iteration. Reserve Modal for parameter sweeps and 7B+ training.
|