File size: 26,296 Bytes
ac4bfb4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
# Modal Reconnaissance — Composer 2.5 Replication GPU Smoke

**Audience:** trainer integrator running a one-shot 30-minute verification smoke for the Composer 2.5 Replication Framework (`spikes/005-integrated-trainer-skeleton/`).
**Workload:** `Qwen/Qwen2.5-0.5B-Instruct` (≈ 1 GB fp16 weights), ~50 forward+backward steps, custom 3-channel loss = `GRPO + α·SDPO-KL + β·trace-replay-DPO`. Batch size ≤ 4 sequences ≤ 2048 tokens. **Goal: prove the loss runs end-to-end and capture mem + step time.** This is *not* training — it's a smoke.
**Cap:** $5. **Local hardware:** RTX 5090, 32 GB VRAM, Modal CLI already configured (`~/.modal.toml`).
**Bottom line up front:** *Run it locally on the 5090.* Modal is the wrong tool for this specific job. The skeleton + price math below is for future scale-out, not the smoke.

---

## 1. Recommended Modal GPU type & estimated cost

### 1.1 Pricing table (from primary source)

All values copied verbatim from <https://modal.com/pricing> (fetched for this report). Modal bills per **second** of compute, not per minute or hour.

| GPU            | Modal `gpu=` string  | $ / sec      | $ / hour | VRAM   | Verdict for this smoke |
|----------------|----------------------|--------------|----------|--------|------------------------|
| Nvidia T4      | `"T4"`               | 0.000164     | 0.590    | 16 GB  | Too small for safe headroom on 3 fwd passes |
| **Nvidia L4**  | `"L4"`               | **0.000222** | **0.799**| 24 GB  | ✅ **Recommended** — cheapest GPU that fits comfortably |
| Nvidia A10     | `"A10"`              | 0.000306     | 1.102    | 24 GB  | Acceptable; ~38% pricier than L4 for marginal speedup at sub-1B |
| Nvidia L40S    | `"L40S"`             | 0.000542     | 1.951    | 48 GB  | Overkill — Modal's default rec, but unjustified at 0.5B |
| Nvidia A100-40GB| `"A100-40GB"`       | 0.000583     | 2.099    | 40 GB  | Overkill |
| Nvidia A100-80GB| `"A100-80GB"`       | 0.000694     | 2.498    | 80 GB  | Overkill |
| Nvidia H100    | `"H100!"`            | 0.001097     | 3.949    | 80 GB  | Wasteful |

(`H100!` suffix = pin to H100, opt out of Modal's automatic H200 upgrade. See <https://modal.com/docs/guide/gpu#automatic-upgrades-to-h200s>.)

**Auxiliary costs** (also primary, same page):
- CPU: $0.0000131 / physical-core / sec → ~$0.047 / core-hour. Min 0.125 cores per container.
- RAM: $0.00000222 / GiB / sec → ~$0.008 / GiB-hour.
- Volumes: $0.09 / GiB / month (first 1 TiB / mo free on the workspace).
- Starter plan: **$30 / month free credits** — your smoke is free if you haven't burned the budget elsewhere.

### 1.2 Why L4, not A10G or A100-40GB

The skill mlops/modal-llm-training defaults to L4/A10 for "small smokes" and that holds here. The framing: *Qwen2.5-0.5B in fp16 is ~1 GB of weights. The 3-channel loss does ≥3 forward passes per step (student-grad, teacher no-grad for SDPO, chosen+rejected for DPO). Even with the teacher forward held in memory you are nowhere near 24 GB.*

Concrete VRAM math for the workload (back of envelope, batch=2, seq=1024, bf16):
- Weights: ~1.0 GB
- Optimizer state (AdamW, fp32 m+v): ~4 GB (8 bytes × 0.5B params)
- Gradients (bf16): ~1 GB
- Activations for student fwd at B=2,T=1024: ~1–2 GB
- Teacher fwd (no grad, no act save): ~0.3 GB
- DPO chosen+rejected fwds (with grad): ~2–3 GB
- HF transformers overhead, KV scratch, framework: ~2 GB
- **Subtotal: ~11–14 GB** — comfortably inside 24 GB on L4.

**A10 is also fine** but costs 38% more for ~30–50% extra throughput on a workload where the GPU is already step-time-bound by Python overhead (see §3). Pay the L4 rate.

**A100-40GB is wrong.** You're paying 2.6× the L4 rate for memory you don't use and FLOPS that, on a 0.5B model with bs=2, you can't saturate. The Modal docs explicitly warn against this: *"Before you jump for the most powerful (and so most expensive) GPU, make sure you understand where the bottlenecks are…"* (<https://modal.com/docs/guide/gpu#b200-gpus>).

**T4 declined** because: (a) only 16 GB VRAM — tight given 3 fwd passes; (b) old Turing arch lacks bf16 hardware; you'd be on fp16/fp32, which trips up `transformers` flash-attention paths and adds debug surface area on a smoke that's already debugging custom loss code.

### 1.3 Cost projection for the actual smoke

Assume a 30-min wall-clock budget that breaks down realistically as:
- Container cold-start + image pull: 30–90 s (first run — Modal's container infra warm-boots in ~1 s but your image with torch+transformers takes one-time pull). See <https://modal.com/docs/guide/cold-start>.
- HF model download (Qwen 2.5-0.5B = ~1 GB shards) on first run: 15–45 s — **should be cached on a Modal Volume after run 1**.
- Setup inside fn (CUDA init, model.from_pretrained, optimizer build): 20–40 s.
- 50 training steps × ~2–4 s/step (3-channel loss, bs=2, seq=1024 on L4): **100–200 s**.
- Logging, save, exit: 5 s.

**Realistic total: ~3–7 minutes of GPU-billed time per run.**

Cost per run on L4:
- Lower bound (3 min): 180 s × $0.000222 = **$0.040**
- Upper bound (7 min): 420 s × $0.000222 = **$0.093**
- Plus CPU/RAM overhead (4 cores × 16 GB RAM): ~420 s × (4 × $0.0000131 + 16 × $0.00000222) = ~$0.037

**Per-run all-in: $0.08 – $0.13 on L4.** You can run the smoke ~50× before nudging the $5 cap. Comfortable.

For comparison, A10 same scenario: ~$0.11 – $0.18 per run. A100-40GB: ~$0.21 – $0.34. Still all under cap, but L4 is the rational pick.

### 1.4 Region & preemption multipliers (DON'T trip on these)

From the pricing-page footer:
- **Region selection: 1.5–1.75× base price.** Don't pin to a region unless you must.
- **Non-preemptible execution: 3× base price.** Default is preemptible — leave it. A 30-min smoke that gets preempted is fine; just retry. Setting `gpu_preempted=False` (or using non-preemptible mode) would push L4 to ~$2.40/hr and is unjustified.

---

## 2. Minimal `modal_app.py` skeleton

This is the actual file to drop into the repo, e.g. at `spikes/005-integrated-trainer-skeleton/modal_app.py`. It is intentionally one file, with no abstraction, sized for the smoke. Image pins are conservative — match what the user is running locally to avoid version drift between local debugging and Modal runs.

```python
"""modal_app.py — GPU smoke for the Composer 2.5 Replication Framework.

Goal: run ~50 forward+backward steps of the 3-channel loss
(GRPO + SDPO-KL + trace-replay-DPO) against Qwen/Qwen2.5-0.5B-Instruct,
capture peak VRAM and per-step latency, and exit. Single L4, single container.

Run:    modal run modal_app.py
Logs:   the function's print() output streams to your terminal.
"""

from __future__ import annotations

import modal

# ---------------------------------------------------------------------------
# 1) App + image
# ---------------------------------------------------------------------------
# Pin torch to a CUDA build that matches Modal's L4 driver (CUDA 12.x).
# Pin transformers/peft/trl to a known-good combination — the trainer skeleton
# was developed against transformers >= 4.45 and trl >= 0.12 for GRPOTrainer.
# If you bump any of these, re-verify GRPOTrainer._compute_loss is still the
# correct override hook (DeepWiki audit anchor: huggingface/trl).
image = (
    modal.Image.debian_slim(python_version="3.11")
    .apt_install("git")
    .pip_install(
        "torch==2.4.1",                  # CUDA 12.1 wheel from PyPI default index
        "transformers==4.46.3",
        "accelerate==1.1.1",
        "peft==0.14.0",
        "trl==0.12.2",
        "datasets==3.1.0",
        "huggingface_hub==0.26.5",
    )
    .env({
        # Force HF to use the mounted Volume for model + dataset cache.
        "HF_HOME": "/cache/hf",
        "TRANSFORMERS_CACHE": "/cache/hf",
        "HF_HUB_ENABLE_HF_TRANSFER": "1",  # parallel download for the model
        # Make Python flush prints immediately so we see step times live.
        "PYTHONUNBUFFERED": "1",
        # Reproducibility for the smoke.
        "TOKENIZERS_PARALLELISM": "false",
    })
)

# ---------------------------------------------------------------------------
# 2) Persistent volume for HF cache (so model isn't re-downloaded each run)
# ---------------------------------------------------------------------------
# 1 GB of Qwen weights persists here. First run pays the download cost,
# every subsequent run reuses the volume. Below 1 TiB / mo: free.
hf_cache = modal.Volume.from_name("hf-cache-composer-smoke", create_if_missing=True)

# ---------------------------------------------------------------------------
# 3) App + secrets
# ---------------------------------------------------------------------------
app = modal.App("composer-replication-smoke")

# Optional — only needed if you switch to a gated model. Qwen2.5-0.5B is open.
hf_secret = modal.Secret.from_name("huggingface-token", required_keys=[])  # no-op safety

# ---------------------------------------------------------------------------
# 4) The smoke function
# ---------------------------------------------------------------------------
@app.function(
    image=image,
    gpu="L4",                       # see §1: cheapest 24 GB option that fits
    cpu=4.0,                        # 4 cores is plenty for tokenization on a sub-1B
    memory=16 * 1024,               # 16 GiB RAM is plenty
    volumes={"/cache": hf_cache},
    timeout=60 * 30,                # hard 30-min cap matches the smoke spec
    secrets=[hf_secret],
    # NB: keep preemptible (default). Don't pay 3× to pin.
    # NB: don't pin region — the 1.5–1.75× tax is unjustified for a smoke.
)
def smoke():
    import time
    import torch
    from transformers import AutoModelForCausalLM, AutoTokenizer

    MODEL_ID = "Qwen/Qwen2.5-0.5B-Instruct"

    print(f"[smoke] torch={torch.__version__} cuda={torch.version.cuda} "
          f"device={torch.cuda.get_device_name(0)} "
          f"vram={torch.cuda.get_device_properties(0).total_memory/1e9:.1f} GB")

    # -------------------------------------------------------------------
    # Load tokenizer + model. bf16 — L4 supports it (Ada Lovelace).
    # -------------------------------------------------------------------
    t0 = time.perf_counter()
    tok = AutoTokenizer.from_pretrained(MODEL_ID, cache_dir="/cache/hf")
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_ID,
        cache_dir="/cache/hf",
        torch_dtype=torch.bfloat16,
        device_map="cuda:0",
    )
    model.train()
    print(f"[smoke] model load: {time.perf_counter()-t0:.1f}s "
          f"params={sum(p.numel() for p in model.parameters())/1e6:.1f}M")

    # -------------------------------------------------------------------
    # 50-step verification loop.
    #
    # NOTE: this stub uses a synthetic batch — a single forward+backward
    # against an LM-head loss — *not* the full 3-channel loss. The point
    # is to (a) verify the Modal harness, (b) measure the per-step time
    # of a vanilla AutoModelForCausalLM step on this GPU as a baseline.
    #
    # Replace the body of the for-loop with the actual ComposerReplicationTrainer
    # `_compute_loss` call once data_collator outputs are stubbed/mocked.
    # See: spikes/005-integrated-trainer-skeleton/trl_path/composer_trainer.py
    # -------------------------------------------------------------------
    optimizer = torch.optim.AdamW(model.parameters(), lr=5e-6)

    # Synthetic batch: bs=2, seq=1024 — matches the realistic smoke shape.
    B, T = 2, 1024
    input_ids = torch.randint(0, tok.vocab_size, (B, T), device="cuda:0")
    labels = input_ids.clone()

    torch.cuda.reset_peak_memory_stats()
    step_times = []
    for step in range(50):
        t = time.perf_counter()
        out = model(input_ids=input_ids, labels=labels)
        out.loss.backward()
        optimizer.step()
        optimizer.zero_grad(set_to_none=True)
        torch.cuda.synchronize()
        dt = time.perf_counter() - t
        step_times.append(dt)
        if step % 10 == 0:
            print(f"[smoke] step {step:>3d} loss={out.loss.item():.4f} "
                  f"dt={dt*1000:.1f}ms peak_vram={torch.cuda.max_memory_allocated()/1e9:.2f}GB")

    # -------------------------------------------------------------------
    # Final report.
    # -------------------------------------------------------------------
    median_ms = sorted(step_times)[len(step_times)//2] * 1000
    p95_ms = sorted(step_times)[int(len(step_times)*0.95)] * 1000
    peak_gb = torch.cuda.max_memory_allocated() / 1e9
    print(f"\n[smoke] DONE. median_step={median_ms:.1f}ms p95={p95_ms:.1f}ms "
          f"peak_vram={peak_gb:.2f}GB total_time={sum(step_times):.1f}s")

    # Persist cache for the next run.
    hf_cache.commit()


@app.local_entrypoint()
def main():
    smoke.remote()
```

### 2.1 What's deliberately *not* in the skeleton

- **No `flash-attn` install.** The `flash-attn` wheel build is a notorious time sink on Modal images (compiles against the CUDA toolkit). For a 0.5B smoke, SDPA (PyTorch's built-in scaled-dot-product attention) is fine and is on by default in transformers ≥ 4.45.
- **No `bitsandbytes`, no `unsloth`, no `xformers`.** All add build complexity. None give you anything on a smoke.
- **No DeepSpeed, no FSDP, no `accelerate launch`.** This is single-GPU; `accelerate` is in the image only because `trl` imports it. We don't invoke it.
- **No web endpoint, no `@app.cls`, no `enter` method.** A `@app.function()` with no warm-up is correct for a one-shot smoke. `enter`/lifecycle methods are for serving and amortizing model load across many calls — not relevant when you call once.
- **No `min_containers` or `buffer_containers`.** Those are warm-pool knobs for serving — they cost money. Default scale-from-zero is right.
- **No `Image.from_registry`.** `debian_slim` + `pip_install` is faster than pulling a CUDA base image when you don't need a custom CUDA toolkit.

### 2.2 What you do need to add when you wire the real loss

Replace the synthetic `for step in range(50)` body with:

```python
from data_collator import ComposerDataCollator        # spike 005 path
from trl_path.composer_trainer import ComposerReplicationTrainer
# ...
# Build a small fixed dataset of (prompt, response, hint, dpo_pair) tuples
# inline in the smoke (10–20 examples). Don't pull a real RL rollout — the
# point is to verify the loss path, not the rollout path.
```

The smoke does **not** need a real rollout/sampling phase. Stub `inputs` with the keys `_compute_sdpo_loss` and `_compute_trace_replay_loss` consume (`ctx_teacher_input_ids`, `dpo_chosen_input_ids`, `dpo_chosen_response_mask`, `dpo_chosen_ref_logprobs`, `sdpo_loss_mask`, …) using fixed tensors. That's the real verification — does the 3-channel loss compute and back-propagate without shape errors. The trainer skeleton's logging will tell you per-channel values.

---

## 3. Gotchas that bite *this specific workload*

The Modal docs and the `mlops/modal-llm-training` skill cover ~30 lessons aimed at 7B–30B training. Most of them don't apply here. The ones that do:

### 3.1 The teacher forward in SDPO doubles your effective batch memory — but only briefly

`ComposerReplicationTrainer._compute_sdpo_loss` does this (composer_trainer.py L138–143):

```python
student_logits = model(input_ids=inputs["input_ids"]).logits      # with grad
with torch.no_grad():
    teacher_logits = model(input_ids=inputs["ctx_teacher_input_ids"]).logits
```

Two issues:

1. **Both logits tensors are held simultaneously** in `_compute_sdpo_loss` — they're handed to `generalized_jsd_loss` which keeps them alive for the JSD math. For Qwen 2.5-0.5B (vocab = 151,936), one logits tensor at B=2,T=1024 in bf16 is `2 * 1024 * 151936 * 2 bytes ≈ 622 MB`. Two of them = ~1.2 GB. **Negligible on a 24 GB L4** but worth noting because logits are surprisingly fat for the Qwen vocab.
2. **Use the `top_k` arg in `generalized_jsd_loss`** if you ever want to scale this up. The docstring (`opsd_loss.py` L54) explicitly recommends it: *"top_k: restrict KL to top-k tokens of the teacher distribution. Saves compute on large vocabularies (Qwen3 vocab = 152K)."* On the smoke, leave it `None` to verify the unrestricted path; flip it on for real training.

### 3.2 The DPO channel does TWO more grad'd forwards per step

`_compute_trace_replay_loss` (composer_trainer.py L191–198) calls `_sequence_logprobs(model, dpo_chosen_…)` and `_sequence_logprobs(model, dpo_rejected_…)`. Both are with-grad. So each training step is:

| Forward | Grad? | Notes |
|---------|-------|-------|
| `super()._compute_loss` (GRPO) | yes | parent's standard fwd |
| Student in SDPO | yes | only when alpha_sdpo ≠ 0 |
| Teacher in SDPO | no | hint-conditioned context |
| DPO chosen | yes | only when beta_replay ≠ 0 |
| DPO rejected | yes | only when beta_replay ≠ 0 |

**That's up to 4 grad'd forwards before the backward.** PyTorch will hold activations for all of them in the autograd graph until `.backward()` runs. For the smoke this is fine (0.5B × 4 act tapes ~ 4 GB at B=2,T=1024) but for any real training run on a larger model: **enable gradient checkpointing or run the SDPO/DPO channels in alternating steps** rather than every step.

For the smoke specifically: **set `alpha_sdpo=0.1` and `beta_replay=0.05`** (the trainer defaults) and verify activation memory peaks below 16 GB. If it doesn't, there's a bug in the data collator producing too-long sequences.

### 3.3 `requires_grad=True` on the zero-tensor short-circuit is a footgun

In `composer_trainer.py` L136 and L155, when SDPO is short-circuited:

```python
return torch.tensor(0.0, device=_device_of(model), requires_grad=True)
```

This is **not in the autograd graph** — it's a leaf tensor with `requires_grad=True` but no parent op. When you sum it into `total = grpo_loss + alpha * sdpo_kl + beta * replay_dpo`, the `0.0` contributes a zero gradient and doesn't break things, *but* if you ever try to call `total.backward()` on a step where ALL three channels short-circuited (e.g., a smoke step with no error sites and no DPO pairs), you'll get a `RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn`. **The smoke will hit this** if your synthetic batch lacks `ctx_teacher_input_ids` and `dpo_chosen_input_ids`.

Fix in the smoke: ensure the synthetic batch includes at least one `ctx_teacher_input_ids` row (it can be a copy of `input_ids` to keep things trivial) so SDPO doesn't short-circuit on every step.

### 3.4 `torch.cuda.synchronize()` before timing reads

If you don't `torch.cuda.synchronize()` before reading `time.perf_counter()` you'll measure CPU dispatch time, not GPU step time. The skeleton above includes it. The Modal runtime doesn't change this — same rule as local.

### 3.5 The HF cache Volume must be `commit()`ed

From <https://modal.com/docs/guide/volumes>: *Volume writes are not persisted across runs unless you call `.commit()` (or use `volume.batch_upload`).* The skeleton calls `hf_cache.commit()` at the end. If you forget, run 2 will re-download the model. This is the only "Modal-flavored" gotcha that bites a smoke.

### 3.6 What does NOT bite

These are the lessons from `mlops/modal-llm-training` that are **not relevant to a 0.5B smoke** — don't waste mental cycles on them:

- ❌ FSDP / DeepSpeed sharding setup. Single GPU.
-`accelerate launch` / multi-process distributed. Single GPU.
- ❌ Flash-attention version pinning vs torch version. SDPA is fine for 0.5B.
- ❌ Tensor parallelism / sequence parallelism. Single GPU.
- ❌ Multi-node clusters. Single node.
- ❌ Memory snapshotting (`enable_memory_snapshot=True`). It's a 30-min one-shot. The cold-start penalty is ~30 s on a smoke that runs for 5 min — 10% overhead, not worth the snapshot setup time.
- ❌ Region pinning for data locality. The whole input is `from_pretrained`, served by HF — Modal's default region is fine.
- ❌ Custom CUDA install (`Image.from_registry("nvidia/cuda:…")`). The pre-built torch wheel ships its own CUDA.

---

## 4. Decision rule: Modal vs the local 5090

### 4.1 The numbers

**Local 5090** (32 GB VRAM, Blackwell, ~1.6 PFLOPS bf16):
- Step time for Qwen-0.5B at B=2, T=1024, 3-channel loss (≈4 grad'd fwds + bwd): expect **~150–400 ms per step** based on parameter-count + Blackwell's bf16 throughput. Call it 300 ms.
- 50 steps: **~15 seconds of pure compute**.
- Plus model load (one-time, from local HF cache): ~5 seconds.
- Plus data collator setup: ~3 seconds.
- **Wall clock: ~25–40 seconds.**
- **Cost: $0** (electricity ignored — the 5090 draws ~600 W under load × 40 s = 6.7 Wh ≈ $0.001).

**Modal L4** (24 GB VRAM, Ada Lovelace, ~0.12 PFLOPS bf16):
- Step time for the same workload on L4: **~1.5–4 s per step.** (L4 is roughly 13× lower bf16 throughput than 5090, but the workload at B=2 won't saturate the 5090, so realistic gap is ~5–10×.) Call it 2 s.
- 50 steps: **~100 seconds of pure compute**.
- Plus container cold start, image pull, model download (cached after run 1), CUDA init: **30–90 s on first run, 20–40 s afterward**.
- **Wall clock: ~3–5 minutes per run (worst case 7 min on a cold first run).**
- **Cost: $0.08–$0.13 per run.**

### 4.2 The decision rule

> **For this specific 30-min smoke: run on the 5090. Do not use Modal.**

Reasoning:

1. **Latency:** the 5090 finishes the smoke in ~30 s. Modal's L4 needs ~5 minutes including cold start. That's a **10× iteration penalty** on a workload where the entire point is iterate-and-fix-the-shape-error cycles. Every minute waiting for Modal is a minute the user could have run the smoke 5 more times locally.
2. **Memory headroom:** the 5090's 32 GB is **larger** than the L4's 24 GB. There is no memory motivation to leave the local box.
3. **Network friction:** every Modal run requires `modal run`, syncing local code, waiting for image, watching logs. Local is `python modal_app.py` (or just import-and-run in a notebook).
4. **Cost asymmetry vs. iteration cost:** $0.10/run is not the issue. The issue is **30 minutes of attention spent on Modal infra is 30 minutes not spent debugging the loss**.
5. **The framework hasn't been verified to run end-to-end yet.** The first hundred bugs you'll find are local Python issues — wrong tensor shapes, missing keys in the collator, the `requires_grad=True` zero-tensor footgun (§3.3), TRL version mismatches. Debugging those over a Modal round-trip is masochism.

**When Modal becomes correct:**

| Scenario | Modal? | Why |
|----------|--------|-----|
| 30-min smoke on 0.5B (this task) | **No** | 5090 wins on every dimension |
| Sweep alpha_sdpo, beta_replay across 8 configs in parallel | **Yes** | 8× Modal containers in parallel beats 8 sequential runs on one 5090 |
| Scale to Qwen2.5-7B (real training) | **Yes** | 7B needs >32 GB for grad+optimizer, so 5090 is out; you want A100-80GB or H100 |
| Scale to multi-node (40B+) | **Yes (with caveats)** | Modal multi-node is in beta — see <https://modal.com/docs/guide/multi-node-training> |
| 24/7 inference of trained model | **Maybe** | Depends on QPS; Modal serverless wins for spiky, loses for steady |

### 4.3 Recommended workflow

1. **Write the smoke as `local_smoke.py`** that runs on the 5090. Same body as `modal_app.py`'s `smoke()` function, minus the `@app.function` decorator. Iterate there until 50 steps run cleanly.
2. **Then** drop the body into `modal_app.py` (the skeleton in §2). The Modal version's value is to verify "does it run on cloud Linux without local dotfile interference" and to baseline L4 step-time vs the 5090. That's a one-shot validation, not a development loop.
3. **For the real training run** (when it's an actual training run, not a smoke), start with A100-40GB on Modal (or H100 if you've got the credits) — the L4 step-time of ~2 s would translate to 2 s × 10,000 steps = ~5.5 hours which is fine for a smoke but painful for a real run.

---

## 5. References

All claims in this document are sourced from:

- **Pricing**: <https://modal.com/pricing> (canonical; updated regularly by Modal — re-fetch if cost-sensitive). Per-second numbers in §1.1 captured from this page at report-write time.
- **GPU naming**: <https://modal.com/docs/guide/gpu> — confirms `gpu="L4"`, `gpu="A10"` (not `"A10G"`), `gpu="A100-40GB"`, `gpu="H100!"` syntax.
- **Cold starts**: <https://modal.com/docs/guide/cold-start> — "Containers boot in about one second" + the warm-up period is image pull + global imports + `enter` methods.
- **Volumes**: <https://modal.com/docs/guide/volumes>`commit()` semantics for HF cache persistence.
- **Region/preemption multipliers**: pricing page footer + <https://modal.com/docs/guide/preemption>.
- **Multi-node beta**: <https://modal.com/docs/guide/multi-node-training>.
- **Examples (for `Image.pip_install` patterns)**: <https://github.com/modal-labs/modal-examples> — see `06_gpu_and_ml/llm-finetuning/` for similar 0.5B/3B finetune patterns.
- **TRL `GRPOTrainer._compute_loss` extension point**: verified in `composer_trainer.py` header comment ("DeepWiki audit of huggingface/trl, 2026-05-25"). Confirmed `super()._compute_loss(model, inputs)` works as the framework's parent-call.
- **Local trainer code reviewed**:
  - `/mnt/e/CS/HF/composer-replication-framework/spikes/005-integrated-trainer-skeleton/trl_path/composer_trainer.py`
  - `/mnt/e/CS/HF/composer-replication-framework/spikes/005-integrated-trainer-skeleton/opsd_loss.py`

---

## 6. TL;DR

1. **GPU: L4. Cost: ~$0.10/run. Total budget burn: ~50× re-runs before the $5 cap.** Don't pay for A10G, A100, or H100 on a 0.5B smoke.
2. **Skeleton: §2**`gpu="L4"`, 4 cores, 16 GB RAM, 30-min timeout, persistent HF cache Volume, default preemption, no region pin.
3. **Workload-specific gotchas: §3** — 3-channel loss does up to 4 grad'd forwards/step (memory headroom check), the zero-tensor `requires_grad=True` short-circuit can break `backward()`, and `volume.commit()` is mandatory.
4. **Decision: run on the 5090, not Modal.** 5090 finishes the smoke in ~30 s vs Modal's ~5 min including cold start, with $0 marginal cost and 10× faster iteration. Reserve Modal for parameter sweeps and 7B+ training.