composer-replication-framework / docs /research /MODAL_RECONNAISSANCE.md
Codeseys's picture
Wave 7: Phase 2-4 of deep work loop — backlog, parallel research, three ADRs
ac4bfb4
# Modal Reconnaissance — Composer 2.5 Replication GPU Smoke
**Audience:** trainer integrator running a one-shot 30-minute verification smoke for the Composer 2.5 Replication Framework (`spikes/005-integrated-trainer-skeleton/`).
**Workload:** `Qwen/Qwen2.5-0.5B-Instruct` (≈ 1 GB fp16 weights), ~50 forward+backward steps, custom 3-channel loss = `GRPO + α·SDPO-KL + β·trace-replay-DPO`. Batch size ≤ 4 sequences ≤ 2048 tokens. **Goal: prove the loss runs end-to-end and capture mem + step time.** This is *not* training — it's a smoke.
**Cap:** $5. **Local hardware:** RTX 5090, 32 GB VRAM, Modal CLI already configured (`~/.modal.toml`).
**Bottom line up front:** *Run it locally on the 5090.* Modal is the wrong tool for this specific job. The skeleton + price math below is for future scale-out, not the smoke.
---
## 1. Recommended Modal GPU type & estimated cost
### 1.1 Pricing table (from primary source)
All values copied verbatim from <https://modal.com/pricing> (fetched for this report). Modal bills per **second** of compute, not per minute or hour.
| GPU | Modal `gpu=` string | $ / sec | $ / hour | VRAM | Verdict for this smoke |
|----------------|----------------------|--------------|----------|--------|------------------------|
| Nvidia T4 | `"T4"` | 0.000164 | 0.590 | 16 GB | Too small for safe headroom on 3 fwd passes |
| **Nvidia L4** | `"L4"` | **0.000222** | **0.799**| 24 GB | ✅ **Recommended** — cheapest GPU that fits comfortably |
| Nvidia A10 | `"A10"` | 0.000306 | 1.102 | 24 GB | Acceptable; ~38% pricier than L4 for marginal speedup at sub-1B |
| Nvidia L40S | `"L40S"` | 0.000542 | 1.951 | 48 GB | Overkill — Modal's default rec, but unjustified at 0.5B |
| Nvidia A100-40GB| `"A100-40GB"` | 0.000583 | 2.099 | 40 GB | Overkill |
| Nvidia A100-80GB| `"A100-80GB"` | 0.000694 | 2.498 | 80 GB | Overkill |
| Nvidia H100 | `"H100!"` | 0.001097 | 3.949 | 80 GB | Wasteful |
(`H100!` suffix = pin to H100, opt out of Modal's automatic H200 upgrade. See <https://modal.com/docs/guide/gpu#automatic-upgrades-to-h200s>.)
**Auxiliary costs** (also primary, same page):
- CPU: $0.0000131 / physical-core / sec → ~$0.047 / core-hour. Min 0.125 cores per container.
- RAM: $0.00000222 / GiB / sec → ~$0.008 / GiB-hour.
- Volumes: $0.09 / GiB / month (first 1 TiB / mo free on the workspace).
- Starter plan: **$30 / month free credits** — your smoke is free if you haven't burned the budget elsewhere.
### 1.2 Why L4, not A10G or A100-40GB
The skill mlops/modal-llm-training defaults to L4/A10 for "small smokes" and that holds here. The framing: *Qwen2.5-0.5B in fp16 is ~1 GB of weights. The 3-channel loss does ≥3 forward passes per step (student-grad, teacher no-grad for SDPO, chosen+rejected for DPO). Even with the teacher forward held in memory you are nowhere near 24 GB.*
Concrete VRAM math for the workload (back of envelope, batch=2, seq=1024, bf16):
- Weights: ~1.0 GB
- Optimizer state (AdamW, fp32 m+v): ~4 GB (8 bytes × 0.5B params)
- Gradients (bf16): ~1 GB
- Activations for student fwd at B=2,T=1024: ~1–2 GB
- Teacher fwd (no grad, no act save): ~0.3 GB
- DPO chosen+rejected fwds (with grad): ~2–3 GB
- HF transformers overhead, KV scratch, framework: ~2 GB
- **Subtotal: ~11–14 GB** — comfortably inside 24 GB on L4.
**A10 is also fine** but costs 38% more for ~30–50% extra throughput on a workload where the GPU is already step-time-bound by Python overhead (see §3). Pay the L4 rate.
**A100-40GB is wrong.** You're paying 2.6× the L4 rate for memory you don't use and FLOPS that, on a 0.5B model with bs=2, you can't saturate. The Modal docs explicitly warn against this: *"Before you jump for the most powerful (and so most expensive) GPU, make sure you understand where the bottlenecks are…"* (<https://modal.com/docs/guide/gpu#b200-gpus>).
**T4 declined** because: (a) only 16 GB VRAM — tight given 3 fwd passes; (b) old Turing arch lacks bf16 hardware; you'd be on fp16/fp32, which trips up `transformers` flash-attention paths and adds debug surface area on a smoke that's already debugging custom loss code.
### 1.3 Cost projection for the actual smoke
Assume a 30-min wall-clock budget that breaks down realistically as:
- Container cold-start + image pull: 30–90 s (first run — Modal's container infra warm-boots in ~1 s but your image with torch+transformers takes one-time pull). See <https://modal.com/docs/guide/cold-start>.
- HF model download (Qwen 2.5-0.5B = ~1 GB shards) on first run: 15–45 s — **should be cached on a Modal Volume after run 1**.
- Setup inside fn (CUDA init, model.from_pretrained, optimizer build): 20–40 s.
- 50 training steps × ~2–4 s/step (3-channel loss, bs=2, seq=1024 on L4): **100–200 s**.
- Logging, save, exit: 5 s.
**Realistic total: ~3–7 minutes of GPU-billed time per run.**
Cost per run on L4:
- Lower bound (3 min): 180 s × $0.000222 = **$0.040**
- Upper bound (7 min): 420 s × $0.000222 = **$0.093**
- Plus CPU/RAM overhead (4 cores × 16 GB RAM): ~420 s × (4 × $0.0000131 + 16 × $0.00000222) = ~$0.037
**Per-run all-in: $0.08 – $0.13 on L4.** You can run the smoke ~50× before nudging the $5 cap. Comfortable.
For comparison, A10 same scenario: ~$0.11 – $0.18 per run. A100-40GB: ~$0.21 – $0.34. Still all under cap, but L4 is the rational pick.
### 1.4 Region & preemption multipliers (DON'T trip on these)
From the pricing-page footer:
- **Region selection: 1.5–1.75× base price.** Don't pin to a region unless you must.
- **Non-preemptible execution: 3× base price.** Default is preemptible — leave it. A 30-min smoke that gets preempted is fine; just retry. Setting `gpu_preempted=False` (or using non-preemptible mode) would push L4 to ~$2.40/hr and is unjustified.
---
## 2. Minimal `modal_app.py` skeleton
This is the actual file to drop into the repo, e.g. at `spikes/005-integrated-trainer-skeleton/modal_app.py`. It is intentionally one file, with no abstraction, sized for the smoke. Image pins are conservative — match what the user is running locally to avoid version drift between local debugging and Modal runs.
```python
"""modal_app.py — GPU smoke for the Composer 2.5 Replication Framework.
Goal: run ~50 forward+backward steps of the 3-channel loss
(GRPO + SDPO-KL + trace-replay-DPO) against Qwen/Qwen2.5-0.5B-Instruct,
capture peak VRAM and per-step latency, and exit. Single L4, single container.
Run: modal run modal_app.py
Logs: the function's print() output streams to your terminal.
"""
from __future__ import annotations
import modal
# ---------------------------------------------------------------------------
# 1) App + image
# ---------------------------------------------------------------------------
# Pin torch to a CUDA build that matches Modal's L4 driver (CUDA 12.x).
# Pin transformers/peft/trl to a known-good combination — the trainer skeleton
# was developed against transformers >= 4.45 and trl >= 0.12 for GRPOTrainer.
# If you bump any of these, re-verify GRPOTrainer._compute_loss is still the
# correct override hook (DeepWiki audit anchor: huggingface/trl).
image = (
modal.Image.debian_slim(python_version="3.11")
.apt_install("git")
.pip_install(
"torch==2.4.1", # CUDA 12.1 wheel from PyPI default index
"transformers==4.46.3",
"accelerate==1.1.1",
"peft==0.14.0",
"trl==0.12.2",
"datasets==3.1.0",
"huggingface_hub==0.26.5",
)
.env({
# Force HF to use the mounted Volume for model + dataset cache.
"HF_HOME": "/cache/hf",
"TRANSFORMERS_CACHE": "/cache/hf",
"HF_HUB_ENABLE_HF_TRANSFER": "1", # parallel download for the model
# Make Python flush prints immediately so we see step times live.
"PYTHONUNBUFFERED": "1",
# Reproducibility for the smoke.
"TOKENIZERS_PARALLELISM": "false",
})
)
# ---------------------------------------------------------------------------
# 2) Persistent volume for HF cache (so model isn't re-downloaded each run)
# ---------------------------------------------------------------------------
# 1 GB of Qwen weights persists here. First run pays the download cost,
# every subsequent run reuses the volume. Below 1 TiB / mo: free.
hf_cache = modal.Volume.from_name("hf-cache-composer-smoke", create_if_missing=True)
# ---------------------------------------------------------------------------
# 3) App + secrets
# ---------------------------------------------------------------------------
app = modal.App("composer-replication-smoke")
# Optional — only needed if you switch to a gated model. Qwen2.5-0.5B is open.
hf_secret = modal.Secret.from_name("huggingface-token", required_keys=[]) # no-op safety
# ---------------------------------------------------------------------------
# 4) The smoke function
# ---------------------------------------------------------------------------
@app.function(
image=image,
gpu="L4", # see §1: cheapest 24 GB option that fits
cpu=4.0, # 4 cores is plenty for tokenization on a sub-1B
memory=16 * 1024, # 16 GiB RAM is plenty
volumes={"/cache": hf_cache},
timeout=60 * 30, # hard 30-min cap matches the smoke spec
secrets=[hf_secret],
# NB: keep preemptible (default). Don't pay 3× to pin.
# NB: don't pin region — the 1.5–1.75× tax is unjustified for a smoke.
)
def smoke():
import time
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
MODEL_ID = "Qwen/Qwen2.5-0.5B-Instruct"
print(f"[smoke] torch={torch.__version__} cuda={torch.version.cuda} "
f"device={torch.cuda.get_device_name(0)} "
f"vram={torch.cuda.get_device_properties(0).total_memory/1e9:.1f} GB")
# -------------------------------------------------------------------
# Load tokenizer + model. bf16 — L4 supports it (Ada Lovelace).
# -------------------------------------------------------------------
t0 = time.perf_counter()
tok = AutoTokenizer.from_pretrained(MODEL_ID, cache_dir="/cache/hf")
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
cache_dir="/cache/hf",
torch_dtype=torch.bfloat16,
device_map="cuda:0",
)
model.train()
print(f"[smoke] model load: {time.perf_counter()-t0:.1f}s "
f"params={sum(p.numel() for p in model.parameters())/1e6:.1f}M")
# -------------------------------------------------------------------
# 50-step verification loop.
#
# NOTE: this stub uses a synthetic batch — a single forward+backward
# against an LM-head loss — *not* the full 3-channel loss. The point
# is to (a) verify the Modal harness, (b) measure the per-step time
# of a vanilla AutoModelForCausalLM step on this GPU as a baseline.
#
# Replace the body of the for-loop with the actual ComposerReplicationTrainer
# `_compute_loss` call once data_collator outputs are stubbed/mocked.
# See: spikes/005-integrated-trainer-skeleton/trl_path/composer_trainer.py
# -------------------------------------------------------------------
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-6)
# Synthetic batch: bs=2, seq=1024 — matches the realistic smoke shape.
B, T = 2, 1024
input_ids = torch.randint(0, tok.vocab_size, (B, T), device="cuda:0")
labels = input_ids.clone()
torch.cuda.reset_peak_memory_stats()
step_times = []
for step in range(50):
t = time.perf_counter()
out = model(input_ids=input_ids, labels=labels)
out.loss.backward()
optimizer.step()
optimizer.zero_grad(set_to_none=True)
torch.cuda.synchronize()
dt = time.perf_counter() - t
step_times.append(dt)
if step % 10 == 0:
print(f"[smoke] step {step:>3d} loss={out.loss.item():.4f} "
f"dt={dt*1000:.1f}ms peak_vram={torch.cuda.max_memory_allocated()/1e9:.2f}GB")
# -------------------------------------------------------------------
# Final report.
# -------------------------------------------------------------------
median_ms = sorted(step_times)[len(step_times)//2] * 1000
p95_ms = sorted(step_times)[int(len(step_times)*0.95)] * 1000
peak_gb = torch.cuda.max_memory_allocated() / 1e9
print(f"\n[smoke] DONE. median_step={median_ms:.1f}ms p95={p95_ms:.1f}ms "
f"peak_vram={peak_gb:.2f}GB total_time={sum(step_times):.1f}s")
# Persist cache for the next run.
hf_cache.commit()
@app.local_entrypoint()
def main():
smoke.remote()
```
### 2.1 What's deliberately *not* in the skeleton
- **No `flash-attn` install.** The `flash-attn` wheel build is a notorious time sink on Modal images (compiles against the CUDA toolkit). For a 0.5B smoke, SDPA (PyTorch's built-in scaled-dot-product attention) is fine and is on by default in transformers ≥ 4.45.
- **No `bitsandbytes`, no `unsloth`, no `xformers`.** All add build complexity. None give you anything on a smoke.
- **No DeepSpeed, no FSDP, no `accelerate launch`.** This is single-GPU; `accelerate` is in the image only because `trl` imports it. We don't invoke it.
- **No web endpoint, no `@app.cls`, no `enter` method.** A `@app.function()` with no warm-up is correct for a one-shot smoke. `enter`/lifecycle methods are for serving and amortizing model load across many calls — not relevant when you call once.
- **No `min_containers` or `buffer_containers`.** Those are warm-pool knobs for serving — they cost money. Default scale-from-zero is right.
- **No `Image.from_registry`.** `debian_slim` + `pip_install` is faster than pulling a CUDA base image when you don't need a custom CUDA toolkit.
### 2.2 What you do need to add when you wire the real loss
Replace the synthetic `for step in range(50)` body with:
```python
from data_collator import ComposerDataCollator # spike 005 path
from trl_path.composer_trainer import ComposerReplicationTrainer
# ...
# Build a small fixed dataset of (prompt, response, hint, dpo_pair) tuples
# inline in the smoke (10–20 examples). Don't pull a real RL rollout — the
# point is to verify the loss path, not the rollout path.
```
The smoke does **not** need a real rollout/sampling phase. Stub `inputs` with the keys `_compute_sdpo_loss` and `_compute_trace_replay_loss` consume (`ctx_teacher_input_ids`, `dpo_chosen_input_ids`, `dpo_chosen_response_mask`, `dpo_chosen_ref_logprobs`, `sdpo_loss_mask`, …) using fixed tensors. That's the real verification — does the 3-channel loss compute and back-propagate without shape errors. The trainer skeleton's logging will tell you per-channel values.
---
## 3. Gotchas that bite *this specific workload*
The Modal docs and the `mlops/modal-llm-training` skill cover ~30 lessons aimed at 7B–30B training. Most of them don't apply here. The ones that do:
### 3.1 The teacher forward in SDPO doubles your effective batch memory — but only briefly
`ComposerReplicationTrainer._compute_sdpo_loss` does this (composer_trainer.py L138–143):
```python
student_logits = model(input_ids=inputs["input_ids"]).logits # with grad
with torch.no_grad():
teacher_logits = model(input_ids=inputs["ctx_teacher_input_ids"]).logits
```
Two issues:
1. **Both logits tensors are held simultaneously** in `_compute_sdpo_loss` — they're handed to `generalized_jsd_loss` which keeps them alive for the JSD math. For Qwen 2.5-0.5B (vocab = 151,936), one logits tensor at B=2,T=1024 in bf16 is `2 * 1024 * 151936 * 2 bytes ≈ 622 MB`. Two of them = ~1.2 GB. **Negligible on a 24 GB L4** but worth noting because logits are surprisingly fat for the Qwen vocab.
2. **Use the `top_k` arg in `generalized_jsd_loss`** if you ever want to scale this up. The docstring (`opsd_loss.py` L54) explicitly recommends it: *"top_k: restrict KL to top-k tokens of the teacher distribution. Saves compute on large vocabularies (Qwen3 vocab = 152K)."* On the smoke, leave it `None` to verify the unrestricted path; flip it on for real training.
### 3.2 The DPO channel does TWO more grad'd forwards per step
`_compute_trace_replay_loss` (composer_trainer.py L191–198) calls `_sequence_logprobs(model, dpo_chosen_…)` and `_sequence_logprobs(model, dpo_rejected_…)`. Both are with-grad. So each training step is:
| Forward | Grad? | Notes |
|---------|-------|-------|
| `super()._compute_loss` (GRPO) | yes | parent's standard fwd |
| Student in SDPO | yes | only when alpha_sdpo ≠ 0 |
| Teacher in SDPO | no | hint-conditioned context |
| DPO chosen | yes | only when beta_replay ≠ 0 |
| DPO rejected | yes | only when beta_replay ≠ 0 |
**That's up to 4 grad'd forwards before the backward.** PyTorch will hold activations for all of them in the autograd graph until `.backward()` runs. For the smoke this is fine (0.5B × 4 act tapes ~ 4 GB at B=2,T=1024) but for any real training run on a larger model: **enable gradient checkpointing or run the SDPO/DPO channels in alternating steps** rather than every step.
For the smoke specifically: **set `alpha_sdpo=0.1` and `beta_replay=0.05`** (the trainer defaults) and verify activation memory peaks below 16 GB. If it doesn't, there's a bug in the data collator producing too-long sequences.
### 3.3 `requires_grad=True` on the zero-tensor short-circuit is a footgun
In `composer_trainer.py` L136 and L155, when SDPO is short-circuited:
```python
return torch.tensor(0.0, device=_device_of(model), requires_grad=True)
```
This is **not in the autograd graph** — it's a leaf tensor with `requires_grad=True` but no parent op. When you sum it into `total = grpo_loss + alpha * sdpo_kl + beta * replay_dpo`, the `0.0` contributes a zero gradient and doesn't break things, *but* if you ever try to call `total.backward()` on a step where ALL three channels short-circuited (e.g., a smoke step with no error sites and no DPO pairs), you'll get a `RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn`. **The smoke will hit this** if your synthetic batch lacks `ctx_teacher_input_ids` and `dpo_chosen_input_ids`.
Fix in the smoke: ensure the synthetic batch includes at least one `ctx_teacher_input_ids` row (it can be a copy of `input_ids` to keep things trivial) so SDPO doesn't short-circuit on every step.
### 3.4 `torch.cuda.synchronize()` before timing reads
If you don't `torch.cuda.synchronize()` before reading `time.perf_counter()` you'll measure CPU dispatch time, not GPU step time. The skeleton above includes it. The Modal runtime doesn't change this — same rule as local.
### 3.5 The HF cache Volume must be `commit()`ed
From <https://modal.com/docs/guide/volumes>: *Volume writes are not persisted across runs unless you call `.commit()` (or use `volume.batch_upload`).* The skeleton calls `hf_cache.commit()` at the end. If you forget, run 2 will re-download the model. This is the only "Modal-flavored" gotcha that bites a smoke.
### 3.6 What does NOT bite
These are the lessons from `mlops/modal-llm-training` that are **not relevant to a 0.5B smoke** — don't waste mental cycles on them:
- ❌ FSDP / DeepSpeed sharding setup. Single GPU.
-`accelerate launch` / multi-process distributed. Single GPU.
- ❌ Flash-attention version pinning vs torch version. SDPA is fine for 0.5B.
- ❌ Tensor parallelism / sequence parallelism. Single GPU.
- ❌ Multi-node clusters. Single node.
- ❌ Memory snapshotting (`enable_memory_snapshot=True`). It's a 30-min one-shot. The cold-start penalty is ~30 s on a smoke that runs for 5 min — 10% overhead, not worth the snapshot setup time.
- ❌ Region pinning for data locality. The whole input is `from_pretrained`, served by HF — Modal's default region is fine.
- ❌ Custom CUDA install (`Image.from_registry("nvidia/cuda:…")`). The pre-built torch wheel ships its own CUDA.
---
## 4. Decision rule: Modal vs the local 5090
### 4.1 The numbers
**Local 5090** (32 GB VRAM, Blackwell, ~1.6 PFLOPS bf16):
- Step time for Qwen-0.5B at B=2, T=1024, 3-channel loss (≈4 grad'd fwds + bwd): expect **~150–400 ms per step** based on parameter-count + Blackwell's bf16 throughput. Call it 300 ms.
- 50 steps: **~15 seconds of pure compute**.
- Plus model load (one-time, from local HF cache): ~5 seconds.
- Plus data collator setup: ~3 seconds.
- **Wall clock: ~25–40 seconds.**
- **Cost: $0** (electricity ignored — the 5090 draws ~600 W under load × 40 s = 6.7 Wh ≈ $0.001).
**Modal L4** (24 GB VRAM, Ada Lovelace, ~0.12 PFLOPS bf16):
- Step time for the same workload on L4: **~1.5–4 s per step.** (L4 is roughly 13× lower bf16 throughput than 5090, but the workload at B=2 won't saturate the 5090, so realistic gap is ~5–10×.) Call it 2 s.
- 50 steps: **~100 seconds of pure compute**.
- Plus container cold start, image pull, model download (cached after run 1), CUDA init: **30–90 s on first run, 20–40 s afterward**.
- **Wall clock: ~3–5 minutes per run (worst case 7 min on a cold first run).**
- **Cost: $0.08–$0.13 per run.**
### 4.2 The decision rule
> **For this specific 30-min smoke: run on the 5090. Do not use Modal.**
Reasoning:
1. **Latency:** the 5090 finishes the smoke in ~30 s. Modal's L4 needs ~5 minutes including cold start. That's a **10× iteration penalty** on a workload where the entire point is iterate-and-fix-the-shape-error cycles. Every minute waiting for Modal is a minute the user could have run the smoke 5 more times locally.
2. **Memory headroom:** the 5090's 32 GB is **larger** than the L4's 24 GB. There is no memory motivation to leave the local box.
3. **Network friction:** every Modal run requires `modal run`, syncing local code, waiting for image, watching logs. Local is `python modal_app.py` (or just import-and-run in a notebook).
4. **Cost asymmetry vs. iteration cost:** $0.10/run is not the issue. The issue is **30 minutes of attention spent on Modal infra is 30 minutes not spent debugging the loss**.
5. **The framework hasn't been verified to run end-to-end yet.** The first hundred bugs you'll find are local Python issues — wrong tensor shapes, missing keys in the collator, the `requires_grad=True` zero-tensor footgun (§3.3), TRL version mismatches. Debugging those over a Modal round-trip is masochism.
**When Modal becomes correct:**
| Scenario | Modal? | Why |
|----------|--------|-----|
| 30-min smoke on 0.5B (this task) | **No** | 5090 wins on every dimension |
| Sweep alpha_sdpo, beta_replay across 8 configs in parallel | **Yes** | 8× Modal containers in parallel beats 8 sequential runs on one 5090 |
| Scale to Qwen2.5-7B (real training) | **Yes** | 7B needs >32 GB for grad+optimizer, so 5090 is out; you want A100-80GB or H100 |
| Scale to multi-node (40B+) | **Yes (with caveats)** | Modal multi-node is in beta — see <https://modal.com/docs/guide/multi-node-training> |
| 24/7 inference of trained model | **Maybe** | Depends on QPS; Modal serverless wins for spiky, loses for steady |
### 4.3 Recommended workflow
1. **Write the smoke as `local_smoke.py`** that runs on the 5090. Same body as `modal_app.py`'s `smoke()` function, minus the `@app.function` decorator. Iterate there until 50 steps run cleanly.
2. **Then** drop the body into `modal_app.py` (the skeleton in §2). The Modal version's value is to verify "does it run on cloud Linux without local dotfile interference" and to baseline L4 step-time vs the 5090. That's a one-shot validation, not a development loop.
3. **For the real training run** (when it's an actual training run, not a smoke), start with A100-40GB on Modal (or H100 if you've got the credits) — the L4 step-time of ~2 s would translate to 2 s × 10,000 steps = ~5.5 hours which is fine for a smoke but painful for a real run.
---
## 5. References
All claims in this document are sourced from:
- **Pricing**: <https://modal.com/pricing> (canonical; updated regularly by Modal — re-fetch if cost-sensitive). Per-second numbers in §1.1 captured from this page at report-write time.
- **GPU naming**: <https://modal.com/docs/guide/gpu> — confirms `gpu="L4"`, `gpu="A10"` (not `"A10G"`), `gpu="A100-40GB"`, `gpu="H100!"` syntax.
- **Cold starts**: <https://modal.com/docs/guide/cold-start> — "Containers boot in about one second" + the warm-up period is image pull + global imports + `enter` methods.
- **Volumes**: <https://modal.com/docs/guide/volumes>`commit()` semantics for HF cache persistence.
- **Region/preemption multipliers**: pricing page footer + <https://modal.com/docs/guide/preemption>.
- **Multi-node beta**: <https://modal.com/docs/guide/multi-node-training>.
- **Examples (for `Image.pip_install` patterns)**: <https://github.com/modal-labs/modal-examples> — see `06_gpu_and_ml/llm-finetuning/` for similar 0.5B/3B finetune patterns.
- **TRL `GRPOTrainer._compute_loss` extension point**: verified in `composer_trainer.py` header comment ("DeepWiki audit of huggingface/trl, 2026-05-25"). Confirmed `super()._compute_loss(model, inputs)` works as the framework's parent-call.
- **Local trainer code reviewed**:
- `/mnt/e/CS/HF/composer-replication-framework/spikes/005-integrated-trainer-skeleton/trl_path/composer_trainer.py`
- `/mnt/e/CS/HF/composer-replication-framework/spikes/005-integrated-trainer-skeleton/opsd_loss.py`
---
## 6. TL;DR
1. **GPU: L4. Cost: ~$0.10/run. Total budget burn: ~50× re-runs before the $5 cap.** Don't pay for A10G, A100, or H100 on a 0.5B smoke.
2. **Skeleton: §2**`gpu="L4"`, 4 cores, 16 GB RAM, 30-min timeout, persistent HF cache Volume, default preemption, no region pin.
3. **Workload-specific gotchas: §3** — 3-channel loss does up to 4 grad'd forwards/step (memory headroom check), the zero-tensor `requires_grad=True` short-circuit can break `backward()`, and `volume.commit()` is mandatory.
4. **Decision: run on the 5090, not Modal.** 5090 finishes the smoke in ~30 s vs Modal's ~5 min including cold start, with $0 marginal cost and 10× faster iteration. Reserve Modal for parameter sweeps and 7B+ training.