Wave 7: Phase 2-4 of deep work loop — backlog, parallel research, three ADRs

ac4bfb4 13 days ago

26.3 kB

	# Modal Reconnaissance — Composer 2.5 Replication GPU Smoke

	Audience: trainer integrator running a one-shot 30-minute verification smoke for the Composer 2.5 Replication Framework (`spikes/005-integrated-trainer-skeleton/`).
	Workload: `Qwen/Qwen2.5-0.5B-Instruct` (≈ 1 GB fp16 weights), ~50 forward+backward steps, custom 3-channel loss = `GRPO + α·SDPO-KL + β·trace-replay-DPO`. Batch size ≤ 4 sequences ≤ 2048 tokens. Goal: prove the loss runs end-to-end and capture mem + step time. This is not training — it's a smoke.
	Cap: $5. Local hardware: RTX 5090, 32 GB VRAM, Modal CLI already configured (`~/.modal.toml`).
	Bottom line up front: Run it locally on the 5090. Modal is the wrong tool for this specific job. The skeleton + price math below is for future scale-out, not the smoke.

	---

	## 1. Recommended Modal GPU type & estimated cost

	### 1.1 Pricing table (from primary source)

	All values copied verbatim from <https://modal.com/pricing> (fetched for this report). Modal bills per second of compute, not per minute or hour.

	\| GPU \| Modal `gpu=` string \| $ / sec \| $ / hour \| VRAM \| Verdict for this smoke \|
	\|----------------\|----------------------\|--------------\|----------\|--------\|------------------------\|
	\| Nvidia T4 \| `"T4"` \| 0.000164 \| 0.590 \| 16 GB \| Too small for safe headroom on 3 fwd passes \|
	\| Nvidia L4 \| `"L4"` \| 0.000222 \| 0.799\| 24 GB \| ✅ Recommended — cheapest GPU that fits comfortably \|
	\| Nvidia A10 \| `"A10"` \| 0.000306 \| 1.102 \| 24 GB \| Acceptable; ~38% pricier than L4 for marginal speedup at sub-1B \|
	\| Nvidia L40S \| `"L40S"` \| 0.000542 \| 1.951 \| 48 GB \| Overkill — Modal's default rec, but unjustified at 0.5B \|
	\| Nvidia A100-40GB\| `"A100-40GB"` \| 0.000583 \| 2.099 \| 40 GB \| Overkill \|
	\| Nvidia A100-80GB\| `"A100-80GB"` \| 0.000694 \| 2.498 \| 80 GB \| Overkill \|
	\| Nvidia H100 \| `"H100!"` \| 0.001097 \| 3.949 \| 80 GB \| Wasteful \|

	(`H100!` suffix = pin to H100, opt out of Modal's automatic H200 upgrade. See <https://modal.com/docs/guide/gpu#automatic-upgrades-to-h200s>.)

	Auxiliary costs (also primary, same page):
	- CPU: $0.0000131 / physical-core / sec → ~$0.047 / core-hour. Min 0.125 cores per container.
	- RAM: $0.00000222 / GiB / sec → ~$0.008 / GiB-hour.
	- Volumes: $0.09 / GiB / month (first 1 TiB / mo free on the workspace).
	- Starter plan: $30 / month free credits — your smoke is free if you haven't burned the budget elsewhere.

	### 1.2 Why L4, not A10G or A100-40GB

	The skill mlops/modal-llm-training defaults to L4/A10 for "small smokes" and that holds here. The framing: Qwen2.5-0.5B in fp16 is ~1 GB of weights. The 3-channel loss does ≥3 forward passes per step (student-grad, teacher no-grad for SDPO, chosen+rejected for DPO). Even with the teacher forward held in memory you are nowhere near 24 GB.

	Concrete VRAM math for the workload (back of envelope, batch=2, seq=1024, bf16):
	- Weights: ~1.0 GB
	- Optimizer state (AdamW, fp32 m+v): ~4 GB (8 bytes × 0.5B params)
	- Gradients (bf16): ~1 GB
	- Activations for student fwd at B=2,T=1024: ~1–2 GB
	- Teacher fwd (no grad, no act save): ~0.3 GB
	- DPO chosen+rejected fwds (with grad): ~2–3 GB
	- HF transformers overhead, KV scratch, framework: ~2 GB
	- Subtotal: ~11–14 GB — comfortably inside 24 GB on L4.

	A10 is also fine but costs 38% more for ~30–50% extra throughput on a workload where the GPU is already step-time-bound by Python overhead (see §3). Pay the L4 rate.

	A100-40GB is wrong. You're paying 2.6× the L4 rate for memory you don't use and FLOPS that, on a 0.5B model with bs=2, you can't saturate. The Modal docs explicitly warn against this: "Before you jump for the most powerful (and so most expensive) GPU, make sure you understand where the bottlenecks are…" (<https://modal.com/docs/guide/gpu#b200-gpus>).

	T4 declined because: (a) only 16 GB VRAM — tight given 3 fwd passes; (b) old Turing arch lacks bf16 hardware; you'd be on fp16/fp32, which trips up `transformers` flash-attention paths and adds debug surface area on a smoke that's already debugging custom loss code.

	### 1.3 Cost projection for the actual smoke

	Assume a 30-min wall-clock budget that breaks down realistically as:
	- Container cold-start + image pull: 30–90 s (first run — Modal's container infra warm-boots in ~1 s but your image with torch+transformers takes one-time pull). See <https://modal.com/docs/guide/cold-start>.
	- HF model download (Qwen 2.5-0.5B = ~1 GB shards) on first run: 15–45 s — should be cached on a Modal Volume after run 1.
	- Setup inside fn (CUDA init, model.from_pretrained, optimizer build): 20–40 s.
	- 50 training steps × ~2–4 s/step (3-channel loss, bs=2, seq=1024 on L4): 100–200 s.
	- Logging, save, exit: 5 s.

	Realistic total: ~3–7 minutes of GPU-billed time per run.

	Cost per run on L4:
	- Lower bound (3 min): 180 s × $0.000222 = $0.040
	- Upper bound (7 min): 420 s × $0.000222 = $0.093
	- Plus CPU/RAM overhead (4 cores × 16 GB RAM): ~420 s × (4 × $0.0000131 + 16 × $0.00000222) = ~$0.037

	Per-run all-in: $0.08 – $0.13 on L4. You can run the smoke ~50× before nudging the $5 cap. Comfortable.

	For comparison, A10 same scenario: ~$0.11 – $0.18 per run. A100-40GB: ~$0.21 – $0.34. Still all under cap, but L4 is the rational pick.

	### 1.4 Region & preemption multipliers (DON'T trip on these)

	From the pricing-page footer:
	- Region selection: 1.5–1.75× base price. Don't pin to a region unless you must.
	- Non-preemptible execution: 3× base price. Default is preemptible — leave it. A 30-min smoke that gets preempted is fine; just retry. Setting `gpu_preempted=False` (or using non-preemptible mode) would push L4 to ~$2.40/hr and is unjustified.

	---

	## 2. Minimal `modal_app.py` skeleton

	This is the actual file to drop into the repo, e.g. at `spikes/005-integrated-trainer-skeleton/modal_app.py`. It is intentionally one file, with no abstraction, sized for the smoke. Image pins are conservative — match what the user is running locally to avoid version drift between local debugging and Modal runs.

	```python
	"""modal_app.py — GPU smoke for the Composer 2.5 Replication Framework.

	Goal: run ~50 forward+backward steps of the 3-channel loss
	(GRPO + SDPO-KL + trace-replay-DPO) against Qwen/Qwen2.5-0.5B-Instruct,
	capture peak VRAM and per-step latency, and exit. Single L4, single container.

	Run: modal run modal_app.py
	Logs: the function's print() output streams to your terminal.
	"""

	from __future__ import annotations

	import modal

	# ---------------------------------------------------------------------------
	# 1) App + image
	# ---------------------------------------------------------------------------
	# Pin torch to a CUDA build that matches Modal's L4 driver (CUDA 12.x).
	# Pin transformers/peft/trl to a known-good combination — the trainer skeleton
	# was developed against transformers >= 4.45 and trl >= 0.12 for GRPOTrainer.
	# If you bump any of these, re-verify GRPOTrainer._compute_loss is still the
	# correct override hook (DeepWiki audit anchor: huggingface/trl).
	image = (
	modal.Image.debian_slim(python_version="3.11")
	.apt_install("git")
	.pip_install(
	"torch==2.4.1", # CUDA 12.1 wheel from PyPI default index
	"transformers==4.46.3",
	"accelerate==1.1.1",
	"peft==0.14.0",
	"trl==0.12.2",
	"datasets==3.1.0",
	"huggingface_hub==0.26.5",
	)
	.env({
	# Force HF to use the mounted Volume for model + dataset cache.
	"HF_HOME": "/cache/hf",
	"TRANSFORMERS_CACHE": "/cache/hf",
	"HF_HUB_ENABLE_HF_TRANSFER": "1", # parallel download for the model
	# Make Python flush prints immediately so we see step times live.
	"PYTHONUNBUFFERED": "1",
	# Reproducibility for the smoke.
	"TOKENIZERS_PARALLELISM": "false",
	})
	)

	# ---------------------------------------------------------------------------
	# 2) Persistent volume for HF cache (so model isn't re-downloaded each run)
	# ---------------------------------------------------------------------------
	# 1 GB of Qwen weights persists here. First run pays the download cost,
	# every subsequent run reuses the volume. Below 1 TiB / mo: free.
	hf_cache = modal.Volume.from_name("hf-cache-composer-smoke", create_if_missing=True)

	# ---------------------------------------------------------------------------
	# 3) App + secrets
	# ---------------------------------------------------------------------------
	app = modal.App("composer-replication-smoke")

	# Optional — only needed if you switch to a gated model. Qwen2.5-0.5B is open.
	hf_secret = modal.Secret.from_name("huggingface-token", required_keys=[]) # no-op safety

	# ---------------------------------------------------------------------------
	# 4) The smoke function
	# ---------------------------------------------------------------------------
	@app.function(
	image=image,
	gpu="L4", # see §1: cheapest 24 GB option that fits
	cpu=4.0, # 4 cores is plenty for tokenization on a sub-1B
	memory=16 * 1024, # 16 GiB RAM is plenty
	volumes={"/cache": hf_cache},
	timeout=60 * 30, # hard 30-min cap matches the smoke spec
	secrets=[hf_secret],
	# NB: keep preemptible (default). Don't pay 3× to pin.
	# NB: don't pin region — the 1.5–1.75× tax is unjustified for a smoke.
	)
	def smoke():
	import time
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer

	MODEL_ID = "Qwen/Qwen2.5-0.5B-Instruct"

	print(f"[smoke] torch={torch.__version__} cuda={torch.version.cuda} "
	f"device={torch.cuda.get_device_name(0)} "
	f"vram={torch.cuda.get_device_properties(0).total_memory/1e9:.1f} GB")

	# -------------------------------------------------------------------
	# Load tokenizer + model. bf16 — L4 supports it (Ada Lovelace).
	# -------------------------------------------------------------------
	t0 = time.perf_counter()
	tok = AutoTokenizer.from_pretrained(MODEL_ID, cache_dir="/cache/hf")
	model = AutoModelForCausalLM.from_pretrained(
	MODEL_ID,
	cache_dir="/cache/hf",
	torch_dtype=torch.bfloat16,
	device_map="cuda:0",
	)
	model.train()
	print(f"[smoke] model load: {time.perf_counter()-t0:.1f}s "
	f"params={sum(p.numel() for p in model.parameters())/1e6:.1f}M")

	# -------------------------------------------------------------------
	# 50-step verification loop.
	#
	# NOTE: this stub uses a synthetic batch — a single forward+backward
	# against an LM-head loss — not the full 3-channel loss. The point
	# is to (a) verify the Modal harness, (b) measure the per-step time
	# of a vanilla AutoModelForCausalLM step on this GPU as a baseline.
	#
	# Replace the body of the for-loop with the actual ComposerReplicationTrainer
	# `_compute_loss` call once data_collator outputs are stubbed/mocked.
	# See: spikes/005-integrated-trainer-skeleton/trl_path/composer_trainer.py
	# -------------------------------------------------------------------
	optimizer = torch.optim.AdamW(model.parameters(), lr=5e-6)

	# Synthetic batch: bs=2, seq=1024 — matches the realistic smoke shape.
	B, T = 2, 1024
	input_ids = torch.randint(0, tok.vocab_size, (B, T), device="cuda:0")
	labels = input_ids.clone()

	torch.cuda.reset_peak_memory_stats()
	step_times = []
	for step in range(50):
	t = time.perf_counter()
	out = model(input_ids=input_ids, labels=labels)
	out.loss.backward()
	optimizer.step()
	optimizer.zero_grad(set_to_none=True)
	torch.cuda.synchronize()
	dt = time.perf_counter() - t
	step_times.append(dt)
	if step % 10 == 0:
	print(f"[smoke] step {step:>3d} loss={out.loss.item():.4f} "
	f"dt={dt*1000:.1f}ms peak_vram={torch.cuda.max_memory_allocated()/1e9:.2f}GB")

	# -------------------------------------------------------------------
	# Final report.
	# -------------------------------------------------------------------
	median_ms = sorted(step_times)[len(step_times)//2] * 1000
	p95_ms = sorted(step_times)[int(len(step_times)0.95)] 1000
	peak_gb = torch.cuda.max_memory_allocated() / 1e9
	print(f"\n[smoke] DONE. median_step={median_ms:.1f}ms p95={p95_ms:.1f}ms "
	f"peak_vram={peak_gb:.2f}GB total_time={sum(step_times):.1f}s")

	# Persist cache for the next run.
	hf_cache.commit()


	@app.local_entrypoint()
	def main():
	smoke.remote()
	```

	### 2.1 What's deliberately not in the skeleton

	- No `flash-attn` install. The `flash-attn` wheel build is a notorious time sink on Modal images (compiles against the CUDA toolkit). For a 0.5B smoke, SDPA (PyTorch's built-in scaled-dot-product attention) is fine and is on by default in transformers ≥ 4.45.
	- No `bitsandbytes`, no `unsloth`, no `xformers`. All add build complexity. None give you anything on a smoke.
	- No DeepSpeed, no FSDP, no `accelerate launch`. This is single-GPU; `accelerate` is in the image only because `trl` imports it. We don't invoke it.
	- No web endpoint, no `@app.cls`, no `enter` method. A `@app.function()` with no warm-up is correct for a one-shot smoke. `enter`/lifecycle methods are for serving and amortizing model load across many calls — not relevant when you call once.
	- No `min_containers` or `buffer_containers`. Those are warm-pool knobs for serving — they cost money. Default scale-from-zero is right.
	- No `Image.from_registry`. `debian_slim` + `pip_install` is faster than pulling a CUDA base image when you don't need a custom CUDA toolkit.

	### 2.2 What you do need to add when you wire the real loss

	Replace the synthetic `for step in range(50)` body with:

	```python
	from data_collator import ComposerDataCollator # spike 005 path
	from trl_path.composer_trainer import ComposerReplicationTrainer
	# ...
	# Build a small fixed dataset of (prompt, response, hint, dpo_pair) tuples
	# inline in the smoke (10–20 examples). Don't pull a real RL rollout — the
	# point is to verify the loss path, not the rollout path.
	```

	The smoke does not need a real rollout/sampling phase. Stub `inputs` with the keys `_compute_sdpo_loss` and `_compute_trace_replay_loss` consume (`ctx_teacher_input_ids`, `dpo_chosen_input_ids`, `dpo_chosen_response_mask`, `dpo_chosen_ref_logprobs`, `sdpo_loss_mask`, …) using fixed tensors. That's the real verification — does the 3-channel loss compute and back-propagate without shape errors. The trainer skeleton's logging will tell you per-channel values.

	---

	## 3. Gotchas that bite this specific workload

	The Modal docs and the `mlops/modal-llm-training` skill cover ~30 lessons aimed at 7B–30B training. Most of them don't apply here. The ones that do:

	### 3.1 The teacher forward in SDPO doubles your effective batch memory — but only briefly

	`ComposerReplicationTrainer._compute_sdpo_loss` does this (composer_trainer.py L138–143):

	```python
	student_logits = model(input_ids=inputs["input_ids"]).logits # with grad
	with torch.no_grad():
	teacher_logits = model(input_ids=inputs["ctx_teacher_input_ids"]).logits
	```

	Two issues:

	1. Both logits tensors are held simultaneously in `_compute_sdpo_loss` — they're handed to `generalized_jsd_loss` which keeps them alive for the JSD math. For Qwen 2.5-0.5B (vocab = 151,936), one logits tensor at B=2,T=1024 in bf16 is `2 * 1024 * 151936 * 2 bytes ≈ 622 MB`. Two of them = ~1.2 GB. Negligible on a 24 GB L4 but worth noting because logits are surprisingly fat for the Qwen vocab.
	2. Use the `top_k` arg in `generalized_jsd_loss` if you ever want to scale this up. The docstring (`opsd_loss.py` L54) explicitly recommends it: "top_k: restrict KL to top-k tokens of the teacher distribution. Saves compute on large vocabularies (Qwen3 vocab = 152K)." On the smoke, leave it `None` to verify the unrestricted path; flip it on for real training.

	### 3.2 The DPO channel does TWO more grad'd forwards per step

	`_compute_trace_replay_loss` (composer_trainer.py L191–198) calls `_sequence_logprobs(model, dpo_chosen_…)` and `_sequence_logprobs(model, dpo_rejected_…)`. Both are with-grad. So each training step is:

	\| Forward \| Grad? \| Notes \|
	\|---------\|-------\|-------\|
	\| `super()._compute_loss` (GRPO) \| yes \| parent's standard fwd \|
	\| Student in SDPO \| yes \| only when alpha_sdpo ≠ 0 \|
	\| Teacher in SDPO \| no \| hint-conditioned context \|
	\| DPO chosen \| yes \| only when beta_replay ≠ 0 \|
	\| DPO rejected \| yes \| only when beta_replay ≠ 0 \|

	That's up to 4 grad'd forwards before the backward. PyTorch will hold activations for all of them in the autograd graph until `.backward()` runs. For the smoke this is fine (0.5B × 4 act tapes ~ 4 GB at B=2,T=1024) but for any real training run on a larger model: enable gradient checkpointing or run the SDPO/DPO channels in alternating steps rather than every step.

	For the smoke specifically: set `alpha_sdpo=0.1` and `beta_replay=0.05` (the trainer defaults) and verify activation memory peaks below 16 GB. If it doesn't, there's a bug in the data collator producing too-long sequences.

	### 3.3 `requires_grad=True` on the zero-tensor short-circuit is a footgun

	In `composer_trainer.py` L136 and L155, when SDPO is short-circuited:

	```python
	return torch.tensor(0.0, device=_device_of(model), requires_grad=True)
	```

	This is not in the autograd graph — it's a leaf tensor with `requires_grad=True` but no parent op. When you sum it into `total = grpo_loss + alpha * sdpo_kl + beta * replay_dpo`, the `0.0` contributes a zero gradient and doesn't break things, but if you ever try to call `total.backward()` on a step where ALL three channels short-circuited (e.g., a smoke step with no error sites and no DPO pairs), you'll get a `RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn`. The smoke will hit this if your synthetic batch lacks `ctx_teacher_input_ids` and `dpo_chosen_input_ids`.

	Fix in the smoke: ensure the synthetic batch includes at least one `ctx_teacher_input_ids` row (it can be a copy of `input_ids` to keep things trivial) so SDPO doesn't short-circuit on every step.

	### 3.4 `torch.cuda.synchronize()` before timing reads

	If you don't `torch.cuda.synchronize()` before reading `time.perf_counter()` you'll measure CPU dispatch time, not GPU step time. The skeleton above includes it. The Modal runtime doesn't change this — same rule as local.

	### 3.5 The HF cache Volume must be `commit()`ed

	From <https://modal.com/docs/guide/volumes>: Volume writes are not persisted across runs unless you call `.commit()` (or use `volume.batch_upload`). The skeleton calls `hf_cache.commit()` at the end. If you forget, run 2 will re-download the model. This is the only "Modal-flavored" gotcha that bites a smoke.

	### 3.6 What does NOT bite

	These are the lessons from `mlops/modal-llm-training` that are not relevant to a 0.5B smoke — don't waste mental cycles on them:

	- ❌ FSDP / DeepSpeed sharding setup. Single GPU.
	- ❌ `accelerate launch` / multi-process distributed. Single GPU.
	- ❌ Flash-attention version pinning vs torch version. SDPA is fine for 0.5B.
	- ❌ Tensor parallelism / sequence parallelism. Single GPU.
	- ❌ Multi-node clusters. Single node.
	- ❌ Memory snapshotting (`enable_memory_snapshot=True`). It's a 30-min one-shot. The cold-start penalty is ~30 s on a smoke that runs for 5 min — 10% overhead, not worth the snapshot setup time.
	- ❌ Region pinning for data locality. The whole input is `from_pretrained`, served by HF — Modal's default region is fine.
	- ❌ Custom CUDA install (`Image.from_registry("nvidia/cuda:…")`). The pre-built torch wheel ships its own CUDA.

	---

	## 4. Decision rule: Modal vs the local 5090

	### 4.1 The numbers

	Local 5090 (32 GB VRAM, Blackwell, ~1.6 PFLOPS bf16):
	- Step time for Qwen-0.5B at B=2, T=1024, 3-channel loss (≈4 grad'd fwds + bwd): expect ~150–400 ms per step based on parameter-count + Blackwell's bf16 throughput. Call it 300 ms.
	- 50 steps: ~15 seconds of pure compute.
	- Plus model load (one-time, from local HF cache): ~5 seconds.
	- Plus data collator setup: ~3 seconds.
	- Wall clock: ~25–40 seconds.
	- Cost: $0 (electricity ignored — the 5090 draws ~600 W under load × 40 s = 6.7 Wh ≈ $0.001).

	Modal L4 (24 GB VRAM, Ada Lovelace, ~0.12 PFLOPS bf16):
	- Step time for the same workload on L4: ~1.5–4 s per step. (L4 is roughly 13× lower bf16 throughput than 5090, but the workload at B=2 won't saturate the 5090, so realistic gap is ~5–10×.) Call it 2 s.
	- 50 steps: ~100 seconds of pure compute.
	- Plus container cold start, image pull, model download (cached after run 1), CUDA init: 30–90 s on first run, 20–40 s afterward.
	- Wall clock: ~3–5 minutes per run (worst case 7 min on a cold first run).
	- Cost: $0.08–$0.13 per run.

	### 4.2 The decision rule

	> For this specific 30-min smoke: run on the 5090. Do not use Modal.

	Reasoning:

	1. Latency: the 5090 finishes the smoke in ~30 s. Modal's L4 needs ~5 minutes including cold start. That's a 10× iteration penalty on a workload where the entire point is iterate-and-fix-the-shape-error cycles. Every minute waiting for Modal is a minute the user could have run the smoke 5 more times locally.
	2. Memory headroom: the 5090's 32 GB is larger than the L4's 24 GB. There is no memory motivation to leave the local box.
	3. Network friction: every Modal run requires `modal run`, syncing local code, waiting for image, watching logs. Local is `python modal_app.py` (or just import-and-run in a notebook).
	4. Cost asymmetry vs. iteration cost: $0.10/run is not the issue. The issue is 30 minutes of attention spent on Modal infra is 30 minutes not spent debugging the loss.
	5. The framework hasn't been verified to run end-to-end yet. The first hundred bugs you'll find are local Python issues — wrong tensor shapes, missing keys in the collator, the `requires_grad=True` zero-tensor footgun (§3.3), TRL version mismatches. Debugging those over a Modal round-trip is masochism.

	When Modal becomes correct:

	\| Scenario \| Modal? \| Why \|
	\|----------\|--------\|-----\|
	\| 30-min smoke on 0.5B (this task) \| No \| 5090 wins on every dimension \|
	\| Sweep alpha_sdpo, beta_replay across 8 configs in parallel \| Yes \| 8× Modal containers in parallel beats 8 sequential runs on one 5090 \|
	\| Scale to Qwen2.5-7B (real training) \| Yes \| 7B needs >32 GB for grad+optimizer, so 5090 is out; you want A100-80GB or H100 \|
	\| Scale to multi-node (40B+) \| Yes (with caveats) \| Modal multi-node is in beta — see <https://modal.com/docs/guide/multi-node-training> \|
	\| 24/7 inference of trained model \| Maybe \| Depends on QPS; Modal serverless wins for spiky, loses for steady \|

	### 4.3 Recommended workflow

	1. Write the smoke as `local_smoke.py` that runs on the 5090. Same body as `modal_app.py`'s `smoke()` function, minus the `@app.function` decorator. Iterate there until 50 steps run cleanly.
	2. Then drop the body into `modal_app.py` (the skeleton in §2). The Modal version's value is to verify "does it run on cloud Linux without local dotfile interference" and to baseline L4 step-time vs the 5090. That's a one-shot validation, not a development loop.
	3. For the real training run (when it's an actual training run, not a smoke), start with A100-40GB on Modal (or H100 if you've got the credits) — the L4 step-time of ~2 s would translate to 2 s × 10,000 steps = ~5.5 hours which is fine for a smoke but painful for a real run.

	---

	## 5. References

	All claims in this document are sourced from:

	- Pricing: <https://modal.com/pricing> (canonical; updated regularly by Modal — re-fetch if cost-sensitive). Per-second numbers in §1.1 captured from this page at report-write time.
	- GPU naming: <https://modal.com/docs/guide/gpu> — confirms `gpu="L4"`, `gpu="A10"` (not `"A10G"`), `gpu="A100-40GB"`, `gpu="H100!"` syntax.
	- Cold starts: <https://modal.com/docs/guide/cold-start> — "Containers boot in about one second" + the warm-up period is image pull + global imports + `enter` methods.
	- Volumes: <https://modal.com/docs/guide/volumes> — `commit()` semantics for HF cache persistence.
	- Region/preemption multipliers: pricing page footer + <https://modal.com/docs/guide/preemption>.
	- Multi-node beta: <https://modal.com/docs/guide/multi-node-training>.
	- Examples (for `Image.pip_install` patterns): <https://github.com/modal-labs/modal-examples> — see `06_gpu_and_ml/llm-finetuning/` for similar 0.5B/3B finetune patterns.
	- TRL `GRPOTrainer._compute_loss` extension point: verified in `composer_trainer.py` header comment ("DeepWiki audit of huggingface/trl, 2026-05-25"). Confirmed `super()._compute_loss(model, inputs)` works as the framework's parent-call.
	- Local trainer code reviewed:
	- `/mnt/e/CS/HF/composer-replication-framework/spikes/005-integrated-trainer-skeleton/trl_path/composer_trainer.py`
	- `/mnt/e/CS/HF/composer-replication-framework/spikes/005-integrated-trainer-skeleton/opsd_loss.py`

	---

	## 6. TL;DR

	1. GPU: L4. Cost: ~$0.10/run. Total budget burn: ~50× re-runs before the $5 cap. Don't pay for A10G, A100, or H100 on a 0.5B smoke.
	2. Skeleton: §2 — `gpu="L4"`, 4 cores, 16 GB RAM, 30-min timeout, persistent HF cache Volume, default preemption, no region pin.
	3. Workload-specific gotchas: §3 — 3-channel loss does up to 4 grad'd forwards/step (memory headroom check), the zero-tensor `requires_grad=True` short-circuit can break `backward()`, and `volume.commit()` is mandatory.
	4. Decision: run on the 5090, not Modal. 5090 finishes the smoke in ~30 s vs Modal's ~5 min including cold start, with $0 marginal cost and 10× faster iteration. Reserve Modal for parameter sweeps and 7B+ training.