# Modal Reconnaissance — Composer 2.5 Replication GPU Smoke **Audience:** trainer integrator running a one-shot 30-minute verification smoke for the Composer 2.5 Replication Framework (`spikes/005-integrated-trainer-skeleton/`). **Workload:** `Qwen/Qwen2.5-0.5B-Instruct` (≈ 1 GB fp16 weights), ~50 forward+backward steps, custom 3-channel loss = `GRPO + α·SDPO-KL + β·trace-replay-DPO`. Batch size ≤ 4 sequences ≤ 2048 tokens. **Goal: prove the loss runs end-to-end and capture mem + step time.** This is *not* training — it's a smoke. **Cap:** $5. **Local hardware:** RTX 5090, 32 GB VRAM, Modal CLI already configured (`~/.modal.toml`). **Bottom line up front:** *Run it locally on the 5090.* Modal is the wrong tool for this specific job. The skeleton + price math below is for future scale-out, not the smoke. --- ## 1. Recommended Modal GPU type & estimated cost ### 1.1 Pricing table (from primary source) All values copied verbatim from (fetched for this report). Modal bills per **second** of compute, not per minute or hour. | GPU | Modal `gpu=` string | $ / sec | $ / hour | VRAM | Verdict for this smoke | |----------------|----------------------|--------------|----------|--------|------------------------| | Nvidia T4 | `"T4"` | 0.000164 | 0.590 | 16 GB | Too small for safe headroom on 3 fwd passes | | **Nvidia L4** | `"L4"` | **0.000222** | **0.799**| 24 GB | ✅ **Recommended** — cheapest GPU that fits comfortably | | Nvidia A10 | `"A10"` | 0.000306 | 1.102 | 24 GB | Acceptable; ~38% pricier than L4 for marginal speedup at sub-1B | | Nvidia L40S | `"L40S"` | 0.000542 | 1.951 | 48 GB | Overkill — Modal's default rec, but unjustified at 0.5B | | Nvidia A100-40GB| `"A100-40GB"` | 0.000583 | 2.099 | 40 GB | Overkill | | Nvidia A100-80GB| `"A100-80GB"` | 0.000694 | 2.498 | 80 GB | Overkill | | Nvidia H100 | `"H100!"` | 0.001097 | 3.949 | 80 GB | Wasteful | (`H100!` suffix = pin to H100, opt out of Modal's automatic H200 upgrade. See .) **Auxiliary costs** (also primary, same page): - CPU: $0.0000131 / physical-core / sec → ~$0.047 / core-hour. Min 0.125 cores per container. - RAM: $0.00000222 / GiB / sec → ~$0.008 / GiB-hour. - Volumes: $0.09 / GiB / month (first 1 TiB / mo free on the workspace). - Starter plan: **$30 / month free credits** — your smoke is free if you haven't burned the budget elsewhere. ### 1.2 Why L4, not A10G or A100-40GB The skill mlops/modal-llm-training defaults to L4/A10 for "small smokes" and that holds here. The framing: *Qwen2.5-0.5B in fp16 is ~1 GB of weights. The 3-channel loss does ≥3 forward passes per step (student-grad, teacher no-grad for SDPO, chosen+rejected for DPO). Even with the teacher forward held in memory you are nowhere near 24 GB.* Concrete VRAM math for the workload (back of envelope, batch=2, seq=1024, bf16): - Weights: ~1.0 GB - Optimizer state (AdamW, fp32 m+v): ~4 GB (8 bytes × 0.5B params) - Gradients (bf16): ~1 GB - Activations for student fwd at B=2,T=1024: ~1–2 GB - Teacher fwd (no grad, no act save): ~0.3 GB - DPO chosen+rejected fwds (with grad): ~2–3 GB - HF transformers overhead, KV scratch, framework: ~2 GB - **Subtotal: ~11–14 GB** — comfortably inside 24 GB on L4. **A10 is also fine** but costs 38% more for ~30–50% extra throughput on a workload where the GPU is already step-time-bound by Python overhead (see §3). Pay the L4 rate. **A100-40GB is wrong.** You're paying 2.6× the L4 rate for memory you don't use and FLOPS that, on a 0.5B model with bs=2, you can't saturate. The Modal docs explicitly warn against this: *"Before you jump for the most powerful (and so most expensive) GPU, make sure you understand where the bottlenecks are…"* (). **T4 declined** because: (a) only 16 GB VRAM — tight given 3 fwd passes; (b) old Turing arch lacks bf16 hardware; you'd be on fp16/fp32, which trips up `transformers` flash-attention paths and adds debug surface area on a smoke that's already debugging custom loss code. ### 1.3 Cost projection for the actual smoke Assume a 30-min wall-clock budget that breaks down realistically as: - Container cold-start + image pull: 30–90 s (first run — Modal's container infra warm-boots in ~1 s but your image with torch+transformers takes one-time pull). See . - HF model download (Qwen 2.5-0.5B = ~1 GB shards) on first run: 15–45 s — **should be cached on a Modal Volume after run 1**. - Setup inside fn (CUDA init, model.from_pretrained, optimizer build): 20–40 s. - 50 training steps × ~2–4 s/step (3-channel loss, bs=2, seq=1024 on L4): **100–200 s**. - Logging, save, exit: 5 s. **Realistic total: ~3–7 minutes of GPU-billed time per run.** Cost per run on L4: - Lower bound (3 min): 180 s × $0.000222 = **$0.040** - Upper bound (7 min): 420 s × $0.000222 = **$0.093** - Plus CPU/RAM overhead (4 cores × 16 GB RAM): ~420 s × (4 × $0.0000131 + 16 × $0.00000222) = ~$0.037 **Per-run all-in: $0.08 – $0.13 on L4.** You can run the smoke ~50× before nudging the $5 cap. Comfortable. For comparison, A10 same scenario: ~$0.11 – $0.18 per run. A100-40GB: ~$0.21 – $0.34. Still all under cap, but L4 is the rational pick. ### 1.4 Region & preemption multipliers (DON'T trip on these) From the pricing-page footer: - **Region selection: 1.5–1.75× base price.** Don't pin to a region unless you must. - **Non-preemptible execution: 3× base price.** Default is preemptible — leave it. A 30-min smoke that gets preempted is fine; just retry. Setting `gpu_preempted=False` (or using non-preemptible mode) would push L4 to ~$2.40/hr and is unjustified. --- ## 2. Minimal `modal_app.py` skeleton This is the actual file to drop into the repo, e.g. at `spikes/005-integrated-trainer-skeleton/modal_app.py`. It is intentionally one file, with no abstraction, sized for the smoke. Image pins are conservative — match what the user is running locally to avoid version drift between local debugging and Modal runs. ```python """modal_app.py — GPU smoke for the Composer 2.5 Replication Framework. Goal: run ~50 forward+backward steps of the 3-channel loss (GRPO + SDPO-KL + trace-replay-DPO) against Qwen/Qwen2.5-0.5B-Instruct, capture peak VRAM and per-step latency, and exit. Single L4, single container. Run: modal run modal_app.py Logs: the function's print() output streams to your terminal. """ from __future__ import annotations import modal # --------------------------------------------------------------------------- # 1) App + image # --------------------------------------------------------------------------- # Pin torch to a CUDA build that matches Modal's L4 driver (CUDA 12.x). # Pin transformers/peft/trl to a known-good combination — the trainer skeleton # was developed against transformers >= 4.45 and trl >= 0.12 for GRPOTrainer. # If you bump any of these, re-verify GRPOTrainer._compute_loss is still the # correct override hook (DeepWiki audit anchor: huggingface/trl). image = ( modal.Image.debian_slim(python_version="3.11") .apt_install("git") .pip_install( "torch==2.4.1", # CUDA 12.1 wheel from PyPI default index "transformers==4.46.3", "accelerate==1.1.1", "peft==0.14.0", "trl==0.12.2", "datasets==3.1.0", "huggingface_hub==0.26.5", ) .env({ # Force HF to use the mounted Volume for model + dataset cache. "HF_HOME": "/cache/hf", "TRANSFORMERS_CACHE": "/cache/hf", "HF_HUB_ENABLE_HF_TRANSFER": "1", # parallel download for the model # Make Python flush prints immediately so we see step times live. "PYTHONUNBUFFERED": "1", # Reproducibility for the smoke. "TOKENIZERS_PARALLELISM": "false", }) ) # --------------------------------------------------------------------------- # 2) Persistent volume for HF cache (so model isn't re-downloaded each run) # --------------------------------------------------------------------------- # 1 GB of Qwen weights persists here. First run pays the download cost, # every subsequent run reuses the volume. Below 1 TiB / mo: free. hf_cache = modal.Volume.from_name("hf-cache-composer-smoke", create_if_missing=True) # --------------------------------------------------------------------------- # 3) App + secrets # --------------------------------------------------------------------------- app = modal.App("composer-replication-smoke") # Optional — only needed if you switch to a gated model. Qwen2.5-0.5B is open. hf_secret = modal.Secret.from_name("huggingface-token", required_keys=[]) # no-op safety # --------------------------------------------------------------------------- # 4) The smoke function # --------------------------------------------------------------------------- @app.function( image=image, gpu="L4", # see §1: cheapest 24 GB option that fits cpu=4.0, # 4 cores is plenty for tokenization on a sub-1B memory=16 * 1024, # 16 GiB RAM is plenty volumes={"/cache": hf_cache}, timeout=60 * 30, # hard 30-min cap matches the smoke spec secrets=[hf_secret], # NB: keep preemptible (default). Don't pay 3× to pin. # NB: don't pin region — the 1.5–1.75× tax is unjustified for a smoke. ) def smoke(): import time import torch from transformers import AutoModelForCausalLM, AutoTokenizer MODEL_ID = "Qwen/Qwen2.5-0.5B-Instruct" print(f"[smoke] torch={torch.__version__} cuda={torch.version.cuda} " f"device={torch.cuda.get_device_name(0)} " f"vram={torch.cuda.get_device_properties(0).total_memory/1e9:.1f} GB") # ------------------------------------------------------------------- # Load tokenizer + model. bf16 — L4 supports it (Ada Lovelace). # ------------------------------------------------------------------- t0 = time.perf_counter() tok = AutoTokenizer.from_pretrained(MODEL_ID, cache_dir="/cache/hf") model = AutoModelForCausalLM.from_pretrained( MODEL_ID, cache_dir="/cache/hf", torch_dtype=torch.bfloat16, device_map="cuda:0", ) model.train() print(f"[smoke] model load: {time.perf_counter()-t0:.1f}s " f"params={sum(p.numel() for p in model.parameters())/1e6:.1f}M") # ------------------------------------------------------------------- # 50-step verification loop. # # NOTE: this stub uses a synthetic batch — a single forward+backward # against an LM-head loss — *not* the full 3-channel loss. The point # is to (a) verify the Modal harness, (b) measure the per-step time # of a vanilla AutoModelForCausalLM step on this GPU as a baseline. # # Replace the body of the for-loop with the actual ComposerReplicationTrainer # `_compute_loss` call once data_collator outputs are stubbed/mocked. # See: spikes/005-integrated-trainer-skeleton/trl_path/composer_trainer.py # ------------------------------------------------------------------- optimizer = torch.optim.AdamW(model.parameters(), lr=5e-6) # Synthetic batch: bs=2, seq=1024 — matches the realistic smoke shape. B, T = 2, 1024 input_ids = torch.randint(0, tok.vocab_size, (B, T), device="cuda:0") labels = input_ids.clone() torch.cuda.reset_peak_memory_stats() step_times = [] for step in range(50): t = time.perf_counter() out = model(input_ids=input_ids, labels=labels) out.loss.backward() optimizer.step() optimizer.zero_grad(set_to_none=True) torch.cuda.synchronize() dt = time.perf_counter() - t step_times.append(dt) if step % 10 == 0: print(f"[smoke] step {step:>3d} loss={out.loss.item():.4f} " f"dt={dt*1000:.1f}ms peak_vram={torch.cuda.max_memory_allocated()/1e9:.2f}GB") # ------------------------------------------------------------------- # Final report. # ------------------------------------------------------------------- median_ms = sorted(step_times)[len(step_times)//2] * 1000 p95_ms = sorted(step_times)[int(len(step_times)*0.95)] * 1000 peak_gb = torch.cuda.max_memory_allocated() / 1e9 print(f"\n[smoke] DONE. median_step={median_ms:.1f}ms p95={p95_ms:.1f}ms " f"peak_vram={peak_gb:.2f}GB total_time={sum(step_times):.1f}s") # Persist cache for the next run. hf_cache.commit() @app.local_entrypoint() def main(): smoke.remote() ``` ### 2.1 What's deliberately *not* in the skeleton - **No `flash-attn` install.** The `flash-attn` wheel build is a notorious time sink on Modal images (compiles against the CUDA toolkit). For a 0.5B smoke, SDPA (PyTorch's built-in scaled-dot-product attention) is fine and is on by default in transformers ≥ 4.45. - **No `bitsandbytes`, no `unsloth`, no `xformers`.** All add build complexity. None give you anything on a smoke. - **No DeepSpeed, no FSDP, no `accelerate launch`.** This is single-GPU; `accelerate` is in the image only because `trl` imports it. We don't invoke it. - **No web endpoint, no `@app.cls`, no `enter` method.** A `@app.function()` with no warm-up is correct for a one-shot smoke. `enter`/lifecycle methods are for serving and amortizing model load across many calls — not relevant when you call once. - **No `min_containers` or `buffer_containers`.** Those are warm-pool knobs for serving — they cost money. Default scale-from-zero is right. - **No `Image.from_registry`.** `debian_slim` + `pip_install` is faster than pulling a CUDA base image when you don't need a custom CUDA toolkit. ### 2.2 What you do need to add when you wire the real loss Replace the synthetic `for step in range(50)` body with: ```python from data_collator import ComposerDataCollator # spike 005 path from trl_path.composer_trainer import ComposerReplicationTrainer # ... # Build a small fixed dataset of (prompt, response, hint, dpo_pair) tuples # inline in the smoke (10–20 examples). Don't pull a real RL rollout — the # point is to verify the loss path, not the rollout path. ``` The smoke does **not** need a real rollout/sampling phase. Stub `inputs` with the keys `_compute_sdpo_loss` and `_compute_trace_replay_loss` consume (`ctx_teacher_input_ids`, `dpo_chosen_input_ids`, `dpo_chosen_response_mask`, `dpo_chosen_ref_logprobs`, `sdpo_loss_mask`, …) using fixed tensors. That's the real verification — does the 3-channel loss compute and back-propagate without shape errors. The trainer skeleton's logging will tell you per-channel values. --- ## 3. Gotchas that bite *this specific workload* The Modal docs and the `mlops/modal-llm-training` skill cover ~30 lessons aimed at 7B–30B training. Most of them don't apply here. The ones that do: ### 3.1 The teacher forward in SDPO doubles your effective batch memory — but only briefly `ComposerReplicationTrainer._compute_sdpo_loss` does this (composer_trainer.py L138–143): ```python student_logits = model(input_ids=inputs["input_ids"]).logits # with grad with torch.no_grad(): teacher_logits = model(input_ids=inputs["ctx_teacher_input_ids"]).logits ``` Two issues: 1. **Both logits tensors are held simultaneously** in `_compute_sdpo_loss` — they're handed to `generalized_jsd_loss` which keeps them alive for the JSD math. For Qwen 2.5-0.5B (vocab = 151,936), one logits tensor at B=2,T=1024 in bf16 is `2 * 1024 * 151936 * 2 bytes ≈ 622 MB`. Two of them = ~1.2 GB. **Negligible on a 24 GB L4** but worth noting because logits are surprisingly fat for the Qwen vocab. 2. **Use the `top_k` arg in `generalized_jsd_loss`** if you ever want to scale this up. The docstring (`opsd_loss.py` L54) explicitly recommends it: *"top_k: restrict KL to top-k tokens of the teacher distribution. Saves compute on large vocabularies (Qwen3 vocab = 152K)."* On the smoke, leave it `None` to verify the unrestricted path; flip it on for real training. ### 3.2 The DPO channel does TWO more grad'd forwards per step `_compute_trace_replay_loss` (composer_trainer.py L191–198) calls `_sequence_logprobs(model, dpo_chosen_…)` and `_sequence_logprobs(model, dpo_rejected_…)`. Both are with-grad. So each training step is: | Forward | Grad? | Notes | |---------|-------|-------| | `super()._compute_loss` (GRPO) | yes | parent's standard fwd | | Student in SDPO | yes | only when alpha_sdpo ≠ 0 | | Teacher in SDPO | no | hint-conditioned context | | DPO chosen | yes | only when beta_replay ≠ 0 | | DPO rejected | yes | only when beta_replay ≠ 0 | **That's up to 4 grad'd forwards before the backward.** PyTorch will hold activations for all of them in the autograd graph until `.backward()` runs. For the smoke this is fine (0.5B × 4 act tapes ~ 4 GB at B=2,T=1024) but for any real training run on a larger model: **enable gradient checkpointing or run the SDPO/DPO channels in alternating steps** rather than every step. For the smoke specifically: **set `alpha_sdpo=0.1` and `beta_replay=0.05`** (the trainer defaults) and verify activation memory peaks below 16 GB. If it doesn't, there's a bug in the data collator producing too-long sequences. ### 3.3 `requires_grad=True` on the zero-tensor short-circuit is a footgun In `composer_trainer.py` L136 and L155, when SDPO is short-circuited: ```python return torch.tensor(0.0, device=_device_of(model), requires_grad=True) ``` This is **not in the autograd graph** — it's a leaf tensor with `requires_grad=True` but no parent op. When you sum it into `total = grpo_loss + alpha * sdpo_kl + beta * replay_dpo`, the `0.0` contributes a zero gradient and doesn't break things, *but* if you ever try to call `total.backward()` on a step where ALL three channels short-circuited (e.g., a smoke step with no error sites and no DPO pairs), you'll get a `RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn`. **The smoke will hit this** if your synthetic batch lacks `ctx_teacher_input_ids` and `dpo_chosen_input_ids`. Fix in the smoke: ensure the synthetic batch includes at least one `ctx_teacher_input_ids` row (it can be a copy of `input_ids` to keep things trivial) so SDPO doesn't short-circuit on every step. ### 3.4 `torch.cuda.synchronize()` before timing reads If you don't `torch.cuda.synchronize()` before reading `time.perf_counter()` you'll measure CPU dispatch time, not GPU step time. The skeleton above includes it. The Modal runtime doesn't change this — same rule as local. ### 3.5 The HF cache Volume must be `commit()`ed From : *Volume writes are not persisted across runs unless you call `.commit()` (or use `volume.batch_upload`).* The skeleton calls `hf_cache.commit()` at the end. If you forget, run 2 will re-download the model. This is the only "Modal-flavored" gotcha that bites a smoke. ### 3.6 What does NOT bite These are the lessons from `mlops/modal-llm-training` that are **not relevant to a 0.5B smoke** — don't waste mental cycles on them: - ❌ FSDP / DeepSpeed sharding setup. Single GPU. - ❌ `accelerate launch` / multi-process distributed. Single GPU. - ❌ Flash-attention version pinning vs torch version. SDPA is fine for 0.5B. - ❌ Tensor parallelism / sequence parallelism. Single GPU. - ❌ Multi-node clusters. Single node. - ❌ Memory snapshotting (`enable_memory_snapshot=True`). It's a 30-min one-shot. The cold-start penalty is ~30 s on a smoke that runs for 5 min — 10% overhead, not worth the snapshot setup time. - ❌ Region pinning for data locality. The whole input is `from_pretrained`, served by HF — Modal's default region is fine. - ❌ Custom CUDA install (`Image.from_registry("nvidia/cuda:…")`). The pre-built torch wheel ships its own CUDA. --- ## 4. Decision rule: Modal vs the local 5090 ### 4.1 The numbers **Local 5090** (32 GB VRAM, Blackwell, ~1.6 PFLOPS bf16): - Step time for Qwen-0.5B at B=2, T=1024, 3-channel loss (≈4 grad'd fwds + bwd): expect **~150–400 ms per step** based on parameter-count + Blackwell's bf16 throughput. Call it 300 ms. - 50 steps: **~15 seconds of pure compute**. - Plus model load (one-time, from local HF cache): ~5 seconds. - Plus data collator setup: ~3 seconds. - **Wall clock: ~25–40 seconds.** - **Cost: $0** (electricity ignored — the 5090 draws ~600 W under load × 40 s = 6.7 Wh ≈ $0.001). **Modal L4** (24 GB VRAM, Ada Lovelace, ~0.12 PFLOPS bf16): - Step time for the same workload on L4: **~1.5–4 s per step.** (L4 is roughly 13× lower bf16 throughput than 5090, but the workload at B=2 won't saturate the 5090, so realistic gap is ~5–10×.) Call it 2 s. - 50 steps: **~100 seconds of pure compute**. - Plus container cold start, image pull, model download (cached after run 1), CUDA init: **30–90 s on first run, 20–40 s afterward**. - **Wall clock: ~3–5 minutes per run (worst case 7 min on a cold first run).** - **Cost: $0.08–$0.13 per run.** ### 4.2 The decision rule > **For this specific 30-min smoke: run on the 5090. Do not use Modal.** Reasoning: 1. **Latency:** the 5090 finishes the smoke in ~30 s. Modal's L4 needs ~5 minutes including cold start. That's a **10× iteration penalty** on a workload where the entire point is iterate-and-fix-the-shape-error cycles. Every minute waiting for Modal is a minute the user could have run the smoke 5 more times locally. 2. **Memory headroom:** the 5090's 32 GB is **larger** than the L4's 24 GB. There is no memory motivation to leave the local box. 3. **Network friction:** every Modal run requires `modal run`, syncing local code, waiting for image, watching logs. Local is `python modal_app.py` (or just import-and-run in a notebook). 4. **Cost asymmetry vs. iteration cost:** $0.10/run is not the issue. The issue is **30 minutes of attention spent on Modal infra is 30 minutes not spent debugging the loss**. 5. **The framework hasn't been verified to run end-to-end yet.** The first hundred bugs you'll find are local Python issues — wrong tensor shapes, missing keys in the collator, the `requires_grad=True` zero-tensor footgun (§3.3), TRL version mismatches. Debugging those over a Modal round-trip is masochism. **When Modal becomes correct:** | Scenario | Modal? | Why | |----------|--------|-----| | 30-min smoke on 0.5B (this task) | **No** | 5090 wins on every dimension | | Sweep alpha_sdpo, beta_replay across 8 configs in parallel | **Yes** | 8× Modal containers in parallel beats 8 sequential runs on one 5090 | | Scale to Qwen2.5-7B (real training) | **Yes** | 7B needs >32 GB for grad+optimizer, so 5090 is out; you want A100-80GB or H100 | | Scale to multi-node (40B+) | **Yes (with caveats)** | Modal multi-node is in beta — see | | 24/7 inference of trained model | **Maybe** | Depends on QPS; Modal serverless wins for spiky, loses for steady | ### 4.3 Recommended workflow 1. **Write the smoke as `local_smoke.py`** that runs on the 5090. Same body as `modal_app.py`'s `smoke()` function, minus the `@app.function` decorator. Iterate there until 50 steps run cleanly. 2. **Then** drop the body into `modal_app.py` (the skeleton in §2). The Modal version's value is to verify "does it run on cloud Linux without local dotfile interference" and to baseline L4 step-time vs the 5090. That's a one-shot validation, not a development loop. 3. **For the real training run** (when it's an actual training run, not a smoke), start with A100-40GB on Modal (or H100 if you've got the credits) — the L4 step-time of ~2 s would translate to 2 s × 10,000 steps = ~5.5 hours which is fine for a smoke but painful for a real run. --- ## 5. References All claims in this document are sourced from: - **Pricing**: (canonical; updated regularly by Modal — re-fetch if cost-sensitive). Per-second numbers in §1.1 captured from this page at report-write time. - **GPU naming**: — confirms `gpu="L4"`, `gpu="A10"` (not `"A10G"`), `gpu="A100-40GB"`, `gpu="H100!"` syntax. - **Cold starts**: — "Containers boot in about one second" + the warm-up period is image pull + global imports + `enter` methods. - **Volumes**: — `commit()` semantics for HF cache persistence. - **Region/preemption multipliers**: pricing page footer + . - **Multi-node beta**: . - **Examples (for `Image.pip_install` patterns)**: — see `06_gpu_and_ml/llm-finetuning/` for similar 0.5B/3B finetune patterns. - **TRL `GRPOTrainer._compute_loss` extension point**: verified in `composer_trainer.py` header comment ("DeepWiki audit of huggingface/trl, 2026-05-25"). Confirmed `super()._compute_loss(model, inputs)` works as the framework's parent-call. - **Local trainer code reviewed**: - `/mnt/e/CS/HF/composer-replication-framework/spikes/005-integrated-trainer-skeleton/trl_path/composer_trainer.py` - `/mnt/e/CS/HF/composer-replication-framework/spikes/005-integrated-trainer-skeleton/opsd_loss.py` --- ## 6. TL;DR 1. **GPU: L4. Cost: ~$0.10/run. Total budget burn: ~50× re-runs before the $5 cap.** Don't pay for A10G, A100, or H100 on a 0.5B smoke. 2. **Skeleton: §2** — `gpu="L4"`, 4 cores, 16 GB RAM, 30-min timeout, persistent HF cache Volume, default preemption, no region pin. 3. **Workload-specific gotchas: §3** — 3-channel loss does up to 4 grad'd forwards/step (memory headroom check), the zero-tensor `requires_grad=True` short-circuit can break `backward()`, and `volume.commit()` is mandatory. 4. **Decision: run on the 5090, not Modal.** 5090 finishes the smoke in ~30 s vs Modal's ~5 min including cold start, with $0 marginal cost and 10× faster iteration. Reserve Modal for parameter sweeps and 7B+ training.