Wave 7: Phase 2-4 of deep work loop — backlog, parallel research, three ADRs

ac4bfb4 13 days ago

26.3 kB

Modal Reconnaissance — Composer 2.5 Replication GPU Smoke

Audience: trainer integrator running a one-shot 30-minute verification smoke for the Composer 2.5 Replication Framework (spikes/005-integrated-trainer-skeleton/). Workload: Qwen/Qwen2.5-0.5B-Instruct (≈ 1 GB fp16 weights), 50 forward+backward steps, custom 3-channel loss = GRPO + α·SDPO-KL + β·trace-replay-DPO. Batch size ≤ 4 sequences ≤ 2048 tokens. Goal: prove the loss runs end-to-end and capture mem + step time. This is not training — it's a smoke. Cap: $5. Local hardware: RTX 5090, 32 GB VRAM, Modal CLI already configured (`/.modal.toml`). Bottom line up front: Run it locally on the 5090. Modal is the wrong tool for this specific job. The skeleton + price math below is for future scale-out, not the smoke.

1. Recommended Modal GPU type & estimated cost

1.1 Pricing table (from primary source)

All values copied verbatim from https://modal.com/pricing (fetched for this report). Modal bills per second of compute, not per minute or hour.

GPU	Modal `gpu=` string	$ / sec	$ / hour	VRAM	Verdict for this smoke
Nvidia T4	`"T4"`	0.000164	0.590	16 GB	Too small for safe headroom on 3 fwd passes
Nvidia L4	`"L4"`	0.000222	0.799	24 GB	✅ Recommended — cheapest GPU that fits comfortably
Nvidia A10	`"A10"`	0.000306	1.102	24 GB	Acceptable; ~38% pricier than L4 for marginal speedup at sub-1B
Nvidia L40S	`"L40S"`	0.000542	1.951	48 GB	Overkill — Modal's default rec, but unjustified at 0.5B
Nvidia A100-40GB	`"A100-40GB"`	0.000583	2.099	40 GB	Overkill
Nvidia A100-80GB	`"A100-80GB"`	0.000694	2.498	80 GB	Overkill
Nvidia H100	`"H100!"`	0.001097	3.949	80 GB	Wasteful

(H100! suffix = pin to H100, opt out of Modal's automatic H200 upgrade. See https://modal.com/docs/guide/gpu#automatic-upgrades-to-h200s.)

Auxiliary costs (also primary, same page):

CPU: $0.0000131 / physical-core / sec → ~$0.047 / core-hour. Min 0.125 cores per container.
RAM: $0.00000222 / GiB / sec → ~$0.008 / GiB-hour.
Volumes: $0.09 / GiB / month (first 1 TiB / mo free on the workspace).
Starter plan: $30 / month free credits — your smoke is free if you haven't burned the budget elsewhere.

1.2 Why L4, not A10G or A100-40GB

The skill mlops/modal-llm-training defaults to L4/A10 for "small smokes" and that holds here. The framing: Qwen2.5-0.5B in fp16 is ~1 GB of weights. The 3-channel loss does ≥3 forward passes per step (student-grad, teacher no-grad for SDPO, chosen+rejected for DPO). Even with the teacher forward held in memory you are nowhere near 24 GB.

Concrete VRAM math for the workload (back of envelope, batch=2, seq=1024, bf16):

Weights: ~1.0 GB
Optimizer state (AdamW, fp32 m+v): ~4 GB (8 bytes × 0.5B params)
Gradients (bf16): ~1 GB
Activations for student fwd at B=2,T=1024: ~1–2 GB
Teacher fwd (no grad, no act save): ~0.3 GB
DPO chosen+rejected fwds (with grad): ~2–3 GB
HF transformers overhead, KV scratch, framework: ~2 GB
Subtotal: ~11–14 GB — comfortably inside 24 GB on L4.

A10 is also fine but costs 38% more for ~30–50% extra throughput on a workload where the GPU is already step-time-bound by Python overhead (see §3). Pay the L4 rate.

A100-40GB is wrong. You're paying 2.6× the L4 rate for memory you don't use and FLOPS that, on a 0.5B model with bs=2, you can't saturate. The Modal docs explicitly warn against this: "Before you jump for the most powerful (and so most expensive) GPU, make sure you understand where the bottlenecks are…" (https://modal.com/docs/guide/gpu#b200-gpus).

T4 declined because: (a) only 16 GB VRAM — tight given 3 fwd passes; (b) old Turing arch lacks bf16 hardware; you'd be on fp16/fp32, which trips up transformers flash-attention paths and adds debug surface area on a smoke that's already debugging custom loss code.

1.3 Cost projection for the actual smoke

Assume a 30-min wall-clock budget that breaks down realistically as:

Container cold-start + image pull: 30–90 s (first run — Modal's container infra warm-boots in ~1 s but your image with torch+transformers takes one-time pull). See https://modal.com/docs/guide/cold-start.
HF model download (Qwen 2.5-0.5B = ~1 GB shards) on first run: 15–45 s — should be cached on a Modal Volume after run 1.
Setup inside fn (CUDA init, model.from_pretrained, optimizer build): 20–40 s.
50 training steps × ~2–4 s/step (3-channel loss, bs=2, seq=1024 on L4): 100–200 s.
Logging, save, exit: 5 s.

Realistic total: ~3–7 minutes of GPU-billed time per run.

Cost per run on L4:

Lower bound (3 min): 180 s × $0.000222 = $0.040
Upper bound (7 min): 420 s × $0.000222 = $0.093
Plus CPU/RAM overhead (4 cores × 16 GB RAM): ~420 s × (4 × $0.0000131 + 16 × $0.00000222) = ~$0.037

Per-run all-in: $0.08 – $0.13 on L4. You can run the smoke ~50× before nudging the $5 cap. Comfortable.

For comparison, A10 same scenario: ~$0.11 – $0.18 per run. A100-40GB: ~$0.21 – $0.34. Still all under cap, but L4 is the rational pick.

1.4 Region & preemption multipliers (DON'T trip on these)

From the pricing-page footer:

Region selection: 1.5–1.75× base price. Don't pin to a region unless you must.
Non-preemptible execution: 3× base price. Default is preemptible — leave it. A 30-min smoke that gets preempted is fine; just retry. Setting gpu_preempted=False (or using non-preemptible mode) would push L4 to ~$2.40/hr and is unjustified.

2. Minimal `modal_app.py` skeleton

This is the actual file to drop into the repo, e.g. at spikes/005-integrated-trainer-skeleton/modal_app.py. It is intentionally one file, with no abstraction, sized for the smoke. Image pins are conservative — match what the user is running locally to avoid version drift between local debugging and Modal runs.

"""modal_app.py — GPU smoke for the Composer 2.5 Replication Framework.

Goal: run ~50 forward+backward steps of the 3-channel loss
(GRPO + SDPO-KL + trace-replay-DPO) against Qwen/Qwen2.5-0.5B-Instruct,
capture peak VRAM and per-step latency, and exit. Single L4, single container.

Run:    modal run modal_app.py
Logs:   the function's print() output streams to your terminal.
"""

from __future__ import annotations

import modal

# ---------------------------------------------------------------------------
# 1) App + image
# ---------------------------------------------------------------------------
# Pin torch to a CUDA build that matches Modal's L4 driver (CUDA 12.x).
# Pin transformers/peft/trl to a known-good combination — the trainer skeleton
# was developed against transformers >= 4.45 and trl >= 0.12 for GRPOTrainer.
# If you bump any of these, re-verify GRPOTrainer._compute_loss is still the
# correct override hook (DeepWiki audit anchor: huggingface/trl).
image = (
    modal.Image.debian_slim(python_version="3.11")
    .apt_install("git")
    .pip_install(
        "torch==2.4.1",                  # CUDA 12.1 wheel from PyPI default index
        "transformers==4.46.3",
        "accelerate==1.1.1",
        "peft==0.14.0",
        "trl==0.12.2",
        "datasets==3.1.0",
        "huggingface_hub==0.26.5",
    )
    .env({
        # Force HF to use the mounted Volume for model + dataset cache.
        "HF_HOME": "/cache/hf",
        "TRANSFORMERS_CACHE": "/cache/hf",
        "HF_HUB_ENABLE_HF_TRANSFER": "1",  # parallel download for the model
        # Make Python flush prints immediately so we see step times live.
        "PYTHONUNBUFFERED": "1",
        # Reproducibility for the smoke.
        "TOKENIZERS_PARALLELISM": "false",
    })
)

# ---------------------------------------------------------------------------
# 2) Persistent volume for HF cache (so model isn't re-downloaded each run)
# ---------------------------------------------------------------------------
# 1 GB of Qwen weights persists here. First run pays the download cost,
# every subsequent run reuses the volume. Below 1 TiB / mo: free.
hf_cache = modal.Volume.from_name("hf-cache-composer-smoke", create_if_missing=True)

# ---------------------------------------------------------------------------
# 3) App + secrets
# ---------------------------------------------------------------------------
app = modal.App("composer-replication-smoke")

# Optional — only needed if you switch to a gated model. Qwen2.5-0.5B is open.
hf_secret = modal.Secret.from_name("huggingface-token", required_keys=[])  # no-op safety

# ---------------------------------------------------------------------------
# 4) The smoke function
# ---------------------------------------------------------------------------
@app.function(
    image=image,
    gpu="L4",                       # see §1: cheapest 24 GB option that fits
    cpu=4.0,                        # 4 cores is plenty for tokenization on a sub-1B
    memory=16 * 1024,               # 16 GiB RAM is plenty
    volumes={"/cache": hf_cache},
    timeout=60 * 30,                # hard 30-min cap matches the smoke spec
    secrets=[hf_secret],
    # NB: keep preemptible (default). Don't pay 3× to pin.
    # NB: don't pin region — the 1.5–1.75× tax is unjustified for a smoke.
)
def smoke():
    import time
    import torch
    from transformers import AutoModelForCausalLM, AutoTokenizer

    MODEL_ID = "Qwen/Qwen2.5-0.5B-Instruct"

    print(f"[smoke] torch={torch.__version__} cuda={torch.version.cuda} "
          f"device={torch.cuda.get_device_name(0)} "
          f"vram={torch.cuda.get_device_properties(0).total_memory/1e9:.1f} GB")

    # -------------------------------------------------------------------
    # Load tokenizer + model. bf16 — L4 supports it (Ada Lovelace).
    # -------------------------------------------------------------------
    t0 = time.perf_counter()
    tok = AutoTokenizer.from_pretrained(MODEL_ID, cache_dir="/cache/hf")
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_ID,
        cache_dir="/cache/hf",
        torch_dtype=torch.bfloat16,
        device_map="cuda:0",
    )
    model.train()
    print(f"[smoke] model load: {time.perf_counter()-t0:.1f}s "
          f"params={sum(p.numel() for p in model.parameters())/1e6:.1f}M")

    # -------------------------------------------------------------------
    # 50-step verification loop.
    #
    # NOTE: this stub uses a synthetic batch — a single forward+backward
    # against an LM-head loss — *not* the full 3-channel loss. The point
    # is to (a) verify the Modal harness, (b) measure the per-step time
    # of a vanilla AutoModelForCausalLM step on this GPU as a baseline.
    #
    # Replace the body of the for-loop with the actual ComposerReplicationTrainer
    # `_compute_loss` call once data_collator outputs are stubbed/mocked.
    # See: spikes/005-integrated-trainer-skeleton/trl_path/composer_trainer.py
    # -------------------------------------------------------------------
    optimizer = torch.optim.AdamW(model.parameters(), lr=5e-6)

    # Synthetic batch: bs=2, seq=1024 — matches the realistic smoke shape.
    B, T = 2, 1024
    input_ids = torch.randint(0, tok.vocab_size, (B, T), device="cuda:0")
    labels = input_ids.clone()

    torch.cuda.reset_peak_memory_stats()
    step_times = []
    for step in range(50):
        t = time.perf_counter()
        out = model(input_ids=input_ids, labels=labels)
        out.loss.backward()
        optimizer.step()
        optimizer.zero_grad(set_to_none=True)
        torch.cuda.synchronize()
        dt = time.perf_counter() - t
        step_times.append(dt)
        if step % 10 == 0:
            print(f"[smoke] step {step:>3d} loss={out.loss.item():.4f} "
                  f"dt={dt*1000:.1f}ms peak_vram={torch.cuda.max_memory_allocated()/1e9:.2f}GB")

    # -------------------------------------------------------------------
    # Final report.
    # -------------------------------------------------------------------
    median_ms = sorted(step_times)[len(step_times)//2] * 1000
    p95_ms = sorted(step_times)[int(len(step_times)*0.95)] * 1000
    peak_gb = torch.cuda.max_memory_allocated() / 1e9
    print(f"\n[smoke] DONE. median_step={median_ms:.1f}ms p95={p95_ms:.1f}ms "
          f"peak_vram={peak_gb:.2f}GB total_time={sum(step_times):.1f}s")

    # Persist cache for the next run.
    hf_cache.commit()


@app.local_entrypoint()
def main():
    smoke.remote()

2.1 What's deliberately not in the skeleton

No flash-attn install. The flash-attn wheel build is a notorious time sink on Modal images (compiles against the CUDA toolkit). For a 0.5B smoke, SDPA (PyTorch's built-in scaled-dot-product attention) is fine and is on by default in transformers ≥ 4.45.
No bitsandbytes, no unsloth, no xformers. All add build complexity. None give you anything on a smoke.
No DeepSpeed, no FSDP, no accelerate launch. This is single-GPU; accelerate is in the image only because trl imports it. We don't invoke it.
No web endpoint, no @app.cls, no enter method. A @app.function() with no warm-up is correct for a one-shot smoke. enter/lifecycle methods are for serving and amortizing model load across many calls — not relevant when you call once.
No min_containers or buffer_containers. Those are warm-pool knobs for serving — they cost money. Default scale-from-zero is right.
No Image.from_registry. debian_slim + pip_install is faster than pulling a CUDA base image when you don't need a custom CUDA toolkit.

2.2 What you do need to add when you wire the real loss

Replace the synthetic for step in range(50) body with:

from data_collator import ComposerDataCollator        # spike 005 path
from trl_path.composer_trainer import ComposerReplicationTrainer
# ...
# Build a small fixed dataset of (prompt, response, hint, dpo_pair) tuples
# inline in the smoke (10–20 examples). Don't pull a real RL rollout — the
# point is to verify the loss path, not the rollout path.

The smoke does not need a real rollout/sampling phase. Stub inputs with the keys _compute_sdpo_loss and _compute_trace_replay_loss consume (ctx_teacher_input_ids, dpo_chosen_input_ids, dpo_chosen_response_mask, dpo_chosen_ref_logprobs, sdpo_loss_mask, …) using fixed tensors. That's the real verification — does the 3-channel loss compute and back-propagate without shape errors. The trainer skeleton's logging will tell you per-channel values.

3. Gotchas that bite this specific workload

The Modal docs and the mlops/modal-llm-training skill cover ~30 lessons aimed at 7B–30B training. Most of them don't apply here. The ones that do:

3.1 The teacher forward in SDPO doubles your effective batch memory — but only briefly

ComposerReplicationTrainer._compute_sdpo_loss does this (composer_trainer.py L138–143):

student_logits = model(input_ids=inputs["input_ids"]).logits      # with grad
with torch.no_grad():
    teacher_logits = model(input_ids=inputs["ctx_teacher_input_ids"]).logits

Two issues:

Both logits tensors are held simultaneously in _compute_sdpo_loss — they're handed to generalized_jsd_loss which keeps them alive for the JSD math. For Qwen 2.5-0.5B (vocab = 151,936), one logits tensor at B=2,T=1024 in bf16 is 2 * 1024 * 151936 * 2 bytes ≈ 622 MB. Two of them = ~1.2 GB. Negligible on a 24 GB L4 but worth noting because logits are surprisingly fat for the Qwen vocab.
Use the top_k arg in generalized_jsd_loss if you ever want to scale this up. The docstring (opsd_loss.py L54) explicitly recommends it: "top_k: restrict KL to top-k tokens of the teacher distribution. Saves compute on large vocabularies (Qwen3 vocab = 152K)." On the smoke, leave it None to verify the unrestricted path; flip it on for real training.

3.2 The DPO channel does TWO more grad'd forwards per step

_compute_trace_replay_loss (composer_trainer.py L191–198) calls _sequence_logprobs(model, dpo_chosen_…) and _sequence_logprobs(model, dpo_rejected_…). Both are with-grad. So each training step is:

Forward	Grad?	Notes
`super()._compute_loss` (GRPO)	yes	parent's standard fwd
Student in SDPO	yes	only when alpha_sdpo ≠ 0
Teacher in SDPO	no	hint-conditioned context
DPO chosen	yes	only when beta_replay ≠ 0
DPO rejected	yes	only when beta_replay ≠ 0

That's up to 4 grad'd forwards before the backward. PyTorch will hold activations for all of them in the autograd graph until .backward() runs. For the smoke this is fine (0.5B × 4 act tapes ~ 4 GB at B=2,T=1024) but for any real training run on a larger model: enable gradient checkpointing or run the SDPO/DPO channels in alternating steps rather than every step.

For the smoke specifically: set alpha_sdpo=0.1 and beta_replay=0.05 (the trainer defaults) and verify activation memory peaks below 16 GB. If it doesn't, there's a bug in the data collator producing too-long sequences.

3.3 `requires_grad=True` on the zero-tensor short-circuit is a footgun

In composer_trainer.py L136 and L155, when SDPO is short-circuited:

return torch.tensor(0.0, device=_device_of(model), requires_grad=True)

This is not in the autograd graph — it's a leaf tensor with requires_grad=True but no parent op. When you sum it into total = grpo_loss + alpha * sdpo_kl + beta * replay_dpo, the 0.0 contributes a zero gradient and doesn't break things, but if you ever try to call total.backward() on a step where ALL three channels short-circuited (e.g., a smoke step with no error sites and no DPO pairs), you'll get a RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn. The smoke will hit this if your synthetic batch lacks ctx_teacher_input_ids and dpo_chosen_input_ids.

Fix in the smoke: ensure the synthetic batch includes at least one ctx_teacher_input_ids row (it can be a copy of input_ids to keep things trivial) so SDPO doesn't short-circuit on every step.

3.4 `torch.cuda.synchronize()` before timing reads

If you don't torch.cuda.synchronize() before reading time.perf_counter() you'll measure CPU dispatch time, not GPU step time. The skeleton above includes it. The Modal runtime doesn't change this — same rule as local.

3.5 The HF cache Volume must be `commit()`ed

From https://modal.com/docs/guide/volumes: Volume writes are not persisted across runs unless you call .commit() (or use volume.batch_upload). The skeleton calls hf_cache.commit() at the end. If you forget, run 2 will re-download the model. This is the only "Modal-flavored" gotcha that bites a smoke.

3.6 What does NOT bite

These are the lessons from mlops/modal-llm-training that are not relevant to a 0.5B smoke — don't waste mental cycles on them:

❌ FSDP / DeepSpeed sharding setup. Single GPU.
❌ accelerate launch / multi-process distributed. Single GPU.
❌ Flash-attention version pinning vs torch version. SDPA is fine for 0.5B.
❌ Tensor parallelism / sequence parallelism. Single GPU.
❌ Multi-node clusters. Single node.
❌ Memory snapshotting (enable_memory_snapshot=True). It's a 30-min one-shot. The cold-start penalty is ~30 s on a smoke that runs for 5 min — 10% overhead, not worth the snapshot setup time.
❌ Region pinning for data locality. The whole input is from_pretrained, served by HF — Modal's default region is fine.
❌ Custom CUDA install (Image.from_registry("nvidia/cuda:…")). The pre-built torch wheel ships its own CUDA.

4. Decision rule: Modal vs the local 5090

4.1 The numbers

Local 5090 (32 GB VRAM, Blackwell, ~1.6 PFLOPS bf16):

Step time for Qwen-0.5B at B=2, T=1024, 3-channel loss (≈4 grad'd fwds + bwd): expect ~150–400 ms per step based on parameter-count + Blackwell's bf16 throughput. Call it 300 ms.
50 steps: ~15 seconds of pure compute.
Plus model load (one-time, from local HF cache): ~5 seconds.
Plus data collator setup: ~3 seconds.
Wall clock: ~25–40 seconds.
Cost: $0 (electricity ignored — the 5090 draws ~600 W under load × 40 s = 6.7 Wh ≈ $0.001).

Modal L4 (24 GB VRAM, Ada Lovelace, ~0.12 PFLOPS bf16):

Step time for the same workload on L4: ~1.5–4 s per step. (L4 is roughly 13× lower bf16 throughput than 5090, but the workload at B=2 won't saturate the 5090, so realistic gap is ~5–10×.) Call it 2 s.
50 steps: ~100 seconds of pure compute.
Plus container cold start, image pull, model download (cached after run 1), CUDA init: 30–90 s on first run, 20–40 s afterward.
Wall clock: ~3–5 minutes per run (worst case 7 min on a cold first run).
Cost: $0.08–$0.13 per run.

4.2 The decision rule

For this specific 30-min smoke: run on the 5090. Do not use Modal.

Reasoning:

Latency: the 5090 finishes the smoke in ~30 s. Modal's L4 needs ~5 minutes including cold start. That's a 10× iteration penalty on a workload where the entire point is iterate-and-fix-the-shape-error cycles. Every minute waiting for Modal is a minute the user could have run the smoke 5 more times locally.
Memory headroom: the 5090's 32 GB is larger than the L4's 24 GB. There is no memory motivation to leave the local box.
Network friction: every Modal run requires modal run, syncing local code, waiting for image, watching logs. Local is python modal_app.py (or just import-and-run in a notebook).
Cost asymmetry vs. iteration cost: $0.10/run is not the issue. The issue is 30 minutes of attention spent on Modal infra is 30 minutes not spent debugging the loss.
The framework hasn't been verified to run end-to-end yet. The first hundred bugs you'll find are local Python issues — wrong tensor shapes, missing keys in the collator, the requires_grad=True zero-tensor footgun (§3.3), TRL version mismatches. Debugging those over a Modal round-trip is masochism.

When Modal becomes correct:

Scenario	Modal?	Why
30-min smoke on 0.5B (this task)	No	5090 wins on every dimension
Sweep alpha_sdpo, beta_replay across 8 configs in parallel	Yes	8× Modal containers in parallel beats 8 sequential runs on one 5090
Scale to Qwen2.5-7B (real training)	Yes	7B needs >32 GB for grad+optimizer, so 5090 is out; you want A100-80GB or H100
Scale to multi-node (40B+)	Yes (with caveats)	Modal multi-node is in beta — see https://modal.com/docs/guide/multi-node-training
24/7 inference of trained model	Maybe	Depends on QPS; Modal serverless wins for spiky, loses for steady

4.3 Recommended workflow

Write the smoke as local_smoke.py that runs on the 5090. Same body as modal_app.py's smoke() function, minus the @app.function decorator. Iterate there until 50 steps run cleanly.
Then drop the body into modal_app.py (the skeleton in §2). The Modal version's value is to verify "does it run on cloud Linux without local dotfile interference" and to baseline L4 step-time vs the 5090. That's a one-shot validation, not a development loop.
For the real training run (when it's an actual training run, not a smoke), start with A100-40GB on Modal (or H100 if you've got the credits) — the L4 step-time of ~2 s would translate to 2 s × 10,000 steps = ~5.5 hours which is fine for a smoke but painful for a real run.

5. References

All claims in this document are sourced from:

Pricing: https://modal.com/pricing (canonical; updated regularly by Modal — re-fetch if cost-sensitive). Per-second numbers in §1.1 captured from this page at report-write time.
GPU naming: https://modal.com/docs/guide/gpu — confirms gpu="L4", gpu="A10" (not "A10G"), gpu="A100-40GB", gpu="H100!" syntax.
Cold starts: https://modal.com/docs/guide/cold-start — "Containers boot in about one second" + the warm-up period is image pull + global imports + enter methods.
Volumes: https://modal.com/docs/guide/volumes — commit() semantics for HF cache persistence.
Region/preemption multipliers: pricing page footer + https://modal.com/docs/guide/preemption.
Multi-node beta: https://modal.com/docs/guide/multi-node-training.
Examples (for Image.pip_install patterns): https://github.com/modal-labs/modal-examples — see 06_gpu_and_ml/llm-finetuning/ for similar 0.5B/3B finetune patterns.
TRL GRPOTrainer._compute_loss extension point: verified in composer_trainer.py header comment ("DeepWiki audit of huggingface/trl, 2026-05-25"). Confirmed super()._compute_loss(model, inputs) works as the framework's parent-call.
Local trainer code reviewed:
- /mnt/e/CS/HF/composer-replication-framework/spikes/005-integrated-trainer-skeleton/trl_path/composer_trainer.py
- /mnt/e/CS/HF/composer-replication-framework/spikes/005-integrated-trainer-skeleton/opsd_loss.py

6. TL;DR

GPU: L4. Cost: ~$0.10/run. Total budget burn: ~50× re-runs before the $5 cap. Don't pay for A10G, A100, or H100 on a 0.5B smoke.
Skeleton: §2 — gpu="L4", 4 cores, 16 GB RAM, 30-min timeout, persistent HF cache Volume, default preemption, no region pin.
Workload-specific gotchas: §3 — 3-channel loss does up to 4 grad'd forwards/step (memory headroom check), the zero-tensor requires_grad=True short-circuit can break backward(), and volume.commit() is mandatory.
Decision: run on the 5090, not Modal. 5090 finishes the smoke in ~30 s vs Modal's ~5 min including cold start, with $0 marginal cost and 10× faster iteration. Reserve Modal for parameter sweeps and 7B+ training.