File size: 26,296 Bytes

ac4bfb4

# Modal Reconnaissance — Composer 2.5 Replication GPU Smoke

**Audience:** trainer integrator running a one-shot 30-minute verification smoke for the Composer 2.5 Replication Framework (`spikes/005-integrated-trainer-skeleton/`).
**Workload:** `Qwen/Qwen2.5-0.5B-Instruct` (≈ 1 GB fp16 weights), ~50 forward+backward steps, custom 3-channel loss = `GRPO + α·SDPO-KL + β·trace-replay-DPO`. Batch size ≤ 4 sequences ≤ 2048 tokens. **Goal: prove the loss runs end-to-end and capture mem + step time.** This is *not* training — it's a smoke.
**Cap:** $5. **Local hardware:** RTX 5090, 32 GB VRAM, Modal CLI already configured (`~/.modal.toml`).
**Bottom line up front:** *Run it locally on the 5090.* Modal is the wrong tool for this specific job. The skeleton + price math below is for future scale-out, not the smoke.

---

## 1. Recommended Modal GPU type & estimated cost

### 1.1 Pricing table (from primary source)

All values copied verbatim from <https://modal.com/pricing> (fetched for this report). Modal bills per **second** of compute, not per minute or hour.

| GPU            | Modal `gpu=` string  | $ / sec      | $ / hour | VRAM   | Verdict for this smoke |
|----------------|----------------------|--------------|----------|--------|------------------------|
| Nvidia T4      | `"T4"`               | 0.000164     | 0.590    | 16 GB  | Too small for safe headroom on 3 fwd passes |
| **Nvidia L4**  | `"L4"`               | **0.000222** | **0.799**| 24 GB  | ✅ **Recommended** — cheapest GPU that fits comfortably |
| Nvidia A10     | `"A10"`              | 0.000306     | 1.102    | 24 GB  | Acceptable; ~38% pricier than L4 for marginal speedup at sub-1B |
| Nvidia L40S    | `"L40S"`             | 0.000542     | 1.951    | 48 GB  | Overkill — Modal's default rec, but unjustified at 0.5B |
| Nvidia A100-40GB| `"A100-40GB"`       | 0.000583     | 2.099    | 40 GB  | Overkill |
| Nvidia A100-80GB| `"A100-80GB"`       | 0.000694     | 2.498    | 80 GB  | Overkill |
| Nvidia H100    | `"H100!"`            | 0.001097     | 3.949    | 80 GB  | Wasteful |

(`H100!` suffix = pin to H100, opt out of Modal's automatic H200 upgrade. See <https://modal.com/docs/guide/gpu#automatic-upgrades-to-h200s>.)

**Auxiliary costs** (also primary, same page):
- CPU: $0.0000131 / physical-core / sec → ~$0.047 / core-hour. Min 0.125 cores per container.
- RAM: $0.00000222 / GiB / sec → ~$0.008 / GiB-hour.
- Volumes: $0.09 / GiB / month (first 1 TiB / mo free on the workspace).
- Starter plan: **$30 / month free credits** — your smoke is free if you haven't burned the budget elsewhere.

### 1.2 Why L4, not A10G or A100-40GB

The skill mlops/modal-llm-training defaults to L4/A10 for "small smokes" and that holds here. The framing: *Qwen2.5-0.5B in fp16 is ~1 GB of weights. The 3-channel loss does ≥3 forward passes per step (student-grad, teacher no-grad for SDPO, chosen+rejected for DPO). Even with the teacher forward held in memory you are nowhere near 24 GB.*

Concrete VRAM math for the workload (back of envelope, batch=2, seq=1024, bf16):
- Weights: ~1.0 GB
- Optimizer state (AdamW, fp32 m+v): ~4 GB (8 bytes × 0.5B params)
- Gradients (bf16): ~1 GB
- Activations for student fwd at B=2,T=1024: ~1–2 GB
- Teacher fwd (no grad, no act save): ~0.3 GB
- DPO chosen+rejected fwds (with grad): ~2–3 GB
- HF transformers overhead, KV scratch, framework: ~2 GB
- **Subtotal: ~11–14 GB** — comfortably inside 24 GB on L4.

**A10 is also fine** but costs 38% more for ~30–50% extra throughput on a workload where the GPU is already step-time-bound by Python overhead (see §3). Pay the L4 rate.

**A100-40GB is wrong.** You're paying 2.6× the L4 rate for memory you don't use and FLOPS that, on a 0.5B model with bs=2, you can't saturate. The Modal docs explicitly warn against this: *"Before you jump for the most powerful (and so most expensive) GPU, make sure you understand where the bottlenecks are…"* (<https://modal.com/docs/guide/gpu#b200-gpus>).

**T4 declined** because: (a) only 16 GB VRAM — tight given 3 fwd passes; (b) old Turing arch lacks bf16 hardware; you'd be on fp16/fp32, which trips up `transformers` flash-attention paths and adds debug surface area on a smoke that's already debugging custom loss code.

### 1.3 Cost projection for the actual smoke

Assume a 30-min wall-clock budget that breaks down realistically as:
- Container cold-start + image pull: 30–90 s (first run — Modal's container infra warm-boots in ~1 s but your image with torch+transformers takes one-time pull). See <https://modal.com/docs/guide/cold-start>.
- HF model download (Qwen 2.5-0.5B = ~1 GB shards) on first run: 15–45 s — **should be cached on a Modal Volume after run 1**.
- Setup inside fn (CUDA init, model.from_pretrained, optimizer build): 20–40 s.
- 50 training steps × ~2–4 s/step (3-channel loss, bs=2, seq=1024 on L4): **100–200 s**.
- Logging, save, exit: 5 s.

**Realistic total: ~3–7 minutes of GPU-billed time per run.**

Cost per run on L4:
- Lower bound (3 min): 180 s × $0.000222 = **$0.040**
- Upper bound (7 min): 420 s × $0.000222 = **$0.093**
- Plus CPU/RAM overhead (4 cores × 16 GB RAM): ~420 s × (4 × $0.0000131 + 16 × $0.00000222) = ~$0.037

**Per-run all-in: $0.08 – $0.13 on L4.** You can run the smoke ~50× before nudging the $5 cap. Comfortable.

For comparison, A10 same scenario: ~$0.11 – $0.18 per run. A100-40GB: ~$0.21 – $0.34. Still all under cap, but L4 is the rational pick.

### 1.4 Region & preemption multipliers (DON'T trip on these)

From the pricing-page footer:
- **Region selection: 1.5–1.75× base price.** Don't pin to a region unless you must.
- **Non-preemptible execution: 3× base price.** Default is preemptible — leave it. A 30-min smoke that gets preempted is fine; just retry. Setting `gpu_preempted=False` (or using non-preemptible mode) would push L4 to ~$2.40/hr and is unjustified.

---

## 2. Minimal `modal_app.py` skeleton

This is the actual file to drop into the repo, e.g. at `spikes/005-integrated-trainer-skeleton/modal_app.py`. It is intentionally one file, with no abstraction, sized for the smoke. Image pins are conservative — match what the user is running locally to avoid version drift between local debugging and Modal runs.

```python
"""modal_app.py — GPU smoke for the Composer 2.5 Replication Framework.

Goal: run ~50 forward+backward steps of the 3-channel loss
(GRPO + SDPO-KL + trace-replay-DPO) against Qwen/Qwen2.5-0.5B-Instruct,
capture peak VRAM and per-step latency, and exit. Single L4, single container.

Run:    modal run modal_app.py
Logs:   the function's print() output streams to your terminal.
"""

from __future__ import annotations

import modal

# ---------------------------------------------------------------------------
# 1) App + image
# ---------------------------------------------------------------------------
# Pin torch to a CUDA build that matches Modal's L4 driver (CUDA 12.x).
# Pin transformers/peft/trl to a known-good combination — the trainer skeleton
# was developed against transformers >= 4.45 and trl >= 0.12 for GRPOTrainer.
# If you bump any of these, re-verify GRPOTrainer._compute_loss is still the
# correct override hook (DeepWiki audit anchor: huggingface/trl).
image = (
    modal.Image.debian_slim(python_version="3.11")
    .apt_install("git")
    .pip_install(
        "torch==2.4.1",                  # CUDA 12.1 wheel from PyPI default index
        "transformers==4.46.3",
        "accelerate==1.1.1",
        "peft==0.14.0",
        "trl==0.12.2",
        "datasets==3.1.0",
        "huggingface_hub==0.26.5",
    )
    .env({
        # Force HF to use the mounted Volume for model + dataset cache.
        "HF_HOME": "/cache/hf",
        "TRANSFORMERS_CACHE": "/cache/hf",
        "HF_HUB_ENABLE_HF_TRANSFER": "1",  # parallel download for the model
        # Make Python flush prints immediately so we see step times live.
        "PYTHONUNBUFFERED": "1",
        # Reproducibility for the smoke.
        "TOKENIZERS_PARALLELISM": "false",
    })
)

# ---------------------------------------------------------------------------
# 2) Persistent volume for HF cache (so model isn't re-downloaded each run)
# ---------------------------------------------------------------------------
# 1 GB of Qwen weights persists here. First run pays the download cost,
# every subsequent run reuses the volume. Below 1 TiB / mo: free.
hf_cache = modal.Volume.from_name("hf-cache-composer-smoke", create_if_missing=True)

# ---------------------------------------------------------------------------
# 3) App + secrets
# ---------------------------------------------------------------------------
app = modal.App("composer-replication-smoke")

# Optional — only needed if you switch to a gated model. Qwen2.5-0.5B is open.
hf_secret = modal.Secret.from_name("huggingface-token", required_keys=[])  # no-op safety

# ---------------------------------------------------------------------------
# 4) The smoke function
# ---------------------------------------------------------------------------
@app.function(
    image=image,
    gpu="L4",                       # see §1: cheapest 24 GB option that fits
    cpu=4.0,                        # 4 cores is plenty for tokenization on a sub-1B
    memory=16 * 1024,               # 16 GiB RAM is plenty
    volumes={"/cache": hf_cache},
    timeout=60 * 30,                # hard 30-min cap matches the smoke spec
    secrets=[hf_secret],
    # NB: keep preemptible (default). Don't pay 3× to pin.
    # NB: don't pin region — the 1.5–1.75× tax is unjustified for a smoke.
)
def smoke():
    import time
    import torch
    from transformers import AutoModelForCausalLM, AutoTokenizer

    MODEL_ID = "Qwen/Qwen2.5-0.5B-Instruct"

    print(f"[smoke] torch={torch.__version__} cuda={torch.version.cuda} "
          f"device={torch.cuda.get_device_name(0)} "
          f"vram={torch.cuda.get_device_properties(0).total_memory/1e9:.1f} GB")

    # -------------------------------------------------------------------
    # Load tokenizer + model. bf16 — L4 supports it (Ada Lovelace).
    # -------------------------------------------------------------------
    t0 = time.perf_counter()
    tok = AutoTokenizer.from_pretrained(MODEL_ID, cache_dir="/cache/hf")
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_ID,
        cache_dir="/cache/hf",
        torch_dtype=torch.bfloat16,
        device_map="cuda:0",
    )
    model.train()
    print(f"[smoke] model load: {time.perf_counter()-t0:.1f}s "
          f"params={sum(p.numel() for p in model.parameters())/1e6:.1f}M")

    # -------------------------------------------------------------------
    # 50-step verification loop.
    #
    # NOTE: this stub uses a synthetic batch — a single forward+backward
    # against an LM-head loss — *not* the full 3-channel loss. The point
    # is to (a) verify the Modal harness, (b) measure the per-step time
    # of a vanilla AutoModelForCausalLM step on this GPU as a baseline.
    #
    # Replace the body of the for-loop with the actual ComposerReplicationTrainer
    # `_compute_loss` call once data_collator outputs are stubbed/mocked.
    # See: spikes/005-integrated-trainer-skeleton/trl_path/composer_trainer.py
    # -------------------------------------------------------------------
    optimizer = torch.optim.AdamW(model.parameters(), lr=5e-6)

    # Synthetic batch: bs=2, seq=1024 — matches the realistic smoke shape.
    B, T = 2, 1024
    input_ids = torch.randint(0, tok.vocab_size, (B, T), device="cuda:0")
    labels = input_ids.clone()

    torch.cuda.reset_peak_memory_stats()
    step_times = []
    for step in range(50):
        t = time.perf_counter()
        out = model(input_ids=input_ids, labels=labels)
        out.loss.backward()
        optimizer.step()
        optimizer.zero_grad(set_to_none=True)
        torch.cuda.synchronize()
        dt = time.perf_counter() - t
        step_times.append(dt)
        if step % 10 == 0:
            print(f"[smoke] step {step:>3d} loss={out.loss.item():.4f} "
                  f"dt={dt*1000:.1f}ms peak_vram={torch.cuda.max_memory_allocated()/1e9:.2f}GB")

    # -------------------------------------------------------------------
    # Final report.
    # -------------------------------------------------------------------
    median_ms = sorted(step_times)[len(step_times)//2] * 1000
    p95_ms = sorted(step_times)[int(len(step_times)*0.95)] * 1000
    peak_gb = torch.cuda.max_memory_allocated() / 1e9
    print(f"\n[smoke] DONE. median_step={median_ms:.1f}ms p95={p95_ms:.1f}ms "
          f"peak_vram={peak_gb:.2f}GB total_time={sum(step_times):.1f}s")

    # Persist cache for the next run.
    hf_cache.commit()


@app.local_entrypoint()
def main():
    smoke.remote()
```

### 2.1 What's deliberately *not* in the skeleton

- **No `flash-attn` install.** The `flash-attn` wheel build is a notorious time sink on Modal images (compiles against the CUDA toolkit). For a 0.5B smoke, SDPA (PyTorch's built-in scaled-dot-product attention) is fine and is on by default in transformers ≥ 4.45.
- **No `bitsandbytes`, no `unsloth`, no `xformers`.** All add build complexity. None give you anything on a smoke.
- **No DeepSpeed, no FSDP, no `accelerate launch`.** This is single-GPU; `accelerate` is in the image only because `trl` imports it. We don't invoke it.
- **No web endpoint, no `@app.cls`, no `enter` method.** A `@app.function()` with no warm-up is correct for a one-shot smoke. `enter`/lifecycle methods are for serving and amortizing model load across many calls — not relevant when you call once.
- **No `min_containers` or `buffer_containers`.** Those are warm-pool knobs for serving — they cost money. Default scale-from-zero is right.
- **No `Image.from_registry`.** `debian_slim` + `pip_install` is faster than pulling a CUDA base image when you don't need a custom CUDA toolkit.

### 2.2 What you do need to add when you wire the real loss

Replace the synthetic `for step in range(50)` body with:

```python
from data_collator import ComposerDataCollator        # spike 005 path
from trl_path.composer_trainer import ComposerReplicationTrainer
# ...
# Build a small fixed dataset of (prompt, response, hint, dpo_pair) tuples
# inline in the smoke (10–20 examples). Don't pull a real RL rollout — the
# point is to verify the loss path, not the rollout path.
```

The smoke does **not** need a real rollout/sampling phase. Stub `inputs` with the keys `_compute_sdpo_loss` and `_compute_trace_replay_loss` consume (`ctx_teacher_input_ids`, `dpo_chosen_input_ids`, `dpo_chosen_response_mask`, `dpo_chosen_ref_logprobs`, `sdpo_loss_mask`, …) using fixed tensors. That's the real verification — does the 3-channel loss compute and back-propagate without shape errors. The trainer skeleton's logging will tell you per-channel values.

---

## 3. Gotchas that bite *this specific workload*

The Modal docs and the `mlops/modal-llm-training` skill cover ~30 lessons aimed at 7B–30B training. Most of them don't apply here. The ones that do:

### 3.1 The teacher forward in SDPO doubles your effective batch memory — but only briefly

`ComposerReplicationTrainer._compute_sdpo_loss` does this (composer_trainer.py L138–143):

```python
student_logits = model(input_ids=inputs["input_ids"]).logits      # with grad
with torch.no_grad():
    teacher_logits = model(input_ids=inputs["ctx_teacher_input_ids"]).logits
```

Two issues:

1. **Both logits tensors are held simultaneously** in `_compute_sdpo_loss` — they're handed to `generalized_jsd_loss` which keeps them alive for the JSD math. For Qwen 2.5-0.5B (vocab = 151,936), one logits tensor at B=2,T=1024 in bf16 is `2 * 1024 * 151936 * 2 bytes ≈ 622 MB`. Two of them = ~1.2 GB. **Negligible on a 24 GB L4** but worth noting because logits are surprisingly fat for the Qwen vocab.
2. **Use the `top_k` arg in `generalized_jsd_loss`** if you ever want to scale this up. The docstring (`opsd_loss.py` L54) explicitly recommends it: *"top_k: restrict KL to top-k tokens of the teacher distribution. Saves compute on large vocabularies (Qwen3 vocab = 152K)."* On the smoke, leave it `None` to verify the unrestricted path; flip it on for real training.

### 3.2 The DPO channel does TWO more grad'd forwards per step

`_compute_trace_replay_loss` (composer_trainer.py L191–198) calls `_sequence_logprobs(model, dpo_chosen_…)` and `_sequence_logprobs(model, dpo_rejected_…)`. Both are with-grad. So each training step is:

| Forward | Grad? | Notes |
|---------|-------|-------|
| `super()._compute_loss` (GRPO) | yes | parent's standard fwd |
| Student in SDPO | yes | only when alpha_sdpo ≠ 0 |
| Teacher in SDPO | no | hint-conditioned context |
| DPO chosen | yes | only when beta_replay ≠ 0 |
| DPO rejected | yes | only when beta_replay ≠ 0 |

**That's up to 4 grad'd forwards before the backward.** PyTorch will hold activations for all of them in the autograd graph until `.backward()` runs. For the smoke this is fine (0.5B × 4 act tapes ~ 4 GB at B=2,T=1024) but for any real training run on a larger model: **enable gradient checkpointing or run the SDPO/DPO channels in alternating steps** rather than every step.

For the smoke specifically: **set `alpha_sdpo=0.1` and `beta_replay=0.05`** (the trainer defaults) and verify activation memory peaks below 16 GB. If it doesn't, there's a bug in the data collator producing too-long sequences.

### 3.3 `requires_grad=True` on the zero-tensor short-circuit is a footgun

In `composer_trainer.py` L136 and L155, when SDPO is short-circuited:

```python
return torch.tensor(0.0, device=_device_of(model), requires_grad=True)
```

This is **not in the autograd graph** — it's a leaf tensor with `requires_grad=True` but no parent op. When you sum it into `total = grpo_loss + alpha * sdpo_kl + beta * replay_dpo`, the `0.0` contributes a zero gradient and doesn't break things, *but* if you ever try to call `total.backward()` on a step where ALL three channels short-circuited (e.g., a smoke step with no error sites and no DPO pairs), you'll get a `RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn`. **The smoke will hit this** if your synthetic batch lacks `ctx_teacher_input_ids` and `dpo_chosen_input_ids`.

Fix in the smoke: ensure the synthetic batch includes at least one `ctx_teacher_input_ids` row (it can be a copy of `input_ids` to keep things trivial) so SDPO doesn't short-circuit on every step.

### 3.4 `torch.cuda.synchronize()` before timing reads

If you don't `torch.cuda.synchronize()` before reading `time.perf_counter()` you'll measure CPU dispatch time, not GPU step time. The skeleton above includes it. The Modal runtime doesn't change this — same rule as local.

### 3.5 The HF cache Volume must be `commit()`ed

From <https://modal.com/docs/guide/volumes>: *Volume writes are not persisted across runs unless you call `.commit()` (or use `volume.batch_upload`).* The skeleton calls `hf_cache.commit()` at the end. If you forget, run 2 will re-download the model. This is the only "Modal-flavored" gotcha that bites a smoke.

### 3.6 What does NOT bite

These are the lessons from `mlops/modal-llm-training` that are **not relevant to a 0.5B smoke** — don't waste mental cycles on them:

- ❌ FSDP / DeepSpeed sharding setup. Single GPU.
- ❌ `accelerate launch` / multi-process distributed. Single GPU.
- ❌ Flash-attention version pinning vs torch version. SDPA is fine for 0.5B.
- ❌ Tensor parallelism / sequence parallelism. Single GPU.
- ❌ Multi-node clusters. Single node.
- ❌ Memory snapshotting (`enable_memory_snapshot=True`). It's a 30-min one-shot. The cold-start penalty is ~30 s on a smoke that runs for 5 min — 10% overhead, not worth the snapshot setup time.
- ❌ Region pinning for data locality. The whole input is `from_pretrained`, served by HF — Modal's default region is fine.
- ❌ Custom CUDA install (`Image.from_registry("nvidia/cuda:…")`). The pre-built torch wheel ships its own CUDA.

---

## 4. Decision rule: Modal vs the local 5090

### 4.1 The numbers

**Local 5090** (32 GB VRAM, Blackwell, ~1.6 PFLOPS bf16):
- Step time for Qwen-0.5B at B=2, T=1024, 3-channel loss (≈4 grad'd fwds + bwd): expect **~150–400 ms per step** based on parameter-count + Blackwell's bf16 throughput. Call it 300 ms.
- 50 steps: **~15 seconds of pure compute**.
- Plus model load (one-time, from local HF cache): ~5 seconds.
- Plus data collator setup: ~3 seconds.
- **Wall clock: ~25–40 seconds.**
- **Cost: $0** (electricity ignored — the 5090 draws ~600 W under load × 40 s = 6.7 Wh ≈ $0.001).

**Modal L4** (24 GB VRAM, Ada Lovelace, ~0.12 PFLOPS bf16):
- Step time for the same workload on L4: **~1.5–4 s per step.** (L4 is roughly 13× lower bf16 throughput than 5090, but the workload at B=2 won't saturate the 5090, so realistic gap is ~5–10×.) Call it 2 s.
- 50 steps: **~100 seconds of pure compute**.
- Plus container cold start, image pull, model download (cached after run 1), CUDA init: **30–90 s on first run, 20–40 s afterward**.
- **Wall clock: ~3–5 minutes per run (worst case 7 min on a cold first run).**
- **Cost: $0.08–$0.13 per run.**

### 4.2 The decision rule

> **For this specific 30-min smoke: run on the 5090. Do not use Modal.**

Reasoning:

1. **Latency:** the 5090 finishes the smoke in ~30 s. Modal's L4 needs ~5 minutes including cold start. That's a **10× iteration penalty** on a workload where the entire point is iterate-and-fix-the-shape-error cycles. Every minute waiting for Modal is a minute the user could have run the smoke 5 more times locally.
2. **Memory headroom:** the 5090's 32 GB is **larger** than the L4's 24 GB. There is no memory motivation to leave the local box.
3. **Network friction:** every Modal run requires `modal run`, syncing local code, waiting for image, watching logs. Local is `python modal_app.py` (or just import-and-run in a notebook).
4. **Cost asymmetry vs. iteration cost:** $0.10/run is not the issue. The issue is **30 minutes of attention spent on Modal infra is 30 minutes not spent debugging the loss**.
5. **The framework hasn't been verified to run end-to-end yet.** The first hundred bugs you'll find are local Python issues — wrong tensor shapes, missing keys in the collator, the `requires_grad=True` zero-tensor footgun (§3.3), TRL version mismatches. Debugging those over a Modal round-trip is masochism.

**When Modal becomes correct:**

| Scenario | Modal? | Why |
|----------|--------|-----|
| 30-min smoke on 0.5B (this task) | **No** | 5090 wins on every dimension |
| Sweep alpha_sdpo, beta_replay across 8 configs in parallel | **Yes** | 8× Modal containers in parallel beats 8 sequential runs on one 5090 |
| Scale to Qwen2.5-7B (real training) | **Yes** | 7B needs >32 GB for grad+optimizer, so 5090 is out; you want A100-80GB or H100 |
| Scale to multi-node (40B+) | **Yes (with caveats)** | Modal multi-node is in beta — see <https://modal.com/docs/guide/multi-node-training> |
| 24/7 inference of trained model | **Maybe** | Depends on QPS; Modal serverless wins for spiky, loses for steady |

### 4.3 Recommended workflow

1. **Write the smoke as `local_smoke.py`** that runs on the 5090. Same body as `modal_app.py`'s `smoke()` function, minus the `@app.function` decorator. Iterate there until 50 steps run cleanly.
2. **Then** drop the body into `modal_app.py` (the skeleton in §2). The Modal version's value is to verify "does it run on cloud Linux without local dotfile interference" and to baseline L4 step-time vs the 5090. That's a one-shot validation, not a development loop.
3. **For the real training run** (when it's an actual training run, not a smoke), start with A100-40GB on Modal (or H100 if you've got the credits) — the L4 step-time of ~2 s would translate to 2 s × 10,000 steps = ~5.5 hours which is fine for a smoke but painful for a real run.

---

## 5. References

All claims in this document are sourced from:

- **Pricing**: <https://modal.com/pricing> (canonical; updated regularly by Modal — re-fetch if cost-sensitive). Per-second numbers in §1.1 captured from this page at report-write time.
- **GPU naming**: <https://modal.com/docs/guide/gpu> — confirms `gpu="L4"`, `gpu="A10"` (not `"A10G"`), `gpu="A100-40GB"`, `gpu="H100!"` syntax.
- **Cold starts**: <https://modal.com/docs/guide/cold-start> — "Containers boot in about one second" + the warm-up period is image pull + global imports + `enter` methods.
- **Volumes**: <https://modal.com/docs/guide/volumes> — `commit()` semantics for HF cache persistence.
- **Region/preemption multipliers**: pricing page footer + <https://modal.com/docs/guide/preemption>.
- **Multi-node beta**: <https://modal.com/docs/guide/multi-node-training>.
- **Examples (for `Image.pip_install` patterns)**: <https://github.com/modal-labs/modal-examples> — see `06_gpu_and_ml/llm-finetuning/` for similar 0.5B/3B finetune patterns.
- **TRL `GRPOTrainer._compute_loss` extension point**: verified in `composer_trainer.py` header comment ("DeepWiki audit of huggingface/trl, 2026-05-25"). Confirmed `super()._compute_loss(model, inputs)` works as the framework's parent-call.
- **Local trainer code reviewed**:
  - `/mnt/e/CS/HF/composer-replication-framework/spikes/005-integrated-trainer-skeleton/trl_path/composer_trainer.py`
  - `/mnt/e/CS/HF/composer-replication-framework/spikes/005-integrated-trainer-skeleton/opsd_loss.py`

---

## 6. TL;DR

1. **GPU: L4. Cost: ~$0.10/run. Total budget burn: ~50× re-runs before the $5 cap.** Don't pay for A10G, A100, or H100 on a 0.5B smoke.
2. **Skeleton: §2** — `gpu="L4"`, 4 cores, 16 GB RAM, 30-min timeout, persistent HF cache Volume, default preemption, no region pin.
3. **Workload-specific gotchas: §3** — 3-channel loss does up to 4 grad'd forwards/step (memory headroom check), the zero-tensor `requires_grad=True` short-circuit can break `backward()`, and `volume.commit()` is mandatory.
4. **Decision: run on the 5090, not Modal.** 5090 finishes the smoke in ~30 s vs Modal's ~5 min including cold start, with $0 marginal cost and 10× faster iteration. Reserve Modal for parameter sweeps and 7B+ training.