Reinforcement Learning
Transformers
English
post-training
distillation
agentic-coding
composer-2.5
cursor
kimi-k2
grpo
dapo
diloco
openenv
trl
verl
research
methodology
Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| # Modal Reconnaissance — Composer 2.5 Replication GPU Smoke | |
| **Audience:** trainer integrator running a one-shot 30-minute verification smoke for the Composer 2.5 Replication Framework (`spikes/005-integrated-trainer-skeleton/`). | |
| **Workload:** `Qwen/Qwen2.5-0.5B-Instruct` (≈ 1 GB fp16 weights), ~50 forward+backward steps, custom 3-channel loss = `GRPO + α·SDPO-KL + β·trace-replay-DPO`. Batch size ≤ 4 sequences ≤ 2048 tokens. **Goal: prove the loss runs end-to-end and capture mem + step time.** This is *not* training — it's a smoke. | |
| **Cap:** $5. **Local hardware:** RTX 5090, 32 GB VRAM, Modal CLI already configured (`~/.modal.toml`). | |
| **Bottom line up front:** *Run it locally on the 5090.* Modal is the wrong tool for this specific job. The skeleton + price math below is for future scale-out, not the smoke. | |
| --- | |
| ## 1. Recommended Modal GPU type & estimated cost | |
| ### 1.1 Pricing table (from primary source) | |
| All values copied verbatim from <https://modal.com/pricing> (fetched for this report). Modal bills per **second** of compute, not per minute or hour. | |
| | GPU | Modal `gpu=` string | $ / sec | $ / hour | VRAM | Verdict for this smoke | | |
| |----------------|----------------------|--------------|----------|--------|------------------------| | |
| | Nvidia T4 | `"T4"` | 0.000164 | 0.590 | 16 GB | Too small for safe headroom on 3 fwd passes | | |
| | **Nvidia L4** | `"L4"` | **0.000222** | **0.799**| 24 GB | ✅ **Recommended** — cheapest GPU that fits comfortably | | |
| | Nvidia A10 | `"A10"` | 0.000306 | 1.102 | 24 GB | Acceptable; ~38% pricier than L4 for marginal speedup at sub-1B | | |
| | Nvidia L40S | `"L40S"` | 0.000542 | 1.951 | 48 GB | Overkill — Modal's default rec, but unjustified at 0.5B | | |
| | Nvidia A100-40GB| `"A100-40GB"` | 0.000583 | 2.099 | 40 GB | Overkill | | |
| | Nvidia A100-80GB| `"A100-80GB"` | 0.000694 | 2.498 | 80 GB | Overkill | | |
| | Nvidia H100 | `"H100!"` | 0.001097 | 3.949 | 80 GB | Wasteful | | |
| (`H100!` suffix = pin to H100, opt out of Modal's automatic H200 upgrade. See <https://modal.com/docs/guide/gpu#automatic-upgrades-to-h200s>.) | |
| **Auxiliary costs** (also primary, same page): | |
| - CPU: $0.0000131 / physical-core / sec → ~$0.047 / core-hour. Min 0.125 cores per container. | |
| - RAM: $0.00000222 / GiB / sec → ~$0.008 / GiB-hour. | |
| - Volumes: $0.09 / GiB / month (first 1 TiB / mo free on the workspace). | |
| - Starter plan: **$30 / month free credits** — your smoke is free if you haven't burned the budget elsewhere. | |
| ### 1.2 Why L4, not A10G or A100-40GB | |
| The skill mlops/modal-llm-training defaults to L4/A10 for "small smokes" and that holds here. The framing: *Qwen2.5-0.5B in fp16 is ~1 GB of weights. The 3-channel loss does ≥3 forward passes per step (student-grad, teacher no-grad for SDPO, chosen+rejected for DPO). Even with the teacher forward held in memory you are nowhere near 24 GB.* | |
| Concrete VRAM math for the workload (back of envelope, batch=2, seq=1024, bf16): | |
| - Weights: ~1.0 GB | |
| - Optimizer state (AdamW, fp32 m+v): ~4 GB (8 bytes × 0.5B params) | |
| - Gradients (bf16): ~1 GB | |
| - Activations for student fwd at B=2,T=1024: ~1–2 GB | |
| - Teacher fwd (no grad, no act save): ~0.3 GB | |
| - DPO chosen+rejected fwds (with grad): ~2–3 GB | |
| - HF transformers overhead, KV scratch, framework: ~2 GB | |
| - **Subtotal: ~11–14 GB** — comfortably inside 24 GB on L4. | |
| **A10 is also fine** but costs 38% more for ~30–50% extra throughput on a workload where the GPU is already step-time-bound by Python overhead (see §3). Pay the L4 rate. | |
| **A100-40GB is wrong.** You're paying 2.6× the L4 rate for memory you don't use and FLOPS that, on a 0.5B model with bs=2, you can't saturate. The Modal docs explicitly warn against this: *"Before you jump for the most powerful (and so most expensive) GPU, make sure you understand where the bottlenecks are…"* (<https://modal.com/docs/guide/gpu#b200-gpus>). | |
| **T4 declined** because: (a) only 16 GB VRAM — tight given 3 fwd passes; (b) old Turing arch lacks bf16 hardware; you'd be on fp16/fp32, which trips up `transformers` flash-attention paths and adds debug surface area on a smoke that's already debugging custom loss code. | |
| ### 1.3 Cost projection for the actual smoke | |
| Assume a 30-min wall-clock budget that breaks down realistically as: | |
| - Container cold-start + image pull: 30–90 s (first run — Modal's container infra warm-boots in ~1 s but your image with torch+transformers takes one-time pull). See <https://modal.com/docs/guide/cold-start>. | |
| - HF model download (Qwen 2.5-0.5B = ~1 GB shards) on first run: 15–45 s — **should be cached on a Modal Volume after run 1**. | |
| - Setup inside fn (CUDA init, model.from_pretrained, optimizer build): 20–40 s. | |
| - 50 training steps × ~2–4 s/step (3-channel loss, bs=2, seq=1024 on L4): **100–200 s**. | |
| - Logging, save, exit: 5 s. | |
| **Realistic total: ~3–7 minutes of GPU-billed time per run.** | |
| Cost per run on L4: | |
| - Lower bound (3 min): 180 s × $0.000222 = **$0.040** | |
| - Upper bound (7 min): 420 s × $0.000222 = **$0.093** | |
| - Plus CPU/RAM overhead (4 cores × 16 GB RAM): ~420 s × (4 × $0.0000131 + 16 × $0.00000222) = ~$0.037 | |
| **Per-run all-in: $0.08 – $0.13 on L4.** You can run the smoke ~50× before nudging the $5 cap. Comfortable. | |
| For comparison, A10 same scenario: ~$0.11 – $0.18 per run. A100-40GB: ~$0.21 – $0.34. Still all under cap, but L4 is the rational pick. | |
| ### 1.4 Region & preemption multipliers (DON'T trip on these) | |
| From the pricing-page footer: | |
| - **Region selection: 1.5–1.75× base price.** Don't pin to a region unless you must. | |
| - **Non-preemptible execution: 3× base price.** Default is preemptible — leave it. A 30-min smoke that gets preempted is fine; just retry. Setting `gpu_preempted=False` (or using non-preemptible mode) would push L4 to ~$2.40/hr and is unjustified. | |
| --- | |
| ## 2. Minimal `modal_app.py` skeleton | |
| This is the actual file to drop into the repo, e.g. at `spikes/005-integrated-trainer-skeleton/modal_app.py`. It is intentionally one file, with no abstraction, sized for the smoke. Image pins are conservative — match what the user is running locally to avoid version drift between local debugging and Modal runs. | |
| ```python | |
| """modal_app.py — GPU smoke for the Composer 2.5 Replication Framework. | |
| Goal: run ~50 forward+backward steps of the 3-channel loss | |
| (GRPO + SDPO-KL + trace-replay-DPO) against Qwen/Qwen2.5-0.5B-Instruct, | |
| capture peak VRAM and per-step latency, and exit. Single L4, single container. | |
| Run: modal run modal_app.py | |
| Logs: the function's print() output streams to your terminal. | |
| """ | |
| from __future__ import annotations | |
| import modal | |
| # --------------------------------------------------------------------------- | |
| # 1) App + image | |
| # --------------------------------------------------------------------------- | |
| # Pin torch to a CUDA build that matches Modal's L4 driver (CUDA 12.x). | |
| # Pin transformers/peft/trl to a known-good combination — the trainer skeleton | |
| # was developed against transformers >= 4.45 and trl >= 0.12 for GRPOTrainer. | |
| # If you bump any of these, re-verify GRPOTrainer._compute_loss is still the | |
| # correct override hook (DeepWiki audit anchor: huggingface/trl). | |
| image = ( | |
| modal.Image.debian_slim(python_version="3.11") | |
| .apt_install("git") | |
| .pip_install( | |
| "torch==2.4.1", # CUDA 12.1 wheel from PyPI default index | |
| "transformers==4.46.3", | |
| "accelerate==1.1.1", | |
| "peft==0.14.0", | |
| "trl==0.12.2", | |
| "datasets==3.1.0", | |
| "huggingface_hub==0.26.5", | |
| ) | |
| .env({ | |
| # Force HF to use the mounted Volume for model + dataset cache. | |
| "HF_HOME": "/cache/hf", | |
| "TRANSFORMERS_CACHE": "/cache/hf", | |
| "HF_HUB_ENABLE_HF_TRANSFER": "1", # parallel download for the model | |
| # Make Python flush prints immediately so we see step times live. | |
| "PYTHONUNBUFFERED": "1", | |
| # Reproducibility for the smoke. | |
| "TOKENIZERS_PARALLELISM": "false", | |
| }) | |
| ) | |
| # --------------------------------------------------------------------------- | |
| # 2) Persistent volume for HF cache (so model isn't re-downloaded each run) | |
| # --------------------------------------------------------------------------- | |
| # 1 GB of Qwen weights persists here. First run pays the download cost, | |
| # every subsequent run reuses the volume. Below 1 TiB / mo: free. | |
| hf_cache = modal.Volume.from_name("hf-cache-composer-smoke", create_if_missing=True) | |
| # --------------------------------------------------------------------------- | |
| # 3) App + secrets | |
| # --------------------------------------------------------------------------- | |
| app = modal.App("composer-replication-smoke") | |
| # Optional — only needed if you switch to a gated model. Qwen2.5-0.5B is open. | |
| hf_secret = modal.Secret.from_name("huggingface-token", required_keys=[]) # no-op safety | |
| # --------------------------------------------------------------------------- | |
| # 4) The smoke function | |
| # --------------------------------------------------------------------------- | |
| @app.function( | |
| image=image, | |
| gpu="L4", # see §1: cheapest 24 GB option that fits | |
| cpu=4.0, # 4 cores is plenty for tokenization on a sub-1B | |
| memory=16 * 1024, # 16 GiB RAM is plenty | |
| volumes={"/cache": hf_cache}, | |
| timeout=60 * 30, # hard 30-min cap matches the smoke spec | |
| secrets=[hf_secret], | |
| # NB: keep preemptible (default). Don't pay 3× to pin. | |
| # NB: don't pin region — the 1.5–1.75× tax is unjustified for a smoke. | |
| ) | |
| def smoke(): | |
| import time | |
| import torch | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| MODEL_ID = "Qwen/Qwen2.5-0.5B-Instruct" | |
| print(f"[smoke] torch={torch.__version__} cuda={torch.version.cuda} " | |
| f"device={torch.cuda.get_device_name(0)} " | |
| f"vram={torch.cuda.get_device_properties(0).total_memory/1e9:.1f} GB") | |
| # ------------------------------------------------------------------- | |
| # Load tokenizer + model. bf16 — L4 supports it (Ada Lovelace). | |
| # ------------------------------------------------------------------- | |
| t0 = time.perf_counter() | |
| tok = AutoTokenizer.from_pretrained(MODEL_ID, cache_dir="/cache/hf") | |
| model = AutoModelForCausalLM.from_pretrained( | |
| MODEL_ID, | |
| cache_dir="/cache/hf", | |
| torch_dtype=torch.bfloat16, | |
| device_map="cuda:0", | |
| ) | |
| model.train() | |
| print(f"[smoke] model load: {time.perf_counter()-t0:.1f}s " | |
| f"params={sum(p.numel() for p in model.parameters())/1e6:.1f}M") | |
| # ------------------------------------------------------------------- | |
| # 50-step verification loop. | |
| # | |
| # NOTE: this stub uses a synthetic batch — a single forward+backward | |
| # against an LM-head loss — *not* the full 3-channel loss. The point | |
| # is to (a) verify the Modal harness, (b) measure the per-step time | |
| # of a vanilla AutoModelForCausalLM step on this GPU as a baseline. | |
| # | |
| # Replace the body of the for-loop with the actual ComposerReplicationTrainer | |
| # `_compute_loss` call once data_collator outputs are stubbed/mocked. | |
| # See: spikes/005-integrated-trainer-skeleton/trl_path/composer_trainer.py | |
| # ------------------------------------------------------------------- | |
| optimizer = torch.optim.AdamW(model.parameters(), lr=5e-6) | |
| # Synthetic batch: bs=2, seq=1024 — matches the realistic smoke shape. | |
| B, T = 2, 1024 | |
| input_ids = torch.randint(0, tok.vocab_size, (B, T), device="cuda:0") | |
| labels = input_ids.clone() | |
| torch.cuda.reset_peak_memory_stats() | |
| step_times = [] | |
| for step in range(50): | |
| t = time.perf_counter() | |
| out = model(input_ids=input_ids, labels=labels) | |
| out.loss.backward() | |
| optimizer.step() | |
| optimizer.zero_grad(set_to_none=True) | |
| torch.cuda.synchronize() | |
| dt = time.perf_counter() - t | |
| step_times.append(dt) | |
| if step % 10 == 0: | |
| print(f"[smoke] step {step:>3d} loss={out.loss.item():.4f} " | |
| f"dt={dt*1000:.1f}ms peak_vram={torch.cuda.max_memory_allocated()/1e9:.2f}GB") | |
| # ------------------------------------------------------------------- | |
| # Final report. | |
| # ------------------------------------------------------------------- | |
| median_ms = sorted(step_times)[len(step_times)//2] * 1000 | |
| p95_ms = sorted(step_times)[int(len(step_times)*0.95)] * 1000 | |
| peak_gb = torch.cuda.max_memory_allocated() / 1e9 | |
| print(f"\n[smoke] DONE. median_step={median_ms:.1f}ms p95={p95_ms:.1f}ms " | |
| f"peak_vram={peak_gb:.2f}GB total_time={sum(step_times):.1f}s") | |
| # Persist cache for the next run. | |
| hf_cache.commit() | |
| @app.local_entrypoint() | |
| def main(): | |
| smoke.remote() | |
| ``` | |
| ### 2.1 What's deliberately *not* in the skeleton | |
| - **No `flash-attn` install.** The `flash-attn` wheel build is a notorious time sink on Modal images (compiles against the CUDA toolkit). For a 0.5B smoke, SDPA (PyTorch's built-in scaled-dot-product attention) is fine and is on by default in transformers ≥ 4.45. | |
| - **No `bitsandbytes`, no `unsloth`, no `xformers`.** All add build complexity. None give you anything on a smoke. | |
| - **No DeepSpeed, no FSDP, no `accelerate launch`.** This is single-GPU; `accelerate` is in the image only because `trl` imports it. We don't invoke it. | |
| - **No web endpoint, no `@app.cls`, no `enter` method.** A `@app.function()` with no warm-up is correct for a one-shot smoke. `enter`/lifecycle methods are for serving and amortizing model load across many calls — not relevant when you call once. | |
| - **No `min_containers` or `buffer_containers`.** Those are warm-pool knobs for serving — they cost money. Default scale-from-zero is right. | |
| - **No `Image.from_registry`.** `debian_slim` + `pip_install` is faster than pulling a CUDA base image when you don't need a custom CUDA toolkit. | |
| ### 2.2 What you do need to add when you wire the real loss | |
| Replace the synthetic `for step in range(50)` body with: | |
| ```python | |
| from data_collator import ComposerDataCollator # spike 005 path | |
| from trl_path.composer_trainer import ComposerReplicationTrainer | |
| # ... | |
| # Build a small fixed dataset of (prompt, response, hint, dpo_pair) tuples | |
| # inline in the smoke (10–20 examples). Don't pull a real RL rollout — the | |
| # point is to verify the loss path, not the rollout path. | |
| ``` | |
| The smoke does **not** need a real rollout/sampling phase. Stub `inputs` with the keys `_compute_sdpo_loss` and `_compute_trace_replay_loss` consume (`ctx_teacher_input_ids`, `dpo_chosen_input_ids`, `dpo_chosen_response_mask`, `dpo_chosen_ref_logprobs`, `sdpo_loss_mask`, …) using fixed tensors. That's the real verification — does the 3-channel loss compute and back-propagate without shape errors. The trainer skeleton's logging will tell you per-channel values. | |
| --- | |
| ## 3. Gotchas that bite *this specific workload* | |
| The Modal docs and the `mlops/modal-llm-training` skill cover ~30 lessons aimed at 7B–30B training. Most of them don't apply here. The ones that do: | |
| ### 3.1 The teacher forward in SDPO doubles your effective batch memory — but only briefly | |
| `ComposerReplicationTrainer._compute_sdpo_loss` does this (composer_trainer.py L138–143): | |
| ```python | |
| student_logits = model(input_ids=inputs["input_ids"]).logits # with grad | |
| with torch.no_grad(): | |
| teacher_logits = model(input_ids=inputs["ctx_teacher_input_ids"]).logits | |
| ``` | |
| Two issues: | |
| 1. **Both logits tensors are held simultaneously** in `_compute_sdpo_loss` — they're handed to `generalized_jsd_loss` which keeps them alive for the JSD math. For Qwen 2.5-0.5B (vocab = 151,936), one logits tensor at B=2,T=1024 in bf16 is `2 * 1024 * 151936 * 2 bytes ≈ 622 MB`. Two of them = ~1.2 GB. **Negligible on a 24 GB L4** but worth noting because logits are surprisingly fat for the Qwen vocab. | |
| 2. **Use the `top_k` arg in `generalized_jsd_loss`** if you ever want to scale this up. The docstring (`opsd_loss.py` L54) explicitly recommends it: *"top_k: restrict KL to top-k tokens of the teacher distribution. Saves compute on large vocabularies (Qwen3 vocab = 152K)."* On the smoke, leave it `None` to verify the unrestricted path; flip it on for real training. | |
| ### 3.2 The DPO channel does TWO more grad'd forwards per step | |
| `_compute_trace_replay_loss` (composer_trainer.py L191–198) calls `_sequence_logprobs(model, dpo_chosen_…)` and `_sequence_logprobs(model, dpo_rejected_…)`. Both are with-grad. So each training step is: | |
| | Forward | Grad? | Notes | | |
| |---------|-------|-------| | |
| | `super()._compute_loss` (GRPO) | yes | parent's standard fwd | | |
| | Student in SDPO | yes | only when alpha_sdpo ≠ 0 | | |
| | Teacher in SDPO | no | hint-conditioned context | | |
| | DPO chosen | yes | only when beta_replay ≠ 0 | | |
| | DPO rejected | yes | only when beta_replay ≠ 0 | | |
| **That's up to 4 grad'd forwards before the backward.** PyTorch will hold activations for all of them in the autograd graph until `.backward()` runs. For the smoke this is fine (0.5B × 4 act tapes ~ 4 GB at B=2,T=1024) but for any real training run on a larger model: **enable gradient checkpointing or run the SDPO/DPO channels in alternating steps** rather than every step. | |
| For the smoke specifically: **set `alpha_sdpo=0.1` and `beta_replay=0.05`** (the trainer defaults) and verify activation memory peaks below 16 GB. If it doesn't, there's a bug in the data collator producing too-long sequences. | |
| ### 3.3 `requires_grad=True` on the zero-tensor short-circuit is a footgun | |
| In `composer_trainer.py` L136 and L155, when SDPO is short-circuited: | |
| ```python | |
| return torch.tensor(0.0, device=_device_of(model), requires_grad=True) | |
| ``` | |
| This is **not in the autograd graph** — it's a leaf tensor with `requires_grad=True` but no parent op. When you sum it into `total = grpo_loss + alpha * sdpo_kl + beta * replay_dpo`, the `0.0` contributes a zero gradient and doesn't break things, *but* if you ever try to call `total.backward()` on a step where ALL three channels short-circuited (e.g., a smoke step with no error sites and no DPO pairs), you'll get a `RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn`. **The smoke will hit this** if your synthetic batch lacks `ctx_teacher_input_ids` and `dpo_chosen_input_ids`. | |
| Fix in the smoke: ensure the synthetic batch includes at least one `ctx_teacher_input_ids` row (it can be a copy of `input_ids` to keep things trivial) so SDPO doesn't short-circuit on every step. | |
| ### 3.4 `torch.cuda.synchronize()` before timing reads | |
| If you don't `torch.cuda.synchronize()` before reading `time.perf_counter()` you'll measure CPU dispatch time, not GPU step time. The skeleton above includes it. The Modal runtime doesn't change this — same rule as local. | |
| ### 3.5 The HF cache Volume must be `commit()`ed | |
| From <https://modal.com/docs/guide/volumes>: *Volume writes are not persisted across runs unless you call `.commit()` (or use `volume.batch_upload`).* The skeleton calls `hf_cache.commit()` at the end. If you forget, run 2 will re-download the model. This is the only "Modal-flavored" gotcha that bites a smoke. | |
| ### 3.6 What does NOT bite | |
| These are the lessons from `mlops/modal-llm-training` that are **not relevant to a 0.5B smoke** — don't waste mental cycles on them: | |
| - ❌ FSDP / DeepSpeed sharding setup. Single GPU. | |
| - ❌ `accelerate launch` / multi-process distributed. Single GPU. | |
| - ❌ Flash-attention version pinning vs torch version. SDPA is fine for 0.5B. | |
| - ❌ Tensor parallelism / sequence parallelism. Single GPU. | |
| - ❌ Multi-node clusters. Single node. | |
| - ❌ Memory snapshotting (`enable_memory_snapshot=True`). It's a 30-min one-shot. The cold-start penalty is ~30 s on a smoke that runs for 5 min — 10% overhead, not worth the snapshot setup time. | |
| - ❌ Region pinning for data locality. The whole input is `from_pretrained`, served by HF — Modal's default region is fine. | |
| - ❌ Custom CUDA install (`Image.from_registry("nvidia/cuda:…")`). The pre-built torch wheel ships its own CUDA. | |
| --- | |
| ## 4. Decision rule: Modal vs the local 5090 | |
| ### 4.1 The numbers | |
| **Local 5090** (32 GB VRAM, Blackwell, ~1.6 PFLOPS bf16): | |
| - Step time for Qwen-0.5B at B=2, T=1024, 3-channel loss (≈4 grad'd fwds + bwd): expect **~150–400 ms per step** based on parameter-count + Blackwell's bf16 throughput. Call it 300 ms. | |
| - 50 steps: **~15 seconds of pure compute**. | |
| - Plus model load (one-time, from local HF cache): ~5 seconds. | |
| - Plus data collator setup: ~3 seconds. | |
| - **Wall clock: ~25–40 seconds.** | |
| - **Cost: $0** (electricity ignored — the 5090 draws ~600 W under load × 40 s = 6.7 Wh ≈ $0.001). | |
| **Modal L4** (24 GB VRAM, Ada Lovelace, ~0.12 PFLOPS bf16): | |
| - Step time for the same workload on L4: **~1.5–4 s per step.** (L4 is roughly 13× lower bf16 throughput than 5090, but the workload at B=2 won't saturate the 5090, so realistic gap is ~5–10×.) Call it 2 s. | |
| - 50 steps: **~100 seconds of pure compute**. | |
| - Plus container cold start, image pull, model download (cached after run 1), CUDA init: **30–90 s on first run, 20–40 s afterward**. | |
| - **Wall clock: ~3–5 minutes per run (worst case 7 min on a cold first run).** | |
| - **Cost: $0.08–$0.13 per run.** | |
| ### 4.2 The decision rule | |
| > **For this specific 30-min smoke: run on the 5090. Do not use Modal.** | |
| Reasoning: | |
| 1. **Latency:** the 5090 finishes the smoke in ~30 s. Modal's L4 needs ~5 minutes including cold start. That's a **10× iteration penalty** on a workload where the entire point is iterate-and-fix-the-shape-error cycles. Every minute waiting for Modal is a minute the user could have run the smoke 5 more times locally. | |
| 2. **Memory headroom:** the 5090's 32 GB is **larger** than the L4's 24 GB. There is no memory motivation to leave the local box. | |
| 3. **Network friction:** every Modal run requires `modal run`, syncing local code, waiting for image, watching logs. Local is `python modal_app.py` (or just import-and-run in a notebook). | |
| 4. **Cost asymmetry vs. iteration cost:** $0.10/run is not the issue. The issue is **30 minutes of attention spent on Modal infra is 30 minutes not spent debugging the loss**. | |
| 5. **The framework hasn't been verified to run end-to-end yet.** The first hundred bugs you'll find are local Python issues — wrong tensor shapes, missing keys in the collator, the `requires_grad=True` zero-tensor footgun (§3.3), TRL version mismatches. Debugging those over a Modal round-trip is masochism. | |
| **When Modal becomes correct:** | |
| | Scenario | Modal? | Why | | |
| |----------|--------|-----| | |
| | 30-min smoke on 0.5B (this task) | **No** | 5090 wins on every dimension | | |
| | Sweep alpha_sdpo, beta_replay across 8 configs in parallel | **Yes** | 8× Modal containers in parallel beats 8 sequential runs on one 5090 | | |
| | Scale to Qwen2.5-7B (real training) | **Yes** | 7B needs >32 GB for grad+optimizer, so 5090 is out; you want A100-80GB or H100 | | |
| | Scale to multi-node (40B+) | **Yes (with caveats)** | Modal multi-node is in beta — see <https://modal.com/docs/guide/multi-node-training> | | |
| | 24/7 inference of trained model | **Maybe** | Depends on QPS; Modal serverless wins for spiky, loses for steady | | |
| ### 4.3 Recommended workflow | |
| 1. **Write the smoke as `local_smoke.py`** that runs on the 5090. Same body as `modal_app.py`'s `smoke()` function, minus the `@app.function` decorator. Iterate there until 50 steps run cleanly. | |
| 2. **Then** drop the body into `modal_app.py` (the skeleton in §2). The Modal version's value is to verify "does it run on cloud Linux without local dotfile interference" and to baseline L4 step-time vs the 5090. That's a one-shot validation, not a development loop. | |
| 3. **For the real training run** (when it's an actual training run, not a smoke), start with A100-40GB on Modal (or H100 if you've got the credits) — the L4 step-time of ~2 s would translate to 2 s × 10,000 steps = ~5.5 hours which is fine for a smoke but painful for a real run. | |
| --- | |
| ## 5. References | |
| All claims in this document are sourced from: | |
| - **Pricing**: <https://modal.com/pricing> (canonical; updated regularly by Modal — re-fetch if cost-sensitive). Per-second numbers in §1.1 captured from this page at report-write time. | |
| - **GPU naming**: <https://modal.com/docs/guide/gpu> — confirms `gpu="L4"`, `gpu="A10"` (not `"A10G"`), `gpu="A100-40GB"`, `gpu="H100!"` syntax. | |
| - **Cold starts**: <https://modal.com/docs/guide/cold-start> — "Containers boot in about one second" + the warm-up period is image pull + global imports + `enter` methods. | |
| - **Volumes**: <https://modal.com/docs/guide/volumes> — `commit()` semantics for HF cache persistence. | |
| - **Region/preemption multipliers**: pricing page footer + <https://modal.com/docs/guide/preemption>. | |
| - **Multi-node beta**: <https://modal.com/docs/guide/multi-node-training>. | |
| - **Examples (for `Image.pip_install` patterns)**: <https://github.com/modal-labs/modal-examples> — see `06_gpu_and_ml/llm-finetuning/` for similar 0.5B/3B finetune patterns. | |
| - **TRL `GRPOTrainer._compute_loss` extension point**: verified in `composer_trainer.py` header comment ("DeepWiki audit of huggingface/trl, 2026-05-25"). Confirmed `super()._compute_loss(model, inputs)` works as the framework's parent-call. | |
| - **Local trainer code reviewed**: | |
| - `/mnt/e/CS/HF/composer-replication-framework/spikes/005-integrated-trainer-skeleton/trl_path/composer_trainer.py` | |
| - `/mnt/e/CS/HF/composer-replication-framework/spikes/005-integrated-trainer-skeleton/opsd_loss.py` | |
| --- | |
| ## 6. TL;DR | |
| 1. **GPU: L4. Cost: ~$0.10/run. Total budget burn: ~50× re-runs before the $5 cap.** Don't pay for A10G, A100, or H100 on a 0.5B smoke. | |
| 2. **Skeleton: §2** — `gpu="L4"`, 4 cores, 16 GB RAM, 30-min timeout, persistent HF cache Volume, default preemption, no region pin. | |
| 3. **Workload-specific gotchas: §3** — 3-channel loss does up to 4 grad'd forwards/step (memory headroom check), the zero-tensor `requires_grad=True` short-circuit can break `backward()`, and `volume.commit()` is mandatory. | |
| 4. **Decision: run on the 5090, not Modal.** 5090 finishes the smoke in ~30 s vs Modal's ~5 min including cold start, with $0 marginal cost and 10× faster iteration. Reserve Modal for parameter sweeps and 7B+ training. | |