Instructions to use Codeseys/composer-replication-framework with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Codeseys/composer-replication-framework with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Codeseys/composer-replication-framework", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Modal Reconnaissance — Composer 2.5 Replication GPU Smoke
Audience: trainer integrator running a one-shot 30-minute verification smoke for the Composer 2.5 Replication Framework (spikes/005-integrated-trainer-skeleton/).
Workload: Qwen/Qwen2.5-0.5B-Instruct (≈ 1 GB fp16 weights), 50 forward+backward steps, custom 3-channel loss = /.modal.toml`).
Bottom line up front: Run it locally on the 5090. Modal is the wrong tool for this specific job. The skeleton + price math below is for future scale-out, not the smoke.GRPO + α·SDPO-KL + β·trace-replay-DPO. Batch size ≤ 4 sequences ≤ 2048 tokens. Goal: prove the loss runs end-to-end and capture mem + step time. This is not training — it's a smoke.
Cap: $5. Local hardware: RTX 5090, 32 GB VRAM, Modal CLI already configured (`
1. Recommended Modal GPU type & estimated cost
1.1 Pricing table (from primary source)
All values copied verbatim from https://modal.com/pricing (fetched for this report). Modal bills per second of compute, not per minute or hour.
| GPU | Modal gpu= string |
$ / sec | $ / hour | VRAM | Verdict for this smoke |
|---|---|---|---|---|---|
| Nvidia T4 | "T4" |
0.000164 | 0.590 | 16 GB | Too small for safe headroom on 3 fwd passes |
| Nvidia L4 | "L4" |
0.000222 | 0.799 | 24 GB | ✅ Recommended — cheapest GPU that fits comfortably |
| Nvidia A10 | "A10" |
0.000306 | 1.102 | 24 GB | Acceptable; ~38% pricier than L4 for marginal speedup at sub-1B |
| Nvidia L40S | "L40S" |
0.000542 | 1.951 | 48 GB | Overkill — Modal's default rec, but unjustified at 0.5B |
| Nvidia A100-40GB | "A100-40GB" |
0.000583 | 2.099 | 40 GB | Overkill |
| Nvidia A100-80GB | "A100-80GB" |
0.000694 | 2.498 | 80 GB | Overkill |
| Nvidia H100 | "H100!" |
0.001097 | 3.949 | 80 GB | Wasteful |
(H100! suffix = pin to H100, opt out of Modal's automatic H200 upgrade. See https://modal.com/docs/guide/gpu#automatic-upgrades-to-h200s.)
Auxiliary costs (also primary, same page):
- CPU: $0.0000131 / physical-core / sec → ~$0.047 / core-hour. Min 0.125 cores per container.
- RAM: $0.00000222 / GiB / sec → ~$0.008 / GiB-hour.
- Volumes: $0.09 / GiB / month (first 1 TiB / mo free on the workspace).
- Starter plan: $30 / month free credits — your smoke is free if you haven't burned the budget elsewhere.
1.2 Why L4, not A10G or A100-40GB
The skill mlops/modal-llm-training defaults to L4/A10 for "small smokes" and that holds here. The framing: Qwen2.5-0.5B in fp16 is ~1 GB of weights. The 3-channel loss does ≥3 forward passes per step (student-grad, teacher no-grad for SDPO, chosen+rejected for DPO). Even with the teacher forward held in memory you are nowhere near 24 GB.
Concrete VRAM math for the workload (back of envelope, batch=2, seq=1024, bf16):
- Weights: ~1.0 GB
- Optimizer state (AdamW, fp32 m+v): ~4 GB (8 bytes × 0.5B params)
- Gradients (bf16): ~1 GB
- Activations for student fwd at B=2,T=1024: ~1–2 GB
- Teacher fwd (no grad, no act save): ~0.3 GB
- DPO chosen+rejected fwds (with grad): ~2–3 GB
- HF transformers overhead, KV scratch, framework: ~2 GB
- Subtotal: ~11–14 GB — comfortably inside 24 GB on L4.
A10 is also fine but costs 38% more for ~30–50% extra throughput on a workload where the GPU is already step-time-bound by Python overhead (see §3). Pay the L4 rate.
A100-40GB is wrong. You're paying 2.6× the L4 rate for memory you don't use and FLOPS that, on a 0.5B model with bs=2, you can't saturate. The Modal docs explicitly warn against this: "Before you jump for the most powerful (and so most expensive) GPU, make sure you understand where the bottlenecks are…" (https://modal.com/docs/guide/gpu#b200-gpus).
T4 declined because: (a) only 16 GB VRAM — tight given 3 fwd passes; (b) old Turing arch lacks bf16 hardware; you'd be on fp16/fp32, which trips up transformers flash-attention paths and adds debug surface area on a smoke that's already debugging custom loss code.
1.3 Cost projection for the actual smoke
Assume a 30-min wall-clock budget that breaks down realistically as:
- Container cold-start + image pull: 30–90 s (first run — Modal's container infra warm-boots in ~1 s but your image with torch+transformers takes one-time pull). See https://modal.com/docs/guide/cold-start.
- HF model download (Qwen 2.5-0.5B = ~1 GB shards) on first run: 15–45 s — should be cached on a Modal Volume after run 1.
- Setup inside fn (CUDA init, model.from_pretrained, optimizer build): 20–40 s.
- 50 training steps × ~2–4 s/step (3-channel loss, bs=2, seq=1024 on L4): 100–200 s.
- Logging, save, exit: 5 s.
Realistic total: ~3–7 minutes of GPU-billed time per run.
Cost per run on L4:
- Lower bound (3 min): 180 s × $0.000222 = $0.040
- Upper bound (7 min): 420 s × $0.000222 = $0.093
- Plus CPU/RAM overhead (4 cores × 16 GB RAM): ~420 s × (4 × $0.0000131 + 16 × $0.00000222) = ~$0.037
Per-run all-in: $0.08 – $0.13 on L4. You can run the smoke ~50× before nudging the $5 cap. Comfortable.
For comparison, A10 same scenario: ~$0.11 – $0.18 per run. A100-40GB: ~$0.21 – $0.34. Still all under cap, but L4 is the rational pick.
1.4 Region & preemption multipliers (DON'T trip on these)
From the pricing-page footer:
- Region selection: 1.5–1.75× base price. Don't pin to a region unless you must.
- Non-preemptible execution: 3× base price. Default is preemptible — leave it. A 30-min smoke that gets preempted is fine; just retry. Setting
gpu_preempted=False(or using non-preemptible mode) would push L4 to ~$2.40/hr and is unjustified.
2. Minimal modal_app.py skeleton
This is the actual file to drop into the repo, e.g. at spikes/005-integrated-trainer-skeleton/modal_app.py. It is intentionally one file, with no abstraction, sized for the smoke. Image pins are conservative — match what the user is running locally to avoid version drift between local debugging and Modal runs.
"""modal_app.py — GPU smoke for the Composer 2.5 Replication Framework.
Goal: run ~50 forward+backward steps of the 3-channel loss
(GRPO + SDPO-KL + trace-replay-DPO) against Qwen/Qwen2.5-0.5B-Instruct,
capture peak VRAM and per-step latency, and exit. Single L4, single container.
Run: modal run modal_app.py
Logs: the function's print() output streams to your terminal.
"""
from __future__ import annotations
import modal
# ---------------------------------------------------------------------------
# 1) App + image
# ---------------------------------------------------------------------------
# Pin torch to a CUDA build that matches Modal's L4 driver (CUDA 12.x).
# Pin transformers/peft/trl to a known-good combination — the trainer skeleton
# was developed against transformers >= 4.45 and trl >= 0.12 for GRPOTrainer.
# If you bump any of these, re-verify GRPOTrainer._compute_loss is still the
# correct override hook (DeepWiki audit anchor: huggingface/trl).
image = (
modal.Image.debian_slim(python_version="3.11")
.apt_install("git")
.pip_install(
"torch==2.4.1", # CUDA 12.1 wheel from PyPI default index
"transformers==4.46.3",
"accelerate==1.1.1",
"peft==0.14.0",
"trl==0.12.2",
"datasets==3.1.0",
"huggingface_hub==0.26.5",
)
.env({
# Force HF to use the mounted Volume for model + dataset cache.
"HF_HOME": "/cache/hf",
"TRANSFORMERS_CACHE": "/cache/hf",
"HF_HUB_ENABLE_HF_TRANSFER": "1", # parallel download for the model
# Make Python flush prints immediately so we see step times live.
"PYTHONUNBUFFERED": "1",
# Reproducibility for the smoke.
"TOKENIZERS_PARALLELISM": "false",
})
)
# ---------------------------------------------------------------------------
# 2) Persistent volume for HF cache (so model isn't re-downloaded each run)
# ---------------------------------------------------------------------------
# 1 GB of Qwen weights persists here. First run pays the download cost,
# every subsequent run reuses the volume. Below 1 TiB / mo: free.
hf_cache = modal.Volume.from_name("hf-cache-composer-smoke", create_if_missing=True)
# ---------------------------------------------------------------------------
# 3) App + secrets
# ---------------------------------------------------------------------------
app = modal.App("composer-replication-smoke")
# Optional — only needed if you switch to a gated model. Qwen2.5-0.5B is open.
hf_secret = modal.Secret.from_name("huggingface-token", required_keys=[]) # no-op safety
# ---------------------------------------------------------------------------
# 4) The smoke function
# ---------------------------------------------------------------------------
@app.function(
image=image,
gpu="L4", # see §1: cheapest 24 GB option that fits
cpu=4.0, # 4 cores is plenty for tokenization on a sub-1B
memory=16 * 1024, # 16 GiB RAM is plenty
volumes={"/cache": hf_cache},
timeout=60 * 30, # hard 30-min cap matches the smoke spec
secrets=[hf_secret],
# NB: keep preemptible (default). Don't pay 3× to pin.
# NB: don't pin region — the 1.5–1.75× tax is unjustified for a smoke.
)
def smoke():
import time
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
MODEL_ID = "Qwen/Qwen2.5-0.5B-Instruct"
print(f"[smoke] torch={torch.__version__} cuda={torch.version.cuda} "
f"device={torch.cuda.get_device_name(0)} "
f"vram={torch.cuda.get_device_properties(0).total_memory/1e9:.1f} GB")
# -------------------------------------------------------------------
# Load tokenizer + model. bf16 — L4 supports it (Ada Lovelace).
# -------------------------------------------------------------------
t0 = time.perf_counter()
tok = AutoTokenizer.from_pretrained(MODEL_ID, cache_dir="/cache/hf")
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
cache_dir="/cache/hf",
torch_dtype=torch.bfloat16,
device_map="cuda:0",
)
model.train()
print(f"[smoke] model load: {time.perf_counter()-t0:.1f}s "
f"params={sum(p.numel() for p in model.parameters())/1e6:.1f}M")
# -------------------------------------------------------------------
# 50-step verification loop.
#
# NOTE: this stub uses a synthetic batch — a single forward+backward
# against an LM-head loss — *not* the full 3-channel loss. The point
# is to (a) verify the Modal harness, (b) measure the per-step time
# of a vanilla AutoModelForCausalLM step on this GPU as a baseline.
#
# Replace the body of the for-loop with the actual ComposerReplicationTrainer
# `_compute_loss` call once data_collator outputs are stubbed/mocked.
# See: spikes/005-integrated-trainer-skeleton/trl_path/composer_trainer.py
# -------------------------------------------------------------------
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-6)
# Synthetic batch: bs=2, seq=1024 — matches the realistic smoke shape.
B, T = 2, 1024
input_ids = torch.randint(0, tok.vocab_size, (B, T), device="cuda:0")
labels = input_ids.clone()
torch.cuda.reset_peak_memory_stats()
step_times = []
for step in range(50):
t = time.perf_counter()
out = model(input_ids=input_ids, labels=labels)
out.loss.backward()
optimizer.step()
optimizer.zero_grad(set_to_none=True)
torch.cuda.synchronize()
dt = time.perf_counter() - t
step_times.append(dt)
if step % 10 == 0:
print(f"[smoke] step {step:>3d} loss={out.loss.item():.4f} "
f"dt={dt*1000:.1f}ms peak_vram={torch.cuda.max_memory_allocated()/1e9:.2f}GB")
# -------------------------------------------------------------------
# Final report.
# -------------------------------------------------------------------
median_ms = sorted(step_times)[len(step_times)//2] * 1000
p95_ms = sorted(step_times)[int(len(step_times)*0.95)] * 1000
peak_gb = torch.cuda.max_memory_allocated() / 1e9
print(f"\n[smoke] DONE. median_step={median_ms:.1f}ms p95={p95_ms:.1f}ms "
f"peak_vram={peak_gb:.2f}GB total_time={sum(step_times):.1f}s")
# Persist cache for the next run.
hf_cache.commit()
@app.local_entrypoint()
def main():
smoke.remote()
2.1 What's deliberately not in the skeleton
- No
flash-attninstall. Theflash-attnwheel build is a notorious time sink on Modal images (compiles against the CUDA toolkit). For a 0.5B smoke, SDPA (PyTorch's built-in scaled-dot-product attention) is fine and is on by default in transformers ≥ 4.45. - No
bitsandbytes, nounsloth, noxformers. All add build complexity. None give you anything on a smoke. - No DeepSpeed, no FSDP, no
accelerate launch. This is single-GPU;accelerateis in the image only becausetrlimports it. We don't invoke it. - No web endpoint, no
@app.cls, noentermethod. A@app.function()with no warm-up is correct for a one-shot smoke.enter/lifecycle methods are for serving and amortizing model load across many calls — not relevant when you call once. - No
min_containersorbuffer_containers. Those are warm-pool knobs for serving — they cost money. Default scale-from-zero is right. - No
Image.from_registry.debian_slim+pip_installis faster than pulling a CUDA base image when you don't need a custom CUDA toolkit.
2.2 What you do need to add when you wire the real loss
Replace the synthetic for step in range(50) body with:
from data_collator import ComposerDataCollator # spike 005 path
from trl_path.composer_trainer import ComposerReplicationTrainer
# ...
# Build a small fixed dataset of (prompt, response, hint, dpo_pair) tuples
# inline in the smoke (10–20 examples). Don't pull a real RL rollout — the
# point is to verify the loss path, not the rollout path.
The smoke does not need a real rollout/sampling phase. Stub inputs with the keys _compute_sdpo_loss and _compute_trace_replay_loss consume (ctx_teacher_input_ids, dpo_chosen_input_ids, dpo_chosen_response_mask, dpo_chosen_ref_logprobs, sdpo_loss_mask, …) using fixed tensors. That's the real verification — does the 3-channel loss compute and back-propagate without shape errors. The trainer skeleton's logging will tell you per-channel values.
3. Gotchas that bite this specific workload
The Modal docs and the mlops/modal-llm-training skill cover ~30 lessons aimed at 7B–30B training. Most of them don't apply here. The ones that do:
3.1 The teacher forward in SDPO doubles your effective batch memory — but only briefly
ComposerReplicationTrainer._compute_sdpo_loss does this (composer_trainer.py L138–143):
student_logits = model(input_ids=inputs["input_ids"]).logits # with grad
with torch.no_grad():
teacher_logits = model(input_ids=inputs["ctx_teacher_input_ids"]).logits
Two issues:
- Both logits tensors are held simultaneously in
_compute_sdpo_loss— they're handed togeneralized_jsd_losswhich keeps them alive for the JSD math. For Qwen 2.5-0.5B (vocab = 151,936), one logits tensor at B=2,T=1024 in bf16 is2 * 1024 * 151936 * 2 bytes ≈ 622 MB. Two of them = ~1.2 GB. Negligible on a 24 GB L4 but worth noting because logits are surprisingly fat for the Qwen vocab. - Use the
top_karg ingeneralized_jsd_lossif you ever want to scale this up. The docstring (opsd_loss.pyL54) explicitly recommends it: "top_k: restrict KL to top-k tokens of the teacher distribution. Saves compute on large vocabularies (Qwen3 vocab = 152K)." On the smoke, leave itNoneto verify the unrestricted path; flip it on for real training.
3.2 The DPO channel does TWO more grad'd forwards per step
_compute_trace_replay_loss (composer_trainer.py L191–198) calls _sequence_logprobs(model, dpo_chosen_…) and _sequence_logprobs(model, dpo_rejected_…). Both are with-grad. So each training step is:
| Forward | Grad? | Notes |
|---|---|---|
super()._compute_loss (GRPO) |
yes | parent's standard fwd |
| Student in SDPO | yes | only when alpha_sdpo ≠ 0 |
| Teacher in SDPO | no | hint-conditioned context |
| DPO chosen | yes | only when beta_replay ≠ 0 |
| DPO rejected | yes | only when beta_replay ≠ 0 |
That's up to 4 grad'd forwards before the backward. PyTorch will hold activations for all of them in the autograd graph until .backward() runs. For the smoke this is fine (0.5B × 4 act tapes ~ 4 GB at B=2,T=1024) but for any real training run on a larger model: enable gradient checkpointing or run the SDPO/DPO channels in alternating steps rather than every step.
For the smoke specifically: set alpha_sdpo=0.1 and beta_replay=0.05 (the trainer defaults) and verify activation memory peaks below 16 GB. If it doesn't, there's a bug in the data collator producing too-long sequences.
3.3 requires_grad=True on the zero-tensor short-circuit is a footgun
In composer_trainer.py L136 and L155, when SDPO is short-circuited:
return torch.tensor(0.0, device=_device_of(model), requires_grad=True)
This is not in the autograd graph — it's a leaf tensor with requires_grad=True but no parent op. When you sum it into total = grpo_loss + alpha * sdpo_kl + beta * replay_dpo, the 0.0 contributes a zero gradient and doesn't break things, but if you ever try to call total.backward() on a step where ALL three channels short-circuited (e.g., a smoke step with no error sites and no DPO pairs), you'll get a RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn. The smoke will hit this if your synthetic batch lacks ctx_teacher_input_ids and dpo_chosen_input_ids.
Fix in the smoke: ensure the synthetic batch includes at least one ctx_teacher_input_ids row (it can be a copy of input_ids to keep things trivial) so SDPO doesn't short-circuit on every step.
3.4 torch.cuda.synchronize() before timing reads
If you don't torch.cuda.synchronize() before reading time.perf_counter() you'll measure CPU dispatch time, not GPU step time. The skeleton above includes it. The Modal runtime doesn't change this — same rule as local.
3.5 The HF cache Volume must be commit()ed
From https://modal.com/docs/guide/volumes: Volume writes are not persisted across runs unless you call .commit() (or use volume.batch_upload). The skeleton calls hf_cache.commit() at the end. If you forget, run 2 will re-download the model. This is the only "Modal-flavored" gotcha that bites a smoke.
3.6 What does NOT bite
These are the lessons from mlops/modal-llm-training that are not relevant to a 0.5B smoke — don't waste mental cycles on them:
- ❌ FSDP / DeepSpeed sharding setup. Single GPU.
- ❌
accelerate launch/ multi-process distributed. Single GPU. - ❌ Flash-attention version pinning vs torch version. SDPA is fine for 0.5B.
- ❌ Tensor parallelism / sequence parallelism. Single GPU.
- ❌ Multi-node clusters. Single node.
- ❌ Memory snapshotting (
enable_memory_snapshot=True). It's a 30-min one-shot. The cold-start penalty is ~30 s on a smoke that runs for 5 min — 10% overhead, not worth the snapshot setup time. - ❌ Region pinning for data locality. The whole input is
from_pretrained, served by HF — Modal's default region is fine. - ❌ Custom CUDA install (
Image.from_registry("nvidia/cuda:…")). The pre-built torch wheel ships its own CUDA.
4. Decision rule: Modal vs the local 5090
4.1 The numbers
Local 5090 (32 GB VRAM, Blackwell, ~1.6 PFLOPS bf16):
- Step time for Qwen-0.5B at B=2, T=1024, 3-channel loss (≈4 grad'd fwds + bwd): expect ~150–400 ms per step based on parameter-count + Blackwell's bf16 throughput. Call it 300 ms.
- 50 steps: ~15 seconds of pure compute.
- Plus model load (one-time, from local HF cache): ~5 seconds.
- Plus data collator setup: ~3 seconds.
- Wall clock: ~25–40 seconds.
- Cost: $0 (electricity ignored — the 5090 draws ~600 W under load × 40 s = 6.7 Wh ≈ $0.001).
Modal L4 (24 GB VRAM, Ada Lovelace, ~0.12 PFLOPS bf16):
- Step time for the same workload on L4: ~1.5–4 s per step. (L4 is roughly 13× lower bf16 throughput than 5090, but the workload at B=2 won't saturate the 5090, so realistic gap is ~5–10×.) Call it 2 s.
- 50 steps: ~100 seconds of pure compute.
- Plus container cold start, image pull, model download (cached after run 1), CUDA init: 30–90 s on first run, 20–40 s afterward.
- Wall clock: ~3–5 minutes per run (worst case 7 min on a cold first run).
- Cost: $0.08–$0.13 per run.
4.2 The decision rule
For this specific 30-min smoke: run on the 5090. Do not use Modal.
Reasoning:
- Latency: the 5090 finishes the smoke in ~30 s. Modal's L4 needs ~5 minutes including cold start. That's a 10× iteration penalty on a workload where the entire point is iterate-and-fix-the-shape-error cycles. Every minute waiting for Modal is a minute the user could have run the smoke 5 more times locally.
- Memory headroom: the 5090's 32 GB is larger than the L4's 24 GB. There is no memory motivation to leave the local box.
- Network friction: every Modal run requires
modal run, syncing local code, waiting for image, watching logs. Local ispython modal_app.py(or just import-and-run in a notebook). - Cost asymmetry vs. iteration cost: $0.10/run is not the issue. The issue is 30 minutes of attention spent on Modal infra is 30 minutes not spent debugging the loss.
- The framework hasn't been verified to run end-to-end yet. The first hundred bugs you'll find are local Python issues — wrong tensor shapes, missing keys in the collator, the
requires_grad=Truezero-tensor footgun (§3.3), TRL version mismatches. Debugging those over a Modal round-trip is masochism.
When Modal becomes correct:
| Scenario | Modal? | Why |
|---|---|---|
| 30-min smoke on 0.5B (this task) | No | 5090 wins on every dimension |
| Sweep alpha_sdpo, beta_replay across 8 configs in parallel | Yes | 8× Modal containers in parallel beats 8 sequential runs on one 5090 |
| Scale to Qwen2.5-7B (real training) | Yes | 7B needs >32 GB for grad+optimizer, so 5090 is out; you want A100-80GB or H100 |
| Scale to multi-node (40B+) | Yes (with caveats) | Modal multi-node is in beta — see https://modal.com/docs/guide/multi-node-training |
| 24/7 inference of trained model | Maybe | Depends on QPS; Modal serverless wins for spiky, loses for steady |
4.3 Recommended workflow
- Write the smoke as
local_smoke.pythat runs on the 5090. Same body asmodal_app.py'ssmoke()function, minus the@app.functiondecorator. Iterate there until 50 steps run cleanly. - Then drop the body into
modal_app.py(the skeleton in §2). The Modal version's value is to verify "does it run on cloud Linux without local dotfile interference" and to baseline L4 step-time vs the 5090. That's a one-shot validation, not a development loop. - For the real training run (when it's an actual training run, not a smoke), start with A100-40GB on Modal (or H100 if you've got the credits) — the L4 step-time of ~2 s would translate to 2 s × 10,000 steps = ~5.5 hours which is fine for a smoke but painful for a real run.
5. References
All claims in this document are sourced from:
- Pricing: https://modal.com/pricing (canonical; updated regularly by Modal — re-fetch if cost-sensitive). Per-second numbers in §1.1 captured from this page at report-write time.
- GPU naming: https://modal.com/docs/guide/gpu — confirms
gpu="L4",gpu="A10"(not"A10G"),gpu="A100-40GB",gpu="H100!"syntax. - Cold starts: https://modal.com/docs/guide/cold-start — "Containers boot in about one second" + the warm-up period is image pull + global imports +
entermethods. - Volumes: https://modal.com/docs/guide/volumes —
commit()semantics for HF cache persistence. - Region/preemption multipliers: pricing page footer + https://modal.com/docs/guide/preemption.
- Multi-node beta: https://modal.com/docs/guide/multi-node-training.
- Examples (for
Image.pip_installpatterns): https://github.com/modal-labs/modal-examples — see06_gpu_and_ml/llm-finetuning/for similar 0.5B/3B finetune patterns. - TRL
GRPOTrainer._compute_lossextension point: verified incomposer_trainer.pyheader comment ("DeepWiki audit of huggingface/trl, 2026-05-25"). Confirmedsuper()._compute_loss(model, inputs)works as the framework's parent-call. - Local trainer code reviewed:
/mnt/e/CS/HF/composer-replication-framework/spikes/005-integrated-trainer-skeleton/trl_path/composer_trainer.py/mnt/e/CS/HF/composer-replication-framework/spikes/005-integrated-trainer-skeleton/opsd_loss.py
6. TL;DR
- GPU: L4. Cost: ~$0.10/run. Total budget burn: ~50× re-runs before the $5 cap. Don't pay for A10G, A100, or H100 on a 0.5B smoke.
- Skeleton: §2 —
gpu="L4", 4 cores, 16 GB RAM, 30-min timeout, persistent HF cache Volume, default preemption, no region pin. - Workload-specific gotchas: §3 — 3-channel loss does up to 4 grad'd forwards/step (memory headroom check), the zero-tensor
requires_grad=Trueshort-circuit can breakbackward(), andvolume.commit()is mandatory. - Decision: run on the 5090, not Modal. 5090 finishes the smoke in ~30 s vs Modal's ~5 min including cold start, with $0 marginal cost and 10× faster iteration. Reserve Modal for parameter sweeps and 7B+ training.