razvan
/

ml-intern-codex-plugin

ml-intern

Model card Files Files and versions

xet

Community

razvan commited on 19 days ago

Commit

c5bfec7

verified ·

1 Parent(s): d402c63

Upload plugins/mlintern/skills/hf-jobs/SKILL.md with huggingface_hub

Browse files

Files changed (1) hide show

plugins/mlintern/skills/hf-jobs/SKILL.md +99 -0

plugins/mlintern/skills/hf-jobs/SKILL.md ADDED Viewed

	@@ -0,0 +1,99 @@

+---
+name: hf-jobs
+description: "Run Python scripts and Docker commands on Hugging Face cloud infrastructure. Submit training, evaluation, conversion, and long-running experiments."
+disable-model-invocation: false
+---
+# hf-jobs — Hugging Face Cloud Jobs
+## Purpose
+Run ML workloads on Hugging Face cloud infrastructure with GPU and CPU hardware. Submit jobs, monitor status, inspect logs, and cancel when needed.
+## Tools
+- `hf_jobs`: Submit and manage HF Jobs.
+## Operations
+| Operation | Description |
+|---|---|
+| `run` | Run a Docker command |
+| `uv` | Run a Python script with UV |
+| `ps` | List active jobs |
+| `logs` | Stream job logs |
+| `inspect` | Get job metadata |
+| `cancel` | Cancel a running job |
+| `scheduled run` | Schedule a Docker command |
+| `scheduled uv` | Schedule a Python script |
+| `scheduled ps` | List scheduled jobs |
+| `scheduled inspect` | Inspect a scheduled job |
+| `scheduled delete` | Delete a scheduled job |
+| `scheduled suspend` | Pause a scheduled job |
+| `scheduled resume` | Resume a scheduled job |
+## Python Mode (uv)
+Run a Python script with dependencies:
+```json
+{
+  "operation": "uv",
+  "script": "print('hello world')",
+  "dependencies": ["transformers", "trl", "datasets"],
+  "hardware_flavor": "t4-small",
+  "timeout": "4h",
+  "env": {"TRACKIO_PROJECT": "my-project"}
+}
+```
+For training scripts, set:
+- `push_to_hub=True` and `hub_model_id`
+- `report_to="trackio"` with `trackio_space_id` and `trackio_project`
+- Realistic `timeout` (at least 2 hours for real training)
+- Correct `hardware_flavor` for model size:
+  - 1-3B params: `t4-small` or `a10g-small`
+  - 7-13B params: `a10g-large` or `a100-large`
+  - 30B+ params: `a100x4` or `l40sx4`
+  - 70B+ params: `a100x8`
+## Docker Mode (run)
+Run a Docker image with a command:
+```json
+{
+  "operation": "run",
+  "command": ["python", "-c", "print('hello')"],
+  "image": "python:3.11",
+  "hardware_flavor": "cpu-basic",
+  "timeout": "30m"
+}
+```
+## Preflight Checklist
+Before submitting:
+- [ ] Reference implementation or docs identified.
+- [ ] Dataset schema verified.
+- [ ] Model repo and tokenizer verified.
+- [ ] Smoke test completed (locally or in a small job).
+- [ ] Hardware choice justified by model size and VRAM needs.
+- [ ] Timeout set realistically.
+- [ ] `push_to_hub=True` and `hub_model_id` set for training outputs.
+- [ ] Monitoring configured (Trackio or logged metrics).
+- [ ] For sweeps: one job first, then the batch.
+## Monitoring
+During a job:
+1. Check logs early with `hf_jobs(operation="logs", job_id="...")`.
+2. If setup/import/data loading fails, stop and fix the script.
+3. Avoid launching a full batch until one job is clearly training.
+## After a Job
+1. Verify the output repo or artifact exists.
+2. Read metrics/logs.
+3. Decide whether to tune, rerun, or finalize.
+4. Record the job URL/ID, source commit, config, metrics, and artifact URLs.