razvan commited on
Commit
c5bfec7
·
verified ·
1 Parent(s): d402c63

Upload plugins/mlintern/skills/hf-jobs/SKILL.md with huggingface_hub

Browse files
plugins/mlintern/skills/hf-jobs/SKILL.md ADDED
@@ -0,0 +1,99 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ name: hf-jobs
3
+ description: "Run Python scripts and Docker commands on Hugging Face cloud infrastructure. Submit training, evaluation, conversion, and long-running experiments."
4
+ disable-model-invocation: false
5
+ ---
6
+
7
+ # hf-jobs — Hugging Face Cloud Jobs
8
+
9
+ ## Purpose
10
+
11
+ Run ML workloads on Hugging Face cloud infrastructure with GPU and CPU hardware. Submit jobs, monitor status, inspect logs, and cancel when needed.
12
+
13
+ ## Tools
14
+
15
+ - `hf_jobs`: Submit and manage HF Jobs.
16
+
17
+ ## Operations
18
+
19
+ | Operation | Description |
20
+ |---|---|
21
+ | `run` | Run a Docker command |
22
+ | `uv` | Run a Python script with UV |
23
+ | `ps` | List active jobs |
24
+ | `logs` | Stream job logs |
25
+ | `inspect` | Get job metadata |
26
+ | `cancel` | Cancel a running job |
27
+ | `scheduled run` | Schedule a Docker command |
28
+ | `scheduled uv` | Schedule a Python script |
29
+ | `scheduled ps` | List scheduled jobs |
30
+ | `scheduled inspect` | Inspect a scheduled job |
31
+ | `scheduled delete` | Delete a scheduled job |
32
+ | `scheduled suspend` | Pause a scheduled job |
33
+ | `scheduled resume` | Resume a scheduled job |
34
+
35
+ ## Python Mode (uv)
36
+
37
+ Run a Python script with dependencies:
38
+
39
+ ```json
40
+ {
41
+ "operation": "uv",
42
+ "script": "print('hello world')",
43
+ "dependencies": ["transformers", "trl", "datasets"],
44
+ "hardware_flavor": "t4-small",
45
+ "timeout": "4h",
46
+ "env": {"TRACKIO_PROJECT": "my-project"}
47
+ }
48
+ ```
49
+
50
+ For training scripts, set:
51
+ - `push_to_hub=True` and `hub_model_id`
52
+ - `report_to="trackio"` with `trackio_space_id` and `trackio_project`
53
+ - Realistic `timeout` (at least 2 hours for real training)
54
+ - Correct `hardware_flavor` for model size:
55
+ - 1-3B params: `t4-small` or `a10g-small`
56
+ - 7-13B params: `a10g-large` or `a100-large`
57
+ - 30B+ params: `a100x4` or `l40sx4`
58
+ - 70B+ params: `a100x8`
59
+
60
+ ## Docker Mode (run)
61
+
62
+ Run a Docker image with a command:
63
+
64
+ ```json
65
+ {
66
+ "operation": "run",
67
+ "command": ["python", "-c", "print('hello')"],
68
+ "image": "python:3.11",
69
+ "hardware_flavor": "cpu-basic",
70
+ "timeout": "30m"
71
+ }
72
+ ```
73
+
74
+ ## Preflight Checklist
75
+
76
+ Before submitting:
77
+ - [ ] Reference implementation or docs identified.
78
+ - [ ] Dataset schema verified.
79
+ - [ ] Model repo and tokenizer verified.
80
+ - [ ] Smoke test completed (locally or in a small job).
81
+ - [ ] Hardware choice justified by model size and VRAM needs.
82
+ - [ ] Timeout set realistically.
83
+ - [ ] `push_to_hub=True` and `hub_model_id` set for training outputs.
84
+ - [ ] Monitoring configured (Trackio or logged metrics).
85
+ - [ ] For sweeps: one job first, then the batch.
86
+
87
+ ## Monitoring
88
+
89
+ During a job:
90
+ 1. Check logs early with `hf_jobs(operation="logs", job_id="...")`.
91
+ 2. If setup/import/data loading fails, stop and fix the script.
92
+ 3. Avoid launching a full batch until one job is clearly training.
93
+
94
+ ## After a Job
95
+
96
+ 1. Verify the output repo or artifact exists.
97
+ 2. Read metrics/logs.
98
+ 3. Decide whether to tune, rerun, or finalize.
99
+ 4. Record the job URL/ID, source commit, config, metrics, and artifact URLs.