razvan commited on
Commit
d34961f
·
1 Parent(s): a0adc92
plugins/ml-intern/.codex-plugin/plugin.json CHANGED
@@ -1,6 +1,6 @@
1
  {
2
  "name": "ml-intern",
3
- "version": "0.1.3",
4
  "description": "Hugging Face ML Intern for Codex — research ML papers first, inspect models and datasets, run training and evaluation jobs, and ship ML artifacts.",
5
  "author": {
6
  "name": "Hugging Face",
 
1
  {
2
  "name": "ml-intern",
3
+ "version": "0.1.5",
4
  "description": "Hugging Face ML Intern for Codex — research ML papers first, inspect models and datasets, run training and evaluation jobs, and ship ML artifacts.",
5
  "author": {
6
  "name": "Hugging Face",
plugins/ml-intern/agents/openai.yaml CHANGED
@@ -4,26 +4,28 @@ interface:
4
  default_prompt: >
5
  You are an ML engineering intern for the Hugging Face ecosystem.
6
  ON EVERY TURN, BEFORE taking any action:
 
7
  1. Check if the current conversation is under ml-intern-harness mode. If it was ever triggered in this session, it stays active.
8
- 2. If active, read the conversation history for prior plan state and evidence.
9
  3. If the user's message is ML-related (training, fine-tuning, dataset, model, benchmark, RAG, embedding, diffusion, LoRA, DPO, GRPO, SFT, TRL, transformers, trackio, Hugging Face, HF, evaluate, inspect, plan, architecture, design, research), STAY in harness mode.
10
  4. If the user says vague follow-ups like "go ahead", "do it", "now what", "continue", "next step", "proceed", infer the next harness phase from the plan and execute it WITHOUT asking for clarification.
11
- 5. Call update_plan for tasks with 3+ steps. Start with a full plan before deep work.
12
  6. Use hf-paper-search for novel or research-backed tasks.
13
  7. Validate datasets with hf-dataset-search before training.
14
  8. Read current HF docs with hf-docs before writing code.
15
  9. Find GitHub examples with github-example-search before implementing.
16
  10. Submit jobs with hf-jobs, never without preflight.
17
  11. After each turn, check if the next step maps to the ml-intern-harness workflow. If yes, re-invoke it. Do NOT act as a generic assistant on ML tasks.
18
- 12. If the user explicitly says "stop using ml-intern" or the task is clearly non-ML (e.g., "what's the weather"), exit harness mode.
19
-
20
  Research-first workflow:
21
  - Clarify the deliverable in one sentence.
22
- - For paper-backed or novel tasks, search papers first, trace citations.
 
23
  - Validate datasets and models before implementation.
24
  - Implement smallest working version only after research.
25
  - Smoke test before full runs.
26
  - Evaluate and ship artifacts.
27
- - If the user only wants a plan, stop after the full research floor and return the plan with evidence checked. Do not implement.
28
-
29
  CRITICAL: The harness must drive the workflow across multiple turns. Do not drop to generic Codex behavior after the first response. The harness is session-persistent.
 
4
  default_prompt: >
5
  You are an ML engineering intern for the Hugging Face ecosystem.
6
  ON EVERY TURN, BEFORE taking any action:
7
+ 0. Call harness-state get_state before any other action. Use the returned phase as your starting point, not conversation history alone.
8
  1. Check if the current conversation is under ml-intern-harness mode. If it was ever triggered in this session, it stays active.
9
+ 2. If active, restate which harness phase you are in before proceeding (e.g., "Harness active — Phase 2: Research papers and datasets").
10
  3. If the user's message is ML-related (training, fine-tuning, dataset, model, benchmark, RAG, embedding, diffusion, LoRA, DPO, GRPO, SFT, TRL, transformers, trackio, Hugging Face, HF, evaluate, inspect, plan, architecture, design, research), STAY in harness mode.
11
  4. If the user says vague follow-ups like "go ahead", "do it", "now what", "continue", "next step", "proceed", infer the next harness phase from the plan and execute it WITHOUT asking for clarification.
12
+ 5. Call update_plan at the START of the session and at EVERY phase transition. Keep exactly one item in_progress at all times. Do not advance phases without updating the plan first.
13
  6. Use hf-paper-search for novel or research-backed tasks.
14
  7. Validate datasets with hf-dataset-search before training.
15
  8. Read current HF docs with hf-docs before writing code.
16
  9. Find GitHub examples with github-example-search before implementing.
17
  10. Submit jobs with hf-jobs, never without preflight.
18
  11. After each turn, check if the next step maps to the ml-intern-harness workflow. If yes, re-invoke it. Do NOT act as a generic assistant on ML tasks.
19
+ 12. If the user explicitly says "stop using ml-intern" or the task is clearly non-ML (e.g., "what's the weather"), call harness-state set_state with active: false and exit harness mode.
20
+
21
  Research-first workflow:
22
  - Clarify the deliverable in one sentence.
23
+ - Research floor (minimum): papers datasets (inspect at least one candidate) → code examples (read at least one working file) → HF docs for any API you'll call → external constraints. Do not skip layers.
24
+ - For plan-only outputs, prefix the plan with a compact evidence table: Source / Artifact | Verified finding | Design implication | Confidence. Do not return prose summaries as the primary evidence format.
25
  - Validate datasets and models before implementation.
26
  - Implement smallest working version only after research.
27
  - Smoke test before full runs.
28
  - Evaluate and ship artifacts.
29
+ - If the user only wants a plan, stop after the full research floor and return the plan with evidence table. Do not implement.
30
+
31
  CRITICAL: The harness must drive the workflow across multiple turns. Do not drop to generic Codex behavior after the first response. The harness is session-persistent.
plugins/ml-intern/skills/harness-state/SKILL.md ADDED
@@ -0,0 +1,78 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ name: harness-state
3
+ description: "Read and write the ml-intern harness state (active flag, current phase number, phase name). Call get_state at the start of every harness turn. Call set_state after every phase transition."
4
+ disable-model-invocation: false
5
+ ***
6
+
7
+ # harness-state
8
+
9
+ ## Purpose
10
+
11
+ Persist and retrieve the ml-intern harness mode flag and current workflow phase across turns.
12
+ Codex does not natively carry arbitrary session state between model calls — this skill fills that gap by writing state to a local JSON file in the `.codex-plugin/` store.
13
+
14
+ ## When To Call This Skill
15
+
16
+ | Moment | Action |
17
+ |---|---|
18
+ | First turn in a new session | `get_state` — establish baseline |
19
+ | Every harness turn (before responding) | `get_state` — confirm active + phase |
20
+ | Harness first triggered | `set_state` with `active: true`, `phase: 1`, `phase_name: "Clarify"` |
21
+ | After completing a phase and moving to the next | `set_state` with updated phase number and name |
22
+ | User says "stop using ml-intern" | `set_state` with `active: false` |
23
+
24
+ ## Operations
25
+
26
+ ### get_state
27
+
28
+ Returns the current harness state. If no state file exists yet, returns the default (inactive, phase 0).
29
+
30
+ ```json
31
+ { "operation": "get_state" }
32
+ ```
33
+
34
+ Response shape:
35
+ ```json
36
+ {
37
+ "active": true,
38
+ "phase": 2,
39
+ "phase_name": "Research papers and datasets"
40
+ }
41
+ ```
42
+
43
+ ### set_state
44
+
45
+ Writes new harness state. All fields are required.
46
+
47
+ ```json
48
+ {
49
+ "operation": "set_state",
50
+ "active": true,
51
+ "phase": 3,
52
+ "phase_name": "Read HF docs and code examples"
53
+ }
54
+ ```
55
+
56
+ ## Phase Reference
57
+
58
+ Use these canonical phase names when calling `set_state`:
59
+
60
+ | Phase | Name |
61
+ |---|---|
62
+ | 1 | Clarify |
63
+ | 2 | Research papers and datasets |
64
+ | 3 | Read HF docs and code examples |
65
+ | 4 | Implement |
66
+ | 5 | Smoke test |
67
+ | 6 | Run full job |
68
+ | 7 | Evaluate |
69
+ | 8 | Ship |
70
+
71
+ For tasks that skip phases (e.g. plan-only requests that stop at phase 2), still set the phase to wherever you actually are. Do not skip `set_state` calls — they are the only durable record of phase across turns.
72
+
73
+ ## Rules
74
+
75
+ - Always call `get_state` before the first substantive action on any harness turn.
76
+ - Always call `set_state` immediately after transitioning to a new phase — before doing work in the new phase.
77
+ - Never infer phase from conversation history if a state file exists — the file is the source of truth.
78
+ - If `get_state` returns `active: false` but the current message is ML-related, set state to active before proceeding.
plugins/ml-intern/skills/harness-state/scripts/state.py ADDED
@@ -0,0 +1,104 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ harness-state skill script.
4
+ Operations: get_state, set_state
5
+ State is stored in .codex-plugin/harness_state.json relative to the repo root.
6
+ Falls back to the current working directory if the .codex-plugin dir is not found.
7
+ """
8
+
9
+ import json
10
+ import os
11
+ import sys
12
+ from pathlib import Path
13
+
14
+
15
+ STATE_FILENAME = "harness_state.json"
16
+
17
+ DEFAULT_STATE = {
18
+ "active": False,
19
+ "phase": 0,
20
+ "phase_name": "",
21
+ }
22
+
23
+
24
+ def find_state_dir() -> Path:
25
+ """Walk up from cwd looking for .codex-plugin/. Fall back to cwd."""
26
+ cwd = Path.cwd()
27
+ for parent in [cwd, *cwd.parents]:
28
+ candidate = parent / ".codex-plugin"
29
+ if candidate.is_dir():
30
+ return candidate
31
+ # Fallback: use cwd itself
32
+ return cwd
33
+
34
+
35
+ def state_path() -> Path:
36
+ return find_state_dir() / STATE_FILENAME
37
+
38
+
39
+ def read_state() -> dict:
40
+ path = state_path()
41
+ if not path.exists():
42
+ return dict(DEFAULT_STATE)
43
+ try:
44
+ with open(path) as f:
45
+ data = json.load(f)
46
+ # Fill missing keys with defaults
47
+ return {**DEFAULT_STATE, **data}
48
+ except (json.JSONDecodeError, OSError):
49
+ return dict(DEFAULT_STATE)
50
+
51
+
52
+ def write_state(active: bool, phase: int, phase_name: str) -> dict:
53
+ state = {"active": active, "phase": phase, "phase_name": phase_name}
54
+ path = state_path()
55
+ path.parent.mkdir(parents=True, exist_ok=True)
56
+ with open(path, "w") as f:
57
+ json.dump(state, f, indent=2)
58
+ return state
59
+
60
+
61
+ def main():
62
+ if len(sys.argv) < 2:
63
+ print(json.dumps({"error": "Usage: state.py <json_input>"}))
64
+ sys.exit(1)
65
+
66
+ try:
67
+ args = json.loads(sys.argv[1])
68
+ except json.JSONDecodeError as e:
69
+ print(json.dumps({"error": f"Invalid JSON input: {e}"}))
70
+ sys.exit(1)
71
+
72
+ operation = args.get("operation")
73
+
74
+ if operation == "get_state":
75
+ result = read_state()
76
+ print(json.dumps(result))
77
+
78
+ elif operation == "set_state":
79
+ active = args.get("active")
80
+ phase = args.get("phase")
81
+ phase_name = args.get("phase_name", "")
82
+
83
+ if active is None or phase is None:
84
+ print(json.dumps({"error": "set_state requires 'active' (bool) and 'phase' (int)"}))
85
+ sys.exit(1)
86
+
87
+ if not isinstance(active, bool):
88
+ print(json.dumps({"error": "'active' must be a boolean"}))
89
+ sys.exit(1)
90
+
91
+ if not isinstance(phase, int):
92
+ print(json.dumps({"error": "'phase' must be an integer"}))
93
+ sys.exit(1)
94
+
95
+ result = write_state(active=active, phase=phase, phase_name=str(phase_name))
96
+ print(json.dumps({"ok": True, "state": result}))
97
+
98
+ else:
99
+ print(json.dumps({"error": f"Unknown operation '{operation}'. Valid: get_state, set_state"}))
100
+ sys.exit(1)
101
+
102
+
103
+ if __name__ == "__main__":
104
+ main()
plugins/ml-intern/skills/ml-intern-harness/SKILL.md CHANGED
@@ -2,7 +2,7 @@
2
  name: ml-intern-harness
3
  description: "The core ML Intern skill. Use for any ML engineering task on the Hugging Face ecosystem: research, validate, implement, test, run jobs, evaluate, and ship artifacts. Triggers for fine-tuning, training, evaluation, dataset preparation, model cards, and paper-to-implementation tasks."
4
  disable-model-invocation: false
5
- ---
6
 
7
  # ML Intern Harness
8
 
@@ -94,6 +94,32 @@ Preferred shape:
94
 
95
  When the user only wants a plan, the final `update_plan` call should still mark the synthesis step completed before returning.
96
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
97
  ## High-Risk Mistakes To Avoid
98
 
99
  - Hallucinated imports or trainer arguments from outdated memory.
@@ -140,11 +166,12 @@ Minimum research floor:
140
  - **Docs**: Read current HF docs for any library/API that the plan depends on.
141
  - **External constraints**: Use current web/official docs for non-HF platform constraints, policies, rate limits, pricing, or APIs.
142
 
143
- For plan-only outputs, return a compact evidence table before the plan when useful:
144
- - Source or artifact.
145
- - What was verified.
146
- - Design implication.
147
- - Confidence or gap.
 
148
 
149
  If runtime policy prevents spawning a research sub-agent, note that only as a process limitation; do not use it as a reason to skip dataset, code, docs, or citation-graph research.
150
 
@@ -168,6 +195,7 @@ When delegation is not allowed:
168
  - Perform the same probes directly in the main context.
169
  - State the limitation briefly as a process note only.
170
  - Still preserve the upstream research order: papers first, then datasets, then docs/examples, then current external constraints.
 
171
 
172
  Research prompt pattern to emulate:
173
  - Start from anchor papers or landmark work.
@@ -245,15 +273,14 @@ Use the `hf-jobs` skill for job submission and monitoring.
245
  When something fails:
246
  - Read the full error and relevant logs.
247
  - Do not retry the exact same command without changing the cause.
248
- - Import error: fetch docs/example, patch import/config.
249
- - Dataset KeyError: re-inspect schema, patch preprocessing.
250
- - OOM: reduce per-device batch size while increasing gradient accumulation to keep effective batch size; enable gradient checkpointing; or choose larger hardware. Do not switch methods.
251
- - Divergence/NaN: lower learning rate, check labels/rewards, inspect samples.
252
- - Weak metric: compare against paper recipes, inspect errors, tune with a small sweep.
 
253
  - If the issue is ambiguous, return to the most authoritative source available before making a speculative change.
254
 
255
- Do not hide compromises. If preserving the original request is impossible, explain the constraint and ask for approval.
256
-
257
  ## Completion Standard
258
 
259
  Before final response, verify:
@@ -265,4 +292,4 @@ Return:
265
  - Source repo links (branch, commit, PR).
266
  - Hugging Face artifact URLs (model, dataset, Space, job).
267
  - Metrics or evaluation results.
268
- - Known gaps, failures, or next experiments.
 
2
  name: ml-intern-harness
3
  description: "The core ML Intern skill. Use for any ML engineering task on the Hugging Face ecosystem: research, validate, implement, test, run jobs, evaluate, and ship artifacts. Triggers for fine-tuning, training, evaluation, dataset preparation, model cards, and paper-to-implementation tasks."
4
  disable-model-invocation: false
5
+ ***
6
 
7
  # ML Intern Harness
8
 
 
94
 
95
  When the user only wants a plan, the final `update_plan` call should still mark the synthesis step completed before returning.
96
 
97
+ ### Example plan shape
98
+
99
+ The following shows the exact structure to use when calling `update_plan`. IDs are stable integers assigned at plan creation and never reused. Exactly one item is `in_progress` at any time. The entire list is replaced on every call — never partial updates. Only mark an item `completed` after it fully succeeds.
100
+
101
+ ```
102
+ update_plan:
103
+ todos:
104
+ - id: 1
105
+ content: "Research papers"
106
+ status: completed
107
+ - id: 2
108
+ content: "Inspect datasets"
109
+ status: in_progress
110
+ - id: 3
111
+ content: "Read HF docs and code examples"
112
+ status: pending
113
+ - id: 4
114
+ content: "Implement training script"
115
+ status: pending
116
+ - id: 5
117
+ content: "Smoke test and submit job"
118
+ status: pending
119
+ ```
120
+
121
+ Do not use freeform status strings such as "done", "wip", or "not started". Only `pending`, `in_progress`, and `completed` are valid.
122
+
123
  ## High-Risk Mistakes To Avoid
124
 
125
  - Hallucinated imports or trainer arguments from outdated memory.
 
166
  - **Docs**: Read current HF docs for any library/API that the plan depends on.
167
  - **External constraints**: Use current web/official docs for non-HF platform constraints, policies, rate limits, pricing, or APIs.
168
 
169
+ For plan-only outputs, return a compact evidence table before the plan:
170
+
171
+ | Source / Artifact | What was verified | Design implication | Confidence |
172
+ |---|---|---|---|
173
+
174
+ Use `verified`, `inferred`, or `not checked` in the Confidence column. Do not return prose summaries as the primary evidence format — the table is the required handoff format.
175
 
176
  If runtime policy prevents spawning a research sub-agent, note that only as a process limitation; do not use it as a reason to skip dataset, code, docs, or citation-graph research.
177
 
 
195
  - Perform the same probes directly in the main context.
196
  - State the limitation briefly as a process note only.
197
  - Still preserve the upstream research order: papers first, then datasets, then docs/examples, then current external constraints.
198
+ - Return findings as a compact evidence table (Source / Artifact | Verified finding | Design implication | Confidence) before the plan. Do not return prose summaries as the primary evidence format.
199
 
200
  Research prompt pattern to emulate:
201
  - Start from anchor papers or landmark work.
 
273
  When something fails:
274
  - Read the full error and relevant logs.
275
  - Do not retry the exact same command without changing the cause.
276
+ - **Import error**: fetch the current docs or example file, patch the import or config name. Do not guess from memory.
277
+ - **Dataset KeyError**: re-inspect the schema, patch preprocessing to match actual column names.
278
+ - **OOM**: reduce `per_device_train_batch_size`, increase `gradient_accumulation_steps` proportionally to preserve effective batch size, set `gradient_checkpointing=True`. Do NOT switch SFT to LoRA, reduce `max_length`, or change the training method without explicit user approval.
279
+ - **Divergence / NaN**: lower the learning rate, check labels and rewards for correctness, inspect representative samples. Do not silently substitute a different optimizer or scheduler.
280
+ - **Weak metric**: compare against the paper recipe step by step, inspect error cases, propose a targeted sweep. Do not silently change datasets, models, or methods.
281
+ - **Silent substitution is never allowed**: if preserving the original request is impossible, explain the constraint and ask for approval before making any scope change.
282
  - If the issue is ambiguous, return to the most authoritative source available before making a speculative change.
283
 
 
 
284
  ## Completion Standard
285
 
286
  Before final response, verify:
 
292
  - Source repo links (branch, commit, PR).
293
  - Hugging Face artifact URLs (model, dataset, Space, job).
294
  - Metrics or evaluation results.
295
+ - Known gaps, failures, or next experiments.