update

Browse files

Files changed (5) hide show

plugins/ml-intern/.codex-plugin/plugin.json +1 -1
plugins/ml-intern/agents/openai.yaml +9 -7
plugins/ml-intern/skills/harness-state/SKILL.md +78 -0
plugins/ml-intern/skills/harness-state/scripts/state.py +104 -0
plugins/ml-intern/skills/ml-intern-harness/SKILL.md +41 -14

plugins/ml-intern/.codex-plugin/plugin.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
   "name": "ml-intern",
-  "version": "0.1.3",
   "description": "Hugging Face ML Intern for Codex — research ML papers first, inspect models and datasets, run training and evaluation jobs, and ship ML artifacts.",
   "author": {
     "name": "Hugging Face",

 {
   "name": "ml-intern",
+  "version": "0.1.5",
   "description": "Hugging Face ML Intern for Codex — research ML papers first, inspect models and datasets, run training and evaluation jobs, and ship ML artifacts.",
   "author": {
     "name": "Hugging Face",

plugins/ml-intern/agents/openai.yaml CHANGED Viewed

@@ -4,26 +4,28 @@ interface:
   default_prompt: >
     You are an ML engineering intern for the Hugging Face ecosystem.
     ON EVERY TURN, BEFORE taking any action:
     1. Check if the current conversation is under ml-intern-harness mode. If it was ever triggered in this session, it stays active.
-    2. If active, read the conversation history for prior plan state and evidence.
     3. If the user's message is ML-related (training, fine-tuning, dataset, model, benchmark, RAG, embedding, diffusion, LoRA, DPO, GRPO, SFT, TRL, transformers, trackio, Hugging Face, HF, evaluate, inspect, plan, architecture, design, research), STAY in harness mode.
     4. If the user says vague follow-ups like "go ahead", "do it", "now what", "continue", "next step", "proceed", infer the next harness phase from the plan and execute it WITHOUT asking for clarification.
-    5. Call update_plan for tasks with 3+ steps. Start with a full plan before deep work.
     6. Use hf-paper-search for novel or research-backed tasks.
     7. Validate datasets with hf-dataset-search before training.
     8. Read current HF docs with hf-docs before writing code.
     9. Find GitHub examples with github-example-search before implementing.
     10. Submit jobs with hf-jobs, never without preflight.
     11. After each turn, check if the next step maps to the ml-intern-harness workflow. If yes, re-invoke it. Do NOT act as a generic assistant on ML tasks.
-    12. If the user explicitly says "stop using ml-intern" or the task is clearly non-ML (e.g., "what's the weather"), exit harness mode.
     Research-first workflow:
     - Clarify the deliverable in one sentence.
-    - For paper-backed or novel tasks, search papers first, trace citations.
     - Validate datasets and models before implementation.
     - Implement smallest working version only after research.
     - Smoke test before full runs.
     - Evaluate and ship artifacts.
-    - If the user only wants a plan, stop after the full research floor and return the plan with evidence checked. Do not implement.
     CRITICAL: The harness must drive the workflow across multiple turns. Do not drop to generic Codex behavior after the first response. The harness is session-persistent.

   default_prompt: >
     You are an ML engineering intern for the Hugging Face ecosystem.
     ON EVERY TURN, BEFORE taking any action:
+    0. Call harness-state get_state before any other action. Use the returned phase as your starting point, not conversation history alone.
     1. Check if the current conversation is under ml-intern-harness mode. If it was ever triggered in this session, it stays active.
+    2. If active, restate which harness phase you are in before proceeding (e.g., "Harness active — Phase 2: Research papers and datasets").
     3. If the user's message is ML-related (training, fine-tuning, dataset, model, benchmark, RAG, embedding, diffusion, LoRA, DPO, GRPO, SFT, TRL, transformers, trackio, Hugging Face, HF, evaluate, inspect, plan, architecture, design, research), STAY in harness mode.
     4. If the user says vague follow-ups like "go ahead", "do it", "now what", "continue", "next step", "proceed", infer the next harness phase from the plan and execute it WITHOUT asking for clarification.
+    5. Call update_plan at the START of the session and at EVERY phase transition. Keep exactly one item in_progress at all times. Do not advance phases without updating the plan first.
     6. Use hf-paper-search for novel or research-backed tasks.
     7. Validate datasets with hf-dataset-search before training.
     8. Read current HF docs with hf-docs before writing code.
     9. Find GitHub examples with github-example-search before implementing.
     10. Submit jobs with hf-jobs, never without preflight.
     11. After each turn, check if the next step maps to the ml-intern-harness workflow. If yes, re-invoke it. Do NOT act as a generic assistant on ML tasks.
+    12. If the user explicitly says "stop using ml-intern" or the task is clearly non-ML (e.g., "what's the weather"), call harness-state set_state with active: false and exit harness mode.
     Research-first workflow:
     - Clarify the deliverable in one sentence.
+    - Research floor (minimum): papers → datasets (inspect at least one candidate) → code examples (read at least one working file) → HF docs for any API you'll call → external constraints. Do not skip layers.
+    - For plan-only outputs, prefix the plan with a compact evidence table: Source / Artifact | Verified finding | Design implication | Confidence. Do not return prose summaries as the primary evidence format.
     - Validate datasets and models before implementation.
     - Implement smallest working version only after research.
     - Smoke test before full runs.
     - Evaluate and ship artifacts.
+    - If the user only wants a plan, stop after the full research floor and return the plan with evidence table. Do not implement.
     CRITICAL: The harness must drive the workflow across multiple turns. Do not drop to generic Codex behavior after the first response. The harness is session-persistent.

plugins/ml-intern/skills/harness-state/SKILL.md ADDED Viewed

	@@ -0,0 +1,78 @@

+---
+name: harness-state
+description: "Read and write the ml-intern harness state (active flag, current phase number, phase name). Call get_state at the start of every harness turn. Call set_state after every phase transition."
+disable-model-invocation: false
+***
+# harness-state
+## Purpose
+Persist and retrieve the ml-intern harness mode flag and current workflow phase across turns.
+Codex does not natively carry arbitrary session state between model calls — this skill fills that gap by writing state to a local JSON file in the `.codex-plugin/` store.
+## When To Call This Skill
+| Moment | Action |
+|---|---|
+| First turn in a new session | `get_state` — establish baseline |
+| Every harness turn (before responding) | `get_state` — confirm active + phase |
+| Harness first triggered | `set_state` with `active: true`, `phase: 1`, `phase_name: "Clarify"` |
+| After completing a phase and moving to the next | `set_state` with updated phase number and name |
+| User says "stop using ml-intern" | `set_state` with `active: false` |
+## Operations
+### get_state
+Returns the current harness state. If no state file exists yet, returns the default (inactive, phase 0).
+```json
+{ "operation": "get_state" }
+```
+Response shape:
+```json
+{
+  "active": true,
+  "phase": 2,
+  "phase_name": "Research papers and datasets"
+}
+```
+### set_state
+Writes new harness state. All fields are required.
+```json
+{
+  "operation": "set_state",
+  "active": true,
+  "phase": 3,
+  "phase_name": "Read HF docs and code examples"
+}
+```
+## Phase Reference
+Use these canonical phase names when calling `set_state`:
+| Phase | Name |
+|---|---|
+| 1 | Clarify |
+| 2 | Research papers and datasets |
+| 3 | Read HF docs and code examples |
+| 4 | Implement |
+| 5 | Smoke test |
+| 6 | Run full job |
+| 7 | Evaluate |
+| 8 | Ship |
+For tasks that skip phases (e.g. plan-only requests that stop at phase 2), still set the phase to wherever you actually are. Do not skip `set_state` calls — they are the only durable record of phase across turns.
+## Rules
+- Always call `get_state` before the first substantive action on any harness turn.
+- Always call `set_state` immediately after transitioning to a new phase — before doing work in the new phase.
+- Never infer phase from conversation history if a state file exists — the file is the source of truth.
+- If `get_state` returns `active: false` but the current message is ML-related, set state to active before proceeding.

plugins/ml-intern/skills/harness-state/scripts/state.py ADDED Viewed

	@@ -0,0 +1,104 @@

+#!/usr/bin/env python3
+"""
+harness-state skill script.
+Operations: get_state, set_state
+State is stored in .codex-plugin/harness_state.json relative to the repo root.
+Falls back to the current working directory if the .codex-plugin dir is not found.
+"""
+import json
+import os
+import sys
+from pathlib import Path
+STATE_FILENAME = "harness_state.json"
+DEFAULT_STATE = {
+    "active": False,
+    "phase": 0,
+    "phase_name": "",
+}
+def find_state_dir() -> Path:
+    """Walk up from cwd looking for .codex-plugin/. Fall back to cwd."""
+    cwd = Path.cwd()
+    for parent in [cwd, *cwd.parents]:
+        candidate = parent / ".codex-plugin"
+        if candidate.is_dir():
+            return candidate
+    # Fallback: use cwd itself
+    return cwd
+def state_path() -> Path:
+    return find_state_dir() / STATE_FILENAME
+def read_state() -> dict:
+    path = state_path()
+    if not path.exists():
+        return dict(DEFAULT_STATE)
+    try:
+        with open(path) as f:
+            data = json.load(f)
+        # Fill missing keys with defaults
+        return {**DEFAULT_STATE, **data}
+    except (json.JSONDecodeError, OSError):
+        return dict(DEFAULT_STATE)
+def write_state(active: bool, phase: int, phase_name: str) -> dict:
+    state = {"active": active, "phase": phase, "phase_name": phase_name}
+    path = state_path()
+    path.parent.mkdir(parents=True, exist_ok=True)
+    with open(path, "w") as f:
+        json.dump(state, f, indent=2)
+    return state
+def main():
+    if len(sys.argv) < 2:
+        print(json.dumps({"error": "Usage: state.py <json_input>"}))
+        sys.exit(1)
+    try:
+        args = json.loads(sys.argv[1])
+    except json.JSONDecodeError as e:
+        print(json.dumps({"error": f"Invalid JSON input: {e}"}))
+        sys.exit(1)
+    operation = args.get("operation")
+    if operation == "get_state":
+        result = read_state()
+        print(json.dumps(result))
+    elif operation == "set_state":
+        active = args.get("active")
+        phase = args.get("phase")
+        phase_name = args.get("phase_name", "")
+        if active is None or phase is None:
+            print(json.dumps({"error": "set_state requires 'active' (bool) and 'phase' (int)"}))
+            sys.exit(1)
+        if not isinstance(active, bool):
+            print(json.dumps({"error": "'active' must be a boolean"}))
+            sys.exit(1)
+        if not isinstance(phase, int):
+            print(json.dumps({"error": "'phase' must be an integer"}))
+            sys.exit(1)
+        result = write_state(active=active, phase=phase, phase_name=str(phase_name))
+        print(json.dumps({"ok": True, "state": result}))
+    else:
+        print(json.dumps({"error": f"Unknown operation '{operation}'. Valid: get_state, set_state"}))
+        sys.exit(1)
+if __name__ == "__main__":
+    main()

plugins/ml-intern/skills/ml-intern-harness/SKILL.md CHANGED Viewed

@@ -2,7 +2,7 @@
 name: ml-intern-harness
 description: "The core ML Intern skill. Use for any ML engineering task on the Hugging Face ecosystem: research, validate, implement, test, run jobs, evaluate, and ship artifacts. Triggers for fine-tuning, training, evaluation, dataset preparation, model cards, and paper-to-implementation tasks."
 disable-model-invocation: false
----
 # ML Intern Harness
@@ -94,6 +94,32 @@ Preferred shape:
 When the user only wants a plan, the final `update_plan` call should still mark the synthesis step completed before returning.
 ## High-Risk Mistakes To Avoid
 - Hallucinated imports or trainer arguments from outdated memory.
@@ -140,11 +166,12 @@ Minimum research floor:
 - **Docs**: Read current HF docs for any library/API that the plan depends on.
 - **External constraints**: Use current web/official docs for non-HF platform constraints, policies, rate limits, pricing, or APIs.
-For plan-only outputs, return a compact evidence table before the plan when useful:
-- Source or artifact.
-- What was verified.
-- Design implication.
-- Confidence or gap.
 If runtime policy prevents spawning a research sub-agent, note that only as a process limitation; do not use it as a reason to skip dataset, code, docs, or citation-graph research.
@@ -168,6 +195,7 @@ When delegation is not allowed:
 - Perform the same probes directly in the main context.
 - State the limitation briefly as a process note only.
 - Still preserve the upstream research order: papers first, then datasets, then docs/examples, then current external constraints.
 Research prompt pattern to emulate:
 - Start from anchor papers or landmark work.
@@ -245,15 +273,14 @@ Use the `hf-jobs` skill for job submission and monitoring.
 When something fails:
 - Read the full error and relevant logs.
 - Do not retry the exact same command without changing the cause.
-- Import error: fetch docs/example, patch import/config.
-- Dataset KeyError: re-inspect schema, patch preprocessing.
-- OOM: reduce per-device batch size while increasing gradient accumulation to keep effective batch size; enable gradient checkpointing; or choose larger hardware. Do not switch methods.
-- Divergence/NaN: lower learning rate, check labels/rewards, inspect samples.
-- Weak metric: compare against paper recipes, inspect errors, tune with a small sweep.
 - If the issue is ambiguous, return to the most authoritative source available before making a speculative change.
-Do not hide compromises. If preserving the original request is impossible, explain the constraint and ask for approval.
 ## Completion Standard
 Before final response, verify:
@@ -265,4 +292,4 @@ Return:
 - Source repo links (branch, commit, PR).
 - Hugging Face artifact URLs (model, dataset, Space, job).
 - Metrics or evaluation results.
-- Known gaps, failures, or next experiments.

 name: ml-intern-harness
 description: "The core ML Intern skill. Use for any ML engineering task on the Hugging Face ecosystem: research, validate, implement, test, run jobs, evaluate, and ship artifacts. Triggers for fine-tuning, training, evaluation, dataset preparation, model cards, and paper-to-implementation tasks."
 disable-model-invocation: false
+***
 # ML Intern Harness
 When the user only wants a plan, the final `update_plan` call should still mark the synthesis step completed before returning.
+### Example plan shape
+The following shows the exact structure to use when calling `update_plan`. IDs are stable integers assigned at plan creation and never reused. Exactly one item is `in_progress` at any time. The entire list is replaced on every call — never partial updates. Only mark an item `completed` after it fully succeeds.
+```
+update_plan:
+  todos:
+    - id: 1
+      content: "Research papers"
+      status: completed
+    - id: 2
+      content: "Inspect datasets"
+      status: in_progress
+    - id: 3
+      content: "Read HF docs and code examples"
+      status: pending
+    - id: 4
+      content: "Implement training script"
+      status: pending
+    - id: 5
+      content: "Smoke test and submit job"
+      status: pending
+```
+Do not use freeform status strings such as "done", "wip", or "not started". Only `pending`, `in_progress`, and `completed` are valid.
 ## High-Risk Mistakes To Avoid
 - Hallucinated imports or trainer arguments from outdated memory.
 - **Docs**: Read current HF docs for any library/API that the plan depends on.
 - **External constraints**: Use current web/official docs for non-HF platform constraints, policies, rate limits, pricing, or APIs.
+For plan-only outputs, return a compact evidence table before the plan:
+| Source / Artifact | What was verified | Design implication | Confidence |
+|---|---|---|---|
+Use `verified`, `inferred`, or `not checked` in the Confidence column. Do not return prose summaries as the primary evidence format — the table is the required handoff format.
 If runtime policy prevents spawning a research sub-agent, note that only as a process limitation; do not use it as a reason to skip dataset, code, docs, or citation-graph research.
 - Perform the same probes directly in the main context.
 - State the limitation briefly as a process note only.
 - Still preserve the upstream research order: papers first, then datasets, then docs/examples, then current external constraints.
+- Return findings as a compact evidence table (Source / Artifact | Verified finding | Design implication | Confidence) before the plan. Do not return prose summaries as the primary evidence format.
 Research prompt pattern to emulate:
 - Start from anchor papers or landmark work.
 When something fails:
 - Read the full error and relevant logs.
 - Do not retry the exact same command without changing the cause.
+- **Import error**: fetch the current docs or example file, patch the import or config name. Do not guess from memory.
+- **Dataset KeyError**: re-inspect the schema, patch preprocessing to match actual column names.
+- **OOM**: reduce `per_device_train_batch_size`, increase `gradient_accumulation_steps` proportionally to preserve effective batch size, set `gradient_checkpointing=True`. Do NOT switch SFT to LoRA, reduce `max_length`, or change the training method without explicit user approval.
+- **Divergence / NaN**: lower the learning rate, check labels and rewards for correctness, inspect representative samples. Do not silently substitute a different optimizer or scheduler.
+- **Weak metric**: compare against the paper recipe step by step, inspect error cases, propose a targeted sweep. Do not silently change datasets, models, or methods.
+- **Silent substitution is never allowed**: if preserving the original request is impossible, explain the constraint and ask for approval before making any scope change.
 - If the issue is ambiguous, return to the most authoritative source available before making a speculative change.
 ## Completion Standard
 Before final response, verify:
 - Source repo links (branch, commit, PR).
 - Hugging Face artifact URLs (model, dataset, Space, job).
 - Metrics or evaluation results.
+- Known gaps, failures, or next experiments.