razvan commited on
Commit
203220c
·
verified ·
1 Parent(s): 5b68ff9

Upload plugins/mlintern/skills/hf-dataset-search/SKILL.md with huggingface_hub

Browse files
plugins/mlintern/skills/hf-dataset-search/SKILL.md ADDED
@@ -0,0 +1,61 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ name: hf-dataset-search
3
+ description: "Search Hugging Face Hub for datasets, inspect schema, splits, sample rows, and training-method compatibility."
4
+ disable-model-invocation: false
5
+ ---
6
+
7
+ # hf-dataset-search — Hugging Face Dataset Discovery
8
+
9
+ ## Purpose
10
+
11
+ Find and validate Hugging Face datasets before using them in training or evaluation. Prevent schema hallucinations and incompatible data.
12
+
13
+ ## Tools
14
+
15
+ - `dataset_search`: Search HF Hub datasets by query, tags, or task.
16
+ - `hub_repo_details`: Get dataset metadata and README.
17
+
18
+ ## Dataset Viewer Inspection
19
+
20
+ The plugin does not expose a direct `hf_inspect_dataset` tool. To inspect schema, splits, and sample rows, use the dataset inspection script:
21
+
22
+ ```bash
23
+ python skills/ml-intern-harness/scripts/inspect_dataset.py <dataset-id> --split train --sample-rows 3
24
+ ```
25
+
26
+ This queries the Hugging Face Dataset Viewer API for:
27
+ - Validity status
28
+ - Configs and splits
29
+ - Schema (column names and types)
30
+ - Representative rows
31
+ - Parquet file availability
32
+ - SFT/DPO/GRPO compatibility notes
33
+
34
+ ## Workflow
35
+
36
+ 1. Search for candidate datasets with `dataset_search`.
37
+ 2. Inspect metadata with `hub_repo_details` (set `repo_type="dataset"`).
38
+ 3. Run `inspect_dataset.py` for schema and sample row details.
39
+ 4. Verify training-method compatibility:
40
+ - SFT: needs `messages`, `text`, or `prompt`/`completion`
41
+ - DPO: needs `prompt`, `chosen`, `rejected`
42
+ - GRPO: needs `prompt`
43
+ 5. Surface class imbalance, missing values, unexpected formats, or unsafe substitutions.
44
+
45
+ ## Example
46
+
47
+ ```
48
+ _dataset_search(query="instruction following", sort="downloads")
49
+ _hub_repo_details(repo_ids=["HuggingFaceH4/ultrachat_200k"], repo_type="dataset")
50
+ python skills/ml-intern-harness/scripts/inspect_dataset.py HuggingFaceH4/ultrachat_200k --split train_sft --sample-rows 3
51
+ ```
52
+
53
+ ## Validation Checklist
54
+
55
+ Before using a dataset:
56
+ - [ ] Dataset is valid and has Dataset Viewer coverage.
57
+ - [ ] Configs/splits match expectations.
58
+ - [ ] Column names and types are compatible with the trainer.
59
+ - [ ] Sample rows look reasonable.
60
+ - [ ] Row count is sufficient for the task.
61
+ - [ ] License and gating are acceptable.