Spaces:

HuggingFaceH4
/

harbor-visualiser

Sleeping

AdithyaSK HF Staff commited on May 28

Commit

e587765

1 Parent(s): ff3a8e7

Recognise registry.json as a positive Harbor-dataset signal

Per Harbor docs (harborframework.com/docs/datasets), a registry.json at the
dataset root registers it with the Harbor CLI (--registry-path). It's NOT
required — Terminal-Bench, DABstep, TitanBench all ship without one — but
when present it's a definitive Harbor-dataset signal.

- list_hf_tasks now treats registry.json at root as a positive marker that
skips the task.toml subdir sampling (cheaper, more correct).
- Empty-state message lists all three recognised markers: registry.json,
tasks/ (nested), or top-level dirs with task.toml (flat).
- Verified: Repo2RLEnv (registry+nested) 127, DABstep (flat) 450,
Terminal-Bench (flat) 89, TitanBench (nested) 2, TaskTrove (not Harbor) 0.

Files changed (2) hide show

static/app.js +1 -1
viewer/hub.py +22 -13

static/app.js CHANGED Viewed

@@ -373,7 +373,7 @@ async function renderWorkspace(params) {
          <p>Select a task from the list to view its spec, files & run command.</p></div>`
       : `<div class="emptysel"><div class="ic">${ICON.info}</div>
          <p><strong style="color:var(--text)">No Harbor tasks found in this dataset.</strong><br>
-         The visualiser looks for <code>task.toml</code> files (either at the root or under <code>tasks/</code>). This dataset doesn't seem to follow the Harbor task-spec format.</p></div>`;
   }
   // ── load one task's detail into the tree + content (no full re-render) ──

          <p>Select a task from the list to view its spec, files & run command.</p></div>`
       : `<div class="emptysel"><div class="ic">${ICON.info}</div>
          <p><strong style="color:var(--text)">No Harbor tasks found in this dataset.</strong><br>
+         The visualiser recognises Harbor datasets by either a <code>registry.json</code> at the root, a <code>tasks/</code> folder (nested layout), or top-level dirs containing <code>task.toml</code> (flat layout). This dataset doesn't follow any of those.</p></div>`;
   }
   // ── load one task's detail into the tree + content (no full re-render) ──

viewer/hub.py CHANGED Viewed

@@ -101,25 +101,34 @@ def list_hf_tasks(dataset_id: str, revision: str | None = None, *, ttl: float =
     root = list(api.list_repo_tree(dataset_id, repo_type="dataset", revision=revision, recursive=False))
     names = {e.path: e for e in root}
     if "tasks" in names and _is_dir(names["tasks"]):
         sub = api.list_repo_tree(dataset_id, "tasks", repo_type="dataset", revision=revision, recursive=False)
         ids = sorted(e.path.split("/")[-1] for e in sub if _is_dir(e))
     else:
         # Flat layout: top-level folders MAY be tasks (skip dotfiles/README/etc.).
-        # But some datasets (e.g. TaskTrove) have top-level dirs that aren't Harbor
-        # tasks — they hold `tasks.parquet` or similar. Verify the layout by sampling
-        # the first few candidates for a `task.toml`; if none have one, this isn't a
-        # Harbor task-spec dataset and we return [] rather than listing random folders.
         candidates = sorted(e.path for e in root if _is_dir(e) and not e.path.startswith("."))
-        ids = []
-        for sample in candidates[:3]:
-            try:
-                sub = list(api.list_repo_tree(dataset_id, sample, repo_type="dataset", revision=revision, recursive=False))
-            except Exception:  # noqa: BLE001
-                continue
-            if any(getattr(e, "path", "").endswith("task.toml") for e in sub):
-                ids = candidates
-                break
     _TASKS_CACHE[key] = (ids, now)
     return ids

     root = list(api.list_repo_tree(dataset_id, repo_type="dataset", revision=revision, recursive=False))
     names = {e.path: e for e in root}
+    # `registry.json` at the root is a positive signal that this is a Harbor
+    # dataset (Repo2RLEnv pushes it; harbor's --registry-path consumes it).
+    # It's *not* required — terminal-bench-2.0, dabstep-harbor, titanbench all
+    # ship without one — but its presence skips the task.toml sampling below.
+    has_registry = "registry.json" in names
     if "tasks" in names and _is_dir(names["tasks"]):
         sub = api.list_repo_tree(dataset_id, "tasks", repo_type="dataset", revision=revision, recursive=False)
         ids = sorted(e.path.split("/")[-1] for e in sub if _is_dir(e))
     else:
         # Flat layout: top-level folders MAY be tasks (skip dotfiles/README/etc.).
+        # Some datasets (e.g. TaskTrove) have top-level dirs that aren't Harbor
+        # tasks — they hold `tasks.parquet` or similar. Verify by sampling the
+        # first few candidates for a `task.toml`. If `registry.json` is at the
+        # root we already know this is a Harbor dataset and skip the check.
         candidates = sorted(e.path for e in root if _is_dir(e) and not e.path.startswith("."))
+        if has_registry:
+            ids = candidates
+        else:
+            ids = []
+            for sample in candidates[:3]:
+                try:
+                    sub = list(api.list_repo_tree(dataset_id, sample, repo_type="dataset", revision=revision, recursive=False))
+                except Exception:  # noqa: BLE001
+                    continue
+                if any(getattr(e, "path", "").endswith("task.toml") for e in sub):
+                    ids = candidates
+                    break
     _TASKS_CACHE[key] = (ids, now)
     return ids