evalstate HF Staff commited on
Commit
ebb3165
·
verified ·
1 Parent(s): 7e6cbce

Deploy Transformers PR API

Browse files
README.md CHANGED
@@ -5,7 +5,7 @@ colorFrom: indigo
5
  colorTo: blue
6
  sdk: docker
7
  app_port: 7860
8
- short_description: Live API for Transformers PR similarity search.
9
  datasets:
10
  - evalstate/transformers-pr
11
  tags:
@@ -20,12 +20,6 @@ tags:
20
 
21
  Machine-oriented API for PR similarity search.
22
 
23
- Canonical storage roles:
24
-
25
- - dataset repo: published latest state and canonical current analysis
26
- - mounted bucket: mutable operational cache only
27
- - Space disk: ephemeral runtime storage
28
-
29
  Defaults for this deployment:
30
 
31
  - repo: `huggingface/transformers`
 
5
  colorTo: blue
6
  sdk: docker
7
  app_port: 7860
8
+ short_description: Live API for Transformers PR search and issue clustering.
9
  datasets:
10
  - evalstate/transformers-pr
11
  tags:
 
20
 
21
  Machine-oriented API for PR similarity search.
22
 
 
 
 
 
 
 
23
  Defaults for this deployment:
24
 
25
  - repo: `huggingface/transformers`
src/slop_farmer.egg-info/PKG-INFO CHANGED
@@ -61,6 +61,16 @@ forward from the previous snapshot when the new snapshot does not already have i
61
  log a cache-hit summary for the run. This is useful for incremental scrapes where many
62
  review units are unchanged and can safely reuse cached hybrid decisions.
63
 
 
 
 
 
 
 
 
 
 
 
64
  ## Scope
65
 
66
  Cluster PRs by touched repository areas.
@@ -81,23 +91,40 @@ uv run slop-farmer scrape \
81
  --max-prs 50
82
  ```
83
 
84
- To publish a snapshot to the Hub:
85
 
86
  ```bash
87
- uv run slop-farmer scrape \
88
- --repo huggingface/transformers \
89
- --output-dir data \
90
- --hf-repo-id burtenshaw/transformers-pr-slop-dataset \
91
- --publish
92
  ```
93
 
94
- When `--publish` is used, `slop-farmer` now also generates and uploads new contributor reviewer artifacts by default:
95
 
96
  - `new_contributors.parquet`
97
  - `new-contributors-report.json`
98
  - `new-contributors-report.md`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
99
 
100
- Use `--no-new-contributor-report` to skip them.
 
 
 
101
 
102
  ## Nightly incremental runs
103
 
@@ -170,6 +197,7 @@ materialize versioned Hub **dataset repos**; it does not currently read HF bucke
170
  Compatibility wrappers remain available:
171
 
172
  - `scripts/submit_transformers_dataset_job.sh`
 
173
  - `scripts/submit_openclaw_dataset_job.sh`
174
 
175
  For the current storage model and recommended modes, see
@@ -183,11 +211,15 @@ You can analyze the published Hugging Face dataset directly without scraping Git
183
  uv run slop-farmer analyze \
184
  --snapshot-dir eval_data/snapshots/gh-live-latest-1000x1000 \
185
  --ranking-backend hybrid \
186
- --model "gpt-5-mini?reasoning=low" \
187
  --output /tmp/gh-live-latest-1000x1000-hybrid.json
188
  ```
189
 
190
- This materializes the dataset-viewer parquet export into a local snapshot cache under `eval_data/snapshots/` and writes `analysis-report.json` next to it.
 
 
 
 
191
 
192
  Repo-local defaults for `analyze` can be stored in `pyproject.toml` under `[tool.slop-farmer.analyze]`. This repo currently defaults to:
193
 
@@ -283,28 +315,15 @@ By default this writes:
283
 
284
  next to the snapshot, including GitHub profile links, repo issue/PR search links, and example authored artifacts.
285
 
286
- ## Full end-to-end workflow
287
-
288
- You can run scrape + publish + analyze + markdown + dashboard export in one command:
289
-
290
- ```bash
291
- uv run slop-farmer full-pipeline \
292
- --repo huggingface/transformers \
293
- --dataset YOURNAME/transformers-pr-slop-dataset \
294
- --model "gpt-5-mini?reasoning=low"
295
- ```
296
 
297
- This writes outputs under a repo-anchored workspace directory, for example:
298
 
299
- - `runs/transformers/data/`
300
- - `runs/transformers/web/public/data/`
301
-
302
- Optional age caps are based on `created_at`:
303
-
304
- ```bash
305
- --issue-max-age-days 30 \
306
- --pr-max-age-days 14
307
- ```
308
 
309
  ## Validation checks
310
 
@@ -312,6 +331,7 @@ Before committing or wiring new package moves into automation, run:
312
 
313
  ```bash
314
  uv run python scripts/enforce_packaging.py
 
315
  uv run --extra dev ruff format --check src tests scripts jobs
316
  uv run --extra dev ruff check src tests scripts jobs
317
  uv run --extra dev ty check src tests scripts jobs
@@ -324,6 +344,9 @@ uv run --extra dev pytest -q
324
  - `data` must not import `reports`
325
  - `reports` must not import `app`
326
 
 
 
 
327
  ## YAML config-driven runs
328
 
329
  You can keep repo-specific pipeline defaults in a YAML file and apply them to all
@@ -378,9 +401,9 @@ uv run slop-farmer --config configs/diffusers.yaml dataset-status
378
  Those reader commands default to `dataset_id` when configured. Pass `--snapshot-dir` to force
379
  an explicit local snapshot instead.
380
 
381
- If you run `analyze` before `publish-snapshot`, the uploaded snapshot will also include
382
- `analysis-state/`, which makes the hybrid cache portable across machines and reusable in
383
- later snapshots when `analysis.cached_analysis: true` is enabled.
384
 
385
  ## Export static dashboard data
386
 
@@ -402,6 +425,15 @@ This writes:
402
 
403
  The dashboard is intentionally summary-first and links out to GitHub for deep detail.
404
 
 
 
 
 
 
 
 
 
 
405
  ## Deploy a dashboard to a Hugging Face Space
406
 
407
  Use the generic deploy script:
@@ -420,6 +452,12 @@ Repo-specific wrappers are also available:
420
  - `scripts/deploy_transformers_dashboard_space.sh`
421
  - `scripts/deploy_openclaw_dashboard_space.sh`
422
 
 
 
 
 
 
 
423
  Or use the CLI wrapper with a YAML config:
424
 
425
  ```bash
@@ -432,21 +470,23 @@ The repo includes the FastAPI service for the read-oriented PR similarity surfac
432
  The standalone `pr-search` client now lives in the downstream `pr-search-cli`
433
  package.
434
 
435
- Deploy the OpenClaw API Space with:
436
 
437
  ```bash
 
 
438
  scripts/update_openclaw_pr_search_api.sh
439
  ```
440
 
441
  Or use the generic deploy script directly:
442
 
443
  ```bash
444
- SPACE_ID=evalstate/openclaw-pr-api \
445
- SPACE_TITLE="OpenClaw PR API" \
446
- DEFAULT_REPO=openclaw/openclaw \
447
  GHR_BASE_URL=https://ghreplica.dutiful.dev \
448
- HF_REPO_ID=evalstate/openclaw-pr \
449
- BUCKET_ID=evalstate/openclaw-pr-api-data \
450
  scripts/deploy_pr_search_space.sh
451
  ```
452
 
@@ -455,14 +495,62 @@ This deploy flow:
455
  - creates or updates a Docker Space
456
  - uploads a minimal app bundle with a generated Space `README.md`
457
  - sets runtime variables for the API
458
- - mounts the configured HF bucket at `/data`
 
 
 
 
 
 
 
459
 
460
  After the Space is live, you can query it either through the in-repo admin CLI:
461
 
462
  ```bash
463
- uv run slop-farmer pr-search status --repo openclaw/openclaw
464
- uv run slop-farmer pr-search similar 67096 --repo openclaw/openclaw
465
  ```
466
 
467
  Or through the downstream `pr-search-cli` package, which owns the standalone
468
  `pr-search` executable.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
61
  log a cache-hit summary for the run. This is useful for incremental scrapes where many
62
  review units are unchanged and can safely reuse cached hybrid decisions.
63
 
64
+ To push that local cache back to the dataset repo for future remote-first runs, use either:
65
+
66
+ - `publish-analysis-artifacts --save-cache` during canonical analysis publication
67
+ - `save-cache` to upload `analysis-state/` on its own
68
+
69
+ Hybrid review execution is bounded-parallel. Use `--hybrid-llm-concurrency N` or
70
+ `analysis.hybrid_llm_concurrency: N` to cap concurrent review units. `1` keeps the
71
+ lowest provider pressure; higher values can reduce wall-clock time at the cost of more
72
+ provider pressure.
73
+
74
  ## Scope
75
 
76
  Cluster PRs by touched repository areas.
 
91
  --max-prs 50
92
  ```
93
 
94
+ To refresh the canonical dataset repo:
95
 
96
  ```bash
97
+ uv run slop-farmer --config configs/transformers.yaml refresh-dataset
 
 
 
 
98
  ```
99
 
100
+ `refresh-dataset` publishes raw tables plus cheap artifacts like:
101
 
102
  - `new_contributors.parquet`
103
  - `new-contributors-report.json`
104
  - `new-contributors-report.md`
105
+ - `pr-scope-clusters.json`
106
+
107
+ To publish expensive hybrid analysis artifacts after a local `analyze` run:
108
+
109
+ ```bash
110
+ uv run slop-farmer --config configs/transformers.yaml publish-analysis-artifacts \
111
+ --analysis-id hybrid-gpt54mini-v3 \
112
+ --canonical \
113
+ --save-cache
114
+ ```
115
+
116
+ This writes an immutable archived run under
117
+ `snapshots/<snapshot_id>/analysis-runs/<analysis_id>/...` and, with `--canonical`,
118
+ updates the stable `analysis/current/` alias. With `--save-cache`, it also uploads the
119
+ snapshot-local `analysis-state/` directory to repo-root `analysis-state/` as mutable
120
+ operational cache for future hybrid runs.
121
+
122
+ To upload only the cache without publishing canonical analysis:
123
 
124
+ ```bash
125
+ uv run slop-farmer --config configs/transformers.yaml save-cache \
126
+ --snapshot-dir runs/transformers-recent-60d/data/snapshots/20260418T170534Z
127
+ ```
128
 
129
  ## Nightly incremental runs
130
 
 
197
  Compatibility wrappers remain available:
198
 
199
  - `scripts/submit_transformers_dataset_job.sh`
200
+ - `scripts/submit_diffusers_dataset_job.sh`
201
  - `scripts/submit_openclaw_dataset_job.sh`
202
 
203
  For the current storage model and recommended modes, see
 
211
  uv run slop-farmer analyze \
212
  --snapshot-dir eval_data/snapshots/gh-live-latest-1000x1000 \
213
  --ranking-backend hybrid \
214
+ --model "gpt-5.4-mini?service_tier=flex" \
215
  --output /tmp/gh-live-latest-1000x1000-hybrid.json
216
  ```
217
 
218
+ This materializes the dataset-viewer parquet export into a local snapshot cache under
219
+ `eval_data/snapshots/` and writes a local analysis report next to it. Publishing
220
+ canonical hybrid analysis is a separate `publish-analysis-artifacts` step, and updating
221
+ the remote hybrid cache source is `publish-analysis-artifacts --save-cache` or
222
+ standalone `save-cache`.
223
 
224
  Repo-local defaults for `analyze` can be stored in `pyproject.toml` under `[tool.slop-farmer.analyze]`. This repo currently defaults to:
225
 
 
315
 
316
  next to the snapshot, including GitHub profile links, repo issue/PR search links, and example authored artifacts.
317
 
318
+ ## Recommended end-to-end sequence
 
 
 
 
 
 
 
 
 
319
 
320
+ For canonical upkeep, prefer the explicit sequence:
321
 
322
+ 1. `refresh-dataset`
323
+ 2. `analyze`
324
+ 3. `publish-analysis-artifacts --save-cache`
325
+ 4. `dashboard-data`
326
+ 5. deploy dashboard and API if needed
 
 
 
 
327
 
328
  ## Validation checks
329
 
 
331
 
332
  ```bash
333
  uv run python scripts/enforce_packaging.py
334
+ uv run python scripts/check_hf_cli_secrets.py
335
  uv run --extra dev ruff format --check src tests scripts jobs
336
  uv run --extra dev ruff check src tests scripts jobs
337
  uv run --extra dev ty check src tests scripts jobs
 
344
  - `data` must not import `reports`
345
  - `reports` must not import `app`
346
 
347
+ `scripts/check_hf_cli_secrets.py` rejects `hf ... --secrets NAME=value` so access
348
+ tokens cannot be exposed via process argv.
349
+
350
  ## YAML config-driven runs
351
 
352
  You can keep repo-specific pipeline defaults in a YAML file and apply them to all
 
401
  Those reader commands default to `dataset_id` when configured. Pass `--snapshot-dir` to force
402
  an explicit local snapshot instead.
403
 
404
+ `analysis-state/` is mutable operational cache only. You can upload it to the dataset
405
+ repo with `save-cache` or `publish-analysis-artifacts --save-cache`, but it is still not
406
+ the canonical analysis read surface.
407
 
408
  ## Export static dashboard data
409
 
 
425
 
426
  The dashboard is intentionally summary-first and links out to GitHub for deep detail.
427
 
428
+ When `--analysis-input` is omitted, `dashboard-data` now prefers:
429
+
430
+ 1. `analysis/current/manifest.json`
431
+ 2. `analysis/current/analysis-report-hybrid.json`
432
+ 3. snapshot-local fallback only when canonical current analysis is absent
433
+
434
+ If the canonical current manifest exists but the required artifact is missing, dashboard export
435
+ fails loudly instead of silently drifting to snapshot-local analysis.
436
+
437
  ## Deploy a dashboard to a Hugging Face Space
438
 
439
  Use the generic deploy script:
 
452
  - `scripts/deploy_transformers_dashboard_space.sh`
453
  - `scripts/deploy_openclaw_dashboard_space.sh`
454
 
455
+ Repo-specific end-to-end dashboard update helpers are also available:
456
+
457
+ - `scripts/update_transformers_dashboard.sh`
458
+ - `scripts/update_diffusers_dashboard.sh`
459
+ - `scripts/update_openclaw_dashboard.sh`
460
+
461
  Or use the CLI wrapper with a YAML config:
462
 
463
  ```bash
 
470
  The standalone `pr-search` client now lives in the downstream `pr-search-cli`
471
  package.
472
 
473
+ Repo-specific wrappers are available for the current deployed APIs:
474
 
475
  ```bash
476
+ scripts/update_diffusers_pr_search_api.sh
477
+ scripts/update_transformers_pr_search_api.sh
478
  scripts/update_openclaw_pr_search_api.sh
479
  ```
480
 
481
  Or use the generic deploy script directly:
482
 
483
  ```bash
484
+ SPACE_ID=evalstate/transformers-pr-api \
485
+ SPACE_TITLE="Transformers PR API" \
486
+ DEFAULT_REPO=huggingface/transformers \
487
  GHR_BASE_URL=https://ghreplica.dutiful.dev \
488
+ HF_REPO_ID=evalstate/transformers-pr \
489
+ BUCKET_ID=evalstate/transformers-pr-api-data \
490
  scripts/deploy_pr_search_space.sh
491
  ```
492
 
 
495
  - creates or updates a Docker Space
496
  - uploads a minimal app bundle with a generated Space `README.md`
497
  - sets runtime variables for the API
498
+ - mounts the configured HF bucket at `/data` as mutable operational cache only
499
+
500
+ Serving defaults:
501
+
502
+ - dataset repo = canonical published state
503
+ - API materializes one self-consistent dataset view
504
+ - canonical `analysis/current/` is the default analysis surface when present
505
+ - archived analysis is selectable explicitly with `snapshot_id` + `analysis_id`
506
 
507
  After the Space is live, you can query it either through the in-repo admin CLI:
508
 
509
  ```bash
510
+ uv run slop-farmer pr-search status --repo huggingface/transformers
511
+ uv run slop-farmer pr-search similar 44940 --repo huggingface/transformers
512
  ```
513
 
514
  Or through the downstream `pr-search-cli` package, which owns the standalone
515
  `pr-search` executable.
516
+
517
+ ## Transformers migration cheat sheet
518
+
519
+ To move Transformers onto the current architecture:
520
+
521
+ ### 1. Recreate the scheduled dataset refresh job with the generic wrapper
522
+
523
+ ```bash
524
+ CONFIG_PATH=configs/transformers.yaml \
525
+ LABEL=transformers-dataset-refresh \
526
+ SCHEDULE='@daily' \
527
+ scripts/submit_transformers_dataset_job.sh
528
+ ```
529
+
530
+ This is the canonical scheduled writer for raw/latest dataset state.
531
+
532
+ ### 2. Run analysis and publish canonical hybrid analysis
533
+
534
+ ```bash
535
+ ANALYSIS_ID=hybrid-gpt54mini-v3 scripts/update_transformers_dashboard.sh
536
+ ```
537
+
538
+ That sequence:
539
+
540
+ - refreshes dataset if requested
541
+ - writes local hybrid analysis output
542
+ - publishes canonical `analysis/current/`
543
+ - saves repo-root `analysis-state/` for future hybrid cache reuse
544
+ - rebuilds PR scope
545
+ - deploys the dashboard
546
+
547
+ ### 3. Deploy the Transformers API Space
548
+
549
+ ```bash
550
+ scripts/update_transformers_pr_search_api.sh
551
+ ```
552
+
553
+ Optional runtime bucket:
554
+
555
+ - default wrapper bucket id: `evalstate/transformers-pr-api-data`
556
+ - treat it as mutable operational cache only, not canonical published storage
src/slop_farmer.egg-info/SOURCES.txt CHANGED
@@ -19,9 +19,10 @@ src/slop_farmer/app/hf_checkpoint_import.py
19
  src/slop_farmer/app/pipeline.py
20
  src/slop_farmer/app/pr_search.py
21
  src/slop_farmer/app/pr_search_api.py
22
- src/slop_farmer/app/publish.py
 
 
23
  src/slop_farmer/app/snapshot_state.py
24
- src/slop_farmer/app/workflow.py
25
  src/slop_farmer/data/__init__.py
26
  src/slop_farmer/data/dataset_card.py
27
  src/slop_farmer/data/ghreplica_api.py
@@ -56,10 +57,12 @@ tests/test_cli.py
56
  tests/test_config.py
57
  tests/test_dashboard.py
58
  tests/test_dataset_status.py
 
59
  tests/test_farmer_setup_assets.py
60
  tests/test_ghreplica_api.py
61
  tests/test_github_api.py
62
  tests/test_hf_checkpoint_import.py
 
63
  tests/test_http.py
64
  tests/test_links.py
65
  tests/test_new_contributor_report.py
@@ -68,7 +71,11 @@ tests/test_pipeline_checkpoint_resume.py
68
  tests/test_pr_scope.py
69
  tests/test_pr_search.py
70
  tests/test_pr_search_api.py
71
- tests/test_publish.py
 
 
72
  tests/test_snapshot_state.py
 
 
73
  tests/test_update_transformers_dataset.py
74
  tests/test_viewer_layout.py
 
19
  src/slop_farmer/app/pipeline.py
20
  src/slop_farmer/app/pr_search.py
21
  src/slop_farmer/app/pr_search_api.py
22
+ src/slop_farmer/app/publish_analysis.py
23
+ src/slop_farmer/app/publish_dataset_snapshot.py
24
+ src/slop_farmer/app/save_cache.py
25
  src/slop_farmer/app/snapshot_state.py
 
26
  src/slop_farmer/data/__init__.py
27
  src/slop_farmer/data/dataset_card.py
28
  src/slop_farmer/data/ghreplica_api.py
 
57
  tests/test_config.py
58
  tests/test_dashboard.py
59
  tests/test_dataset_status.py
60
+ tests/test_deploy.py
61
  tests/test_farmer_setup_assets.py
62
  tests/test_ghreplica_api.py
63
  tests/test_github_api.py
64
  tests/test_hf_checkpoint_import.py
65
+ tests/test_hf_cli_secrets_check.py
66
  tests/test_http.py
67
  tests/test_links.py
68
  tests/test_new_contributor_report.py
 
71
  tests/test_pr_scope.py
72
  tests/test_pr_search.py
73
  tests/test_pr_search_api.py
74
+ tests/test_publish_analysis.py
75
+ tests/test_published_layout.py
76
+ tests/test_save_cache.py
77
  tests/test_snapshot_state.py
78
+ tests/test_submit_dataset_job.py
79
+ tests/test_update_dashboard_scripts.py
80
  tests/test_update_transformers_dataset.py
81
  tests/test_viewer_layout.py
src/slop_farmer/app/cli.py CHANGED
@@ -23,6 +23,7 @@ from slop_farmer.config import (
23
  PrSearchRefreshOptions,
24
  PublishAnalysisArtifactsOptions,
25
  RepoRef,
 
26
  SnapshotAdoptOptions,
27
  )
28
  from slop_farmer.reports.duplicate_prs import DEFAULT_DUPLICATE_PR_MODEL
@@ -63,6 +64,7 @@ def build_parser(*, config_path: Path | None = None) -> argparse.ArgumentParser:
63
  _add_new_contributor_report_parser(subparsers, defaults["new-contributor-report"])
64
  _add_dashboard_data_parser(subparsers, defaults["dashboard-data"])
65
  _add_publish_analysis_artifacts_parser(subparsers, defaults["publish-analysis-artifacts"])
 
66
  _add_deploy_dashboard_parser(subparsers, defaults["deploy-dashboard"])
67
  _add_dataset_status_parser(subparsers, defaults["dataset-status"])
68
  return parser
@@ -80,6 +82,7 @@ def _load_parser_defaults(config_path: Path | None) -> dict[str, dict[str, Any]]
80
  "new-contributor-report",
81
  "dashboard-data",
82
  "publish-analysis-artifacts",
 
83
  "deploy-dashboard",
84
  "dataset-status",
85
  )
@@ -897,6 +900,11 @@ def _add_publish_analysis_artifacts_parser(subparsers: Any, defaults: dict[str,
897
  type=Path,
898
  help="Optional explicit snapshot directory containing analysis-report-hybrid.json.",
899
  )
 
 
 
 
 
900
  publish_analysis.add_argument(
901
  "--hf-repo-id",
902
  default=defaults.get("hf-repo-id"),
@@ -910,6 +918,12 @@ def _add_publish_analysis_artifacts_parser(subparsers: Any, defaults: dict[str,
910
  default=bool(defaults.get("canonical", False)),
911
  help="Also update the stable analysis/current canonical alias.",
912
  )
 
 
 
 
 
 
913
  publish_analysis.add_argument(
914
  "--private-hf-repo",
915
  action="store_true",
@@ -918,6 +932,36 @@ def _add_publish_analysis_artifacts_parser(subparsers: Any, defaults: dict[str,
918
  )
919
 
920
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
921
  def _add_deploy_dashboard_parser(subparsers: Any, defaults: dict[str, Any]) -> None:
922
  deploy_dashboard = subparsers.add_parser(
923
  "deploy-dashboard",
@@ -1512,9 +1556,30 @@ def _run_publish_analysis_artifacts(args: argparse.Namespace, config_path: Path
1512
  PublishAnalysisArtifactsOptions(
1513
  output_dir=args.output_dir,
1514
  snapshot_dir=args.snapshot_dir,
 
1515
  hf_repo_id=args.hf_repo_id,
1516
  analysis_id=args.analysis_id,
1517
  canonical=args.canonical,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1518
  private_hf_repo=args.private_hf_repo,
1519
  )
1520
  ),
@@ -1543,6 +1608,7 @@ def main() -> None:
1543
  "deploy-dashboard": _run_deploy_dashboard,
1544
  "dataset-status": _run_dataset_status,
1545
  "publish-analysis-artifacts": _run_publish_analysis_artifacts,
 
1546
  }
1547
  handler = handlers.get(args.command)
1548
  if handler is None:
 
23
  PrSearchRefreshOptions,
24
  PublishAnalysisArtifactsOptions,
25
  RepoRef,
26
+ SaveCacheOptions,
27
  SnapshotAdoptOptions,
28
  )
29
  from slop_farmer.reports.duplicate_prs import DEFAULT_DUPLICATE_PR_MODEL
 
64
  _add_new_contributor_report_parser(subparsers, defaults["new-contributor-report"])
65
  _add_dashboard_data_parser(subparsers, defaults["dashboard-data"])
66
  _add_publish_analysis_artifacts_parser(subparsers, defaults["publish-analysis-artifacts"])
67
+ _add_save_cache_parser(subparsers, defaults["save-cache"])
68
  _add_deploy_dashboard_parser(subparsers, defaults["deploy-dashboard"])
69
  _add_dataset_status_parser(subparsers, defaults["dataset-status"])
70
  return parser
 
82
  "new-contributor-report",
83
  "dashboard-data",
84
  "publish-analysis-artifacts",
85
+ "save-cache",
86
  "deploy-dashboard",
87
  "dataset-status",
88
  )
 
900
  type=Path,
901
  help="Optional explicit snapshot directory containing analysis-report-hybrid.json.",
902
  )
903
+ publish_analysis.add_argument(
904
+ "--analysis-input",
905
+ type=Path,
906
+ help="Optional explicit hybrid analysis report JSON to publish instead of snapshot-dir discovery.",
907
+ )
908
  publish_analysis.add_argument(
909
  "--hf-repo-id",
910
  default=defaults.get("hf-repo-id"),
 
918
  default=bool(defaults.get("canonical", False)),
919
  help="Also update the stable analysis/current canonical alias.",
920
  )
921
+ publish_analysis.add_argument(
922
+ "--save-cache",
923
+ action="store_true",
924
+ default=bool(defaults.get("save-cache", False)),
925
+ help="Also upload snapshot-local analysis-state/ as mutable operational cache at repo-root analysis-state/.",
926
+ )
927
  publish_analysis.add_argument(
928
  "--private-hf-repo",
929
  action="store_true",
 
932
  )
933
 
934
 
935
+ def _add_save_cache_parser(subparsers: Any, defaults: dict[str, Any]) -> None:
936
+ save_cache = subparsers.add_parser(
937
+ "save-cache",
938
+ help="Upload snapshot-local analysis-state/ as mutable operational cache to a dataset repo.",
939
+ )
940
+ save_cache.add_argument(
941
+ "--output-dir",
942
+ type=Path,
943
+ default=Path(defaults.get("output-dir", "data")),
944
+ help="Pipeline workspace root containing snapshots/latest.json.",
945
+ )
946
+ save_cache.add_argument(
947
+ "--snapshot-dir",
948
+ type=Path,
949
+ help="Optional explicit snapshot directory containing analysis-state/.",
950
+ )
951
+ save_cache.add_argument(
952
+ "--hf-repo-id",
953
+ default=defaults.get("hf-repo-id"),
954
+ required=defaults.get("hf-repo-id") is None,
955
+ help="Target Hugging Face dataset repo id.",
956
+ )
957
+ save_cache.add_argument(
958
+ "--private-hf-repo",
959
+ action="store_true",
960
+ default=bool(defaults.get("private-hf-repo", False)),
961
+ help="Create the target dataset repo as private if needed.",
962
+ )
963
+
964
+
965
  def _add_deploy_dashboard_parser(subparsers: Any, defaults: dict[str, Any]) -> None:
966
  deploy_dashboard = subparsers.add_parser(
967
  "deploy-dashboard",
 
1556
  PublishAnalysisArtifactsOptions(
1557
  output_dir=args.output_dir,
1558
  snapshot_dir=args.snapshot_dir,
1559
+ analysis_input=args.analysis_input,
1560
  hf_repo_id=args.hf_repo_id,
1561
  analysis_id=args.analysis_id,
1562
  canonical=args.canonical,
1563
+ save_cache=args.save_cache,
1564
+ private_hf_repo=args.private_hf_repo,
1565
+ )
1566
+ ),
1567
+ indent=2,
1568
+ )
1569
+ )
1570
+
1571
+
1572
+ def _run_save_cache(args: argparse.Namespace, config_path: Path | None) -> None:
1573
+ del config_path
1574
+ from slop_farmer.app.save_cache import run_save_cache
1575
+
1576
+ print(
1577
+ json.dumps(
1578
+ run_save_cache(
1579
+ SaveCacheOptions(
1580
+ output_dir=args.output_dir,
1581
+ snapshot_dir=args.snapshot_dir,
1582
+ hf_repo_id=args.hf_repo_id,
1583
  private_hf_repo=args.private_hf_repo,
1584
  )
1585
  ),
 
1608
  "deploy-dashboard": _run_deploy_dashboard,
1609
  "dataset-status": _run_dataset_status,
1610
  "publish-analysis-artifacts": _run_publish_analysis_artifacts,
1611
+ "save-cache": _run_save_cache,
1612
  }
1613
  handler = handlers.get(args.command)
1614
  if handler is None:
src/slop_farmer/app/pr_search_api.py CHANGED
@@ -1,7 +1,8 @@
1
  from __future__ import annotations
2
 
 
3
  import os
4
- from contextlib import asynccontextmanager
5
  from dataclasses import dataclass
6
  from pathlib import Path
7
  from typing import Any, Literal
@@ -11,10 +12,19 @@ from fastapi.responses import JSONResponse
11
 
12
  from slop_farmer.config import PrSearchRefreshOptions
13
  from slop_farmer.data.ghreplica_api import GhReplicaProbeUnavailableError, GhrProbeClient
14
- from slop_farmer.data.snapshot_materialize import materialize_hf_dataset_snapshot
 
 
 
 
 
 
15
  from slop_farmer.data.snapshot_paths import (
16
  CURRENT_ANALYSIS_MANIFEST_PATH,
17
  default_hf_materialize_dir,
 
 
 
18
  )
19
  from slop_farmer.reports.analysis_service import (
20
  get_analysis_best,
@@ -64,6 +74,8 @@ class PrSearchApiSettings:
64
  http_max_retries: int = 5
65
  refresh_if_missing: bool = False
66
  rebuild_on_start: bool = False
 
 
67
  include_drafts: bool = False
68
  include_closed: bool = False
69
  similar_limit_default: int = 10
@@ -100,6 +112,8 @@ class PrSearchApiSettings:
100
  http_max_retries=_env_int("HTTP_MAX_RETRIES", 5),
101
  refresh_if_missing=_env_bool("REFRESH_IF_MISSING", False),
102
  rebuild_on_start=_env_bool("REBUILD_ON_START", False),
 
 
103
  include_drafts=_env_bool("INCLUDE_DRAFTS", False),
104
  include_closed=_env_bool("INCLUDE_CLOSED", False),
105
  similar_limit_default=_env_int("SIMILAR_LIMIT_DEFAULT", 10),
@@ -125,13 +139,28 @@ def create_app(settings: PrSearchApiSettings | None = None) -> FastAPI:
125
  app.state.settings = api_settings
126
  app.state.ready = False
127
  app.state.startup_error = None
 
 
 
128
  try:
129
  _bootstrap_snapshot_assets(api_settings)
 
130
  _bootstrap_index(api_settings)
131
  app.state.ready = _is_ready(api_settings)
 
 
 
 
 
132
  except Exception as exc:
133
  app.state.startup_error = str(exc)
134
- yield
 
 
 
 
 
 
135
 
136
  app = FastAPI(title="slop PR search API", version="0.1.1", lifespan=lifespan)
137
 
@@ -628,6 +657,84 @@ def _bootstrap_snapshot_assets(settings: PrSearchApiSettings) -> None:
628
  )
629
 
630
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
631
  def _needs_refresh(settings: PrSearchApiSettings) -> bool:
632
  if settings.rebuild_on_start:
633
  return True
@@ -714,6 +821,64 @@ def _surface_available(snapshot_dir: Path, *, surface: Literal["issues", "contri
714
  return (snapshot_dir / "new-contributors-report.json").exists()
715
 
716
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
717
  def _limit(value: int | None, *, default: int, maximum: int) -> int:
718
  limit = default if value is None else value
719
  if limit < 1:
 
1
  from __future__ import annotations
2
 
3
+ import asyncio
4
  import os
5
+ from contextlib import asynccontextmanager, suppress
6
  from dataclasses import dataclass
7
  from pathlib import Path
8
  from typing import Any, Literal
 
12
 
13
  from slop_farmer.config import PrSearchRefreshOptions
14
  from slop_farmer.data.ghreplica_api import GhReplicaProbeUnavailableError, GhrProbeClient
15
+ from slop_farmer.data.snapshot_materialize import (
16
+ load_hf_current_analysis_manifest,
17
+ load_hf_current_search_manifest,
18
+ materialize_hf_current_analysis_surface,
19
+ materialize_hf_current_search_index,
20
+ materialize_hf_dataset_snapshot,
21
+ )
22
  from slop_farmer.data.snapshot_paths import (
23
  CURRENT_ANALYSIS_MANIFEST_PATH,
24
  default_hf_materialize_dir,
25
+ load_current_analysis_manifest,
26
+ load_current_search_manifest,
27
+ repo_relative_path_to_local,
28
  )
29
  from slop_farmer.reports.analysis_service import (
30
  get_analysis_best,
 
74
  http_max_retries: int = 5
75
  refresh_if_missing: bool = False
76
  rebuild_on_start: bool = False
77
+ current_analysis_poll_seconds: int = 0
78
+ current_search_poll_seconds: int = 0
79
  include_drafts: bool = False
80
  include_closed: bool = False
81
  similar_limit_default: int = 10
 
112
  http_max_retries=_env_int("HTTP_MAX_RETRIES", 5),
113
  refresh_if_missing=_env_bool("REFRESH_IF_MISSING", False),
114
  rebuild_on_start=_env_bool("REBUILD_ON_START", False),
115
+ current_analysis_poll_seconds=_env_int("CURRENT_ANALYSIS_POLL_SECONDS", 0),
116
+ current_search_poll_seconds=_env_int("CURRENT_SEARCH_POLL_SECONDS", 0),
117
  include_drafts=_env_bool("INCLUDE_DRAFTS", False),
118
  include_closed=_env_bool("INCLUDE_CLOSED", False),
119
  similar_limit_default=_env_int("SIMILAR_LIMIT_DEFAULT", 10),
 
139
  app.state.settings = api_settings
140
  app.state.ready = False
141
  app.state.startup_error = None
142
+ app.state.current_analysis_refresh_error = None
143
+ app.state.current_search_refresh_error = None
144
+ refresh_task: asyncio.Task[None] | None = None
145
  try:
146
  _bootstrap_snapshot_assets(api_settings)
147
+ _bootstrap_current_search_index(api_settings)
148
  _bootstrap_index(api_settings)
149
  app.state.ready = _is_ready(api_settings)
150
+ if (
151
+ api_settings.current_analysis_poll_seconds > 0
152
+ or api_settings.current_search_poll_seconds > 0
153
+ ):
154
+ refresh_task = asyncio.create_task(_run_current_asset_refresh_loop(app))
155
  except Exception as exc:
156
  app.state.startup_error = str(exc)
157
+ try:
158
+ yield
159
+ finally:
160
+ if refresh_task is not None:
161
+ refresh_task.cancel()
162
+ with suppress(asyncio.CancelledError):
163
+ await refresh_task
164
 
165
  app = FastAPI(title="slop PR search API", version="0.1.1", lifespan=lifespan)
166
 
 
657
  )
658
 
659
 
660
+ def _bootstrap_current_search_index(settings: PrSearchApiSettings) -> None:
661
+ if settings.snapshot_dir is not None or settings.hf_repo_id is None:
662
+ return
663
+ _refresh_current_search_index(settings)
664
+
665
+
666
+ async def _run_current_asset_refresh_loop(app: FastAPI) -> None:
667
+ settings = app.state.settings
668
+ interval = min_non_zero(
669
+ settings.current_analysis_poll_seconds,
670
+ settings.current_search_poll_seconds,
671
+ )
672
+ while True:
673
+ await asyncio.sleep(interval)
674
+ if settings.current_search_poll_seconds > 0:
675
+ try:
676
+ _refresh_current_search_index(settings)
677
+ except Exception as exc:
678
+ app.state.current_search_refresh_error = str(exc)
679
+ else:
680
+ app.state.current_search_refresh_error = None
681
+ if settings.current_analysis_poll_seconds > 0:
682
+ try:
683
+ _refresh_current_analysis_surface(settings)
684
+ except Exception as exc:
685
+ app.state.current_analysis_refresh_error = str(exc)
686
+ else:
687
+ app.state.current_analysis_refresh_error = None
688
+
689
+
690
+ def _refresh_current_analysis_surface(settings: PrSearchApiSettings) -> bool:
691
+ if settings.hf_repo_id is None or settings.snapshot_dir is not None:
692
+ return False
693
+ remote_manifest = load_hf_current_analysis_manifest(
694
+ repo_id=settings.hf_repo_id,
695
+ revision=settings.hf_revision,
696
+ )
697
+ if remote_manifest is None:
698
+ return False
699
+ local_manifest = _load_local_current_analysis_manifest(settings)
700
+ if _analysis_manifest_identity(local_manifest) == _analysis_manifest_identity(remote_manifest):
701
+ return False
702
+ materialize_hf_current_analysis_surface(
703
+ repo_id=settings.hf_repo_id,
704
+ local_dir=_materialized_snapshot_dir(settings) or settings.output_dir,
705
+ revision=settings.hf_revision,
706
+ )
707
+ return True
708
+
709
+
710
+ def _refresh_current_search_index(settings: PrSearchApiSettings) -> bool:
711
+ if settings.hf_repo_id is None or settings.snapshot_dir is not None:
712
+ return False
713
+ remote_manifest = load_hf_current_search_manifest(
714
+ repo_id=settings.hf_repo_id,
715
+ revision=settings.hf_revision,
716
+ )
717
+ if remote_manifest is None:
718
+ return False
719
+ local_manifest = _load_local_current_search_manifest(settings)
720
+ if _search_manifest_identity(local_manifest) == _search_manifest_identity(remote_manifest):
721
+ return False
722
+ manifest_path = _local_current_search_manifest_path(settings)
723
+ staged_db_path = settings.index_path.with_name(f".{settings.index_path.name}.download")
724
+ staged_manifest_path = manifest_path.with_name(f".{manifest_path.name}.download")
725
+ materialize_hf_current_search_index(
726
+ repo_id=settings.hf_repo_id,
727
+ db_path=staged_db_path,
728
+ manifest_path=staged_manifest_path,
729
+ revision=settings.hf_revision,
730
+ )
731
+ get_pr_search_status(staged_db_path, repo=settings.default_repo)
732
+ settings.index_path.parent.mkdir(parents=True, exist_ok=True)
733
+ staged_db_path.replace(settings.index_path)
734
+ staged_manifest_path.replace(manifest_path)
735
+ return True
736
+
737
+
738
  def _needs_refresh(settings: PrSearchApiSettings) -> bool:
739
  if settings.rebuild_on_start:
740
  return True
 
821
  return (snapshot_dir / "new-contributors-report.json").exists()
822
 
823
 
824
+ def _load_local_current_analysis_manifest(settings: PrSearchApiSettings) -> dict[str, Any] | None:
825
+ materialized_snapshot_dir = _materialized_snapshot_dir(settings)
826
+ if materialized_snapshot_dir is None:
827
+ return None
828
+ manifest_path = repo_relative_path_to_local(
829
+ materialized_snapshot_dir, CURRENT_ANALYSIS_MANIFEST_PATH
830
+ )
831
+ if not manifest_path.exists():
832
+ return None
833
+ return load_current_analysis_manifest(manifest_path)
834
+
835
+
836
+ def _local_current_search_manifest_path(settings: PrSearchApiSettings) -> Path:
837
+ return settings.index_path.with_name("current-search-manifest.json")
838
+
839
+
840
+ def _load_local_current_search_manifest(settings: PrSearchApiSettings) -> dict[str, Any] | None:
841
+ manifest_path = _local_current_search_manifest_path(settings)
842
+ if not manifest_path.exists():
843
+ return None
844
+ return load_current_search_manifest(manifest_path)
845
+
846
+
847
+ def _analysis_manifest_identity(
848
+ payload: dict[str, Any] | None,
849
+ ) -> tuple[str | None, str | None, str | None]:
850
+ if payload is None:
851
+ return (None, None, None)
852
+ snapshot_id = payload.get("snapshot_id")
853
+ analysis_id = payload.get("analysis_id")
854
+ published_at = payload.get("published_at")
855
+ return (
856
+ None if snapshot_id is None else str(snapshot_id),
857
+ None if analysis_id is None else str(analysis_id),
858
+ None if published_at is None else str(published_at),
859
+ )
860
+
861
+
862
+ def _search_manifest_identity(
863
+ payload: dict[str, Any] | None,
864
+ ) -> tuple[str | None, str | None, str | None]:
865
+ if payload is None:
866
+ return (None, None, None)
867
+ snapshot_id = payload.get("snapshot_id")
868
+ run_id = payload.get("run_id")
869
+ published_at = payload.get("published_at")
870
+ return (
871
+ None if snapshot_id is None else str(snapshot_id),
872
+ None if run_id is None else str(run_id),
873
+ None if published_at is None else str(published_at),
874
+ )
875
+
876
+
877
+ def min_non_zero(*values: int) -> int:
878
+ candidates = [value for value in values if value > 0]
879
+ return min(candidates) if candidates else 300
880
+
881
+
882
  def _limit(value: int | None, *, default: int, maximum: int) -> int:
883
  limit = default if value is None else value
884
  if limit < 1:
src/slop_farmer/app/publish_analysis.py CHANGED
@@ -9,6 +9,7 @@ from typing import Any, Protocol, cast
9
 
10
  from huggingface_hub import CommitOperationAdd, HfApi, hf_hub_download
11
 
 
12
  from slop_farmer.config import PublishAnalysisArtifactsOptions
13
  from slop_farmer.data.parquet_io import read_json
14
  from slop_farmer.data.snapshot_paths import (
@@ -44,6 +45,16 @@ class HubApiLike(Protocol):
44
  repo_type: str,
45
  ) -> Any: ...
46
 
 
 
 
 
 
 
 
 
 
 
47
 
48
  @dataclass(frozen=True, slots=True)
49
  class PublishableAnalysisArtifacts:
@@ -59,9 +70,11 @@ def run_publish_analysis_artifacts(options: PublishAnalysisArtifactsOptions) ->
59
  snapshot_dir = resolve_snapshot_dir_from_output(options.output_dir, options.snapshot_dir)
60
  return publish_analysis_artifacts(
61
  snapshot_dir=snapshot_dir,
 
62
  hf_repo_id=options.hf_repo_id,
63
  analysis_id=options.analysis_id,
64
  canonical=options.canonical,
 
65
  private=options.private_hf_repo,
66
  )
67
 
@@ -69,19 +82,23 @@ def run_publish_analysis_artifacts(options: PublishAnalysisArtifactsOptions) ->
69
  def publish_analysis_artifacts(
70
  *,
71
  snapshot_dir: Path,
 
72
  hf_repo_id: str,
73
  analysis_id: str,
74
  canonical: bool,
75
  private: bool,
 
76
  log: Callable[[str], None] | None = None,
77
  ) -> dict[str, Any]:
78
  return _publish_analysis_artifacts_api(
79
  cast("HubApiLike", HfApi()),
80
  snapshot_dir=snapshot_dir,
 
81
  hf_repo_id=hf_repo_id,
82
  analysis_id=analysis_id,
83
  canonical=canonical,
84
  private=private,
 
85
  log=log,
86
  )
87
 
@@ -90,13 +107,15 @@ def _publish_analysis_artifacts_api(
90
  api: HubApiLike,
91
  *,
92
  snapshot_dir: Path,
 
93
  hf_repo_id: str,
94
  analysis_id: str,
95
  canonical: bool,
96
  private: bool,
 
97
  log: Callable[[str], None] | None = None,
98
  ) -> dict[str, Any]:
99
- artifacts = _discover_publishable_analysis(snapshot_dir)
100
  published_at = _iso_now()
101
  channel = "canonical" if canonical else "comparison"
102
  archived_manifest = build_archived_analysis_run_manifest(
@@ -150,21 +169,37 @@ def _publish_analysis_artifacts_api(
150
  commit_message=f"Publish analysis {analysis_id} for snapshot {artifacts.snapshot_id}",
151
  repo_type="dataset",
152
  )
153
- result = {
 
 
 
 
 
 
 
 
 
 
 
154
  "repo": artifacts.repo,
155
  "dataset_id": hf_repo_id,
156
  "snapshot_id": artifacts.snapshot_id,
157
  "analysis_id": analysis_id,
158
  "canonical": canonical,
 
159
  "published_at": published_at,
160
  "artifact_paths": [operation.path_in_repo for operation in operations],
161
  }
 
 
162
  if log:
163
  log(f"Published analysis artifacts to {hf_repo_id}")
164
  return result
165
 
166
 
167
- def _discover_publishable_analysis(snapshot_dir: Path) -> PublishableAnalysisArtifacts:
 
 
168
  manifest_path = snapshot_dir / ROOT_MANIFEST_FILENAME
169
  if not manifest_path.exists():
170
  raise FileNotFoundError(f"Snapshot manifest is missing: {manifest_path}")
@@ -176,7 +211,11 @@ def _discover_publishable_analysis(snapshot_dir: Path) -> PublishableAnalysisArt
176
  if not repo:
177
  raise ValueError(f"Snapshot manifest at {manifest_path} does not define repo.")
178
 
179
- report_path = snapshot_dir / ANALYSIS_REPORT_FILENAME_BY_VARIANT["hybrid"]
 
 
 
 
180
  if not report_path.exists():
181
  raise FileNotFoundError(f"Hybrid analysis report is missing: {report_path}")
182
  report_payload = read_json(report_path)
@@ -196,7 +235,7 @@ def _discover_publishable_analysis(snapshot_dir: Path) -> PublishableAnalysisArt
196
  if model is not None:
197
  model = str(model)
198
 
199
- reviews_path = snapshot_dir / HYBRID_ANALYSIS_REVIEWS_FILENAME
200
  return PublishableAnalysisArtifacts(
201
  repo=repo,
202
  snapshot_id=snapshot_id,
@@ -266,12 +305,13 @@ def _commit_operations(
266
  current_manifest: dict[str, Any] | None,
267
  snapshot_manifest: dict[str, Any],
268
  ) -> list[CommitOperationAdd]:
 
269
  operations = [
270
  CommitOperationAdd(
271
  path_in_repo=analysis_run_artifact_path(
272
  artifacts.snapshot_id,
273
  analysis_id,
274
- artifacts.report_path.name,
275
  ),
276
  path_or_fileobj=artifacts.report_path,
277
  ),
@@ -290,7 +330,7 @@ def _commit_operations(
290
  path_in_repo=analysis_run_artifact_path(
291
  artifacts.snapshot_id,
292
  analysis_id,
293
- artifacts.reviews_path.name,
294
  ),
295
  path_or_fileobj=artifacts.reviews_path,
296
  )
@@ -299,7 +339,7 @@ def _commit_operations(
299
  operations.extend(
300
  [
301
  CommitOperationAdd(
302
- path_in_repo=current_analysis_artifact_path(artifacts.report_path.name),
303
  path_or_fileobj=artifacts.report_path,
304
  ),
305
  CommitOperationAdd(
@@ -311,7 +351,7 @@ def _commit_operations(
311
  if artifacts.reviews_path is not None:
312
  operations.append(
313
  CommitOperationAdd(
314
- path_in_repo=current_analysis_artifact_path(artifacts.reviews_path.name),
315
  path_or_fileobj=artifacts.reviews_path,
316
  )
317
  )
 
9
 
10
  from huggingface_hub import CommitOperationAdd, HfApi, hf_hub_download
11
 
12
+ from slop_farmer.app.save_cache import _save_analysis_cache_api
13
  from slop_farmer.config import PublishAnalysisArtifactsOptions
14
  from slop_farmer.data.parquet_io import read_json
15
  from slop_farmer.data.snapshot_paths import (
 
45
  repo_type: str,
46
  ) -> Any: ...
47
 
48
+ def upload_folder(
49
+ self,
50
+ *,
51
+ repo_id: str,
52
+ folder_path: Path,
53
+ path_in_repo: str,
54
+ repo_type: str,
55
+ commit_message: str,
56
+ ) -> None: ...
57
+
58
 
59
  @dataclass(frozen=True, slots=True)
60
  class PublishableAnalysisArtifacts:
 
70
  snapshot_dir = resolve_snapshot_dir_from_output(options.output_dir, options.snapshot_dir)
71
  return publish_analysis_artifacts(
72
  snapshot_dir=snapshot_dir,
73
+ analysis_input=options.analysis_input,
74
  hf_repo_id=options.hf_repo_id,
75
  analysis_id=options.analysis_id,
76
  canonical=options.canonical,
77
+ save_cache=options.save_cache,
78
  private=options.private_hf_repo,
79
  )
80
 
 
82
  def publish_analysis_artifacts(
83
  *,
84
  snapshot_dir: Path,
85
+ analysis_input: Path | None,
86
  hf_repo_id: str,
87
  analysis_id: str,
88
  canonical: bool,
89
  private: bool,
90
+ save_cache: bool = False,
91
  log: Callable[[str], None] | None = None,
92
  ) -> dict[str, Any]:
93
  return _publish_analysis_artifacts_api(
94
  cast("HubApiLike", HfApi()),
95
  snapshot_dir=snapshot_dir,
96
+ analysis_input=analysis_input,
97
  hf_repo_id=hf_repo_id,
98
  analysis_id=analysis_id,
99
  canonical=canonical,
100
  private=private,
101
+ save_cache=save_cache,
102
  log=log,
103
  )
104
 
 
107
  api: HubApiLike,
108
  *,
109
  snapshot_dir: Path,
110
+ analysis_input: Path | None = None,
111
  hf_repo_id: str,
112
  analysis_id: str,
113
  canonical: bool,
114
  private: bool,
115
+ save_cache: bool = False,
116
  log: Callable[[str], None] | None = None,
117
  ) -> dict[str, Any]:
118
+ artifacts = _discover_publishable_analysis(snapshot_dir, analysis_input=analysis_input)
119
  published_at = _iso_now()
120
  channel = "canonical" if canonical else "comparison"
121
  archived_manifest = build_archived_analysis_run_manifest(
 
169
  commit_message=f"Publish analysis {analysis_id} for snapshot {artifacts.snapshot_id}",
170
  repo_type="dataset",
171
  )
172
+ cache_result = (
173
+ _save_analysis_cache_api(
174
+ api,
175
+ snapshot_dir=snapshot_dir,
176
+ hf_repo_id=hf_repo_id,
177
+ private=private,
178
+ log=log,
179
+ )
180
+ if save_cache
181
+ else None
182
+ )
183
+ result: dict[str, Any] = {
184
  "repo": artifacts.repo,
185
  "dataset_id": hf_repo_id,
186
  "snapshot_id": artifacts.snapshot_id,
187
  "analysis_id": analysis_id,
188
  "canonical": canonical,
189
+ "save_cache": save_cache,
190
  "published_at": published_at,
191
  "artifact_paths": [operation.path_in_repo for operation in operations],
192
  }
193
+ if cache_result is not None:
194
+ result["cache"] = cache_result
195
  if log:
196
  log(f"Published analysis artifacts to {hf_repo_id}")
197
  return result
198
 
199
 
200
+ def _discover_publishable_analysis(
201
+ snapshot_dir: Path, *, analysis_input: Path | None
202
+ ) -> PublishableAnalysisArtifacts:
203
  manifest_path = snapshot_dir / ROOT_MANIFEST_FILENAME
204
  if not manifest_path.exists():
205
  raise FileNotFoundError(f"Snapshot manifest is missing: {manifest_path}")
 
211
  if not repo:
212
  raise ValueError(f"Snapshot manifest at {manifest_path} does not define repo.")
213
 
214
+ report_path = (
215
+ analysis_input.resolve()
216
+ if analysis_input is not None
217
+ else snapshot_dir / ANALYSIS_REPORT_FILENAME_BY_VARIANT["hybrid"]
218
+ )
219
  if not report_path.exists():
220
  raise FileNotFoundError(f"Hybrid analysis report is missing: {report_path}")
221
  report_payload = read_json(report_path)
 
235
  if model is not None:
236
  model = str(model)
237
 
238
+ reviews_path = report_path.with_name(f"{report_path.stem}.llm-reviews.json")
239
  return PublishableAnalysisArtifacts(
240
  repo=repo,
241
  snapshot_id=snapshot_id,
 
305
  current_manifest: dict[str, Any] | None,
306
  snapshot_manifest: dict[str, Any],
307
  ) -> list[CommitOperationAdd]:
308
+ report_filename = ANALYSIS_REPORT_FILENAME_BY_VARIANT["hybrid"]
309
  operations = [
310
  CommitOperationAdd(
311
  path_in_repo=analysis_run_artifact_path(
312
  artifacts.snapshot_id,
313
  analysis_id,
314
+ report_filename,
315
  ),
316
  path_or_fileobj=artifacts.report_path,
317
  ),
 
330
  path_in_repo=analysis_run_artifact_path(
331
  artifacts.snapshot_id,
332
  analysis_id,
333
+ HYBRID_ANALYSIS_REVIEWS_FILENAME,
334
  ),
335
  path_or_fileobj=artifacts.reviews_path,
336
  )
 
339
  operations.extend(
340
  [
341
  CommitOperationAdd(
342
+ path_in_repo=current_analysis_artifact_path(report_filename),
343
  path_or_fileobj=artifacts.report_path,
344
  ),
345
  CommitOperationAdd(
 
351
  if artifacts.reviews_path is not None:
352
  operations.append(
353
  CommitOperationAdd(
354
+ path_in_repo=current_analysis_artifact_path(HYBRID_ANALYSIS_REVIEWS_FILENAME),
355
  path_or_fileobj=artifacts.reviews_path,
356
  )
357
  )
src/slop_farmer/app/publish_pr_search_index.py ADDED
@@ -0,0 +1,141 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ import json
4
+ from collections.abc import Callable
5
+ from dataclasses import dataclass
6
+ from datetime import UTC, datetime
7
+ from pathlib import Path
8
+ from typing import Any, Protocol, cast
9
+
10
+ from huggingface_hub import CommitOperationAdd, HfApi
11
+
12
+ from slop_farmer.data.snapshot_paths import (
13
+ CURRENT_SEARCH_DB_PATH,
14
+ CURRENT_SEARCH_MANIFEST_PATH,
15
+ build_current_search_manifest,
16
+ )
17
+ from slop_farmer.reports.pr_search_service import get_pr_search_status
18
+
19
+
20
+ class HubApiLike(Protocol):
21
+ def create_repo(
22
+ self,
23
+ repo_id: str,
24
+ *,
25
+ repo_type: str,
26
+ private: bool,
27
+ exist_ok: bool,
28
+ ) -> None: ...
29
+
30
+ def create_commit(
31
+ self,
32
+ repo_id: str,
33
+ operations: list[CommitOperationAdd],
34
+ *,
35
+ commit_message: str,
36
+ repo_type: str,
37
+ ) -> Any: ...
38
+
39
+
40
+ @dataclass(frozen=True, slots=True)
41
+ class PublishablePrSearchIndex:
42
+ repo: str
43
+ snapshot_id: str
44
+ run_id: str
45
+ db_path: Path
46
+ source_type: str | None
47
+ hf_repo_id: str | None
48
+ hf_revision: str | None
49
+ row_counts: dict[str, Any]
50
+
51
+
52
+ def publish_current_pr_search_index(
53
+ *,
54
+ db_path: Path,
55
+ hf_repo_id: str,
56
+ private: bool = False,
57
+ log: Callable[[str], None] | None = None,
58
+ ) -> dict[str, Any]:
59
+ return _publish_current_pr_search_index_api(
60
+ cast("HubApiLike", HfApi()),
61
+ db_path=db_path,
62
+ hf_repo_id=hf_repo_id,
63
+ private=private,
64
+ log=log,
65
+ )
66
+
67
+
68
+ def _publish_current_pr_search_index_api(
69
+ api: HubApiLike,
70
+ *,
71
+ db_path: Path,
72
+ hf_repo_id: str,
73
+ private: bool = False,
74
+ log: Callable[[str], None] | None = None,
75
+ ) -> dict[str, Any]:
76
+ publishable = _discover_publishable_pr_search_index(db_path)
77
+ published_at = _iso_now()
78
+ manifest = build_current_search_manifest(
79
+ repo=publishable.repo,
80
+ snapshot_id=publishable.snapshot_id,
81
+ run_id=publishable.run_id,
82
+ published_at=published_at,
83
+ source_type=publishable.source_type,
84
+ hf_repo_id=publishable.hf_repo_id,
85
+ hf_revision=publishable.hf_revision,
86
+ row_counts=publishable.row_counts,
87
+ )
88
+ operations = [
89
+ CommitOperationAdd(
90
+ path_in_repo=CURRENT_SEARCH_DB_PATH,
91
+ path_or_fileobj=publishable.db_path,
92
+ ),
93
+ CommitOperationAdd(
94
+ path_in_repo=CURRENT_SEARCH_MANIFEST_PATH,
95
+ path_or_fileobj=_json_bytes(manifest),
96
+ ),
97
+ ]
98
+ if log:
99
+ log(f"Ensuring Hub dataset repo exists: {hf_repo_id}")
100
+ api.create_repo(hf_repo_id, repo_type="dataset", private=private, exist_ok=True)
101
+ if log:
102
+ log(f"Publishing PR search index run {publishable.run_id} for {publishable.snapshot_id}")
103
+ api.create_commit(
104
+ hf_repo_id,
105
+ operations,
106
+ commit_message=f"Publish PR search index {publishable.run_id} for {publishable.snapshot_id}",
107
+ repo_type="dataset",
108
+ )
109
+ result = {
110
+ "repo": publishable.repo,
111
+ "dataset_id": hf_repo_id,
112
+ "snapshot_id": publishable.snapshot_id,
113
+ "run_id": publishable.run_id,
114
+ "published_at": published_at,
115
+ "artifact_paths": [operation.path_in_repo for operation in operations],
116
+ }
117
+ if log:
118
+ log(f"Published PR search index to {hf_repo_id}")
119
+ return result
120
+
121
+
122
+ def _discover_publishable_pr_search_index(db_path: Path) -> PublishablePrSearchIndex:
123
+ status = get_pr_search_status(db_path)
124
+ return PublishablePrSearchIndex(
125
+ repo=str(status["repo"]),
126
+ snapshot_id=str(status["snapshot_id"]),
127
+ run_id=str(status["id"]),
128
+ db_path=db_path.resolve(),
129
+ source_type=(None if status.get("source_type") is None else str(status.get("source_type"))),
130
+ hf_repo_id=None if status.get("hf_repo_id") is None else str(status.get("hf_repo_id")),
131
+ hf_revision=(None if status.get("hf_revision") is None else str(status.get("hf_revision"))),
132
+ row_counts=dict(status.get("row_counts") or {}),
133
+ )
134
+
135
+
136
+ def _json_bytes(payload: dict[str, Any]) -> bytes:
137
+ return (json.dumps(payload, indent=2, sort_keys=True) + "\n").encode("utf-8")
138
+
139
+
140
+ def _iso_now() -> str:
141
+ return datetime.now(tz=UTC).replace(microsecond=0).isoformat().replace("+00:00", "Z")
src/slop_farmer/app/save_cache.py ADDED
@@ -0,0 +1,115 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ from collections.abc import Callable
4
+ from pathlib import Path
5
+ from typing import Any, Protocol, cast
6
+
7
+ from huggingface_hub import HfApi
8
+
9
+ from slop_farmer.config import SaveCacheOptions
10
+ from slop_farmer.data.parquet_io import read_json
11
+ from slop_farmer.data.snapshot_paths import ROOT_MANIFEST_FILENAME, resolve_snapshot_dir_from_output
12
+
13
+ ANALYSIS_STATE_DIRNAME = "analysis-state"
14
+
15
+
16
+ class HubApiLike(Protocol):
17
+ def create_repo(
18
+ self,
19
+ repo_id: str,
20
+ *,
21
+ repo_type: str,
22
+ private: bool,
23
+ exist_ok: bool,
24
+ ) -> None: ...
25
+
26
+ def upload_folder(
27
+ self,
28
+ *,
29
+ repo_id: str,
30
+ folder_path: Path,
31
+ path_in_repo: str,
32
+ repo_type: str,
33
+ commit_message: str,
34
+ ) -> None: ...
35
+
36
+
37
+ def run_save_cache(options: SaveCacheOptions) -> dict[str, Any]:
38
+ snapshot_dir = resolve_snapshot_dir_from_output(options.output_dir, options.snapshot_dir)
39
+ return save_analysis_cache(
40
+ snapshot_dir=snapshot_dir,
41
+ hf_repo_id=options.hf_repo_id,
42
+ private=options.private_hf_repo,
43
+ )
44
+
45
+
46
+ def save_analysis_cache(
47
+ *,
48
+ snapshot_dir: Path,
49
+ hf_repo_id: str,
50
+ private: bool,
51
+ log: Callable[[str], None] | None = None,
52
+ ) -> dict[str, Any]:
53
+ return _save_analysis_cache_api(
54
+ cast("HubApiLike", HfApi()),
55
+ snapshot_dir=snapshot_dir,
56
+ hf_repo_id=hf_repo_id,
57
+ private=private,
58
+ log=log,
59
+ )
60
+
61
+
62
+ def _save_analysis_cache_api(
63
+ api: HubApiLike,
64
+ *,
65
+ snapshot_dir: Path,
66
+ hf_repo_id: str,
67
+ private: bool,
68
+ log: Callable[[str], None] | None = None,
69
+ ) -> dict[str, Any]:
70
+ cache_dir = snapshot_dir / ANALYSIS_STATE_DIRNAME
71
+ if not cache_dir.exists():
72
+ raise FileNotFoundError(f"Analysis cache directory is missing: {cache_dir}")
73
+ if not cache_dir.is_dir():
74
+ raise NotADirectoryError(f"Analysis cache path is not a directory: {cache_dir}")
75
+ artifact_paths = _cache_artifact_paths(cache_dir)
76
+ if not artifact_paths:
77
+ raise ValueError(f"Analysis cache directory is empty: {cache_dir}")
78
+
79
+ manifest_path = snapshot_dir / ROOT_MANIFEST_FILENAME
80
+ manifest = read_json(manifest_path) if manifest_path.exists() else {}
81
+ if not isinstance(manifest, dict):
82
+ raise ValueError(f"Snapshot manifest at {manifest_path} must contain a JSON object.")
83
+ snapshot_id = str(manifest.get("snapshot_id") or snapshot_dir.name).strip()
84
+ repo = str(manifest.get("repo") or "").strip()
85
+
86
+ if log:
87
+ log(f"Ensuring Hub dataset repo exists: {hf_repo_id}")
88
+ api.create_repo(hf_repo_id, repo_type="dataset", private=private, exist_ok=True)
89
+ if log:
90
+ log(f"Saving analysis cache for snapshot {snapshot_id}")
91
+ api.upload_folder(
92
+ repo_id=hf_repo_id,
93
+ folder_path=cache_dir,
94
+ path_in_repo=ANALYSIS_STATE_DIRNAME,
95
+ repo_type="dataset",
96
+ commit_message=f"Save analysis cache for snapshot {snapshot_id}",
97
+ )
98
+ result = {
99
+ "dataset_id": hf_repo_id,
100
+ "snapshot_id": snapshot_id,
101
+ "artifact_paths": [f"{ANALYSIS_STATE_DIRNAME}/{path}" for path in artifact_paths],
102
+ }
103
+ if repo:
104
+ result["repo"] = repo
105
+ if log:
106
+ log(f"Saved analysis cache to {hf_repo_id}")
107
+ return result
108
+
109
+
110
+ def _cache_artifact_paths(cache_dir: Path) -> list[str]:
111
+ return sorted(
112
+ str(path.relative_to(cache_dir).as_posix())
113
+ for path in cache_dir.rglob("*")
114
+ if path.is_file()
115
+ )
src/slop_farmer/app_config.py CHANGED
@@ -234,6 +234,10 @@ def _dashboard_config_defaults(config_path: Path) -> dict[str, dict[str, Any]]:
234
  "output-dir": str(data_dir) if data_dir else None,
235
  "hf-repo-id": dataset_id,
236
  },
 
 
 
 
237
  "deploy-dashboard": {
238
  "pipeline-data-dir": str(data_dir) if data_dir else None,
239
  "web-dir": str(web_dir) if web_dir else None,
 
234
  "output-dir": str(data_dir) if data_dir else None,
235
  "hf-repo-id": dataset_id,
236
  },
237
+ "save-cache": {
238
+ "output-dir": str(data_dir) if data_dir else None,
239
+ "hf-repo-id": dataset_id,
240
+ },
241
  "deploy-dashboard": {
242
  "pipeline-data-dir": str(data_dir) if data_dir else None,
243
  "web-dir": str(web_dir) if web_dir else None,
src/slop_farmer/config.py CHANGED
@@ -244,9 +244,19 @@ class DatasetRefreshOptions:
244
  class PublishAnalysisArtifactsOptions:
245
  output_dir: Path
246
  snapshot_dir: Path | None
 
247
  hf_repo_id: str
248
  analysis_id: str
249
  canonical: bool = False
 
 
 
 
 
 
 
 
 
250
  private_hf_repo: bool = False
251
 
252
 
 
244
  class PublishAnalysisArtifactsOptions:
245
  output_dir: Path
246
  snapshot_dir: Path | None
247
+ analysis_input: Path | None
248
  hf_repo_id: str
249
  analysis_id: str
250
  canonical: bool = False
251
+ save_cache: bool = False
252
+ private_hf_repo: bool = False
253
+
254
+
255
+ @dataclass(slots=True)
256
+ class SaveCacheOptions:
257
+ output_dir: Path
258
+ snapshot_dir: Path | None
259
+ hf_repo_id: str
260
  private_hf_repo: bool = False
261
 
262
 
src/slop_farmer/data/snapshot_materialize.py CHANGED
@@ -15,6 +15,8 @@ from slop_farmer.data.parquet_io import read_json, write_text
15
  from slop_farmer.data.snapshot_paths import (
16
  CONTRIBUTOR_ARTIFACT_FILENAMES,
17
  CURRENT_ANALYSIS_MANIFEST_PATH,
 
 
18
  LEGACY_ANALYSIS_FILENAMES,
19
  PR_SCOPE_CLUSTERS_FILENAME,
20
  RAW_TABLE_FILENAMES,
@@ -24,6 +26,7 @@ from slop_farmer.data.snapshot_paths import (
24
  STATE_WATERMARK_PATH,
25
  load_archived_analysis_run_manifest,
26
  load_current_analysis_manifest,
 
27
  repo_relative_path_to_local,
28
  )
29
 
@@ -64,6 +67,121 @@ def materialize_hf_dataset_snapshot(
64
  )
65
 
66
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
67
  def _materialize_hf_snapshot_repo_snapshot(
68
  *,
69
  repo_id: str,
 
15
  from slop_farmer.data.snapshot_paths import (
16
  CONTRIBUTOR_ARTIFACT_FILENAMES,
17
  CURRENT_ANALYSIS_MANIFEST_PATH,
18
+ CURRENT_SEARCH_DB_PATH,
19
+ CURRENT_SEARCH_MANIFEST_PATH,
20
  LEGACY_ANALYSIS_FILENAMES,
21
  PR_SCOPE_CLUSTERS_FILENAME,
22
  RAW_TABLE_FILENAMES,
 
26
  STATE_WATERMARK_PATH,
27
  load_archived_analysis_run_manifest,
28
  load_current_analysis_manifest,
29
+ load_current_search_manifest,
30
  repo_relative_path_to_local,
31
  )
32
 
 
67
  )
68
 
69
 
70
+ def load_hf_current_analysis_manifest(
71
+ *,
72
+ repo_id: str,
73
+ revision: str | None = None,
74
+ ) -> dict[str, Any] | None:
75
+ try:
76
+ downloaded = Path(
77
+ hf_hub_download(
78
+ repo_id=repo_id,
79
+ repo_type="dataset",
80
+ filename=CURRENT_ANALYSIS_MANIFEST_PATH,
81
+ revision=revision,
82
+ )
83
+ )
84
+ except Exception:
85
+ return None
86
+ return load_current_analysis_manifest(downloaded)
87
+
88
+
89
+ def materialize_hf_current_analysis_surface(
90
+ *,
91
+ repo_id: str,
92
+ local_dir: Path,
93
+ revision: str | None = None,
94
+ ) -> Path:
95
+ local_dir.mkdir(parents=True, exist_ok=True)
96
+ manifest = load_hf_current_analysis_manifest(repo_id=repo_id, revision=revision)
97
+ if manifest is None:
98
+ raise FileNotFoundError(
99
+ f"HF dataset {repo_id} does not expose {CURRENT_ANALYSIS_MANIFEST_PATH!r}"
100
+ )
101
+
102
+ staged_downloads: list[tuple[Path, Path]] = []
103
+
104
+ def stage(repo_path: str, *, required: bool) -> None:
105
+ try:
106
+ downloaded = Path(
107
+ hf_hub_download(
108
+ repo_id=repo_id,
109
+ repo_type="dataset",
110
+ filename=repo_path,
111
+ revision=revision,
112
+ )
113
+ )
114
+ except Exception:
115
+ if required:
116
+ raise
117
+ return
118
+ staged_downloads.append((downloaded, repo_relative_path_to_local(local_dir, repo_path)))
119
+
120
+ for repo_path in (
121
+ ROOT_MANIFEST_FILENAME,
122
+ SNAPSHOTS_LATEST_PATH,
123
+ "issues.parquet",
124
+ "pull_requests.parquet",
125
+ ):
126
+ stage(repo_path, required=False)
127
+
128
+ for repo_path in manifest.get("artifacts", {}).values():
129
+ if isinstance(repo_path, str) and repo_path:
130
+ stage(repo_path, required=True)
131
+
132
+ stage(CURRENT_ANALYSIS_MANIFEST_PATH, required=True)
133
+
134
+ for downloaded, destination in staged_downloads:
135
+ _copy_downloaded_file(downloaded, destination)
136
+
137
+ return local_dir
138
+
139
+
140
+ def load_hf_current_search_manifest(
141
+ *,
142
+ repo_id: str,
143
+ revision: str | None = None,
144
+ ) -> dict[str, Any] | None:
145
+ try:
146
+ downloaded = Path(
147
+ hf_hub_download(
148
+ repo_id=repo_id,
149
+ repo_type="dataset",
150
+ filename=CURRENT_SEARCH_MANIFEST_PATH,
151
+ revision=revision,
152
+ )
153
+ )
154
+ except Exception:
155
+ return None
156
+ return load_current_search_manifest(downloaded)
157
+
158
+
159
+ def materialize_hf_current_search_index(
160
+ *,
161
+ repo_id: str,
162
+ db_path: Path,
163
+ manifest_path: Path,
164
+ revision: str | None = None,
165
+ ) -> dict[str, Any]:
166
+ manifest = load_hf_current_search_manifest(repo_id=repo_id, revision=revision)
167
+ if manifest is None:
168
+ raise FileNotFoundError(
169
+ f"HF dataset {repo_id} does not expose {CURRENT_SEARCH_MANIFEST_PATH!r}"
170
+ )
171
+ db_repo_path = str(manifest.get("artifacts", {}).get("db") or CURRENT_SEARCH_DB_PATH)
172
+ downloaded_db = Path(
173
+ hf_hub_download(
174
+ repo_id=repo_id,
175
+ repo_type="dataset",
176
+ filename=db_repo_path,
177
+ revision=revision,
178
+ )
179
+ )
180
+ _copy_downloaded_file(downloaded_db, db_path)
181
+ write_text(json.dumps(manifest, indent=2, sort_keys=True) + "\n", manifest_path)
182
+ return manifest
183
+
184
+
185
  def _materialize_hf_snapshot_repo_snapshot(
186
  *,
187
  repo_id: str,
src/slop_farmer/data/snapshot_paths.py CHANGED
@@ -48,6 +48,10 @@ LEGACY_ANALYSIS_FILENAMES: tuple[str, ...] = (
48
  CURRENT_ANALYSIS_DIR = PurePosixPath("analysis/current")
49
  CURRENT_ANALYSIS_MANIFEST_PATH = str(CURRENT_ANALYSIS_DIR / ROOT_MANIFEST_FILENAME)
50
  ANALYSIS_MANIFEST_SCHEMA_VERSION = 1
 
 
 
 
51
 
52
 
53
  @dataclass(frozen=True, slots=True)
@@ -90,6 +94,10 @@ def current_analysis_artifact_path(filename: str) -> str:
90
  return str(CURRENT_ANALYSIS_DIR / filename)
91
 
92
 
 
 
 
 
93
  def repo_key(repo_slug: str) -> str:
94
  return _path_key(repo_slug)
95
 
@@ -195,6 +203,39 @@ def load_archived_analysis_run_manifest(path: Path) -> dict[str, Any]:
195
  return validate_archived_analysis_run_manifest(payload)
196
 
197
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
198
  def resolve_default_dashboard_analysis_report(
199
  snapshot_dir: Path,
200
  ) -> ResolvedAnalysisReportPath | None:
@@ -289,6 +330,52 @@ def validate_archived_analysis_run_manifest(payload: dict[str, Any]) -> dict[str
289
  return _validate_analysis_manifest(payload, require_archived_artifacts=False)
290
 
291
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
292
  def load_latest_snapshot_pointer(snapshots_root: Path) -> Path | None:
293
  resolved_snapshots_root = snapshots_root.resolve()
294
  latest_path = resolved_snapshots_root / "latest.json"
 
48
  CURRENT_ANALYSIS_DIR = PurePosixPath("analysis/current")
49
  CURRENT_ANALYSIS_MANIFEST_PATH = str(CURRENT_ANALYSIS_DIR / ROOT_MANIFEST_FILENAME)
50
  ANALYSIS_MANIFEST_SCHEMA_VERSION = 1
51
+ CURRENT_SEARCH_DIR = PurePosixPath("search/current")
52
+ CURRENT_SEARCH_MANIFEST_PATH = str(CURRENT_SEARCH_DIR / ROOT_MANIFEST_FILENAME)
53
+ CURRENT_SEARCH_DB_PATH = str(CURRENT_SEARCH_DIR / "pr-search.duckdb")
54
+ SEARCH_MANIFEST_SCHEMA_VERSION = 1
55
 
56
 
57
  @dataclass(frozen=True, slots=True)
 
94
  return str(CURRENT_ANALYSIS_DIR / filename)
95
 
96
 
97
+ def current_search_artifact_path(filename: str) -> str:
98
+ return str(CURRENT_SEARCH_DIR / filename)
99
+
100
+
101
  def repo_key(repo_slug: str) -> str:
102
  return _path_key(repo_slug)
103
 
 
203
  return validate_archived_analysis_run_manifest(payload)
204
 
205
 
206
+ def build_current_search_manifest(
207
+ *,
208
+ repo: str,
209
+ snapshot_id: str,
210
+ run_id: str,
211
+ published_at: str,
212
+ source_type: str | None,
213
+ hf_repo_id: str | None,
214
+ hf_revision: str | None,
215
+ row_counts: dict[str, Any] | None,
216
+ ) -> dict[str, Any]:
217
+ payload = {
218
+ "schema_version": SEARCH_MANIFEST_SCHEMA_VERSION,
219
+ "repo": repo,
220
+ "snapshot_id": snapshot_id,
221
+ "run_id": run_id,
222
+ "published_at": published_at,
223
+ "source_type": source_type,
224
+ "hf_repo_id": hf_repo_id,
225
+ "hf_revision": hf_revision,
226
+ "artifacts": {"db": CURRENT_SEARCH_DB_PATH},
227
+ "row_counts": row_counts or {},
228
+ }
229
+ return validate_current_search_manifest(payload)
230
+
231
+
232
+ def load_current_search_manifest(path: Path) -> dict[str, Any]:
233
+ payload = read_json(path)
234
+ if not isinstance(payload, dict):
235
+ raise ValueError(f"Current search manifest at {path} must contain a JSON object.")
236
+ return validate_current_search_manifest(payload)
237
+
238
+
239
  def resolve_default_dashboard_analysis_report(
240
  snapshot_dir: Path,
241
  ) -> ResolvedAnalysisReportPath | None:
 
330
  return _validate_analysis_manifest(payload, require_archived_artifacts=False)
331
 
332
 
333
+ def validate_current_search_manifest(payload: dict[str, Any]) -> dict[str, Any]:
334
+ schema_version = int(payload.get("schema_version", SEARCH_MANIFEST_SCHEMA_VERSION))
335
+ if schema_version != SEARCH_MANIFEST_SCHEMA_VERSION:
336
+ raise ValueError(
337
+ f"Current search manifest schema_version must be {SEARCH_MANIFEST_SCHEMA_VERSION}."
338
+ )
339
+ repo = str(payload.get("repo") or "").strip()
340
+ snapshot_id = str(payload.get("snapshot_id") or "").strip()
341
+ run_id = str(payload.get("run_id") or "").strip()
342
+ published_at = str(payload.get("published_at") or "").strip()
343
+ artifacts = payload.get("artifacts")
344
+ if not repo:
345
+ raise ValueError("Current search manifest must define repo.")
346
+ if not snapshot_id:
347
+ raise ValueError("Current search manifest must define snapshot_id.")
348
+ if not run_id:
349
+ raise ValueError("Current search manifest must define run_id.")
350
+ if not published_at:
351
+ raise ValueError("Current search manifest must define published_at.")
352
+ if not isinstance(artifacts, dict):
353
+ raise ValueError("Current search manifest must define an artifacts object.")
354
+ db_path = artifacts.get("db")
355
+ if db_path != CURRENT_SEARCH_DB_PATH:
356
+ raise ValueError(f"Current search manifest db artifact must be {CURRENT_SEARCH_DB_PATH!r}.")
357
+ return {
358
+ "schema_version": SEARCH_MANIFEST_SCHEMA_VERSION,
359
+ "repo": repo,
360
+ "snapshot_id": snapshot_id,
361
+ "run_id": run_id,
362
+ "published_at": published_at,
363
+ "source_type": (
364
+ None if payload.get("source_type") is None else str(payload.get("source_type"))
365
+ ),
366
+ "hf_repo_id": None if payload.get("hf_repo_id") is None else str(payload.get("hf_repo_id")),
367
+ "hf_revision": (
368
+ None if payload.get("hf_revision") is None else str(payload.get("hf_revision"))
369
+ ),
370
+ "artifacts": {"db": CURRENT_SEARCH_DB_PATH},
371
+ "row_counts": (
372
+ {str(key): value for key, value in payload.get("row_counts", {}).items()}
373
+ if isinstance(payload.get("row_counts"), dict)
374
+ else {}
375
+ ),
376
+ }
377
+
378
+
379
  def load_latest_snapshot_pointer(snapshots_root: Path) -> Path | None:
380
  resolved_snapshots_root = snapshots_root.resolve()
381
  latest_path = resolved_snapshots_root / "latest.json"