Spaces:
Running
Running
Deploy Transformers PR API
Browse files- README.md +1 -7
- src/slop_farmer.egg-info/PKG-INFO +130 -42
- src/slop_farmer.egg-info/SOURCES.txt +10 -3
- src/slop_farmer/app/cli.py +66 -0
- src/slop_farmer/app/pr_search_api.py +168 -3
- src/slop_farmer/app/publish_analysis.py +49 -9
- src/slop_farmer/app/publish_pr_search_index.py +141 -0
- src/slop_farmer/app/save_cache.py +115 -0
- src/slop_farmer/app_config.py +4 -0
- src/slop_farmer/config.py +10 -0
- src/slop_farmer/data/snapshot_materialize.py +118 -0
- src/slop_farmer/data/snapshot_paths.py +87 -0
README.md
CHANGED
|
@@ -5,7 +5,7 @@ colorFrom: indigo
|
|
| 5 |
colorTo: blue
|
| 6 |
sdk: docker
|
| 7 |
app_port: 7860
|
| 8 |
-
short_description: Live API for Transformers PR
|
| 9 |
datasets:
|
| 10 |
- evalstate/transformers-pr
|
| 11 |
tags:
|
|
@@ -20,12 +20,6 @@ tags:
|
|
| 20 |
|
| 21 |
Machine-oriented API for PR similarity search.
|
| 22 |
|
| 23 |
-
Canonical storage roles:
|
| 24 |
-
|
| 25 |
-
- dataset repo: published latest state and canonical current analysis
|
| 26 |
-
- mounted bucket: mutable operational cache only
|
| 27 |
-
- Space disk: ephemeral runtime storage
|
| 28 |
-
|
| 29 |
Defaults for this deployment:
|
| 30 |
|
| 31 |
- repo: `huggingface/transformers`
|
|
|
|
| 5 |
colorTo: blue
|
| 6 |
sdk: docker
|
| 7 |
app_port: 7860
|
| 8 |
+
short_description: Live API for Transformers PR search and issue clustering.
|
| 9 |
datasets:
|
| 10 |
- evalstate/transformers-pr
|
| 11 |
tags:
|
|
|
|
| 20 |
|
| 21 |
Machine-oriented API for PR similarity search.
|
| 22 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 23 |
Defaults for this deployment:
|
| 24 |
|
| 25 |
- repo: `huggingface/transformers`
|
src/slop_farmer.egg-info/PKG-INFO
CHANGED
|
@@ -61,6 +61,16 @@ forward from the previous snapshot when the new snapshot does not already have i
|
|
| 61 |
log a cache-hit summary for the run. This is useful for incremental scrapes where many
|
| 62 |
review units are unchanged and can safely reuse cached hybrid decisions.
|
| 63 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 64 |
## Scope
|
| 65 |
|
| 66 |
Cluster PRs by touched repository areas.
|
|
@@ -81,23 +91,40 @@ uv run slop-farmer scrape \
|
|
| 81 |
--max-prs 50
|
| 82 |
```
|
| 83 |
|
| 84 |
-
To
|
| 85 |
|
| 86 |
```bash
|
| 87 |
-
uv run slop-farmer
|
| 88 |
-
--repo huggingface/transformers \
|
| 89 |
-
--output-dir data \
|
| 90 |
-
--hf-repo-id burtenshaw/transformers-pr-slop-dataset \
|
| 91 |
-
--publish
|
| 92 |
```
|
| 93 |
|
| 94 |
-
|
| 95 |
|
| 96 |
- `new_contributors.parquet`
|
| 97 |
- `new-contributors-report.json`
|
| 98 |
- `new-contributors-report.md`
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 99 |
|
| 100 |
-
|
|
|
|
|
|
|
|
|
|
| 101 |
|
| 102 |
## Nightly incremental runs
|
| 103 |
|
|
@@ -170,6 +197,7 @@ materialize versioned Hub **dataset repos**; it does not currently read HF bucke
|
|
| 170 |
Compatibility wrappers remain available:
|
| 171 |
|
| 172 |
- `scripts/submit_transformers_dataset_job.sh`
|
|
|
|
| 173 |
- `scripts/submit_openclaw_dataset_job.sh`
|
| 174 |
|
| 175 |
For the current storage model and recommended modes, see
|
|
@@ -183,11 +211,15 @@ You can analyze the published Hugging Face dataset directly without scraping Git
|
|
| 183 |
uv run slop-farmer analyze \
|
| 184 |
--snapshot-dir eval_data/snapshots/gh-live-latest-1000x1000 \
|
| 185 |
--ranking-backend hybrid \
|
| 186 |
-
--model "gpt-5-mini?
|
| 187 |
--output /tmp/gh-live-latest-1000x1000-hybrid.json
|
| 188 |
```
|
| 189 |
|
| 190 |
-
This materializes the dataset-viewer parquet export into a local snapshot cache under
|
|
|
|
|
|
|
|
|
|
|
|
|
| 191 |
|
| 192 |
Repo-local defaults for `analyze` can be stored in `pyproject.toml` under `[tool.slop-farmer.analyze]`. This repo currently defaults to:
|
| 193 |
|
|
@@ -283,28 +315,15 @@ By default this writes:
|
|
| 283 |
|
| 284 |
next to the snapshot, including GitHub profile links, repo issue/PR search links, and example authored artifacts.
|
| 285 |
|
| 286 |
-
##
|
| 287 |
-
|
| 288 |
-
You can run scrape + publish + analyze + markdown + dashboard export in one command:
|
| 289 |
-
|
| 290 |
-
```bash
|
| 291 |
-
uv run slop-farmer full-pipeline \
|
| 292 |
-
--repo huggingface/transformers \
|
| 293 |
-
--dataset YOURNAME/transformers-pr-slop-dataset \
|
| 294 |
-
--model "gpt-5-mini?reasoning=low"
|
| 295 |
-
```
|
| 296 |
|
| 297 |
-
|
| 298 |
|
| 299 |
-
|
| 300 |
-
|
| 301 |
-
|
| 302 |
-
|
| 303 |
-
|
| 304 |
-
```bash
|
| 305 |
-
--issue-max-age-days 30 \
|
| 306 |
-
--pr-max-age-days 14
|
| 307 |
-
```
|
| 308 |
|
| 309 |
## Validation checks
|
| 310 |
|
|
@@ -312,6 +331,7 @@ Before committing or wiring new package moves into automation, run:
|
|
| 312 |
|
| 313 |
```bash
|
| 314 |
uv run python scripts/enforce_packaging.py
|
|
|
|
| 315 |
uv run --extra dev ruff format --check src tests scripts jobs
|
| 316 |
uv run --extra dev ruff check src tests scripts jobs
|
| 317 |
uv run --extra dev ty check src tests scripts jobs
|
|
@@ -324,6 +344,9 @@ uv run --extra dev pytest -q
|
|
| 324 |
- `data` must not import `reports`
|
| 325 |
- `reports` must not import `app`
|
| 326 |
|
|
|
|
|
|
|
|
|
|
| 327 |
## YAML config-driven runs
|
| 328 |
|
| 329 |
You can keep repo-specific pipeline defaults in a YAML file and apply them to all
|
|
@@ -378,9 +401,9 @@ uv run slop-farmer --config configs/diffusers.yaml dataset-status
|
|
| 378 |
Those reader commands default to `dataset_id` when configured. Pass `--snapshot-dir` to force
|
| 379 |
an explicit local snapshot instead.
|
| 380 |
|
| 381 |
-
|
| 382 |
-
`
|
| 383 |
-
|
| 384 |
|
| 385 |
## Export static dashboard data
|
| 386 |
|
|
@@ -402,6 +425,15 @@ This writes:
|
|
| 402 |
|
| 403 |
The dashboard is intentionally summary-first and links out to GitHub for deep detail.
|
| 404 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 405 |
## Deploy a dashboard to a Hugging Face Space
|
| 406 |
|
| 407 |
Use the generic deploy script:
|
|
@@ -420,6 +452,12 @@ Repo-specific wrappers are also available:
|
|
| 420 |
- `scripts/deploy_transformers_dashboard_space.sh`
|
| 421 |
- `scripts/deploy_openclaw_dashboard_space.sh`
|
| 422 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 423 |
Or use the CLI wrapper with a YAML config:
|
| 424 |
|
| 425 |
```bash
|
|
@@ -432,21 +470,23 @@ The repo includes the FastAPI service for the read-oriented PR similarity surfac
|
|
| 432 |
The standalone `pr-search` client now lives in the downstream `pr-search-cli`
|
| 433 |
package.
|
| 434 |
|
| 435 |
-
|
| 436 |
|
| 437 |
```bash
|
|
|
|
|
|
|
| 438 |
scripts/update_openclaw_pr_search_api.sh
|
| 439 |
```
|
| 440 |
|
| 441 |
Or use the generic deploy script directly:
|
| 442 |
|
| 443 |
```bash
|
| 444 |
-
SPACE_ID=evalstate/
|
| 445 |
-
SPACE_TITLE="
|
| 446 |
-
DEFAULT_REPO=
|
| 447 |
GHR_BASE_URL=https://ghreplica.dutiful.dev \
|
| 448 |
-
HF_REPO_ID=evalstate/
|
| 449 |
-
BUCKET_ID=evalstate/
|
| 450 |
scripts/deploy_pr_search_space.sh
|
| 451 |
```
|
| 452 |
|
|
@@ -455,14 +495,62 @@ This deploy flow:
|
|
| 455 |
- creates or updates a Docker Space
|
| 456 |
- uploads a minimal app bundle with a generated Space `README.md`
|
| 457 |
- sets runtime variables for the API
|
| 458 |
-
- mounts the configured HF bucket at `/data`
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 459 |
|
| 460 |
After the Space is live, you can query it either through the in-repo admin CLI:
|
| 461 |
|
| 462 |
```bash
|
| 463 |
-
uv run slop-farmer pr-search status --repo
|
| 464 |
-
uv run slop-farmer pr-search similar
|
| 465 |
```
|
| 466 |
|
| 467 |
Or through the downstream `pr-search-cli` package, which owns the standalone
|
| 468 |
`pr-search` executable.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 61 |
log a cache-hit summary for the run. This is useful for incremental scrapes where many
|
| 62 |
review units are unchanged and can safely reuse cached hybrid decisions.
|
| 63 |
|
| 64 |
+
To push that local cache back to the dataset repo for future remote-first runs, use either:
|
| 65 |
+
|
| 66 |
+
- `publish-analysis-artifacts --save-cache` during canonical analysis publication
|
| 67 |
+
- `save-cache` to upload `analysis-state/` on its own
|
| 68 |
+
|
| 69 |
+
Hybrid review execution is bounded-parallel. Use `--hybrid-llm-concurrency N` or
|
| 70 |
+
`analysis.hybrid_llm_concurrency: N` to cap concurrent review units. `1` keeps the
|
| 71 |
+
lowest provider pressure; higher values can reduce wall-clock time at the cost of more
|
| 72 |
+
provider pressure.
|
| 73 |
+
|
| 74 |
## Scope
|
| 75 |
|
| 76 |
Cluster PRs by touched repository areas.
|
|
|
|
| 91 |
--max-prs 50
|
| 92 |
```
|
| 93 |
|
| 94 |
+
To refresh the canonical dataset repo:
|
| 95 |
|
| 96 |
```bash
|
| 97 |
+
uv run slop-farmer --config configs/transformers.yaml refresh-dataset
|
|
|
|
|
|
|
|
|
|
|
|
|
| 98 |
```
|
| 99 |
|
| 100 |
+
`refresh-dataset` publishes raw tables plus cheap artifacts like:
|
| 101 |
|
| 102 |
- `new_contributors.parquet`
|
| 103 |
- `new-contributors-report.json`
|
| 104 |
- `new-contributors-report.md`
|
| 105 |
+
- `pr-scope-clusters.json`
|
| 106 |
+
|
| 107 |
+
To publish expensive hybrid analysis artifacts after a local `analyze` run:
|
| 108 |
+
|
| 109 |
+
```bash
|
| 110 |
+
uv run slop-farmer --config configs/transformers.yaml publish-analysis-artifacts \
|
| 111 |
+
--analysis-id hybrid-gpt54mini-v3 \
|
| 112 |
+
--canonical \
|
| 113 |
+
--save-cache
|
| 114 |
+
```
|
| 115 |
+
|
| 116 |
+
This writes an immutable archived run under
|
| 117 |
+
`snapshots/<snapshot_id>/analysis-runs/<analysis_id>/...` and, with `--canonical`,
|
| 118 |
+
updates the stable `analysis/current/` alias. With `--save-cache`, it also uploads the
|
| 119 |
+
snapshot-local `analysis-state/` directory to repo-root `analysis-state/` as mutable
|
| 120 |
+
operational cache for future hybrid runs.
|
| 121 |
+
|
| 122 |
+
To upload only the cache without publishing canonical analysis:
|
| 123 |
|
| 124 |
+
```bash
|
| 125 |
+
uv run slop-farmer --config configs/transformers.yaml save-cache \
|
| 126 |
+
--snapshot-dir runs/transformers-recent-60d/data/snapshots/20260418T170534Z
|
| 127 |
+
```
|
| 128 |
|
| 129 |
## Nightly incremental runs
|
| 130 |
|
|
|
|
| 197 |
Compatibility wrappers remain available:
|
| 198 |
|
| 199 |
- `scripts/submit_transformers_dataset_job.sh`
|
| 200 |
+
- `scripts/submit_diffusers_dataset_job.sh`
|
| 201 |
- `scripts/submit_openclaw_dataset_job.sh`
|
| 202 |
|
| 203 |
For the current storage model and recommended modes, see
|
|
|
|
| 211 |
uv run slop-farmer analyze \
|
| 212 |
--snapshot-dir eval_data/snapshots/gh-live-latest-1000x1000 \
|
| 213 |
--ranking-backend hybrid \
|
| 214 |
+
--model "gpt-5.4-mini?service_tier=flex" \
|
| 215 |
--output /tmp/gh-live-latest-1000x1000-hybrid.json
|
| 216 |
```
|
| 217 |
|
| 218 |
+
This materializes the dataset-viewer parquet export into a local snapshot cache under
|
| 219 |
+
`eval_data/snapshots/` and writes a local analysis report next to it. Publishing
|
| 220 |
+
canonical hybrid analysis is a separate `publish-analysis-artifacts` step, and updating
|
| 221 |
+
the remote hybrid cache source is `publish-analysis-artifacts --save-cache` or
|
| 222 |
+
standalone `save-cache`.
|
| 223 |
|
| 224 |
Repo-local defaults for `analyze` can be stored in `pyproject.toml` under `[tool.slop-farmer.analyze]`. This repo currently defaults to:
|
| 225 |
|
|
|
|
| 315 |
|
| 316 |
next to the snapshot, including GitHub profile links, repo issue/PR search links, and example authored artifacts.
|
| 317 |
|
| 318 |
+
## Recommended end-to-end sequence
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 319 |
|
| 320 |
+
For canonical upkeep, prefer the explicit sequence:
|
| 321 |
|
| 322 |
+
1. `refresh-dataset`
|
| 323 |
+
2. `analyze`
|
| 324 |
+
3. `publish-analysis-artifacts --save-cache`
|
| 325 |
+
4. `dashboard-data`
|
| 326 |
+
5. deploy dashboard and API if needed
|
|
|
|
|
|
|
|
|
|
|
|
|
| 327 |
|
| 328 |
## Validation checks
|
| 329 |
|
|
|
|
| 331 |
|
| 332 |
```bash
|
| 333 |
uv run python scripts/enforce_packaging.py
|
| 334 |
+
uv run python scripts/check_hf_cli_secrets.py
|
| 335 |
uv run --extra dev ruff format --check src tests scripts jobs
|
| 336 |
uv run --extra dev ruff check src tests scripts jobs
|
| 337 |
uv run --extra dev ty check src tests scripts jobs
|
|
|
|
| 344 |
- `data` must not import `reports`
|
| 345 |
- `reports` must not import `app`
|
| 346 |
|
| 347 |
+
`scripts/check_hf_cli_secrets.py` rejects `hf ... --secrets NAME=value` so access
|
| 348 |
+
tokens cannot be exposed via process argv.
|
| 349 |
+
|
| 350 |
## YAML config-driven runs
|
| 351 |
|
| 352 |
You can keep repo-specific pipeline defaults in a YAML file and apply them to all
|
|
|
|
| 401 |
Those reader commands default to `dataset_id` when configured. Pass `--snapshot-dir` to force
|
| 402 |
an explicit local snapshot instead.
|
| 403 |
|
| 404 |
+
`analysis-state/` is mutable operational cache only. You can upload it to the dataset
|
| 405 |
+
repo with `save-cache` or `publish-analysis-artifacts --save-cache`, but it is still not
|
| 406 |
+
the canonical analysis read surface.
|
| 407 |
|
| 408 |
## Export static dashboard data
|
| 409 |
|
|
|
|
| 425 |
|
| 426 |
The dashboard is intentionally summary-first and links out to GitHub for deep detail.
|
| 427 |
|
| 428 |
+
When `--analysis-input` is omitted, `dashboard-data` now prefers:
|
| 429 |
+
|
| 430 |
+
1. `analysis/current/manifest.json`
|
| 431 |
+
2. `analysis/current/analysis-report-hybrid.json`
|
| 432 |
+
3. snapshot-local fallback only when canonical current analysis is absent
|
| 433 |
+
|
| 434 |
+
If the canonical current manifest exists but the required artifact is missing, dashboard export
|
| 435 |
+
fails loudly instead of silently drifting to snapshot-local analysis.
|
| 436 |
+
|
| 437 |
## Deploy a dashboard to a Hugging Face Space
|
| 438 |
|
| 439 |
Use the generic deploy script:
|
|
|
|
| 452 |
- `scripts/deploy_transformers_dashboard_space.sh`
|
| 453 |
- `scripts/deploy_openclaw_dashboard_space.sh`
|
| 454 |
|
| 455 |
+
Repo-specific end-to-end dashboard update helpers are also available:
|
| 456 |
+
|
| 457 |
+
- `scripts/update_transformers_dashboard.sh`
|
| 458 |
+
- `scripts/update_diffusers_dashboard.sh`
|
| 459 |
+
- `scripts/update_openclaw_dashboard.sh`
|
| 460 |
+
|
| 461 |
Or use the CLI wrapper with a YAML config:
|
| 462 |
|
| 463 |
```bash
|
|
|
|
| 470 |
The standalone `pr-search` client now lives in the downstream `pr-search-cli`
|
| 471 |
package.
|
| 472 |
|
| 473 |
+
Repo-specific wrappers are available for the current deployed APIs:
|
| 474 |
|
| 475 |
```bash
|
| 476 |
+
scripts/update_diffusers_pr_search_api.sh
|
| 477 |
+
scripts/update_transformers_pr_search_api.sh
|
| 478 |
scripts/update_openclaw_pr_search_api.sh
|
| 479 |
```
|
| 480 |
|
| 481 |
Or use the generic deploy script directly:
|
| 482 |
|
| 483 |
```bash
|
| 484 |
+
SPACE_ID=evalstate/transformers-pr-api \
|
| 485 |
+
SPACE_TITLE="Transformers PR API" \
|
| 486 |
+
DEFAULT_REPO=huggingface/transformers \
|
| 487 |
GHR_BASE_URL=https://ghreplica.dutiful.dev \
|
| 488 |
+
HF_REPO_ID=evalstate/transformers-pr \
|
| 489 |
+
BUCKET_ID=evalstate/transformers-pr-api-data \
|
| 490 |
scripts/deploy_pr_search_space.sh
|
| 491 |
```
|
| 492 |
|
|
|
|
| 495 |
- creates or updates a Docker Space
|
| 496 |
- uploads a minimal app bundle with a generated Space `README.md`
|
| 497 |
- sets runtime variables for the API
|
| 498 |
+
- mounts the configured HF bucket at `/data` as mutable operational cache only
|
| 499 |
+
|
| 500 |
+
Serving defaults:
|
| 501 |
+
|
| 502 |
+
- dataset repo = canonical published state
|
| 503 |
+
- API materializes one self-consistent dataset view
|
| 504 |
+
- canonical `analysis/current/` is the default analysis surface when present
|
| 505 |
+
- archived analysis is selectable explicitly with `snapshot_id` + `analysis_id`
|
| 506 |
|
| 507 |
After the Space is live, you can query it either through the in-repo admin CLI:
|
| 508 |
|
| 509 |
```bash
|
| 510 |
+
uv run slop-farmer pr-search status --repo huggingface/transformers
|
| 511 |
+
uv run slop-farmer pr-search similar 44940 --repo huggingface/transformers
|
| 512 |
```
|
| 513 |
|
| 514 |
Or through the downstream `pr-search-cli` package, which owns the standalone
|
| 515 |
`pr-search` executable.
|
| 516 |
+
|
| 517 |
+
## Transformers migration cheat sheet
|
| 518 |
+
|
| 519 |
+
To move Transformers onto the current architecture:
|
| 520 |
+
|
| 521 |
+
### 1. Recreate the scheduled dataset refresh job with the generic wrapper
|
| 522 |
+
|
| 523 |
+
```bash
|
| 524 |
+
CONFIG_PATH=configs/transformers.yaml \
|
| 525 |
+
LABEL=transformers-dataset-refresh \
|
| 526 |
+
SCHEDULE='@daily' \
|
| 527 |
+
scripts/submit_transformers_dataset_job.sh
|
| 528 |
+
```
|
| 529 |
+
|
| 530 |
+
This is the canonical scheduled writer for raw/latest dataset state.
|
| 531 |
+
|
| 532 |
+
### 2. Run analysis and publish canonical hybrid analysis
|
| 533 |
+
|
| 534 |
+
```bash
|
| 535 |
+
ANALYSIS_ID=hybrid-gpt54mini-v3 scripts/update_transformers_dashboard.sh
|
| 536 |
+
```
|
| 537 |
+
|
| 538 |
+
That sequence:
|
| 539 |
+
|
| 540 |
+
- refreshes dataset if requested
|
| 541 |
+
- writes local hybrid analysis output
|
| 542 |
+
- publishes canonical `analysis/current/`
|
| 543 |
+
- saves repo-root `analysis-state/` for future hybrid cache reuse
|
| 544 |
+
- rebuilds PR scope
|
| 545 |
+
- deploys the dashboard
|
| 546 |
+
|
| 547 |
+
### 3. Deploy the Transformers API Space
|
| 548 |
+
|
| 549 |
+
```bash
|
| 550 |
+
scripts/update_transformers_pr_search_api.sh
|
| 551 |
+
```
|
| 552 |
+
|
| 553 |
+
Optional runtime bucket:
|
| 554 |
+
|
| 555 |
+
- default wrapper bucket id: `evalstate/transformers-pr-api-data`
|
| 556 |
+
- treat it as mutable operational cache only, not canonical published storage
|
src/slop_farmer.egg-info/SOURCES.txt
CHANGED
|
@@ -19,9 +19,10 @@ src/slop_farmer/app/hf_checkpoint_import.py
|
|
| 19 |
src/slop_farmer/app/pipeline.py
|
| 20 |
src/slop_farmer/app/pr_search.py
|
| 21 |
src/slop_farmer/app/pr_search_api.py
|
| 22 |
-
src/slop_farmer/app/
|
|
|
|
|
|
|
| 23 |
src/slop_farmer/app/snapshot_state.py
|
| 24 |
-
src/slop_farmer/app/workflow.py
|
| 25 |
src/slop_farmer/data/__init__.py
|
| 26 |
src/slop_farmer/data/dataset_card.py
|
| 27 |
src/slop_farmer/data/ghreplica_api.py
|
|
@@ -56,10 +57,12 @@ tests/test_cli.py
|
|
| 56 |
tests/test_config.py
|
| 57 |
tests/test_dashboard.py
|
| 58 |
tests/test_dataset_status.py
|
|
|
|
| 59 |
tests/test_farmer_setup_assets.py
|
| 60 |
tests/test_ghreplica_api.py
|
| 61 |
tests/test_github_api.py
|
| 62 |
tests/test_hf_checkpoint_import.py
|
|
|
|
| 63 |
tests/test_http.py
|
| 64 |
tests/test_links.py
|
| 65 |
tests/test_new_contributor_report.py
|
|
@@ -68,7 +71,11 @@ tests/test_pipeline_checkpoint_resume.py
|
|
| 68 |
tests/test_pr_scope.py
|
| 69 |
tests/test_pr_search.py
|
| 70 |
tests/test_pr_search_api.py
|
| 71 |
-
tests/
|
|
|
|
|
|
|
| 72 |
tests/test_snapshot_state.py
|
|
|
|
|
|
|
| 73 |
tests/test_update_transformers_dataset.py
|
| 74 |
tests/test_viewer_layout.py
|
|
|
|
| 19 |
src/slop_farmer/app/pipeline.py
|
| 20 |
src/slop_farmer/app/pr_search.py
|
| 21 |
src/slop_farmer/app/pr_search_api.py
|
| 22 |
+
src/slop_farmer/app/publish_analysis.py
|
| 23 |
+
src/slop_farmer/app/publish_dataset_snapshot.py
|
| 24 |
+
src/slop_farmer/app/save_cache.py
|
| 25 |
src/slop_farmer/app/snapshot_state.py
|
|
|
|
| 26 |
src/slop_farmer/data/__init__.py
|
| 27 |
src/slop_farmer/data/dataset_card.py
|
| 28 |
src/slop_farmer/data/ghreplica_api.py
|
|
|
|
| 57 |
tests/test_config.py
|
| 58 |
tests/test_dashboard.py
|
| 59 |
tests/test_dataset_status.py
|
| 60 |
+
tests/test_deploy.py
|
| 61 |
tests/test_farmer_setup_assets.py
|
| 62 |
tests/test_ghreplica_api.py
|
| 63 |
tests/test_github_api.py
|
| 64 |
tests/test_hf_checkpoint_import.py
|
| 65 |
+
tests/test_hf_cli_secrets_check.py
|
| 66 |
tests/test_http.py
|
| 67 |
tests/test_links.py
|
| 68 |
tests/test_new_contributor_report.py
|
|
|
|
| 71 |
tests/test_pr_scope.py
|
| 72 |
tests/test_pr_search.py
|
| 73 |
tests/test_pr_search_api.py
|
| 74 |
+
tests/test_publish_analysis.py
|
| 75 |
+
tests/test_published_layout.py
|
| 76 |
+
tests/test_save_cache.py
|
| 77 |
tests/test_snapshot_state.py
|
| 78 |
+
tests/test_submit_dataset_job.py
|
| 79 |
+
tests/test_update_dashboard_scripts.py
|
| 80 |
tests/test_update_transformers_dataset.py
|
| 81 |
tests/test_viewer_layout.py
|
src/slop_farmer/app/cli.py
CHANGED
|
@@ -23,6 +23,7 @@ from slop_farmer.config import (
|
|
| 23 |
PrSearchRefreshOptions,
|
| 24 |
PublishAnalysisArtifactsOptions,
|
| 25 |
RepoRef,
|
|
|
|
| 26 |
SnapshotAdoptOptions,
|
| 27 |
)
|
| 28 |
from slop_farmer.reports.duplicate_prs import DEFAULT_DUPLICATE_PR_MODEL
|
|
@@ -63,6 +64,7 @@ def build_parser(*, config_path: Path | None = None) -> argparse.ArgumentParser:
|
|
| 63 |
_add_new_contributor_report_parser(subparsers, defaults["new-contributor-report"])
|
| 64 |
_add_dashboard_data_parser(subparsers, defaults["dashboard-data"])
|
| 65 |
_add_publish_analysis_artifacts_parser(subparsers, defaults["publish-analysis-artifacts"])
|
|
|
|
| 66 |
_add_deploy_dashboard_parser(subparsers, defaults["deploy-dashboard"])
|
| 67 |
_add_dataset_status_parser(subparsers, defaults["dataset-status"])
|
| 68 |
return parser
|
|
@@ -80,6 +82,7 @@ def _load_parser_defaults(config_path: Path | None) -> dict[str, dict[str, Any]]
|
|
| 80 |
"new-contributor-report",
|
| 81 |
"dashboard-data",
|
| 82 |
"publish-analysis-artifacts",
|
|
|
|
| 83 |
"deploy-dashboard",
|
| 84 |
"dataset-status",
|
| 85 |
)
|
|
@@ -897,6 +900,11 @@ def _add_publish_analysis_artifacts_parser(subparsers: Any, defaults: dict[str,
|
|
| 897 |
type=Path,
|
| 898 |
help="Optional explicit snapshot directory containing analysis-report-hybrid.json.",
|
| 899 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 900 |
publish_analysis.add_argument(
|
| 901 |
"--hf-repo-id",
|
| 902 |
default=defaults.get("hf-repo-id"),
|
|
@@ -910,6 +918,12 @@ def _add_publish_analysis_artifacts_parser(subparsers: Any, defaults: dict[str,
|
|
| 910 |
default=bool(defaults.get("canonical", False)),
|
| 911 |
help="Also update the stable analysis/current canonical alias.",
|
| 912 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 913 |
publish_analysis.add_argument(
|
| 914 |
"--private-hf-repo",
|
| 915 |
action="store_true",
|
|
@@ -918,6 +932,36 @@ def _add_publish_analysis_artifacts_parser(subparsers: Any, defaults: dict[str,
|
|
| 918 |
)
|
| 919 |
|
| 920 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 921 |
def _add_deploy_dashboard_parser(subparsers: Any, defaults: dict[str, Any]) -> None:
|
| 922 |
deploy_dashboard = subparsers.add_parser(
|
| 923 |
"deploy-dashboard",
|
|
@@ -1512,9 +1556,30 @@ def _run_publish_analysis_artifacts(args: argparse.Namespace, config_path: Path
|
|
| 1512 |
PublishAnalysisArtifactsOptions(
|
| 1513 |
output_dir=args.output_dir,
|
| 1514 |
snapshot_dir=args.snapshot_dir,
|
|
|
|
| 1515 |
hf_repo_id=args.hf_repo_id,
|
| 1516 |
analysis_id=args.analysis_id,
|
| 1517 |
canonical=args.canonical,
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1518 |
private_hf_repo=args.private_hf_repo,
|
| 1519 |
)
|
| 1520 |
),
|
|
@@ -1543,6 +1608,7 @@ def main() -> None:
|
|
| 1543 |
"deploy-dashboard": _run_deploy_dashboard,
|
| 1544 |
"dataset-status": _run_dataset_status,
|
| 1545 |
"publish-analysis-artifacts": _run_publish_analysis_artifacts,
|
|
|
|
| 1546 |
}
|
| 1547 |
handler = handlers.get(args.command)
|
| 1548 |
if handler is None:
|
|
|
|
| 23 |
PrSearchRefreshOptions,
|
| 24 |
PublishAnalysisArtifactsOptions,
|
| 25 |
RepoRef,
|
| 26 |
+
SaveCacheOptions,
|
| 27 |
SnapshotAdoptOptions,
|
| 28 |
)
|
| 29 |
from slop_farmer.reports.duplicate_prs import DEFAULT_DUPLICATE_PR_MODEL
|
|
|
|
| 64 |
_add_new_contributor_report_parser(subparsers, defaults["new-contributor-report"])
|
| 65 |
_add_dashboard_data_parser(subparsers, defaults["dashboard-data"])
|
| 66 |
_add_publish_analysis_artifacts_parser(subparsers, defaults["publish-analysis-artifacts"])
|
| 67 |
+
_add_save_cache_parser(subparsers, defaults["save-cache"])
|
| 68 |
_add_deploy_dashboard_parser(subparsers, defaults["deploy-dashboard"])
|
| 69 |
_add_dataset_status_parser(subparsers, defaults["dataset-status"])
|
| 70 |
return parser
|
|
|
|
| 82 |
"new-contributor-report",
|
| 83 |
"dashboard-data",
|
| 84 |
"publish-analysis-artifacts",
|
| 85 |
+
"save-cache",
|
| 86 |
"deploy-dashboard",
|
| 87 |
"dataset-status",
|
| 88 |
)
|
|
|
|
| 900 |
type=Path,
|
| 901 |
help="Optional explicit snapshot directory containing analysis-report-hybrid.json.",
|
| 902 |
)
|
| 903 |
+
publish_analysis.add_argument(
|
| 904 |
+
"--analysis-input",
|
| 905 |
+
type=Path,
|
| 906 |
+
help="Optional explicit hybrid analysis report JSON to publish instead of snapshot-dir discovery.",
|
| 907 |
+
)
|
| 908 |
publish_analysis.add_argument(
|
| 909 |
"--hf-repo-id",
|
| 910 |
default=defaults.get("hf-repo-id"),
|
|
|
|
| 918 |
default=bool(defaults.get("canonical", False)),
|
| 919 |
help="Also update the stable analysis/current canonical alias.",
|
| 920 |
)
|
| 921 |
+
publish_analysis.add_argument(
|
| 922 |
+
"--save-cache",
|
| 923 |
+
action="store_true",
|
| 924 |
+
default=bool(defaults.get("save-cache", False)),
|
| 925 |
+
help="Also upload snapshot-local analysis-state/ as mutable operational cache at repo-root analysis-state/.",
|
| 926 |
+
)
|
| 927 |
publish_analysis.add_argument(
|
| 928 |
"--private-hf-repo",
|
| 929 |
action="store_true",
|
|
|
|
| 932 |
)
|
| 933 |
|
| 934 |
|
| 935 |
+
def _add_save_cache_parser(subparsers: Any, defaults: dict[str, Any]) -> None:
|
| 936 |
+
save_cache = subparsers.add_parser(
|
| 937 |
+
"save-cache",
|
| 938 |
+
help="Upload snapshot-local analysis-state/ as mutable operational cache to a dataset repo.",
|
| 939 |
+
)
|
| 940 |
+
save_cache.add_argument(
|
| 941 |
+
"--output-dir",
|
| 942 |
+
type=Path,
|
| 943 |
+
default=Path(defaults.get("output-dir", "data")),
|
| 944 |
+
help="Pipeline workspace root containing snapshots/latest.json.",
|
| 945 |
+
)
|
| 946 |
+
save_cache.add_argument(
|
| 947 |
+
"--snapshot-dir",
|
| 948 |
+
type=Path,
|
| 949 |
+
help="Optional explicit snapshot directory containing analysis-state/.",
|
| 950 |
+
)
|
| 951 |
+
save_cache.add_argument(
|
| 952 |
+
"--hf-repo-id",
|
| 953 |
+
default=defaults.get("hf-repo-id"),
|
| 954 |
+
required=defaults.get("hf-repo-id") is None,
|
| 955 |
+
help="Target Hugging Face dataset repo id.",
|
| 956 |
+
)
|
| 957 |
+
save_cache.add_argument(
|
| 958 |
+
"--private-hf-repo",
|
| 959 |
+
action="store_true",
|
| 960 |
+
default=bool(defaults.get("private-hf-repo", False)),
|
| 961 |
+
help="Create the target dataset repo as private if needed.",
|
| 962 |
+
)
|
| 963 |
+
|
| 964 |
+
|
| 965 |
def _add_deploy_dashboard_parser(subparsers: Any, defaults: dict[str, Any]) -> None:
|
| 966 |
deploy_dashboard = subparsers.add_parser(
|
| 967 |
"deploy-dashboard",
|
|
|
|
| 1556 |
PublishAnalysisArtifactsOptions(
|
| 1557 |
output_dir=args.output_dir,
|
| 1558 |
snapshot_dir=args.snapshot_dir,
|
| 1559 |
+
analysis_input=args.analysis_input,
|
| 1560 |
hf_repo_id=args.hf_repo_id,
|
| 1561 |
analysis_id=args.analysis_id,
|
| 1562 |
canonical=args.canonical,
|
| 1563 |
+
save_cache=args.save_cache,
|
| 1564 |
+
private_hf_repo=args.private_hf_repo,
|
| 1565 |
+
)
|
| 1566 |
+
),
|
| 1567 |
+
indent=2,
|
| 1568 |
+
)
|
| 1569 |
+
)
|
| 1570 |
+
|
| 1571 |
+
|
| 1572 |
+
def _run_save_cache(args: argparse.Namespace, config_path: Path | None) -> None:
|
| 1573 |
+
del config_path
|
| 1574 |
+
from slop_farmer.app.save_cache import run_save_cache
|
| 1575 |
+
|
| 1576 |
+
print(
|
| 1577 |
+
json.dumps(
|
| 1578 |
+
run_save_cache(
|
| 1579 |
+
SaveCacheOptions(
|
| 1580 |
+
output_dir=args.output_dir,
|
| 1581 |
+
snapshot_dir=args.snapshot_dir,
|
| 1582 |
+
hf_repo_id=args.hf_repo_id,
|
| 1583 |
private_hf_repo=args.private_hf_repo,
|
| 1584 |
)
|
| 1585 |
),
|
|
|
|
| 1608 |
"deploy-dashboard": _run_deploy_dashboard,
|
| 1609 |
"dataset-status": _run_dataset_status,
|
| 1610 |
"publish-analysis-artifacts": _run_publish_analysis_artifacts,
|
| 1611 |
+
"save-cache": _run_save_cache,
|
| 1612 |
}
|
| 1613 |
handler = handlers.get(args.command)
|
| 1614 |
if handler is None:
|
src/slop_farmer/app/pr_search_api.py
CHANGED
|
@@ -1,7 +1,8 @@
|
|
| 1 |
from __future__ import annotations
|
| 2 |
|
|
|
|
| 3 |
import os
|
| 4 |
-
from contextlib import asynccontextmanager
|
| 5 |
from dataclasses import dataclass
|
| 6 |
from pathlib import Path
|
| 7 |
from typing import Any, Literal
|
|
@@ -11,10 +12,19 @@ from fastapi.responses import JSONResponse
|
|
| 11 |
|
| 12 |
from slop_farmer.config import PrSearchRefreshOptions
|
| 13 |
from slop_farmer.data.ghreplica_api import GhReplicaProbeUnavailableError, GhrProbeClient
|
| 14 |
-
from slop_farmer.data.snapshot_materialize import
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 15 |
from slop_farmer.data.snapshot_paths import (
|
| 16 |
CURRENT_ANALYSIS_MANIFEST_PATH,
|
| 17 |
default_hf_materialize_dir,
|
|
|
|
|
|
|
|
|
|
| 18 |
)
|
| 19 |
from slop_farmer.reports.analysis_service import (
|
| 20 |
get_analysis_best,
|
|
@@ -64,6 +74,8 @@ class PrSearchApiSettings:
|
|
| 64 |
http_max_retries: int = 5
|
| 65 |
refresh_if_missing: bool = False
|
| 66 |
rebuild_on_start: bool = False
|
|
|
|
|
|
|
| 67 |
include_drafts: bool = False
|
| 68 |
include_closed: bool = False
|
| 69 |
similar_limit_default: int = 10
|
|
@@ -100,6 +112,8 @@ class PrSearchApiSettings:
|
|
| 100 |
http_max_retries=_env_int("HTTP_MAX_RETRIES", 5),
|
| 101 |
refresh_if_missing=_env_bool("REFRESH_IF_MISSING", False),
|
| 102 |
rebuild_on_start=_env_bool("REBUILD_ON_START", False),
|
|
|
|
|
|
|
| 103 |
include_drafts=_env_bool("INCLUDE_DRAFTS", False),
|
| 104 |
include_closed=_env_bool("INCLUDE_CLOSED", False),
|
| 105 |
similar_limit_default=_env_int("SIMILAR_LIMIT_DEFAULT", 10),
|
|
@@ -125,13 +139,28 @@ def create_app(settings: PrSearchApiSettings | None = None) -> FastAPI:
|
|
| 125 |
app.state.settings = api_settings
|
| 126 |
app.state.ready = False
|
| 127 |
app.state.startup_error = None
|
|
|
|
|
|
|
|
|
|
| 128 |
try:
|
| 129 |
_bootstrap_snapshot_assets(api_settings)
|
|
|
|
| 130 |
_bootstrap_index(api_settings)
|
| 131 |
app.state.ready = _is_ready(api_settings)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 132 |
except Exception as exc:
|
| 133 |
app.state.startup_error = str(exc)
|
| 134 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 135 |
|
| 136 |
app = FastAPI(title="slop PR search API", version="0.1.1", lifespan=lifespan)
|
| 137 |
|
|
@@ -628,6 +657,84 @@ def _bootstrap_snapshot_assets(settings: PrSearchApiSettings) -> None:
|
|
| 628 |
)
|
| 629 |
|
| 630 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 631 |
def _needs_refresh(settings: PrSearchApiSettings) -> bool:
|
| 632 |
if settings.rebuild_on_start:
|
| 633 |
return True
|
|
@@ -714,6 +821,64 @@ def _surface_available(snapshot_dir: Path, *, surface: Literal["issues", "contri
|
|
| 714 |
return (snapshot_dir / "new-contributors-report.json").exists()
|
| 715 |
|
| 716 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 717 |
def _limit(value: int | None, *, default: int, maximum: int) -> int:
|
| 718 |
limit = default if value is None else value
|
| 719 |
if limit < 1:
|
|
|
|
| 1 |
from __future__ import annotations
|
| 2 |
|
| 3 |
+
import asyncio
|
| 4 |
import os
|
| 5 |
+
from contextlib import asynccontextmanager, suppress
|
| 6 |
from dataclasses import dataclass
|
| 7 |
from pathlib import Path
|
| 8 |
from typing import Any, Literal
|
|
|
|
| 12 |
|
| 13 |
from slop_farmer.config import PrSearchRefreshOptions
|
| 14 |
from slop_farmer.data.ghreplica_api import GhReplicaProbeUnavailableError, GhrProbeClient
|
| 15 |
+
from slop_farmer.data.snapshot_materialize import (
|
| 16 |
+
load_hf_current_analysis_manifest,
|
| 17 |
+
load_hf_current_search_manifest,
|
| 18 |
+
materialize_hf_current_analysis_surface,
|
| 19 |
+
materialize_hf_current_search_index,
|
| 20 |
+
materialize_hf_dataset_snapshot,
|
| 21 |
+
)
|
| 22 |
from slop_farmer.data.snapshot_paths import (
|
| 23 |
CURRENT_ANALYSIS_MANIFEST_PATH,
|
| 24 |
default_hf_materialize_dir,
|
| 25 |
+
load_current_analysis_manifest,
|
| 26 |
+
load_current_search_manifest,
|
| 27 |
+
repo_relative_path_to_local,
|
| 28 |
)
|
| 29 |
from slop_farmer.reports.analysis_service import (
|
| 30 |
get_analysis_best,
|
|
|
|
| 74 |
http_max_retries: int = 5
|
| 75 |
refresh_if_missing: bool = False
|
| 76 |
rebuild_on_start: bool = False
|
| 77 |
+
current_analysis_poll_seconds: int = 0
|
| 78 |
+
current_search_poll_seconds: int = 0
|
| 79 |
include_drafts: bool = False
|
| 80 |
include_closed: bool = False
|
| 81 |
similar_limit_default: int = 10
|
|
|
|
| 112 |
http_max_retries=_env_int("HTTP_MAX_RETRIES", 5),
|
| 113 |
refresh_if_missing=_env_bool("REFRESH_IF_MISSING", False),
|
| 114 |
rebuild_on_start=_env_bool("REBUILD_ON_START", False),
|
| 115 |
+
current_analysis_poll_seconds=_env_int("CURRENT_ANALYSIS_POLL_SECONDS", 0),
|
| 116 |
+
current_search_poll_seconds=_env_int("CURRENT_SEARCH_POLL_SECONDS", 0),
|
| 117 |
include_drafts=_env_bool("INCLUDE_DRAFTS", False),
|
| 118 |
include_closed=_env_bool("INCLUDE_CLOSED", False),
|
| 119 |
similar_limit_default=_env_int("SIMILAR_LIMIT_DEFAULT", 10),
|
|
|
|
| 139 |
app.state.settings = api_settings
|
| 140 |
app.state.ready = False
|
| 141 |
app.state.startup_error = None
|
| 142 |
+
app.state.current_analysis_refresh_error = None
|
| 143 |
+
app.state.current_search_refresh_error = None
|
| 144 |
+
refresh_task: asyncio.Task[None] | None = None
|
| 145 |
try:
|
| 146 |
_bootstrap_snapshot_assets(api_settings)
|
| 147 |
+
_bootstrap_current_search_index(api_settings)
|
| 148 |
_bootstrap_index(api_settings)
|
| 149 |
app.state.ready = _is_ready(api_settings)
|
| 150 |
+
if (
|
| 151 |
+
api_settings.current_analysis_poll_seconds > 0
|
| 152 |
+
or api_settings.current_search_poll_seconds > 0
|
| 153 |
+
):
|
| 154 |
+
refresh_task = asyncio.create_task(_run_current_asset_refresh_loop(app))
|
| 155 |
except Exception as exc:
|
| 156 |
app.state.startup_error = str(exc)
|
| 157 |
+
try:
|
| 158 |
+
yield
|
| 159 |
+
finally:
|
| 160 |
+
if refresh_task is not None:
|
| 161 |
+
refresh_task.cancel()
|
| 162 |
+
with suppress(asyncio.CancelledError):
|
| 163 |
+
await refresh_task
|
| 164 |
|
| 165 |
app = FastAPI(title="slop PR search API", version="0.1.1", lifespan=lifespan)
|
| 166 |
|
|
|
|
| 657 |
)
|
| 658 |
|
| 659 |
|
| 660 |
+
def _bootstrap_current_search_index(settings: PrSearchApiSettings) -> None:
|
| 661 |
+
if settings.snapshot_dir is not None or settings.hf_repo_id is None:
|
| 662 |
+
return
|
| 663 |
+
_refresh_current_search_index(settings)
|
| 664 |
+
|
| 665 |
+
|
| 666 |
+
async def _run_current_asset_refresh_loop(app: FastAPI) -> None:
|
| 667 |
+
settings = app.state.settings
|
| 668 |
+
interval = min_non_zero(
|
| 669 |
+
settings.current_analysis_poll_seconds,
|
| 670 |
+
settings.current_search_poll_seconds,
|
| 671 |
+
)
|
| 672 |
+
while True:
|
| 673 |
+
await asyncio.sleep(interval)
|
| 674 |
+
if settings.current_search_poll_seconds > 0:
|
| 675 |
+
try:
|
| 676 |
+
_refresh_current_search_index(settings)
|
| 677 |
+
except Exception as exc:
|
| 678 |
+
app.state.current_search_refresh_error = str(exc)
|
| 679 |
+
else:
|
| 680 |
+
app.state.current_search_refresh_error = None
|
| 681 |
+
if settings.current_analysis_poll_seconds > 0:
|
| 682 |
+
try:
|
| 683 |
+
_refresh_current_analysis_surface(settings)
|
| 684 |
+
except Exception as exc:
|
| 685 |
+
app.state.current_analysis_refresh_error = str(exc)
|
| 686 |
+
else:
|
| 687 |
+
app.state.current_analysis_refresh_error = None
|
| 688 |
+
|
| 689 |
+
|
| 690 |
+
def _refresh_current_analysis_surface(settings: PrSearchApiSettings) -> bool:
|
| 691 |
+
if settings.hf_repo_id is None or settings.snapshot_dir is not None:
|
| 692 |
+
return False
|
| 693 |
+
remote_manifest = load_hf_current_analysis_manifest(
|
| 694 |
+
repo_id=settings.hf_repo_id,
|
| 695 |
+
revision=settings.hf_revision,
|
| 696 |
+
)
|
| 697 |
+
if remote_manifest is None:
|
| 698 |
+
return False
|
| 699 |
+
local_manifest = _load_local_current_analysis_manifest(settings)
|
| 700 |
+
if _analysis_manifest_identity(local_manifest) == _analysis_manifest_identity(remote_manifest):
|
| 701 |
+
return False
|
| 702 |
+
materialize_hf_current_analysis_surface(
|
| 703 |
+
repo_id=settings.hf_repo_id,
|
| 704 |
+
local_dir=_materialized_snapshot_dir(settings) or settings.output_dir,
|
| 705 |
+
revision=settings.hf_revision,
|
| 706 |
+
)
|
| 707 |
+
return True
|
| 708 |
+
|
| 709 |
+
|
| 710 |
+
def _refresh_current_search_index(settings: PrSearchApiSettings) -> bool:
|
| 711 |
+
if settings.hf_repo_id is None or settings.snapshot_dir is not None:
|
| 712 |
+
return False
|
| 713 |
+
remote_manifest = load_hf_current_search_manifest(
|
| 714 |
+
repo_id=settings.hf_repo_id,
|
| 715 |
+
revision=settings.hf_revision,
|
| 716 |
+
)
|
| 717 |
+
if remote_manifest is None:
|
| 718 |
+
return False
|
| 719 |
+
local_manifest = _load_local_current_search_manifest(settings)
|
| 720 |
+
if _search_manifest_identity(local_manifest) == _search_manifest_identity(remote_manifest):
|
| 721 |
+
return False
|
| 722 |
+
manifest_path = _local_current_search_manifest_path(settings)
|
| 723 |
+
staged_db_path = settings.index_path.with_name(f".{settings.index_path.name}.download")
|
| 724 |
+
staged_manifest_path = manifest_path.with_name(f".{manifest_path.name}.download")
|
| 725 |
+
materialize_hf_current_search_index(
|
| 726 |
+
repo_id=settings.hf_repo_id,
|
| 727 |
+
db_path=staged_db_path,
|
| 728 |
+
manifest_path=staged_manifest_path,
|
| 729 |
+
revision=settings.hf_revision,
|
| 730 |
+
)
|
| 731 |
+
get_pr_search_status(staged_db_path, repo=settings.default_repo)
|
| 732 |
+
settings.index_path.parent.mkdir(parents=True, exist_ok=True)
|
| 733 |
+
staged_db_path.replace(settings.index_path)
|
| 734 |
+
staged_manifest_path.replace(manifest_path)
|
| 735 |
+
return True
|
| 736 |
+
|
| 737 |
+
|
| 738 |
def _needs_refresh(settings: PrSearchApiSettings) -> bool:
|
| 739 |
if settings.rebuild_on_start:
|
| 740 |
return True
|
|
|
|
| 821 |
return (snapshot_dir / "new-contributors-report.json").exists()
|
| 822 |
|
| 823 |
|
| 824 |
+
def _load_local_current_analysis_manifest(settings: PrSearchApiSettings) -> dict[str, Any] | None:
|
| 825 |
+
materialized_snapshot_dir = _materialized_snapshot_dir(settings)
|
| 826 |
+
if materialized_snapshot_dir is None:
|
| 827 |
+
return None
|
| 828 |
+
manifest_path = repo_relative_path_to_local(
|
| 829 |
+
materialized_snapshot_dir, CURRENT_ANALYSIS_MANIFEST_PATH
|
| 830 |
+
)
|
| 831 |
+
if not manifest_path.exists():
|
| 832 |
+
return None
|
| 833 |
+
return load_current_analysis_manifest(manifest_path)
|
| 834 |
+
|
| 835 |
+
|
| 836 |
+
def _local_current_search_manifest_path(settings: PrSearchApiSettings) -> Path:
|
| 837 |
+
return settings.index_path.with_name("current-search-manifest.json")
|
| 838 |
+
|
| 839 |
+
|
| 840 |
+
def _load_local_current_search_manifest(settings: PrSearchApiSettings) -> dict[str, Any] | None:
|
| 841 |
+
manifest_path = _local_current_search_manifest_path(settings)
|
| 842 |
+
if not manifest_path.exists():
|
| 843 |
+
return None
|
| 844 |
+
return load_current_search_manifest(manifest_path)
|
| 845 |
+
|
| 846 |
+
|
| 847 |
+
def _analysis_manifest_identity(
|
| 848 |
+
payload: dict[str, Any] | None,
|
| 849 |
+
) -> tuple[str | None, str | None, str | None]:
|
| 850 |
+
if payload is None:
|
| 851 |
+
return (None, None, None)
|
| 852 |
+
snapshot_id = payload.get("snapshot_id")
|
| 853 |
+
analysis_id = payload.get("analysis_id")
|
| 854 |
+
published_at = payload.get("published_at")
|
| 855 |
+
return (
|
| 856 |
+
None if snapshot_id is None else str(snapshot_id),
|
| 857 |
+
None if analysis_id is None else str(analysis_id),
|
| 858 |
+
None if published_at is None else str(published_at),
|
| 859 |
+
)
|
| 860 |
+
|
| 861 |
+
|
| 862 |
+
def _search_manifest_identity(
|
| 863 |
+
payload: dict[str, Any] | None,
|
| 864 |
+
) -> tuple[str | None, str | None, str | None]:
|
| 865 |
+
if payload is None:
|
| 866 |
+
return (None, None, None)
|
| 867 |
+
snapshot_id = payload.get("snapshot_id")
|
| 868 |
+
run_id = payload.get("run_id")
|
| 869 |
+
published_at = payload.get("published_at")
|
| 870 |
+
return (
|
| 871 |
+
None if snapshot_id is None else str(snapshot_id),
|
| 872 |
+
None if run_id is None else str(run_id),
|
| 873 |
+
None if published_at is None else str(published_at),
|
| 874 |
+
)
|
| 875 |
+
|
| 876 |
+
|
| 877 |
+
def min_non_zero(*values: int) -> int:
|
| 878 |
+
candidates = [value for value in values if value > 0]
|
| 879 |
+
return min(candidates) if candidates else 300
|
| 880 |
+
|
| 881 |
+
|
| 882 |
def _limit(value: int | None, *, default: int, maximum: int) -> int:
|
| 883 |
limit = default if value is None else value
|
| 884 |
if limit < 1:
|
src/slop_farmer/app/publish_analysis.py
CHANGED
|
@@ -9,6 +9,7 @@ from typing import Any, Protocol, cast
|
|
| 9 |
|
| 10 |
from huggingface_hub import CommitOperationAdd, HfApi, hf_hub_download
|
| 11 |
|
|
|
|
| 12 |
from slop_farmer.config import PublishAnalysisArtifactsOptions
|
| 13 |
from slop_farmer.data.parquet_io import read_json
|
| 14 |
from slop_farmer.data.snapshot_paths import (
|
|
@@ -44,6 +45,16 @@ class HubApiLike(Protocol):
|
|
| 44 |
repo_type: str,
|
| 45 |
) -> Any: ...
|
| 46 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 47 |
|
| 48 |
@dataclass(frozen=True, slots=True)
|
| 49 |
class PublishableAnalysisArtifacts:
|
|
@@ -59,9 +70,11 @@ def run_publish_analysis_artifacts(options: PublishAnalysisArtifactsOptions) ->
|
|
| 59 |
snapshot_dir = resolve_snapshot_dir_from_output(options.output_dir, options.snapshot_dir)
|
| 60 |
return publish_analysis_artifacts(
|
| 61 |
snapshot_dir=snapshot_dir,
|
|
|
|
| 62 |
hf_repo_id=options.hf_repo_id,
|
| 63 |
analysis_id=options.analysis_id,
|
| 64 |
canonical=options.canonical,
|
|
|
|
| 65 |
private=options.private_hf_repo,
|
| 66 |
)
|
| 67 |
|
|
@@ -69,19 +82,23 @@ def run_publish_analysis_artifacts(options: PublishAnalysisArtifactsOptions) ->
|
|
| 69 |
def publish_analysis_artifacts(
|
| 70 |
*,
|
| 71 |
snapshot_dir: Path,
|
|
|
|
| 72 |
hf_repo_id: str,
|
| 73 |
analysis_id: str,
|
| 74 |
canonical: bool,
|
| 75 |
private: bool,
|
|
|
|
| 76 |
log: Callable[[str], None] | None = None,
|
| 77 |
) -> dict[str, Any]:
|
| 78 |
return _publish_analysis_artifacts_api(
|
| 79 |
cast("HubApiLike", HfApi()),
|
| 80 |
snapshot_dir=snapshot_dir,
|
|
|
|
| 81 |
hf_repo_id=hf_repo_id,
|
| 82 |
analysis_id=analysis_id,
|
| 83 |
canonical=canonical,
|
| 84 |
private=private,
|
|
|
|
| 85 |
log=log,
|
| 86 |
)
|
| 87 |
|
|
@@ -90,13 +107,15 @@ def _publish_analysis_artifacts_api(
|
|
| 90 |
api: HubApiLike,
|
| 91 |
*,
|
| 92 |
snapshot_dir: Path,
|
|
|
|
| 93 |
hf_repo_id: str,
|
| 94 |
analysis_id: str,
|
| 95 |
canonical: bool,
|
| 96 |
private: bool,
|
|
|
|
| 97 |
log: Callable[[str], None] | None = None,
|
| 98 |
) -> dict[str, Any]:
|
| 99 |
-
artifacts = _discover_publishable_analysis(snapshot_dir)
|
| 100 |
published_at = _iso_now()
|
| 101 |
channel = "canonical" if canonical else "comparison"
|
| 102 |
archived_manifest = build_archived_analysis_run_manifest(
|
|
@@ -150,21 +169,37 @@ def _publish_analysis_artifacts_api(
|
|
| 150 |
commit_message=f"Publish analysis {analysis_id} for snapshot {artifacts.snapshot_id}",
|
| 151 |
repo_type="dataset",
|
| 152 |
)
|
| 153 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 154 |
"repo": artifacts.repo,
|
| 155 |
"dataset_id": hf_repo_id,
|
| 156 |
"snapshot_id": artifacts.snapshot_id,
|
| 157 |
"analysis_id": analysis_id,
|
| 158 |
"canonical": canonical,
|
|
|
|
| 159 |
"published_at": published_at,
|
| 160 |
"artifact_paths": [operation.path_in_repo for operation in operations],
|
| 161 |
}
|
|
|
|
|
|
|
| 162 |
if log:
|
| 163 |
log(f"Published analysis artifacts to {hf_repo_id}")
|
| 164 |
return result
|
| 165 |
|
| 166 |
|
| 167 |
-
def _discover_publishable_analysis(
|
|
|
|
|
|
|
| 168 |
manifest_path = snapshot_dir / ROOT_MANIFEST_FILENAME
|
| 169 |
if not manifest_path.exists():
|
| 170 |
raise FileNotFoundError(f"Snapshot manifest is missing: {manifest_path}")
|
|
@@ -176,7 +211,11 @@ def _discover_publishable_analysis(snapshot_dir: Path) -> PublishableAnalysisArt
|
|
| 176 |
if not repo:
|
| 177 |
raise ValueError(f"Snapshot manifest at {manifest_path} does not define repo.")
|
| 178 |
|
| 179 |
-
report_path =
|
|
|
|
|
|
|
|
|
|
|
|
|
| 180 |
if not report_path.exists():
|
| 181 |
raise FileNotFoundError(f"Hybrid analysis report is missing: {report_path}")
|
| 182 |
report_payload = read_json(report_path)
|
|
@@ -196,7 +235,7 @@ def _discover_publishable_analysis(snapshot_dir: Path) -> PublishableAnalysisArt
|
|
| 196 |
if model is not None:
|
| 197 |
model = str(model)
|
| 198 |
|
| 199 |
-
reviews_path =
|
| 200 |
return PublishableAnalysisArtifacts(
|
| 201 |
repo=repo,
|
| 202 |
snapshot_id=snapshot_id,
|
|
@@ -266,12 +305,13 @@ def _commit_operations(
|
|
| 266 |
current_manifest: dict[str, Any] | None,
|
| 267 |
snapshot_manifest: dict[str, Any],
|
| 268 |
) -> list[CommitOperationAdd]:
|
|
|
|
| 269 |
operations = [
|
| 270 |
CommitOperationAdd(
|
| 271 |
path_in_repo=analysis_run_artifact_path(
|
| 272 |
artifacts.snapshot_id,
|
| 273 |
analysis_id,
|
| 274 |
-
|
| 275 |
),
|
| 276 |
path_or_fileobj=artifacts.report_path,
|
| 277 |
),
|
|
@@ -290,7 +330,7 @@ def _commit_operations(
|
|
| 290 |
path_in_repo=analysis_run_artifact_path(
|
| 291 |
artifacts.snapshot_id,
|
| 292 |
analysis_id,
|
| 293 |
-
|
| 294 |
),
|
| 295 |
path_or_fileobj=artifacts.reviews_path,
|
| 296 |
)
|
|
@@ -299,7 +339,7 @@ def _commit_operations(
|
|
| 299 |
operations.extend(
|
| 300 |
[
|
| 301 |
CommitOperationAdd(
|
| 302 |
-
path_in_repo=current_analysis_artifact_path(
|
| 303 |
path_or_fileobj=artifacts.report_path,
|
| 304 |
),
|
| 305 |
CommitOperationAdd(
|
|
@@ -311,7 +351,7 @@ def _commit_operations(
|
|
| 311 |
if artifacts.reviews_path is not None:
|
| 312 |
operations.append(
|
| 313 |
CommitOperationAdd(
|
| 314 |
-
path_in_repo=current_analysis_artifact_path(
|
| 315 |
path_or_fileobj=artifacts.reviews_path,
|
| 316 |
)
|
| 317 |
)
|
|
|
|
| 9 |
|
| 10 |
from huggingface_hub import CommitOperationAdd, HfApi, hf_hub_download
|
| 11 |
|
| 12 |
+
from slop_farmer.app.save_cache import _save_analysis_cache_api
|
| 13 |
from slop_farmer.config import PublishAnalysisArtifactsOptions
|
| 14 |
from slop_farmer.data.parquet_io import read_json
|
| 15 |
from slop_farmer.data.snapshot_paths import (
|
|
|
|
| 45 |
repo_type: str,
|
| 46 |
) -> Any: ...
|
| 47 |
|
| 48 |
+
def upload_folder(
|
| 49 |
+
self,
|
| 50 |
+
*,
|
| 51 |
+
repo_id: str,
|
| 52 |
+
folder_path: Path,
|
| 53 |
+
path_in_repo: str,
|
| 54 |
+
repo_type: str,
|
| 55 |
+
commit_message: str,
|
| 56 |
+
) -> None: ...
|
| 57 |
+
|
| 58 |
|
| 59 |
@dataclass(frozen=True, slots=True)
|
| 60 |
class PublishableAnalysisArtifacts:
|
|
|
|
| 70 |
snapshot_dir = resolve_snapshot_dir_from_output(options.output_dir, options.snapshot_dir)
|
| 71 |
return publish_analysis_artifacts(
|
| 72 |
snapshot_dir=snapshot_dir,
|
| 73 |
+
analysis_input=options.analysis_input,
|
| 74 |
hf_repo_id=options.hf_repo_id,
|
| 75 |
analysis_id=options.analysis_id,
|
| 76 |
canonical=options.canonical,
|
| 77 |
+
save_cache=options.save_cache,
|
| 78 |
private=options.private_hf_repo,
|
| 79 |
)
|
| 80 |
|
|
|
|
| 82 |
def publish_analysis_artifacts(
|
| 83 |
*,
|
| 84 |
snapshot_dir: Path,
|
| 85 |
+
analysis_input: Path | None,
|
| 86 |
hf_repo_id: str,
|
| 87 |
analysis_id: str,
|
| 88 |
canonical: bool,
|
| 89 |
private: bool,
|
| 90 |
+
save_cache: bool = False,
|
| 91 |
log: Callable[[str], None] | None = None,
|
| 92 |
) -> dict[str, Any]:
|
| 93 |
return _publish_analysis_artifacts_api(
|
| 94 |
cast("HubApiLike", HfApi()),
|
| 95 |
snapshot_dir=snapshot_dir,
|
| 96 |
+
analysis_input=analysis_input,
|
| 97 |
hf_repo_id=hf_repo_id,
|
| 98 |
analysis_id=analysis_id,
|
| 99 |
canonical=canonical,
|
| 100 |
private=private,
|
| 101 |
+
save_cache=save_cache,
|
| 102 |
log=log,
|
| 103 |
)
|
| 104 |
|
|
|
|
| 107 |
api: HubApiLike,
|
| 108 |
*,
|
| 109 |
snapshot_dir: Path,
|
| 110 |
+
analysis_input: Path | None = None,
|
| 111 |
hf_repo_id: str,
|
| 112 |
analysis_id: str,
|
| 113 |
canonical: bool,
|
| 114 |
private: bool,
|
| 115 |
+
save_cache: bool = False,
|
| 116 |
log: Callable[[str], None] | None = None,
|
| 117 |
) -> dict[str, Any]:
|
| 118 |
+
artifacts = _discover_publishable_analysis(snapshot_dir, analysis_input=analysis_input)
|
| 119 |
published_at = _iso_now()
|
| 120 |
channel = "canonical" if canonical else "comparison"
|
| 121 |
archived_manifest = build_archived_analysis_run_manifest(
|
|
|
|
| 169 |
commit_message=f"Publish analysis {analysis_id} for snapshot {artifacts.snapshot_id}",
|
| 170 |
repo_type="dataset",
|
| 171 |
)
|
| 172 |
+
cache_result = (
|
| 173 |
+
_save_analysis_cache_api(
|
| 174 |
+
api,
|
| 175 |
+
snapshot_dir=snapshot_dir,
|
| 176 |
+
hf_repo_id=hf_repo_id,
|
| 177 |
+
private=private,
|
| 178 |
+
log=log,
|
| 179 |
+
)
|
| 180 |
+
if save_cache
|
| 181 |
+
else None
|
| 182 |
+
)
|
| 183 |
+
result: dict[str, Any] = {
|
| 184 |
"repo": artifacts.repo,
|
| 185 |
"dataset_id": hf_repo_id,
|
| 186 |
"snapshot_id": artifacts.snapshot_id,
|
| 187 |
"analysis_id": analysis_id,
|
| 188 |
"canonical": canonical,
|
| 189 |
+
"save_cache": save_cache,
|
| 190 |
"published_at": published_at,
|
| 191 |
"artifact_paths": [operation.path_in_repo for operation in operations],
|
| 192 |
}
|
| 193 |
+
if cache_result is not None:
|
| 194 |
+
result["cache"] = cache_result
|
| 195 |
if log:
|
| 196 |
log(f"Published analysis artifacts to {hf_repo_id}")
|
| 197 |
return result
|
| 198 |
|
| 199 |
|
| 200 |
+
def _discover_publishable_analysis(
|
| 201 |
+
snapshot_dir: Path, *, analysis_input: Path | None
|
| 202 |
+
) -> PublishableAnalysisArtifacts:
|
| 203 |
manifest_path = snapshot_dir / ROOT_MANIFEST_FILENAME
|
| 204 |
if not manifest_path.exists():
|
| 205 |
raise FileNotFoundError(f"Snapshot manifest is missing: {manifest_path}")
|
|
|
|
| 211 |
if not repo:
|
| 212 |
raise ValueError(f"Snapshot manifest at {manifest_path} does not define repo.")
|
| 213 |
|
| 214 |
+
report_path = (
|
| 215 |
+
analysis_input.resolve()
|
| 216 |
+
if analysis_input is not None
|
| 217 |
+
else snapshot_dir / ANALYSIS_REPORT_FILENAME_BY_VARIANT["hybrid"]
|
| 218 |
+
)
|
| 219 |
if not report_path.exists():
|
| 220 |
raise FileNotFoundError(f"Hybrid analysis report is missing: {report_path}")
|
| 221 |
report_payload = read_json(report_path)
|
|
|
|
| 235 |
if model is not None:
|
| 236 |
model = str(model)
|
| 237 |
|
| 238 |
+
reviews_path = report_path.with_name(f"{report_path.stem}.llm-reviews.json")
|
| 239 |
return PublishableAnalysisArtifacts(
|
| 240 |
repo=repo,
|
| 241 |
snapshot_id=snapshot_id,
|
|
|
|
| 305 |
current_manifest: dict[str, Any] | None,
|
| 306 |
snapshot_manifest: dict[str, Any],
|
| 307 |
) -> list[CommitOperationAdd]:
|
| 308 |
+
report_filename = ANALYSIS_REPORT_FILENAME_BY_VARIANT["hybrid"]
|
| 309 |
operations = [
|
| 310 |
CommitOperationAdd(
|
| 311 |
path_in_repo=analysis_run_artifact_path(
|
| 312 |
artifacts.snapshot_id,
|
| 313 |
analysis_id,
|
| 314 |
+
report_filename,
|
| 315 |
),
|
| 316 |
path_or_fileobj=artifacts.report_path,
|
| 317 |
),
|
|
|
|
| 330 |
path_in_repo=analysis_run_artifact_path(
|
| 331 |
artifacts.snapshot_id,
|
| 332 |
analysis_id,
|
| 333 |
+
HYBRID_ANALYSIS_REVIEWS_FILENAME,
|
| 334 |
),
|
| 335 |
path_or_fileobj=artifacts.reviews_path,
|
| 336 |
)
|
|
|
|
| 339 |
operations.extend(
|
| 340 |
[
|
| 341 |
CommitOperationAdd(
|
| 342 |
+
path_in_repo=current_analysis_artifact_path(report_filename),
|
| 343 |
path_or_fileobj=artifacts.report_path,
|
| 344 |
),
|
| 345 |
CommitOperationAdd(
|
|
|
|
| 351 |
if artifacts.reviews_path is not None:
|
| 352 |
operations.append(
|
| 353 |
CommitOperationAdd(
|
| 354 |
+
path_in_repo=current_analysis_artifact_path(HYBRID_ANALYSIS_REVIEWS_FILENAME),
|
| 355 |
path_or_fileobj=artifacts.reviews_path,
|
| 356 |
)
|
| 357 |
)
|
src/slop_farmer/app/publish_pr_search_index.py
ADDED
|
@@ -0,0 +1,141 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from __future__ import annotations
|
| 2 |
+
|
| 3 |
+
import json
|
| 4 |
+
from collections.abc import Callable
|
| 5 |
+
from dataclasses import dataclass
|
| 6 |
+
from datetime import UTC, datetime
|
| 7 |
+
from pathlib import Path
|
| 8 |
+
from typing import Any, Protocol, cast
|
| 9 |
+
|
| 10 |
+
from huggingface_hub import CommitOperationAdd, HfApi
|
| 11 |
+
|
| 12 |
+
from slop_farmer.data.snapshot_paths import (
|
| 13 |
+
CURRENT_SEARCH_DB_PATH,
|
| 14 |
+
CURRENT_SEARCH_MANIFEST_PATH,
|
| 15 |
+
build_current_search_manifest,
|
| 16 |
+
)
|
| 17 |
+
from slop_farmer.reports.pr_search_service import get_pr_search_status
|
| 18 |
+
|
| 19 |
+
|
| 20 |
+
class HubApiLike(Protocol):
|
| 21 |
+
def create_repo(
|
| 22 |
+
self,
|
| 23 |
+
repo_id: str,
|
| 24 |
+
*,
|
| 25 |
+
repo_type: str,
|
| 26 |
+
private: bool,
|
| 27 |
+
exist_ok: bool,
|
| 28 |
+
) -> None: ...
|
| 29 |
+
|
| 30 |
+
def create_commit(
|
| 31 |
+
self,
|
| 32 |
+
repo_id: str,
|
| 33 |
+
operations: list[CommitOperationAdd],
|
| 34 |
+
*,
|
| 35 |
+
commit_message: str,
|
| 36 |
+
repo_type: str,
|
| 37 |
+
) -> Any: ...
|
| 38 |
+
|
| 39 |
+
|
| 40 |
+
@dataclass(frozen=True, slots=True)
|
| 41 |
+
class PublishablePrSearchIndex:
|
| 42 |
+
repo: str
|
| 43 |
+
snapshot_id: str
|
| 44 |
+
run_id: str
|
| 45 |
+
db_path: Path
|
| 46 |
+
source_type: str | None
|
| 47 |
+
hf_repo_id: str | None
|
| 48 |
+
hf_revision: str | None
|
| 49 |
+
row_counts: dict[str, Any]
|
| 50 |
+
|
| 51 |
+
|
| 52 |
+
def publish_current_pr_search_index(
|
| 53 |
+
*,
|
| 54 |
+
db_path: Path,
|
| 55 |
+
hf_repo_id: str,
|
| 56 |
+
private: bool = False,
|
| 57 |
+
log: Callable[[str], None] | None = None,
|
| 58 |
+
) -> dict[str, Any]:
|
| 59 |
+
return _publish_current_pr_search_index_api(
|
| 60 |
+
cast("HubApiLike", HfApi()),
|
| 61 |
+
db_path=db_path,
|
| 62 |
+
hf_repo_id=hf_repo_id,
|
| 63 |
+
private=private,
|
| 64 |
+
log=log,
|
| 65 |
+
)
|
| 66 |
+
|
| 67 |
+
|
| 68 |
+
def _publish_current_pr_search_index_api(
|
| 69 |
+
api: HubApiLike,
|
| 70 |
+
*,
|
| 71 |
+
db_path: Path,
|
| 72 |
+
hf_repo_id: str,
|
| 73 |
+
private: bool = False,
|
| 74 |
+
log: Callable[[str], None] | None = None,
|
| 75 |
+
) -> dict[str, Any]:
|
| 76 |
+
publishable = _discover_publishable_pr_search_index(db_path)
|
| 77 |
+
published_at = _iso_now()
|
| 78 |
+
manifest = build_current_search_manifest(
|
| 79 |
+
repo=publishable.repo,
|
| 80 |
+
snapshot_id=publishable.snapshot_id,
|
| 81 |
+
run_id=publishable.run_id,
|
| 82 |
+
published_at=published_at,
|
| 83 |
+
source_type=publishable.source_type,
|
| 84 |
+
hf_repo_id=publishable.hf_repo_id,
|
| 85 |
+
hf_revision=publishable.hf_revision,
|
| 86 |
+
row_counts=publishable.row_counts,
|
| 87 |
+
)
|
| 88 |
+
operations = [
|
| 89 |
+
CommitOperationAdd(
|
| 90 |
+
path_in_repo=CURRENT_SEARCH_DB_PATH,
|
| 91 |
+
path_or_fileobj=publishable.db_path,
|
| 92 |
+
),
|
| 93 |
+
CommitOperationAdd(
|
| 94 |
+
path_in_repo=CURRENT_SEARCH_MANIFEST_PATH,
|
| 95 |
+
path_or_fileobj=_json_bytes(manifest),
|
| 96 |
+
),
|
| 97 |
+
]
|
| 98 |
+
if log:
|
| 99 |
+
log(f"Ensuring Hub dataset repo exists: {hf_repo_id}")
|
| 100 |
+
api.create_repo(hf_repo_id, repo_type="dataset", private=private, exist_ok=True)
|
| 101 |
+
if log:
|
| 102 |
+
log(f"Publishing PR search index run {publishable.run_id} for {publishable.snapshot_id}")
|
| 103 |
+
api.create_commit(
|
| 104 |
+
hf_repo_id,
|
| 105 |
+
operations,
|
| 106 |
+
commit_message=f"Publish PR search index {publishable.run_id} for {publishable.snapshot_id}",
|
| 107 |
+
repo_type="dataset",
|
| 108 |
+
)
|
| 109 |
+
result = {
|
| 110 |
+
"repo": publishable.repo,
|
| 111 |
+
"dataset_id": hf_repo_id,
|
| 112 |
+
"snapshot_id": publishable.snapshot_id,
|
| 113 |
+
"run_id": publishable.run_id,
|
| 114 |
+
"published_at": published_at,
|
| 115 |
+
"artifact_paths": [operation.path_in_repo for operation in operations],
|
| 116 |
+
}
|
| 117 |
+
if log:
|
| 118 |
+
log(f"Published PR search index to {hf_repo_id}")
|
| 119 |
+
return result
|
| 120 |
+
|
| 121 |
+
|
| 122 |
+
def _discover_publishable_pr_search_index(db_path: Path) -> PublishablePrSearchIndex:
|
| 123 |
+
status = get_pr_search_status(db_path)
|
| 124 |
+
return PublishablePrSearchIndex(
|
| 125 |
+
repo=str(status["repo"]),
|
| 126 |
+
snapshot_id=str(status["snapshot_id"]),
|
| 127 |
+
run_id=str(status["id"]),
|
| 128 |
+
db_path=db_path.resolve(),
|
| 129 |
+
source_type=(None if status.get("source_type") is None else str(status.get("source_type"))),
|
| 130 |
+
hf_repo_id=None if status.get("hf_repo_id") is None else str(status.get("hf_repo_id")),
|
| 131 |
+
hf_revision=(None if status.get("hf_revision") is None else str(status.get("hf_revision"))),
|
| 132 |
+
row_counts=dict(status.get("row_counts") or {}),
|
| 133 |
+
)
|
| 134 |
+
|
| 135 |
+
|
| 136 |
+
def _json_bytes(payload: dict[str, Any]) -> bytes:
|
| 137 |
+
return (json.dumps(payload, indent=2, sort_keys=True) + "\n").encode("utf-8")
|
| 138 |
+
|
| 139 |
+
|
| 140 |
+
def _iso_now() -> str:
|
| 141 |
+
return datetime.now(tz=UTC).replace(microsecond=0).isoformat().replace("+00:00", "Z")
|
src/slop_farmer/app/save_cache.py
ADDED
|
@@ -0,0 +1,115 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from __future__ import annotations
|
| 2 |
+
|
| 3 |
+
from collections.abc import Callable
|
| 4 |
+
from pathlib import Path
|
| 5 |
+
from typing import Any, Protocol, cast
|
| 6 |
+
|
| 7 |
+
from huggingface_hub import HfApi
|
| 8 |
+
|
| 9 |
+
from slop_farmer.config import SaveCacheOptions
|
| 10 |
+
from slop_farmer.data.parquet_io import read_json
|
| 11 |
+
from slop_farmer.data.snapshot_paths import ROOT_MANIFEST_FILENAME, resolve_snapshot_dir_from_output
|
| 12 |
+
|
| 13 |
+
ANALYSIS_STATE_DIRNAME = "analysis-state"
|
| 14 |
+
|
| 15 |
+
|
| 16 |
+
class HubApiLike(Protocol):
|
| 17 |
+
def create_repo(
|
| 18 |
+
self,
|
| 19 |
+
repo_id: str,
|
| 20 |
+
*,
|
| 21 |
+
repo_type: str,
|
| 22 |
+
private: bool,
|
| 23 |
+
exist_ok: bool,
|
| 24 |
+
) -> None: ...
|
| 25 |
+
|
| 26 |
+
def upload_folder(
|
| 27 |
+
self,
|
| 28 |
+
*,
|
| 29 |
+
repo_id: str,
|
| 30 |
+
folder_path: Path,
|
| 31 |
+
path_in_repo: str,
|
| 32 |
+
repo_type: str,
|
| 33 |
+
commit_message: str,
|
| 34 |
+
) -> None: ...
|
| 35 |
+
|
| 36 |
+
|
| 37 |
+
def run_save_cache(options: SaveCacheOptions) -> dict[str, Any]:
|
| 38 |
+
snapshot_dir = resolve_snapshot_dir_from_output(options.output_dir, options.snapshot_dir)
|
| 39 |
+
return save_analysis_cache(
|
| 40 |
+
snapshot_dir=snapshot_dir,
|
| 41 |
+
hf_repo_id=options.hf_repo_id,
|
| 42 |
+
private=options.private_hf_repo,
|
| 43 |
+
)
|
| 44 |
+
|
| 45 |
+
|
| 46 |
+
def save_analysis_cache(
|
| 47 |
+
*,
|
| 48 |
+
snapshot_dir: Path,
|
| 49 |
+
hf_repo_id: str,
|
| 50 |
+
private: bool,
|
| 51 |
+
log: Callable[[str], None] | None = None,
|
| 52 |
+
) -> dict[str, Any]:
|
| 53 |
+
return _save_analysis_cache_api(
|
| 54 |
+
cast("HubApiLike", HfApi()),
|
| 55 |
+
snapshot_dir=snapshot_dir,
|
| 56 |
+
hf_repo_id=hf_repo_id,
|
| 57 |
+
private=private,
|
| 58 |
+
log=log,
|
| 59 |
+
)
|
| 60 |
+
|
| 61 |
+
|
| 62 |
+
def _save_analysis_cache_api(
|
| 63 |
+
api: HubApiLike,
|
| 64 |
+
*,
|
| 65 |
+
snapshot_dir: Path,
|
| 66 |
+
hf_repo_id: str,
|
| 67 |
+
private: bool,
|
| 68 |
+
log: Callable[[str], None] | None = None,
|
| 69 |
+
) -> dict[str, Any]:
|
| 70 |
+
cache_dir = snapshot_dir / ANALYSIS_STATE_DIRNAME
|
| 71 |
+
if not cache_dir.exists():
|
| 72 |
+
raise FileNotFoundError(f"Analysis cache directory is missing: {cache_dir}")
|
| 73 |
+
if not cache_dir.is_dir():
|
| 74 |
+
raise NotADirectoryError(f"Analysis cache path is not a directory: {cache_dir}")
|
| 75 |
+
artifact_paths = _cache_artifact_paths(cache_dir)
|
| 76 |
+
if not artifact_paths:
|
| 77 |
+
raise ValueError(f"Analysis cache directory is empty: {cache_dir}")
|
| 78 |
+
|
| 79 |
+
manifest_path = snapshot_dir / ROOT_MANIFEST_FILENAME
|
| 80 |
+
manifest = read_json(manifest_path) if manifest_path.exists() else {}
|
| 81 |
+
if not isinstance(manifest, dict):
|
| 82 |
+
raise ValueError(f"Snapshot manifest at {manifest_path} must contain a JSON object.")
|
| 83 |
+
snapshot_id = str(manifest.get("snapshot_id") or snapshot_dir.name).strip()
|
| 84 |
+
repo = str(manifest.get("repo") or "").strip()
|
| 85 |
+
|
| 86 |
+
if log:
|
| 87 |
+
log(f"Ensuring Hub dataset repo exists: {hf_repo_id}")
|
| 88 |
+
api.create_repo(hf_repo_id, repo_type="dataset", private=private, exist_ok=True)
|
| 89 |
+
if log:
|
| 90 |
+
log(f"Saving analysis cache for snapshot {snapshot_id}")
|
| 91 |
+
api.upload_folder(
|
| 92 |
+
repo_id=hf_repo_id,
|
| 93 |
+
folder_path=cache_dir,
|
| 94 |
+
path_in_repo=ANALYSIS_STATE_DIRNAME,
|
| 95 |
+
repo_type="dataset",
|
| 96 |
+
commit_message=f"Save analysis cache for snapshot {snapshot_id}",
|
| 97 |
+
)
|
| 98 |
+
result = {
|
| 99 |
+
"dataset_id": hf_repo_id,
|
| 100 |
+
"snapshot_id": snapshot_id,
|
| 101 |
+
"artifact_paths": [f"{ANALYSIS_STATE_DIRNAME}/{path}" for path in artifact_paths],
|
| 102 |
+
}
|
| 103 |
+
if repo:
|
| 104 |
+
result["repo"] = repo
|
| 105 |
+
if log:
|
| 106 |
+
log(f"Saved analysis cache to {hf_repo_id}")
|
| 107 |
+
return result
|
| 108 |
+
|
| 109 |
+
|
| 110 |
+
def _cache_artifact_paths(cache_dir: Path) -> list[str]:
|
| 111 |
+
return sorted(
|
| 112 |
+
str(path.relative_to(cache_dir).as_posix())
|
| 113 |
+
for path in cache_dir.rglob("*")
|
| 114 |
+
if path.is_file()
|
| 115 |
+
)
|
src/slop_farmer/app_config.py
CHANGED
|
@@ -234,6 +234,10 @@ def _dashboard_config_defaults(config_path: Path) -> dict[str, dict[str, Any]]:
|
|
| 234 |
"output-dir": str(data_dir) if data_dir else None,
|
| 235 |
"hf-repo-id": dataset_id,
|
| 236 |
},
|
|
|
|
|
|
|
|
|
|
|
|
|
| 237 |
"deploy-dashboard": {
|
| 238 |
"pipeline-data-dir": str(data_dir) if data_dir else None,
|
| 239 |
"web-dir": str(web_dir) if web_dir else None,
|
|
|
|
| 234 |
"output-dir": str(data_dir) if data_dir else None,
|
| 235 |
"hf-repo-id": dataset_id,
|
| 236 |
},
|
| 237 |
+
"save-cache": {
|
| 238 |
+
"output-dir": str(data_dir) if data_dir else None,
|
| 239 |
+
"hf-repo-id": dataset_id,
|
| 240 |
+
},
|
| 241 |
"deploy-dashboard": {
|
| 242 |
"pipeline-data-dir": str(data_dir) if data_dir else None,
|
| 243 |
"web-dir": str(web_dir) if web_dir else None,
|
src/slop_farmer/config.py
CHANGED
|
@@ -244,9 +244,19 @@ class DatasetRefreshOptions:
|
|
| 244 |
class PublishAnalysisArtifactsOptions:
|
| 245 |
output_dir: Path
|
| 246 |
snapshot_dir: Path | None
|
|
|
|
| 247 |
hf_repo_id: str
|
| 248 |
analysis_id: str
|
| 249 |
canonical: bool = False
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 250 |
private_hf_repo: bool = False
|
| 251 |
|
| 252 |
|
|
|
|
| 244 |
class PublishAnalysisArtifactsOptions:
|
| 245 |
output_dir: Path
|
| 246 |
snapshot_dir: Path | None
|
| 247 |
+
analysis_input: Path | None
|
| 248 |
hf_repo_id: str
|
| 249 |
analysis_id: str
|
| 250 |
canonical: bool = False
|
| 251 |
+
save_cache: bool = False
|
| 252 |
+
private_hf_repo: bool = False
|
| 253 |
+
|
| 254 |
+
|
| 255 |
+
@dataclass(slots=True)
|
| 256 |
+
class SaveCacheOptions:
|
| 257 |
+
output_dir: Path
|
| 258 |
+
snapshot_dir: Path | None
|
| 259 |
+
hf_repo_id: str
|
| 260 |
private_hf_repo: bool = False
|
| 261 |
|
| 262 |
|
src/slop_farmer/data/snapshot_materialize.py
CHANGED
|
@@ -15,6 +15,8 @@ from slop_farmer.data.parquet_io import read_json, write_text
|
|
| 15 |
from slop_farmer.data.snapshot_paths import (
|
| 16 |
CONTRIBUTOR_ARTIFACT_FILENAMES,
|
| 17 |
CURRENT_ANALYSIS_MANIFEST_PATH,
|
|
|
|
|
|
|
| 18 |
LEGACY_ANALYSIS_FILENAMES,
|
| 19 |
PR_SCOPE_CLUSTERS_FILENAME,
|
| 20 |
RAW_TABLE_FILENAMES,
|
|
@@ -24,6 +26,7 @@ from slop_farmer.data.snapshot_paths import (
|
|
| 24 |
STATE_WATERMARK_PATH,
|
| 25 |
load_archived_analysis_run_manifest,
|
| 26 |
load_current_analysis_manifest,
|
|
|
|
| 27 |
repo_relative_path_to_local,
|
| 28 |
)
|
| 29 |
|
|
@@ -64,6 +67,121 @@ def materialize_hf_dataset_snapshot(
|
|
| 64 |
)
|
| 65 |
|
| 66 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 67 |
def _materialize_hf_snapshot_repo_snapshot(
|
| 68 |
*,
|
| 69 |
repo_id: str,
|
|
|
|
| 15 |
from slop_farmer.data.snapshot_paths import (
|
| 16 |
CONTRIBUTOR_ARTIFACT_FILENAMES,
|
| 17 |
CURRENT_ANALYSIS_MANIFEST_PATH,
|
| 18 |
+
CURRENT_SEARCH_DB_PATH,
|
| 19 |
+
CURRENT_SEARCH_MANIFEST_PATH,
|
| 20 |
LEGACY_ANALYSIS_FILENAMES,
|
| 21 |
PR_SCOPE_CLUSTERS_FILENAME,
|
| 22 |
RAW_TABLE_FILENAMES,
|
|
|
|
| 26 |
STATE_WATERMARK_PATH,
|
| 27 |
load_archived_analysis_run_manifest,
|
| 28 |
load_current_analysis_manifest,
|
| 29 |
+
load_current_search_manifest,
|
| 30 |
repo_relative_path_to_local,
|
| 31 |
)
|
| 32 |
|
|
|
|
| 67 |
)
|
| 68 |
|
| 69 |
|
| 70 |
+
def load_hf_current_analysis_manifest(
|
| 71 |
+
*,
|
| 72 |
+
repo_id: str,
|
| 73 |
+
revision: str | None = None,
|
| 74 |
+
) -> dict[str, Any] | None:
|
| 75 |
+
try:
|
| 76 |
+
downloaded = Path(
|
| 77 |
+
hf_hub_download(
|
| 78 |
+
repo_id=repo_id,
|
| 79 |
+
repo_type="dataset",
|
| 80 |
+
filename=CURRENT_ANALYSIS_MANIFEST_PATH,
|
| 81 |
+
revision=revision,
|
| 82 |
+
)
|
| 83 |
+
)
|
| 84 |
+
except Exception:
|
| 85 |
+
return None
|
| 86 |
+
return load_current_analysis_manifest(downloaded)
|
| 87 |
+
|
| 88 |
+
|
| 89 |
+
def materialize_hf_current_analysis_surface(
|
| 90 |
+
*,
|
| 91 |
+
repo_id: str,
|
| 92 |
+
local_dir: Path,
|
| 93 |
+
revision: str | None = None,
|
| 94 |
+
) -> Path:
|
| 95 |
+
local_dir.mkdir(parents=True, exist_ok=True)
|
| 96 |
+
manifest = load_hf_current_analysis_manifest(repo_id=repo_id, revision=revision)
|
| 97 |
+
if manifest is None:
|
| 98 |
+
raise FileNotFoundError(
|
| 99 |
+
f"HF dataset {repo_id} does not expose {CURRENT_ANALYSIS_MANIFEST_PATH!r}"
|
| 100 |
+
)
|
| 101 |
+
|
| 102 |
+
staged_downloads: list[tuple[Path, Path]] = []
|
| 103 |
+
|
| 104 |
+
def stage(repo_path: str, *, required: bool) -> None:
|
| 105 |
+
try:
|
| 106 |
+
downloaded = Path(
|
| 107 |
+
hf_hub_download(
|
| 108 |
+
repo_id=repo_id,
|
| 109 |
+
repo_type="dataset",
|
| 110 |
+
filename=repo_path,
|
| 111 |
+
revision=revision,
|
| 112 |
+
)
|
| 113 |
+
)
|
| 114 |
+
except Exception:
|
| 115 |
+
if required:
|
| 116 |
+
raise
|
| 117 |
+
return
|
| 118 |
+
staged_downloads.append((downloaded, repo_relative_path_to_local(local_dir, repo_path)))
|
| 119 |
+
|
| 120 |
+
for repo_path in (
|
| 121 |
+
ROOT_MANIFEST_FILENAME,
|
| 122 |
+
SNAPSHOTS_LATEST_PATH,
|
| 123 |
+
"issues.parquet",
|
| 124 |
+
"pull_requests.parquet",
|
| 125 |
+
):
|
| 126 |
+
stage(repo_path, required=False)
|
| 127 |
+
|
| 128 |
+
for repo_path in manifest.get("artifacts", {}).values():
|
| 129 |
+
if isinstance(repo_path, str) and repo_path:
|
| 130 |
+
stage(repo_path, required=True)
|
| 131 |
+
|
| 132 |
+
stage(CURRENT_ANALYSIS_MANIFEST_PATH, required=True)
|
| 133 |
+
|
| 134 |
+
for downloaded, destination in staged_downloads:
|
| 135 |
+
_copy_downloaded_file(downloaded, destination)
|
| 136 |
+
|
| 137 |
+
return local_dir
|
| 138 |
+
|
| 139 |
+
|
| 140 |
+
def load_hf_current_search_manifest(
|
| 141 |
+
*,
|
| 142 |
+
repo_id: str,
|
| 143 |
+
revision: str | None = None,
|
| 144 |
+
) -> dict[str, Any] | None:
|
| 145 |
+
try:
|
| 146 |
+
downloaded = Path(
|
| 147 |
+
hf_hub_download(
|
| 148 |
+
repo_id=repo_id,
|
| 149 |
+
repo_type="dataset",
|
| 150 |
+
filename=CURRENT_SEARCH_MANIFEST_PATH,
|
| 151 |
+
revision=revision,
|
| 152 |
+
)
|
| 153 |
+
)
|
| 154 |
+
except Exception:
|
| 155 |
+
return None
|
| 156 |
+
return load_current_search_manifest(downloaded)
|
| 157 |
+
|
| 158 |
+
|
| 159 |
+
def materialize_hf_current_search_index(
|
| 160 |
+
*,
|
| 161 |
+
repo_id: str,
|
| 162 |
+
db_path: Path,
|
| 163 |
+
manifest_path: Path,
|
| 164 |
+
revision: str | None = None,
|
| 165 |
+
) -> dict[str, Any]:
|
| 166 |
+
manifest = load_hf_current_search_manifest(repo_id=repo_id, revision=revision)
|
| 167 |
+
if manifest is None:
|
| 168 |
+
raise FileNotFoundError(
|
| 169 |
+
f"HF dataset {repo_id} does not expose {CURRENT_SEARCH_MANIFEST_PATH!r}"
|
| 170 |
+
)
|
| 171 |
+
db_repo_path = str(manifest.get("artifacts", {}).get("db") or CURRENT_SEARCH_DB_PATH)
|
| 172 |
+
downloaded_db = Path(
|
| 173 |
+
hf_hub_download(
|
| 174 |
+
repo_id=repo_id,
|
| 175 |
+
repo_type="dataset",
|
| 176 |
+
filename=db_repo_path,
|
| 177 |
+
revision=revision,
|
| 178 |
+
)
|
| 179 |
+
)
|
| 180 |
+
_copy_downloaded_file(downloaded_db, db_path)
|
| 181 |
+
write_text(json.dumps(manifest, indent=2, sort_keys=True) + "\n", manifest_path)
|
| 182 |
+
return manifest
|
| 183 |
+
|
| 184 |
+
|
| 185 |
def _materialize_hf_snapshot_repo_snapshot(
|
| 186 |
*,
|
| 187 |
repo_id: str,
|
src/slop_farmer/data/snapshot_paths.py
CHANGED
|
@@ -48,6 +48,10 @@ LEGACY_ANALYSIS_FILENAMES: tuple[str, ...] = (
|
|
| 48 |
CURRENT_ANALYSIS_DIR = PurePosixPath("analysis/current")
|
| 49 |
CURRENT_ANALYSIS_MANIFEST_PATH = str(CURRENT_ANALYSIS_DIR / ROOT_MANIFEST_FILENAME)
|
| 50 |
ANALYSIS_MANIFEST_SCHEMA_VERSION = 1
|
|
|
|
|
|
|
|
|
|
|
|
|
| 51 |
|
| 52 |
|
| 53 |
@dataclass(frozen=True, slots=True)
|
|
@@ -90,6 +94,10 @@ def current_analysis_artifact_path(filename: str) -> str:
|
|
| 90 |
return str(CURRENT_ANALYSIS_DIR / filename)
|
| 91 |
|
| 92 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 93 |
def repo_key(repo_slug: str) -> str:
|
| 94 |
return _path_key(repo_slug)
|
| 95 |
|
|
@@ -195,6 +203,39 @@ def load_archived_analysis_run_manifest(path: Path) -> dict[str, Any]:
|
|
| 195 |
return validate_archived_analysis_run_manifest(payload)
|
| 196 |
|
| 197 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 198 |
def resolve_default_dashboard_analysis_report(
|
| 199 |
snapshot_dir: Path,
|
| 200 |
) -> ResolvedAnalysisReportPath | None:
|
|
@@ -289,6 +330,52 @@ def validate_archived_analysis_run_manifest(payload: dict[str, Any]) -> dict[str
|
|
| 289 |
return _validate_analysis_manifest(payload, require_archived_artifacts=False)
|
| 290 |
|
| 291 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 292 |
def load_latest_snapshot_pointer(snapshots_root: Path) -> Path | None:
|
| 293 |
resolved_snapshots_root = snapshots_root.resolve()
|
| 294 |
latest_path = resolved_snapshots_root / "latest.json"
|
|
|
|
| 48 |
CURRENT_ANALYSIS_DIR = PurePosixPath("analysis/current")
|
| 49 |
CURRENT_ANALYSIS_MANIFEST_PATH = str(CURRENT_ANALYSIS_DIR / ROOT_MANIFEST_FILENAME)
|
| 50 |
ANALYSIS_MANIFEST_SCHEMA_VERSION = 1
|
| 51 |
+
CURRENT_SEARCH_DIR = PurePosixPath("search/current")
|
| 52 |
+
CURRENT_SEARCH_MANIFEST_PATH = str(CURRENT_SEARCH_DIR / ROOT_MANIFEST_FILENAME)
|
| 53 |
+
CURRENT_SEARCH_DB_PATH = str(CURRENT_SEARCH_DIR / "pr-search.duckdb")
|
| 54 |
+
SEARCH_MANIFEST_SCHEMA_VERSION = 1
|
| 55 |
|
| 56 |
|
| 57 |
@dataclass(frozen=True, slots=True)
|
|
|
|
| 94 |
return str(CURRENT_ANALYSIS_DIR / filename)
|
| 95 |
|
| 96 |
|
| 97 |
+
def current_search_artifact_path(filename: str) -> str:
|
| 98 |
+
return str(CURRENT_SEARCH_DIR / filename)
|
| 99 |
+
|
| 100 |
+
|
| 101 |
def repo_key(repo_slug: str) -> str:
|
| 102 |
return _path_key(repo_slug)
|
| 103 |
|
|
|
|
| 203 |
return validate_archived_analysis_run_manifest(payload)
|
| 204 |
|
| 205 |
|
| 206 |
+
def build_current_search_manifest(
|
| 207 |
+
*,
|
| 208 |
+
repo: str,
|
| 209 |
+
snapshot_id: str,
|
| 210 |
+
run_id: str,
|
| 211 |
+
published_at: str,
|
| 212 |
+
source_type: str | None,
|
| 213 |
+
hf_repo_id: str | None,
|
| 214 |
+
hf_revision: str | None,
|
| 215 |
+
row_counts: dict[str, Any] | None,
|
| 216 |
+
) -> dict[str, Any]:
|
| 217 |
+
payload = {
|
| 218 |
+
"schema_version": SEARCH_MANIFEST_SCHEMA_VERSION,
|
| 219 |
+
"repo": repo,
|
| 220 |
+
"snapshot_id": snapshot_id,
|
| 221 |
+
"run_id": run_id,
|
| 222 |
+
"published_at": published_at,
|
| 223 |
+
"source_type": source_type,
|
| 224 |
+
"hf_repo_id": hf_repo_id,
|
| 225 |
+
"hf_revision": hf_revision,
|
| 226 |
+
"artifacts": {"db": CURRENT_SEARCH_DB_PATH},
|
| 227 |
+
"row_counts": row_counts or {},
|
| 228 |
+
}
|
| 229 |
+
return validate_current_search_manifest(payload)
|
| 230 |
+
|
| 231 |
+
|
| 232 |
+
def load_current_search_manifest(path: Path) -> dict[str, Any]:
|
| 233 |
+
payload = read_json(path)
|
| 234 |
+
if not isinstance(payload, dict):
|
| 235 |
+
raise ValueError(f"Current search manifest at {path} must contain a JSON object.")
|
| 236 |
+
return validate_current_search_manifest(payload)
|
| 237 |
+
|
| 238 |
+
|
| 239 |
def resolve_default_dashboard_analysis_report(
|
| 240 |
snapshot_dir: Path,
|
| 241 |
) -> ResolvedAnalysisReportPath | None:
|
|
|
|
| 330 |
return _validate_analysis_manifest(payload, require_archived_artifacts=False)
|
| 331 |
|
| 332 |
|
| 333 |
+
def validate_current_search_manifest(payload: dict[str, Any]) -> dict[str, Any]:
|
| 334 |
+
schema_version = int(payload.get("schema_version", SEARCH_MANIFEST_SCHEMA_VERSION))
|
| 335 |
+
if schema_version != SEARCH_MANIFEST_SCHEMA_VERSION:
|
| 336 |
+
raise ValueError(
|
| 337 |
+
f"Current search manifest schema_version must be {SEARCH_MANIFEST_SCHEMA_VERSION}."
|
| 338 |
+
)
|
| 339 |
+
repo = str(payload.get("repo") or "").strip()
|
| 340 |
+
snapshot_id = str(payload.get("snapshot_id") or "").strip()
|
| 341 |
+
run_id = str(payload.get("run_id") or "").strip()
|
| 342 |
+
published_at = str(payload.get("published_at") or "").strip()
|
| 343 |
+
artifacts = payload.get("artifacts")
|
| 344 |
+
if not repo:
|
| 345 |
+
raise ValueError("Current search manifest must define repo.")
|
| 346 |
+
if not snapshot_id:
|
| 347 |
+
raise ValueError("Current search manifest must define snapshot_id.")
|
| 348 |
+
if not run_id:
|
| 349 |
+
raise ValueError("Current search manifest must define run_id.")
|
| 350 |
+
if not published_at:
|
| 351 |
+
raise ValueError("Current search manifest must define published_at.")
|
| 352 |
+
if not isinstance(artifacts, dict):
|
| 353 |
+
raise ValueError("Current search manifest must define an artifacts object.")
|
| 354 |
+
db_path = artifacts.get("db")
|
| 355 |
+
if db_path != CURRENT_SEARCH_DB_PATH:
|
| 356 |
+
raise ValueError(f"Current search manifest db artifact must be {CURRENT_SEARCH_DB_PATH!r}.")
|
| 357 |
+
return {
|
| 358 |
+
"schema_version": SEARCH_MANIFEST_SCHEMA_VERSION,
|
| 359 |
+
"repo": repo,
|
| 360 |
+
"snapshot_id": snapshot_id,
|
| 361 |
+
"run_id": run_id,
|
| 362 |
+
"published_at": published_at,
|
| 363 |
+
"source_type": (
|
| 364 |
+
None if payload.get("source_type") is None else str(payload.get("source_type"))
|
| 365 |
+
),
|
| 366 |
+
"hf_repo_id": None if payload.get("hf_repo_id") is None else str(payload.get("hf_repo_id")),
|
| 367 |
+
"hf_revision": (
|
| 368 |
+
None if payload.get("hf_revision") is None else str(payload.get("hf_revision"))
|
| 369 |
+
),
|
| 370 |
+
"artifacts": {"db": CURRENT_SEARCH_DB_PATH},
|
| 371 |
+
"row_counts": (
|
| 372 |
+
{str(key): value for key, value in payload.get("row_counts", {}).items()}
|
| 373 |
+
if isinstance(payload.get("row_counts"), dict)
|
| 374 |
+
else {}
|
| 375 |
+
),
|
| 376 |
+
}
|
| 377 |
+
|
| 378 |
+
|
| 379 |
def load_latest_snapshot_pointer(snapshots_root: Path) -> Path | None:
|
| 380 |
resolved_snapshots_root = snapshots_root.resolve()
|
| 381 |
latest_path = resolved_snapshots_root / "latest.json"
|