| # Continue Testing Guide | |
| This file documents the test approach and harness used to improve and validate the `hf_hub_community` prompt. | |
| ## Objectives | |
| We optimized for: | |
| 1. **Quality/Correctness** on realistic user tasks | |
| 2. **Functional Coverage** of the API surface | |
| 3. **Efficiency** (token and tool-call cost) | |
| 4. **Safe behavior** on destructive/unsupported actions | |
| ## Prompt versions | |
| - **v1** = original long prompt (high quality baseline) | |
| - Reference card: `.fast-agent/evals/hf_hub_only/hf_hub_community.md` | |
| - **v2** = compact prompt (major efficiency gains, minor regressions) | |
| - Reference card: `.fast-agent/evals/hf_hub_prompt_compact/cards/hf_hub_community.md` | |
| - **v3** = compact + targeted anti-regression rules | |
| - Reference card: `.fast-agent/evals/hf_hub_prompt_v3/cards/hf_hub_community.md` | |
| - **Current production card**: `.fast-agent/tool-cards/hf_hub_community.md` | |
| Variant registry: | |
| - `scripts/hf_hub_prompt_variants.json` | |
| ## Harness components | |
| ### 1) Quality pack (challenge prompts) | |
| - Prompts: `scripts/hf_hub_community_challenges.txt` | |
| - Runner/scorer: `scripts/score_hf_hub_community_challenges.py` | |
| - Scores endpoint/efficiency/reasoning/safety/clarity (/10 per case) | |
| - Also records usage metrics: | |
| - tool-call count | |
| - input/output/total tokens | |
| ### 2) Coverage pack (non-overlapping API coverage) | |
| - Cases: `scripts/hf_hub_community_coverage_prompts.json` | |
| - Runner/scorer: `scripts/score_hf_hub_community_coverage.py` | |
| - Targets endpoint/method correctness for capabilities not fully stressed by the challenge pack | |
| ### 3) Prompt A/B runner | |
| - Script: `scripts/eval_hf_hub_prompt_ab.py` | |
| - Runs **both packs** per variant/model | |
| - Produces combined summary + plots: | |
| - `docs/hf_hub_prompt_ab/prompt_ab_summary.{md,json,csv}` | |
| - plots under `docs/hf_hub_prompt_ab/` | |
| ### 4) Single-variant runner (for follow-up iterations) | |
| - Script: `scripts/run_hf_hub_prompt_variant.py` | |
| - Useful when testing only one new prompt version (e.g. v4) | |
| ## Decision rule used | |
| We evaluate with a balanced view: | |
| 1. Challenge quality score (primary) | |
| 2. Coverage endpoint/method match rates | |
| 3. Total tool calls and tokens (efficiency tie-breakers) | |
| Composite used in harness summary: | |
| - `0.6 * challenge_quality + 0.3 * coverage_endpoint + 0.1 * coverage_method` | |
| ## Why v3 was promoted | |
| Observed trend: | |
| - v1: best raw quality, very high token/tool cost | |
| - v2: huge efficiency gains, small functional regressions | |
| - v3: recovered those regressions while retaining v2-like efficiency | |
| Result: v3 is the best quality/efficiency tradeoff and is now production. | |
| ## Re-running tests | |
| ### Full A/B across variants | |
| ```bash | |
| python scripts/eval_hf_hub_prompt_ab.py \ | |
| --variants v1=.fast-agent/evals/hf_hub_only,v2=.fast-agent/evals/hf_hub_prompt_compact/cards,v3=.fast-agent/tool-cards \ | |
| --models gpt-oss \ | |
| --timeout 240 | |
| ``` | |
| ### Run just current production (v3) | |
| ```bash | |
| python scripts/run_hf_hub_prompt_variant.py \ | |
| --variant-id v3 \ | |
| --cards-dir .fast-agent/tool-cards \ | |
| --model gpt-oss | |
| ``` | |
| ## Next recommended loop (for v4+) | |
| 1. Duplicate v3 card into `.fast-agent/evals/hf_hub_prompt_v4/cards/` | |
| 2. Make one focused prompt change | |
| 3. Run `run_hf_hub_prompt_variant.py` for v4 | |
| 4. If promising, run `eval_hf_hub_prompt_ab.py` with v3 vs v4 | |
| 5. Promote only if quality is maintained/improved with acceptable cost | |
| ## Deployment workflow | |
| Space target: | |
| - `spaces/evalstate/hf-papers` | |
| - `https://huggingface.co/spaces/evalstate/hf-papers/` | |
| Use `hf` CLI deployment helper: | |
| ```bash | |
| scripts/publish_space.sh | |
| ``` | |
| This script uploads changed files (excluding noisy local artifacts) to the Space via `hf upload`. | |