# Continue Testing Guide This file documents the test approach and harness used to improve and validate the `hf_hub_community` prompt. ## Objectives We optimized for: 1. **Quality/Correctness** on realistic user tasks 2. **Functional Coverage** of the API surface 3. **Efficiency** (token and tool-call cost) 4. **Safe behavior** on destructive/unsupported actions ## Prompt versions - **v1** = original long prompt (high quality baseline) - Reference card: `.fast-agent/evals/hf_hub_only/hf_hub_community.md` - **v2** = compact prompt (major efficiency gains, minor regressions) - Reference card: `.fast-agent/evals/hf_hub_prompt_compact/cards/hf_hub_community.md` - **v3** = compact + targeted anti-regression rules - Reference card: `.fast-agent/evals/hf_hub_prompt_v3/cards/hf_hub_community.md` - **Current production card**: `.fast-agent/tool-cards/hf_hub_community.md` Variant registry: - `scripts/hf_hub_prompt_variants.json` ## Harness components ### 1) Quality pack (challenge prompts) - Prompts: `scripts/hf_hub_community_challenges.txt` - Runner/scorer: `scripts/score_hf_hub_community_challenges.py` - Scores endpoint/efficiency/reasoning/safety/clarity (/10 per case) - Also records usage metrics: - tool-call count - input/output/total tokens ### 2) Coverage pack (non-overlapping API coverage) - Cases: `scripts/hf_hub_community_coverage_prompts.json` - Runner/scorer: `scripts/score_hf_hub_community_coverage.py` - Targets endpoint/method correctness for capabilities not fully stressed by the challenge pack ### 3) Prompt A/B runner - Script: `scripts/eval_hf_hub_prompt_ab.py` - Runs **both packs** per variant/model - Produces combined summary + plots: - `docs/hf_hub_prompt_ab/prompt_ab_summary.{md,json,csv}` - plots under `docs/hf_hub_prompt_ab/` ### 4) Single-variant runner (for follow-up iterations) - Script: `scripts/run_hf_hub_prompt_variant.py` - Useful when testing only one new prompt version (e.g. v4) ## Decision rule used We evaluate with a balanced view: 1. Challenge quality score (primary) 2. Coverage endpoint/method match rates 3. Total tool calls and tokens (efficiency tie-breakers) Composite used in harness summary: - `0.6 * challenge_quality + 0.3 * coverage_endpoint + 0.1 * coverage_method` ## Why v3 was promoted Observed trend: - v1: best raw quality, very high token/tool cost - v2: huge efficiency gains, small functional regressions - v3: recovered those regressions while retaining v2-like efficiency Result: v3 is the best quality/efficiency tradeoff and is now production. ## Re-running tests ### Full A/B across variants ```bash python scripts/eval_hf_hub_prompt_ab.py \ --variants v1=.fast-agent/evals/hf_hub_only,v2=.fast-agent/evals/hf_hub_prompt_compact/cards,v3=.fast-agent/tool-cards \ --models gpt-oss \ --timeout 240 ``` ### Run just current production (v3) ```bash python scripts/run_hf_hub_prompt_variant.py \ --variant-id v3 \ --cards-dir .fast-agent/tool-cards \ --model gpt-oss ``` ## Next recommended loop (for v4+) 1. Duplicate v3 card into `.fast-agent/evals/hf_hub_prompt_v4/cards/` 2. Make one focused prompt change 3. Run `run_hf_hub_prompt_variant.py` for v4 4. If promising, run `eval_hf_hub_prompt_ab.py` with v3 vs v4 5. Promote only if quality is maintained/improved with acceptable cost ## Deployment workflow Space target: - `spaces/evalstate/hf-papers` - `https://huggingface.co/spaces/evalstate/hf-papers/` Use `hf` CLI deployment helper: ```bash scripts/publish_space.sh ``` This script uploads changed files (excluding noisy local artifacts) to the Space via `hf upload`.