Continue Testing Guide
This file documents the test approach and harness used to improve and validate the hf_hub_community prompt.
Objectives
We optimized for:
- Quality/Correctness on realistic user tasks
- Functional Coverage of the API surface
- Efficiency (token and tool-call cost)
- Safe behavior on destructive/unsupported actions
Prompt versions
- v1 = original long prompt (high quality baseline)
- Reference card:
.fast-agent/evals/hf_hub_only/hf_hub_community.md
- Reference card:
- v2 = compact prompt (major efficiency gains, minor regressions)
- Reference card:
.fast-agent/evals/hf_hub_prompt_compact/cards/hf_hub_community.md
- Reference card:
- v3 = compact + targeted anti-regression rules
- Reference card:
.fast-agent/evals/hf_hub_prompt_v3/cards/hf_hub_community.md - Current production card:
.fast-agent/tool-cards/hf_hub_community.md
- Reference card:
Variant registry:
scripts/hf_hub_prompt_variants.json
Harness components
1) Quality pack (challenge prompts)
- Prompts:
scripts/hf_hub_community_challenges.txt - Runner/scorer:
scripts/score_hf_hub_community_challenges.py - Scores endpoint/efficiency/reasoning/safety/clarity (/10 per case)
- Also records usage metrics:
- tool-call count
- input/output/total tokens
2) Coverage pack (non-overlapping API coverage)
- Cases:
scripts/hf_hub_community_coverage_prompts.json - Runner/scorer:
scripts/score_hf_hub_community_coverage.py - Targets endpoint/method correctness for capabilities not fully stressed by the challenge pack
3) Prompt A/B runner
- Script:
scripts/eval_hf_hub_prompt_ab.py - Runs both packs per variant/model
- Produces combined summary + plots:
docs/hf_hub_prompt_ab/prompt_ab_summary.{md,json,csv}- plots under
docs/hf_hub_prompt_ab/
4) Single-variant runner (for follow-up iterations)
- Script:
scripts/run_hf_hub_prompt_variant.py - Useful when testing only one new prompt version (e.g. v4)
Decision rule used
We evaluate with a balanced view:
- Challenge quality score (primary)
- Coverage endpoint/method match rates
- Total tool calls and tokens (efficiency tie-breakers)
Composite used in harness summary:
0.6 * challenge_quality + 0.3 * coverage_endpoint + 0.1 * coverage_method
Why v3 was promoted
Observed trend:
- v1: best raw quality, very high token/tool cost
- v2: huge efficiency gains, small functional regressions
- v3: recovered those regressions while retaining v2-like efficiency
Result: v3 is the best quality/efficiency tradeoff and is now production.
Re-running tests
Full A/B across variants
python scripts/eval_hf_hub_prompt_ab.py \
--variants v1=.fast-agent/evals/hf_hub_only,v2=.fast-agent/evals/hf_hub_prompt_compact/cards,v3=.fast-agent/tool-cards \
--models gpt-oss \
--timeout 240
Run just current production (v3)
python scripts/run_hf_hub_prompt_variant.py \
--variant-id v3 \
--cards-dir .fast-agent/tool-cards \
--model gpt-oss
Next recommended loop (for v4+)
- Duplicate v3 card into
.fast-agent/evals/hf_hub_prompt_v4/cards/ - Make one focused prompt change
- Run
run_hf_hub_prompt_variant.pyfor v4 - If promising, run
eval_hf_hub_prompt_ab.pywith v3 vs v4 - Promote only if quality is maintained/improved with acceptable cost
Deployment workflow
Space target:
spaces/evalstate/hf-papershttps://huggingface.co/spaces/evalstate/hf-papers/
Use hf CLI deployment helper:
scripts/publish_space.sh
This script uploads changed files (excluding noisy local artifacts) to the Space via hf upload.