hf-papers / continue-testing.md
evalstate's picture
evalstate HF Staff
docs: add continue-testing guide and hf CLI deployment helper updates
30dc122 verified

Continue Testing Guide

This file documents the test approach and harness used to improve and validate the hf_hub_community prompt.

Objectives

We optimized for:

  1. Quality/Correctness on realistic user tasks
  2. Functional Coverage of the API surface
  3. Efficiency (token and tool-call cost)
  4. Safe behavior on destructive/unsupported actions

Prompt versions

  • v1 = original long prompt (high quality baseline)
    • Reference card: .fast-agent/evals/hf_hub_only/hf_hub_community.md
  • v2 = compact prompt (major efficiency gains, minor regressions)
    • Reference card: .fast-agent/evals/hf_hub_prompt_compact/cards/hf_hub_community.md
  • v3 = compact + targeted anti-regression rules
    • Reference card: .fast-agent/evals/hf_hub_prompt_v3/cards/hf_hub_community.md
    • Current production card: .fast-agent/tool-cards/hf_hub_community.md

Variant registry:

  • scripts/hf_hub_prompt_variants.json

Harness components

1) Quality pack (challenge prompts)

  • Prompts: scripts/hf_hub_community_challenges.txt
  • Runner/scorer: scripts/score_hf_hub_community_challenges.py
  • Scores endpoint/efficiency/reasoning/safety/clarity (/10 per case)
  • Also records usage metrics:
    • tool-call count
    • input/output/total tokens

2) Coverage pack (non-overlapping API coverage)

  • Cases: scripts/hf_hub_community_coverage_prompts.json
  • Runner/scorer: scripts/score_hf_hub_community_coverage.py
  • Targets endpoint/method correctness for capabilities not fully stressed by the challenge pack

3) Prompt A/B runner

  • Script: scripts/eval_hf_hub_prompt_ab.py
  • Runs both packs per variant/model
  • Produces combined summary + plots:
    • docs/hf_hub_prompt_ab/prompt_ab_summary.{md,json,csv}
    • plots under docs/hf_hub_prompt_ab/

4) Single-variant runner (for follow-up iterations)

  • Script: scripts/run_hf_hub_prompt_variant.py
  • Useful when testing only one new prompt version (e.g. v4)

Decision rule used

We evaluate with a balanced view:

  1. Challenge quality score (primary)
  2. Coverage endpoint/method match rates
  3. Total tool calls and tokens (efficiency tie-breakers)

Composite used in harness summary:

  • 0.6 * challenge_quality + 0.3 * coverage_endpoint + 0.1 * coverage_method

Why v3 was promoted

Observed trend:

  • v1: best raw quality, very high token/tool cost
  • v2: huge efficiency gains, small functional regressions
  • v3: recovered those regressions while retaining v2-like efficiency

Result: v3 is the best quality/efficiency tradeoff and is now production.

Re-running tests

Full A/B across variants

python scripts/eval_hf_hub_prompt_ab.py \
  --variants v1=.fast-agent/evals/hf_hub_only,v2=.fast-agent/evals/hf_hub_prompt_compact/cards,v3=.fast-agent/tool-cards \
  --models gpt-oss \
  --timeout 240

Run just current production (v3)

python scripts/run_hf_hub_prompt_variant.py \
  --variant-id v3 \
  --cards-dir .fast-agent/tool-cards \
  --model gpt-oss

Next recommended loop (for v4+)

  1. Duplicate v3 card into .fast-agent/evals/hf_hub_prompt_v4/cards/
  2. Make one focused prompt change
  3. Run run_hf_hub_prompt_variant.py for v4
  4. If promising, run eval_hf_hub_prompt_ab.py with v3 vs v4
  5. Promote only if quality is maintained/improved with acceptable cost

Deployment workflow

Space target:

  • spaces/evalstate/hf-papers
  • https://huggingface.co/spaces/evalstate/hf-papers/

Use hf CLI deployment helper:

scripts/publish_space.sh

This script uploads changed files (excluding noisy local artifacts) to the Space via hf upload.