hf-papers / continue-testing.md
evalstate's picture
evalstate HF Staff
docs: add continue-testing guide and hf CLI deployment helper updates
30dc122 verified
# Continue Testing Guide
This file documents the test approach and harness used to improve and validate the `hf_hub_community` prompt.
## Objectives
We optimized for:
1. **Quality/Correctness** on realistic user tasks
2. **Functional Coverage** of the API surface
3. **Efficiency** (token and tool-call cost)
4. **Safe behavior** on destructive/unsupported actions
## Prompt versions
- **v1** = original long prompt (high quality baseline)
- Reference card: `.fast-agent/evals/hf_hub_only/hf_hub_community.md`
- **v2** = compact prompt (major efficiency gains, minor regressions)
- Reference card: `.fast-agent/evals/hf_hub_prompt_compact/cards/hf_hub_community.md`
- **v3** = compact + targeted anti-regression rules
- Reference card: `.fast-agent/evals/hf_hub_prompt_v3/cards/hf_hub_community.md`
- **Current production card**: `.fast-agent/tool-cards/hf_hub_community.md`
Variant registry:
- `scripts/hf_hub_prompt_variants.json`
## Harness components
### 1) Quality pack (challenge prompts)
- Prompts: `scripts/hf_hub_community_challenges.txt`
- Runner/scorer: `scripts/score_hf_hub_community_challenges.py`
- Scores endpoint/efficiency/reasoning/safety/clarity (/10 per case)
- Also records usage metrics:
- tool-call count
- input/output/total tokens
### 2) Coverage pack (non-overlapping API coverage)
- Cases: `scripts/hf_hub_community_coverage_prompts.json`
- Runner/scorer: `scripts/score_hf_hub_community_coverage.py`
- Targets endpoint/method correctness for capabilities not fully stressed by the challenge pack
### 3) Prompt A/B runner
- Script: `scripts/eval_hf_hub_prompt_ab.py`
- Runs **both packs** per variant/model
- Produces combined summary + plots:
- `docs/hf_hub_prompt_ab/prompt_ab_summary.{md,json,csv}`
- plots under `docs/hf_hub_prompt_ab/`
### 4) Single-variant runner (for follow-up iterations)
- Script: `scripts/run_hf_hub_prompt_variant.py`
- Useful when testing only one new prompt version (e.g. v4)
## Decision rule used
We evaluate with a balanced view:
1. Challenge quality score (primary)
2. Coverage endpoint/method match rates
3. Total tool calls and tokens (efficiency tie-breakers)
Composite used in harness summary:
- `0.6 * challenge_quality + 0.3 * coverage_endpoint + 0.1 * coverage_method`
## Why v3 was promoted
Observed trend:
- v1: best raw quality, very high token/tool cost
- v2: huge efficiency gains, small functional regressions
- v3: recovered those regressions while retaining v2-like efficiency
Result: v3 is the best quality/efficiency tradeoff and is now production.
## Re-running tests
### Full A/B across variants
```bash
python scripts/eval_hf_hub_prompt_ab.py \
--variants v1=.fast-agent/evals/hf_hub_only,v2=.fast-agent/evals/hf_hub_prompt_compact/cards,v3=.fast-agent/tool-cards \
--models gpt-oss \
--timeout 240
```
### Run just current production (v3)
```bash
python scripts/run_hf_hub_prompt_variant.py \
--variant-id v3 \
--cards-dir .fast-agent/tool-cards \
--model gpt-oss
```
## Next recommended loop (for v4+)
1. Duplicate v3 card into `.fast-agent/evals/hf_hub_prompt_v4/cards/`
2. Make one focused prompt change
3. Run `run_hf_hub_prompt_variant.py` for v4
4. If promising, run `eval_hf_hub_prompt_ab.py` with v3 vs v4
5. Promote only if quality is maintained/improved with acceptable cost
## Deployment workflow
Space target:
- `spaces/evalstate/hf-papers`
- `https://huggingface.co/spaces/evalstate/hf-papers/`
Use `hf` CLI deployment helper:
```bash
scripts/publish_space.sh
```
This script uploads changed files (excluding noisy local artifacts) to the Space via `hf upload`.