# Continue Testing Guide

This file documents the test approach and harness used to improve and validate the `hf_hub_community` prompt.

## Objectives

We optimized for:

1. **Quality/Correctness** on realistic user tasks
2. **Functional Coverage** of the API surface
3. **Efficiency** (token and tool-call cost)
4. **Safe behavior** on destructive/unsupported actions

## Prompt versions

- **v1** = original long prompt (high quality baseline)
  - Reference card: `.fast-agent/evals/hf_hub_only/hf_hub_community.md`
- **v2** = compact prompt (major efficiency gains, minor regressions)
  - Reference card: `.fast-agent/evals/hf_hub_prompt_compact/cards/hf_hub_community.md`
- **v3** = compact + targeted anti-regression rules
  - Reference card: `.fast-agent/evals/hf_hub_prompt_v3/cards/hf_hub_community.md`
  - **Current production card**: `.fast-agent/tool-cards/hf_hub_community.md`

Variant registry:
- `scripts/hf_hub_prompt_variants.json`

## Harness components

### 1) Quality pack (challenge prompts)
- Prompts: `scripts/hf_hub_community_challenges.txt`
- Runner/scorer: `scripts/score_hf_hub_community_challenges.py`
- Scores endpoint/efficiency/reasoning/safety/clarity (/10 per case)
- Also records usage metrics:
  - tool-call count
  - input/output/total tokens

### 2) Coverage pack (non-overlapping API coverage)
- Cases: `scripts/hf_hub_community_coverage_prompts.json`
- Runner/scorer: `scripts/score_hf_hub_community_coverage.py`
- Targets endpoint/method correctness for capabilities not fully stressed by the challenge pack

### 3) Prompt A/B runner
- Script: `scripts/eval_hf_hub_prompt_ab.py`
- Runs **both packs** per variant/model
- Produces combined summary + plots:
  - `docs/hf_hub_prompt_ab/prompt_ab_summary.{md,json,csv}`
  - plots under `docs/hf_hub_prompt_ab/`

### 4) Single-variant runner (for follow-up iterations)
- Script: `scripts/run_hf_hub_prompt_variant.py`
- Useful when testing only one new prompt version (e.g. v4)

## Decision rule used

We evaluate with a balanced view:

1. Challenge quality score (primary)
2. Coverage endpoint/method match rates
3. Total tool calls and tokens (efficiency tie-breakers)

Composite used in harness summary:
- `0.6 * challenge_quality + 0.3 * coverage_endpoint + 0.1 * coverage_method`

## Why v3 was promoted

Observed trend:
- v1: best raw quality, very high token/tool cost
- v2: huge efficiency gains, small functional regressions
- v3: recovered those regressions while retaining v2-like efficiency

Result: v3 is the best quality/efficiency tradeoff and is now production.

## Re-running tests

### Full A/B across variants

```bash
python scripts/eval_hf_hub_prompt_ab.py \
  --variants v1=.fast-agent/evals/hf_hub_only,v2=.fast-agent/evals/hf_hub_prompt_compact/cards,v3=.fast-agent/tool-cards \
  --models gpt-oss \
  --timeout 240
```

### Run just current production (v3)

```bash
python scripts/run_hf_hub_prompt_variant.py \
  --variant-id v3 \
  --cards-dir .fast-agent/tool-cards \
  --model gpt-oss
```

## Next recommended loop (for v4+)

1. Duplicate v3 card into `.fast-agent/evals/hf_hub_prompt_v4/cards/`
2. Make one focused prompt change
3. Run `run_hf_hub_prompt_variant.py` for v4
4. If promising, run `eval_hf_hub_prompt_ab.py` with v3 vs v4
5. Promote only if quality is maintained/improved with acceptable cost

## Deployment workflow

Space target:
- `spaces/evalstate/hf-papers`
- `https://huggingface.co/spaces/evalstate/hf-papers/`

Use `hf` CLI deployment helper:

```bash
scripts/publish_space.sh
```

This script uploads changed files (excluding noisy local artifacts) to the Space via `hf upload`.