File size: 2,352 Bytes
bba4fab | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 | # Tool Description A/B Setup
This harness benchmarks how tool-description quality affects tool use quality and intent capture.
Assume commands are run from the repo root.
## Files added
- `scripts/eval_tool_description_ab.py`
- `scripts/tool_description_variants.json`
- outputs go to `docs/tool_description_eval/`
- generated variant cards go to `.fast-agent/evals/tool_desc_ab/cards/<variant>/`
## What it varies
For each variant in `tool_description_variants.json`, the script creates a temporary tool-card set where it updates:
1. `hf_hub_community.md` frontmatter `description`
2. `hf_api_tool.py` function docstring for `hf_api_request`
Then it runs the same prompts across selected models.
Execution modes:
- **Direct (default):** runs `hf_hub_community` directly (best for endpoint-level scoring).
- **Indirect (`--indirect`):** runs via a generated wrapper agent that exposes exactly one sub-agent tool: `hf_hub_community`.
## Metrics collected
Per run:
- return code
- whether tool was called
- endpoint call count
- first endpoint used
- first-call correctness (challenge-aware heuristics)
- challenge score (reusing `score_hf_hub_community_challenges.py` when available)
Aggregates by `(variant, model)`:
- success rate
- tool-use rate
- average endpoint calls
- first-call OK rate
- average score total
## Run
```bash
python scripts/eval_tool_description_ab.py \
--models gpt-oss \
--base-cards-dir .fast-agent/tool-cards \
--prompts scripts/hf_hub_community_challenges.txt \
--variants scripts/tool_description_variants.json \
--start 1 --end 10
```
Multi-model example:
```bash
python scripts/eval_tool_description_ab.py \
--models gpt-oss,gpt-5-mini,gpt-4.1-mini
```
Indirect (single sub-agent tool) example:
```bash
python scripts/eval_tool_description_ab.py \
--models gpt-oss \
--indirect
```
## Outputs
- `docs/tool_description_eval/tool_description_ab_detailed.json`
- `docs/tool_description_eval/tool_description_ab_summary.json`
- `docs/tool_description_eval/tool_description_ab_summary.csv`
- `docs/tool_description_eval/tool_description_ab_summary.md`
- `docs/tool_description_eval/tool_description_ab_pairwise.json`
- `docs/tool_description_eval/tool_description_ab_pairwise.csv`
Model list syntax is comma-separated aliases/IDs, e.g. `--models gpt-5-mini,haiku,kimi25,glm,grok-4-fast`.
|