| # Tool Description A/B Setup | |
| This harness benchmarks how tool-description quality affects tool use quality and intent capture. | |
| Assume commands are run from the repo root. | |
| ## Files added | |
| - `scripts/eval_tool_description_ab.py` | |
| - `scripts/tool_description_variants.json` | |
| - outputs go to `docs/tool_description_eval/` | |
| - generated variant cards go to `.fast-agent/evals/tool_desc_ab/cards/<variant>/` | |
| ## What it varies | |
| For each variant in `tool_description_variants.json`, the script creates a temporary tool-card set where it updates: | |
| 1. `hf_hub_community.md` frontmatter `description` | |
| 2. `hf_api_tool.py` function docstring for `hf_api_request` | |
| Then it runs the same prompts across selected models. | |
| Execution modes: | |
| - **Direct (default):** runs `hf_hub_community` directly (best for endpoint-level scoring). | |
| - **Indirect (`--indirect`):** runs via a generated wrapper agent that exposes exactly one sub-agent tool: `hf_hub_community`. | |
| ## Metrics collected | |
| Per run: | |
| - return code | |
| - whether tool was called | |
| - endpoint call count | |
| - first endpoint used | |
| - first-call correctness (challenge-aware heuristics) | |
| - challenge score (reusing `score_hf_hub_community_challenges.py` when available) | |
| Aggregates by `(variant, model)`: | |
| - success rate | |
| - tool-use rate | |
| - average endpoint calls | |
| - first-call OK rate | |
| - average score total | |
| ## Run | |
| ```bash | |
| python scripts/eval_tool_description_ab.py \ | |
| --models gpt-oss \ | |
| --base-cards-dir .fast-agent/tool-cards \ | |
| --prompts scripts/hf_hub_community_challenges.txt \ | |
| --variants scripts/tool_description_variants.json \ | |
| --start 1 --end 10 | |
| ``` | |
| Multi-model example: | |
| ```bash | |
| python scripts/eval_tool_description_ab.py \ | |
| --models gpt-oss,gpt-5-mini,gpt-4.1-mini | |
| ``` | |
| Indirect (single sub-agent tool) example: | |
| ```bash | |
| python scripts/eval_tool_description_ab.py \ | |
| --models gpt-oss \ | |
| --indirect | |
| ``` | |
| ## Outputs | |
| - `docs/tool_description_eval/tool_description_ab_detailed.json` | |
| - `docs/tool_description_eval/tool_description_ab_summary.json` | |
| - `docs/tool_description_eval/tool_description_ab_summary.csv` | |
| - `docs/tool_description_eval/tool_description_ab_summary.md` | |
| - `docs/tool_description_eval/tool_description_ab_pairwise.json` | |
| - `docs/tool_description_eval/tool_description_ab_pairwise.csv` | |
| Model list syntax is comma-separated aliases/IDs, e.g. `--models gpt-5-mini,haiku,kimi25,glm,grok-4-fast`. | |