# Tool Description A/B Setup This harness benchmarks how tool-description quality affects tool use quality and intent capture. Assume commands are run from the repo root. ## Files added - `scripts/eval_tool_description_ab.py` - `scripts/tool_description_variants.json` - outputs go to `docs/tool_description_eval/` - generated variant cards go to `.fast-agent/evals/tool_desc_ab/cards//` ## What it varies For each variant in `tool_description_variants.json`, the script creates a temporary tool-card set where it updates: 1. `hf_hub_community.md` frontmatter `description` 2. `hf_api_tool.py` function docstring for `hf_api_request` Then it runs the same prompts across selected models. Execution modes: - **Direct (default):** runs `hf_hub_community` directly (best for endpoint-level scoring). - **Indirect (`--indirect`):** runs via a generated wrapper agent that exposes exactly one sub-agent tool: `hf_hub_community`. ## Metrics collected Per run: - return code - whether tool was called - endpoint call count - first endpoint used - first-call correctness (challenge-aware heuristics) - challenge score (reusing `score_hf_hub_community_challenges.py` when available) Aggregates by `(variant, model)`: - success rate - tool-use rate - average endpoint calls - first-call OK rate - average score total ## Run ```bash python scripts/eval_tool_description_ab.py \ --models gpt-oss \ --base-cards-dir .fast-agent/tool-cards \ --prompts scripts/hf_hub_community_challenges.txt \ --variants scripts/tool_description_variants.json \ --start 1 --end 10 ``` Multi-model example: ```bash python scripts/eval_tool_description_ab.py \ --models gpt-oss,gpt-5-mini,gpt-4.1-mini ``` Indirect (single sub-agent tool) example: ```bash python scripts/eval_tool_description_ab.py \ --models gpt-oss \ --indirect ``` ## Outputs - `docs/tool_description_eval/tool_description_ab_detailed.json` - `docs/tool_description_eval/tool_description_ab_summary.json` - `docs/tool_description_eval/tool_description_ab_summary.csv` - `docs/tool_description_eval/tool_description_ab_summary.md` - `docs/tool_description_eval/tool_description_ab_pairwise.json` - `docs/tool_description_eval/tool_description_ab_pairwise.csv` Model list syntax is comma-separated aliases/IDs, e.g. `--models gpt-5-mini,haiku,kimi25,glm,grok-4-fast`.