Tool Description A/B Setup
This harness benchmarks how tool-description quality affects tool use quality and intent capture.
Assume commands are run from the repo root.
Files added
scripts/eval_tool_description_ab.pyscripts/tool_description_variants.json- outputs go to
docs/tool_description_eval/ - generated variant cards go to
.fast-agent/evals/tool_desc_ab/cards/<variant>/
What it varies
For each variant in tool_description_variants.json, the script creates a temporary tool-card set where it updates:
hf_hub_community.mdfrontmatterdescriptionhf_api_tool.pyfunction docstring forhf_api_request
Then it runs the same prompts across selected models.
Execution modes:
- Direct (default): runs
hf_hub_communitydirectly (best for endpoint-level scoring). - Indirect (
--indirect): runs via a generated wrapper agent that exposes exactly one sub-agent tool:hf_hub_community.
Metrics collected
Per run:
- return code
- whether tool was called
- endpoint call count
- first endpoint used
- first-call correctness (challenge-aware heuristics)
- challenge score (reusing
score_hf_hub_community_challenges.pywhen available)
Aggregates by (variant, model):
- success rate
- tool-use rate
- average endpoint calls
- first-call OK rate
- average score total
Run
python scripts/eval_tool_description_ab.py \
--models gpt-oss \
--base-cards-dir .fast-agent/tool-cards \
--prompts scripts/hf_hub_community_challenges.txt \
--variants scripts/tool_description_variants.json \
--start 1 --end 10
Multi-model example:
python scripts/eval_tool_description_ab.py \
--models gpt-oss,gpt-5-mini,gpt-4.1-mini
Indirect (single sub-agent tool) example:
python scripts/eval_tool_description_ab.py \
--models gpt-oss \
--indirect
Outputs
docs/tool_description_eval/tool_description_ab_detailed.jsondocs/tool_description_eval/tool_description_ab_summary.jsondocs/tool_description_eval/tool_description_ab_summary.csvdocs/tool_description_eval/tool_description_ab_summary.mddocs/tool_description_eval/tool_description_ab_pairwise.jsondocs/tool_description_eval/tool_description_ab_pairwise.csv
Model list syntax is comma-separated aliases/IDs, e.g. --models gpt-5-mini,haiku,kimi25,glm,grok-4-fast.