hf-papers / docs /tool_description_eval_setup.md
evalstate's picture
evalstate HF Staff
sync: promote hf_hub_community prompt v3 + add prompt/coverage harness
bba4fab verified

Tool Description A/B Setup

This harness benchmarks how tool-description quality affects tool use quality and intent capture.

Assume commands are run from the repo root.

Files added

  • scripts/eval_tool_description_ab.py
  • scripts/tool_description_variants.json
  • outputs go to docs/tool_description_eval/
  • generated variant cards go to .fast-agent/evals/tool_desc_ab/cards/<variant>/

What it varies

For each variant in tool_description_variants.json, the script creates a temporary tool-card set where it updates:

  1. hf_hub_community.md frontmatter description
  2. hf_api_tool.py function docstring for hf_api_request

Then it runs the same prompts across selected models.

Execution modes:

  • Direct (default): runs hf_hub_community directly (best for endpoint-level scoring).
  • Indirect (--indirect): runs via a generated wrapper agent that exposes exactly one sub-agent tool: hf_hub_community.

Metrics collected

Per run:

  • return code
  • whether tool was called
  • endpoint call count
  • first endpoint used
  • first-call correctness (challenge-aware heuristics)
  • challenge score (reusing score_hf_hub_community_challenges.py when available)

Aggregates by (variant, model):

  • success rate
  • tool-use rate
  • average endpoint calls
  • first-call OK rate
  • average score total

Run

python scripts/eval_tool_description_ab.py \
  --models gpt-oss \
  --base-cards-dir .fast-agent/tool-cards \
  --prompts scripts/hf_hub_community_challenges.txt \
  --variants scripts/tool_description_variants.json \
  --start 1 --end 10

Multi-model example:

python scripts/eval_tool_description_ab.py \
  --models gpt-oss,gpt-5-mini,gpt-4.1-mini

Indirect (single sub-agent tool) example:

python scripts/eval_tool_description_ab.py \
  --models gpt-oss \
  --indirect

Outputs

  • docs/tool_description_eval/tool_description_ab_detailed.json
  • docs/tool_description_eval/tool_description_ab_summary.json
  • docs/tool_description_eval/tool_description_ab_summary.csv
  • docs/tool_description_eval/tool_description_ab_summary.md
  • docs/tool_description_eval/tool_description_ab_pairwise.json
  • docs/tool_description_eval/tool_description_ab_pairwise.csv

Model list syntax is comma-separated aliases/IDs, e.g. --models gpt-5-mini,haiku,kimi25,glm,grok-4-fast.