Spaces:

evalstate
/

hf-papers

Sleeping

Direct (default): runs hf_hub_community directly (best for endpoint-level scoring).
Indirect (--indirect): runs via a generated wrapper agent that exposes exactly one sub-agent tool: hf_hub_community.

Metrics collected

Per run:

return code
whether tool was called
endpoint call count
first endpoint used
first-call correctness (challenge-aware heuristics)
challenge score (reusing score_hf_hub_community_challenges.py when available)

Aggregates by (variant, model):

success rate
tool-use rate
average endpoint calls
first-call OK rate
average score total

Run

python scripts/eval_tool_description_ab.py \
  --models gpt-oss \
  --base-cards-dir .fast-agent/tool-cards \
  --prompts scripts/hf_hub_community_challenges.txt \
  --variants scripts/tool_description_variants.json \
  --start 1 --end 10

Multi-model example:

python scripts/eval_tool_description_ab.py \
  --models gpt-oss,gpt-5-mini,gpt-4.1-mini

Indirect (single sub-agent tool) example:

python scripts/eval_tool_description_ab.py \
  --models gpt-oss \
  --indirect

Outputs

docs/tool_description_eval/tool_description_ab_detailed.json
docs/tool_description_eval/tool_description_ab_summary.json
docs/tool_description_eval/tool_description_ab_summary.csv
docs/tool_description_eval/tool_description_ab_summary.md
docs/tool_description_eval/tool_description_ab_pairwise.json
docs/tool_description_eval/tool_description_ab_pairwise.csv

Model list syntax is comma-separated aliases/IDs, e.g. --models gpt-5-mini,haiku,kimi25,glm,grok-4-fast.