Scripts: Eval runners and scoring
Core scripts
score_hf_hub_community_challenges.py- Runs/scoring for HF Hub community challenge pack.
score_hf_hub_community_coverage.py- Endpoint-coverage pack for capabilities not covered by the main challenge pack.
score_tool_routing_confusion.py- Tool-routing/confusion benchmark for one model.
run_tool_routing_batch.py- Batch wrapper around
score_tool_routing_confusion.py.
- Batch wrapper around
eval_tool_description_ab.py- A/B benchmark of tool description variants.
eval_hf_hub_prompt_ab.py- A/B benchmark for hf_hub_community prompt/card variants across challenge + coverage packs, with plots.
run_hf_hub_prompt_variant.py- Runs a single prompt variant (e.g. v3 only) on both challenge + coverage packs.
plot_tool_description_eval.py- Plot generator from A/B summary CSV.
Input data files
hf_hub_community_challenges.txttool_routing_challenges.txttool_routing_expected.jsontool_description_variants.jsonhf_hub_community_coverage_prompts.json
Quick examples
python scripts/score_hf_hub_community_challenges.py
python scripts/score_tool_routing_confusion.py \
--model gpt-oss \
--agent hf_hub_community \
--agent-cards .fast-agent/tool-cards
python scripts/run_tool_routing_batch.py \
--models gpt-oss,gpt-5-mini \
--agent hf_hub_community \
--agent-cards .fast-agent/tool-cards
python scripts/score_hf_hub_community_coverage.py \
--model gpt-oss \
--agent hf_hub_community \
--agent-cards .fast-agent/tool-cards
python scripts/eval_hf_hub_prompt_ab.py \
--variants baseline=.fast-agent/tool-cards,compact=/abs/path/to/compact/cards \
--models gpt-oss
Run only one variant (example: v3):
python scripts/run_hf_hub_prompt_variant.py \
--variant-id v3 \
--cards-dir .fast-agent/evals/hf_hub_prompt_v3/cards \
--model gpt-oss
Repository compact variant (already added):
python scripts/eval_hf_hub_prompt_ab.py \
--variants baseline=.fast-agent/tool-cards,compact=.fast-agent/evals/hf_hub_prompt_compact/cards \
--models gpt-oss
To run description A/B against a non-default winner card set:
python scripts/eval_tool_description_ab.py \
--base-cards-dir /abs/path/to/winner/cards \
--models gpt-oss
Convenience
run_all_evals.sh- Runs community scoring, routing batch, tool-description A/B, and plotting in sequence.
fast-agent runtime notes (eval scripts)
- Eval runners now prefer
fast-agent go --no-envto avoid creating environment-side session artifacts. - They also use
--resultsto persist run histories as JSON. - Output JSON files are in fast-agent session-history format, so existing
jq/ analysis workflows continue to work. - Default raw result locations:
- Community challenges:
docs/hf_hub_community_eval_results/ - Tool routing:
docs/tool_routing_eval/raw_results/ - Tool description A/B:
docs/tool_description_eval/raw_results/
- Community challenges:
Tool description A/B modes
- Default is direct (
--agent hf_hub_community) so endpoint-level scoring remains available. - Optional indirect mode (
--indirect) wraps calls through a generated router card that exposes exactly one sub-agent tool:hf_hub_community.