# Scripts: Eval runners and scoring ## Core scripts - `score_hf_hub_community_challenges.py` - Runs/scoring for HF Hub community challenge pack. - `score_hf_hub_community_coverage.py` - Endpoint-coverage pack for capabilities not covered by the main challenge pack. - `score_tool_routing_confusion.py` - Tool-routing/confusion benchmark for one model. - `run_tool_routing_batch.py` - Batch wrapper around `score_tool_routing_confusion.py`. - `eval_tool_description_ab.py` - A/B benchmark of tool description variants. - `eval_hf_hub_prompt_ab.py` - A/B benchmark for hf_hub_community prompt/card variants across challenge + coverage packs, with plots. - `run_hf_hub_prompt_variant.py` - Runs a single prompt variant (e.g. v3 only) on both challenge + coverage packs. - `plot_tool_description_eval.py` - Plot generator from A/B summary CSV. ## Input data files - `hf_hub_community_challenges.txt` - `tool_routing_challenges.txt` - `tool_routing_expected.json` - `tool_description_variants.json` - `hf_hub_community_coverage_prompts.json` ## Quick examples ```bash python scripts/score_hf_hub_community_challenges.py ``` ```bash python scripts/score_tool_routing_confusion.py \ --model gpt-oss \ --agent hf_hub_community \ --agent-cards .fast-agent/tool-cards ``` ```bash python scripts/run_tool_routing_batch.py \ --models gpt-oss,gpt-5-mini \ --agent hf_hub_community \ --agent-cards .fast-agent/tool-cards ``` ```bash python scripts/score_hf_hub_community_coverage.py \ --model gpt-oss \ --agent hf_hub_community \ --agent-cards .fast-agent/tool-cards ``` ```bash python scripts/eval_hf_hub_prompt_ab.py \ --variants baseline=.fast-agent/tool-cards,compact=/abs/path/to/compact/cards \ --models gpt-oss ``` Run only one variant (example: v3): ```bash python scripts/run_hf_hub_prompt_variant.py \ --variant-id v3 \ --cards-dir .fast-agent/evals/hf_hub_prompt_v3/cards \ --model gpt-oss ``` Repository compact variant (already added): ```bash python scripts/eval_hf_hub_prompt_ab.py \ --variants baseline=.fast-agent/tool-cards,compact=.fast-agent/evals/hf_hub_prompt_compact/cards \ --models gpt-oss ``` To run description A/B against a non-default winner card set: ```bash python scripts/eval_tool_description_ab.py \ --base-cards-dir /abs/path/to/winner/cards \ --models gpt-oss ``` ## Convenience - `run_all_evals.sh` - Runs community scoring, routing batch, tool-description A/B, and plotting in sequence. ## fast-agent runtime notes (eval scripts) - Eval runners now prefer `fast-agent go --no-env` to avoid creating environment-side session artifacts. - They also use `--results` to persist run histories as JSON. - Output JSON files are in fast-agent session-history format, so existing `jq` / analysis workflows continue to work. - Default raw result locations: - Community challenges: `docs/hf_hub_community_eval_results/` - Tool routing: `docs/tool_routing_eval/raw_results/` - Tool description A/B: `docs/tool_description_eval/raw_results/` ## Tool description A/B modes - Default is **direct** (`--agent hf_hub_community`) so endpoint-level scoring remains available. - Optional **indirect** mode (`--indirect`) wraps calls through a generated router card that exposes exactly one sub-agent tool: `hf_hub_community`.