| # Scripts: Eval runners and scoring | |
| ## Core scripts | |
| - `score_hf_hub_community_challenges.py` | |
| - Runs/scoring for HF Hub community challenge pack. | |
| - `score_hf_hub_community_coverage.py` | |
| - Endpoint-coverage pack for capabilities not covered by the main challenge pack. | |
| - `score_tool_routing_confusion.py` | |
| - Tool-routing/confusion benchmark for one model. | |
| - `run_tool_routing_batch.py` | |
| - Batch wrapper around `score_tool_routing_confusion.py`. | |
| - `eval_tool_description_ab.py` | |
| - A/B benchmark of tool description variants. | |
| - `eval_hf_hub_prompt_ab.py` | |
| - A/B benchmark for hf_hub_community prompt/card variants across challenge + coverage packs, with plots. | |
| - `run_hf_hub_prompt_variant.py` | |
| - Runs a single prompt variant (e.g. v3 only) on both challenge + coverage packs. | |
| - `plot_tool_description_eval.py` | |
| - Plot generator from A/B summary CSV. | |
| ## Input data files | |
| - `hf_hub_community_challenges.txt` | |
| - `tool_routing_challenges.txt` | |
| - `tool_routing_expected.json` | |
| - `tool_description_variants.json` | |
| - `hf_hub_community_coverage_prompts.json` | |
| ## Quick examples | |
| ```bash | |
| python scripts/score_hf_hub_community_challenges.py | |
| ``` | |
| ```bash | |
| python scripts/score_tool_routing_confusion.py \ | |
| --model gpt-oss \ | |
| --agent hf_hub_community \ | |
| --agent-cards .fast-agent/tool-cards | |
| ``` | |
| ```bash | |
| python scripts/run_tool_routing_batch.py \ | |
| --models gpt-oss,gpt-5-mini \ | |
| --agent hf_hub_community \ | |
| --agent-cards .fast-agent/tool-cards | |
| ``` | |
| ```bash | |
| python scripts/score_hf_hub_community_coverage.py \ | |
| --model gpt-oss \ | |
| --agent hf_hub_community \ | |
| --agent-cards .fast-agent/tool-cards | |
| ``` | |
| ```bash | |
| python scripts/eval_hf_hub_prompt_ab.py \ | |
| --variants baseline=.fast-agent/tool-cards,compact=/abs/path/to/compact/cards \ | |
| --models gpt-oss | |
| ``` | |
| Run only one variant (example: v3): | |
| ```bash | |
| python scripts/run_hf_hub_prompt_variant.py \ | |
| --variant-id v3 \ | |
| --cards-dir .fast-agent/evals/hf_hub_prompt_v3/cards \ | |
| --model gpt-oss | |
| ``` | |
| Repository compact variant (already added): | |
| ```bash | |
| python scripts/eval_hf_hub_prompt_ab.py \ | |
| --variants baseline=.fast-agent/tool-cards,compact=.fast-agent/evals/hf_hub_prompt_compact/cards \ | |
| --models gpt-oss | |
| ``` | |
| To run description A/B against a non-default winner card set: | |
| ```bash | |
| python scripts/eval_tool_description_ab.py \ | |
| --base-cards-dir /abs/path/to/winner/cards \ | |
| --models gpt-oss | |
| ``` | |
| ## Convenience | |
| - `run_all_evals.sh` | |
| - Runs community scoring, routing batch, tool-description A/B, and plotting in sequence. | |
| ## fast-agent runtime notes (eval scripts) | |
| - Eval runners now prefer `fast-agent go --no-env` to avoid creating environment-side session artifacts. | |
| - They also use `--results` to persist run histories as JSON. | |
| - Output JSON files are in fast-agent session-history format, so existing `jq` / analysis workflows continue to work. | |
| - Default raw result locations: | |
| - Community challenges: `docs/hf_hub_community_eval_results/` | |
| - Tool routing: `docs/tool_routing_eval/raw_results/` | |
| - Tool description A/B: `docs/tool_description_eval/raw_results/` | |
| ## Tool description A/B modes | |
| - Default is **direct** (`--agent hf_hub_community`) so endpoint-level scoring remains available. | |
| - Optional **indirect** mode (`--indirect`) wraps calls through a generated router card that exposes exactly one sub-agent tool: `hf_hub_community`. | |