hf-papers / scripts /README.md
evalstate's picture
evalstate HF Staff
sync: promote hf_hub_community prompt v3 + add prompt/coverage harness
bba4fab verified

Scripts: Eval runners and scoring

Core scripts

  • score_hf_hub_community_challenges.py

    • Runs/scoring for HF Hub community challenge pack.
  • score_hf_hub_community_coverage.py

    • Endpoint-coverage pack for capabilities not covered by the main challenge pack.
  • score_tool_routing_confusion.py

    • Tool-routing/confusion benchmark for one model.
  • run_tool_routing_batch.py

    • Batch wrapper around score_tool_routing_confusion.py.
  • eval_tool_description_ab.py

    • A/B benchmark of tool description variants.
  • eval_hf_hub_prompt_ab.py

    • A/B benchmark for hf_hub_community prompt/card variants across challenge + coverage packs, with plots.
  • run_hf_hub_prompt_variant.py

    • Runs a single prompt variant (e.g. v3 only) on both challenge + coverage packs.
  • plot_tool_description_eval.py

    • Plot generator from A/B summary CSV.

Input data files

  • hf_hub_community_challenges.txt
  • tool_routing_challenges.txt
  • tool_routing_expected.json
  • tool_description_variants.json
  • hf_hub_community_coverage_prompts.json

Quick examples

python scripts/score_hf_hub_community_challenges.py
python scripts/score_tool_routing_confusion.py \
  --model gpt-oss \
  --agent hf_hub_community \
  --agent-cards .fast-agent/tool-cards
python scripts/run_tool_routing_batch.py \
  --models gpt-oss,gpt-5-mini \
  --agent hf_hub_community \
  --agent-cards .fast-agent/tool-cards
python scripts/score_hf_hub_community_coverage.py \
  --model gpt-oss \
  --agent hf_hub_community \
  --agent-cards .fast-agent/tool-cards
python scripts/eval_hf_hub_prompt_ab.py \
  --variants baseline=.fast-agent/tool-cards,compact=/abs/path/to/compact/cards \
  --models gpt-oss

Run only one variant (example: v3):

python scripts/run_hf_hub_prompt_variant.py \
  --variant-id v3 \
  --cards-dir .fast-agent/evals/hf_hub_prompt_v3/cards \
  --model gpt-oss

Repository compact variant (already added):

python scripts/eval_hf_hub_prompt_ab.py \
  --variants baseline=.fast-agent/tool-cards,compact=.fast-agent/evals/hf_hub_prompt_compact/cards \
  --models gpt-oss

To run description A/B against a non-default winner card set:

python scripts/eval_tool_description_ab.py \
  --base-cards-dir /abs/path/to/winner/cards \
  --models gpt-oss

Convenience

  • run_all_evals.sh
    • Runs community scoring, routing batch, tool-description A/B, and plotting in sequence.

fast-agent runtime notes (eval scripts)

  • Eval runners now prefer fast-agent go --no-env to avoid creating environment-side session artifacts.
  • They also use --results to persist run histories as JSON.
  • Output JSON files are in fast-agent session-history format, so existing jq / analysis workflows continue to work.
  • Default raw result locations:
    • Community challenges: docs/hf_hub_community_eval_results/
    • Tool routing: docs/tool_routing_eval/raw_results/
    • Tool description A/B: docs/tool_description_eval/raw_results/

Tool description A/B modes

  • Default is direct (--agent hf_hub_community) so endpoint-level scoring remains available.
  • Optional indirect mode (--indirect) wraps calls through a generated router card that exposes exactly one sub-agent tool: hf_hub_community.