Spaces:

evalstate
/

hf-papers

Running

App Files Files Community

hf-papers / scripts /README.md

evalstate HF Staff

sync: promote hf_hub_community prompt v3 + add prompt/coverage harness

bba4fab verified about 1 month ago

preview code

raw

history blame contribute delete

3.32 kB

Scripts: Eval runners and scoring

Core scripts

score_hf_hub_community_challenges.py
- Runs/scoring for HF Hub community challenge pack.
score_hf_hub_community_coverage.py
- Endpoint-coverage pack for capabilities not covered by the main challenge pack.
score_tool_routing_confusion.py
- Tool-routing/confusion benchmark for one model.
run_tool_routing_batch.py
- Batch wrapper around score_tool_routing_confusion.py.
eval_tool_description_ab.py
- A/B benchmark of tool description variants.
eval_hf_hub_prompt_ab.py
- A/B benchmark for hf_hub_community prompt/card variants across challenge + coverage packs, with plots.
run_hf_hub_prompt_variant.py
- Runs a single prompt variant (e.g. v3 only) on both challenge + coverage packs.
plot_tool_description_eval.py
- Plot generator from A/B summary CSV.

Input data files

hf_hub_community_challenges.txt
tool_routing_challenges.txt
tool_routing_expected.json
tool_description_variants.json
hf_hub_community_coverage_prompts.json

Quick examples

python scripts/score_hf_hub_community_challenges.py

python scripts/score_tool_routing_confusion.py \
  --model gpt-oss \
  --agent hf_hub_community \
  --agent-cards .fast-agent/tool-cards

python scripts/run_tool_routing_batch.py \
  --models gpt-oss,gpt-5-mini \
  --agent hf_hub_community \
  --agent-cards .fast-agent/tool-cards

python scripts/score_hf_hub_community_coverage.py \
  --model gpt-oss \
  --agent hf_hub_community \
  --agent-cards .fast-agent/tool-cards

python scripts/eval_hf_hub_prompt_ab.py \
  --variants baseline=.fast-agent/tool-cards,compact=/abs/path/to/compact/cards \
  --models gpt-oss

Run only one variant (example: v3):

python scripts/run_hf_hub_prompt_variant.py \
  --variant-id v3 \
  --cards-dir .fast-agent/evals/hf_hub_prompt_v3/cards \
  --model gpt-oss

Repository compact variant (already added):

python scripts/eval_hf_hub_prompt_ab.py \
  --variants baseline=.fast-agent/tool-cards,compact=.fast-agent/evals/hf_hub_prompt_compact/cards \
  --models gpt-oss

To run description A/B against a non-default winner card set:

python scripts/eval_tool_description_ab.py \
  --base-cards-dir /abs/path/to/winner/cards \
  --models gpt-oss

Convenience

run_all_evals.sh
- Runs community scoring, routing batch, tool-description A/B, and plotting in sequence.

fast-agent runtime notes (eval scripts)

Eval runners now prefer fast-agent go --no-env to avoid creating environment-side session artifacts.
They also use --results to persist run histories as JSON.
Output JSON files are in fast-agent session-history format, so existing jq / analysis workflows continue to work.
Default raw result locations:
- Community challenges: docs/hf_hub_community_eval_results/
- Tool routing: docs/tool_routing_eval/raw_results/
- Tool description A/B: docs/tool_description_eval/raw_results/

Tool description A/B modes

Default is direct (--agent hf_hub_community) so endpoint-level scoring remains available.
Optional indirect mode (--indirect) wraps calls through a generated router card that exposes exactly one sub-agent tool: hf_hub_community.