Spaces:

evalstate
/

hf-papers

Running

App Files Files Community

hf-papers / scripts /README.md

evalstate HF Staff

sync: promote hf_hub_community prompt v3 + add prompt/coverage harness

bba4fab verified about 1 month ago

preview code

raw

history blame contribute delete

3.32 kB

	# Scripts: Eval runners and scoring

	## Core scripts

	- `score_hf_hub_community_challenges.py`
	- Runs/scoring for HF Hub community challenge pack.

	- `score_hf_hub_community_coverage.py`
	- Endpoint-coverage pack for capabilities not covered by the main challenge pack.

	- `score_tool_routing_confusion.py`
	- Tool-routing/confusion benchmark for one model.

	- `run_tool_routing_batch.py`
	- Batch wrapper around `score_tool_routing_confusion.py`.

	- `eval_tool_description_ab.py`
	- A/B benchmark of tool description variants.

	- `eval_hf_hub_prompt_ab.py`
	- A/B benchmark for hf_hub_community prompt/card variants across challenge + coverage packs, with plots.

	- `run_hf_hub_prompt_variant.py`
	- Runs a single prompt variant (e.g. v3 only) on both challenge + coverage packs.

	- `plot_tool_description_eval.py`
	- Plot generator from A/B summary CSV.

	## Input data files

	- `hf_hub_community_challenges.txt`
	- `tool_routing_challenges.txt`
	- `tool_routing_expected.json`
	- `tool_description_variants.json`
	- `hf_hub_community_coverage_prompts.json`

	## Quick examples

	```bash
	python scripts/score_hf_hub_community_challenges.py
	```

	```bash
	python scripts/score_tool_routing_confusion.py \
	--model gpt-oss \
	--agent hf_hub_community \
	--agent-cards .fast-agent/tool-cards
	```

	```bash
	python scripts/run_tool_routing_batch.py \
	--models gpt-oss,gpt-5-mini \
	--agent hf_hub_community \
	--agent-cards .fast-agent/tool-cards
	```

	```bash
	python scripts/score_hf_hub_community_coverage.py \
	--model gpt-oss \
	--agent hf_hub_community \
	--agent-cards .fast-agent/tool-cards
	```

	```bash
	python scripts/eval_hf_hub_prompt_ab.py \
	--variants baseline=.fast-agent/tool-cards,compact=/abs/path/to/compact/cards \
	--models gpt-oss
	```

	Run only one variant (example: v3):

	```bash
	python scripts/run_hf_hub_prompt_variant.py \
	--variant-id v3 \
	--cards-dir .fast-agent/evals/hf_hub_prompt_v3/cards \
	--model gpt-oss
	```

	Repository compact variant (already added):

	```bash
	python scripts/eval_hf_hub_prompt_ab.py \
	--variants baseline=.fast-agent/tool-cards,compact=.fast-agent/evals/hf_hub_prompt_compact/cards \
	--models gpt-oss
	```

	To run description A/B against a non-default winner card set:

	```bash
	python scripts/eval_tool_description_ab.py \
	--base-cards-dir /abs/path/to/winner/cards \
	--models gpt-oss
	```


	## Convenience

	- `run_all_evals.sh`
	- Runs community scoring, routing batch, tool-description A/B, and plotting in sequence.

	## fast-agent runtime notes (eval scripts)

	- Eval runners now prefer `fast-agent go --no-env` to avoid creating environment-side session artifacts.
	- They also use `--results` to persist run histories as JSON.
	- Output JSON files are in fast-agent session-history format, so existing `jq` / analysis workflows continue to work.
	- Default raw result locations:
	- Community challenges: `docs/hf_hub_community_eval_results/`
	- Tool routing: `docs/tool_routing_eval/raw_results/`
	- Tool description A/B: `docs/tool_description_eval/raw_results/`

	## Tool description A/B modes

	- Default is direct (`--agent hf_hub_community`) so endpoint-level scoring remains available.
	- Optional indirect mode (`--indirect`) wraps calls through a generated router card that exposes exactly one sub-agent tool: `hf_hub_community`.