walidsobhie-code

refactor: Squeeze folders further - cleaner structure

65888d5 22 days ago

4.4 kB

	# Evaluation Audit & Methodology

	Status: Under Independent Verification

	## Critical Findings

	After comprehensive audit of the Stack 2.9 evaluation infrastructure, the following issues were identified:

	### 1. Incomplete Test Sets

	- HumanEval: Only 20 out of 164 problems (~12%) were evaluated
	- MBPP: Only 20 out of 500 problems (~4%) were evaluated

	The claimed scores (76.8% HumanEval, 82.3% MBPP) are therefore not representative of full benchmark performance.

	### 2. Missing Model Inference

	Investigation of the evaluation scripts (`human_eval.py`, `mbpp_eval.py`) revealed:

	- The scripts return pre-written canonical solutions instead of actual model inference
	- No API calls to Ollama/OpenAI/Anthropic providers were made
	- No model-generated outputs exist in the `results/` directory
	- The `results/humaneval.json` file contains 0% failure rate from a broken run

	Conclusion: The benchmark numbers appear to be fabricated or at best, unverified.

	### 3. Tool Use Benchmark Unimplemented

	The claimed 94.1% Tool Use score lacks:
	- Any proper benchmark dataset
	- Defined evaluation methodology
	- Reproduction instructions
	- Actual model calls to test tool selection accuracy

	It appears to be a custom, non-standard metric with no basis in accepted benchmarks.

	---

	## Proper Evaluation Framework

	We have built a new, rigorous evaluation infrastructure:

	### Official Datasets

	```bash
	# Download HumanEval (164 problems) and MBPP (500 problems)
	python scripts/download_benchmark_datasets.py --data-dir ./data
	```

	This script fetches:
	- HumanEval from OpenAI's official dataset
	- MBPP from Google'sbenchmark suite
	- Ensures correct formatting and ground truth solutions

	### Unified Evaluation Runner

	`stack-2.9-eval/run_proper_evaluation.py` provides:

	```bash
	python stack_2_9_eval/run_proper_evaluation.py \
	--benchmark humaneval \
	--provider ollama \
	--model qwen2.5-coder:32b \
	--k-samples 100 \
	--output-dir ./results/humaneval_run
	```

	Features:
	- Multi-provider support (Ollama, OpenAI, Anthropic, OpenRouter)
	- Proper `pass@k` calculation with confidence intervals
	- Per-problem detailed logs (JSON)
	- Reproducible random sampling (seeds)
	- Parallel evaluation (configurable workers)

	### Evaluation Checklist

	To ensure transparency, every proper evaluation must:

	1. ✅ Use full official benchmark (164 HumanEval, 500 MBPP)
	2. ✅ Call real model inference via `model_client.py`
	3. ✅ Run with k≥100 samples for pass@1 estimation
	4. ✅ Store all generation outputs for audit
	5. ✅ Compute standard deviation and confidence intervals
	6. ✅ Publish full JSON logs to `results/` directory
	7. ✅ Document exact model version, quantization, and provider settings

	---

	## Current Status

	The previously claimed scores have been removed from README.md and BENCHMARKS.md. They are replaced with:

	\| Benchmark \| Status \| Notes \|
	\|-----------\|--------\|-------\|
	\| HumanEval \| Pending verification \| Full 164-problem evaluation setup ready \|
	\| MBPP \| Pending verification \| Full 500-problem evaluation setup ready \|
	\| Tool Use \| Needs benchmark design \| 500+ realistic OpenClaw tool-calling test cases required \|
	\| GSM8K \| Not started \| Math reasoning evaluation planned \|

	Expected baseline (Qwen2.5-Coder-32B):
	- HumanEval: ~70-72% Pass@1
	- MBPP: ~75-77% Pass@1

	Stack 2.9's fine-tuned performance will be published after running proper evaluations.

	---

	## What Changed

	- Created `scripts/download_benchmark_datasets.py` for official datasets
	- Created `stack-2.9-eval/run_proper_evaluation.py` unified runner
	- Created `stack-2.9-eval/test_evaluation_setup.py` to validate environment
	- Added deprecation warnings to flawed `human_eval.py`, `mbpp_eval.py`, `tool_use_eval.py`
	- Updated README.md, BENCHMARKS.md, website pages to remove false claims

	---

	## How to Publish Verified Scores

	1. Prepare datasets: `python scripts/download_benchmark_datasets.py --data-dir ./data`
	2. Run evaluation: `python stack-2.9-eval/run_proper_evaluation.py --benchmark humaneval --provider ollama --model qwen2.5-coder:32b --k-samples 100`
	3. Review logs in `./results/humaneval_run/` (includes per-problem generations)
	4. Update README.md with actual numbers once verified
	5. Commit full JSON results to `stack-2.9-eval/results/` for reproducibility

	Do NOT publish the previously claimed percentages. They are invalid.