Evaluation Audit & Methodology
Status: Under Independent Verification
Critical Findings
After comprehensive audit of the Stack 2.9 evaluation infrastructure, the following issues were identified:
1. Incomplete Test Sets
- HumanEval: Only 20 out of 164 problems (~12%) were evaluated
- MBPP: Only 20 out of 500 problems (~4%) were evaluated
The claimed scores (76.8% HumanEval, 82.3% MBPP) are therefore not representative of full benchmark performance.
2. Missing Model Inference
Investigation of the evaluation scripts (human_eval.py, mbpp_eval.py) revealed:
- The scripts return pre-written canonical solutions instead of actual model inference
- No API calls to Ollama/OpenAI/Anthropic providers were made
- No model-generated outputs exist in the
results/directory - The
results/humaneval.jsonfile contains 0% failure rate from a broken run
Conclusion: The benchmark numbers appear to be fabricated or at best, unverified.
3. Tool Use Benchmark Unimplemented
The claimed 94.1% Tool Use score lacks:
- Any proper benchmark dataset
- Defined evaluation methodology
- Reproduction instructions
- Actual model calls to test tool selection accuracy
It appears to be a custom, non-standard metric with no basis in accepted benchmarks.
Proper Evaluation Framework
We have built a new, rigorous evaluation infrastructure:
Official Datasets
# Download HumanEval (164 problems) and MBPP (500 problems)
python scripts/download_benchmark_datasets.py --data-dir ./data
This script fetches:
- HumanEval from OpenAI's official dataset
- MBPP from Google'sbenchmark suite
- Ensures correct formatting and ground truth solutions
Unified Evaluation Runner
stack-2.9-eval/run_proper_evaluation.py provides:
python stack_2_9_eval/run_proper_evaluation.py \
--benchmark humaneval \
--provider ollama \
--model qwen2.5-coder:32b \
--k-samples 100 \
--output-dir ./results/humaneval_run
Features:
- Multi-provider support (Ollama, OpenAI, Anthropic, OpenRouter)
- Proper
pass@kcalculation with confidence intervals - Per-problem detailed logs (JSON)
- Reproducible random sampling (seeds)
- Parallel evaluation (configurable workers)
Evaluation Checklist
To ensure transparency, every proper evaluation must:
- ✅ Use full official benchmark (164 HumanEval, 500 MBPP)
- ✅ Call real model inference via
model_client.py - ✅ Run with k≥100 samples for pass@1 estimation
- ✅ Store all generation outputs for audit
- ✅ Compute standard deviation and confidence intervals
- ✅ Publish full JSON logs to
results/directory - ✅ Document exact model version, quantization, and provider settings
Current Status
The previously claimed scores have been removed from README.md and BENCHMARKS.md. They are replaced with:
| Benchmark | Status | Notes |
|---|---|---|
| HumanEval | Pending verification | Full 164-problem evaluation setup ready |
| MBPP | Pending verification | Full 500-problem evaluation setup ready |
| Tool Use | Needs benchmark design | 500+ realistic OpenClaw tool-calling test cases required |
| GSM8K | Not started | Math reasoning evaluation planned |
Expected baseline (Qwen2.5-Coder-32B):
- HumanEval: ~70-72% Pass@1
- MBPP: ~75-77% Pass@1
Stack 2.9's fine-tuned performance will be published after running proper evaluations.
What Changed
- Created
scripts/download_benchmark_datasets.pyfor official datasets - Created
stack-2.9-eval/run_proper_evaluation.pyunified runner - Created
stack-2.9-eval/test_evaluation_setup.pyto validate environment - Added deprecation warnings to flawed
human_eval.py,mbpp_eval.py,tool_use_eval.py - Updated README.md, BENCHMARKS.md, website pages to remove false claims
How to Publish Verified Scores
- Prepare datasets:
python scripts/download_benchmark_datasets.py --data-dir ./data - Run evaluation:
python stack-2.9-eval/run_proper_evaluation.py --benchmark humaneval --provider ollama --model qwen2.5-coder:32b --k-samples 100 - Review logs in
./results/humaneval_run/(includes per-problem generations) - Update README.md with actual numbers once verified
- Commit full JSON results to
stack-2.9-eval/results/for reproducibility
Do NOT publish the previously claimed percentages. They are invalid.