| # Evaluation Audit & Methodology |
|
|
| **Status:** Under Independent Verification |
|
|
| ## Critical Findings |
|
|
| After comprehensive audit of the Stack 2.9 evaluation infrastructure, the following issues were identified: |
|
|
| ### 1. Incomplete Test Sets |
|
|
| - **HumanEval**: Only **20 out of 164 problems** (~12%) were evaluated |
| - **MBPP**: Only **20 out of 500 problems** (~4%) were evaluated |
|
|
| The claimed scores (76.8% HumanEval, 82.3% MBPP) are therefore **not representative** of full benchmark performance. |
|
|
| ### 2. Missing Model Inference |
|
|
| Investigation of the evaluation scripts (`human_eval.py`, `mbpp_eval.py`) revealed: |
|
|
| - The scripts return **pre-written canonical solutions** instead of actual model inference |
| - No API calls to Ollama/OpenAI/Anthropic providers were made |
| - No model-generated outputs exist in the `results/` directory |
| - The `results/humaneval.json` file contains 0% failure rate from a broken run |
|
|
| **Conclusion:** The benchmark numbers appear to be fabricated or at best, unverified. |
|
|
| ### 3. Tool Use Benchmark Unimplemented |
|
|
| The claimed 94.1% Tool Use score lacks: |
| - Any proper benchmark dataset |
| - Defined evaluation methodology |
| - Reproduction instructions |
| - Actual model calls to test tool selection accuracy |
|
|
| It appears to be a custom, non-standard metric with no basis in accepted benchmarks. |
|
|
| --- |
|
|
| ## Proper Evaluation Framework |
|
|
| We have built a new, rigorous evaluation infrastructure: |
|
|
| ### Official Datasets |
|
|
| ```bash |
| # Download HumanEval (164 problems) and MBPP (500 problems) |
| python scripts/download_benchmark_datasets.py --data-dir ./data |
| ``` |
|
|
| This script fetches: |
| - HumanEval from OpenAI's official dataset |
| - MBPP from Google'sbenchmark suite |
| - Ensures correct formatting and ground truth solutions |
|
|
| ### Unified Evaluation Runner |
|
|
| `stack-2.9-eval/run_proper_evaluation.py` provides: |
|
|
| ```bash |
| python stack_2_9_eval/run_proper_evaluation.py \ |
| --benchmark humaneval \ |
| --provider ollama \ |
| --model qwen2.5-coder:32b \ |
| --k-samples 100 \ |
| --output-dir ./results/humaneval_run |
| ``` |
|
|
| Features: |
| - Multi-provider support (Ollama, OpenAI, Anthropic, OpenRouter) |
| - Proper `pass@k` calculation with confidence intervals |
| - Per-problem detailed logs (JSON) |
| - Reproducible random sampling (seeds) |
| - Parallel evaluation (configurable workers) |
|
|
| ### Evaluation Checklist |
|
|
| To ensure transparency, every proper evaluation must: |
|
|
| 1. β
Use full official benchmark (164 HumanEval, 500 MBPP) |
| 2. β
Call real model inference via `model_client.py` |
| 3. β
Run with kβ₯100 samples for pass@1 estimation |
| 4. β
Store all generation outputs for audit |
| 5. β
Compute standard deviation and confidence intervals |
| 6. β
Publish full JSON logs to `results/` directory |
| 7. β
Document exact model version, quantization, and provider settings |
|
|
| --- |
|
|
| ## Current Status |
|
|
| The previously claimed scores have been **removed** from README.md and BENCHMARKS.md. They are replaced with: |
|
|
| | Benchmark | Status | Notes | |
| |-----------|--------|-------| |
| | HumanEval | Pending verification | Full 164-problem evaluation setup ready | |
| | MBPP | Pending verification | Full 500-problem evaluation setup ready | |
| | Tool Use | Needs benchmark design | 500+ realistic OpenClaw tool-calling test cases required | |
| | GSM8K | Not started | Math reasoning evaluation planned | |
|
|
| Expected baseline (Qwen2.5-Coder-32B): |
| - HumanEval: ~70-72% Pass@1 |
| - MBPP: ~75-77% Pass@1 |
|
|
| Stack 2.9's fine-tuned performance will be published after running proper evaluations. |
|
|
| --- |
|
|
| ## What Changed |
|
|
| - Created `scripts/download_benchmark_datasets.py` for official datasets |
| - Created `stack-2.9-eval/run_proper_evaluation.py` unified runner |
| - Created `stack-2.9-eval/test_evaluation_setup.py` to validate environment |
| - Added deprecation warnings to flawed `human_eval.py`, `mbpp_eval.py`, `tool_use_eval.py` |
| - Updated README.md, BENCHMARKS.md, website pages to remove false claims |
|
|
| --- |
|
|
| ## How to Publish Verified Scores |
|
|
| 1. Prepare datasets: `python scripts/download_benchmark_datasets.py --data-dir ./data` |
| 2. Run evaluation: `python stack-2.9-eval/run_proper_evaluation.py --benchmark humaneval --provider ollama --model qwen2.5-coder:32b --k-samples 100` |
| 3. Review logs in `./results/humaneval_run/` (includes per-problem generations) |
| 4. Update README.md with actual numbers once verified |
| 5. Commit full JSON results to `stack-2.9-eval/results/` for reproducibility |
|
|
| **Do NOT publish** the previously claimed percentages. They are invalid. |
|
|