Evaluation Audit & Methodology

Status: Under Independent Verification

Critical Findings

After comprehensive audit of the Stack 2.9 evaluation infrastructure, the following issues were identified:

1. Incomplete Test Sets

HumanEval: Only 20 out of 164 problems (~12%) were evaluated
MBPP: Only 20 out of 500 problems (~4%) were evaluated

The claimed scores (76.8% HumanEval, 82.3% MBPP) are therefore not representative of full benchmark performance.

2. Missing Model Inference

Investigation of the evaluation scripts (human_eval.py, mbpp_eval.py) revealed:

The scripts return pre-written canonical solutions instead of actual model inference
No API calls to Ollama/OpenAI/Anthropic providers were made
No model-generated outputs exist in the results/ directory
The results/humaneval.json file contains 0% failure rate from a broken run

Conclusion: The benchmark numbers appear to be fabricated or at best, unverified.

3. Tool Use Benchmark Unimplemented

The claimed 94.1% Tool Use score lacks:

Any proper benchmark dataset
Defined evaluation methodology
Reproduction instructions
Actual model calls to test tool selection accuracy

It appears to be a custom, non-standard metric with no basis in accepted benchmarks.

Proper Evaluation Framework

We have built a new, rigorous evaluation infrastructure:

Official Datasets

# Download HumanEval (164 problems) and MBPP (500 problems)
python scripts/download_benchmark_datasets.py --data-dir ./data

This script fetches:

HumanEval from OpenAI's official dataset
MBPP from Google'sbenchmark suite
Ensures correct formatting and ground truth solutions

Unified Evaluation Runner

stack-2.9-eval/run_proper_evaluation.py provides:

python stack_2_9_eval/run_proper_evaluation.py \
    --benchmark humaneval \
    --provider ollama \
    --model qwen2.5-coder:32b \
    --k-samples 100 \
    --output-dir ./results/humaneval_run

Features:

Multi-provider support (Ollama, OpenAI, Anthropic, OpenRouter)
Proper pass@k calculation with confidence intervals
Per-problem detailed logs (JSON)
Reproducible random sampling (seeds)
Parallel evaluation (configurable workers)

Evaluation Checklist

To ensure transparency, every proper evaluation must:

✅ Use full official benchmark (164 HumanEval, 500 MBPP)
✅ Call real model inference via model_client.py
✅ Run with k≥100 samples for pass@1 estimation
✅ Store all generation outputs for audit
✅ Compute standard deviation and confidence intervals
✅ Publish full JSON logs to results/ directory
✅ Document exact model version, quantization, and provider settings

Current Status

The previously claimed scores have been removed from README.md and BENCHMARKS.md. They are replaced with:

Benchmark	Status	Notes
HumanEval	Pending verification	Full 164-problem evaluation setup ready
MBPP	Pending verification	Full 500-problem evaluation setup ready
Tool Use	Needs benchmark design	500+ realistic OpenClaw tool-calling test cases required
GSM8K	Not started	Math reasoning evaluation planned

Expected baseline (Qwen2.5-Coder-32B):

HumanEval: ~70-72% Pass@1
MBPP: ~75-77% Pass@1

Stack 2.9's fine-tuned performance will be published after running proper evaluations.

What Changed

Created scripts/download_benchmark_datasets.py for official datasets
Created stack-2.9-eval/run_proper_evaluation.py unified runner
Created stack-2.9-eval/test_evaluation_setup.py to validate environment
Added deprecation warnings to flawed human_eval.py, mbpp_eval.py, tool_use_eval.py
Updated README.md, BENCHMARKS.md, website pages to remove false claims

How to Publish Verified Scores

Prepare datasets: python scripts/download_benchmark_datasets.py --data-dir ./data
Run evaluation: python stack-2.9-eval/run_proper_evaluation.py --benchmark humaneval --provider ollama --model qwen2.5-coder:32b --k-samples 100
Review logs in ./results/humaneval_run/ (includes per-problem generations)
Update README.md with actual numbers once verified
Commit full JSON results to stack-2.9-eval/results/ for reproducibility

Do NOT publish the previously claimed percentages. They are invalid.