File size: 4,398 Bytes
2088481
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
# Evaluation Audit & Methodology

**Status:** Under Independent Verification

## Critical Findings

After comprehensive audit of the Stack 2.9 evaluation infrastructure, the following issues were identified:

### 1. Incomplete Test Sets

- **HumanEval**: Only **20 out of 164 problems** (~12%) were evaluated
- **MBPP**: Only **20 out of 500 problems** (~4%) were evaluated

The claimed scores (76.8% HumanEval, 82.3% MBPP) are therefore **not representative** of full benchmark performance.

### 2. Missing Model Inference

Investigation of the evaluation scripts (`human_eval.py`, `mbpp_eval.py`) revealed:

- The scripts return **pre-written canonical solutions** instead of actual model inference
- No API calls to Ollama/OpenAI/Anthropic providers were made
- No model-generated outputs exist in the `results/` directory
- The `results/humaneval.json` file contains 0% failure rate from a broken run

**Conclusion:** The benchmark numbers appear to be fabricated or at best, unverified.

### 3. Tool Use Benchmark Unimplemented

The claimed 94.1% Tool Use score lacks:
- Any proper benchmark dataset
- Defined evaluation methodology
- Reproduction instructions
- Actual model calls to test tool selection accuracy

It appears to be a custom, non-standard metric with no basis in accepted benchmarks.

---

## Proper Evaluation Framework

We have built a new, rigorous evaluation infrastructure:

### Official Datasets

```bash
# Download HumanEval (164 problems) and MBPP (500 problems)
python scripts/download_benchmark_datasets.py --data-dir ./data
```

This script fetches:
- HumanEval from OpenAI's official dataset
- MBPP from Google'sbenchmark suite
- Ensures correct formatting and ground truth solutions

### Unified Evaluation Runner

`stack-2.9-eval/run_proper_evaluation.py` provides:

```bash
python stack_2_9_eval/run_proper_evaluation.py \
    --benchmark humaneval \
    --provider ollama \
    --model qwen2.5-coder:32b \
    --k-samples 100 \
    --output-dir ./results/humaneval_run
```

Features:
- Multi-provider support (Ollama, OpenAI, Anthropic, OpenRouter)
- Proper `pass@k` calculation with confidence intervals
- Per-problem detailed logs (JSON)
- Reproducible random sampling (seeds)
- Parallel evaluation (configurable workers)

### Evaluation Checklist

To ensure transparency, every proper evaluation must:

1. ✅ Use full official benchmark (164 HumanEval, 500 MBPP)
2. ✅ Call real model inference via `model_client.py`
3. ✅ Run with k≥100 samples for pass@1 estimation
4. ✅ Store all generation outputs for audit
5. ✅ Compute standard deviation and confidence intervals
6. ✅ Publish full JSON logs to `results/` directory
7. ✅ Document exact model version, quantization, and provider settings

---

## Current Status

The previously claimed scores have been **removed** from README.md and BENCHMARKS.md. They are replaced with:

| Benchmark | Status | Notes |
|-----------|--------|-------|
| HumanEval | Pending verification | Full 164-problem evaluation setup ready |
| MBPP | Pending verification | Full 500-problem evaluation setup ready |
| Tool Use | Needs benchmark design | 500+ realistic OpenClaw tool-calling test cases required |
| GSM8K | Not started | Math reasoning evaluation planned |

Expected baseline (Qwen2.5-Coder-32B):
- HumanEval: ~70-72% Pass@1
- MBPP: ~75-77% Pass@1

Stack 2.9's fine-tuned performance will be published after running proper evaluations.

---

## What Changed

- Created `scripts/download_benchmark_datasets.py` for official datasets
- Created `stack-2.9-eval/run_proper_evaluation.py` unified runner
- Created `stack-2.9-eval/test_evaluation_setup.py` to validate environment
- Added deprecation warnings to flawed `human_eval.py`, `mbpp_eval.py`, `tool_use_eval.py`
- Updated README.md, BENCHMARKS.md, website pages to remove false claims

---

## How to Publish Verified Scores

1. Prepare datasets: `python scripts/download_benchmark_datasets.py --data-dir ./data`
2. Run evaluation: `python stack-2.9-eval/run_proper_evaluation.py --benchmark humaneval --provider ollama --model qwen2.5-coder:32b --k-samples 100`
3. Review logs in `./results/humaneval_run/` (includes per-problem generations)
4. Update README.md with actual numbers once verified
5. Commit full JSON results to `stack-2.9-eval/results/` for reproducibility

**Do NOT publish** the previously claimed percentages. They are invalid.