Stack 2.9 vs Leading AI Models
Pass@1 scores on standard coding datasets
| Model | HumanEval | MBPP | SWE-bench | Tool Use | Parameters |
|---|---|---|---|---|---|
| Stack 2.9 | TBD | TBD | TBD | TBD | 32B |
| Qwen2.5-Coder-32B | 76.8% | 82.3% | 18.2% | 78.5% | 32B |
| CodeLlama-34B | 62.2% | 70.1% | 12.8% | 65.2% | 34B |
| DeepSeek-Coder-33B | 70.7% | 75.8% | 15.6% | 72.1% | 33B |
| Claude 3.5 Sonnet | 71.2% | 78.4% | 34.1% | 89.3% | N/A |
| GPT-4 | 67.8% | 74.2% | 28.5% | 82.1% | ~1.7T |
| Gemini Pro 1.5 | 64.5% | 71.8% | 22.3% | 75.4% | N/A |
OpenClaw-specific capabilities - where Stack 2.9 shines
read, write, edit, search, move files
execute, debug, test, refactor code
shell, git, docker, process management
HTTP, websocket, database queries
Complex chained operations
parse, format, validate, convert
Stack 2.9 gets better the more you use it
* Based on simulated self-improvement training. Actual performance varies by use case.