Post
33
Ran a small controlled study on a frozen 40-task slice of Harbor Terminal-Bench-Pro, using the same model (
Under the base setup, reducing the turn budget from 100 to 60 pushed the two harnesses in opposite directions:
* Goose: 0.450 → 0.525
* OpenHands-SDK: 0.575 → 0.500
A tweaked 60-turn setup brought OpenHands-SDK back to 0.575. At their best, both harnesses reached the same 0.575 pass rate.
What surprised me most was the token profile: in this setup, the reported token usage for OpenHands-SDK was dramatically higher than Goose while converging to the same best score.
Same model, same task slice, different harness behavior under a tighter interaction budget.
Dataset:
namanvats/harbor-goose-openhands-benchmark
Code/configs:
https://github.com/namanvats/harbor-agent-ablation
minimax/minimax-m2.5) with two agent harnesses: Goose and OpenHands-SDK.Under the base setup, reducing the turn budget from 100 to 60 pushed the two harnesses in opposite directions:
* Goose: 0.450 → 0.525
* OpenHands-SDK: 0.575 → 0.500
A tweaked 60-turn setup brought OpenHands-SDK back to 0.575. At their best, both harnesses reached the same 0.575 pass rate.
What surprised me most was the token profile: in this setup, the reported token usage for OpenHands-SDK was dramatically higher than Goose while converging to the same best score.
Same model, same task slice, different harness behavior under a tighter interaction budget.
Dataset:
namanvats/harbor-goose-openhands-benchmark
Code/configs:
https://github.com/namanvats/harbor-agent-ablation