Crafter: A Multi-Agent Harness for Editable Scientific Figure Generation from Diverse Inputs Paper • 2605.30611 • Published 6 days ago • 83
FINESSE-Bench: A Hierarchical Benchmark Suite for Financial Domain Knowledge and Technical Analysis in Large Language Models Paper • 2605.15482 • Published 20 days ago • 8
TransitLM: A Large-Scale Dataset and Benchmark for Map-Free Transit Route Generation Paper • 2605.22355 • Published 13 days ago • 174
GFT: From Imitation to Reward Fine-Tuning with Unbiased Group Advantages and Dynamic Coefficient Rectification Paper • 2604.14258 • Published Apr 15 • 23
GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents Paper • 2604.07429 • Published Apr 8 • 121
PokeGym: A Visually-Driven Long-Horizon Benchmark for Vision-Language Models Paper • 2604.08340 • Published Apr 9 • 8