Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation Paper • 2602.12125 • Published 20 days ago • 58
Gaia2: Benchmarking LLM Agents on Dynamic and Asynchronous Environments Paper • 2602.11964 • Published 20 days ago • 12
Length-Unbiased Sequence Policy Optimization: Revealing and Controlling Response Length Variation in RLVR Paper • 2602.05261 • Published 28 days ago • 49
Scaling Embeddings Outperforms Scaling Experts in Language Models Paper • 2601.21204 • Published Jan 29 • 100