VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training Paper โข 2602.10693 โข Published 23 days ago โข 216
Does Your Reasoning Model Implicitly Know When to Stop Thinking? Paper โข 2602.08354 โข Published 25 days ago โข 258
Running Featured 57 QED-Nano: Teaching a Tiny Model to Prove Hard Theorems ๐ 57 Who needs 1T parameters? Olympiad proofs with a 4B model