AdaPlanBench: Evaluating Adaptive Planning in Large Language Model Agents under World and User Constraints Paper • 2606.05622 • Published 1 day ago • 30
Ψ-Bench: Evaluating Persona-Sensitive Influencing in Persuasive Dialogues Paper • 2606.02754 • Published 4 days ago • 13
Advancing Creative Physical Intelligence in Large Multimodal Models Paper • 2605.26396 • Published 12 days ago • 19
Advancing Creative Physical Intelligence in Large Multimodal Models Paper • 2605.26396 • Published 12 days ago • 19
Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe Paper • 2604.13016 • Published Apr 14 • 109
CreativityBench: Evaluating Agent Creative Reasoning via Affordance-Based Tool Repurposing Paper • 2605.02910 • Published about 1 month ago • 22
CreativityBench: Evaluating Agent Creative Reasoning via Affordance-Based Tool Repurposing Paper • 2605.02910 • Published about 1 month ago • 22
PEARL: Self-Evolving Assistant for Time Management with Reinforcement Learning Paper • 2601.11957 • Published Jan 28 • 3
NarrativeTrack: Evaluating Video Language Models Beyond the Frame Paper • 2601.01095 • Published Jan 3 • 8
Tool-R0: Self-Evolving LLM Agents for Tool-Learning from Zero Data Paper • 2602.21320 • Published Feb 24 • 12
AgentDoG Collection A Diagnostic Guardrail Framework for AI Agent Safety and Security • 12 items • Updated 24 days ago • 112
Multimodal Policy Internalization for Conversational Agents Paper • 2510.09474 • Published Oct 10, 2025 • 5