Risk Under Pressure: Compute-Aware Evaluation of Adversarial Robustness in Language Models Paper • 2606.11409 • Published 11 days ago • 9
Reassessing High-Performing LLMs on Polish Medical Exams: True Competence or Bias-Driven Performance? Paper • 2606.12250 • Published 10 days ago
VLM-RobustBench: A Comprehensive Benchmark for Robustness of Vision-Language Models Paper • 2603.06148 • Published Mar 6 • 2
SCOPE: Self-Play via Co-Evolving Policies for Open-Ended Tasks Paper • 2605.31433 • Published 22 days ago • 28
SCOPE: Self-Play via Co-Evolving Policies for Open-Ended Tasks Paper • 2605.31433 • Published 22 days ago • 28
RbtAct: Rebuttal as Supervision for Actionable Review Feedback Generation Paper • 2603.09723 • Published Mar 10 • 7