MASS-DPO: Multi-negative Active Sample Selection for Direct Policy Optimization Paper • 2605.10784 • Published 5 days ago • 1
F-GRPO: Factorized Group-Relative Policy Optimization for Unified Candidate Generation and Ranking Paper • 2605.12995 • Published 3 days ago • 1
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning Paper • 2605.02913 • Published Apr 8 • 9
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning Paper • 2605.02913 • Published Apr 8 • 9
rohan2810/movielens_heissen_theta_normalized_massdpo_theta_normalized_llama-3.2-3b-instruct_0.1_3_lastlaye Updated Mar 28
rohan2810/movielens_heissen_theta_normalized_massdpo_theta_normalized_llama-3.2-3b-instruct_0.1_3_lastlaye Updated Mar 28