STAR: Efficient Preference-based Reinforcement Learning via Dual Regularization

Fengshuo Bai; Rui Zhao; Hongming Zhang; Sijia Cui; Shao Zhang; Bo Xu; Lei Han; Ying Wen; Yaodong Yang

Back to NeurIPS

NeurIPS 2025

STAR: Efficient Preference-based Reinforcement Learning via Dual Regularization

Conference Paper Main Conference Track Artificial Intelligence · Machine Learning

PDF Details

Abstract

Preference-based reinforcement learning (PbRL) bypasses complex reward engineering by learning from human feedback. However, due to the high cost of obtaining feedback, PbRL typically relies on a limited set of preference-labeled samples. This data scarcity introduces two key inefficiencies: (1) the reward model overfits to the limited feedback, leading to poor generalization to unseen samples, and (2) the agent exploits the learned reward model, exacerbating overestimation of action values in temporal difference (TD) learning. To address these issues, we propose STAR, an efficient PbRL method that integrates preference margin regularization and policy regularization. Preference margin regularization mitigates overfitting by introducing a bounded margin in reward optimization, preventing excessive bias toward specific feedback. Policy regularization bootstraps a conservative estimate $\widehat{Q}$ from well-supported state-action pairs in the replay memory, reducing overestimation during policy learning. Experimental results show that STAR improves feedback efficiency, achieving 34. 8\% higher performance in online settings and 29. 7\% in offline settings compared to state-of-the-art methods. Ablation studies confirm that STAR facilitates more robust reward and value function learning. The videos of this project are released at https: //sites. google. com/view/pbrl-star.

STAR: Efficient Preference-based Reinforcement Learning via Dual Regularization

Abstract

Authors

Keywords

Context