Arrow Research search
Back to ICML

ICML 2024

MaxMin-RLHF: Alignment with Diverse Human Preferences

Conference Paper Accept (Poster) Artificial Intelligence ยท Machine Learning

Abstract

Reinforcement Learning from Human Feedback (RLHF) aligns language models to human preferences by employing a singular reward model derived from preference data. However, the single reward model overlooks the rich diversity of human preferences inherent in data collected from multiple users. In this work, we first derive an impossibility result of alignment with single reward RLHF, thereby highlighting its insufficiency in representing diverse human preferences. Next, we propose to learn a mixture of reward models via an expectation-maximization algorithm and solve a MaxMin alignment objective inspired by the Egalitarian principle in social choice theory to better honor diverse human preferences. We present comprehensive experimental results on small-scale (GPT-2) and large-scale language (with Tulu2-7B)) and show the efficacy of the proposed approach in the presence of diversity among human preferences. We remark that our findings in this work are not only limited to language models but also extend to reinforcement learning in general.

Authors

Keywords

No keywords are indexed for this paper.

Context

Venue
International Conference on Machine Learning
Archive span
1993-2025
Indexed papers
16471
Paper id
195387566296576824