Direct Alignment with Heterogeneous Preferences

Ali Shirali; Arash Nasr-Esfahany; Abdullah Alomar; Parsa Mirtaheri; Rediet Abebe; Ariel Procaccia

Back to NeurIPS

NeurIPS 2025

Direct Alignment with Heterogeneous Preferences

Conference Paper Main Conference Track Artificial Intelligence · Machine Learning

PDF Details

Abstract

Alignment with human preferences is commonly framed using a universal reward function, even though human preferences are inherently heterogeneous. We formalize this heterogeneity by introducing user types and examine the limits of the homogeneity assumption. We show that aligning to heterogeneous preferences with a single policy is best achieved using the average reward across user types. However, this requires additional information about annotators. We examine improvements under different information settings, focusing on direct alignment methods. We find that minimal information can yield first-order improvements, while full feedback from each user type leads to consistent learning of the optimal policy. Surprisingly, however, no sample-efficient consistent direct loss exists in this latter setting. These results reveal a fundamental tension between consistency and sample efficiency in direct policy alignment.

Authors

Keywords

No keywords are indexed for this paper.

Context

Venue: Annual Conference on Neural Information Processing Systems
Archive span: 1987-2025
Indexed papers: 30776
Paper id: 370729958708395237