LLM Safety Alignment is Divergence Estimation in Disguise

Rajdeep Haldar; Ziyi Wang; Guang Lin; Yue Xing; Qifan Song

Back to NeurIPS

NeurIPS 2025

LLM Safety Alignment is Divergence Estimation in Disguise

Conference Paper Main Conference Track Artificial Intelligence · Machine Learning

PDF Details

Abstract

We present a theoretical framework showing that popular LLM alignment methods—including RLHF and its variants—can be understood as divergence estimators between aligned (safe or preferred) and unaligned (harmful or less-preferred) distributions. This perspective explains the emergence of separation in the latent space between safe and harmful prompts after alignment. As an application of our general divergence framework, we propose KLDO, a novel KL divergence-based alignment method, and empirically validate its effectiveness. We further show that using compliance–refusal datasets, rather than standard preference-based datasets, leads to stronger separation and improved safety alignment. Finally, to quantify the separation effect, we propose a distance-based metric in the prompt representation space, which also acts as a statistically significant indicator for model safety.

Authors

Keywords

No keywords are indexed for this paper.

Context

Venue: Annual Conference on Neural Information Processing Systems
Archive span: 1987-2025
Indexed papers: 30776
Paper id: 270334257266930152