SPA: Achieving Consensus in LLM Alignment via Self-Priority Optimization

Yue Huang; Xiangqi Wang; Xiangliang Zhang

doi:10.1609/aaai.v40i37.40384

Back to AAAI

AAAI 2026

SPA: Achieving Consensus in LLM Alignment via Self-Priority Optimization

Conference Paper AAAI Technical Track on Natural Language Processing II Artificial Intelligence

PDF Details DOI

Abstract

In high-stakes scenarios—such as self-harm, legal, or medical queries—LLMs must be both trustworthy and helpful. However, these goals often conflict. We propose priority alignment, a new alignment paradigm that enforces a strict “trustworthy-before-helpful” ordering: optimization of helpfulness is conditioned on first meeting trustworthy thresholds (e.g., harmlessness or honesty). To realize this, we introduce Self-Priority Alignment (SPA)—a fully unsupervised framework that generates diverse responses, self-evaluates them and refines them by the model itself, and applies dual-criterion denoising to remove inconsistency and control variance. From this, SPA constructs lexicographically ordered preference pairs and fine-tunes the model using an uncertainty-weighted alignment loss that emphasizes high-confidence, high-gap decisions. Experiments across multiple benchmarks show that SPA improves helpfulness without compromising safety, outperforming strong baselines while preserving general capabilities. Our results demonstrate that SPA provides a scalable and interpretable alignment strategy for critical LLM applications.

SPA: Achieving Consensus in LLM Alignment via Self-Priority Optimization

Abstract

Authors

Keywords

Context