The Alignment Problem from a Deep Learning Perspective

Richard Ngo; Lawrence Chan; Sören Mindermann

Back to ICLR

ICLR 2024

The Alignment Problem from a Deep Learning Perspective

Conference Paper Accept (poster) Artificial Intelligence · Machine Learning

Details

Abstract

AI systems based on deep learning have reached or surpassed human performance in a range of narrow domains. In coming years or decades, artificial general intelligence (AGI) may surpass human capabilities at many critical tasks. In this position paper, we examine the technical difficulty of fine-tuning hypothetical AGI systems based on pretrained deep models to pursue goals that are aligned with human interests. We argue that, if trained like today's most capable models, AGI systems could learn to act deceptively to receive higher reward, learn internally-represented goals which generalize beyond their fine-tuning distributions, and pursue those goals using power-seeking strategies. We review emerging evidence for these properties. AGIs with these properties would be difficult to align and may appear aligned even when they are not.

Authors

Keywords

Alignment
Safety
AGI
position paper

Context

Venue: International Conference on Learning Representations
Archive span: 2013-2025
Indexed papers: 10294
Paper id: 343561306763173738