Semantic Representation Attack against Aligned Large Language Models

Jiawei Lian; Jianhong Pan; Lefan Wang; Yi Wang; Shaohui Mei; Lap-Pui Chau

Back to NeurIPS

NeurIPS 2025

Semantic Representation Attack against Aligned Large Language Models

Conference Paper Main Conference Track Artificial Intelligence · Machine Learning

PDF Details

Abstract

Large Language Models (LLMs) increasingly employ alignment techniques to prevent harmful outputs. Despite these safeguards, attackers can circumvent them by crafting prompts that induce LLMs to generate harmful content. Current methods typically target exact affirmative responses, suffering from limited convergence, unnatural prompts, and high computational costs. We introduce semantic representation attacks, a novel paradigm that fundamentally reconceptualizes adversarial objectives against aligned LLMs. Rather than targeting exact textual patterns, our approach exploits the semantic representation space that can elicit diverse responses that share equivalent harmful meanings. This innovation resolves the inherent trade-off between attack effectiveness and prompt naturalness that plagues existing methods. Our Semantic Representation Heuristic Search (SRHS) algorithm efficiently generates semantically coherent adversarial prompts by maintaining interpretability during incremental search. We establish rigorous theoretical guarantees for semantic convergence and demonstrate that SRHS achieves unprecedented attack success rates (89. 4% averaged across 18 LLMs, including 100% on 11 models) while significantly reducing computational requirements. Extensive experiments show that our method consistently outperforms existing approaches.

Semantic Representation Attack against Aligned Large Language Models

Abstract

Authors

Keywords

Context