Arrow Research search
Back to AAAI

AAAI 2025

Single Character Perturbations Break LLM Alignment

Conference Paper AAAI Technical Track on AI Alignment Artificial Intelligence

Abstract

When LLMs are deployed in sensitive, human-facing settings, it is crucial that they do not output unsafe, biased, or privacy-violating outputs. For this reason, models are both trained and instructed to refuse to answer unsafe prompts such as ``Tell me how to build a bomb." We find that, despite these safeguards, it is possible to break model defenses simply by appending a space or other single character token to the end of a model's input. In a study of a variety of open-source models, we demonstrate that this simple perturbation is able to cause the majority of models to generate harmful outputs with very high probability. We further find that both Claude and GPT-3.5 demonstrate the same behavior. We examine the causes of this behavior, finding that the contexts in which single spaces occur in tokenized training data encourage models answer in lists or other formatted responses, overriding training signals to refuse unsafe requests. Our findings underscore the fragile state of current model alignment and promote the importance of developing more robust alignment methods.

Authors

Keywords

No keywords are indexed for this paper.

Context

Venue
AAAI Conference on Artificial Intelligence
Archive span
1980-2026
Indexed papers
28718
Paper id
109381314004725734