Single Character Perturbations Break LLM Alignment

Leon Lin; Hannah Brown; Kenji Kawaguchi; Michael Shieh

doi:10.1609/aaai.v39i26.34959

Back to AAAI

AAAI 2025

Single Character Perturbations Break LLM Alignment

Conference Paper AAAI Technical Track on AI Alignment Artificial Intelligence

PDF Details DOI

Abstract

When LLMs are deployed in sensitive, human-facing settings, it is crucial that they do not output unsafe, biased, or privacy-violating outputs. For this reason, models are both trained and instructed to refuse to answer unsafe prompts such as ``Tell me how to build a bomb." We find that, despite these safeguards, it is possible to break model defenses simply by appending a space or other single character token to the end of a model's input. In a study of a variety of open-source models, we demonstrate that this simple perturbation is able to cause the majority of models to generate harmful outputs with very high probability. We further find that both Claude and GPT-3.5 demonstrate the same behavior. We examine the causes of this behavior, finding that the contexts in which single spaces occur in tokenized training data encourage models answer in lists or other formatted responses, overriding training signals to refuse unsafe requests. Our findings underscore the fragile state of current model alignment and promote the importance of developing more robust alignment methods.

Single Character Perturbations Break LLM Alignment

Abstract

Authors

Keywords

Context