AAAI Conference 2026 Conference Paper
Exploiting Synergistic Cognitive Biases to Bypass Safety in LLMs
- Xikang Yang
- Biyu Zhou
- Xuehai Tang
- Jizhong Han
- Songlin Hu
Large Language Models (LLMs) demonstrate impressive capabilities across diverse tasks, yet their safety mechanisms remain susceptible to adversarial exploitation of cognitive biases---systematic deviations from rational judgment. Unlike prior studies focusing on isolated biases, this work highlights the overlooked power of multi-bias interactions in undermining LLM safeguards. Specifically, we propose CognitiveAttack, a novel red-teaming framework that adaptively selects optimal ensembles from 154 human social psychology-defined cognitive biases, engineering them into adversarial prompts to effectively compromise LLM safety mechanisms. Experimental results reveal systemic vulnerabilities across 30 mainstream LLMs, particularly open-source variants. CognitiveAttack achieves a substantially higher attack success rate than the SOTA black-box method PAP (60.1% vs. 31.6%), exposing critical limitations in current defenses. Through quantitative analysis of successful jailbreaks, we further identify vulnerability patterns in safety-aligned LLMs under synergistic cognitive biases, validating multi-bias interactions as a potent yet underexplored attack vector. This work introduces a novel interdisciplinary perspective by bridging cognitive science and LLM safety, paving the way for more robust and human-aligned AI systems.