On the Robustness of Verbal Confidence of LLMs in Adversarial Attacks

Stephen Obadinma; Xiaodan Zhu

Back to NeurIPS

NeurIPS 2025

On the Robustness of Verbal Confidence of LLMs in Adversarial Attacks

Conference Paper Main Conference Track Artificial Intelligence · Machine Learning

PDF Details

Abstract

Robust verbal confidence generated by large language models (LLMs) is crucial for the deployment of LLMs to help ensure transparency, trust, and safety in many applications, including those involving human-AI interactions. In this paper, we present the first comprehensive study on the robustness of verbal confidence under adversarial attacks. We introduce attack frameworks targeting verbal confidence scores through both perturbation and jailbreak-based methods, and demonstrate that these attacks can significantly impair verbal confidence estimates and lead to frequent answer changes. We examine a variety of prompting strategies, model sizes, and application domains, revealing that current verbal confidence is vulnerable and that commonly used defence techniques are largely ineffective or counterproductive. Our findings underscore the need to design robust mechanisms for confidence expression in LLMs, as even subtle semantic-preserving modifications can lead to misleading confidence in responses.

On the Robustness of Verbal Confidence of LLMs in Adversarial Attacks

Abstract

Authors

Keywords

Context