Author name cluster

Pankayaraj Pathmanathan

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

3 papers

1 author row

AAAI Conference 2026 Conference Paper

AdvBDGen: A Robust Framework for Generating Adaptive and Stealthy Backdoors in LLM Alignment

Pankayaraj Pathmanathan
Udari Madhushani Sehwag
Michael-Andrei Panaitescu-Liess
Cho-Yu Jason Chiang
Furong Huang

With the increasing adoption of reinforcement learning with human feedback (RLHF) to align large language models (LLMs), the risk of backdoor installation during the alignment process has grown, potentially leading to unintended and harmful behaviors. Existing backdoor attacks mostly focus on simpler tasks, such as sequence classification, making them either difficult to install in LLM alignment or installable but easily detectable and removable. In this work, we introduce AdvBDGen, a generative fine-tuning framework that automatically creates prompt-specific paraphrases as triggers, enabling stealthier and more resilient backdoor attacks in LLM alignment. AdvBDGen is designed to exploit the disparities in learning speeds between strong and weak discriminators to craft backdoors that are both installable and stealthy. Using as little as 3% of the fine-tuning data, AdvBDGen can install highly effective backdoor triggers that, once installed, not only jailbreak LLMs during inference but also exhibit greater stability against input perturbations and improved robustness to trigger removal methods. Our findings highlight the growing vulnerability of LLM alignment pipelines to advanced backdoor attacks, underscoring the pressing need for more robust defense mechanisms.

PDF Details DOI

AAAI Conference 2025 Conference Paper

Can Watermarking Large Language Models Prevent Copyrighted Text Generation and Hide Training Data?

Michael-Andrei Panaitescu-Liess
Zora Che
Bang An
Yuancheng Xu
Pankayaraj Pathmanathan
Souradip Chakraborty
Sicheng Zhu
Tom Goldstein

Large Language Models (LLMs) have demonstrated impressive capabilities in generating diverse and contextually rich text. However, concerns regarding copyright infringement arise as LLMs may inadvertently produce copyrighted material. In this paper, we first investigate the effectiveness of watermarking LLMs as a deterrent against the generation of copyrighted texts. Through theoretical analysis and empirical evaluation, we demonstrate that incorporating watermarks into LLMs significantly reduces the likelihood of generating copyrighted content, thereby addressing a critical concern in the deployment of LLMs. However, we also find that watermarking can have unintended consequences on Membership Inference Attacks (MIAs), which aim to discern whether a sample was part of the pretraining dataset and may be used to detect copyright violations. Surprisingly, we find that watermarking adversely affects the success rate of MIAs, complicating the task of detecting copyrighted text in the pretraining dataset. These results reveal the complex interplay between different regulatory measures, which may impact each other in unforeseen ways. Finally, we propose an adaptive technique to improve the success rate of a recent MIA under watermarking. Our findings underscore the importance of developing adaptive methods to study critical problems in LLMs with potential legal implications.

PDF Details DOI

AAAI Conference 2025 Conference Paper

Is Poisoning a Real Threat to DPO? Maybe More So Than You Think

Pankayaraj Pathmanathan
Souradip Chakraborty
Xiangyu Liu
Yongyuan Liang
Furong Huang

Recent advancements in Reinforcement Learning with Human Feedback (RLHF) have significantly impacted the alignment of Large Language Models (LLMs). The sensitivity of reinforcement learning algorithms such as Proximal Policy Optimization (PPO) has led to new line work on Direct Preference Optimization (DPO), which treats RLHF in a supervised learning framework. The increased practical use of these RLHF methods warrants an analysis of their vulnerabilities. In this work, we investigate the vulnerabilities of DPO to poisoning attacks under different scenarios and compare the effectiveness of preference poisoning, a first of its kind. We comprehensively analyze DPO's vulnerabilities under different types of attacks, i.e., backdoor and non-backdoor attacks, and different poisoning methods across a wide array of language models, i.e., LLama 7B, Mistral 7B, and Gemma 7B. We find that unlike PPO-based methods, which, when it comes to backdoor attacks, require at least 4% of the data to be poisoned to elicit harmful behavior, we exploit the vulnerabilities of DPO by simpler methods so we can poison the model with only as much as 0.5% of the data. We further the investigate efficacy of the existing defence methods and find that these poisoning attacks can evade the existing data anomaly detection methods.

PDF Details DOI