Arrow Research search

Author name cluster

Rajdeep Haldar

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

2 papers
1 author row

Possible papers

2

TMLR Journal 2026 Journal Article

Adversarial Vulnerability from On-Manifold Inseparability and Poor Off-Manifold Convergence

  • Rajdeep Haldar
  • Yue Xing
  • Qifan Song
  • Guang Lin

We introduce a new perspective on adversarial vulnerability in image classification: fragility can arise from poor convergence in off-manifold directions. We model data as lying on low-dimensional manifolds, where on-manifold directions correspond to high-variance, data-aligned features and off-manifold directions capture low-variance, nuanced features. Standard first-order optimizers, such as gradient descent, are inherently ill-conditioned, leading to slow or incomplete convergence in off-manifold directions. When data is inseparable along the on-manifold direction, robustness depends on learning these subtle off-manifold features, and failure to converge leaves models exposed to adversarial perturbations. On the theoretical side, we formalize this mechanism through convergence analyses of logistic regression and two-layer linear networks under first-order methods. These results highlight how ill-conditioning slows or prevents convergence in off-manifold directions, thereby motivating the use of second-order methods which mitigate ill-conditioning and achieve convergence across all directions. Empirically, we demonstrate that even without adversarial training, robustness improves significantly with extended training or second-order optimization, underscoring convergence as a central factor. As an auxiliary empirical finding, we observe that batch normalization suppresses these robustness gains, consistent with its implicit bias toward uniform-margin rather than max-margin solutions. By introducing the notions of on- and off-manifold convergence, this work provides a novel theoretical explanation for adversarial vulnerability.

NeurIPS Conference 2025 Conference Paper

LLM Safety Alignment is Divergence Estimation in Disguise

  • Rajdeep Haldar
  • Ziyi Wang
  • Guang Lin
  • Yue Xing
  • Qifan Song

We present a theoretical framework showing that popular LLM alignment methods—including RLHF and its variants—can be understood as divergence estimators between aligned (safe or preferred) and unaligned (harmful or less-preferred) distributions. This perspective explains the emergence of separation in the latent space between safe and harmful prompts after alignment. As an application of our general divergence framework, we propose KLDO, a novel KL divergence-based alignment method, and empirically validate its effectiveness. We further show that using compliance–refusal datasets, rather than standard preference-based datasets, leads to stronger separation and improved safety alignment. Finally, to quantify the separation effect, we propose a distance-based metric in the prompt representation space, which also acts as a statistically significant indicator for model safety.