Arrow Research search

Author name cluster

Ashwinee Panda

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

14 papers
2 author rows

Possible papers

14

TMLR Journal 2025 Journal Article

Continual Pre-training of MoEs: How robust is your router?

  • Benjamin Thérien
  • Charles-Étienne Joseph
  • Zain Sarwar
  • Ashwinee Panda
  • Anirban Das
  • Shi-Xiong Zhang
  • Stephen Rawls
  • Sambit Sahu

Sparsely-activated Mixture of Experts (MoE) transformers are promising architectures for foundation models. Compared to dense transformers that require the same amount of floating-point operations (FLOPs) per forward pass, MoEs benefit from improved sample efficiency at training time and achieve much stronger performance. Many closed-source and open-source frontier language models have thus adopted an MoE architecture. Naturally, practitioners will want to extend the capabilities of these models with large amounts of newly collected data without completely re-training them. Prior work has shown that a simple combination of replay, learning rate re-warming, and re-decaying can enable the continual pre-training (CPT) of dense decoder-only transformers with minimal performance degradation compared to full re-training. In the case of decoder-only MoE transformers, however, it is unclear how the routing algorithm will impact continual pre-training performance: 1) *do the MoE transformer's routers exacerbate forgetting relative to a dense model?*; 2) *do the routers maintain a balanced load on previous distributions after CPT?*; 3) *are the same strategies applied to dense models sufficient to continually pre-train MoE LLMs?* In what follows, we conduct a large-scale study training a 500M parameter dense transformer and four 500M-active/2B-total parameter MoE transformers, following the Switch Transformer architecture and a granular DeepSeek-inspired architecture. Each model is trained for 600B tokens. Our results establish a surprising robustness to distribution shifts for MoEs using both Sinkhorn-Balanced and Z-and-Aux-loss-balanced routing algorithms, even in MoEs continually pre-trained without replay. Moreover, we show that MoE LLMs maintain their sample efficiency (relative to a FLOP-matched dense model) during CPT and that they can match the performance of a fully re-trained MoE at a fraction of the cost.

NeurIPS Conference 2025 Conference Paper

Dense Backpropagation Improves Training for Sparse Mixture-of-Experts

  • Ashwinee Panda
  • Vatsal Baherwani
  • Zain Sarwar
  • Benjamin Thérien
  • Sambit Sahu
  • Tom Goldstein
  • Supriyo Chakraborty

Mixture of Experts (MoE) pretraining is more scalable than dense Transformer pretraining, because MoEs learn to route inputs to a sparse set of their feedforward parameters. However, this means that MoEs only receive a sparse backward update, leading to training instability and suboptimal performance. We present a lightweight approximation method that gives the MoE router a dense gradient update while continuing to sparsely activate its parameters. Our method, which we refer to as Default MoE, substitutes missing expert activations with default outputs consisting of an exponential moving average of expert outputs previously seen over the course of training. This allows the router to receive signals from every expert for each token, leading to significant improvements in training performance. Our Default MoE outperforms standard TopK routing in a variety of settings without requiring significant computational overhead.

NeurIPS Conference 2025 Conference Paper

FineGRAIN: Evaluating Failure Modes of Text-to-Image Models with Vision Language Model Judges

  • Kevin Hayes
  • Micah Goldblum
  • Vikash Sehwag
  • Gowthami Somepalli
  • Ashwinee Panda
  • Tom Goldstein

Text-to-image (T2I) models are capable of generating visually impressive images, yet they often fail to accurately capture specific attributes in user prompts, such as the correct number of objects with the specified colors. The diversity of such errors underscores the need for a hierarchical evaluation framework that can compare prompt adherence abilities of different image generation models. Simultaneously, benchmarks of vision language models (VLMs) have not kept pace with the complexity of scenes that VLMs are used to annotate. In this work, we propose a structured methodology for jointly evaluating T2I models and VLMs by testing whether VLMs can identify 27 specific failure modes in the images generated by T2I models conditioned on challenging prompts. Our second contribution is a dataset of prompts and images generated by 5 T2I models (Flux, SD3-Medium, SD3-Large, SD3. 5-Medium, SD3. 5-Large) and the corresponding annotations from VLMs (Molmo, InternVL3, Pixtral) annotated by an LLM (Llama3) to test whether VLMs correctly identify the failure mode in a generated image. By analyzing failure modes on a curated set of prompts, we reveal systematic errors in attribute fidelity and object representation. Our findings suggest that current metrics are insufficient to capture these nuanced errors, highlighting the importance of targeted benchmarks for advancing generative model reliability and interpretability.

NeurIPS Conference 2025 Conference Paper

Gemstones: A Model Suite for Multi-Faceted Scaling Laws

  • Sean McLeish
  • John Kirchenbauer
  • David Miller
  • Siddharth Singh
  • Abhinav Bhatele
  • Micah Goldblum
  • Ashwinee Panda
  • Tom Goldstein

Scaling laws are typically fit using a family of models with a narrow range of frozen hyperparameter choices. In this work we study scaling laws using multiple architectural shapes and hyperparameter choices, highlighting their impact on resulting prescriptions. As a primary artifact of our research, we release the Gemstones: an open-source scaling law dataset, consisting of over 4000 checkpoints from transformers with up to 2 billion parameters and diverse architectural shapes; including ablations over learning rate and cooldown. Our checkpoints enable more complex studies of scaling, such as analyzing the relationship between width and depth. By examining our model suite, we find that the prescriptions of scaling laws can be highly sensitive to the experimental design process and the specific model checkpoints used during fitting.

ICLR Conference 2025 Conference Paper

Privacy Auditing of Large Language Models

  • Ashwinee Panda
  • Xinyu Tang 0003
  • Christopher A. Choquette-Choo
  • Milad Nasr
  • Prateek Mittal

Current techniques for privacy auditing of large language models (LLMs) have limited efficacy---they rely on basic approaches to generate canaries which leads to weak membership inference attacks that in turn give loose lower bounds on the empirical privacy leakage. We develop canaries that are far more effective than those used in prior work under threat models that cover a range of realistic settings. We demonstrate through extensive experiments on multiple families of fine-tuned LLMs that our approach sets a new standard for detection of privacy leakage. For measuring the memorization rate of non-privately trained LLMs, our designed canaries surpass prior approaches. For example, on the Qwen2.5-0.5B model, our designed canaries achieve $49.6\%$ TPR at $1\%$ FPR, vastly surpassing the prior approach's $4.2\%$ TPR at $1\%$ FPR. Our method can be used to provide a privacy audit of $\varepsilon \approx 1$ for a model trained with theoretical $\varepsilon$ of 4. To the best of our knowledge, this is the first time that a privacy audit of LLM training has achieved nontrivial auditing success in the setting where the attacker cannot train shadow models, insert gradient canaries, or access the model at every iteration.

TMLR Journal 2025 Journal Article

Private Fine-tuning of Large Language Models with Zeroth-order Optimization

  • Xinyu Tang
  • Ashwinee Panda
  • Milad Nasr
  • Saeed Mahloujifar
  • Prateek Mittal

Differentially private stochastic gradient descent (DP-SGD) allows models to be trained in a privacy-preserving manner, but has proven difficult to scale to the era of foundation models. We introduce DP-ZO, a private fine-tuning method for large language models by privatizing zeroth order optimization methods. A key insight into the design of our method is that the direction of the gradient in the zeroth-order optimization we use is random and the only information from training data is the step size, i.e., a scalar. Therefore, we only need to privatize the scalar step size, which is memory-efficient. DP-ZO provides a strong privacy-utility trade-off across different tasks, and model sizes that are comparable to DP-SGD in $(\varepsilon,\delta)$-DP. Notably, DP-ZO possesses significant advantages over DP-SGD in memory efficiency, and obtains higher utility in $\varepsilon$-DP when using the Laplace mechanism.

ICLR Conference 2025 Conference Paper

Safety Alignment Should be Made More Than Just a Few Tokens Deep

  • Xiangyu Qi
  • Ashwinee Panda
  • Kaifeng Lyu
  • Xiao Ma 0010
  • Subhrajit Roy
  • Ahmad Beirami
  • Prateek Mittal
  • Peter Henderson 0002

The safety alignment of current Large Language Models (LLMs) is vulnerable. Simple attacks, or even benign fine-tuning, can jailbreak aligned models. We note that many of these vulnerabilities are related to a shared underlying issue: safety alignment can take shortcuts, wherein the alignment adapts a model's generative distribution primarily over only its very first few output tokens. We unifiedly refer to this issue as shallow safety alignment. In this paper, we present case studies to explain why shallow safety alignment can exist and show how this issue universally contributes to multiple recently discovered vulnerabilities in LLMs, including the susceptibility to adversarial suffix attacks, prefilling attacks, decoding parameter attacks, and fine-tuning attacks. The key contribution of this work is that we demonstrate how this consolidated notion of shallow safety alignment sheds light on promising research directions for mitigating these vulnerabilities. We show that deepening the safety alignment beyond the first few tokens can meaningfully improve robustness against some common exploits. We also design a regularized fine-tuning objective that makes the safety alignment more persistent against fine-tuning attacks by constraining updates on initial tokens. Overall, we advocate that future safety alignment should be made more than just a few tokens deep.

ICML Conference 2024 Conference Paper

A New Linear Scaling Rule for Private Adaptive Hyperparameter Optimization

  • Ashwinee Panda
  • Xinyu Tang 0003
  • Saeed Mahloujifar
  • Vikash Sehwag
  • Prateek Mittal

An open problem in differentially private deep learning is hyperparameter optimization (HPO). DP-SGD introduces new hyperparameters and complicates existing ones, forcing researchers to painstakingly tune hyperparameters with hundreds of trials, which in turn makes it impossible to account for the privacy cost of HPO without destroying the utility. We propose an adaptive HPO method that uses cheap trials (in terms of privacy cost and runtime) to estimate optimal hyperparameters and scales them up. We obtain state-of-the-art performance on 22 benchmark tasks, across computer vision and natural language processing, across pretraining and finetuning, across architectures and a wide range of $\varepsilon \in [0. 01, 8. 0]$, all while accounting for the privacy cost of HPO.

ICLR Conference 2024 Conference Paper

Privacy-Preserving In-Context Learning for Large Language Models

  • Tong Wu
  • Ashwinee Panda
  • Jiachen T. Wang
  • Prateek Mittal

In-context learning (ICL) is an important capability of Large Language Models (LLMs), enabling these models to dynamically adapt based on specific, in-context exemplars, thereby improving accuracy and relevance. However, LLM's responses may leak the sensitive private information contained in in-context exemplars. To address this challenge, we propose Differentially Private In-context Learning (DP-ICL), a general paradigm for privatizing ICL tasks. The key idea for DP-ICL paradigm is generating differentially private responses through a noisy consensus among an ensemble of LLM's responses based on disjoint exemplar sets. Based on the general paradigm of DP-ICL, we instantiate several techniques showing how to privatize ICL for text classification and language generation. We experiment on four text classification benchmarks and two language generation tasks, and our empirical findings suggest that our DP-ICL achieves a strong utility-privacy tradeoff.

ICLR Conference 2024 Conference Paper

Teach LLMs to Phish: Stealing Private Information from Language Models

  • Ashwinee Panda
  • Christopher A. Choquette-Choo
  • Zhengming Zhang 0001
  • Yaoqing Yang 0002
  • Prateek Mittal

When large language models are trained on private data, it can be a \textit{significant} privacy risk for them to memorize and regurgitate sensitive information. In this work, we propose a new \emph{practical} data extraction attack that we call ``neural phishing''. This attack enables an adversary to target and extract sensitive or personally identifiable information (PII), e.g., credit card numbers, from a model trained on user data with upwards of $10\%$ attack success rates, at times, as high as $50\%$. Our attack assumes only that an adversary can insert as few as $10$s of benign-appearing sentences into the training dataset using only vague priors on the structure of the user data.

AAAI Conference 2024 Conference Paper

Visual Adversarial Examples Jailbreak Aligned Large Language Models

  • Xiangyu Qi
  • Kaixuan Huang
  • Ashwinee Panda
  • Peter Henderson
  • Mengdi Wang
  • Prateek Mittal

Warning: this paper contains data, prompts, and model outputs that are offensive in nature. Recently, there has been a surge of interest in integrating vision into Large Language Models (LLMs), exemplified by Visual Language Models (VLMs) such as Flamingo and GPT-4. This paper sheds light on the security and safety implications of this trend. First, we underscore that the continuous and high-dimensional nature of the visual input makes it a weak link against adversarial attacks, representing an expanded attack surface of vision-integrated LLMs. Second, we highlight that the versatility of LLMs also presents visual attackers with a wider array of achievable adversarial objectives, extending the implications of security failures beyond mere misclassification. As an illustration, we present a case study in which we exploit visual adversarial examples to circumvent the safety guardrail of aligned LLMs with integrated vision. Intriguingly, we discover that a single visual adversarial example can universally jailbreak an aligned LLM, compelling it to heed a wide range of harmful instructions (that it otherwise would not) and generate harmful content that transcends the narrow scope of a `few-shot' derogatory corpus initially employed to optimize the adversarial example. Our study underscores the escalating adversarial risks associated with the pursuit of multimodality. Our findings also connect the long-studied adversarial vulnerabilities of neural networks to the nascent field of AI alignment. The presented attack suggests a fundamental adversarial challenge for AI alignment, especially in light of the emerging trend toward multimodality in frontier foundation models.

NeurIPS Conference 2023 Conference Paper

Differentially Private Image Classification by Learning Priors from Random Processes

  • Xinyu Tang
  • Ashwinee Panda
  • Vikash Sehwag
  • Prateek Mittal

In privacy-preserving machine learning, differentially private stochastic gradient descent (DP-SGD) performs worse than SGD due to per-sample gradient clipping and noise addition. A recent focus in private learning research is improving the performance of DP-SGD on private data by incorporating priors that are learned on real-world public data. In this work, we explore how we can improve the privacy-utility tradeoff of DP-SGD by learning priors from images generated by random processes and transferring these priors to private data. We propose DP-RandP, a three-phase approach. We attain new state-of-the-art accuracy when training from scratch on CIFAR10, CIFAR100, MedMNIST and ImageNet for a range of privacy budgets $\\varepsilon \\in [1, 8]$. In particular, we improve the previous best reported accuracy on CIFAR10 from $60. 6 \\%$ to $72. 3 \\%$ for $\\varepsilon=1$.

ICML Conference 2022 Conference Paper

Neurotoxin: Durable Backdoors in Federated Learning

  • Zhengming Zhang 0001
  • Ashwinee Panda
  • Linyue Song
  • Yaoqing Yang 0002
  • Michael W. Mahoney
  • Prateek Mittal
  • Kannan Ramchandran
  • Joseph E. Gonzalez

Federated learning (FL) systems have an inherent vulnerability to adversarial backdoor attacks during training due to their decentralized nature. The goal of the attacker is to implant backdoors in the learned model with poisoned updates such that at test time, the model’s outputs can be fixed to a given target for certain inputs (e. g. , if a user types “people from New York” into a mobile keyboard app that uses a backdoored next word prediction model, the model will autocomplete their sentence to “people in New York are rude”). Prior work has shown that backdoors can be inserted in FL, but these backdoors are not durable: they do not remain in the model after the attacker stops uploading poisoned updates because training continues, and in production FL systems an inserted backdoor may not survive until deployment. We propose Neurotoxin, a simple one-line backdoor attack that functions by attacking parameters that are changed less in magnitude during training. We conduct an exhaustive evaluation across ten natural language processing and computer vision tasks and find that we can double the durability of state of the art backdoors by adding a single line with Neurotoxin.

ICML Conference 2020 Conference Paper

FetchSGD: Communication-Efficient Federated Learning with Sketching

  • Daniel Rothchild
  • Ashwinee Panda
  • Enayat Ullah
  • Nikita Ivkin
  • Ion Stoica
  • Vladimir Braverman
  • Joseph E. Gonzalez
  • Raman Arora

Existing approaches to federated learning suffer from a communication bottleneck as well as convergence issues due to sparse client participation. In this paper we introduce a novel algorithm, called FetchSGD, to overcome these challenges. FetchSGD compresses model updates using a Count Sketch, and then takes advantage of the mergeability of sketches to combine model updates from many workers. A key insight in the design of FetchSGD is that, because the Count Sketch is linear, momentum and error accumulation can both be carried out within the sketch. This allows the algorithm to move momentum and error accumulation from clients to the central aggregator, overcoming the challenges of sparse client participation while still achieving high compression rates and good convergence. We prove that FetchSGD has favorable convergence guarantees, and we demonstrate its empirical effectiveness by training two residual networks and a transformer model.