Arrow Research search

Author name cluster

Gowthami Somepalli

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

10 papers
2 author rows

Possible papers

10

NeurIPS Conference 2025 Conference Paper

FineGRAIN: Evaluating Failure Modes of Text-to-Image Models with Vision Language Model Judges

  • Kevin Hayes
  • Micah Goldblum
  • Vikash Sehwag
  • Gowthami Somepalli
  • Ashwinee Panda
  • Tom Goldstein

Text-to-image (T2I) models are capable of generating visually impressive images, yet they often fail to accurately capture specific attributes in user prompts, such as the correct number of objects with the specified colors. The diversity of such errors underscores the need for a hierarchical evaluation framework that can compare prompt adherence abilities of different image generation models. Simultaneously, benchmarks of vision language models (VLMs) have not kept pace with the complexity of scenes that VLMs are used to annotate. In this work, we propose a structured methodology for jointly evaluating T2I models and VLMs by testing whether VLMs can identify 27 specific failure modes in the images generated by T2I models conditioned on challenging prompts. Our second contribution is a dataset of prompts and images generated by 5 T2I models (Flux, SD3-Medium, SD3-Large, SD3. 5-Medium, SD3. 5-Large) and the corresponding annotations from VLMs (Molmo, InternVL3, Pixtral) annotated by an LLM (Llama3) to test whether VLMs correctly identify the failure mode in a generated image. By analyzing failure modes on a curated set of prompts, we reveal systematic errors in attribute fidelity and object representation. Our findings suggest that current metrics are insufficient to capture these nuanced errors, highlighting the importance of targeted benchmarks for advancing generative model reliability and interpretability.

NeurIPS Conference 2024 Conference Paper

Be like a Goldfish, Don't Memorize! Mitigating Memorization in Generative LLMs

  • Abhimanyu Hans
  • Yuxin Wen
  • Neel Jain
  • John Kirchenbauer
  • Hamid Kazemi
  • Prajwal Singhania
  • Siddharth Singh
  • Gowthami Somepalli

Large language models can memorize and repeat their training data, causing privacy and copyright risks. To mitigate memorization, we introduce a subtle modification to the next-token training objective that we call the goldfish loss. During training, a randomly sampled subsets of tokens are excluded from the loss computation. These dropped tokens are not memorized by the model, which prevents verbatim reproduction of a complete chain of tokens from the training set. We run extensive experiments training billion-scale LLaMA-2 models, both pre-trained and trained from scratch, and demonstrate significant reductions in extractable memorization with little to no impact on downstream benchmarks. Code and checkpoints: https: //github. com/ahans30/goldfish-loss

NeurIPS Conference 2024 Conference Paper

CALVIN: Improved Contextual Video Captioning via Instruction Tuning

  • Gowthami Somepalli
  • Arkabandhu Chowdhury
  • Ronen Basri
  • Jonas Geiping
  • Tom Goldstein
  • David Jacobs

The recent emergence of powerful Vision-Language models (VLMs) has significantly improved image captioning. Some of these models are extended to caption videos as well. However, their capabilities to understand complex scenes are limited, and the descriptions they provide for scenes tend to be overly verbose and focused on the superficial appearance of objects. Scene descriptions, especially in movies, require a deeper contextual understanding, unlike general-purpose video captioning. To address this challenge, we propose a model, CALVIN, a specialized video LLM that leverages previous movie context to generate fully "contextual" scene descriptions. To achieve this, we train our model on a suite of tasks that integrate both image-based question-answering and video captioning within a unified framework, before applying instruction tuning to refine the model's ability to provide scene captions. Lastly, we observe that our model responds well to prompt engineering and few-shot in-context learning techniques, enabling the user to adapt it to any new movie with very little additional annotation.

ICLR Conference 2024 Conference Paper

NEFTune: Noisy Embeddings Improve Instruction Finetuning

  • Neel Jain
  • Ping-Yeh Chiang
  • Yuxin Wen
  • John Kirchenbauer
  • Hong-Min Chu
  • Gowthami Somepalli
  • Brian R. Bartoldson
  • Bhavya Kailkhura

We show that language model finetuning can be improved, sometimes dramatically, with a simple augmentation. NEFTune adds noise to the embedding vectors during training. Standard finetuning of LLaMA-2-7B using Alpaca achieves $29.79$\% on AlpacaEval, which rises to $64.69$\% using noisy embeddings. NEFTune also improves over strong baselines on modern instruction datasets. Models trained with Evol-Instruct see a $10$\% improvement, with ShareGPT an $8$\% improvement, and with OpenPlatypus an $8$\% improvement. Even powerful models further refined with RLHF such as LLaMA-2-Chat benefit from additional training with NEFTune. Particularly, we see these improvements on the conversational abilities of the instruction model and not on traditional tasks like those on the OpenLLM Leaderboard, where performance is the same.

NeurIPS Conference 2023 Conference Paper

A Performance-Driven Benchmark for Feature Selection in Tabular Deep Learning

  • Valeriia Cherepanova
  • Roman Levin
  • Gowthami Somepalli
  • Jonas Geiping
  • C. Bayan Bruss
  • Andrew G. Wilson
  • Tom Goldstein
  • Micah Goldblum

Academic tabular benchmarks often contain small sets of curated features. In contrast, data scientists typically collect as many features as possible into their datasets, and even engineer new features from existing ones. To prevent over-fitting in subsequent downstream modeling, practitioners commonly use automated feature selection methods that identify a reduced subset of informative features. Existing benchmarks for tabular feature selection consider classical downstream models, toy synthetic datasets, or do not evaluate feature selectors on the basis of downstream performance. We construct a challenging feature selection benchmark evaluated on downstream neural networks including transformers, using real datasets and multiple methods for generating extraneous features. We also propose an input-gradient-based analogue of LASSO for neural networks that outperforms classical feature selection methods on challenging problems such as selecting from corrupted or second-order features.

NeurIPS Conference 2023 Conference Paper

Battle of the Backbones: A Large-Scale Comparison of Pretrained Models across Computer Vision Tasks

  • Micah Goldblum
  • Hossein Souri
  • Renkun Ni
  • Manli Shu
  • Viraj Prabhu
  • Gowthami Somepalli
  • Prithvijit Chattopadhyay
  • Mark Ibrahim

Neural network based computer vision systems are typically built on a backbone, a pretrained or randomly initialized feature extractor. Several years ago, the default option was an ImageNet-trained convolutional neural network. However, the recent past has seen the emergence of countless backbones pretrained using various algorithms and datasets. While this abundance of choice has led to performance increases for a range of systems, it is difficult for practitioners to make informed decisions about which backbone to choose. Battle of the Backbones (BoB) makes this choice easier by benchmarking a diverse suite of pretrained models, including vision-language models, those trained via self-supervised learning, and the Stable Diffusion backbone, across a diverse set of computer vision tasks ranging from classification to object detection to OOD generalization and more. Furthermore, BoB sheds light on promising directions for the research community to advance computer vision by illuminating strengths and weakness of existing approaches through a comprehensive analysis conducted on more than 1500 training runs. While vision transformers (ViTs) and self-supervised learning (SSL) are increasingly popular, we find that convolutional neural networks pretrained in a supervised fashion on large training sets still perform best on most tasks among the models we consider. Moreover, in apples-to-apples comparisons on the same architectures and similarly sized pretraining datasets, we find that SSL backbones are highly competitive, indicating that future works should perform SSL pretraining with advanced architectures and larger pretraining datasets. We release the raw results of our experiments along with code that allows researchers to put their own backbones through the gauntlet here: https: //github. com/hsouri/Battle-of-the-Backbones.

ICLR Conference 2023 Conference Paper

How Much Data Are Augmentations Worth? An Investigation into Scaling Laws, Invariance, and Implicit Regularization

  • Jonas Geiping
  • Micah Goldblum
  • Gowthami Somepalli
  • Ravid Shwartz-Ziv
  • Tom Goldstein
  • Andrew Gordon Wilson

Despite the clear performance benefits of data augmentations, little is known about why they are so effective. In this paper, we disentangle several key mechanisms through which data augmentations operate. Establishing an exchange rate between augmented and additional real data, we find that in out-of-distribution testing scenarios, augmentations which yield samples that are diverse, but inconsistent with the data distribution can be even more valuable than additional training data. Moreover, we find that data augmentations which encourage invariances can be more valuable than invariance alone, especially on small and medium sized training sets. Following this observation, we show that augmentations induce additional stochasticity during training, effectively flattening the loss landscape.

NeurIPS Conference 2023 Conference Paper

Understanding and Mitigating Copying in Diffusion Models

  • Gowthami Somepalli
  • Vasu Singla
  • Micah Goldblum
  • Jonas Geiping
  • Tom Goldstein

Images generated by diffusion models like Stable Diffusion are increasingly widespread. Recent works and even lawsuits have shown that these models are prone to replicating their training data, unbeknownst to the user. In this paper, we first analyze this memorization problem in text-to-image diffusion models. While it is widely believed that duplicated images in the training set are responsible for content replication at inference time, we observe that the text conditioning of the model plays a similarly important role. In fact, we see in our experiments that data replication often does not happen for unconditional models, while it is common in the text-conditional case. Motivated by our findings, we then propose several techniques for reducing data replication at both training and inference time by randomizing and augmenting image captions in the training set. Code is available at https: //github. com/somepago/DCR.

NeurIPS Conference 2021 Conference Paper

PatchGame: Learning to Signal Mid-level Patches in Referential Games

  • Kamal Gupta
  • Gowthami Somepalli
  • Anubhav Anubhav
  • Vinoj Yasanga Jayasundara Magalle Hewa
  • Matthias Zwicker
  • Abhinav Shrivastava

We study a referential game (a type of signaling game) where two agents communicate with each other via a discrete bottleneck to achieve a common goal. In our referential game, the goal of the speaker is to compose a message or a symbolic representation of "important" image patches, while the task for the listener is to match the speaker's message to a different view of the same image. We show that it is indeed possible for the two agents to develop a communication protocol without explicit or implicit supervision. We further investigate the developed protocol and show the applications in speeding up recent Vision Transformers by using only important patches, and as pre-training for downstream recognition tasks (e. g. , classification).

UAI Conference 2021 Conference Paper

Unsupervised anomaly detection with adversarial mirrored autoencoders

  • Gowthami Somepalli
  • Yexin Wu
  • Yogesh Balaji
  • Bhanukiran Vinzamuri
  • Soheil Feizi

Detecting out-of-distribution (OOD) samples is of paramount importance in all Machine Learning applications. Deep generative modeling has emerged as a dominant paradigm to model complex data distributions without labels. However, prior work has shown that generative models tend to assign higher likelihoods to OOD samples compared to the data distribution on which they were trained. First, we propose Adversarial Mirrored Autoencoder (AMA), a variant of Adversarial Autoencoder, which uses a mirrored Wasserstein loss in the discriminator to enforce better semantic-level reconstruction. We also propose a latent space regularization to learn a compact manifold for in-distribution samples. The use of AMA produces better feature representations that improve anomaly detection performance. Second, we put forward an alternative measure of anomaly score to replace the reconstruction-based metric which has been traditionally used in generative model-based anomaly detection methods. Our method outperforms the current state-of-the-art methods for anomaly detection on several OOD detection benchmarks.