Arrow Research search

Author name cluster

Walter Gerych

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

12 papers
2 author rows

Possible papers

12

AAAI Conference 2026 Conference Paper

Neural Tangent Kernels Under Stochastic Data Augmentation

  • Joshua DeOliveira
  • Sajal Chakroborty
  • Walter Gerych
  • Elke Rundensteiner

The learning dynamics of modern neural networks remain an open problem in deep learning. The Neural Tangent Kernel (NTK) offers an elegant description of training dynamics in the infinite‑width limit, yet its classical formulation assumes a static data set. Modern model training practice departs from this strong assumption through the use of on‑the‑fly data augmentations (e.g. additive noise). In this work, we conduct an NTK-driven analysis of how data transformations affect a neural net's evolution in the function space. Our theoretical contributions characterize how repeated Gaussian perturbations from NTK-derived covariances can steer neural-net optimizations toward user‑specified behavior. These theoretical insights are empirically validated by controlled experiments. Taken together, our results lay the foundation for a promising future research direction that transforms the NTK from a descriptive to a prescriptive tool, enabling control of neural net training trajectories and behavior of inference generalization with grounded interventions.

NeurIPS Conference 2025 Conference Paper

An Investigation of Memorization Risk in Healthcare Foundation Models

  • Sana Tonekaboni
  • Lena Stempfle
  • Adibvafa Fallahpour
  • Walter Gerych
  • Marzyeh Ghassemi

Foundation models trained on large-scale de-identified electronic health records (EHRs) hold promise for clinical applications. However, their capacity to memorize patient information raises important privacy concerns. In this work, we introduce a suite of black-box evaluation tests to assess privacy-related memorization risks in foundation models trained on structured EHR data. Our framework includes methods for probing memorization at both the embedding and generative levels, and aims to distinguish between model generalization and harmful memorization in clinically relevant settings. We contextualize memorization in terms of its potential to compromise patient privacy, particularly for vulnerable subgroups. We validate our approach on a publicly available EHR foundation model and release an open-source toolkit to facilitate reproducible and collaborative privacy assessments in healthcare AI.

ICLR Conference 2025 Conference Paper

Learning under Temporal Label Noise

  • Sujay Nagaraj
  • Walter Gerych
  • Sana Tonekaboni
  • Anna Goldenberg
  • Berk Ustun
  • Thomas Hartvigsen

Many time series classification tasks, where labels vary over time, are affected by label noise that also varies over time. Such noise can cause label quality to improve, worsen, or periodically change over time. We first propose and formalize temporal label noise, an unstudied problem for sequential classification of time series. In this setting, multiple labels are recorded over time while being corrupted by a time-dependent noise function. We first demonstrate the importance of modeling the temporal nature of the label noise function and how existing methods will consistently underperform. We then propose methods to train noise-tolerant classifiers by estimating the temporal label noise function directly from data. We show that our methods lead to state-of-the-art performance under diverse types of temporal label noise on real-world datasets.

AAAI Conference 2025 Conference Paper

Mapping from Meaning: Addressing the Miscalibration of Prompt-Sensitive Language Models

  • Kyle Cox
  • Jiawei Xu
  • Yikun Han
  • Rong Xu
  • Tianhao Li
  • Chi-Yang Hsu
  • Tianlong Chen
  • Walter Gerych

An interesting behavior in large language models (LLMs) is prompt sensitivity. When provided with different but semantically equivalent versions of the same prompt, models may produce very different distributions of answers. This suggests that the uncertainty reflected in a model's output distribution for one prompt may not reflect the model's uncertainty about the meaning of the prompt. We model prompt sensitivity as a type of generalization error, and show that sampling across the semantic concept space with paraphrasing perturbations improves uncertainty calibration without compromising accuracy. Additionally, we introduce a new metric for uncertainty decomposition in black-box LLMs that improves upon entropy-based decomposition by modeling semantic continuities in natural language generation. We show that this decomposition metric can be used to quantify how much LLM uncertainty is attributed to prompt sensitivity. Our work introduces a new way to improve uncertainty calibration in prompt-sensitive language models, and provides evidence that some LLMs fail to exhibit consistent general reasoning about the meanings of their inputs.

AAAI Conference 2025 Conference Paper

The Surprising Effectiveness of Infinite-Width NTKs for Characterizing and Improving Model Training

  • Joshua DeOliveira
  • Walter Gerych
  • Elke Rundensteiner

Developments in deep neural nets have trended towards increasingly larger overparameterized architectures, resulting in lengthy training sessions with ever more elusive training dynamics. Thus, ensuring these models learn accurate generalizable representations of data efficiently is challenging. Previous works have developed specialized techniques from data-pruning, architecture selection, pseudo-label generation, bias identification, or label refurbishment to improve downstream training. Problematically, most methods require prohibitively expensive iterative model training. In this paper, we demonstrate that we can exploit the recent neural tangent kernel (NTK) theory for understanding and improving model training behavior before ever training a model. First, we show a powerful signal derived from the NTK theory can be computed remarkably fast. We then leverage this signal for the design of a unified suite of surprisingly effective tools for the four important tasks of architecture selection, pseudo-label verification, bias identification, and label refurbishment, all requiring zero model training.

AAAI Conference 2024 Conference Paper

Amalgamating Multi-Task Models with Heterogeneous Architectures

  • Jidapa Thadajarassiri
  • Walter Gerych
  • Xiangnan Kong
  • Elke Rundensteiner

Multi-task learning (MTL) is essential for real-world applications that handle multiple tasks simultaneously, such as selfdriving cars. MTL methods improve the performance of all tasks by utilizing information across tasks to learn a robust shared representation. However, acquiring sufficient labeled data tends to be extremely expensive, especially when having to support many tasks. Recently, Knowledge Amalgamation (KA) has emerged as an effective strategy for addressing the lack of labels by instead learning directly from pretrained models (teachers). KA learns one unified multi-task student that masters all tasks across all teachers. Existing KA for MTL works are limited to teachers with identical architectures, and thus propose layer-to-layer based approaches. Unfortunately, in practice, teachers may have heterogeneous architectures; their layers may not be aligned and their dimensionalities or scales may be incompatible. Amalgamating multi-task teachers with heterogeneous architectures remains an open problem. For this, we design Versatile Common Feature Consolidator (VENUS), the first solution to this problem. VENUS fuses knowledge from the shared representations of each teacher into one unified generalized representation for all tasks. Specifically, we design the Feature Consolidator network that leverages an array of teacher-specific trainable adaptors. These adaptors enable the student to learn from multiple teachers, even if they have incompatible learned representations. We demonstrate that VENUS outperforms five alternative methods on numerous benchmark datasets across a broad spectrum of experiments.

NeurIPS Conference 2024 Conference Paper

BendVLM: Test-Time Debiasing of Vision-Language Embeddings

  • Walter Gerych
  • Haoran Zhang
  • Kimia Hamidieh
  • Eileen Pan
  • Maanas Sharma
  • Thomas Hartvigsen
  • Marzyeh Ghassemi

Vision-language (VL) embedding models have been shown to encode biases present in their training data, such as societal biases that prescribe negative characteristics to members of various racial and gender identities. Due to their wide-spread adoption for various tasks ranging from few-shot classification to text-guided image generation, debiasing VL models is crucial. Debiasing approaches that fine-tune the VL model often suffer from catastrophic forgetting. On the other hand, fine-tuning-free methods typically utilize a ``one-size-fits-all" approach that assumes that correlation with the spurious attribute can be explained using a single linear direction across all possible inputs. In this work, we propose a nonlinear, fine-tuning-free approach for VL embedding model debiasing that tailors the debiasing operation to each unique input. This allows for a more flexible debiasing approach. Additionally, we do not require knowledge of the set of inputs a priori to inference time, making our method more appropriate for online tasks such as retrieval and text guided image generation.

AAAI Conference 2024 Conference Paper

Who Knows the Answer? Finding the Best Model and Prompt for Each Query Using Confidence-Based Search

  • Walter Gerych
  • Yara Rizk
  • Vatche Isahagian
  • Vinod Muthusamy
  • Evelyn Duesterwald
  • Praveen Venkateswaran

There are increasingly many large language models (LLMs) available to the public. While these LLMs have exhibited impressive abilities on a variety of task, any individual LLM in particular may do well on some tasks and worse on others. Additionally, the performance of these models is heavily dependent on the choice of prompt template used. For instance, they exhibit sensitivity to the few shot examples chosen or brittleness to the wording of instructions. Moreover, a prompt template that makes a model perform well for one input may not be the optimal template for another input. This necessitates an approach for adaptively selecting LLM and prompt template pairs for each input. Recent work has shown that the accuracy of LLM's responses is correlated with the LLM's confidence in the response. Thus, a natural choice for selecting which model and prompt template to use is to select the pair that is most confident in its response. However, existing confidence metrics are expensive to calculate - necessitating multiple calls to each LLm and prompt pair. We thus propose an approach to predict the confidence of each pair using an auxiliary regression model that is inexpensive to run. Using this auxiliary model, we select the LLM and prompt template with the highest predicted confidence for a given input. Results on a range of benchmark datasets show that our confidence-based instance-level prompt search method consistently improves the performance of LLMs.

NeurIPS Conference 2023 Conference Paper

Debiasing Pretrained Generative Models by Uniformly Sampling Semantic Attributes

  • Walter Gerych
  • Kevin Hickey
  • Luke Buquicchio
  • Kavin Chandrasekaran
  • Abdulaziz Alajaji
  • Elke A. Rundensteiner
  • Emmanuel Agu

Generative models are being increasingly used in science and industry applications. Unfortunately, they often perpetuate the biases present in their training sets, such as societal biases causing certain groups to be underrepresented in the data. For instance, image generators may overwhelmingly produce images of white people due to few non-white samples in their training data. It is imperative to debias generative models so they synthesize an equal number of instances for each group, while not requiring retraining of the model to avoid prohibitive expense. We thus propose a distribution mapping module that produces samples from a fair noise distribution, such that the pretrained generative model produces semantically uniform outputs - an equal number of instances for each group - when conditioned on these samples. This does not involve retraining the generator, nor does it require any real training data. Experiments on debiasing generators trained on popular real-world datasets show that our method outperforms existing approaches.

AAAI Conference 2023 Conference Paper

Knowledge Amalgamation for Multi-Label Classification via Label Dependency Transfer

  • Jidapa Thadajarassiri
  • Thomas Hartvigsen
  • Walter Gerych
  • Xiangnan Kong
  • Elke Rundensteiner

Multi-label classification (MLC), which assigns multiple labels to each instance, is crucial to domains from computer vision to text mining. Conventional methods for MLC require huge amounts of labeled data to capture complex dependencies between labels. However, such labeled datasets are expensive, or even impossible, to acquire. Worse yet, these pre-trained MLC models can only be used for the particular label set covered in the training data. Despite this severe limitation, few methods exist for expanding the set of labels predicted by pre-trained models. Instead, we acquire vast amounts of new labeled data and retrain a new model from scratch. Here, we propose combining the knowledge from multiple pre-trained models (teachers) to train a new student model that covers the union of the labels predicted by this set of teachers. This student supports a broader label set than any one of its teachers without using labeled data. We call this new problem knowledge amalgamation for multi-label classification. Our new method, Adaptive KNowledge Transfer (ANT), trains a student by learning from each teacher’s partial knowledge of label dependencies to infer the global dependencies between all labels across the teachers. We show that ANT succeeds in unifying label dependencies among teachers, outperforming five state-of-the-art methods on eight real-world datasets.

AAAI Conference 2022 Conference Paper

Recovering the Propensity Score from Biased Positive Unlabeled Data

  • Walter Gerych
  • Thomas Hartvigsen
  • Luke Buquicchio
  • Emmanuel Agu
  • Elke Rundensteiner

Positive-Unlabeled (PU) learning methods train a classifier to distinguish between the positive and negative classes given only positive and unlabeled data. While traditional PU methods require the labeled positive samples to be an unbiased sample of the positive distribution, in practice the labeled sample is often a biased draw from the true distribution. Prior work shows that if we know the likelihood that each positive instance will be selected for labeling, referred to as the propensity score, then the biased sample can be used for PU learning. Unfortunately, no prior work has been proposed an inference strategy for which the propensity score is identifiable. In this work, we propose two sets of assumptions under which the propensity score can be uniquely determined: one in which no assumption is made on the functional form of the propensity score (requiring assumptions on the data distribution), and the second which loosens the data assumptions while assuming a functional form for the propensity score. We then propose inference strategies for each case. Our empirical study shows that our approach significantly outperforms the state-of-the-art propensity estimation methods on a rich variety of benchmark datasets.

NeurIPS Conference 2021 Conference Paper

Recurrent Bayesian Classifier Chains for Exact Multi-Label Classification

  • Walter Gerych
  • Tom Hartvigsen
  • Luke Buquicchio
  • Emmanuel Agu
  • Elke A. Rundensteiner

Exact multi-label classification is the task of assigning each datapoint a set of class labels such that the assigned set exactly matches the ground truth. Optimizing for exact multi-label classification is important in domains where missing a single label can be especially costly, such as in object detection for autonomous vehicles or symptom classification for disease diagnosis. Recurrent Classifier Chains (RCCs), a recurrent neural network extension of ensemble-based classifier chains, are the state-of-the-art exact multi-label classification method for maximizing subset accuracy. However, RCCs iteratively predict classes with an unprincipled ordering, and therefore indiscriminately condition class probabilities. These disadvantages make RCCs prone to predicting inaccurate label sets. In this work we propose Recurrent Bayesian Classifier Chains (RBCCs), which learn a Bayesian network of class dependencies and leverage this network in order to condition the prediction of child nodes only on their parents. By conditioning predictions in this way, we perform principled and non-noisy class prediction. We demonstrate the effectiveness of our RBCC method on a variety of real-world multi-label datasets, where we routinely outperform the state of the art methods for exact multi-label classification.