Arrow Research search

Author name cluster

Jack Gallifant

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

6 papers
2 author rows

Possible papers

6

ICLR Conference 2025 Conference Paper

ACES: Automatic Cohort Extraction System for Event-Stream Datasets

  • Justin Xu
  • Jack Gallifant
  • Alistair E. W. Johnson
  • Matthew B. A. McDermott

Reproducibility remains a significant challenge in machine learning (ML) for healthcare. Datasets, model pipelines, and even task or cohort definitions are often private in this field, leading to a significant barrier in sharing, iterating, and understanding ML results on electronic health record (EHR) datasets. We address a significant part of this problem by introducing the Automatic Cohort Extraction System (ACES) for event-stream data. This library is designed to simultaneously simplify the development of tasks and cohorts for ML in healthcare and also enable their reproduction, both at an exact level for single datasets and at a conceptual level across datasets. To accomplish this, ACES provides: (1) a highly intuitive and expressive domain-specific configuration language for defining both dataset-specific concepts and dataset-agnostic inclusion or exclusion criteria, and (2) a pipeline to automatically extract patient records that meet these defined criteria from real-world data. ACES can be automatically applied to any dataset in either the Medical Event Data Standard (MEDS) or Event Stream GPT (ESGPT) formats, or to *any* dataset in which the necessary task-specific predicates can be extracted in an event-stream form. ACES has the potential to significantly lower the barrier to entry for defining ML tasks in representation learning, redefine the way researchers interact with EHR datasets, and significantly improve the state of reproducibility for ML studies using this modality. ACES is available at: https://github.com/justin13601/aces.

NeurIPS Conference 2025 Conference Paper

KScope: A Framework for Characterizing the Knowledge Status of Language Models

  • Yuxin Xiao
  • Shan Chen
  • Jack Gallifant
  • Danielle Bitterman
  • Tom Hartvigsen
  • Marzyeh Ghassemi

Characterizing a large language model's (LLM's) knowledge of a given question is challenging. As a result, prior work has primarily examined LLM behavior under knowledge conflicts, where the model's internal parametric memory contradicts information in the external context. However, this does not fully reflect how well the model knows the answer to the question. In this paper, we first introduce a taxonomy of five knowledge statuses based on the consistency and correctness of LLM knowledge modes. We then propose KScope, a hierarchical framework of statistical tests that progressively refines hypotheses about knowledge modes and characterizes LLM knowledge into one of these five statuses. We apply KScope to nine LLMs across four datasets and systematically establish: (1) Supporting context narrows knowledge gaps across models. (2) Context features related to difficulty, relevance, and familiarity drive successful knowledge updates. (3) LLMs exhibit similar feature preferences when partially correct or conflicted, but diverge sharply when consistently wrong. (4) Context summarization constrained by our feature analysis, together with enhanced credibility, further improves update effectiveness and generalizes across LLMs.

AAAI Conference 2025 Conference Paper

Multi-OphthaLingua: A Multilingual Benchmark for Assessing and Debiasing LLM Ophthalmological QA in LMICs

  • David Restrepo
  • Chenwei Wu
  • Zhengxu Tang
  • Zitao Shuai
  • Thao Nguyen Minh Phan
  • Jun-En Ding
  • Cong-Tinh Dao
  • Jack Gallifant

Current ophthalmology clinical workflows are plagued by over-referrals, long waits, and complex and heterogeneous medical records. Large language models (LLMs) present a promising solution to automate various procedures such as triaging, preliminary tests like visual acuity assessment, and report summaries. However, LLMs have demonstrated significantly varied performance across different languages in natural language question-answering tasks, potentially exacerbating healthcare disparities in Low and Middle-Income Countries (LMICs). This study introduces the first multilingual ophthalmological question-answering benchmark with manually curated questions parallel across languages, allowing for direct cross-lingual comparisons. Our evaluation of 6 popular LLMs across 7 different languages reveals substantial bias across different languages, highlighting risks for clinical deployment of LLMs in LMICs. Existing debiasing methods such as Translation Chain-of-Thought or Retrieval-augmented generation (RAG) by themselves fall short of closing this performance gap, often failing to improve performance across all languages and lacking specificity for the medical domain. To address this issue, We propose CLARA (Cross-Lingual Reflective Agentic system), a novel inference time de-biasing method leveraging retrieval augmented generation and self-verification. Our approach not only improves performance across all languages but also significantly reduces the multilingual bias gap, facilitating equitable LLM application across the globe.

TIST Journal 2025 Journal Article

Race Against the Machine Learning Courses

  • Riddhi Deshpande
  • Donald Mlombwa
  • Leo Anthony Celi
  • Jack Gallifant
  • Helen D’couto

Despite the rapid integration of AI in healthcare, a critical gap exists in current machine learning courses: the lack of education on identifying and mitigating bias in datasets. This oversight risks perpetuating existing health disparities through biased AI models. Analyzing 11 prominent online courses, we found only 5 addressed dataset bias, often dedicating minimal time compared to technical aspects. This paper urges course developers to prioritize education on data context, equipping learners with the tools to critically evaluate the origin, collection methods, and potential biases inherent in the data. This approach fosters the creation of fair algorithms and the incorporation of diverse data sources, ultimately mitigating the harmful effects of bias in healthcare AI. While this analysis focused on publicly available courses, it underscores the urgency of addressing bias in all healthcare machine learning education. Early intervention in algorithm development is crucial to prevent the amplification of dataset and model bias, ensuring responsible and equitable AI implementation in healthcare.

NeurIPS Conference 2024 Conference Paper

A Closer Look at AUROC and AUPRC under Class Imbalance

  • Matthew B. McDermott
  • Haoran Zhang
  • Lasse H. Hansen
  • Giovanni Angelotti
  • Jack Gallifant

In machine learning (ML), a widespread claim is that the area under the precision-recall curve (AUPRC) is a superior metric for model comparison to the area under the receiver operating characteristic (AUROC) for tasks with class imbalance. This paper refutes this notion on two fronts. First, we theoretically characterize the behavior of AUROC and AUPRC in the presence of model mistakes, establishing clearly that AUPRC is not generally superior in cases of class imbalance. We further show that AUPRC can be a harmful metric as it can unduly favor model improvements in subpopulations with more frequent positive labels, heightening algorithmic disparities. Next, we empirically support our theory using experiments on both semi-synthetic and real-world fairness datasets. Prompted by these insights, we conduct a review of over 1. 5 million scientific papers to understand the origin of this invalid claim, finding that it is often made without citation, misattributed to papers that do not argue this point, and aggressively over-generalized from source arguments. Our findings represent a dual contribution: a significant technical advancement in understanding the relationship between AUROC and AUPRC and a stark warning about unchecked assumptions in the ML community.

NeurIPS Conference 2024 Conference Paper

Cross-Care: Assessing the Healthcare Implications of Pre-training Data on Language Model Bias

  • Shan Chen
  • Jack Gallifant
  • Mingye Gao
  • Pedro Moreira
  • Nikolaj Munch
  • Ajay Muthukkumar
  • Arvind Rajan
  • Jaya Kolluri

Large language models (LLMs) are increasingly essential in processing natural languages, yet their application is frequently compromised by biases and inaccuracies originating in their training data. In this study, we introduce \textbf{Cross-Care}, the first benchmark framework dedicated to assessing biases and real world knowledge in LLMs, specifically focusing on the representation of disease prevalence across diverse demographic groups. We systematically evaluate how demographic biases embedded in pre-training corpora like $ThePile$ influence the outputs of LLMs. We expose and quantify discrepancies by juxtaposing these biases against actual disease prevalences in various U. S. demographic groups. Our results highlight substantial misalignment between LLM representation of disease prevalence and real disease prevalence rates across demographic subgroups, indicating a pronounced risk of bias propagation and a lack of real-world grounding for medical applications of LLMs. Furthermore, we observe that various alignment methods minimally resolve inconsistencies in the models' representation of disease prevalence across different languages. For further exploration and analysis, we make all data and a data visualization tool available at: \url{www. crosscare. net}.