Arrow Research search

Author name cluster

Byron Wallace

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

7 papers
1 author row

Possible papers

7

NeurIPS Conference 2025 Conference Paper

Do Automatic Factuality Metrics Measure Factuality? A Critical Evaluation

  • Sanjana Ramprasad
  • Byron Wallace

Modern LLMs can now produce highly readable abstractive summaries, to the point that traditional automated metrics for evaluating summary quality, such as ROUGE, have saturated. However, LLMs still sometimes introduce inaccuracies into summaries, i. e. , information inconsistent with or unsupported by the corresponding source. Measuring the occurrence of these often subtle factual inconsistencies automatically has proved challenging. This in turn has motivated development of metrics intended to measure the factual consistency of generated summaries against sources. But are these approaches measuring what they purport to? Or are they mostly exploiting artifacts? In this work, we stress test a range of automatic factuality metrics—including specialized model-based approaches and LLM-based prompting methods—to probe what they actually capture. Using a shallow classifier to separate “easy” examples for factual evaluation—where surface features suffice—from “hard” cases requiring deeper reasoning, we find that all metrics show substantial performance drops on the latter. Furthermore, some metrics are more sensitive to benign, fact-preserving edits than to factual corrections. Building on this observation, we demonstrate that most automatic factuality metrics can be gamed—that is, their scores can be artificially inflated by appending innocuous, content-free sentences to summaries. Among the metrics tested, the LLM prompt-based ChatGPT-DA approach is the most robust and reliable; however, it exhibits a notable caveat: it likely relies more on parametric knowledge than on the provided source when making judgments. Taken together, our findings call into question the reliability of current factuality metrics and prompt a broader reflection on what these metrics are truly measuring. We conclude with concrete recommendations for improving both benchmark design and metric robustness, particularly in light of their vulnerability to superficial manipulations.

NeurIPS Conference 2025 Conference Paper

Learning the Wrong Lessons: Syntactic-Domain Spurious Correlations in Language Models

  • Chantal Shaib
  • Vinith Suriyakumar
  • Byron Wallace
  • Marzyeh Ghassemi

For an LLM to correctly respond to an instruction it must understand both the semantics and the domain (i. e. , subject area) of a given task-instruction pair. However, syntax can also convey implicit information. Recent work shows that \textit{syntactic templates}---frequent sequences of Part-of-Speech (PoS) tags---are prevalent in training data and often appear in model outputs. In this work we characterize syntactic templates, domain, and semantics in task-instruction pairs. We identify cases of spurious correlations between syntax and domain, where models learn to associate a domain with syntax during training; this can sometimes override prompt semantics. Using a synthetic training dataset, we find that the syntactic-domain correlation can lower performance (mean 0. 51 +/- 0. 06) on entity knowledge tasks in OLMo-2 models (1B-13B). We introduce an evaluation framework to detect this phenomenon in trained models, and show that it occurs on a subset of the FlanV2 dataset in open (OLMo-2-7B; Llama-4-Maverick), and closed (GPT-4o) models. Finally, we present a case study on the implications for LLM security, showing that unintended syntactic-domain correlations can be used to bypass refusals in OLMo-2-7B Instruct and GPT-4o. Our findings highlight two needs: (1) to explicitly test for syntactic-domain correlations, and (2) to ensure \textit{syntactic} diversity in training data, specifically within domains, to prevent such spurious correlations.

AAAI Conference 2018 Conference Paper

An Interpretable Joint Graphical Model for Fact-Checking From Crowds

  • An Nguyen
  • Aditya Kharosekar
  • Matthew Lease
  • Byron Wallace

Assessing the veracity of claims made on the Internet is an important, challenging, and timely problem. While automated fact-checking models have potential to help people better assess what they read, we argue such models must be explainable, accurate, and fast to be useful in practice; while prediction accuracy is clearly important, model transparency is critical in order for users to trust the system and integrate their own knowledge with model predictions. To achieve this, we propose a novel probabilistic graphical model (PGM) which combines machine learning with crowd annotations. Nodes in our model correspond to claim veracity, article stance regarding claims, reputation of news sources, and annotator reliabilities. We introduce a fast variational method for parameter estimation. Evaluation across two real-world datasets and three scenarios shows that: (1) joint modeling of sources, claims and crowd annotators in a PGM improves the predictive performance and interpretability for predicting claim veracity; and (2) our variational inference method achieves scalably fast parameter estimation, with only modest degradation in performance compared to Gibbs sampling. Regarding model transparency, we designed and deployed a prototype fact-checker Web tool, including a visual interface for explaining model predictions. Results of a small user study indicate that model explanations improve user satisfaction and trust in model predictions. We share our web demo, model source code, and the 13K crowd labels we collected. 1

AAAI Conference 2017 Conference Paper

Active Discriminative Text Representation Learning

  • Ye Zhang
  • Matthew Lease
  • Byron Wallace

We propose a new active learning (AL) method for text classification with convolutional neural networks (CNNs). In AL, one selects the instances to be manually labeled with the aim of maximizing model performance with minimal effort. Neural models capitalize on word embeddings as representations (features), tuning these to the task at hand. We argue that AL strategies for multi-layered neural models should focus on selecting instances that most affect the embedding space (i. e. , induce discriminative word representations). This is in contrast to traditional AL approaches (e. g. , entropy-based uncertainty sampling), which specify higher level objectives. We propose a simple approach for sentence classification that selects instances containing words whose embeddings are likely to be updated with the greatest magnitude, thereby rapidly learning discriminative, task-specific embeddings. We extend this approach to document classification by jointly considering: (1) the expected changes to the constituent word representations; and (2) the model’s current overall uncertainty regarding the instance. The relative emphasis placed on these criteria is governed by a stochastic process that favors selecting instances likely to improve representations at the outset of learning, and then shifts toward general uncertainty sampling as AL progresses. Empirical results show that our method outperforms baseline AL approaches on both sentence and document classification tasks. We also show that, as expected, the method quickly learns discriminative word embeddings. To the best of our knowledge, this is the first work on AL addressing neural models for text classification.

AAAI Conference 2015 Conference Paper

Graph-Sparse LDA: A Topic Model with Structured Sparsity

  • Finale Doshi-Velez
  • Byron Wallace
  • Ryan Adams

Topic modeling is a powerful tool for uncovering latent structure in many domains, including medicine, finance, and vision. The goals for the model vary depending on the application: sometimes the discovered topics are used for prediction or another downstream task. In other cases, the content of the topic may be of intrinsic scientific interest. Unfortunately, even when one uses modern sparse techniques, discovered topics are often difficult to interpret due to the high dimensionality of the underlying space. To improve topic interpretability, we introduce Graph-Sparse LDA, a hierarchical topic model that uses knowledge of relationships between words (e. g. , as encoded by an ontology). In our model, topics are summarized by a few latent concept-words from the underlying graph that explain the observed words. Graph- Sparse LDA recovers sparse, interpretable summaries on two real-world biomedical datasets while matching state-of-the-art prediction performance.

AAAI Conference 2014 Conference Paper

Discovering Better AAAI Keywords via Clustering with Community-Sourced Constraints

  • Kelly Moran
  • Byron Wallace
  • Carla Brodley

Selecting good conference keywords is important because they often determine the composition of review committees and hence which papers are reviewed by whom. But presently conference keywords are generated in an ad-hoc manner by a small set of conference organizers. This approach is plainly not ideal. There is no guarantee, for example, that the generated keyword set aligns with what the community is actually working on and submitting to the conference in a given year. This is especially true in fast moving fields such as AI. The problem is exacerbated by the tendency of organizers to draw heavily on preceding years’ keyword lists when generating a new set. Rather than a select few ordaining a keyword set that that represents AI at large, it would be preferable to generate these keywords more directly from the data, with input from research community members. To this end, we solicited feedback from seven AAAI PC members regarding a previously existing keyword set and used these ‘communitysourced constraints’ to inform a clustering over the abstracts of all submissions to AAAI 2013. We show that the keywords discovered via this data-driven, human-inthe-loop method are at least as preferred (by AAAI PC members) as 2013’s manually generated set, and that they include categories previously overlooked by organizers. Many of the discovered terms were used for this year’s conference.

AAAI Conference 2014 Conference Paper

Identifying Differences in Physician Communication Styles with a Log-Linear Transition Component Model

  • Byron Wallace
  • Issa Dahabreh
  • Thomas Trikalinos
  • Michael Barton Laws
  • Ira Wilson
  • Eugene Charniak

We consider the task of grouping doctors with respect to communication patterns exhibited in outpatient visits. We propose a novel approach toward this end in which we model speech act transitions in conversations via a log-linear model incorporating physician specific components. We train this model over transcripts of outpatient visits annotated with speech act codes and then cluster physicians in (a transformation of) this parameter space. We find significant correlations between the induced groupings and patient survey response data comprising ratings of physician communication. Furthermore, the novel sequential component model we leverage to induce this clustering allows us to explore differences across these groups. This work demonstrates how statistical AI might be used to better understand (and ultimately improve) physician communication.