Arrow Research search

Author name cluster

Serena Booth

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

12 papers
1 author row

Possible papers

12

AAAI Conference 2025 Conference Paper

AI Governance and Lessons Learned as an AI Policy Advisor in the United States Senate

  • Serena Booth

This talk examines the intersection of artificial intelligence and policymaking, focusing on legislative and regulatory frameworks in the United States. It explores the role of key federal agencies, existing technology-agnostic laws affecting AI, and gaps in regulatory oversight that require legislative intervention. Consumer protection laws are analyzed for their relevance to AI governance, particularly in financial services. The discussion also highlights the implications for AI research, emphasizing the importance of interdisciplinary collaboration between computer scientists and policymakers to ensure responsible AI development that aligns with democratic values and societal interests.

RLJ Journal 2025 Journal Article

Goals vs. Rewards: A Preliminary Comparative Study of Objective Specification Mechanisms

  • Septia Rani
  • Serena Booth
  • Sarath Sreedharan

This paper studies two popular objective specification mechanisms for sequential decision-making: goals and rewards. We investigate how easy it is for people without AI expertise to use these different specification mechanisms effectively. Specifically, through this paper, we investigate how effectively these mechanisms could be used to (a) correctly direct an AI system or robot to generate some desired behavior and (b) predict the behavior encoded in a given objective specification. We first present a formalization of the problems of objective specification and behavior prediction, and we characterize the problems of underspecification and overspecification. We then perform a user study to assess how well participants are able to use rewards and goals as specification mechanisms, and their propensity for overspecification and underspecification with these mechanisms. While participants have a strong preference for using goals as an objective specification mechanism, we find a surprising result: even non-expert users are equally capable of specifying and interpreting reward functions as of using goals.

RLC Conference 2025 Conference Paper

Goals vs. Rewards: A Preliminary Comparative Study of Objective Specification Mechanisms

  • Septia Rani
  • Serena Booth
  • Sarath Sreedharan

This paper studies two popular objective specification mechanisms for sequential decision-making: goals and rewards. We investigate how easy it is for people without AI expertise to use these different specification mechanisms effectively. Specifically, through this paper, we investigate how effectively these mechanisms could be used to (a) correctly direct an AI system or robot to generate some desired behavior and (b) predict the behavior encoded in a given objective specification. We first present a formalization of the problems of objective specification and behavior prediction, and we characterize the problems of underspecification and overspecification. We then perform a user study to assess how well participants are able to use rewards and goals as specification mechanisms, and their propensity for overspecification and underspecification with these mechanisms. While participants have a strong preference for using goals as an objective specification mechanism, we find a surprising result: even non-expert users are equally capable of specifying and interpreting reward functions as of using goals.

RLC Conference 2025 Conference Paper

Towards Improving Reward Design in RL: A Reward Alignment Metric for RL Practitioners

  • Calarina Muslimani
  • Kerrick Johnstonbaugh
  • Suyog Chandramouli
  • Serena Booth
  • W. Bradley Knox
  • Matthew E. Taylor

Reinforcement learning agents are fundamentally limited by the quality of the reward functions they learn from, yet reward design is often overlooked under the assumption that a well-defined reward is readily available. However, in practice, designing rewards is difficult, and even when specified, evaluating their correctness is equally problematic: how do we know if a reward function is correctly specified? In our work, we address these challenges by focusing on reward alignment --- assessing whether a reward function accurately encodes the preferences of a human stakeholder. As a concrete measure of reward alignment, we introduce the Trajectory Alignment Coefficient to quantify the similarity between a human stakeholder's ranking of trajectory distributions and those induced by a given reward function. We show that the Trajectory Alignment Coefficient exhibits desirable properties, such as not requiring access to a ground truth reward, invariance to potential-based reward shaping, and applicability to online RL. Additionally, in an $11$--person user study of RL practitioners, we found that access to the Trajectory Alignment Coefficient during reward selection led to statistically significant improvements. Compared to relying only on reward functions, our metric reduced cognitive workload by $1. 5$x, was preferred by 82\% of users and increased the success rate of selecting reward functions that produced performant policies by 41\%.

RLJ Journal 2025 Journal Article

Towards Improving Reward Design in RL: A Reward Alignment Metric for RL Practitioners

  • Calarina Muslimani
  • Kerrick Johnstonbaugh
  • Suyog Chandramouli
  • Serena Booth
  • W. Bradley Knox
  • Matthew E. Taylor

Reinforcement learning agents are fundamentally limited by the quality of the reward functions they learn from, yet reward design is often overlooked under the assumption that a well-defined reward is readily available. However, in practice, designing rewards is difficult, and even when specified, evaluating their correctness is equally problematic: how do we know if a reward function is correctly specified? In our work, we address these challenges by focusing on reward alignment --- assessing whether a reward function accurately encodes the preferences of a human stakeholder. As a concrete measure of reward alignment, we introduce the Trajectory Alignment Coefficient to quantify the similarity between a human stakeholder's ranking of trajectory distributions and those induced by a given reward function. We show that the Trajectory Alignment Coefficient exhibits desirable properties, such as not requiring access to a ground truth reward, invariance to potential-based reward shaping, and applicability to online RL. Additionally, in an $11$--person user study of RL practitioners, we found that access to the Trajectory Alignment Coefficient during reward selection led to statistically significant improvements. Compared to relying only on reward functions, our metric reduced cognitive workload by $1.5$x, was preferred by 82\% of users and increased the success rate of selecting reward functions that produced performant policies by 41\%.

AAAI Conference 2024 Conference Paper

Learning Optimal Advantage from Preferences and Mistaking It for Reward

  • W. Bradley Knox
  • Stephane Hatgis-Kessell
  • Sigurdur Orn Adalgeirsson
  • Serena Booth
  • Anca Dragan
  • Peter Stone
  • Scott Niekum

We consider algorithms for learning reward functions from human preferences over pairs of trajectory segments, as used in reinforcement learning from human feedback (RLHF). Most recent work assumes that human preferences are generated based only upon the reward accrued within those segments, or their partial return. Recent work casts doubt on the validity of this assumption, proposing an alternative preference model based upon regret. We investigate the consequences of assuming preferences are based upon partial return when they actually arise from regret. We argue that the learned function is an approximation of the optimal advantage function, not a reward function. We find that if a specific pitfall is addressed, this incorrect assumption is not particularly harmful, resulting in a highly shaped reward function. Nonetheless, this incorrect usage of the approximation of the optimal advantage function is less desirable than the appropriate and simpler approach of greedy maximization of it. From the perspective of the regret preference model, we also provide a clearer interpretation of fine tuning contemporary large language models with RLHF. This paper overall provides insight regarding why learning under the partial return preference model tends to work so well in practice, despite it conforming poorly to how humans give preferences.

TMLR Journal 2024 Journal Article

Models of human preference for learning reward functions

  • W. Bradley Knox
  • Stephane Hatgis-Kessell
  • Serena Booth
  • Scott Niekum
  • Peter Stone
  • Alessandro G Allievi

The utility of reinforcement learning is limited by the alignment of reward functions with the interests of human stakeholders. One promising method for alignment is to learn the reward function from human-generated preferences between pairs of trajectory segments, a type of reinforcement learning from human feedback (RLHF). These human preferences are typically assumed to be informed solely by partial return, the sum of rewards along each segment. We find this assumption to be flawed and propose modeling human preferences instead as informed by each segment’s regret, a measure of a segment’s deviation from optimal decision-making. Given infinitely many preferences generated according to regret, we prove that we can identify a reward function equivalent to the reward function that generated those preferences, and we prove that the previous partial return model lacks this identifiability property in multiple contexts. We empirically show that our proposed regret preference model outperforms the partial return preference model with finite training data in otherwise the same setting. Additionally, we find that our proposed regret preference model better predicts real human preferences and also learns reward functions from these preferences that lead to policies that are better human-aligned. Overall, this work establishes that the choice of preference model is impactful, and our proposed regret preference model provides an improvement upon a core assumption of recent research. We have open sourced our experimental code, the human preferences dataset we gathered, and our training and preference elicitation interfaces for gathering such a dataset.

AAAI Conference 2024 Conference Paper

Quality-Diversity Generative Sampling for Learning with Synthetic Data

  • Allen Chang
  • Matthew C. Fontaine
  • Serena Booth
  • Maja J. Matarić
  • Stefanos Nikolaidis

Generative models can serve as surrogates for some real data sources by creating synthetic training datasets, but in doing so they may transfer biases to downstream tasks. We focus on protecting quality and diversity when generating synthetic training datasets. We propose quality-diversity generative sampling (QDGS), a framework for sampling data uniformly across a user-defined measure space, despite the data coming from a biased generator. QDGS is a model-agnostic framework that uses prompt guidance to optimize a quality objective across measures of diversity for synthetically generated data, without fine-tuning the generative model. Using balanced synthetic datasets generated by QDGS, we first debias classifiers trained on color-biased shape datasets as a proof-of-concept. By applying QDGS to facial data synthesis, we prompt for desired semantic concepts, such as skin tone and age, to create an intersectional dataset with a combined blend of visual features. Leveraging this balanced data for training classifiers improves fairness while maintaining accuracy on facial recognition benchmarks. Code available at: https://github.com/Cylumn/qd-generative-sampling.

AAAI Conference 2023 Conference Paper

The Perils of Trial-and-Error Reward Design: Misdesign through Overfitting and Invalid Task Specifications

  • Serena Booth
  • W. Bradley Knox
  • Julie Shah
  • Scott Niekum
  • Peter Stone
  • Alessandro Allievi

In reinforcement learning (RL), a reward function that aligns exactly with a task's true performance metric is often necessarily sparse. For example, a true task metric might encode a reward of 1 upon success and 0 otherwise. The sparsity of these true task metrics can make them hard to learn from, so in practice they are often replaced with alternative dense reward functions. These dense reward functions are typically designed by experts through an ad hoc process of trial and error. In this process, experts manually search for a reward function that improves performance with respect to the task metric while also enabling an RL algorithm to learn faster. This process raises the question of whether the same reward function is optimal for all algorithms, i.e., whether the reward function can be overfit to a particular algorithm. In this paper, we study the consequences of this wide yet unexamined practice of trial-and-error reward design. We first conduct computational experiments that confirm that reward functions can be overfit to learning algorithms and their hyperparameters. We then conduct a controlled observation study which emulates expert practitioners' typical experiences of reward design, in which we similarly find evidence of reward function overfitting. We also find that experts' typical approach to reward design---of adopting a myopic strategy and weighing the relative goodness of each state-action pair---leads to misdesign through invalid task specifications, since RL algorithms use cumulative reward rather than rewards for individual state-action pairs as an optimization target. Code, data: github.com/serenabooth/reward-design-perils

AAAI Conference 2022 Conference Paper

Do Feature Attribution Methods Correctly Attribute Features?

  • Yilun Zhou
  • Serena Booth
  • Marco Tulio Ribeiro
  • Julie Shah

Feature attribution methods are popular in interpretable machine learning. These methods compute the attribution of each input feature to represent its importance, but there is no consensus on the definition of “attribution”, leading to many competing methods with little systematic evaluation, complicated in particular by the lack of ground truth attribution. To address this, we propose a dataset modification procedure to induce such ground truth. Using this procedure, we evaluate three common methods: saliency maps, attentions and rationales. We identify several deficiencies and add new perspectives to the growing body of evidence questioning the correctness and reliability of these methods applied on datasets in the wild. We further discuss possible avenues for remedy and recommend new attribution methods to be tested against ground truth before deployment. The code and appendix are available at https: //yilunzhou. github. io/feature-attribution-evaluation/.

AAAI Conference 2021 Conference Paper

Bayes-TrEx: a Bayesian Sampling Approach to Model Transparency by Example

  • Serena Booth
  • Yilun Zhou
  • Ankit Shah
  • Julie Shah

Post-hoc explanation methods are gaining popularity for interpreting, understanding, and debugging neural networks. Most analyses using such methods explain decisions in response to inputs drawn from the test set. However, the test set may have few examples that trigger some model behaviors, such as high-confidence failures or ambiguous classifications. To address these challenges, we introduce a flexible model inspection framework: BAYES-TREX. Given a data distribution, BAYES-TREX finds in-distribution examples which trigger a specified prediction confidence. We demonstrate several use cases of BAYES-TREX, including revealing highly con- fident (mis)classifications, visualizing class boundaries via ambiguous examples, understanding novel-class extrapolation behavior, and exposing neural network overconfidence. We use BAYES-TREX to study classifiers trained on CLEVR, MNIST, and Fashion-MNIST, and we show that this framework enables more flexible holistic model analysis than just inspecting the test set. Code and supplemental material are available at https: //github. com/serenabooth/Bayes-TrEx.

IJCAI Conference 2019 Conference Paper

Evaluating the Interpretability of the Knowledge Compilation Map: Communicating Logical Statements Effectively

  • Serena Booth
  • Christian Muise
  • Julie Shah

Knowledge compilation techniques translate propositional theories into equivalent forms to increase their computational tractability. But, how should we best present these propositional theories to a human? We analyze the standard taxonomy of propositional theories for relative interpretability across three model domains: highway driving, emergency triage, and the chopsticks game. We generate decision-making agents which produce logical explanations for their actions and apply knowledge compilation to these explanations. Then, we evaluate how quickly, accurately, and confidently users comprehend the generated explanations. We find that domain, formula size, and negated logical connectives significantly affect comprehension while formula properties typically associated with interpretability are not strong predictors of human ability to comprehend the theory.