Author name cluster

Micah Carroll

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

10 papers

2 author rows

ICLR Conference 2025 Conference Paper

On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback

Marcus Williams
Micah Carroll
Adhyyan Narang
Constantin Weisser
Brendan Murphy
Anca D. Dragan

As LLMs become more widely deployed, there is increasing interest in directly optimizing for feedback from end users (e.g. thumbs up) in addition to feedback from paid annotators. However, training to maximize human feedback creates a perverse incentive structure for the AI to resort to manipulative or deceptive tactics to obtain positive feedback from users who are vulnerable to such strategies. We study this phenomenon by training LLMs with Reinforcement Learning with simulated user feedback in environments of practical LLM usage. In our settings, we find that: 1) Extreme forms of "feedback gaming" such as manipulation and deception are learned reliably; 2) Even if only 2% of users are vulnerable to manipulative strategies, LLMs learn to identify and target them while behaving appropriately with other users, making such behaviors harder to detect; 3) To mitigate this issue, it may seem promising to leverage continued safety training or LLM-as-judges during training to filter problematic outputs. Instead, we found that while such approaches help in some of our settings, they backfire in others, sometimes even leading to subtler manipulative behaviors. We hope our results can serve as a case study which highlights the risks of using gameable feedback sources -- such as user feedback -- as a target for RL. Our code is publicly available. Warning: some of our examples may be upsetting.

Details

NeurIPS Conference 2025 Conference Paper

Robust and Diverse Multi-Agent Learning via Rational Policy Gradient

Niklas Lauffer
Ameesh Shah
Micah Carroll
Sanjit Seshia
Stuart J Russell
Michael Dennis

Adversarial optimization algorithms that explicitly search for flaws in agents' policies have been successfully applied to finding robust and diverse policies in the context of multi-agent learning. However, the success of adversarial optimization has been largely limited to zero-sum settings because its naive application in cooperative settings leads to a critical failure mode: agents are irrationally incentivized to self-sabotage, blocking the completion of tasks and halting further learning. To address this, we introduce Rationality-preserving Policy Optimization (RPO), a formalism for adversarial optimization that avoids self-sabotage by ensuring agents remain rational —that is, their policies are optimal with respect to some possible partner policy. To solve RPO, we develop Rational Policy Gradient (RPG), which trains agents to maximize their own reward in a modified version of the original game in which we use opponent shaping techniques to optimize the adversarial objective. RPG enables us to extend a variety of existing adversarial optimization algorithms that, no longer subject to the limitations of self-sabotage, can find adversarial examples, improve robustness and adaptability, and learn diverse policies. We empirically validate that our approach achieves strong performance in several popular cooperative and general-sum environments. Our project page can be found at https: //rational-policy-gradient. github. io.

PDF Details

ICML Conference 2024 Conference Paper

AI Alignment with Changing and Influenceable Reward Functions

Micah Carroll
Davis Foote
Anand Siththaranjan
Stuart Russell 0001
Anca D. Dragan

Existing AI alignment approaches assume that preferences are static, which is unrealistic: our preferences change, and may even be influenced by our interactions with AI systems themselves. To clarify the consequences of incorrectly assuming static preferences, we introduce Dynamic Reward Markov Decision Processes (DR-MDPs), which explicitly model preference changes and the AI’s influence on them. We show that despite its convenience, the static-preference assumption may undermine the soundness of existing alignment techniques, leading them to implicitly reward AI systems for influencing user preferences in ways users may not truly want. We then explore potential solutions. First, we offer a unifying perspective on how an agent’s optimization horizon may partially help reduce undesirable AI influence. Then, we formalize different notions of AI alignment that account for preference change from the outset. Comparing the strengths and limitations of 8 such notions of alignment, we find that they all either err towards causing undesirable AI influence, or are overly risk-averse, suggesting that a straightforward solution to the problems of changing preferences may not exist. As there is no avoiding grappling with changing preferences in real-world settings, this makes it all the more important to handle these issues with care, balancing risks and capabilities. We hope our work can provide conceptual clarity and constitute a first step towards AI alignment practices which explicitly account for (and contend with) the changing and influenceable nature of human preferences.

Details

AAMAS Conference 2024 Conference Paper

Defining Deception in Decision Making

Marwa Abdulhai
Micah Carroll
Justin Svegliato
Anca Dragan
Sergey Levine

With the growing capabilities of machine learning systems, particularly those that communicate or interact with humans, there is an increased risk of systems that can easily deceive and manipulate people. Preventing unintended deception and manipulation therefore represents an important challenge for creating aligned AI systems. To approach this challenge in a principled way, we first need to define deception formally. In this work, we present a concrete definition of deception under the formalism of rational decision making in partially observed Markov decision processes. We propose a general regret theory of deception under which the degree of deception can be quantified in terms of the actor’s beliefs, actions, and utility. We instantiate these principles as reward terms for communication agents, and study the degree to which the behavior aligns with human judgments about deception. We hope our work will represent a step toward systems that aim to avoid deception, and detection mechanisms to identify deceptive agents.

PDF

TMLR Journal 2023 Journal Article

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

Stephen Casper
Xander Davies
Claudia Shi
Thomas Krendl Gilbert
Jérémy Scheurer
Javier Rando
Rachel Freedman
Tomek Korbak

Reinforcement learning from human feedback (RLHF) is a technique for training AI systems to align with human goals. RLHF has emerged as the central method used to finetune state-of-the-art large language models (LLMs). Despite this popularity, there has been relatively little public work systematizing its flaws. In this paper, we (1) survey open problems and fundamental limitations of RLHF and related methods; (2) overview techniques to understand, improve, and complement RLHF in practice; and (3) propose auditing and disclosure standards to improve societal oversight of RLHF systems. Our work emphasizes the limitations of RLHF and highlights the importance of a multi-layered approach to the development of safer AI systems.

PDF Details

ICML Conference 2023 Conference Paper

Who Needs to Know? Minimal Knowledge for Optimal Coordination

Niklas Lauffer
Ameesh Shah
Micah Carroll
Michael D. Dennis
Stuart Russell 0001

To optimally coordinate with others in cooperative games, it is often crucial to have information about one’s collaborators: successful driving requires understanding which side of the road to drive on. However, not every feature of collaborators is strategically relevant: the fine-grained acceleration of drivers may be ignored while maintaining optimal coordination. We show that there is a well-defined dichotomy between strategically relevant and irrelevant information. Moreover, we show that, in dynamic games, this dichotomy has a compact representation that can be efficiently computed via a Bellman backup operator. We apply this algorithm to analyze the strategically relevant information for tasks in both a standard and a partially observable version of the Overcooked environment. Theoretical and empirical results show that our algorithms are significantly more efficient than baselines. Videos are available at https: //minknowledge. github. io.

Details

ICML Conference 2022 Conference Paper

Estimating and Penalizing Induced Preference Shifts in Recommender Systems

Micah Carroll
Anca D. Dragan
Stuart Russell 0001
Dylan Hadfield-Menell

The content that a recommender system (RS) shows to users influences them. Therefore, when choosing a recommender to deploy, one is implicitly also choosing to induce specific internal states in users. Even more, systems trained via long-horizon optimization will have direct incentives to manipulate users, e. g. shift their preferences so they are easier to satisfy. We focus on induced preference shifts in users. We argue that {–} before deployment {–} system designers should: estimate the shifts a recommender would induce; evaluate whether such shifts would be undesirable; and perhaps even actively optimize to avoid problematic shifts. These steps involve two challenging ingredients: estimation requires anticipating how hypothetical policies would influence user preferences if deployed {–} we do this by using historical user interaction data to train a predictive user model which implicitly contains their preference dynamics; evaluation and optimization additionally require metrics to assess whether such influences are manipulative or otherwise unwanted {–} we use the notion of "safe shifts", that define a trust region within which behavior is safe: for instance, the natural way in which users would shift without interference from the system could be deemed "safe". In simulated experiments, we show that our learned preference dynamics model is effective in estimating user preferences and how they would respond to new recommenders. Additionally, we show that recommenders that optimize for staying in the trust region can avoid manipulative behaviors while still generating engagement.

Details

NeurIPS Conference 2022 Conference Paper

Uni[MASK]: Unified Inference in Sequential Decision Problems

Micah Carroll
Orr Paradise
Jessy Lin
Raluca Georgescu
Mingfei Sun
David Bignell
Stephanie Milani
Katja Hofmann

Randomly masking and predicting word tokens has been a successful approach in pre-training language models for a variety of downstream tasks. In this work, we observe that the same idea also applies naturally to sequential decision making, where many well-studied tasks like behavior cloning, offline RL, inverse dynamics, and waypoint conditioning correspond to different sequence maskings over a sequence of states, actions, and returns. We introduce the UniMASK framework, which provides a unified way to specify models which can be trained on many different sequential decision making tasks. We show that a single UniMASK model is often capable of carrying out many tasks with performance similar to or better than single-task models. Additionally, after fine-tuning, our UniMASK models consistently outperform comparable single-task models.

PDF Details

AAMAS Conference 2021 Conference Paper

Evaluating the Robustness of Collaborative Agents

Paul Knott
Micah Carroll
Sam Devlin
Kamil Ciosek
Katja Hofmann
Anca Dragan
Rohin Shah

Artificial agents trained by deep reinforcement learning will likely encounter novel situations after deployment that were never seen during training. Our agent must be robust to handle such situations well. However, if we cannot rely on the average training or validation reward as a metric, then how can we effectively evaluate robustness? We take inspiration from the practice of unit testing in software engineering. Specifically, we suggest that when designing AI agents that collaborate with humans, designers should search for potential edge cases in possible partner behavior and possible states encountered, and write tests which check that the behavior of the agent in these edge cases is reasonable. We apply this methodology to build a suite of unit tests for the Overcooked-AI environment, and use this test suite to evaluate three proposals for improving robustness. We find that the test suite provides significant insight into the effects of these proposals that were generally not revealed by looking solely at the average validation reward. For our full paper, see arxiv. org/abs/2101. 05507.

PDF

NeurIPS Conference 2019 Conference Paper

On the Utility of Learning about Humans for Human-AI Coordination

Micah Carroll
Rohin Shah
Mark Ho
Tom Griffiths
Sanjit Seshia
Pieter Abbeel
Anca Dragan

While we would like agents that can coordinate with humans, current algorithms such as self-play and population-based training create agents that can coordinate with themselves. Agents that assume their partner to be optimal or similar to them can converge to coordination protocols that fail to understand and be understood by humans. To demonstrate this, we introduce a simple environment that requires challenging coordination, based on the popular game Overcooked, and learn a simple model that mimics human play. We evaluate the performance of agents trained via self-play and population-based training. These agents perform very well when paired with themselves, but when paired with our human model, they are significantly worse than agents designed to play with the human model. An experiment with a planning algorithm yields the same conclusion, though only when the human-aware planner is given the exact human model that it is playing with. A user study with real humans shows this pattern as well, though less strongly. Qualitatively, we find that the gains come from having the agent adapt to the human's gameplay. Given this result, we suggest several approaches for designing agents that learn about humans in order to better coordinate with them. Code is available at https: //github. com/HumanCompatibleAI/overcooked_ai.

PDF Details