Author name cluster

Michael Bowling

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

83 papers

1 author row

TMLR Journal 2025 Journal Article

Learning to Be Cautious

Montaser Mohammedalamen
Dustin Morrill
Alexander Sieusahai
Yash Satsangi
Michael Bowling

A key challenge in the field of reinforcement learning is to develop agents that behave cautiously in novel situations. It is generally impossible to anticipate all situations that an autonomous system may face or what behavior would best avoid bad outcomes. An agent that could learn to be cautious would overcome this challenge by discovering for itself when and how to behave cautiously. In contrast, current approaches typically embed task-specific safety information or explicitly cautious behaviors into the system, which is error-prone and imposes extra burdens on practitioners. In this paper, we present both a sequence of tasks where cautious behavior becomes increasingly non-obvious, as well as an algorithm to demonstrate that it is possible for a system to learn to be cautious. The essential features of our algorithm are that it characterizes reward function uncertainty without task-specific safety information and uses this uncertainty to construct a robust policy. Specifically, we construct robust policies with a $k$-of-$N$ counterfactual regret minimization (CFR) subroutine given a learned reward function uncertainty represented by a neural network ensemble belief. These policies exhibit caution in each of our tasks without any task-specific safety tuning. Our code is available at https://github.com/montaserFath/Learning-to-be-Cautious

PDF Details

NeurIPS Conference 2025 Conference Paper

Plasticity as the Mirror of Empowerment

David Abel
Michael Bowling
Andre Barreto
Will Dabney
Shi Dong
Steven Hansen
Anna Harutyunyan
Khimya Khetarpal

Agents are minimally entities that are influenced by their past observations and act to influence future observations. This latter capacity is captured by empowerment, which has served as a vital framing concept across artificial intelligence and cognitive science. This former capacity, however, is equally foundational: In what ways, and to what extent, can an agent be influenced by what it observes? In this paper, we ground this concept in a universal agent-centric measure that we refer to as plasticity, and reveal a fundamental connection to empowerment. Following a set of desiderata on a suitable definition, we define plasticity using a new information-theoretic quantity we call the generalized directed information. We show that this new quantity strictly generalizes the directed information introduced by Massey (1990) while preserving all of its desirable properties. Under this definition, we find that plasticity is well thought of as the mirror of empowerment: The two concepts are defined using the same measure, with only the direction of influence reversed. Our main result establishes a tension between the plasticity and empowerment of an agent, suggesting that agent design needs to be mindful of both characteristics. We explore the implications of these findings, and suggest that plasticity, empowerment, and their relationship are essential to understanding agency.

PDF Details

RLC Conference 2025 Conference Paper

Rethinking the Foundations for Continual Reinforcement Learning

Esraa Elelimy
David Szepesvari
Martha White
Michael Bowling

In the traditional view of reinforcement learning, the agent's goal is to find an optimal policy that maximizes its expected sum of rewards. Once the agent finds this policy, the learning ends. This view contrasts with *continual reinforcement learning*, where learning does not end, and agents are expected to continually learn and adapt indefinitely. Despite the clear distinction between these two paradigms of learning, much of the progress in continual reinforcement learning has been shaped by foundations rooted in the traditional view of reinforcement learning. In this paper, we first examine whether the foundations of traditional reinforcement learning are suitable for the continual reinforcement learning paradigm. We identify four key pillars of the traditional reinforcement learning foundations that are antithetical to the goals of continual learning: the Markov decision process formalism, the focus on atemporal artifacts, the expected sum of rewards as an evaluation metric, and episodic benchmark environments that embrace the other three foundations. We then propose a new formalism that sheds the first and the third foundations and replaces them with the history process as a mathematical formalism and a new definition of deviation regret, adapted for continual learning, as an evaluation metric. Finally, we discuss possible approaches to shed the other two foundations.

PDF Details

RLJ Journal 2025 Journal Article

Rethinking the Foundations for Continual Reinforcement Learning

Esraa Elelimy
David Szepesvari
Martha White
Michael Bowling

PDF Details

NeurIPS Conference 2024 Conference Paper

A Method for Evaluating Hyperparameter Sensitivity in Reinforcement Learning

Jacob Adkins
Michael Bowling
Adam White

The performance of modern reinforcement learning algorithms critically relieson tuning ever increasing numbers of hyperparameters. Often, small changes ina hyperparameter can lead to drastic changes in performance, and different environments require very different hyperparameter settings to achieve state-of-the-artperformance reported in the literature. We currently lack a scalable and widelyaccepted approach to characterizing these complex interactions. This work proposes a new empirical methodology for studying, comparing, and quantifying thesensitivity of an algorithm’s performance to hyperparameter tuning for a given setof environments. We then demonstrate the utility of this methodology by assessingthe hyperparameter sensitivity of several commonly used normalization variants ofPPO. The results suggest that several algorithmic performance improvements may, in fact, be a result of an increased reliance on hyperparameter tuning.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Beyond Optimism: Exploration With Partially Observable Rewards

Simone Parisi
Alireza Kazemipour
Michael Bowling

Exploration in reinforcement learning (RL) remains an open challenge. RL algorithms rely on observing rewards to train the agent, and if informative rewards are sparse the agent learns slowly or may not learn at all. To improve exploration and reward discovery, popular algorithms rely on optimism. But what if sometimes rewards are unobservable, e. g. , situations of partial monitoring in bandits and the recent formalism of monitored Markov decision process? In this case, optimism can lead to suboptimal behavior that does not explore further to collapse uncertainty. With this paper, we present a novel exploration strategy that overcomes the limitations of existing methods and guarantees convergence to an optimal policy even when rewards are not always observable. We further propose a collection of tabular environments for benchmarking exploration in RL (with and without unobservable rewards) and show that our method outperforms existing ones.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Learning Not to Regret

David Sychrovský
Michal Šustr
Elnaz Davoodi
Michael Bowling
Marc Lanctot
Martin Schmid

The literature on game-theoretic equilibrium finding predominantly focuses on single games or their repeated play. Nevertheless, numerous real-world scenarios feature playing a game sampled from a distribution of similar, but not identical games, such as playing poker with different public cards or trading correlated assets on the stock market. As these similar games feature similar equilibra, we investigate a way to accelerate equilibrium finding on such a distribution. We present a novel ``learning not to regret'' framework, enabling us to meta-learn a regret minimizer tailored to a specific distribution. Our key contribution, Neural Predictive Regret Matching, is uniquely meta-learned to converge rapidly for the chosen distribution of games, while having regret minimization guarantees on any game. We validated our algorithms' faster convergence on a distribution of river poker games. Our experiments show that the meta-learned algorithms outpace their non-meta-learned counterparts, achieving more than tenfold improvements.

PDF Details DOI

JAIR Journal 2024 Journal Article

Mitigating Value Hallucination in Dyna-Style Planning via Multistep Predecessor Models

Farzane Aminmansour
Taher Jafferjee
Ehsan Imani
Erin J. Talvitie
Michael Bowling
Martha White

Dyna-style reinforcement learning (RL) agents improve sample efficiency over model-free RL agents by updating the value function with simulated experience generated by an environment model. However, it is often difficult to learn accurate models of environment dynamics, and even small errors may result in failure of Dyna agents. In this paper, we highlight that one potential cause of that failure is bootstrapping off of the values of simulated states, and introduce a new Dyna algorithm to avoid this failure. We discuss a design space of Dyna algorithms, based on using successor or predecessor models---simulating forwards or backwards---and using one-step or multi-step updates. Three of the variants have been explored, but surprisingly the fourth variant has not: using predecessor models with multi-step updates. We present the \emph{Hallucinated Value Hypothesis} (HVH): updating the values of real states towards values of simulated states can result in misleading action values which adversely affect the control policy. We discuss and evaluate all four variants of Dyna amongst which three update real states toward simulated states --- so potentially toward hallucinated values --- and our proposed approach, which does not. The experimental results provide evidence for the HVH, and suggest that using predecessor models with multi-step updates is a fruitful direction toward developing Dyna algorithms that are more robust to model error.

PDF Details DOI

AAMAS Conference 2024 Conference Paper

Monitored Markov Decision Processes

Simone Parisi
Montaser Mohammedalamen
Alireza Kazemipour
Matthew E. Taylor
Michael Bowling

In reinforcement learning (RL), an agent learns to perform a task by interacting with an environment and receiving feedback (a numerical reward) for its actions. However, the assumption that rewards are always observable is often not applicable in real-world problems. For example, the agent may need to ask a human to supervise its actions or activate a monitoring system to receive feedback. There may even be a period of time before rewards become observable, or a period of time after which rewards are no longer given. In other words, there are cases where the environment generates rewards in response to the agent’s actions but the agent cannot observe them. In this paper, we formalize a novel but general RL framework — Monitored MDPs — where the agent cannot always observe rewards. We discuss the theoretical and practical consequences of this setting, show challenges raised even in toy environments, and propose algorithms to begin to tackle this novel setting. This paper introduces a powerful new formalism that encompasses both new and existing problems and lays the foundation for future research.

PDF

NeurIPS Conference 2024 Conference Paper

Real-Time Recurrent Learning using Trace Units in Reinforcement Learning

Esraa Elelimy
Adam White
Michael Bowling
Martha White

Recurrent Neural Networks (RNNs) are used to learn representations in partially observable environments. For agents that learn online and continually interact with the environment, it is desirable to train RNNs with real-time recurrent learning (RTRL); unfortunately, RTRL is prohibitively expensive for standard RNNs. A promising direction is to use linear recurrent architectures (LRUs), where dense recurrent weights are replaced with a complex-valued diagonal, making RTRL efficient. In this work, we build on these insights to provide a lightweight but effective approach for training RNNs in online RL. We introduce Recurrent Trace Units (RTUs), a small modification on LRUs that we nonetheless find to have significant performance benefits over LRUs when trained with RTRL. We find RTUs significantly outperform GRUs and Transformers across several partially observable environments while using significantly less computation.

PDF Details DOI

IJCAI Conference 2023 Conference Paper

Rethinking Formal Models of Partially Observable Multiagent Decision Making (Extended Abstract)

Vojtěch Kovařík
Martin Schmid
Neil Burch
Michael Bowling
Viliam Lisý

Multiagent decision-making in partially observable environments is usually modelled as either an extensive-form game (EFG) in game theory or a partially observable stochastic game (POSG) in multiagent reinforcement learning (MARL). One issue with the current situation is that while most practical problems can be modelled in both formalisms, the relationship of the two models is unclear, which hinders the transfer of ideas between the two communities. A second issue is that while EFGs have recently seen significant algorithmic progress, their classical formalization is unsuitable for efficient presentation of the underlying ideas, such as those around decomposition. To solve the first issue, we introduce factored-observation stochastic games (FOSGs), a minor modification of the POSG formalism which distinguishes between private and public observation and thereby greatly simplifies decomposition. To remedy the second issue, we show that FOSGs and POSGs are naturally connected to EFGs: by "unrolling" a FOSG into its tree form, we obtain an EFG. Conversely, any perfect-recall timeable EFG corresponds to some underlying FOSG in this manner. Moreover, this relationship justifies several minor modifications to the classical EFG formalization that recently appeared as an implicit response to the model's issues with decomposition. Finally, we illustrate the transfer of ideas between EFGs and MARL by presenting three key EFG techniques -- counterfactual regret minimization, sequence form, and decomposition -- in the FOSG framework.

PDF Details DOI

AAMAS Conference 2023 Conference Paper

Targeted Search Control in AlphaZero for Effective Policy Improvement

Alexandre Trudeau
Michael Bowling

AlphaZero is a self-play reinforcement learning algorithm that achieves superhuman play in chess, shogi, and Go via policy iteration. To be an effective policy improvement operator, AlphaZero’s search requires accurate value estimates for the states appearing in its search tree. AlphaZero trains upon self-play matches beginning from the initial state of a game and only samples actions over the first few moves, limiting its exploration of states deeper in the game tree. We introduce Go-Exploit, a novel search control strategy for AlphaZero. Go-Exploit samples the start state of its self-play trajectories from an archive of states of interest. Beginning self-play trajectories from varied starting states enables Go-Exploit to more effectively explore the game tree and to learn a value function that generalizes better. Producing shorter self-play trajectories allows Go-Exploit to train upon more independent value targets, improving value training. Finally, the exploration inherent in Go-Exploit reduces its need for exploratory actions, enabling it to train under more exploitative policies. In the games of Connect Four and 9x9 Go, we show that Go-Exploit learns with a greater sample efficiency than standard AlphaZero, resulting in stronger performance against reference opponents and in head-to-head play. We also compare Go-Exploit to KataGo, a more sample efficient reimplementation of AlphaZero, and demonstrate that Go-Exploit has a more effective search control strategy. Furthermore, Go-Exploit’s sample efficiency improves when KataGo’s other innovations are incorporated.

PDF

JMLR Journal 2023 Journal Article

Temporal Abstraction in Reinforcement Learning with the Successor Representation

Marlos C. Machado
Andre Barreto
Doina Precup
Michael Bowling

Reasoning at multiple levels of temporal abstraction is one of the key attributes of intelligence. In reinforcement learning, this is often modeled through temporally extended courses of actions called options. Options allow agents to make predictions and to operate at different levels of abstraction within an environment. Nevertheless, approaches based on the options framework often start with the assumption that a reasonable set of options is known beforehand. When this is not the case, there are no definitive answers for which options one should consider. In this paper, we argue that the successor representation, which encodes states based on the pattern of state visitation that follows them, can be seen as a natural substrate for the discovery and use of temporal abstractions. To support our claim, we take a big picture view of recent results, showing how the successor representation can be used to discover options that facilitate either temporally-extended exploration or planning. We cast these results as instantiations of a general framework for option discovery in which the agent’s representation is used to identify useful options, which are then used to further improve its representation. This results in a virtuous, never-ending, cycle in which both the representation and the options are constantly refined based on each other. Beyond option discovery itself, we also discuss how the successor representation allows us to augment a set of options into a combinatorially large counterpart without additional learning. This is achieved through the combination of previously learned options. Our empirical evaluation focuses on options discovered for temporally-extended exploration and on the use of the successor representation to combine them. Our results shed light on important design decisions involved in the definition of options and demonstrate the synergy of different methods based on the successor representation, such as eigenoptions and the option keyboard. [abs] [ pdf ][ bib ] &copy JMLR 2023. ( edit, beta )

PDF Details

IJCAI Conference 2022 Conference Paper

Approximate Exploitability: Learning a Best Response

Finbarr Timbers
Nolan Bard
Edward Lockhart
Marc Lanctot
Martin Schmid
Neil Burch
Julian Schrittwieser
Thomas Hubert

Researchers have shown that neural networks are vulnerable to adversarial examples and subtle environment changes. The resulting errors can look like blunders to humans, eroding trust in these agents. In prior games research, agent evaluation often focused on the in-practice game outcomes. Such evaluation typically fails to evaluate robustness to worst-case outcomes. Computer poker research has examined how to assess such worst-case performance. Unfortunately, exact computation is infeasible with larger domains, and existing approximations are poker-specific. We introduce ISMCTS-BR, a scalable search-based deep reinforcement learning algorithm for learning a best response to an agent, approximating worst-case performance. We demonstrate the technique in several games against a variety of agents, including several AlphaZero-based agents. Supplementary material is available at https: //arxiv. org/abs/2004. 09677.

PDF Details DOI

IJCAI Conference 2022 Conference Paper

Learning Curricula for Humans: An Empirical Study with Puzzles from The Witness

Levi H. S. Lelis
João G. G. V. Nova
Eugene Chen
Nathan R. Sturtevant
Carrie Demmans Epp
Michael Bowling

The combination of tree search and neural networks has achieved super-human performance in challenging domains. We are interested in transferring to humans the knowledge these learning systems generate. We hypothesize the process in which neural-guided tree search algorithms learn how to solve a set of problems can be used to generate curricula for helping human learners. In this paper we show how the Bootstrap learning system can be modified to learn curricula for humans in a puzzle domain. We evaluate our system in two curriculum learning settings. First, given a small set of problem instances, our system orders the instances to ease the learning process of human learners. Second, given a large set of problem instances, our system returns a small ordered subset of the initial set that can be presented to human learners. We evaluate our curricula with a user study where participants learn how to solve a class of puzzles from the game `The Witness. ' The user-study results suggest one of the curricula our system generates compares favorably with simple baselines and is competitive with the curriculum from the original `The Witness' game in terms of user retention and effort.

PDF Details DOI

AIJ Journal 2022 Journal Article

Rethinking formal models of partially observable multiagent decision making

Vojtěch Kovařík
Martin Schmid
Neil Burch
Michael Bowling
Viliam Lisý

Details DOI

AAAI Conference 2021 Conference Paper

Hindsight and Sequential Rationality of Correlated Play

Dustin Morrill
Ryan D'Orazio
Reca Sarfati
Marc Lanctot
James R Wright
Amy R Greenwald
Michael Bowling

Driven by recent successes in two-player, zero-sum game solving and playing, artificial intelligence work on games has increasingly focused on algorithms that produce equilibriumbased strategies. However, this approach has been less effective at producing competent players in general-sum games or those with more than two players than in two-player, zerosum games. An appealing alternative is to consider adaptive algorithms that ensure strong performance in hindsight relative to what could have been achieved with modified behavior. This approach also leads to a game-theoretic analysis, but in the correlated play that arises from joint learning dynamics rather than factored agent behavior at equilibrium. We develop and advocate for this hindsight rationality framing of learning in general sequential decision-making settings. To this end, we re-examine mediated equilibrium and deviation types in extensive-form games, thereby gaining a more complete understanding and resolving past misconceptions. We present a set of examples illustrating the distinct strengths and weaknesses of each type of equilibrium in the literature, and prove that no tractable concept subsumes all others. This line of inquiry culminates in the definition of the deviation and equilibrium classes that correspond to algorithms in the counterfactual regret minimization (CFR) family, relating them to all others in the literature. Examining CFR in greater detail further leads to a new recursive definition of rationality in correlated play that extends sequential rationality in a way that naturally applies to hindsight evaluation.

PDF Details

AAAI Conference 2021 Conference Paper

Solving Common-Payoff Games with Approximate Policy Iteration

Samuel Sokota
Edward Lockhart
Finbarr Timbers
Elnaz Davoodi
Ryan D'Orazio
Neil Burch
Martin Schmid
Michael Bowling

For artificially intelligent learning systems to have widespread applicability in real-world settings, it is important that they be able to operate decentrally. Unfortunately, decentralized control is difficult—computing even an epsilon-optimal joint policy is a NEXP complete problem. Nevertheless, a recently rediscovered insight—that a team of agents can coordinate via common knowledge—has given rise to algorithms capable of finding optimal joint policies in small common-payoff games. The Bayesian action decoder (BAD) leverages this insight and deep reinforcement learning to scale to games as large as two-player Hanabi. However, the approximations it uses to do so prevent it from discovering optimal joint policies even in games small enough to brute force optimal solutions. This work proposes CAPI, a novel algorithm which, like BAD, combines common knowledge with deep reinforcement learning. However, unlike BAD, CAPI prioritizes the propensity to discover optimal joint policies over scalability. While this choice precludes CAPI from scaling to games as large as Hanabi, empirical results demonstrate that, on the games to which CAPI does scale, it is capable of discovering optimal joint policies even when other modern multi-agent reinforcement learning algorithms are unable to do so.

PDF Details

AAMAS Conference 2021 Conference Paper

Sound Algorithms in Imperfect Information Games

Michal Šustr
Martin Schmid
Matej Moravćík
Neil Burch
Marc Lanctot
Michael Bowling

Search has played a fundamental role in computer game research since the very beginning. And while online search has been commonly used in perfect information games such as Chess and Go, online search methods for imperfect information games have only been introduced relatively recently. This paper addresses the question of what is a sound online algorithm in an imperfect information setting of two-player zero-sum games? We argue that the fixedstrategy definitions of exploitability and epsilon-Nash equilibria are ill suited to measure the worst-case performance of an online algorithm. We thus formalize epsilon-soundness, a concept that connects the worst-case performance of an online algorithm to the performance of an epsilon-Nash equilibrium. Our definition of soundness and the consistency hierarchy finally provide appropriate tools to analyze online algorithms in repeated imperfect information games. We thus inspect some of the previous online algorithms in a new light, bringing new insights into their worst case performance guarantees.

PDF

JAIR Journal 2021 Journal Article

Teaching People by Justifying Tree Search Decisions: An Empirical Study in Curling

Cleyton R. Silva
Michael Bowling
Levi H.S. Lelis

In this research note we show that a simple justification system can be used to teach humans non-trivial strategies of the Olympic sport of curling. This is achieved by justifying the decisions of Kernel Regression UCT (KR-UCT), a tree search algorithm that derives curling strategies by playing the game with itself. Given an action returned by KR-UCT and the expected outcome of that action, we use a decision tree to produce a counterfactual justification of KR-UCT’s decision. The system samples other possible outcomes and selects for presentation the outcomes that are most similar to the expected outcome in terms of visual features and most different in terms of expected end-game value. A user study with 122 people shows that the participants who had access to the justifications produced by our system achieved much higher scores in a curling test than those who only observed the decision made by KR-UCT and those with access to the justifications of a baseline system. This is, to the best of our knowledge, the first work showing that a justification system is able to teach humans non-trivial strategies learned by an algorithm operating in self play.

PDF Details DOI

AAAI Conference 2020 Conference Paper

Count-Based Exploration with the Successor Representation

Marlos C. Machado
Marc G. Bellemare
Michael Bowling

In this paper we introduce a simple approach for exploration in reinforcement learning (RL) that allows us to develop theoretically justiﬁed algorithms in the tabular case but that is also extendable to settings where function approximation is required. Our approach is based on the successor representation (SR), which was originally introduced as a representation deﬁning state generalization by the similarity of successor states. Here we show that the norm of the SR, while it is being learned, can be used as a reward bonus to incentivize exploration. In order to better understand this transient behavior of the norm of the SR we introduce the substochastic successor representation (SSR) and we show that it implicitly counts the number of times each state (or feature) has been observed. We use this result to introduce an algorithm that performs as well as some theoretically sample-efﬁcient approaches. Finally, we extend these ideas to a deep RL algorithm and show that it achieves state-of-the-art performance in Atari 2600 games when in a low sample-complexity regime.

PDF Details

NeurIPS Conference 2020 Conference Paper

Marginal Utility for Planning in Continuous or Large Discrete Action Spaces

Zaheen Ahmad
Levi Lelis
Michael Bowling

Sample-based planning is a powerful family of algorithms for generating intelligent behavior from a model of the environment. Generating good candidate actions is critical to the success of sample-based planners, particularly in continuous or large action spaces. Typically, candidate action generation exhausts the action space, uses domain knowledge, or more recently, involves learning a stochastic policy to provide such search guidance. In this paper we explore explicitly learning a candidate action generator by optimizing a novel objective, marginal utility. The marginal utility of an action generator measures the increase in value of an action over previously generated actions. We validate our approach in both curling, a challenging stochastic domain with continuous state and action spaces, and a location game with a discrete but large action space. We show that a generator trained with the marginal utility objective outperforms hand-coded schemes built on substantial domain knowledge, trained stochastic policies, and other natural objectives for generating actions for sampled-based planners.

PDF Details

AIJ Journal 2020 Journal Article

The Hanabi challenge: A new frontier for AI research

Nolan Bard
Jakob N. Foerster
Sarath Chandar
Neil Burch
Marc Lanctot
H. Francis Song
Emilio Parisotto
Vincent Dumoulin

Details DOI

RLDM Conference 2019 Conference Abstract

Can a Game Require Theory of Mind?

Michael Bowling

Luke Chang (Dartmouth College): Anatomy of a Social Interaction Every day we make many decisions about how to spend our time and resources. These decisions can be quick (What should I eat for lunch?) or more deliberative (e. g. , Should I take this job?). Making a deci- sion typically involves selecting the option that best maximizes the overall benefits while simultaneously minimizing the associated costs. However, these decisions often have consequences on other people, and considerably less is known about how we integrate the beliefs, intentions, and desires of others with our own feelings into this decision-making process. In this talk, we will explore the psychological and neural processes involved in how we learn and make decisions from different social roles in a simple interaction (i. e. , the trust game). For example, caring about others’ outcomes can yield vicarious rewards, while dis- appointing a relationship partner can lead to negative feelings of guilt. Moreover, these interactions require constructing a model of each player’s intentions and motivations based on observing their actions in the game. We leverage reinforcement-learning and game theoretic modeling frameworks to model these social and affective processes and use neuroimaging methods to aid in validating these constructs. Overall, we hope that this work will inspire increased interest in modeling social and affective processes. Wednesday, July 10th, 2019

PDF Details

NeurIPS Conference 2019 Conference Paper

Ease-of-Teaching and Language Structure from Emergent Communication

Fushan Li
Michael Bowling

Artificial agents have been shown to learn to communicate when needed to complete a cooperative task. Some level of language structure (e. g. , compositionality) has been found in the learned communication protocols. This observed structure is often the result of specific environmental pressures during training. By introducing new agents periodically to replace old ones, sequentially and within a population, we explore such a new pressure — ease of teaching — and show its impact on the structure of the resulting language.

PDF Details

AAAI Conference 2019 Conference Paper

Solving Large Extensive-Form Games with Strategy Constraints

Trevor Davis
Kevin Waugh
Michael Bowling

Extensive-form games are a common model for multiagent interactions with imperfect information. In two-player zerosum games, the typical solution concept is a Nash equilibrium over the unconstrained strategy set for each player. In many situations, however, we would like to constrain the set of possible strategies. For example, constraints are a natural way to model limited resources, risk mitigation, safety, consistency with past observations of behavior, or other secondary objectives for an agent. In small games, optimal strategies under linear constraints can be found by solving a linear program; however, state-of-the-art algorithms for solving large games cannot handle general constraints. In this work we introduce a generalized form of Counterfactual Regret Minimization that provably finds optimal strategies under any feasible set of convex constraints. We demonstrate the effectiveness of our algorithm for finding strategies that mitigate risk in security games, and for opponent modeling in poker games when given only partial observations of private information.

PDF Details

RLDM Conference 2019 Conference Abstract

The Effect of Planning Shape on Dyna-style planning in High-dimensional State Spaces

Gordon Z Holland
Erin Talvitie
Michael Bowling

Dyna is a fundamental approach to model-based reinforcement learning (MBRL) that interleaves planning, acting, and learning in an online setting. In the most typical application of Dyna, the dynamics model is used to generate one-step transitions from selected start states from the agent’s history, which are used to update the agent’s value function or policy as if they were real experiences. In this work, one- step Dyna was applied to several games from the Arcade Learning Environment (ALE). We found that the model-based updates offered surprisingly little benefit over simply performing more updates with the agent’s existing experience, even when using a perfect model. We hypothesize that to get the most from planning, the model must be used to generate unfamiliar experience. To test this, we experimented with the shape of planning in multiple different concrete instantiations of Dyna, performing fewer, longer rollouts, rather than many short rollouts. We found that planning shape has a profound impact on the efficacy of Dyna for both perfect and learned models. In addition to these findings regarding Dyna in general, our results represent, to our knowledge, the first time that a learned dynamics model has been successfully used for planning in the ALE, suggesting that Dyna may be a viable approach to MBRL in the ALE and other high-dimensional problems.

PDF Details

AAAI Conference 2019 Conference Paper

Variance Reduction in Monte Carlo Counterfactual Regret Minimization (VR-MCCFR) for Extensive Form Games Using Baselines

Martin Schmid
Neil Burch
Marc Lanctot
Matej Moravcik
Rudolf Kadlec
Michael Bowling

Learning strategies for imperfect information games from samples of interaction is a challenging problem. A common method for this setting, Monte Carlo Counterfactual Regret Minimization (MCCFR), can have slow long-term convergence rates due to high variance. In this paper, we introduce a variance reduction technique (VR-MCCFR) that applies to any sampling variant of MCCFR. Using this technique, periteration estimated values and updates are reformulated as a function of sampled values and state-action baselines, similar to their use in policy gradient reinforcement learning. The new formulation allows estimates to be bootstrapped from other estimates within the same episode, propagating the benefits of baselines along the sampled trajectory; the estimates remain unbiased even when bootstrapping from other estimates. Finally, we show that given a perfect baseline, the variance of the value estimates can be reduced to zero. Experimental evaluation shows that VR-MCCFR brings an order of magnitude speedup, while the empirical variance decreases by three orders of magnitude. The decreased variance allows for the first time CFR+ to be used with sampling, increasing the speedup to two orders of magnitude.

PDF Details

NeurIPS Conference 2018 Conference Paper

Actor-Critic Policy Optimization in Partially Observable Multiagent Environments

Sriram Srinivasan
Marc Lanctot
Vinicius Zambaldi
Julien Perolat
Karl Tuyls
Remi Munos
Michael Bowling

Optimization of parameterized policies for reinforcement learning (RL) is an important and challenging problem in artificial intelligence. Among the most common approaches are algorithms based on gradient ascent of a score function representing discounted return. In this paper, we examine the role of these policy gradient and actor-critic algorithms in partially-observable multiagent environments. We show several candidate policy update rules and relate them to a foundation of regret minimization and multiagent learning techniques for the one-shot and tabular cases, leading to previously unknown convergence guarantees. We apply our method to model-free multiagent reinforcement learning in adversarial sequential decision problems (zero-sum imperfect information games), using RL-style function approximation. We evaluate on commonly used benchmark Poker domains, showing performance against fixed policies and empirical convergence to approximate Nash equilibria in self-play with rates similar to or better than a baseline model-free algorithm for zero-sum games, without any domain-specific state space reductions.

PDF Details

AAAI Conference 2018 Conference Paper

AIVAT: A New Variance Reduction Technique for Agent Evaluation in Imperfect Information Games

Neil Burch
Martin Schmid
Matej Moravcik
Dustin Morill
Michael Bowling

Evaluating agent performance when outcomes are stochastic and agents use randomized strategies can be challenging when there is limited data available. The variance of sampled outcomes may make the simple approach of Monte Carlo sampling inadequate. This is the case for agents playing heads-up no-limit Texas hold’em poker, where manmachine competitions typically involve multiple days of consistent play by multiple players, but still can (and sometimes did) result in statistically insigniﬁcant conclusions. In this paper, we introduce AIVAT, a low variance, provably unbiased value assessment tool that exploits an arbitrary heuristic estimate of state value, as well as the explicit strategy of a subset of the agents. Unlike existing techniques which reduce the variance from chance events, or only consider game ending actions, AIVAT reduces the variance both from choices by nature and by players with a known strategy. The resulting estimator produces results that signiﬁcantly outperform previous state of the art techniques. It was able to reduce the standard deviation of a Texas hold’em poker man-machine match by 85% and consequently requires 44 times fewer games to draw the same statistical conclusion. AIVAT enabled the ﬁrst statistically signiﬁcant AI victory against professional poker players in no-limit hold’em. Furthermore, the technique was powerful enough to produce statistically signiﬁcant results versus individual players, not just an aggregate pool of the players. We also used AIVAT to analyze a short series of AI vs human poker tournaments, producing statistical signiﬁcant results with as few as 28 matches.

PDF Details

JAIR Journal 2018 Journal Article

Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents

Marlos C. Machado
Marc G. Bellemare
Erik Talvitie
Joel Veness
Matthew Hausknecht
Michael Bowling

The Arcade Learning Environment (ALE) is an evaluation platform that poses the challenge of building AI agents with general competency across dozens of Atari 2600 games. It supports a variety of different problem settings and it has been receiving increasing attention from the scientific community, leading to some high-profile success stories such as the much publicized Deep Q-Networks (DQN). In this article we take a big picture look at how the ALE is being used by the research community. We show how diverse the evaluation methodologies in the ALE have become with time, and highlight some key concerns when evaluating agents in the ALE. We use this discussion to present some methodological best practices and provide new benchmark results using these best practices. To further the progress in the field, we introduce a new version of the ALE that supports multiple game modes and provides a form of stochasticity we call sticky actions. We conclude this big picture look by revisiting challenges posed when the ALE was introduced, summarizing the state-of-the-art in various problems and highlighting problems that remain open.

PDF Details DOI

IJCAI Conference 2018 Conference Paper

Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents (Extended Abstract)

Marlos C. Machado
Marc G. Bellemare
Erik Talvitie
Joel Veness
Matthew Hausknecht
Michael Bowling

The Arcade Learning Environment (ALE) is an evaluation platform that poses the challenge of building AI agents with general competency across dozens of Atari 2600 games. It supports a variety of different problem settings and it has been receiving increasing attention from the scientific community. In this paper we take a big picture look at how the ALE is being used by the research community. We focus on how diverse the evaluation methodologies in the ALE have become and we highlight some key concerns when evaluating agents in this platform. We use this discussion to present what we consider to be the best practices for future evaluations in the ALE. To further the progress in the field, we also introduce a new version of the ALE that supports multiple game modes and provides a form of stochasticity we call sticky actions.

PDF Details

RLDM Conference 2017 Conference Abstract

A Laplacian Framework for Option Discovery in Reinforcement Learning*

Marlos C. Machado
Marc G. Bellemare
Michael Bowling

Representation learning and option discovery are two of the biggest challenges in reinforcement learning (RL). Proto-RL is a well-known approach for representation learning in MDPs. The representations learned with this framework are called proto-value functions (PVFs). In this paper we address the option discovery problem by showing how PVFs implicitly define options. We do it by introducing the concepts of eigenpurposes and eigenbehaviors. Eigenpurposes are intrinsic reward functions that incentivize the agent to traverse the state space by following the principal directions of the learned representation. Each intrinsic reward function leads to a different eigenbehavior, which is the optimal policy for that reward function. We convert eigenbehaviors into options by defining the termination condition to the eigenbehavior to be the moment in which the agent stops being able to accumulate positive intrinsic reward. Our termination criterion is provably satisfiable in at least one state in every MDP. Intuitively, the options discovered from eigenpurposes traverse the principal directions of the state space. In this paper we show how exploration is greatly improved when the agent’s action set is augmented by the options we discover. We use the expected number of steps required to navigate between any two states in the MDP when following a random walk as performance metric. This result is due to the fact that our options capture the diffusion process of a random walk, and that different options act at different time scales. We also demonstrate how the options we discover can be used to accumulate reward. Finally, we introduce a sample-based algorithm for option discovery in scenarios in which linear function approximation is required. We provide anecdotal evidence in Atari 2600 games that the options we discover clearly demonstrate intent, aiming at reaching specific locations on the screen, or to execute specific behavioral patterns.

PDF Details

IJCAI Conference 2016 Conference Paper

Action Selection for Hammer Shots in Curling

Zaheen Farraz Ahmad
Robert C. Holte
Michael Bowling

Curling is an adversarial two-player game with a continuous state and action space, and stochastic transitions. This paper focuses on one aspect of the full game, namely, finding the optimal "hammer shot, " which is the last action taken before a score is tallied. We survey existing methods for finding an optimal action in a continuous, low-dimensional space with stochastic outcomes, and adapt a method based on Delaunay Triangulation to our application. Experiments using our curling physics simulator show that the adapted Delaunay Triangulation's shot selection outperforms other algorithms, and with some caveats, exceeds Olympic-level human performance.

PDF Details

AAAI Conference 2016 Conference Paper

Counterfactual Regret Minimization in Sequential Security Games

Viliam Lisy
Trevor Davis
Michael Bowling

PDF Details

IJCAI Conference 2016 Conference Paper

Monte Carlo Tree Search in Continuous Action Spaces with Execution Uncertainty

Timothy Yee
Viliam Lisy
Michael Bowling

Real world applications of artificial intelligence often require agents to sequentially choose actions from continuous action spaces with execution uncertainty. When good actions are sparse, domain knowledge is often used to identify a discrete set of promising actions. These actions and their uncertain effects are typically evaluated using a recursive search procedure. The reduction of the problem to a discrete search problem causes severe limitations, notably, not exploiting all of the sampled outcomes when evaluating actions, and not using outcomes to help find new actions outside the original set. We propose a new Monte Carlo tree search (MCTS) algorithm specifically designed for exploiting an execution model in this setting. Using kernel regression, it generalizes the information about action quality between actions and to unexplored parts of the action space. In a high fidelity simulator of the olympic sport of curling, we show that this approach significantly outperforms existing MCTS methods.

PDF Details

AAMAS Conference 2016 Conference Paper

State of the Art Control of Atari Games Using Shallow Reinforcement Learning

Yitao Liang
Marlos C. Machado
Erik Talvitie
Michael Bowling

The recently introduced Deep Q-Networks (DQN) algorithm has gained attention as one of the first successful combinations of deep neural networks and reinforcement learning. Its promise was demonstrated in the Arcade Learning Environment (ALE), a challenging framework composed of dozens of Atari 2600 games used to evaluate general competency in AI. It achieved dramatically better results than earlier approaches, showing that its ability to learn good representations is quite robust and general. This paper attempts to understand the principles that underlie DQN’s impressive performance and to better contextualize its success. We systematically evaluate the importance of key representational biases encoded by DQN’s network by proposing simple linear representations that make use of these concepts. Incorporating these characteristics, we obtain a computationally practical feature set that achieves competitive performance to DQN in the ALE. Besides offering insight into the strengths and weaknesses of DQN, we provide a generic representation for the ALE, significantly reducing the burden of learning a representation for each game. Moreover, we also provide a simple, reproducible benchmark for the sake of comparison to future work in the ALE.

PDF

NeurIPS Conference 2016 Conference Paper

The Forget-me-not Process

Kieran Milan
Joel Veness
James Kirkpatrick
Michael Bowling
Anna Koop
Demis Hassabis

We introduce the Forget-me-not Process, an efficient, non-parametric meta-algorithm for online probabilistic sequence prediction for piecewise stationary, repeating sources. Our method works by taking a Bayesian approach to partition a stream of data into postulated task-specific segments, while simultaneously building a model for each task. We provide regret guarantees with respect to piecewise stationary data sources under the logarithmic loss, and validate the method empirically across a range of sequence prediction and task identification problems.

PDF Details