Arrow Research search

Author name cluster

Michael Bowling

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

83 papers
1 author row

Possible papers

83

TMLR Journal 2025 Journal Article

Learning to Be Cautious

  • Montaser Mohammedalamen
  • Dustin Morrill
  • Alexander Sieusahai
  • Yash Satsangi
  • Michael Bowling

A key challenge in the field of reinforcement learning is to develop agents that behave cautiously in novel situations. It is generally impossible to anticipate all situations that an autonomous system may face or what behavior would best avoid bad outcomes. An agent that could learn to be cautious would overcome this challenge by discovering for itself when and how to behave cautiously. In contrast, current approaches typically embed task-specific safety information or explicitly cautious behaviors into the system, which is error-prone and imposes extra burdens on practitioners. In this paper, we present both a sequence of tasks where cautious behavior becomes increasingly non-obvious, as well as an algorithm to demonstrate that it is possible for a system to learn to be cautious. The essential features of our algorithm are that it characterizes reward function uncertainty without task-specific safety information and uses this uncertainty to construct a robust policy. Specifically, we construct robust policies with a $k$-of-$N$ counterfactual regret minimization (CFR) subroutine given a learned reward function uncertainty represented by a neural network ensemble belief. These policies exhibit caution in each of our tasks without any task-specific safety tuning. Our code is available at https://github.com/montaserFath/Learning-to-be-Cautious

NeurIPS Conference 2025 Conference Paper

Plasticity as the Mirror of Empowerment

  • David Abel
  • Michael Bowling
  • Andre Barreto
  • Will Dabney
  • Shi Dong
  • Steven Hansen
  • Anna Harutyunyan
  • Khimya Khetarpal

Agents are minimally entities that are influenced by their past observations and act to influence future observations. This latter capacity is captured by empowerment, which has served as a vital framing concept across artificial intelligence and cognitive science. This former capacity, however, is equally foundational: In what ways, and to what extent, can an agent be influenced by what it observes? In this paper, we ground this concept in a universal agent-centric measure that we refer to as plasticity, and reveal a fundamental connection to empowerment. Following a set of desiderata on a suitable definition, we define plasticity using a new information-theoretic quantity we call the generalized directed information. We show that this new quantity strictly generalizes the directed information introduced by Massey (1990) while preserving all of its desirable properties. Under this definition, we find that plasticity is well thought of as the mirror of empowerment: The two concepts are defined using the same measure, with only the direction of influence reversed. Our main result establishes a tension between the plasticity and empowerment of an agent, suggesting that agent design needs to be mindful of both characteristics. We explore the implications of these findings, and suggest that plasticity, empowerment, and their relationship are essential to understanding agency.

RLC Conference 2025 Conference Paper

Rethinking the Foundations for Continual Reinforcement Learning

  • Esraa Elelimy
  • David Szepesvari
  • Martha White
  • Michael Bowling

In the traditional view of reinforcement learning, the agent's goal is to find an optimal policy that maximizes its expected sum of rewards. Once the agent finds this policy, the learning ends. This view contrasts with *continual reinforcement learning*, where learning does not end, and agents are expected to continually learn and adapt indefinitely. Despite the clear distinction between these two paradigms of learning, much of the progress in continual reinforcement learning has been shaped by foundations rooted in the traditional view of reinforcement learning. In this paper, we first examine whether the foundations of traditional reinforcement learning are suitable for the continual reinforcement learning paradigm. We identify four key pillars of the traditional reinforcement learning foundations that are antithetical to the goals of continual learning: the Markov decision process formalism, the focus on atemporal artifacts, the expected sum of rewards as an evaluation metric, and episodic benchmark environments that embrace the other three foundations. We then propose a new formalism that sheds the first and the third foundations and replaces them with the history process as a mathematical formalism and a new definition of deviation regret, adapted for continual learning, as an evaluation metric. Finally, we discuss possible approaches to shed the other two foundations.

RLJ Journal 2025 Journal Article

Rethinking the Foundations for Continual Reinforcement Learning

  • Esraa Elelimy
  • David Szepesvari
  • Martha White
  • Michael Bowling

In the traditional view of reinforcement learning, the agent's goal is to find an optimal policy that maximizes its expected sum of rewards. Once the agent finds this policy, the learning ends. This view contrasts with *continual reinforcement learning*, where learning does not end, and agents are expected to continually learn and adapt indefinitely. Despite the clear distinction between these two paradigms of learning, much of the progress in continual reinforcement learning has been shaped by foundations rooted in the traditional view of reinforcement learning. In this paper, we first examine whether the foundations of traditional reinforcement learning are suitable for the continual reinforcement learning paradigm. We identify four key pillars of the traditional reinforcement learning foundations that are antithetical to the goals of continual learning: the Markov decision process formalism, the focus on atemporal artifacts, the expected sum of rewards as an evaluation metric, and episodic benchmark environments that embrace the other three foundations. We then propose a new formalism that sheds the first and the third foundations and replaces them with the history process as a mathematical formalism and a new definition of deviation regret, adapted for continual learning, as an evaluation metric. Finally, we discuss possible approaches to shed the other two foundations.

NeurIPS Conference 2024 Conference Paper

A Method for Evaluating Hyperparameter Sensitivity in Reinforcement Learning

  • Jacob Adkins
  • Michael Bowling
  • Adam White

The performance of modern reinforcement learning algorithms critically relieson tuning ever increasing numbers of hyperparameters. Often, small changes ina hyperparameter can lead to drastic changes in performance, and different environments require very different hyperparameter settings to achieve state-of-the-artperformance reported in the literature. We currently lack a scalable and widelyaccepted approach to characterizing these complex interactions. This work proposes a new empirical methodology for studying, comparing, and quantifying thesensitivity of an algorithm’s performance to hyperparameter tuning for a given setof environments. We then demonstrate the utility of this methodology by assessingthe hyperparameter sensitivity of several commonly used normalization variants ofPPO. The results suggest that several algorithmic performance improvements may, in fact, be a result of an increased reliance on hyperparameter tuning.

NeurIPS Conference 2024 Conference Paper

Beyond Optimism: Exploration With Partially Observable Rewards

  • Simone Parisi
  • Alireza Kazemipour
  • Michael Bowling

Exploration in reinforcement learning (RL) remains an open challenge. RL algorithms rely on observing rewards to train the agent, and if informative rewards are sparse the agent learns slowly or may not learn at all. To improve exploration and reward discovery, popular algorithms rely on optimism. But what if sometimes rewards are unobservable, e. g. , situations of partial monitoring in bandits and the recent formalism of monitored Markov decision process? In this case, optimism can lead to suboptimal behavior that does not explore further to collapse uncertainty. With this paper, we present a novel exploration strategy that overcomes the limitations of existing methods and guarantees convergence to an optimal policy even when rewards are not always observable. We further propose a collection of tabular environments for benchmarking exploration in RL (with and without unobservable rewards) and show that our method outperforms existing ones.

AAAI Conference 2024 Conference Paper

Learning Not to Regret

  • David Sychrovský
  • Michal Šustr
  • Elnaz Davoodi
  • Michael Bowling
  • Marc Lanctot
  • Martin Schmid

The literature on game-theoretic equilibrium finding predominantly focuses on single games or their repeated play. Nevertheless, numerous real-world scenarios feature playing a game sampled from a distribution of similar, but not identical games, such as playing poker with different public cards or trading correlated assets on the stock market. As these similar games feature similar equilibra, we investigate a way to accelerate equilibrium finding on such a distribution. We present a novel ``learning not to regret'' framework, enabling us to meta-learn a regret minimizer tailored to a specific distribution. Our key contribution, Neural Predictive Regret Matching, is uniquely meta-learned to converge rapidly for the chosen distribution of games, while having regret minimization guarantees on any game. We validated our algorithms' faster convergence on a distribution of river poker games. Our experiments show that the meta-learned algorithms outpace their non-meta-learned counterparts, achieving more than tenfold improvements.

JAIR Journal 2024 Journal Article

Mitigating Value Hallucination in Dyna-Style Planning via Multistep Predecessor Models

  • Farzane Aminmansour
  • Taher Jafferjee
  • Ehsan Imani
  • Erin J. Talvitie
  • Michael Bowling
  • Martha White

Dyna-style reinforcement learning (RL) agents improve sample efficiency over model-free RL agents by updating the value function with simulated experience generated by an environment model. However, it is often difficult to learn accurate models of environment dynamics, and even small errors may result in failure of Dyna agents. In this paper, we highlight that one potential cause of that failure is bootstrapping off of the values of simulated states, and introduce a new Dyna algorithm to avoid this failure. We discuss a design space of Dyna algorithms, based on using successor or predecessor models---simulating forwards or backwards---and using one-step or multi-step updates. Three of the variants have been explored, but surprisingly the fourth variant has not: using predecessor models with multi-step updates. We present the \emph{Hallucinated Value Hypothesis} (HVH): updating the values of real states towards values of simulated states can result in misleading action values which adversely affect the control policy. We discuss and evaluate all four variants of Dyna amongst which three update real states toward simulated states --- so potentially toward hallucinated values --- and our proposed approach, which does not. The experimental results provide evidence for the HVH, and suggest that using predecessor models with multi-step updates is a fruitful direction toward developing Dyna algorithms that are more robust to model error.

AAMAS Conference 2024 Conference Paper

Monitored Markov Decision Processes

  • Simone Parisi
  • Montaser Mohammedalamen
  • Alireza Kazemipour
  • Matthew E. Taylor
  • Michael Bowling

In reinforcement learning (RL), an agent learns to perform a task by interacting with an environment and receiving feedback (a numerical reward) for its actions. However, the assumption that rewards are always observable is often not applicable in real-world problems. For example, the agent may need to ask a human to supervise its actions or activate a monitoring system to receive feedback. There may even be a period of time before rewards become observable, or a period of time after which rewards are no longer given. In other words, there are cases where the environment generates rewards in response to the agent’s actions but the agent cannot observe them. In this paper, we formalize a novel but general RL framework — Monitored MDPs — where the agent cannot always observe rewards. We discuss the theoretical and practical consequences of this setting, show challenges raised even in toy environments, and propose algorithms to begin to tackle this novel setting. This paper introduces a powerful new formalism that encompasses both new and existing problems and lays the foundation for future research.

NeurIPS Conference 2024 Conference Paper

Real-Time Recurrent Learning using Trace Units in Reinforcement Learning

  • Esraa Elelimy
  • Adam White
  • Michael Bowling
  • Martha White

Recurrent Neural Networks (RNNs) are used to learn representations in partially observable environments. For agents that learn online and continually interact with the environment, it is desirable to train RNNs with real-time recurrent learning (RTRL); unfortunately, RTRL is prohibitively expensive for standard RNNs. A promising direction is to use linear recurrent architectures (LRUs), where dense recurrent weights are replaced with a complex-valued diagonal, making RTRL efficient. In this work, we build on these insights to provide a lightweight but effective approach for training RNNs in online RL. We introduce Recurrent Trace Units (RTUs), a small modification on LRUs that we nonetheless find to have significant performance benefits over LRUs when trained with RTRL. We find RTUs significantly outperform GRUs and Transformers across several partially observable environments while using significantly less computation.

IJCAI Conference 2023 Conference Paper

Rethinking Formal Models of Partially Observable Multiagent Decision Making (Extended Abstract)

  • Vojtěch Kovařík
  • Martin Schmid
  • Neil Burch
  • Michael Bowling
  • Viliam Lisý

Multiagent decision-making in partially observable environments is usually modelled as either an extensive-form game (EFG) in game theory or a partially observable stochastic game (POSG) in multiagent reinforcement learning (MARL). One issue with the current situation is that while most practical problems can be modelled in both formalisms, the relationship of the two models is unclear, which hinders the transfer of ideas between the two communities. A second issue is that while EFGs have recently seen significant algorithmic progress, their classical formalization is unsuitable for efficient presentation of the underlying ideas, such as those around decomposition. To solve the first issue, we introduce factored-observation stochastic games (FOSGs), a minor modification of the POSG formalism which distinguishes between private and public observation and thereby greatly simplifies decomposition. To remedy the second issue, we show that FOSGs and POSGs are naturally connected to EFGs: by "unrolling" a FOSG into its tree form, we obtain an EFG. Conversely, any perfect-recall timeable EFG corresponds to some underlying FOSG in this manner. Moreover, this relationship justifies several minor modifications to the classical EFG formalization that recently appeared as an implicit response to the model's issues with decomposition. Finally, we illustrate the transfer of ideas between EFGs and MARL by presenting three key EFG techniques -- counterfactual regret minimization, sequence form, and decomposition -- in the FOSG framework.

AAMAS Conference 2023 Conference Paper

Targeted Search Control in AlphaZero for Effective Policy Improvement

  • Alexandre Trudeau
  • Michael Bowling

AlphaZero is a self-play reinforcement learning algorithm that achieves superhuman play in chess, shogi, and Go via policy iteration. To be an effective policy improvement operator, AlphaZero’s search requires accurate value estimates for the states appearing in its search tree. AlphaZero trains upon self-play matches beginning from the initial state of a game and only samples actions over the first few moves, limiting its exploration of states deeper in the game tree. We introduce Go-Exploit, a novel search control strategy for AlphaZero. Go-Exploit samples the start state of its self-play trajectories from an archive of states of interest. Beginning self-play trajectories from varied starting states enables Go-Exploit to more effectively explore the game tree and to learn a value function that generalizes better. Producing shorter self-play trajectories allows Go-Exploit to train upon more independent value targets, improving value training. Finally, the exploration inherent in Go-Exploit reduces its need for exploratory actions, enabling it to train under more exploitative policies. In the games of Connect Four and 9x9 Go, we show that Go-Exploit learns with a greater sample efficiency than standard AlphaZero, resulting in stronger performance against reference opponents and in head-to-head play. We also compare Go-Exploit to KataGo, a more sample efficient reimplementation of AlphaZero, and demonstrate that Go-Exploit has a more effective search control strategy. Furthermore, Go-Exploit’s sample efficiency improves when KataGo’s other innovations are incorporated.

JMLR Journal 2023 Journal Article

Temporal Abstraction in Reinforcement Learning with the Successor Representation

  • Marlos C. Machado
  • Andre Barreto
  • Doina Precup
  • Michael Bowling

Reasoning at multiple levels of temporal abstraction is one of the key attributes of intelligence. In reinforcement learning, this is often modeled through temporally extended courses of actions called options. Options allow agents to make predictions and to operate at different levels of abstraction within an environment. Nevertheless, approaches based on the options framework often start with the assumption that a reasonable set of options is known beforehand. When this is not the case, there are no definitive answers for which options one should consider. In this paper, we argue that the successor representation, which encodes states based on the pattern of state visitation that follows them, can be seen as a natural substrate for the discovery and use of temporal abstractions. To support our claim, we take a big picture view of recent results, showing how the successor representation can be used to discover options that facilitate either temporally-extended exploration or planning. We cast these results as instantiations of a general framework for option discovery in which the agent’s representation is used to identify useful options, which are then used to further improve its representation. This results in a virtuous, never-ending, cycle in which both the representation and the options are constantly refined based on each other. Beyond option discovery itself, we also discuss how the successor representation allows us to augment a set of options into a combinatorially large counterpart without additional learning. This is achieved through the combination of previously learned options. Our empirical evaluation focuses on options discovered for temporally-extended exploration and on the use of the successor representation to combine them. Our results shed light on important design decisions involved in the definition of options and demonstrate the synergy of different methods based on the successor representation, such as eigenoptions and the option keyboard. [abs] [ pdf ][ bib ] &copy JMLR 2023. ( edit, beta )

IJCAI Conference 2022 Conference Paper

Approximate Exploitability: Learning a Best Response

  • Finbarr Timbers
  • Nolan Bard
  • Edward Lockhart
  • Marc Lanctot
  • Martin Schmid
  • Neil Burch
  • Julian Schrittwieser
  • Thomas Hubert

Researchers have shown that neural networks are vulnerable to adversarial examples and subtle environment changes. The resulting errors can look like blunders to humans, eroding trust in these agents. In prior games research, agent evaluation often focused on the in-practice game outcomes. Such evaluation typically fails to evaluate robustness to worst-case outcomes. Computer poker research has examined how to assess such worst-case performance. Unfortunately, exact computation is infeasible with larger domains, and existing approximations are poker-specific. We introduce ISMCTS-BR, a scalable search-based deep reinforcement learning algorithm for learning a best response to an agent, approximating worst-case performance. We demonstrate the technique in several games against a variety of agents, including several AlphaZero-based agents. Supplementary material is available at https: //arxiv. org/abs/2004. 09677.

IJCAI Conference 2022 Conference Paper

Learning Curricula for Humans: An Empirical Study with Puzzles from The Witness

  • Levi H. S. Lelis
  • João G. G. V. Nova
  • Eugene Chen
  • Nathan R. Sturtevant
  • Carrie Demmans Epp
  • Michael Bowling

The combination of tree search and neural networks has achieved super-human performance in challenging domains. We are interested in transferring to humans the knowledge these learning systems generate. We hypothesize the process in which neural-guided tree search algorithms learn how to solve a set of problems can be used to generate curricula for helping human learners. In this paper we show how the Bootstrap learning system can be modified to learn curricula for humans in a puzzle domain. We evaluate our system in two curriculum learning settings. First, given a small set of problem instances, our system orders the instances to ease the learning process of human learners. Second, given a large set of problem instances, our system returns a small ordered subset of the initial set that can be presented to human learners. We evaluate our curricula with a user study where participants learn how to solve a class of puzzles from the game `The Witness. ' The user-study results suggest one of the curricula our system generates compares favorably with simple baselines and is competitive with the curriculum from the original `The Witness' game in terms of user retention and effort.

AAAI Conference 2021 Conference Paper

Hindsight and Sequential Rationality of Correlated Play

  • Dustin Morrill
  • Ryan D'Orazio
  • Reca Sarfati
  • Marc Lanctot
  • James R Wright
  • Amy R Greenwald
  • Michael Bowling

Driven by recent successes in two-player, zero-sum game solving and playing, artificial intelligence work on games has increasingly focused on algorithms that produce equilibriumbased strategies. However, this approach has been less effective at producing competent players in general-sum games or those with more than two players than in two-player, zerosum games. An appealing alternative is to consider adaptive algorithms that ensure strong performance in hindsight relative to what could have been achieved with modified behavior. This approach also leads to a game-theoretic analysis, but in the correlated play that arises from joint learning dynamics rather than factored agent behavior at equilibrium. We develop and advocate for this hindsight rationality framing of learning in general sequential decision-making settings. To this end, we re-examine mediated equilibrium and deviation types in extensive-form games, thereby gaining a more complete understanding and resolving past misconceptions. We present a set of examples illustrating the distinct strengths and weaknesses of each type of equilibrium in the literature, and prove that no tractable concept subsumes all others. This line of inquiry culminates in the definition of the deviation and equilibrium classes that correspond to algorithms in the counterfactual regret minimization (CFR) family, relating them to all others in the literature. Examining CFR in greater detail further leads to a new recursive definition of rationality in correlated play that extends sequential rationality in a way that naturally applies to hindsight evaluation.

AAAI Conference 2021 Conference Paper

Solving Common-Payoff Games with Approximate Policy Iteration

  • Samuel Sokota
  • Edward Lockhart
  • Finbarr Timbers
  • Elnaz Davoodi
  • Ryan D'Orazio
  • Neil Burch
  • Martin Schmid
  • Michael Bowling

For artificially intelligent learning systems to have widespread applicability in real-world settings, it is important that they be able to operate decentrally. Unfortunately, decentralized control is difficult—computing even an epsilon-optimal joint policy is a NEXP complete problem. Nevertheless, a recently rediscovered insight—that a team of agents can coordinate via common knowledge—has given rise to algorithms capable of finding optimal joint policies in small common-payoff games. The Bayesian action decoder (BAD) leverages this insight and deep reinforcement learning to scale to games as large as two-player Hanabi. However, the approximations it uses to do so prevent it from discovering optimal joint policies even in games small enough to brute force optimal solutions. This work proposes CAPI, a novel algorithm which, like BAD, combines common knowledge with deep reinforcement learning. However, unlike BAD, CAPI prioritizes the propensity to discover optimal joint policies over scalability. While this choice precludes CAPI from scaling to games as large as Hanabi, empirical results demonstrate that, on the games to which CAPI does scale, it is capable of discovering optimal joint policies even when other modern multi-agent reinforcement learning algorithms are unable to do so.

AAMAS Conference 2021 Conference Paper

Sound Algorithms in Imperfect Information Games

  • Michal Šustr
  • Martin Schmid
  • Matej Moravćík
  • Neil Burch
  • Marc Lanctot
  • Michael Bowling

Search has played a fundamental role in computer game research since the very beginning. And while online search has been commonly used in perfect information games such as Chess and Go, online search methods for imperfect information games have only been introduced relatively recently. This paper addresses the question of what is a sound online algorithm in an imperfect information setting of two-player zero-sum games? We argue that the fixedstrategy definitions of exploitability and epsilon-Nash equilibria are ill suited to measure the worst-case performance of an online algorithm. We thus formalize epsilon-soundness, a concept that connects the worst-case performance of an online algorithm to the performance of an epsilon-Nash equilibrium. Our definition of soundness and the consistency hierarchy finally provide appropriate tools to analyze online algorithms in repeated imperfect information games. We thus inspect some of the previous online algorithms in a new light, bringing new insights into their worst case performance guarantees.

JAIR Journal 2021 Journal Article

Teaching People by Justifying Tree Search Decisions: An Empirical Study in Curling

  • Cleyton R. Silva
  • Michael Bowling
  • Levi H.S. Lelis

In this research note we show that a simple justification system can be used to teach humans non-trivial strategies of the Olympic sport of curling. This is achieved by justifying the decisions of Kernel Regression UCT (KR-UCT), a tree search algorithm that derives curling strategies by playing the game with itself. Given an action returned by KR-UCT and the expected outcome of that action, we use a decision tree to produce a counterfactual justification of KR-UCT’s decision. The system samples other possible outcomes and selects for presentation the outcomes that are most similar to the expected outcome in terms of visual features and most different in terms of expected end-game value. A user study with 122 people shows that the participants who had access to the justifications produced by our system achieved much higher scores in a curling test than those who only observed the decision made by KR-UCT and those with access to the justifications of a baseline system. This is, to the best of our knowledge, the first work showing that a justification system is able to teach humans non-trivial strategies learned by an algorithm operating in self play.

AAAI Conference 2020 Conference Paper

Count-Based Exploration with the Successor Representation

  • Marlos C. Machado
  • Marc G. Bellemare
  • Michael Bowling

In this paper we introduce a simple approach for exploration in reinforcement learning (RL) that allows us to develop theoretically justified algorithms in the tabular case but that is also extendable to settings where function approximation is required. Our approach is based on the successor representation (SR), which was originally introduced as a representation defining state generalization by the similarity of successor states. Here we show that the norm of the SR, while it is being learned, can be used as a reward bonus to incentivize exploration. In order to better understand this transient behavior of the norm of the SR we introduce the substochastic successor representation (SSR) and we show that it implicitly counts the number of times each state (or feature) has been observed. We use this result to introduce an algorithm that performs as well as some theoretically sample-efficient approaches. Finally, we extend these ideas to a deep RL algorithm and show that it achieves state-of-the-art performance in Atari 2600 games when in a low sample-complexity regime.

NeurIPS Conference 2020 Conference Paper

Marginal Utility for Planning in Continuous or Large Discrete Action Spaces

  • Zaheen Ahmad
  • Levi Lelis
  • Michael Bowling

Sample-based planning is a powerful family of algorithms for generating intelligent behavior from a model of the environment. Generating good candidate actions is critical to the success of sample-based planners, particularly in continuous or large action spaces. Typically, candidate action generation exhausts the action space, uses domain knowledge, or more recently, involves learning a stochastic policy to provide such search guidance. In this paper we explore explicitly learning a candidate action generator by optimizing a novel objective, marginal utility. The marginal utility of an action generator measures the increase in value of an action over previously generated actions. We validate our approach in both curling, a challenging stochastic domain with continuous state and action spaces, and a location game with a discrete but large action space. We show that a generator trained with the marginal utility objective outperforms hand-coded schemes built on substantial domain knowledge, trained stochastic policies, and other natural objectives for generating actions for sampled-based planners.

RLDM Conference 2019 Conference Abstract

Can a Game Require Theory of Mind?

  • Michael Bowling

Luke Chang (Dartmouth College): Anatomy of a Social Interaction Every day we make many decisions about how to spend our time and resources. These decisions can be quick (What should I eat for lunch?) or more deliberative (e. g. , Should I take this job?). Making a deci- sion typically involves selecting the option that best maximizes the overall benefits while simultaneously minimizing the associated costs. However, these decisions often have consequences on other people, and considerably less is known about how we integrate the beliefs, intentions, and desires of others with our own feelings into this decision-making process. In this talk, we will explore the psychological and neural processes involved in how we learn and make decisions from different social roles in a simple interaction (i. e. , the trust game). For example, caring about others’ outcomes can yield vicarious rewards, while dis- appointing a relationship partner can lead to negative feelings of guilt. Moreover, these interactions require constructing a model of each player’s intentions and motivations based on observing their actions in the game. We leverage reinforcement-learning and game theoretic modeling frameworks to model these social and affective processes and use neuroimaging methods to aid in validating these constructs. Overall, we hope that this work will inspire increased interest in modeling social and affective processes. Wednesday, July 10th, 2019

NeurIPS Conference 2019 Conference Paper

Ease-of-Teaching and Language Structure from Emergent Communication

  • Fushan Li
  • Michael Bowling

Artificial agents have been shown to learn to communicate when needed to complete a cooperative task. Some level of language structure (e. g. , compositionality) has been found in the learned communication protocols. This observed structure is often the result of specific environmental pressures during training. By introducing new agents periodically to replace old ones, sequentially and within a population, we explore such a new pressure — ease of teaching — and show its impact on the structure of the resulting language.

AAAI Conference 2019 Conference Paper

Solving Large Extensive-Form Games with Strategy Constraints

  • Trevor Davis
  • Kevin Waugh
  • Michael Bowling

Extensive-form games are a common model for multiagent interactions with imperfect information. In two-player zerosum games, the typical solution concept is a Nash equilibrium over the unconstrained strategy set for each player. In many situations, however, we would like to constrain the set of possible strategies. For example, constraints are a natural way to model limited resources, risk mitigation, safety, consistency with past observations of behavior, or other secondary objectives for an agent. In small games, optimal strategies under linear constraints can be found by solving a linear program; however, state-of-the-art algorithms for solving large games cannot handle general constraints. In this work we introduce a generalized form of Counterfactual Regret Minimization that provably finds optimal strategies under any feasible set of convex constraints. We demonstrate the effectiveness of our algorithm for finding strategies that mitigate risk in security games, and for opponent modeling in poker games when given only partial observations of private information.

RLDM Conference 2019 Conference Abstract

The Effect of Planning Shape on Dyna-style planning in High-dimensional State Spaces

  • Gordon Z Holland
  • Erin Talvitie
  • Michael Bowling

Dyna is a fundamental approach to model-based reinforcement learning (MBRL) that interleaves planning, acting, and learning in an online setting. In the most typical application of Dyna, the dynamics model is used to generate one-step transitions from selected start states from the agent’s history, which are used to update the agent’s value function or policy as if they were real experiences. In this work, one- step Dyna was applied to several games from the Arcade Learning Environment (ALE). We found that the model-based updates offered surprisingly little benefit over simply performing more updates with the agent’s existing experience, even when using a perfect model. We hypothesize that to get the most from planning, the model must be used to generate unfamiliar experience. To test this, we experimented with the shape of planning in multiple different concrete instantiations of Dyna, performing fewer, longer rollouts, rather than many short rollouts. We found that planning shape has a profound impact on the efficacy of Dyna for both perfect and learned models. In addition to these findings regarding Dyna in general, our results represent, to our knowledge, the first time that a learned dynamics model has been successfully used for planning in the ALE, suggesting that Dyna may be a viable approach to MBRL in the ALE and other high-dimensional problems.

AAAI Conference 2019 Conference Paper

Variance Reduction in Monte Carlo Counterfactual Regret Minimization (VR-MCCFR) for Extensive Form Games Using Baselines

  • Martin Schmid
  • Neil Burch
  • Marc Lanctot
  • Matej Moravcik
  • Rudolf Kadlec
  • Michael Bowling

Learning strategies for imperfect information games from samples of interaction is a challenging problem. A common method for this setting, Monte Carlo Counterfactual Regret Minimization (MCCFR), can have slow long-term convergence rates due to high variance. In this paper, we introduce a variance reduction technique (VR-MCCFR) that applies to any sampling variant of MCCFR. Using this technique, periteration estimated values and updates are reformulated as a function of sampled values and state-action baselines, similar to their use in policy gradient reinforcement learning. The new formulation allows estimates to be bootstrapped from other estimates within the same episode, propagating the benefits of baselines along the sampled trajectory; the estimates remain unbiased even when bootstrapping from other estimates. Finally, we show that given a perfect baseline, the variance of the value estimates can be reduced to zero. Experimental evaluation shows that VR-MCCFR brings an order of magnitude speedup, while the empirical variance decreases by three orders of magnitude. The decreased variance allows for the first time CFR+ to be used with sampling, increasing the speedup to two orders of magnitude.

NeurIPS Conference 2018 Conference Paper

Actor-Critic Policy Optimization in Partially Observable Multiagent Environments

  • Sriram Srinivasan
  • Marc Lanctot
  • Vinicius Zambaldi
  • Julien Perolat
  • Karl Tuyls
  • Remi Munos
  • Michael Bowling

Optimization of parameterized policies for reinforcement learning (RL) is an important and challenging problem in artificial intelligence. Among the most common approaches are algorithms based on gradient ascent of a score function representing discounted return. In this paper, we examine the role of these policy gradient and actor-critic algorithms in partially-observable multiagent environments. We show several candidate policy update rules and relate them to a foundation of regret minimization and multiagent learning techniques for the one-shot and tabular cases, leading to previously unknown convergence guarantees. We apply our method to model-free multiagent reinforcement learning in adversarial sequential decision problems (zero-sum imperfect information games), using RL-style function approximation. We evaluate on commonly used benchmark Poker domains, showing performance against fixed policies and empirical convergence to approximate Nash equilibria in self-play with rates similar to or better than a baseline model-free algorithm for zero-sum games, without any domain-specific state space reductions.

AAAI Conference 2018 Conference Paper

AIVAT: A New Variance Reduction Technique for Agent Evaluation in Imperfect Information Games

  • Neil Burch
  • Martin Schmid
  • Matej Moravcik
  • Dustin Morill
  • Michael Bowling

Evaluating agent performance when outcomes are stochastic and agents use randomized strategies can be challenging when there is limited data available. The variance of sampled outcomes may make the simple approach of Monte Carlo sampling inadequate. This is the case for agents playing heads-up no-limit Texas hold’em poker, where manmachine competitions typically involve multiple days of consistent play by multiple players, but still can (and sometimes did) result in statistically insignificant conclusions. In this paper, we introduce AIVAT, a low variance, provably unbiased value assessment tool that exploits an arbitrary heuristic estimate of state value, as well as the explicit strategy of a subset of the agents. Unlike existing techniques which reduce the variance from chance events, or only consider game ending actions, AIVAT reduces the variance both from choices by nature and by players with a known strategy. The resulting estimator produces results that significantly outperform previous state of the art techniques. It was able to reduce the standard deviation of a Texas hold’em poker man-machine match by 85% and consequently requires 44 times fewer games to draw the same statistical conclusion. AIVAT enabled the first statistically significant AI victory against professional poker players in no-limit hold’em. Furthermore, the technique was powerful enough to produce statistically significant results versus individual players, not just an aggregate pool of the players. We also used AIVAT to analyze a short series of AI vs human poker tournaments, producing statistical significant results with as few as 28 matches.

JAIR Journal 2018 Journal Article

Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents

  • Marlos C. Machado
  • Marc G. Bellemare
  • Erik Talvitie
  • Joel Veness
  • Matthew Hausknecht
  • Michael Bowling

The Arcade Learning Environment (ALE) is an evaluation platform that poses the challenge of building AI agents with general competency across dozens of Atari 2600 games. It supports a variety of different problem settings and it has been receiving increasing attention from the scientific community, leading to some high-profile success stories such as the much publicized Deep Q-Networks (DQN). In this article we take a big picture look at how the ALE is being used by the research community. We show how diverse the evaluation methodologies in the ALE have become with time, and highlight some key concerns when evaluating agents in the ALE. We use this discussion to present some methodological best practices and provide new benchmark results using these best practices. To further the progress in the field, we introduce a new version of the ALE that supports multiple game modes and provides a form of stochasticity we call sticky actions. We conclude this big picture look by revisiting challenges posed when the ALE was introduced, summarizing the state-of-the-art in various problems and highlighting problems that remain open.

IJCAI Conference 2018 Conference Paper

Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents (Extended Abstract)

  • Marlos C. Machado
  • Marc G. Bellemare
  • Erik Talvitie
  • Joel Veness
  • Matthew Hausknecht
  • Michael Bowling

The Arcade Learning Environment (ALE) is an evaluation platform that poses the challenge of building AI agents with general competency across dozens of Atari 2600 games. It supports a variety of different problem settings and it has been receiving increasing attention from the scientific community. In this paper we take a big picture look at how the ALE is being used by the research community. We focus on how diverse the evaluation methodologies in the ALE have become and we highlight some key concerns when evaluating agents in this platform. We use this discussion to present what we consider to be the best practices for future evaluations in the ALE. To further the progress in the field, we also introduce a new version of the ALE that supports multiple game modes and provides a form of stochasticity we call sticky actions.

RLDM Conference 2017 Conference Abstract

A Laplacian Framework for Option Discovery in Reinforcement Learning*

  • Marlos C. Machado
  • Marc G. Bellemare
  • Michael Bowling

Representation learning and option discovery are two of the biggest challenges in reinforcement learning (RL). Proto-RL is a well-known approach for representation learning in MDPs. The representations learned with this framework are called proto-value functions (PVFs). In this paper we address the option discovery problem by showing how PVFs implicitly define options. We do it by introducing the concepts of eigenpurposes and eigenbehaviors. Eigenpurposes are intrinsic reward functions that incentivize the agent to traverse the state space by following the principal directions of the learned representation. Each intrinsic reward function leads to a different eigenbehavior, which is the optimal policy for that reward function. We convert eigenbehaviors into options by defining the termination condition to the eigenbehavior to be the moment in which the agent stops being able to accumulate positive intrinsic reward. Our termination criterion is provably satisfiable in at least one state in every MDP. Intuitively, the options discovered from eigenpurposes traverse the principal directions of the state space. In this paper we show how exploration is greatly improved when the agent’s action set is augmented by the options we discover. We use the expected number of steps required to navigate between any two states in the MDP when following a random walk as performance metric. This result is due to the fact that our options capture the diffusion process of a random walk, and that different options act at different time scales. We also demonstrate how the options we discover can be used to accumulate reward. Finally, we introduce a sample-based algorithm for option discovery in scenarios in which linear function approximation is required. We provide anecdotal evidence in Atari 2600 games that the options we discover clearly demonstrate intent, aiming at reaching specific locations on the screen, or to execute specific behavioral patterns.

IJCAI Conference 2016 Conference Paper

Action Selection for Hammer Shots in Curling

  • Zaheen Farraz Ahmad
  • Robert C. Holte
  • Michael Bowling

Curling is an adversarial two-player game with a continuous state and action space, and stochastic transitions. This paper focuses on one aspect of the full game, namely, finding the optimal "hammer shot, " which is the last action taken before a score is tallied. We survey existing methods for finding an optimal action in a continuous, low-dimensional space with stochastic outcomes, and adapt a method based on Delaunay Triangulation to our application. Experiments using our curling physics simulator show that the adapted Delaunay Triangulation's shot selection outperforms other algorithms, and with some caveats, exceeds Olympic-level human performance.

IJCAI Conference 2016 Conference Paper

Monte Carlo Tree Search in Continuous Action Spaces with Execution Uncertainty

  • Timothy Yee
  • Viliam Lisy
  • Michael Bowling

Real world applications of artificial intelligence often require agents to sequentially choose actions from continuous action spaces with execution uncertainty. When good actions are sparse, domain knowledge is often used to identify a discrete set of promising actions. These actions and their uncertain effects are typically evaluated using a recursive search procedure. The reduction of the problem to a discrete search problem causes severe limitations, notably, not exploiting all of the sampled outcomes when evaluating actions, and not using outcomes to help find new actions outside the original set. We propose a new Monte Carlo tree search (MCTS) algorithm specifically designed for exploiting an execution model in this setting. Using kernel regression, it generalizes the information about action quality between actions and to unexplored parts of the action space. In a high fidelity simulator of the olympic sport of curling, we show that this approach significantly outperforms existing MCTS methods.

AAMAS Conference 2016 Conference Paper

State of the Art Control of Atari Games Using Shallow Reinforcement Learning

  • Yitao Liang
  • Marlos C. Machado
  • Erik Talvitie
  • Michael Bowling

The recently introduced Deep Q-Networks (DQN) algorithm has gained attention as one of the first successful combinations of deep neural networks and reinforcement learning. Its promise was demonstrated in the Arcade Learning Environment (ALE), a challenging framework composed of dozens of Atari 2600 games used to evaluate general competency in AI. It achieved dramatically better results than earlier approaches, showing that its ability to learn good representations is quite robust and general. This paper attempts to understand the principles that underlie DQN’s impressive performance and to better contextualize its success. We systematically evaluate the importance of key representational biases encoded by DQN’s network by proposing simple linear representations that make use of these concepts. Incorporating these characteristics, we obtain a computationally practical feature set that achieves competitive performance to DQN in the ALE. Besides offering insight into the strengths and weaknesses of DQN, we provide a generic representation for the ALE, significantly reducing the burden of learning a representation for each game. Moreover, we also provide a simple, reproducible benchmark for the sake of comparison to future work in the ALE.

NeurIPS Conference 2016 Conference Paper

The Forget-me-not Process

  • Kieran Milan
  • Joel Veness
  • James Kirkpatrick
  • Michael Bowling
  • Anna Koop
  • Demis Hassabis

We introduce the Forget-me-not Process, an efficient, non-parametric meta-algorithm for online probabilistic sequence prediction for piecewise stationary, repeating sources. Our method works by taking a Bayesian approach to partition a stream of data into postulated task-specific segments, while simultaneously building a model for each task. We provide regret guarantees with respect to piecewise stationary data sources under the logarithmic loss, and validate the method empirically across a range of sequence prediction and task identification problems.

AAAI Conference 2015 Conference Paper

Approximate Linear Programming for Constrained Partially Observable Markov Decision Processes

  • Pascal Poupart
  • Aarti Malhotra
  • Pei Pei
  • Kee-Eung Kim
  • Bongseok Goh
  • Michael Bowling

In many situations, it is desirable to optimize a sequence of decisions by maximizing a primary objective while respecting some constraints with respect to secondary objectives. Such problems can be naturally modeled as constrained partially observable Markov decision processes (CPOMDPs) when the environment is partially observable. In this work, we describe a technique based on approximate linear programming to optimize policies in CPOMDPs. The optimization is performed offline and produces a finite state controller with desirable performance guarantees. The approach outperforms a constrained version of point-based value iteration on a suite of benchmark problems.

AAAI Conference 2015 Conference Paper

Improving Exploration in UCT Using Local Manifolds

  • Sriram Srinivasan
  • Erik Talvitie
  • Michael Bowling

Monte Carlo planning has been proven successful in many sequential decision-making settings, but it suffers from poor exploration when the rewards are sparse. In this paper, we improve exploration in UCT by generalizing across similar states using a given distance metric. When the state space does not have a natural distance metric, we show how we can learn a local manifold from the transition graph of states in the near future. to obtain a distance metric. On domains inspired by video games, empirical evidence shows that our algorithm is more sample efficient than UCT, particularly when rewards are sparse.

AAAI Conference 2015 Conference Paper

Optimal Estimation of Multivariate ARMA Models

  • Martha White
  • Junfeng Wen
  • Michael Bowling
  • Dale Schuurmans

Autoregressive moving average (ARMA) models are a fundamental tool in time series analysis that offer intuitive modeling capability and efficient predictors. Unfortunately, the lack of globally optimal parameter estimation strategies for these models remains a problem: application studies often adopt the simpler autoregressive model that can be easily estimated by maximizing (a posteriori) likelihood. We develop a (regularized, imputed) maximum likelihood criterion that admits efficient global estimation via structured matrix norm optimization methods. An empirical evaluation demonstrates the benefits of globally optimal parameter estimation over local and moment matching approaches.

AAAI Conference 2015 Conference Paper

Policy Tree: Adaptive Representation for Policy Gradient

  • Ujjwal Das Gupta
  • Erik Talvitie
  • Michael Bowling

Much of the focus on finding good representations in reinforcement learning has been on learning complex non-linear predictors of value. Policy gradient algorithms, which directly represent the policy, often need fewer parameters to learn good policies. However, they typically employ a fixed parametric representation that may not be sufficient for complex domains. This paper introduces the Policy Tree algorithm, which can learn an adaptive representation of policy in the form of a decision tree over different instantiations of a base policy. Policy gradient is used both to optimize the parameters and to grow the tree by choosing splits that enable the maximum local increase in the expected return of the policy. Experiments show that this algorithm can choose genuinely helpful splits and significantly improve upon the commonly used linear Gibbs softmax policy, which we choose as our base policy.

AAAI Conference 2015 Conference Paper

Solving Games with Functional Regret Estimation

  • Kevin Waugh
  • Dustin Morrill
  • James Bagnell
  • Michael Bowling

We propose a novel online learning method for minimizing regret in large extensive-form games. The approach learns a function approximator online to estimate the regret for choosing a particular action. A noregret algorithm uses these estimates in place of the true regrets to define a sequence of policies. We prove the approach sound by providing a bound relating the quality of the function approximation and regret of the algorithm. A corollary being that the method is guaranteed to converge to a Nash equilibrium in selfplay so long as the regrets are ultimately realizable by the function approximator. Our technique can be understood as a principled generalization of existing work on abstraction in large games; in our work, both the abstraction as well as the equilibrium are learned during self-play. We demonstrate empirically the method achieves higher quality strategies than state-of-the-art abstraction techniques given the same resources.

IJCAI Conference 2015 Conference Paper

Solving Heads-Up Limit Texas Hold'em

  • Oskari Tammelin
  • Neil Burch
  • Michael Johanson
  • Michael Bowling

Cepheus is the first computer program to essentially solve a game of imperfect information that is played competitively by humans. The game it plays is heads-up limit Texas hold’em poker, a game with over 1014 information sets, and a challenge problem for artificial intelligence for over 10 years. Cepheus was trained using a new variant of Counterfactual Regret Minimization (CFR), called CFR+, using 4800 CPUs running for 68 days. In this paper we describe in detail the engineering details required to make this computation a reality. We also prove the theoretical soundness of CFR+ and its component algorithm, regretmatching+. We further give a hint towards understanding the success of CFR+ by proving a tracking regret bound for this new regret matching algorithm. We present results showing the role of the algorithmic components and the engineering choices to the success of CFR+.

IJCAI Conference 2015 Conference Paper

The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract)

  • Marc Bellemare
  • Yavar Naddaf
  • Joel Veness
  • Michael Bowling

In this extended abstract we introduce the Arcade Learning Environment (ALE): both a challenge problem and a platform and methodology for evaluating the development of general, domainindependent AI technology. ALE provides an interface to hundreds of Atari 2600 game environments, each one different, interesting, and designed to be a challenge for human players. ALE presents significant research challenges for reinforcement learning, model learning, model-based planning, imitation learning, transfer learning, and intrinsic motivation. Most importantly, it provides a rigorous testbed for evaluating and comparing approaches to these problems. We illustrate the promise of ALE by presenting a benchmark set of domain-independent agents designed using wellestablished AI techniques for both reinforcement learning and planning. In doing so, we also propose an evaluation methodology made possible by ALE, reporting empirical results on over 55 different games. We conclude with a brief update on the latest ALE developments. All of the software, including the benchmark agents, is publicly available.

AAAI Conference 2014 Conference Paper

Solving Imperfect Information Games Using Decomposition

  • Neil Burch
  • Michael Johanson
  • Michael Bowling

Decomposition, i. e. , independently analyzing possible subgames, has proven to be an essential principle for effective decision-making in perfect information games. However, in imperfect information games, decomposition has proven to be problematic. To date, all proposed techniques for decomposition in imperfect information games have abandoned theoretical guarantees. This work presents the first technique for decomposing an imperfect information game into subgames that can be solved independently, while retaining optimality guarantees on the full-game solution. We can use this technique to construct theoretically justified algorithms that make better use of information available at run-time, overcome memory or disk limitations at run-time, or make a time/space trade-off to overcome memory or disk limitations while solving a game. In particular, we present an algorithm for subgame solving which guarantees performance in the whole game, in contrast to existing methods which may have unbounded error. In addition, we present an offline game solving algorithm, CFR-D, which can produce a Nash equilibrium for a game that is larger than available storage.

AAAI Conference 2014 Conference Paper

Using Response Functions to Measure Strategy Strength

  • Trevor Davis
  • Neil Burch
  • Michael Bowling

Extensive-form games are a powerful tool for representing complex multi-agent interactions. Nash equilibrium strategies are commonly used as a solution concept for extensive-form games, but many games are too large for the computation of Nash equilibria to be tractable. In these large games, exploitability has traditionally been used to measure deviation from Nash equilibrium, and thus strategies are aimed to achieve minimal exploitability. However, while exploitability measures a strategy’s worst-case performance, it fails to capture how likely that worst-case is to be observed in practice. In fact, empirical evidence has shown that a less exploitable strategy can perform worse than a more exploitable strategy in one-on-one play against a variety of opponents. In this work, we propose a class of response functions that can be used to measure the strength of a strategy. We prove that standard no-regret algorithms can be used to learn optimal strategies for a scenario where the opponent uses one of these response functions. We demonstrate the effectiveness of this technique in Leduc Hold’em against opponents that use the UCT Monte Carlo tree search algorithm.

AAAI Conference 2013 Conference Paper

Automating Collusion Detection in Sequential Games

  • Parisa Mazrooei
  • Christopher Archibald
  • Michael Bowling

Collusion is the practice of two or more parties deliberately cooperating to the detriment of others. While such behavior may be desirable in certain circumstances, in many it is considered dishonest and unfair. If agents otherwise hold strictly to the established rules, though, collusion can be challenging to police. In this paper, we introduce an automatic method for collusion detection in sequential games. We achieve this through a novel object, called a collusion table, that captures the effects of collusive behavior, i. e. , advantage to the colluding parties, without assuming any particular pattern of behavior. We show the effectiveness of this method in the domain of poker, a popular game where collusion is prohibited.

AAMAS Conference 2013 Conference Paper

Rating Players in Games with Real-Valued Outcomes

  • Christopher Archibald
  • Neil Burch
  • Michael Bowling
  • Matthew Rutherford

Game-theoretic models typically associate outcomes with real valued utilities, and rational agents are expected to maximize their expected utility. Currently fielded agent rating systems, which aim to order a population of agents by strength, focus exclusively on games with discrete outcomes, e. g. , win-loss in two-agent settings or an ordering in the multi-agent setting. These rating systems are not well-suited for domains where the absolute magnitude of utility rather than just the relative value is important. We introduce the problem of rating agents in games with real-valued outcomes and survey applicable existing techniques for rating agents in this setting. We then propose a novel rating system and an extension for all of these rating systems to games with more than two agents, showing experimentally the advantages of our proposed system.

IJCAI Conference 2013 Conference Paper

Subset Selection of Search Heuristics

  • Chris Rayner
  • Nathan Sturtevant
  • Michael Bowling

Constructing a strong heuristic function is a central problem in heuristic search. A common approach is to combine a number of heuristics by maximizing over the values from each. If a limit is placed on this number, then a subset selection problem arises. We treat this as an optimization problem, and proceed by translating a natural loss function into a submodular and monotonic utility function under which greedy selection is guaranteed to be nearoptimal. We then extend this approach with a sampling scheme that retains provable optimality. Our empirical results show large improvements over existing methods, and give new insight into building heuristics for directed domains.

AAMAS Conference 2012 Conference Paper

Efficient Nash Equilibrium Approximation through Monte Carlo Counterfactual Regret Minimization

  • Michael Johanson
  • Nolan Bard
  • Marc Lanctot
  • Richard Gibson
  • Michael Bowling

Recently, there has been considerable progress towards algorithms for approximating Nash equilibrium strategies in extensive games. One such algorithm, Counterfactual Regret Minimization (CFR), has proven to be effective in two-player, zero-sum poker domains. While the basic algorithm is iterative and performs a full game traversal on each iteration, sampling based approaches are possible. For instance, chance-sampled CFR considers just a single chance outcome per traversal, resulting in faster but less precise iterations. While more iterations are required, chance-sampled CFR requires less time overall to converge. In this work, we present new sampling techniques that consider sets of chance outcomes during each traversal to produce slower, more accurate iterations. By sampling only the public chance outcomes seen by all players, we take advantage of the imperfect information structure of the game to (i) avoid recomputation of strategy probabilities, and (ii) achieve an algorithmic speed improvement, performing O($n^2$) work at terminal nodes in O($n$) time. We demonstrate that this new CFR update converges more quickly than chance-sampled CFR in the large domains of poker and Bluff.

AAAI Conference 2012 Conference Paper

Finding Optimal Abstract Strategies in Extensive-Form Games

  • Michael Johanson
  • Nolan Bard
  • Neil Burch
  • Michael Bowling

Extensive-form games are a powerful model for representing interactions between agents. Nash equilibrium strategies are a common solution concept for extensive-form games and, in two-player zero-sum games, there are efficient algorithms for calculating such strategies. In large games, this computation may require too much memory and time to be tractable. A standard approach in such cases is to apply a lossy state-space abstraction technique to produce a smaller abstract game that can be tractably solved, while hoping that the resulting abstract game equilibrium is close to an equilibrium strategy in the unabstracted game. Recent work has shown that this assumption is unreliable, and an arbitrary Nash equilibrium in the abstract game is unlikely to be even near the least suboptimal strategy that can be represented in that space. In this work, we present for the first time an algorithm which ef- ficiently finds optimal abstract strategies — strategies with minimal exploitability in the unabstracted game. We use this technique to find the least exploitable strategy ever reported for two-player limit Texas hold’em.

AAAI Conference 2012 Conference Paper

Generalized Sampling and Variance in Counterfactual Regret Minimization

  • Richard Gibson
  • Marc Lanctot
  • Neil Burch
  • Duane Szafron
  • Michael Bowling

In large extensive form games with imperfect information, Counterfactual Regret Minimization (CFR) is a popular, iterative algorithm for computing approximate Nash equilibria. While the base algorithm performs a full tree traversal on each iteration, Monte Carlo CFR (MCCFR) reduces the per iteration time cost by traversing just a sampled portion of the tree. On the other hand, MCCFR’s sampled values introduce variance, and the effects of this variance were previously unknown. In this paper, we generalize MCCFR by considering any generic estimator of the sought values. We show that any choice of an estimator can be used to probabilistically minimize regret, provided the estimator is bounded and unbiased. In addition, we relate the variance of the estimator to the convergence rate of an algorithm that calculates regret directly from the estimator. We demonstrate the application of our analysis by defining a new bounded, unbiased estimator with empirically lower variance than MCCFR estimates. Finally, we use this estimator in a new sampling algorithm to compute approximate equilibria in Goofspiel, Bluff, and Texas hold’em poker. Under each of our selected sampling schemes, our new algorithm converges faster than MCCFR.

AAAI Conference 2012 Conference Paper

Investigating Contingency Awareness Using Atari 2600 Games

  • Marc Bellemare
  • Joel Veness
  • Michael Bowling

Contingency awareness is the recognition that some aspects of a future observation are under an agent’s control while others are solely determined by the environment. This paper explores the idea of contingency awareness in reinforcement learning using the platform of Atari 2600 games. We introduce a technique for accurately identifying contingent regions and describe how to exploit this knowledge to generate improved features for value function approximation. We evaluate the performance of our techniques empirically, using 46 unseen, diverse, and challenging games for the Atari 2600 console. Our results suggest that contingency awareness is a generally useful concept for model-free reinforcement learning agents.

JMLR Journal 2012 Journal Article

Linear Fitted-Q Iteration with Multiple Reward Functions

  • Daniel J. Lizotte
  • Michael Bowling
  • Susan A. Murphy

We present a general and detailed development of an algorithm for finite-horizon fitted-Q iteration with an arbitrary number of reward signals and linear value function approximation using an arbitrary number of state features. This includes a detailed treatment of the 3-reward function case using triangulation primitives from computational geometry and a method for identifying globally dominated actions. We also present an example of how our methods can be used to construct a real-world decision aid by considering symptom reduction, weight gain, and quality of life in sequential treatments for schizophrenia. Finally, we discuss future directions in which to take this work that will further enable our methods to make a positive impact on the field of evidence-based clinical decision support. [abs] [ pdf ][ bib ] &copy JMLR 2012. ( edit, beta )

NeurIPS Conference 2012 Conference Paper

Sketch-Based Linear Value Function Approximation

  • Marc Bellemare
  • Joel Veness
  • Michael Bowling

Hashing is a common method to reduce large, potentially infinite feature vectors to a fixed-size table. In reinforcement learning, hashing is often used in conjunction with tile coding to represent states in continuous spaces. Hashing is also a promising approach to value function approximation in large discrete domains such as Go and Hearts, where feature vectors can be constructed by exhaustively combining a set of atomic features. Unfortunately, the typical use of hashing in value function approximation results in biased value estimates due to the possibility of collisions. Recent work in data stream summaries has led to the development of the tug-of-war sketch, an unbiased estimator for approximating inner products. Our work investigates the application of this new data structure to linear value function approximation. Although in the reinforcement learning setting the use of the tug-of-war sketch leads to biased value estimates, we show that this bias can be orders of magnitude less than that of standard hashing. We provide empirical results on two RL benchmark domains and fifty-five Atari 2600 games to highlight the superior learning performance of tug-of-war hashing.

NeurIPS Conference 2012 Conference Paper

Tractable Objectives for Robust Policy Optimization

  • Katherine Chen
  • Michael Bowling

Robust policy optimization acknowledges that risk-aversion plays a vital role in real-world decision-making. When faced with uncertainty about the effects of actions, the policy that maximizes expected utility over the unknown parameters of the system may also carry with it a risk of intolerably poor performance. One might prefer to accept lower utility in expectation in order to avoid, or reduce the likelihood of, unacceptable levels of utility under harmful parameter realizations. In this paper, we take a Bayesian approach to parameter uncertainty, but unlike other methods avoid making any distributional assumptions about the form of this uncertainty. Instead we focus on identifying optimization objectives for which solutions can be efficiently approximated. We introduce percentile measures: a very general class of objectives for robust policy optimization, which encompasses most existing approaches, including ones known to be intractable. We then introduce a broad subclass of this family for which robust policies can be approximated efficiently. Finally, we frame these objectives in the context of a two-player, zero-sum, extensive-form game and employ a no-regret algorithm to approximate an optimal policy, with computation only polynomial in the number of states and actions of the MDP.

IJCAI Conference 2011 Conference Paper

Accelerating Best Response Calculation in Large Extensive Games

  • Michael Johanson
  • Kevin Waugh
  • Michael Bowling
  • Martin Zinkevich

One fundamental evaluation criteria of an AI technique is its performance in the worst-case. For static strategies in extensive games, this can be computed using a best response computation. Conventionally, this requires a full game tree traversal. For very large games, such as poker, that traversal is infeasible to perform on modern hardware. In this paper, we detail a general technique for best response computations that can often avoid a full game tree traversal. Additionally, our method is specifically well-suited for parallel environments. We apply this approach to computing the worst-case performance of a number of strategies in heads-up limit Texas hold'em, which, prior to this work, was not possible. We explore these results thoroughly as they provide insight into the effects of abstraction on worst-case performance in large imperfect information games. This is a topic that has received much attention, but could not previously be examined outside of toy domains.

AAAI Conference 2011 Conference Paper

Euclidean Heuristic Optimization

  • D. Chris Rayner
  • Michael Bowling
  • Nathan Sturtevant

We pose the problem of constructing good search heuristics as an optimization problem: minimizing the loss between the true distances and the heuristic estimates subject to admissibility and consistency constraints. For a well-motivated choice of loss function, we show performing this optimization is tractable. In fact, it corresponds to a recently proposed method for dimensionality reduction. We prove this optimization is guaranteed to produce admissible and consistent heuristics, generalizes and gives insight into differential heuristics, and show experimentally that it produces strong heuristics on problems from three distinct search domains.

NeurIPS Conference 2011 Conference Paper

Variance Reduction in Monte-Carlo Tree Search

  • Joel Veness
  • Marc Lanctot
  • Michael Bowling

Monte-Carlo Tree Search (MCTS) has proven to be a powerful, generic planning technique for decision-making in single-agent and adversarial environments. The stochastic nature of the Monte-Carlo simulations introduces errors in the value estimates, both in terms of bias and variance. Whilst reducing bias (typically through the addition of domain knowledge) has been studied in the MCTS literature, comparatively little effort has focused on reducing variance. This is somewhat surprising, since variance reduction techniques are a well-studied area in classical statistics. In this paper, we examine the application of some standard techniques for variance reduction in MCTS, including common random numbers, antithetic variates and control variates. We demonstrate how these techniques can be applied to MCTS and explore their efficacy on three different stochastic, single-agent settings: Pig, Can't Stop and Dominion.

IJCAI Conference 2009 Conference Paper

  • Martha White
  • Michael Bowling

Evaluating an agent’s performance in a stochastic setting is necessary for agent development, scientific evaluation, and competitions. Traditionally, evaluation is done using Monte Carlo estimation; the magnitude of the stochasticity in the domain or the high cost of sampling, however, can often prevent the approach from resulting in statistically significant conclusions. Recently, an advantage sum technique has been proposed for constructing unbiased, low variance estimates of agent performance. The technique requires an expert to define a value function over states of the system, essentially a guess of the state’s unknown value. In this work, we propose learning this value function from past interactions between agents in some target population. Our learned value functions have two key advantages: they can be applied in domains where no expert value function is available and they can result in tuned evaluation for a specific population of agents (e. g. , novice versus advanced agents). We demonstrate these two advantages in the domain of poker. We show that we can reduce variance over state-of-the-art estimators for a specific population of limit poker players as well as construct the first variance reducing estimators for no-limit poker and multi-player limit poker.

IJCAI Conference 2009 Conference Paper

  • David Schnizlein
  • Michael Bowling
  • Duane Szafron

Equilibrium or near-equilibrium solutions to very large extensive form games are often computed by using abstractions to reduce the game size. A common abstraction technique for games with a large number of available actions is to restrict the number of legal actions in every state. This method has been used to discover equilibrium solutions for the game of no-limit heads-up Texas Hold’em. When using a solution to an abstracted game to play one side in the un-abstracted (real) game, the real opponent actions may not correspond to actions in the abstracted game. The most popular method for handling this situation is to translate opponent actions in the real game to the closest legal actions in the abstracted game. We show that this approach can result in a very exploitable player and propose an alternative solution. We use probabilistic mapping to translate a real action into a probability distribution over actions, whose weights are determined by a similarity metric. We show that this approach significantly reduces the exploitability when using an abstract solution in the real game.

AAMAS Conference 2009 Conference Paper

Abstraction Pathologies in Extensive Games

  • Kevin Waugh
  • David Schnizlein
  • Michael Bowling
  • Duane Szafron

Extensive games can be used to model many scenarios in which multiple agents interact with an environment. There has been considerable recent research on finding strong strategies in very large, zero-sum extensive games. The standard approach in such work is to employ abstraction techniques to derive a more tractably sized game. An extensive game solver is then employed to compute an equilibrium in that abstract game, and the resulting strategy is presumed to be strong in the full game. Progress in this line of research has focused on solving larger abstract games, which more closely resemble the full game. However, there is an underlying assumption that by abstracting less, and solving a larger game, an agent will have a stronger strategy in the full game. In this work we show that this assumption is not true in general. Refining an abstraction can actually lead to a weaker strategy. We show examples of these abstraction pathologies in a small game of poker that can be analyzed exactly. These examples show that pathologies arise when abstracting both chance nodes as well as a player’s actions. In summary, this paper shows that the standard approach to finding strong strategies for large extensive games rests on shaky ground.

NeurIPS Conference 2009 Conference Paper

Monte Carlo Sampling for Regret Minimization in Extensive Games

  • Marc Lanctot
  • Kevin Waugh
  • Martin Zinkevich
  • Michael Bowling

Sequential decision-making with multiple agents and imperfect information is commonly modeled as an extensive game. One efficient method for computing Nash equilibria in large, zero-sum, imperfect information games is counterfactual regret minimization (CFR). In the domain of poker, CFR has proven effective, particularly when using a domain-specific augmentation involving chance outcome sampling. In this paper, we describe a general family of domain independent CFR sample-based algorithms called Monte Carlo counterfactual regret minimization (MCCFR) of which the original and poker-specific versions are special cases. We start by showing that MCCFR performs the same regret updates as CFR on expectation. Then, we introduce two sampling schemes: {\it outcome sampling} and {\it external sampling}, showing that both have bounded overall regret with high probability. Thus, they can compute an approximate equilibrium using self-play. Finally, we prove a new tighter bound on the regret for the original CFR algorithm and relate this new bound to MCCFRs bounds. We show empirically that, although the sample-based algorithms require more iterations, their lower cost per iteration can lead to dramatically faster convergence in various games.

NeurIPS Conference 2009 Conference Paper

Strategy Grafting in Extensive Games

  • Kevin Waugh
  • Nolan Bard
  • Michael Bowling

Extensive games are often used to model the interactions of multiple agents within an environment. Much recent work has focused on increasing the size of an extensive game that can be feasibly solved. Despite these improvements, many interesting games are still too large for such techniques. A common approach for computing strategies in these large games is to first employ an abstraction technique to reduce the original game to an abstract game that is of a manageable size. This abstract game is then solved and the resulting strategy is used in the original game. Most top programs in recent AAAI Computer Poker Competitions use this approach. The trend in this competition has been that strategies found in larger abstract games tend to beat strategies found in smaller abstract games. These larger abstract games have more expressive strategy spaces and therefore contain better strategies. In this paper we present a new method for computing strategies in large games. This method allows us to compute more expressive strategies without increasing the size of abstract games that we are required to solve. We demonstrate the power of the approach experimentally in both small and large games, while also providing a theoretical justification for the resulting improvement.

AAMAS Conference 2008 Conference Paper

Autonomous Geocaching: Navigation and Goal Finding in Outdoor Domains

  • James Neufeld
  • Michael Bowling
  • Jason Roberts
  • Stephen Walsh
  • Adam Milstein
  • Michael Sokolsky

This paper describes an autonomous robot system designed to solve the challenging task of geocaching. Geocaching involves locating a goal object in an outdoor environment given only its rough GPS position. No additional information about the environment such as road maps, waypoints, or obstacle descriptions is provided, nor is their often a simple straight line path to the object. This is in contrast to much of the research in robot navigation which often focuses on common structural features, e. g. , road following, curb avoidance, or indoor navigation. In addition, uncertainty in GPS positions requires a final local search of the target area after completing the challenging navigation problem. We describe a relatively simple robotic system for completing this task. This system addresses three main issues: building a map from raw sensor readings, navigating to the target region, and searching for the target object. We demonstrate the effectiveness of this system in a variety of complex outdoor environments and compare our system’s performance to that of a human expert teleoperating the robot.

AAMAS Conference 2008 Conference Paper

Sigma Point Policy Iteration

  • Michael Bowling
  • Alborz Geramifard
  • David Wingate

In reinforcement learning, least-squares temporal difference methods (e. g. , LSTD and LSPI) are effective, data-efficient techniques for policy evaluation and control with linear value function approximation. These algorithms rely on policy-dependent expectations of the transition and reward functions, which require all experience to be remembered and iterated over for each new policy evaluated. We propose to summarize experience with a compact policy-independent Gaussian model. We show how this policyindependent model can be transformed into a policy-dependent form and used to perform policy evaluation. Because closed-form transformations are rarely available, we introduce an efficient sigma point approximation. We show that the resulting Sigma-Point Policy Iteration algorithm (SPPI) is mathematically equivalent to LSPI for tabular representations and empirically demonstrate comparable performance for approximate representations. However, the experience does not need to be saved or replayed, meaning that for even moderate amounts of experience, SPPI is an order of magnitude faster than LSPI.

IJCAI Conference 2007 Conference Paper

  • Daniel Lizotte
  • Tao Wang
  • Michael Bowling
  • Dale Schuurmans

Gait optimization is a basic yet challenging problem for both quadrupedal and bipedal robots. Although techniques for automating the process exist, most involve local function optimization procedures that suffer from three key drawbacks. Local optimization techniques are naturally plagued by local optima, make no use of the expensive gait evaluations once a local step is taken, and do not explicitly model noise in gait evaluation. These drawbacks increase the need for a large number of gait evaluations, making optimization slow, data inefficient, and manually intensive. We present a Bayesian approach based on Gaussian process regression that addresses all three drawbacks. It uses a global search strategy based on a posterior model inferred from all of the individual noisy evaluations. We demonstrate the technique on a quadruped robot, using it to optimize two different criteria: speed and smoothness. We show in both cases our technique requires dramatically fewer gait evaluations than state-of-the-art local gradient approaches.

NeurIPS Conference 2007 Conference Paper

Computing Robust Counter-Strategies

  • Michael Johanson
  • Martin Zinkevich
  • Michael Bowling

Adaptation to other initially unknown agents often requires computing an effective counter-strategy. In the Bayesian paradigm, one must find a good counter-strategy to the inferred posterior of the other agents' behavior. In the experts paradigm, one may want to choose experts that are good counter-strategies to the other agents' expected behavior. In this paper we introduce a technique for computing robust counter-strategies for adaptation in multiagent scenarios under a variety of paradigms. The strategies can take advantage of a suspected tendency in the decisions of the other agents, while bounding the worst-case performance when the tendency is not observed. The technique involves solving a modified game, and therefore can make use of recently developed algorithms for solving very large extensive games. We demonstrate the effectiveness of the technique in two-player Texas Hold'em. We show that the computed poker strategies are substantially more robust than best response counter-strategies, while still exploiting a suspected tendency. We also compose the generated strategies in an experts algorithm showing a dramatic improvement in performance over using simple best responses.

NeurIPS Conference 2007 Conference Paper

Regret Minimization in Games with Incomplete Information

  • Martin Zinkevich
  • Michael Johanson
  • Michael Bowling
  • Carmelo Piccione

Extensive games are a powerful model of multiagent decision-making scenarios with incomplete information. Finding a Nash equilibrium for very large instances of these games has received a great deal of recent attention. In this paper, we describe a new technique for solving large games based on regret minimization. In particular, we introduce the notion of counterfactual regret, which exploits the degree of incomplete information in an extensive game. We show how minimizing counterfactual regret minimizes overall regret, and therefore in self-play can be used to compute a Nash equilibrium. We demonstrate this technique in the domain of poker, showing we can solve abstractions of limit Texas Hold’em with as many as 1012 states, two orders of magnitude larger than previous methods.

NeurIPS Conference 2007 Conference Paper

Stable Dual Dynamic Programming

  • Tao Wang
  • Michael Bowling
  • Dale Schuurmans
  • Daniel Lizotte

Recently, we have introduced a novel approach to dynamic programming and re- inforcement learning that is based on maintaining explicit representations of sta- tionary distributions instead of value functions. In this paper, we investigate the convergence properties of these dual algorithms both theoretically and empirically, and show how they can be scaled up by incorporating function approximation.

AAAI Conference 2006 Conference Paper

Compact, Convex Upper Bound Iteration for Approximate POMDP Planning

  • Tao Wang
  • Michael Bowling

Partially observable Markov decision processes (POMDPs) are an intuitive and general way to model sequential decision making problems under uncertainty. Unfortunately, even approximate planning in POMDPs is known to be hard, and developing heuristic planners that can deliver reasonable results in practice has proved to be a significant challenge. In this paper, we present a new approach to approximate value-iteration for POMDP planning that is based on quadratic rather than piecewise linear function approximators. Specifically, we approximate the optimal value function by a convex upper bound composed of a fixed number of quadratics, and optimize it at each stage by semidefinite programming. We demonstrate that our approach can achieve competitive approximation quality to current techniques while still maintaining a bounded size representation of the function approximator. Moreover, an upper bound on the optimal value function can be preserved if required. Overall, the technique requires computation time and space that is only linear in the number of iterations (horizon time).

NeurIPS Conference 2006 Conference Paper

iLSTD: Eligibility Traces and Convergence Analysis

  • Alborz Geramifard
  • Michael Bowling
  • Martin Zinkevich
  • Richard Sutton

We present new theoretical and empirical results with the iLSTD algorithm for policy evaluation in reinforcement learning with linear function approximation. iLSTD is an incremental method for achieving results similar to LSTD, the dataefficient, least-squares version of temporal difference learning, without incurring the full cost of the LSTD computation. LSTD is O(n2 ), where n is the number of parameters in the linear function approximator, while iLSTD is O(n). In this paper, we generalize the previous iLSTD algorithm and present three new results: (1) the first convergence proof for an iLSTD algorithm; (2) an extension to incorporate eligibility traces without changing the asymptotic computational complexity; and (3) the first empirical results with an iLSTD algorithm for a problem (mountain car) with feature vectors large enough (n = 10, 000) to show substantial computational advantages over LSTD.

AAAI Conference 2006 Conference Paper

Subjective Mapping

  • Michael Bowling

Extracting a map from a stream of experience is a key problem in robotics and artificial intelligence in general. We propose a technique, called subjective mapping, that seeks to learn a fully specified predictive model, or map, without the need for expert provided models of the robot’s motion and sensor apparatus. We briefly overview the recent advancements presented elsewhere (ICML, IJCAI, and ISRR) that make this possible, examine its significance in relationship to other developments in the field, and outline open issues that remain to be addressed.

AAAI Conference 2005 Conference Paper

Coordination and Adaptation in Impromptu Teams

  • Michael Bowling

Coordinating a team of autonomous agents is one of the major challenges in building effective multiagent systems. Many techniques have been devised for this problem, and coordinated teamwork has been demonstrated even in highly dynamic and adversarial environments. A key assumption of these techniques, though, is that the team members are developed together as a whole. In many multiagent scenarios, this assumption is violated. We study the problem of coordination in impromptu teams, where a team is composed of independent agents each unknown to the others. The team members have their own skills, models, strategies, and coordination mechanisms, and no external organization is imposed upon them. In particular, we propose two techniques, one adaptive and one predictive, for coordinating a single agent that joins an unknown team of existing agents. We experimentally evaluate these mechanisms in the robot soccer domain, while introducing useful baselines for evaluating the performance of impromptu teams. We show some encouraging success while demonstrating this is a very fertile area of research.

IJCAI Conference 2005 Conference Paper

Learning Subjective Representations for Planning

  • Dana Wilkinson
  • Michael Bowling
  • Ali

Planning involves using a model of an agent’s actions to find a sequence of decisions which achieve a desired goal. It is usually assumed that the models are given, and such models often require expert knowledge of the domain. This paper explores subjective representations for planning that are learned directly from agent observations and actions (requiring no initial domain knowledge). A non-linear embedding technique called Action Respecting Embedding is used to construct such a representation. It is then shown how to extract the effects of the agent’s actions as operators in this learned representation. Finally, the learned representation and operators are combined with search to find sequences of actions that achieve given goals. The efficacy of this technique is demonstrated in a challenging robot-vision-inspired image domain.

NeurIPS Conference 2005 Conference Paper

Online Discovery and Learning of Predictive State Representations

  • Peter Mccracken
  • Michael Bowling

Predictive state representations (PSRs) are a method of modeling dynamical systems using only observable data, such as actions and observations, to describe their model. PSRs use predictions about the outcome of future tests to summarize the system state. The best existing techniques for discovery and learning of PSRs use a Monte Carlo approach to explicitly estimate these outcome probabilities. In this paper, we present a new algorithm for discovery and learning of PSRs that uses a gradient descent approach to compute the predictions for the current state. The algorithm takes advantage of the large amount of structure inherent in a valid prediction matrix to constrain its predictions. Furthermore, the algorithm can be used online by an agent to constantly improve its prediction quality; something that current state of the art discovery and learning algorithms are unable to do. We give empirical results to show that our constrained gradient algorithm is able to discover core tests using very small amounts of data, and with larger amounts of data can compute accurate predictions of the system dynamics.

NeurIPS Conference 2004 Conference Paper

Convergence and No-Regret in Multiagent Learning

  • Michael Bowling

Learning in a multiagent system is a challenging problem due to two key factors. First, if other agents are simultaneously learning then the envi- ronment is no longer stationary, thus undermining convergence guaran- tees. Second, learning is often susceptible to deception, where the other agents may be able to exploit a learner's particular dynamics. In the worst case, this could result in poorer performance than if the agent was not learning at all. These challenges are identifiable in the two most com- mon evaluation criteria for multiagent learning algorithms: convergence and regret. Algorithms focusing on convergence or regret in isolation are numerous. In this paper, we seek to address both criteria in a single algorithm by introducing GIGA-WoLF, a learning algorithm for normal- form games. We prove the algorithm guarantees at most zero average regret, while demonstrating the algorithm converges in many situations of self-play. We prove convergence in a limited setting and give empir- ical results in a wider variety of situations. These results also suggest a third new learning criterion combining convergence and regret, which we call negative non-convergence regret (NNR).

IJCAI Conference 2003 Conference Paper

Simultaneous Adversarial Multi-Robot Learning

  • Michael Bowling
  • Manuela Veloso

Multi-robot learning faces all of the challenges of robot learning with all of the challenges of multiagent learning. There has been a great deal of recent research on multiagent reinforcement learning in stochastic games, which is the intuitive extension of MDPs to multiple agents. This recent work, although general, has only been applied to small games with at most hundreds of states. On the other hand robot tasks have continuous, and often complex, state and action spaces. Robot learning tasks demand approximation and generalization techniques, which have only received extensive attention in single-agent learning. In this paper we introduce GraWoLF, a general-purpose, scalable, multiagent learning algorithm. It combines gradient-based policy learning techniques with the WoLF ("Win or Learn Fast") variable learning rate. We apply this algorithm to an adversarial multirobot task with simultaneous learning. We show results of learning both in simulation and on the real robots. These results demonstrate that GraWoLF can learn successful policies, overcoming the many challenges in multi-robot learning.

IJCAI Conference 1999 Conference Paper

Bounding the Suboptimality of Reusing Subproblems

  • Michael Bowling
  • Manuela Veloso

We are interested in the problem of determining a course of action to achieve a desired objective in a non-deterministic environment. Markov decision processes (MDPs) provide a framework for representing this action selection problem, and there are a number of algorithms that learn optimal policies within this formulation. This framework has also been used to study state space abstraction, problem decomposition, and policy reuse. These techniques sacrifice optimality of their solution for improved learning speed. In this paper we examine the suboptimality of reusing policies that are solutions to subproblems. This is done within a restricted class of MDPs, namely those where non-zero reward is received only upon reaching a goal state. We introduce the definition of a subproblem within this class and provide motivation for how reuse of subproblem solutions can speed up learning. The contribution of this paper is the derivation of a tight bound on the loss in optimality from this reuse. We examine a bound that is based on Bellman error, which applies to all MDPs, but is not tight enough to be useful. We contribute our own theoretical result that gives an empirically tight bound on this suboptimality.