Author name cluster

Satinder Singh

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

103 papers

1 author row

NeurIPS Conference 2025 Conference Paper

Generating Creative Chess Puzzles

Xidong Feng
Vivek Veeriah
Marcus Chiam
Michael Dennis
Federico Barbero
Johan Obando Ceron
Jiaxin Shi
Satinder Singh

While Generative AI rapidly advances in various domains, generating truly creative, aesthetic, and counter-intuitive outputs remains a challenge. This paper presents an approach to tackle these difficulties in the domain of chess puzzles. We start by benchmarking Generative AI architectures, and then introduce an RL framework with novel rewards based on chess engine search statistics to overcome some of those shortcomings. The rewards are designed to enhance a puzzle's uniqueness, counter-intuitiveness, diversity, and realism. Our RL approach dramatically increases counter-intuitive puzzle generation by 10x, from 0. 22\% (supervised) to 2. 5\%, surpassing existing dataset rates (2. 1\%) and the best Lichess-trained model (0. 4\%). Our puzzles meet novelty and diversity benchmarks, retain aesthetic themes, and are rated by human experts as more creative, enjoyable, and counter-intuitive than composed book puzzles, even approaching classic compositions. Our final outcome is a curated booklet of these novel AI-generated puzzles, which is acknowledged for creativity by three world-renowned experts.

NeurIPS Conference 2025 Conference Paper

Plasticity as the Mirror of Empowerment

David Abel
Michael Bowling
Andre Barreto
Will Dabney
Shi Dong
Steven Hansen
Anna Harutyunyan
Khimya Khetarpal

Agents are minimally entities that are influenced by their past observations and act to influence future observations. This latter capacity is captured by empowerment, which has served as a vital framing concept across artificial intelligence and cognitive science. This former capacity, however, is equally foundational: In what ways, and to what extent, can an agent be influenced by what it observes? In this paper, we ground this concept in a universal agent-centric measure that we refer to as plasticity, and reveal a fundamental connection to empowerment. Following a set of desiderata on a suitable definition, we define plasticity using a new information-theoretic quantity we call the generalized directed information. We show that this new quantity strictly generalizes the directed information introduced by Massey (1990) while preserving all of its desirable properties. Under this definition, we find that plasticity is well thought of as the mirror of empowerment: The two concepts are defined using the same measure, with only the direction of influence reversed. Our main result establishes a tension between the plasticity and empowerment of an agent, suggesting that agent design needs to be mindful of both characteristics. We explore the implications of these findings, and suggest that plasticity, empowerment, and their relationship are essential to understanding agency.

NeurIPS Conference 2023 Conference Paper

A Definition of Continual Reinforcement Learning

David Abel
Andre Barreto
Benjamin Van Roy
Doina Precup
Hado P. van Hasselt
Satinder Singh

In a standard view of the reinforcement learning problem, an agent’s goal is to efficiently identify a policy that maximizes long-term reward. However, this perspective is based on a restricted view of learning as finding a solution, rather than treating learning as endless adaptation. In contrast, continual reinforcement learning refers to the setting in which the best agents never stop learning. Despite the importance of continual reinforcement learning, the community lacks a simple definition of the problem that highlights its commitments and makes its primary concepts precise and clear. To this end, this paper is dedicated to carefully defining the continual reinforcement learning problem. We formalize the notion of agents that “never stop learning” through a new mathematical language for analyzing and cataloging agents. Using this new language, we define a continual learning agent as one that can be understood as carrying out an implicit search process indefinitely, and continual reinforcement learning as the setting in which the best agents are all continual learning agents. We provide two motivating examples, illustrating that traditional views of multi-task reinforcement learning and continual supervised learning are special cases of our definition. Collectively, these definitions and perspectives formalize many intuitive concepts at the heart of learning, and open new research pathways surrounding continual learning agents.

NeurIPS Conference 2023 Conference Paper

Combining Behaviors with the Successor Features Keyboard

Wilka Carvalho Carvalho
Andre Saraiva
Angelos Filos
Andrew Lampinen
Loic Matthey
Richard L Lewis
Honglak Lee
Satinder Singh

The Option Keyboard (OK) was recently proposed as a method for transferring behavioral knowledge across tasks. OK transfers knowledge by adaptively combining subsets of known behaviors using Successor Features (SFs) and Generalized Policy Improvement (GPI). However, it relies on hand-designed state-features and task encodings which are cumbersome to design for every new environment. In this work, we propose the "Successor Features Keyboard" (SFK), which enables transfer with discovered state-features and task encodings. To enable discovery, we propose the "Categorical Successor Feature Approximator" (CSFA), a novel learning algorithm for estimating SFs while jointly discovering state-features and task encodings. With SFK and CSFA, we achieve the first demonstration of transfer with SFs in a challenging 3D environment where all the necessary representations are discovered. We first compare CSFA against other methods for approximating SFs and show that only CSFA discovers representations compatible with SF&GPI at this scale. We then compare SFK against transfer learning baselines and show that it transfers most quickly to long-horizon tasks.

NeurIPS Conference 2023 Conference Paper

Large Language Models can Implement Policy Iteration

Ethan Brooks
Logan Walls
Richard L Lewis
Satinder Singh

In this work, we demonstrate a method for implementing policy iteration using a large language model. While the application of foundation models to RL has received considerable attention, most approaches rely on either (1) the curation of expert demonstrations (either through manual design or task-specific pretraining) or (2) adaptation to the task of interest using gradient methods (either fine-tuning or training of adapter layers). Both of these techniques have drawbacks. Collecting demonstrations is labor-intensive, and algorithms that rely on them do not outperform the experts from which the demonstrations were derived. All gradient techniques are inherently slow, sacrificing the “few-shot” quality that makes in-context learning attractive to begin with. Our method demonstrates that a large language model can be used to implement policy iteration using the machinery of in-context learning, enabling it to learn to perform RL tasks without expert demonstrations or gradients. Our approach iteratively updates the contents of the prompt from which it derives its policy through trial-and-error interaction with an RL environment. In order to eliminate the role of in-weights learning (on which approaches like Decision Transformer rely heavily), we demonstrate our method using Codex (M. Chen et al. 2021b), a language model with no prior knowledge of the domains on which we evaluate it.

NeurIPS Conference 2023 Conference Paper

Optimistic Meta-Gradients

Sebastian Flennerhag
Tom Zahavy
Brendan O'Donoghue
Hado P. van Hasselt
András György
Satinder Singh

We study the connection between gradient-based meta-learning and convex optimisation. We observe that gradient descent with momentum is a special case of meta-gradients, and building on recent results in optimisation, we prove convergence rates for meta learning in the single task setting. While a meta-learned update rule can yield faster convergence up to constant factor, it is not sufficient for acceleration. Instead, some form of optimism is required. We show that optimism in meta-learning can be captured through the recently proposed Bootstrapped Meta-Gradient (Flennerhag et. al. , 2022) method, providing deeper insight into its underlying mechanics.

TMLR Journal 2023 Journal Article

POMRL: No-Regret Learning-to-Plan with Increasing Horizons

Khimya Khetarpal
Claire Vernade
Brendan O'Donoghue
Satinder Singh
Tom Zahavy

We study the problem of planning under model uncertainty in an online meta-reinforcement learning (RL) setting where an agent is presented with a sequence of related tasks with limited interactions per task. The agent can use its experience in each task and across tasks to estimate both the transition model and the distribution over tasks. We propose an algorithm to meta-learn the underlying relatedness across tasks, utilize it to plan in each task, and upper-bound the regret of the planning loss. Our bound suggests that the average regret over tasks decreases as the number of tasks increases and as the tasks are more similar. In the classical single-task setting, it is known that the planning horizon should depend on the estimated model's accuracy, that is, on the number of samples within task. We generalize this finding to meta-RL and study this dependence of planning horizons on the number of tasks. Based on our theoretical findings, we derive heuristics for selecting slowly increasing discount factors, and we validate its significance empirically.

AIJ Journal 2023 Journal Article

Risk-aware analysis for interpretations of probabilistic achievement and maintenance commitments

Qi Zhang
Edmund H. Durfee
Satinder Singh

NeurIPS Conference 2023 Conference Paper

Structured State Space Models for In-Context Reinforcement Learning

Chris Lu
Yannick Schroecker
Albert Gu
Emilio Parisotto
Jakob Foerster
Satinder Singh
Feryal Behbahani

Structured state space sequence (S4) models have recently achieved state-of-the-art performance on long-range sequence modeling tasks. These models also have fast inference speeds and parallelisable training, making them potentially useful in many reinforcement learning settings. We propose a modification to a variant of S4 that enables us to initialise and reset the hidden state in parallel, allowing us to tackle reinforcement learning tasks. We show that our modified architecture runs asymptotically faster than Transformers in sequence length and performs better than RNN's on a simple memory-based task. We evaluate our modified architecture on a set of partially-observable environments and find that, in practice, our model outperforms RNN's while also running over five times faster. Then, by leveraging the model’s ability to handle long-range sequences, we achieve strong performance on a challenging meta-learning task in which the agent is given a randomly-sampled continuous control environment, combined with a randomly-sampled linear projection of the environment's observations and actions. Furthermore, we show the resulting model can adapt to out-of-distribution held-out tasks. Overall, the results presented in this paper show that structured state space models are fast and performant for in-context reinforcement learning tasks. We provide code at https: //github. com/luchris429/s5rl.

AAAI Conference 2022 Conference Paper

Adaptive Pairwise Weights for Temporal Credit Assignment

Zeyu Zheng
Risto Vuorio
Richard Lewis
Satinder Singh

How much credit (or blame) should an action taken in a state get for a future reward? This is the fundamental temporal credit assignment problem in Reinforcement Learning (RL). One of the earliest and still most widely used heuristics is to assign this credit based on a scalar coefficient, lambda (treated as a hyperparameter), raised to the power of the time interval between the state-action and the reward. In this empirical paper, we explore heuristics based on more general pairwise weightings that are functions of the state in which the action was taken, the state at the time of the reward, as well as the time interval between the two. Of course it isn’t clear what these pairwise weight functions should be, and because they are too complex to be treated as hyperparameters we develop a metagradient procedure for learning these weight functions during the usual RL training of a policy. Our empirical work shows that it is often possible to learn these pairwise weight functions during learning of the policy to achieve better performance than competing approaches.

NeurIPS Conference 2022 Conference Paper

Approximate Value Equivalence

Christopher Grimm
Andre Barreto
Satinder Singh

Model-based reinforcement learning agents must make compromises about which aspects of the environment their models should capture. The value equivalence (VE) principle posits that these compromises should be made considering the model's eventual use in value-based planning. Given sets of functions and policies, a model is said to be order-$k$ VE to the environment if $k$ applications of the Bellman operators induced by the policies produce the correct result when applied to the functions. Prior work investigated the classes of models induced by VE when we vary $k$ and the sets of policies and functions. This gives rise to a rich collection of topological relationships and conditions under which VE models are optimal for planning. Despite this effort, relatively little is known about the planning performance of models that fail to satisfy these conditions. This is due to the rigidity of the VE formalism, as classes of VE models are defined with respect to \textit{exact} constraints on their Bellman operators. This limitation gets amplified by the fact that such constraints themselves may depend on functions that can only be approximated in practice. To address these problems we propose approximate value equivalence (AVE), which extends the VE formalism by replacing equalities with error tolerances. This extension allows us to show that AVE models with respect to one set of functions are also AVE with respect to any other set of functions if we tolerate a high enough error. We can then derive bounds on the performance of VE models with respect to \textit{arbitrary sets of functions}. Moreover, AVE models more accurately reflect what can be learned by our agents in practice, allowing us to investigate previously unexplored tensions between model capacity and the choice of VE model class. In contrast to previous works, we show empirically that there are situations where agents with limited capacity should prefer to learn more accurate models with respect to smaller sets of functions over less accurate models with respect to larger sets of functions.

EWRL Workshop 2022 Workshop Paper

Discovering Policies with DOMiNO: Diversity Optimization Maintaining Near Optimality

Tom Zahavy
Yannick Schroecker
Feryal Behbahani
Kate Baumli
Sebastian Flennerhag
Shaobo Hou
Satinder Singh

Finding different solutions to the same problem is a key aspect of intelligence associated with creativity and adaptation to novel situations. In reinforcement learning, a set of diverse policies can be useful for exploration, transfer, hierarchy, and robustness. We propose DOMiNO, a method for Diversity Optimization Maintaining Near Optimality. We formalize the problem as a Constrained Markov Decision Process where the objective is to find diverse policies, measured by the distance between the state occupancies of the policies in the set, while remaining near-optimal with respect to the extrinsic reward. We demonstrate that the method can discover diverse and meaningful behaviors in various domains, such as different locomotion patterns in the DeepMind Control Suite. We perform extensive analysis of our approach, compare it with other multi-objective baselines, demonstrate that we can control both the quality and the diversity of the set via interpretable hyperparameters, and show that the discovered set is robust to perturbations.

IJCAI Conference 2022 Conference Paper

On the Expressivity of Markov Reward (Extended Abstract)

David Abel
Will Dabney
Anna Harutyunyan
Mark K. Ho
Michael L. Littman
Doina Precup
Satinder Singh

Reward is the driving force for reinforcement-learning agents. We here set out to understand the expressivity of Markov reward as a way to capture tasks that we would want an agent to perform. We frame this study around three new abstract notions of "task": (1) a set of acceptable behaviors, (2) a partial ordering over behaviors, or (3) a partial ordering over trajectories. Our main results prove that while reward can express many of these tasks, there exist instances of each task type that no Markov reward function can capture. We then provide a set of polynomial-time algorithms that construct a Markov reward function that allows an agent to perform each task type, and correctly determine when no such reward function exists.

PDF Details DOI

NeurIPS Conference 2022 Conference Paper

Palm up: Playing in the Latent Manifold for Unsupervised Pretraining

Hao Liu
Tom Zahavy
Volodymyr Mnih
Satinder Singh

Large and diverse datasets have been the cornerstones of many impressive advancements in artificial intelligence. Intelligent creatures, however, learn by interacting with the environment, which changes the input sensory signals and the state of the environment. In this work, we aim to bring the best of both worlds and propose an algorithm that exhibits an exploratory behavior whilst it utilizes large diverse datasets. Our key idea is to leverage deep generative models that are pretrained on static datasets and introduce a dynamic model in the latent space. The transition dynamics simply mixes an action and a random sampled latent. It then applies an exponential moving average for temporal persistency, the resulting latent is decoded to image using pretrained generator. We then employ an unsupervised reinforcement learning algorithm to explore in this environment and perform unsupervised representation learning on the collected data. We further leverage the temporal information of this data to pair data points as a natural supervision for representation learning. Our experiments suggest that the learned representations can be successfully transferred to downstream tasks in both vision and reinforcement learning domains.

NeurIPS Conference 2022 Conference Paper

Planning to the Information Horizon of BAMDPs via Epistemic State Abstraction

Dilip Arumugam
Satinder Singh

The Bayes-Adaptive Markov Decision Process (BAMDP) formalism pursues the Bayes-optimal solution to the exploration-exploitation trade-off in reinforcement learning. As the computation of exact solutions to Bayesian reinforcement-learning problems is intractable, much of the literature has focused on developing suitable approximation algorithms. In this work, before diving into algorithm design, we first define, under mild structural assumptions, a complexity measure for BAMDP planning. As efficient exploration in BAMDPs hinges upon the judicious acquisition of information, our complexity measure highlights the worst-case difficulty of gathering information and exhausting epistemic uncertainty. To illustrate its significance, we establish a computationally-intractable, exact planning algorithm that takes advantage of this measure to show more efficient planning. We then conclude by introducing a specific form of state abstraction with the potential to reduce BAMDP complexity and gives rise to a computationally-tractable, approximate planning algorithm.

NeurIPS Conference 2021 Conference Paper

Discovery of Options via Meta-Learned Subgoals

Vivek Veeriah
Tom Zahavy
Matteo Hessel
Zhongwen Xu
Junhyuk Oh
Iurii Kemaev
Hado P. van Hasselt
David Silver

Temporal abstractions in the form of options have been shown to help reinforcement learning (RL) agents learn faster. However, despite prior work on this topic, the problem of discovering options through interaction with an environment remains a challenge. In this paper, we introduce a novel meta-gradient approach for discovering useful options in multi-task RL environments. Our approach is based on a manager-worker decomposition of the RL agent, in which a manager maximises rewards from the environment by learning a task-dependent policy over both a set of task-independent discovered-options and primitive actions. The option-reward and termination functions that define a subgoal for each option are parameterised as neural networks and trained via meta-gradients to maximise their usefulness. Empirical analysis on gridworld and DeepMind Lab tasks show that: (1) our approach can discover meaningful and diverse temporally-extended options in multi-task RL domains, (2) the discovered options are frequently used by the agent while learning to solve the training tasks, and (3) that the discovered options help a randomly initialised manager learn faster in completely new tasks.

AAAI Conference 2021 Conference Paper

Efficient Querying for Cooperative Probabilistic Commitments

Qi Zhang
Edmund H. Durfee
Satinder Singh

Multiagent systems can use commitments as the core of a general coordination infrastructure, supporting both cooperative and non-cooperative interactions. Agents whose objectives are aligned, and where one agent can help another achieve greater reward by sacrificing some of its own reward, should choose a cooperative commitment to maximize their joint reward. We present a solution to the problem of how cooperative agents can efficiently find an (approximately) optimal commitment by querying about carefully-selected commitment choices. We prove structural properties of the agents’ values as functions of the parameters of the commitment specification, and develop a greedy method for composing a query with provable approximation bounds, which we empirically show can find nearly optimal commitments in a fraction of the time methods that lack our insights require.

NeurIPS Conference 2021 Conference Paper

Learning State Representations from Random Deep Action-conditional Predictions

Zeyu Zheng
Vivek Veeriah
Risto Vuorio
Richard L Lewis
Satinder Singh

Our main contribution in this work is an empirical finding that random General Value Functions (GVFs), i. e. , deep action-conditional predictions---random both in what feature of observations they predict as well as in the sequence of actions the predictions are conditioned upon---form good auxiliary tasks for reinforcement learning (RL) problems. In particular, we show that random deep action-conditional predictions when used as auxiliary tasks yield state representations that produce control performance competitive with state-of-the-art hand-crafted auxiliary tasks like value prediction, pixel control, and CURL in both Atari and DeepMind Lab tasks. In another set of experiments we stop the gradients from the RL part of the network to the state representation learning part of the network and show, perhaps surprisingly, that the auxiliary tasks alone are sufficient to learn state representations good enough to outperform an end-to-end trained actor-critic baseline. We opensourced our code at https: //github. com/Hwhitetooth/random_gvfs.

NeurIPS Conference 2021 Conference Paper

On the Expressivity of Markov Reward

David Abel
Will Dabney
Anna Harutyunyan
Mark K. Ho
Michael Littman
Doina Precup
Satinder Singh

Reward is the driving force for reinforcement-learning agents. This paper is dedicated to understanding the expressivity of reward as a way to capture tasks that we would want an agent to perform. We frame this study around three new abstract notions of “task” that might be desirable: (1) a set of acceptable behaviors, (2) a partial ordering over behaviors, or (3) a partial ordering over trajectories. Our main results prove that while reward can express many of these tasks, there exist instances of each task type that no Markov reward function can capture. We then provide a set of polynomial-time algorithms that construct a Markov reward function that allows an agent to optimize tasks of each of these three types, and correctly determine when no such reward function exists. We conclude with an empirical study that corroborates and illustrates our theoretical findings.

NeurIPS Conference 2021 Conference Paper

Proper Value Equivalence

Christopher Grimm
Andre Barreto
Greg Farquhar
David Silver
Satinder Singh

One of the main challenges in model-based reinforcement learning (RL) is to decide which aspects of the environment should be modeled. The value-equivalence (VE) principle proposes a simple answer to this question: a model should capture the aspects of the environment that are relevant for value-based planning. Technically, VE distinguishes models based on a set of policies and a set of functions: a model is said to be VE to the environment if the Bellman operators it induces for the policies yield the correct result when applied to the functions. As the number of policies and functions increase, the set of VE models shrinks, eventually collapsing to a single point corresponding to a perfect model. A fundamental question underlying the VE principle is thus how to select the smallest sets of policies and functions that are sufficient for planning. In this paper we take an important step towards answering this question. We start by generalizing the concept of VE to order-$k$ counterparts defined with respect to $k$ applications of the Bellman operator. This leads to a family of VE classes that increase in size as $k \rightarrow \infty$. In the limit, all functions become value functions, and we have a special instantiation of VE which we call proper VE or simply PVE. Unlike VE, the PVE class may contain multiple models even in the limit when all value functions are used. Crucially, all these models are sufficient for planning, meaning that they will yield an optimal policy despite the fact that they may ignore many aspects of the environment. We construct a loss function for learning PVE models and argue that popular algorithms such as MuZero can be understood as minimizing an upper bound for this loss. We leverage this connection to propose a modification to MuZero and show that it can lead to improved performance in practice.

IJCAI Conference 2021 Conference Paper

Reinforcement Learning for Sparse-Reward Object-Interaction Tasks in a First-person Simulated 3D Environment

Wilka Carvalho
Anthony Liang
Kimin Lee
Sungryull Sohn
Honglak Lee
Richard Lewis
Satinder Singh

Learning how to execute complex tasks involving multiple objects in a 3D world is challenging when there is no ground-truth information about the objects or any demonstration to learn from. When an agent only receives a signal from task-completion, this makes it challenging to learn the object-representations which support learning the correct object-interactions needed to complete the task. In this work, we formulate learning an attentive object dynamics model as a classification problem, using random object-images to define incorrect labels for our object-dynamics model. We show empirically that this enables object-representation learning that captures an object's category (is it a toaster? ), its properties (is it on? ), and object-relations (is something inside of it? ). With this, our core learner (a relational RL agent) receives the dense training signal it needs to rapidly learn object-interaction tasks. We demonstrate results in the 3D AI2Thor simulated kitchen environment with a range of challenging food preparation tasks. We compare our method's performance to several related approaches and against the performance of an oracle: an agent that is supplied with ground-truth information about objects in the scene. We find that our agent achieves performance closest to the oracle in terms of both learning speed and maximum success rate.

PDF Details DOI

AIJ Journal 2021 Journal Article

Reward is enough

David Silver
Satinder Singh
Doina Precup
Richard S. Sutton

NeurIPS Conference 2021 Conference Paper

Reward is enough for convex MDPs

Tom Zahavy
Brendan O'Donoghue
Guillaume Desjardins
Satinder Singh

Maximising a cumulative reward function that is Markov and stationary, i. e. , defined over state-action pairs and independent of time, is sufficient to capture many kinds of goals in a Markov decision process (MDP). However, not all goals can be captured in this manner. In this paper we study convex MDPs in which goals are expressed as convex functions of the stationary distribution and show that they cannot be formulated using stationary reward functions. Convex MDPs generalize the standard reinforcement learning (RL) problem formulation to a larger framework that includes many supervised and unsupervised RL problems, such as apprenticeship learning, constrained MDPs, and so-called pure exploration'. Our approach is to reformulate the convex MDP problem as a min-max game involving policy and cost (negative reward) players', using Fenchel duality. We propose a meta-algorithm for solving this problem and show that it unifies many existing algorithms in the literature.

NeurIPS Conference 2020 Conference Paper

A Self-Tuning Actor-Critic Algorithm

Tom Zahavy
Zhongwen Xu
Vivek Veeriah
Matteo Hessel
Junhyuk Oh
Hado P. van Hasselt
David Silver
Satinder Singh

Reinforcement learning algorithms are highly sensitive to the choice of hyperparameters, typically requiring significant manual effort to identify hyperparameters that perform well on a new domain. In this paper, we take a step towards addressing this issue by using metagradients to automatically adapt hyperparameters online by meta-gradient descent (Xu et al. , 2018). We apply our algorithm, Self-Tuning Actor-Critic (STAC), to self-tune all the differentiable hyperparameters of an actor-critic loss function, to discover auxiliary tasks, and to improve off-policy learning using a novel leaky V-trace operator. STAC is simple to use, sample efficient and does not require a significant increase in compute. Ablative studies show that the overall performance of STAC improved as we adapt more hyperparameters. When applied to the Arcade Learning Environment (Bellemare et al. 2012), STAC improved the median human normalized score in 200M steps from 243% to 364%. When applied to the DM Control suite (Tassa et al. , 2018), STAC improved the mean score in 30M steps from 217 to 389 when learning with features, from 108 to 202 when learning from pixels, and from 195 to 295 in the Real-World Reinforcement Learning Challenge (Dulac-Arnold et al. , 2020).

NeurIPS Conference 2020 Conference Paper

Discovering Reinforcement Learning Algorithms

Junhyuk Oh
Matteo Hessel
Wojciech M. Czarnecki
Zhongwen Xu
Hado P. van Hasselt
Satinder Singh
David Silver

Reinforcement learning (RL) algorithms update an agent’s parameters according to one of several possible rules, discovered manually through years of research. Automating the discovery of update rules from data could lead to more efficient algorithms, or algorithms that are better adapted to specific environments. Although there have been prior attempts at addressing this significant scientific challenge, it remains an open question whether it is feasible to discover alternatives to fundamental concepts of RL such as value functions and temporal-difference learning. This paper introduces a new meta-learning approach that discovers an entire update rule which includes both what to predict' (e. g. value functions) and how to learn from it' (e. g. bootstrapping) by interacting with a set of environments. The output of this method is an RL algorithm that we call Learned Policy Gradient (LPG). Empirical results show that our method discovers its own alternative to the concept of value functions. Furthermore it discovers a bootstrapping mechanism to maintain and use its predictions. Surprisingly, when trained solely on toy environments, LPG generalises effectively to complex Atari games and achieves non-trivial performance. This shows the potential to discover general RL algorithms from data.

AAAI Conference 2020 Conference Paper

How Should an Agent Practice?

Janarthanan Rajendran
Richard Lewis
Vivek Veeriah
Honglak Lee
Satinder Singh

We present a method for learning intrinsic reward functions to drive the learning of an agent during periods of practice in which extrinsic task rewards are not available. During practice, the environment may differ from the one available for training and evaluation with extrinsic rewards. We refer to this setup of alternating periods of practice and objective evaluation as practice-match, drawing an analogy to regimes of skill acquisition common for humans in sports and games. The agent must effectively use periods in the practice environment so that performance improves during matches. In the proposed method the intrinsic practice reward is learned through a meta-gradient approach that adapts the practice reward parameters to reduce the extrinsic match reward loss computed from matches. We illustrate the method on a simple grid world, and evaluate it in two games in which the practice environment differs from match: Pong with practice against a wall without an opponent, and PacMan with practice in a maze without ghosts. The results show gains from learning in practice in addition to match periods over learning in matches only.

NeurIPS Conference 2020 Conference Paper

Learning to Play No-Press Diplomacy with Best Response Policy Iteration

Thomas Anthony
Tom Eccles
Andrea Tacchetti
János Kramár
Ian Gemp
Thomas Hudson
Nicolas Porcel
Marc Lanctot

Recent advances in deep reinforcement learning (RL) have led to considerable progress in many 2-player zero-sum games, such as Go, Poker and Starcraft. The purely adversarial nature of such games allows for conceptually simple and principled application of RL methods. However real-world settings are many-agent, and agent interactions are complex mixtures of common-interest and competitive aspects. We consider Diplomacy, a 7-player board game designed to accentuate dilemmas resulting from many-agent interactions. It also features a large combinatorial action space and simultaneous moves, which are challenging for RL algorithms. We propose a simple yet effective approximate best response operator, designed to handle large combinatorial action spaces and simultaneous moves. We also introduce a family of policy iteration methods that approximate fictitious play. With these methods, we successfully apply RL to Diplomacy: we show that our agents convincingly outperform the previous state-of-the-art, and game theoretic equilibrium analysis shows that the new process yields consistent improvements.

NeurIPS Conference 2020 Conference Paper

Meta-Gradient Reinforcement Learning with an Objective Discovered Online

Zhongwen Xu
Hado P. van Hasselt
Matteo Hessel
Junhyuk Oh
Satinder Singh
David Silver

Deep reinforcement learning includes a broad family of algorithms that parameterise an internal representation, such as a value function or policy, by a deep neural network. Each algorithm optimises its parameters with respect to an objective, such as Q-learning or policy gradient, that defines its semantics. In this work, we propose an algorithm based on meta-gradient descent that discovers its own objective, flexibly parameterised by a deep neural network, solely from interactive experience with its environment. Over time, this allows the agent to learn how to learn increasingly effectively. Furthermore, because the objective is discovered online, it can adapt to changes over time. We demonstrate that the algorithm discovers how to address several important issues in RL, such as bootstrapping, non-stationarity, and off-policy learning. On the Atari Learning Environment, the meta-gradient algorithm adapts over time to learn with greater efficiency, eventually outperforming the median score of a strong actor-critic baseline.

AAAI Conference 2020 Conference Paper

Modeling Probabilistic Commitments for Maintenance Is Inherently Harder than for Achievement

Qi Zhang
Edmund Durfee
Satinder Singh

Most research on probabilistic commitments focuses on commitments to achieve enabling preconditions for other agents. Our work reveals that probabilistic commitments to instead maintain preconditions for others are surprisingly harder to use well than their achievement counterparts, despite strong semantic similarities. We isolate the key difference as being not in how the commitment provider is constrained, but rather in how the commitment recipient can locally use the commitment speciﬁcation to approximately model the provider’s effects on the preconditions of interest. Our theoretic analyses show that we can more tightly bound the potential suboptimality due to approximate modeling for achievement than for maintenance commitments. We empirically evaluate alternative approximate modeling strategies, conﬁrming that probabilistic maintenance commitments are qualitatively more challenging for the recipient to model well, and indicating the need for more detailed speciﬁcations that can sacriﬁce some of the agents’ autonomy.

NeurIPS Conference 2020 Conference Paper

On Efficiency in Hierarchical Reinforcement Learning

Zheng Wen
Doina Precup
Morteza Ibrahimi
Andre Barreto
Benjamin Van Roy
Satinder Singh

Hierarchical Reinforcement Learning (HRL) approaches promise to provide more efficient solutions to sequential decision making problems, both in terms of statistical as well as computational efficiency. While this has been demonstrated empirically over time in a variety of tasks, theoretical results quantifying the benefits of such methods are still few and far between. In this paper, we discuss the kind of structure in a Markov decision process which gives rise to efficient HRL methods. Specifically, we formalize the intuition that HRL can exploit well repeating "subMDPs", with similar reward and transition structure. We show that, under reasonable assumptions, a model-based Thompson sampling-style HRL algorithm that exploits this structure is statistically efficient, as established through a finite-time regret bound. We also establish conditions under which planning with structure-induced options is near-optimal and computationally efficient.

AAAI Conference 2020 Conference Paper

Querying to Find a Safe Policy under Uncertain Safety Constraints in Markov Decision Processes

Shun Zhang
Edmund Durfee
Satinder Singh

An autonomous agent acting on behalf of a human user has the potential of causing side-effects that surprise the user in unsafe ways. When the agent cannot formulate a policy with only side-effects it knows are safe, it needs to selectively query the user about whether other useful side-effects are safe. Our goal is an algorithm that queries about as few potential side-effects as possible to ﬁnd a safe policy, or to prove that none exists. We extend prior work on irreducible infeasible sets to also handle our problem’s complication that a constraint to avoid a side-effect cannot be relaxed without user permission. By proving that our objectives are also adaptive submodular, we devise a querying algorithm that we empirically show ﬁnds nearly-optimal queries with much less computation than a guaranteed-optimal approach, and outperforms competing approximate approaches.

JAAMAS Journal 2020 Journal Article

Semantics and algorithms for trustworthy commitment achievement under model uncertainty

Qi Zhang
Edmund H. Durfee
Satinder Singh

Abstract We focus on how an agent can exercise autonomy while still dependably fulfilling commitments it has made to another, despite uncertainty about outcomes of its actions and how its own objectives might evolve. Our formal semantics treats a probabilistic commitment as constraints on the actions an autonomous agent can take, rather than as promises about states of the environment it will achieve. We have developed a family of commitment-constrained (iterative) lookahead algorithms that provably respect the semantics, and that support different tradeoffs between computation and plan quality. Our empirical results confirm that our algorithms’ ability to balance (selfish) autonomy and (unselfish) dependability outperforms optimizing either alone, that our algorithms can effectively handle uncertainty about both what actions do and which states are rewarding, and that our algorithms can solve more computationally-demanding problems through judicious parameter choices for how far our algorithms should lookahead and how often they should iterate.

NeurIPS Conference 2020 Conference Paper

The Value Equivalence Principle for Model-Based Reinforcement Learning

Christopher Grimm
Andre Barreto
Satinder Singh
David Silver

Learning models of the environment from data is often viewed as an essential component to building intelligent reinforcement learning (RL) agents. The common practice is to separate the learning of the model from its use, by constructing a model of the environment’s dynamics that correctly predicts the observed state transitions. In this paper we argue that the limited representational resources of model-based RL agents are better used to build models that are directly useful for value-based planning. As our main contribution, we introduce the principle of value equivalence: two models are value equivalent with respect to a set of functions and policies if they yield the same Bellman updates. We propose a formulation of the model learning problem based on the value equivalence principle and analyze how the set of feasible solutions is impacted by the choice of policies and functions. Specifically, we show that, as we augment the set of policies and functions considered, the class of value equivalent models shrinks, until eventually collapsing to a single point corresponding to a model that perfectly describes the environment. In many problems, directly modelling state-to-state transitions may be both difficult and unnecessary. By leveraging the value-equivalence principle one may find simpler models without compromising performance, saving computation and memory. We illustrate the benefits of value-equivalent model learning with experiments comparing it against more traditional counterparts like maximum likelihood estimation. More generally, we argue that the principle of value equivalence underlies a number of recent empirical successes in RL, such as Value Iteration Networks, the Predictron, Value Prediction Networks, TreeQN, and MuZero, and provides a first theoretical underpinning of those results.

NeurIPS Conference 2019 Conference Paper

Discovery of Useful Questions as Auxiliary Tasks

Vivek Veeriah
Matteo Hessel
Zhongwen Xu
Janarthanan Rajendran
Richard Lewis
Junhyuk Oh
Hado van Hasselt
David Silver

Arguably, intelligent agents ought to be able to discover their own questions so that in learning answers for them they learn unanticipated useful knowledge and skills; this departs from the focus in much of machine learning on agents learning answers to externally defined questions. We present a novel method for a reinforcement learning (RL) agent to discover questions formulated as general value functions or GVFs, a fairly rich form of knowledge representation. Specifically, our method uses non-myopic meta-gradients to learn GVF-questions such that learning answers to them, as an auxiliary task, induces useful representations for the main task faced by the RL agent. We demonstrate that auxiliary tasks based on the discovered GVFs are sufficient, on their own, to build representations that support main task learning, and that they do so better than popular hand-designed auxiliary tasks from the literature. Furthermore, we show, in the context of Atari2600 videogames, how such auxiliary tasks, meta-learned alongside the main task, can improve the data efficiency of an actor-critic agent.

NeurIPS Conference 2019 Conference Paper

Hindsight Credit Assignment

Anna Harutyunyan
Will Dabney
Thomas Mesnard
Mohammad Gheshlaghi Azar
Bilal Piot
Nicolas Heess
Hado van Hasselt
Gregory Wayne

We consider the problem of efficient credit assignment in reinforcement learning. In order to efficiently and meaningfully utilize new data, we propose to explicitly assign credit to past decisions based on the likelihood of them having led to the observed outcome. This approach uses new information in hindsight, rather than employing foresight. Somewhat surprisingly, we show that value functions can be rewritten through this lens, yielding a new family of algorithms. We study the properties of these algorithms, and empirically show that they successfully address important credit assignment challenges, through a set of illustrative tasks.

AAAI Conference 2019 Conference Paper

Learning to Communicate and Solve Visual Blocks-World Tasks

Qi Zhang
Richard Lewis
Satinder Singh
Edmund Durfee

We study emergent communication between speaker and listener recurrent neural-network agents that are tasked to cooperatively construct a blocks-world target image sampled from a generative grammar of blocks configurations. The speaker receives the target image and learns to emit a sequence of discrete symbols from a fixed vocabulary. The listener learns to construct a blocks-world image by choosing block placement actions as a function of the speaker’s full utterance and the image of the ongoing construction. Our contributions are (a) the introduction of a task domain for studying emergent communication that is both challenging and affords useful analyses of the emergent protocols; (b) an empirical comparison of the interpolation and extrapolation performance of training via supervised, (contextual) Bandit, and reinforcement learning; and (c) evidence for the emergence of interesting linguistic properties in the RL agent protocol that are distinct from the other two.

NeurIPS Conference 2019 Conference Paper

No-Press Diplomacy: Modeling Multi-Agent Gameplay

Philip Paquette
Yuchen Lu
SETON STEVEN BOCCO
Max Smith
Satya O. -G.
Jonathan Kummerfeld
Joelle Pineau
Satinder Singh

Diplomacy is a seven-player non-stochastic, non-cooperative game, where agents acquire resources through a mix of teamwork and betrayal. Reliance on trust and coordination makes Diplomacy the first non-cooperative multi-agent benchmark for complex sequential social dilemmas in a rich environment. In this work, we focus on training an agent that learns to play the No Press version of Diplomacy where there is no dedicated communication channel between players. We present DipNet, a neural-network-based policy model for No Press Diplomacy. The model was trained on a new dataset of more than 150, 000 human games. Our model is trained by supervised learning (SL) from expert trajectories, which is then used to initialize a reinforcement learning (RL) agent trained through self-play. Both the SL and the RL agent demonstrate state-of-the-art No Press performance by beating popular rule-based bots.

NeurIPS Conference 2018 Conference Paper

Completing State Representations using Spectral Learning

Nan Jiang
Alex Kulesza
Satinder Singh

A central problem in dynamical system modeling is state discovery—that is, finding a compact summary of the past that captures the information needed to predict the future. Predictive State Representations (PSRs) enable clever spectral methods for state discovery; however, while consistent in the limit of infinite data, these methods often suffer from poor performance in the low data regime. In this paper we develop a novel algorithm for incorporating domain knowledge, in the form of an imperfect state representation, as side information to speed spectral learning for PSRs. We prove theoretical results characterizing the relevance of a user-provided state representation, and design spectral algorithms that can take advantage of a relevant representation. Our algorithm utilizes principal angles to extract the relevant components of the representation, and is robust to misspecification. Empirical evaluation on synthetic HMMs, an aircraft identification domain, and a gene splice dataset shows that, even with weak domain knowledge, the algorithm can significantly outperform standard PSR learning.

IJCAI Conference 2018 Conference Paper

Minimax-Regret Querying on Side Effects for Safe Optimality in Factored Markov Decision Processes

Shun Zhang
Edmund H. Durfee
Satinder Singh

As it achieves a goal on behalf of its human user, an autonomous agent's actions may have side effects that change features of its environment in ways that negatively surprise its user. An agent that can be trusted to operate safely should thus only change features the user has explicitly permitted. We formalize this problem, and develop a planning algorithm that avoids potentially negative side effects given what the agent knows about (un)changeable features. Further, we formulate a provably minimax-regret querying strategy for the agent to selectively ask the user about features that it hasn't explicitly been told about. We empirically show how much faster it is than a more exhaustive approach and how much better its queries are than those found by the best known heuristic.

NeurIPS Conference 2018 Conference Paper

On Learning Intrinsic Rewards for Policy Gradient Methods

Zeyu Zheng
Junhyuk Oh
Satinder Singh

In many sequential decision making tasks, it is challenging to design reward functions that help an RL agent efficiently learn behavior that is considered good by the agent designer. A number of different formulations of the reward-design problem, or close variants thereof, have been proposed in the literature. In this paper we build on the Optimal Rewards Framework of Singh et al. that defines the optimal intrinsic reward function as one that when used by an RL agent achieves behavior that optimizes the task-specifying or extrinsic reward function. Previous work in this framework has shown how good intrinsic reward functions can be learned for lookahead search based planning agents. Whether it is possible to learn intrinsic reward functions for learning agents remains an open problem. In this paper we derive a novel algorithm for learning intrinsic rewards for policy-gradient based learning agents. We compare the performance of an augmented agent that uses our algorithm to provide additive intrinsic rewards to an A2C-based policy learner (for Atari games) and a PPO-based policy learner (for Mujoco domains) with a baseline agent that uses the same policy learners but with only extrinsic rewards. Our results show improved performance on most but not all of the domains.

AAMAS Conference 2018 Conference Paper

On Querying for Safe Optimality in Factored Markov Decision Processes

Shun Zhang
Edmund H. Durfee
Satinder Singh

As it achieves a goal on behalf of its human user, an autonomous agent’s actions may have side effects that change features of its environment in ways that negatively surprise its user. An agent that can be trusted to operate safely should thus only change features the user has explicitly permitted. We formalize this problem, and develop a planning algorithm that avoids potentially negative side effects given what the agent knows about (un)changeable features. Further, we formulate a provably minimax-regret querying strategy for the agent to selectively ask the user about features that it hasn’t explicitly been told about. We empirically show how much faster it is than a more exhaustive approach and how much better its queries are than those found by the best known heuristic.

RLDM Conference 2017 Conference Abstract

Approximately-Optimal Queries in Reward-Uncertain Markov Decision Processes

Shun Zhang
Edmund Durfee
Satinder Singh

When planning actions to take on behalf of its human operator, a robot might be uncertain about its operator’s reward function. We address the problem of how the robot should formulate an (approxi- mately) optimal query to pose to the operator, given how its uncertainty affects which policies it should plan to pursue. We explain how a robot whose queries ask the operator to choose the best from among k choices can, without loss of optimality, restrict consideration to choices only over alternative policies. Fur- ther, we present a method for constructing an approximately-optimal policy query that enjoys a performance bound, where the method need not enumerate all policies. Finally, because queries posed to the operator of a robotic system are often expressed in terms of preferences over trajectories rather than policies, we show how our constructed policy query can be projected into the space of trajectory queries. Our empirical results demonstrate that our projection technique can outperform prior techniques for choosing trajectory queries, particularly when the number of trajectories the operator is asked to compare is small.

RLDM Conference 2017 Conference Abstract

Architecture for Predicting Sets

Janarthanan Rajendran
Satinder Singh

Having an effective model for predicting sets would be useful in tasks such as object detection, image tagging, forming a team of employees in an office and so on, where the output we are interested in, is a set (order invariant). Trying to predict sets using multi-class classification type of techniques face the issue of unknown cardinality, along with its difficulty in capturing dependencies between different elements of the set during prediction. Another approach is to use models that predict sequences to predict sets. In this case we have to come up with a model that can find the right sequence out of the many possible sequences that can constitute a set, as some orders are easy to model than the others. We propose a recurrent neural networks based architecture for predicting sets, which uses an order invariant loss function to learn the sequence of prediction automatically.

NeurIPS Conference 2017 Conference Paper

Repeated Inverse Reinforcement Learning

Kareem Amin
Nan Jiang
Satinder Singh

We introduce a novel repeated Inverse Reinforcement Learning problem: the agent has to act on behalf of a human in a sequence of tasks and wishes to minimize the number of tasks that it surprises the human by acting suboptimally with respect to how the human would have acted. Each time the human is surprised, the agent is provided a demonstration of the desired behavior by the human. We formalize this problem, including how the sequence of tasks is chosen, in a few different ways and provide some foundational results.

RLDM Conference 2017 Conference Abstract

Repeated Inverse Reinforcement Learning for AI Safety

Kareem Amin
Nan Jiang
Satinder Singh

How detailed should we make the goals we prescribe to AI agents acting on our behalf in com- plex environments? Detailed & low-level specification of goals can be tedious and expensive to create, and abstract & high-level goals could lead to negative surprises as the agent may find behaviors that we would not want it to do, i. e. , lead to unsafe AI. One approach to addressing this dilemma is for the agent to infer human goals by observing human behavior. This is the Inverse Reinforcement Learning (IRL) problem. However, IRL is generally ill-posed for there are typically many reward functions for which the observed behavior is optimal. While the use of heuristics to select from among the set of feasible reward functions has led to successful applications of IRL to learning from demonstration, such heuristics do not address AI safety. In this paper we introduce a novel repeated IRL problem that captures an aspect of AI safety as fol- lows. The agent has to act on behalf of a human in a sequence of tasks and wishes to minimize the number of tasks that it surprises the human. Each time the human is surprised the agent is provided a demonstration of the desired behavior by the human. We formalize this problem, including how the sequence of tasks is chosen, in a few different ways and provide some foundational results.

NeurIPS Conference 2017 Conference Paper

Value Prediction Network

Junhyuk Oh
Satinder Singh
Honglak Lee

This paper proposes a novel deep reinforcement learning (RL) architecture, called Value Prediction Network (VPN), which integrates model-free and model-based RL methods into a single neural network. In contrast to typical model-based RL methods, VPN learns a dynamics model whose abstract states are trained to make option-conditional predictions of future values (discounted sum of rewards) rather than of future observations. Our experimental results show that VPN has several advantages over both model-free and model-based baselines in a stochastic environment where careful planning is required but building an accurate observation-prediction model is difficult. Furthermore, VPN outperforms Deep Q-Network (DQN) on several Atari games even with short-lookahead planning, demonstrating its potential as a new way of learning a good state representation.

IJCAI Conference 2016 Conference Paper

Commitment Semantics for Sequential Decision Making under Reward Uncertainty

Qi Zhang
Edmund Durfee
Satinder Singh
Anna Chen
Stefan Witwicki

Cooperating agents can make commitments to help each other, but commitments might have to be probabilistic when actions have stochastic outcomes. We consider the additional complication in cases where an agent might prefer to change its policy as it learns more about its reward function from experience. How should such an agent be allowed to change its policy while still faithfully pursuing its commitment in a principled decision-theoretic manner? We address this question by defining a class of Dec-POMDPs with Bayesian reward uncertainty, and by developing a novel Commitment Constrained Iterative Mean Reward algorithm that implements the semantics of faithful commitment pursuit while still permitting the agent's response to the evolving understanding of its rewards. We bound the performance of our algorithm theoretically, and evaluate empirically how it effectively balances solution quality and computation cost.

IJCAI Conference 2016 Conference Paper

Deep Learning for Reward Design to Improve Monte Carlo Tree Search in ATARI Games

Xiaoxiao Guo
Satinder Singh
Richard Lewis
Honglak Lee

Monte Carlo Tree Search (MCTS) methods have proven powerful in planning for sequential decision-making problems such as Go and video games, but their performance can be poor when the planning depth and sampling trajectories are limited or when the rewards are sparse. We present an adaptation of PGRD (policy-gradient for reward-design) for learning a reward-bonus function to improve UCT (a MCTS algorithm). Unlike previous applications of PGRD in which the space of reward-bonus functions was limited to linear functions of hand-coded state-action-features, we use PGRD with a multi-layer convolutional neural network to automatically learn features from raw perception as well as to adapt the non-linear reward-bonus function parameters. We also adopt a variance-reducing gradient method to improve PGRD's performance. The new method improves UCT's performance on multiple ATARI games compared to UCT without the reward bonus. Combining PGRD and Deep Learning in this way should make adapting rewards for MCTS algorithms far more widely and practically applicable than before.

AAAI Conference 2016 Conference Paper

Improving Predictive State Representations via Gradient Descent

Nan Jiang
Alex Kulesza
Satinder Singh

Predictive state representations (PSRs) model dynamical systems using appropriately chosen predictions about future observations as a representation of the current state. In contrast to the hidden states posited by HMMs or RNNs, PSR states are directly observable in the training data; this gives rise to a moment-matching spectral algorithm for learning PSRs that is computationally efﬁcient and statistically consistent when the model complexity matches that of the true system generating the data. In practice, however, model mismatch is inevitable and while spectral learning remains appealingly fast and simple it may fail to ﬁnd optimal models. To address this problem, we investigate the use of gradient methods for improving spectrally-learned PSRs. We show that only a small amount of additional gradient optimization can lead to significant performance gains, and moreover that initializing gradient methods with the spectral learning solution yields better models in signiﬁcantly less time than starting from scratch.

IJCAI Conference 2016 Conference Paper

On Structural Properties of MDPs that Bound Loss Due to Shallow Planning

Nan Jiang
Satinder Singh
Ambuj Tewari

Planning in MDPs often uses a smaller planning horizon than specified in the problem to save computational expense at the risk of a loss due to suboptimal plans. Jiang et al. [2015b] recently showed that smaller than specified planning horizons can in fact be beneficial in cases where the MDP model is learned from data and therefore not accurate. In this paper, we consider planning with accurate models and investigate structural properties of MDPs that bound the loss incurred by using smaller than specified planning horizons. We identify a number of structural parameters some of which depend on the reward function alone, some on the transition dynamics alone, and some that depend on the interaction between rewards and transition dynamics. We provide planning loss bounds in terms of these structural parameters and, in some cases, also show tightness of the upper bounds. Empirical results with randomly generated MDPs are used to validate qualitative properties of our theoretical bounds for shallow planning.

IJCAI Conference 2016 Conference Paper

The Dependence of Effective Planning Horizon on Model Accuracy

Nan Jiang
Alex Kulesza
Satinder Singh
Richard Lewis

Because planning with a long horizon (i. e. , looking far into the future) is computationally expensive, it is common in practice to save time by using reduced horizons. This is usually understood to come at the expense of computing suboptimal plans, which is the case when the planning model is exact. However, when the planning model is estimated from data, as is frequently true in the real world, the policy found using a shorter planning horizon can actually be better than a policy learned with the true horizon. In this paper we provide a precise explanation for this phenomenon based on principles of learning theory. We show formally that the planning horizon is a complexity control parameter for the class of policies available to the planning algorithm, having an intuitive, monotonic relationship with a simple measure of complexity. We prove a planning loss bound predicting that shorter planning horizons can reduce overfitting and improve test performance, and we confirm these predictions empirically.

NeurIPS Conference 2015 Conference Paper

Action-Conditional Video Prediction using Deep Networks in Atari Games

Junhyuk Oh
Xiaoxiao Guo
Honglak Lee
Richard Lewis
Satinder Singh

Motivated by vision-based reinforcement learning (RL) problems, in particular Atari games from the recent benchmark Aracade Learning Environment (ALE), we consider spatio-temporal prediction problems where future (image-)frames are dependent on control variables or actions as well as previous frames. While not composed of natural scenes, frames in Atari games are high-dimensional in size, can involve tens of objects with one or more objects being controlled by the actions directly and many other objects being influenced indirectly, can involve entry and departure of objects, and can involve deep partial observability. We propose and evaluate two deep neural network architectures that consist of encoding, action-conditional transformation, and decoding layers based on convolutional neural networks and recurrent neural networks. Experimental results show that the proposed architectures are able to generate visually-realistic frames that are also useful for control over approximately 100-step action-conditional futures in some games. To the best of our knowledge, this paper is the first to make and evaluate long-term predictions on high-dimensional video conditioned by control inputs.

AAAI Conference 2015 Conference Paper

Spectral Learning of Predictive State Representations with Insufficient Statistics

Alex Kulesza
Nan Jiang
Satinder Singh

Predictive state representations (PSRs) are models of dynamical systems that represent state as a vector of predictions about future observable events (tests) conditioned on past observed events (histories). If a practitioner selects finite sets of tests and histories that are known to be sufficient to completely capture the system, an exact PSR can be learned in polynomial time using spectral methods. However, most real-world systems are complex, and in practice computational constraints limit us to small sets of tests and histories which are therefore never truly sufficient. How, then, should we choose these sets? Existing theory offers little guidance here, and yet we show that the choice is highly consequential— tests and histories selected at random or by a naı̈ve rule significantly underperform the best sets. In this paper we approach the problem both theoretically and empirically. While any fixed system can be represented by an infinite number of equivalent but distinct PSRs, we show that in the computationally unconstrained setting, where existing theory guarantees accurate predictions, the PSRs learned by spectral methods always satisfy a particular spectral bound. Adapting this idea, we propose a simple algorithmic technique to search for sets of tests and histories that approximately satisfy the bound while respecting computational limits. Empirically, our method significantly reduces prediction errors compared to standard spectral learning approaches.

NeurIPS Conference 2014 Conference Paper

Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Search Planning

Xiaoxiao Guo
Satinder Singh
Honglak Lee
Richard Lewis
Xiaoshi Wang

The combination of modern Reinforcement Learning and Deep Learning approaches holds the promise of making significant progress on challenging applications requiring both rich perception and policy-selection. The Arcade Learning Environment (ALE) provides a set of Atari games that represent a useful benchmark set of such applications. A recent breakthrough in combining model-free reinforcement learning with deep learning, called DQN, achieves the best real-time agents thus far. Planning-based approaches achieve far higher scores than the best model-free approaches, but they exploit information that is not available to human players, and they are orders of magnitude slower than needed for real-time play. Our main goal in this work is to build a better real-time Atari game playing agent than DQN. The central idea is to use the slow planning-based agents to provide training data for a deep-learning architecture capable of real-time play. We proposed new agents based on this idea and show that they outperform DQN.

AAAI Conference 2014 Conference Paper

Evaluating Trauma Patients: Addressing Missing Covariates with Joint Optimization

Alex Van Esbroeck
Satinder Singh
Ilan Rubinfeld
Zeeshan Syed

Missing values are a common problem when applying classification algorithms to real-world medical data. This is especially true for trauma patients, where the emergent nature of the cases makes it difficult to collect all of the relevant data for each patient. Standard methods for handling missingness first learn a model to estimate missing data values, and subsequently train and evaluate a classifier using data imputed with this model. Recently, several proposed methods have demonstrated the benefits of jointly estimating the imputation model and classifier parameters. However, these methods make assumptions that limit their utility with many real-world medical datasets. For example, the assumption that data elements are missing at random is often invalid. We address this situation by exploring a novel approach for jointly learning the imputation model and classifier. Unlike previous algorithms, our approach makes no assumptions about the missingness of the data, can be used with arbitrary probabilistic data models and classification loss functions, and can be used when both the training and testing data have missing values. We investigate the utility of this approach on the prediction of several patient outcomes in a large national registry of trauma patients, and find that it significantly outperforms standard sequential methods.

AAAI Conference 2014 Conference Paper

Predicting Postoperative Atrial Fibrillation from Independent ECG Components

Chih-Chun Chia
James Blum
Zahi Karam
Satinder Singh
Zeeshan Syed

Postoperative atrial fibrillation (PAF) occurs in 10% to 65% of the patients undergoing cardiothoracic surgery. It is associated with increased post-surgical mortality and morbidity, and results in longer and more expensive hospital stays. Accurately stratifying patients for PAF allows for selective use of prophylactic therapies (e. g. , amiodarone). Unfortunately, existing tools to stratify patients for PAF fail to provide clinically adequate discrimination. Our research addresses this situation through the development of novel electrocardiographic (ECG) markers to identify patients at risk of PAF. As a first step, we explore an eigen-decomposition approach that partitions ECG signals into atrial and ventricular components by exploiting knowledge of the underlying cardiac cycle. We then quantify electrical instability in the myocardium manifesting as probabilistic variations in atrial ECG morphology to assess the risk of PAF. When evaluated on 385 patients undergoing cardiac surgery, this approach of stratifying patients for PAF through an analysis of morphologic variability within decoupled atrial ECG demonstrated substantial promise and improved net reclassification by over 53% relative to the use of baseline clinical characteristics.

NeurIPS Conference 2013 Conference Paper

Reward Mapping for Transfer in Long-Lived Agents

Xiaoxiao Guo
Satinder Singh
Richard Lewis

We consider how to transfer knowledge from previous tasks to a current task in long-lived and bounded agents that must solve a sequence of MDPs over a finite lifetime. A novel aspect of our transfer approach is that we reuse reward functions. While this may seem counterintuitive, we build on the insight of recent work on the optimal rewards problem that guiding an agent's behavior with reward functions other than the task-specifying reward function can help overcome computational bounds of the agent. Specifically, we use good guidance reward functions learned on previous tasks in the sequence to incrementally train a reward mapping function that maps task-specifying reward functions into good initial guidance reward functions for subsequent tasks. We demonstrate that our approach can substantially improve the agent's performance relative to other approaches, including an approach that transfers policies.

AAAI Conference 2012 Conference Paper

Computing Stackelberg Equilibria in Discounted Stochastic Games

Yevgeniy Vorobeychik
Satinder Singh

Stackelberg games increasingly influence security policies deployed in real-world settings. Much of the work to date focuses on devising a fixed randomized strategy for the defender, accounting for an attacker who optimally responds to it. In practice, defense policies are often subject to constraints and vary over time, allowing an attacker to infer characteristics of future policies based on current observations. A defender must therefore account for an attacker’s observation capabilities in devising a security policy. We show that this general modeling framework can be captured using stochastic Stackelberg games (SSGs), where a defender commits to a dynamic policy to which the attacker devises an optimal dynamic response. We then offer the following contributions. 1) We show that Markov stationary policies suffice in SSGs, 2) present a finite-time mixedinteger non-linear program for computing a Stackelberg equilibrium in SSGs, and 3) present a mixed-integer linear program to approximate it. 4) We illustrate our algorithms on a simple SSG representing an adversarial patrolling scenario, where we study the impact of attacker patience and risk aversion on optimal defense policies.

AAMAS Conference 2012 Conference Paper

Learning and Predicting Dynamic Networked Behavior with Graphical Multiagent Models

Quang Duong
Michael Wellman
Satinder Singh
Michael Kearns

Factored models of multiagent systems address the complexity of joint behavior by exploiting locality in agent interactions. \emph{History-dependent graphical multiagent models} (hGMMs) further capture dynamics by conditioning behavior on history. The challenges of modeling real human behavior motivated us to extend the hGMM representation by distinguishing two types of agent interactions. This distinction opens the opportunity for learning dependence networks that are different from given graphical structures representing observed agent interactions. We propose a greedy algorithm for learning hGMMs from time-series data, inducing both graphical structure and parameters. Our empirical study employs human-subject experiment data for a dynamic consensus scenario, where agents on a network attempt to reach a unanimous vote. We show that the learned hGMMs directly expressing joint behavior outperform alternatives in predicting dynamic human voting behavior, and end-game vote results. Analysis of learned graphical structures reveals patterns of action dependence not directly reflected in the original experiment networks.

AAMAS Conference 2012 Conference Paper

Planning and Evaluating Multiagent Influences Under Reward Uncertainty

Stefan Witwicki
Inn-Tung Chen
Ed Durfee
Satinder Singh

Forming commitments about abstract influences that agents can exert on one another has shown promise in improving the tractability of multiagent coordination under uncertainty. We now extend this approach to domains with meta-level reward-model uncertainty. Intuitively, an agent may actually improve collective performance by forming a weaker commitment that allows more latitude to adapt its policy as it refines its reward model. To account for reward uncertainty as such, we introduce and contrast three new techniques.

AAAI Conference 2012 Conference Paper

Security Games with Limited Surveillance

Bo An
David Kempe
Christopher Kiekintveld
Eric Shieh
Satinder Singh
Milind Tambe
Yevgeniy Vorobeychik

Randomized first-mover strategies of Stackelberg games are used in several deployed applications to allocate limited resources for the protection of critical infrastructure. Stackelberg games model the fact that a strategic attacker can surveil and exploit the defender’s strategy, and randomization guards against the worst effects by making the defender less predictable. In accordance with the standard game-theoretic model of Stackelberg games, past work has typically assumed that the attacker has perfect knowledge of the defender’s randomized strategy and will react correspondingly. In light of the fact that surveillance is costly, risky, and delays an attack, this assumption is clearly simplistic: attackers will usually act on partial knowledge of the defender’s strategies. The attacker’s imperfect estimate could present opportunities and possibly also threats to a strategic defender. In this paper, we therefore begin a systematic study of security games with limited surveillance. We propose a natural model wherein an attacker forms or updates a belief based on observed actions, and chooses an optimal response. We investigate the model both theoretically and experimentally. In particular, we give mathematical programs to compute optimal attacker and defender strategies for a fixed observation duration, and show how to use them to estimate the attacker’s observation durations. Our experimental results show that the defender can achieve significant improvement in expected utility by taking the attacker’s limited surveillance into account, validating the motivation of our work.

AAMAS Conference 2012 Conference Paper

Strong Mitigation: Nesting Search for Good Policies Within Search for Good Reward

Jeshua Bratman
Satinder Singh
Richard Lewis
Jonathan Sorg

Recent work has defined an optimal reward problem (ORP) in which an agent designer, with an objective reward function that \emph{evaluates} an agent's behavior, has a choice of what reward function to build into a learning or planning agent to \emph{guide} its behavior. Existing results on ORP show \emph{weak mitigation} of limited computational resources, i. e. , the existence of reward functions so that agents when guided by them do better than when guided by the objective reward function. These existing results ignore the cost of finding such good reward functions. We define a nested optimal reward and control architecture that achieves \emph{strong mitigation} of limited computational resources. We show empirically that the designer is better off using a new architecture that spends some of its limited resources learning a good reward function instead of using all of its resources to optimize its behavior with respect to the objective reward function.

AAMAS Conference 2011 Conference Paper

Comparing Action-Query Strategies in Semi-Autonomous Agents

Robert Cohn
Edmund H. Durfee
Satinder Singh

We consider semi-autonomous agents that have uncertain knowledge about their environment, but can ask what action the operator would prefer taking in the current or in a potential future state. Asking queries can help improve behavior, but if queries come at a cost (e. g. , due to limited operator attention), the number of queries needs to be minimized. We develop a new algorithm for selecting action queries by adapting the recently proposed Expected Myopic Gain (EMG) from its prior use in settings with reward or transition probability queries to our setting of action queries, and empirically compare it to the current state of the art.

AAAI Conference 2011 Conference Paper

Comparing Action-Query Strategies in Semi-Autonomous Agents

Robert Cohn
Edmund Durfee
Satinder Singh

We consider settings in which a semi-autonomous agent has uncertain knowledge about its environment, but can ask what action the human operator would prefer taking in the current or in a potential future state. Asking queries can improve behavior, but if queries come at a cost (e. g. , due to limited operator attention), the value of each query should be maximized. We compare two strategies for selecting action queries: 1) based on myopically maximizing expected gain in long-term value, and 2) based on myopically minimizing uncertainty in the agent’s policy representation. We show empirically that the ﬁrst strategy tends to select more valuable queries, and that a hybrid method can outperform either method alone in settings with limited computation.

AAAI Conference 2011 Conference Paper

Optimal Rewards versus Leaf-Evaluation Heuristics in Planning Agents

Jonathan Sorg
Satinder Singh
Richard Lewis

Planning agents often lack the computational resources needed to build full planning trees for their environments. Agent designers commonly overcome this ﬁnite-horizon approximation by applying an evaluation function at the leaf-states of the planning tree. Recent work has proposed an alternative approach for overcoming computational constraints on agent design: modify the reward function. In this work, we compare this reward design approach to the common leaf-evaluation heuristic approach for improving planning agents. We show that in many agents, the reward design approach strictly subsumes the leaf-evaluation approach, i. e. , there exists a reward function for every leaf-evaluation heuristic that leads to equivalent behavior, but the converse is not true. We demonstrate that this generality leads to improved performance when an agent makes approximations in addition to the ﬁnite-horizon approximation. As part of our contribution, we extend PGRD, an online reward design algorithm, to develop reward design algorithms for Sparse Sampling and UCT, two algorithms capable of planning in large state spaces.

AAMAS Conference 2010 Conference Paper

History-Dependent Graphical Multiagent Models

Quang Duong
Michael Wellman
Satinder Singh
Yevgeniy Vorobeychik

A dynamic model of a multiagent system defines a probability distribution over possible system behaviors over time. Alternative representations for such models present tradeoffs in expressive power, and accuracy and cost for inferential tasks of interest. In a history-dependent representation, behavior at a given time is specified as a probabilistic function of some portion of system history. Models may be further distinguished based on whether they specify individualor joint behavior. Joint behavior models are more expressive, but in general grow exponentially in number of agents. Graphical multiagent models (GMMs) provide a more compact representation of joint behavior, when agent interactions exhibit some local structure. We extend GMMs tocondition on history, thus supporting inference about system dynamics. To evaluate this hGMM representation westudy a voting consensus scenario, where agents on a network attempt to reach a preferred unanimous vote through aprocess of smooth fictitious play. We induce hGMMs and individual behavior models from example traces, showing thatthe former provide better predictions, given limited historyinformation. These hGMMs also provide advantages for answering general inference queries compared to sampling thetrue generative model.

AAMAS Conference 2010 Conference Paper

Linear Options

Jonathan Sorg
Satinder Singh

Learning, planning, and representing knowledge in large state spaces at multiple levels of temporal abstraction are key, long-standing challenges for building flexible autonomous agents. The options framework provides a formal mechanism for specifying and learning reusable temporally-extended skills. Although past work has demonstrated the benefit of acting according to options in large state spaces, one of the central advantages of temporal abstraction - the ability to plan using a temporally abstract model - remains a challenging problem when the number of environment states is large or infinite. In this work, we develop a knowledge construct, the linear option, which is capable of modeling temporally abstract dynamics in large state spaces. We show that planning with a linear expectation model of an option's dynamics converges to a fixed point with low Temporal Difference (TD) error. Next, building on recent work on linear feature selection, we show conditions under which a linear feature set is sufficient for accurately representing the value function of an option policy. We extend this result to show conditions under which multiple options may be repeatedly composed to create new options with accurate linear models. Finally, we demonstrate linear option learning and planning algorithms in a simulated robot environment.

NeurIPS Conference 2010 Conference Paper

Reward Design via Online Gradient Ascent

Jonathan Sorg
Richard Lewis
Satinder Singh

Recent work has demonstrated that when artificial agents are limited in their ability to achieve their goals, the agent designer can benefit by making the agent's goals different from the designer's. This gives rise to the optimization problem of designing the artificial agent's goals---in the RL framework, designing the agent's reward function. Existing attempts at solving this optimal reward problem do not leverage experience gained online during the agent's lifetime nor do they take advantage of knowledge about the agent's structure. In this work, we develop a gradient ascent approach with formal convergence guarantees for approximately solving the optimal reward problem online during an agent's lifetime. We show that our method generalizes a standard policy gradient approach, and we demonstrate its ability to improve reward functions in agents with various forms of limitations.

IJCAI Conference 2009 Conference Paper

Quang Duong
Yevgeniy Vorobeychik
Satinder Singh
Michael P. Wellman

Graphical games provide compact representation of a multiagent interaction when agents’ payoffs depend only on actions of agents in their local neighborhood. We formally describe the problem of learning a graphical game model from limited observation of the payoff function, deﬁne three performance metrics for evaluating learned games, and investigate several learning algorithms based on minimizing empirical loss. Our ﬁrst algorithm is a branch-and-bound search, which takes advantage of the structure of the empirical loss function to derive upper and lower bounds on loss at every node of the search tree. We also examine a greedy heuristic and local search algorithms. Our experiments with directed graphical games show that (i) when only a small sample of proﬁle payoffs is available, branch-and-bound signiﬁcantly outperforms other methods, and has competitive running time, but (ii) when many proﬁles are observed, greedy is nearly optimal and considerably better than other methods, at a fraction of branch-andbound’s running time. The results are comparable for undirected graphical games and when payoffs are sampled with noise.

IJCAI Conference 2009 Conference Paper

Erik Talvitie
Satinder Singh

A common approach to the control problem in partially observable environments is to perform a direct search in policy space, as deﬁned over some set of features of history. In this paper we consider predictive features, whose values are conditional probabilities of future events, given history. Since predictive features provide direct information about the agent’s future, they have a number of advantages for control. However, unlike more typical features deﬁned directly over past observations, it is not clear how to maintain the values of predictive features over time. A model could be used, since a model can make any prediction about the future, but in many cases learning a model is infeasible. In this paper we demonstrate that in some cases it is possible to learn to maintain the values of a set of predictive features even when a learning a model is infeasible, and that natural predictive features can be useful for policy-search methods.

AAMAS Conference 2009 Conference Paper

SarsaLandmark: An Algorithm for Learning in POMDPs with Landmarks

Michael R. James
Satinder Singh

Reinforcement learning algorithms that use eligibility traces, such as Sarsa(λ), have been empirically shown to be effective in learning good estimated-state-based policies in partially observable Markov decision processes (POMDPs). Nevertheless, one can construct counterexamples, problems in which Sarsa(λ < 1 ) fails to ﬁnd a good policy even though one exists. Despite this, these algorithms remain of great interest because alternative approaches to learning in POMDPs based on approximating belief-states do not scale. In this paper we present SarsaLandmark, an algorithm for learning in POMDPs with ”landmark” states (most man-made and many natural environments have landmarks). SarsaLandmark simultaneously preserves the advantages oﬀered by eligibility traces and ﬁxes the cause of the failure of Sarsa(λ) on the motivating counterexamples. We present a theoretical analysis of SarsaLandmark for the policy evaluation problem and present empirical results on a few learning control problems.

AAMAS Conference 2009 Conference Paper

Transfer via Soft Homomorphisms

Jonathan Sorg
Satinder Singh

The ﬁeld of transfer learning aims to speed up learning across multiple related tasks by transferring knowledge between source and target tasks. Past work has shown that when the tasks are speciﬁed as Markov Decision Processes (MDPs), a function that maps states in the target task to similar states in the source task can be used to transfer many types of knowledge. Current approaches for autonomously learning such functions are ineﬃcient or require domain knowledge and lack theoretical guarantees of performance. We devise a novel approach that learns a stochastic mapping between tasks. Using this mapping, we present two algorithms for autonomous transfer learning – one that has strong convergence guarantees and another approximate method that learns online from experience. Extending existing work on MDP homomorphisms, we present theoretical guarantees for the quality of a transferred value function.

AAMAS Conference 2008 Conference Paper

Approximate Predictive State Representations

Britton Wolfe
Michael James
Satinder Singh

Predictive state representations (PSRs) are models that represent the state of a dynamical system as a set of predictions about future events. The existing work with PSRs focuses on trying to learn exact models, an approach that cannot scale to complex dynamical systems. In contrast, our work takes the first steps in developing a theory of approximate PSRs. We examine the consequences of using an approximate predictive state representation, bounding the error of the approximate state under certain conditions. We also introduce factored PSRs, a class of PSRs with a particular approximate state representation. We show that the class of factored PSRs allow one to tune the degree of approximation by trading off accuracy for compactness. We demonstrate this trade-off empirically on some example systems, using factored PSRs that were learned from data.

NeurIPS Conference 2008 Conference Paper

Simple Local Models for Complex Dynamical Systems

Erik Talvitie
Satinder Singh

We present a novel mathematical formalism for the idea of a local model, '' a model of a potentially complex dynamical system that makes only certain predictions in only certain situations. As a result of its restricted responsibilities, a local model may be far simpler than a complete model of the system. We then show how one might combine several local models to produce a more detailed model. We demonstrate our ability to learn a collection of local models on a large-scale example and do a preliminary empirical comparison of learning a collection of local models and some other model learning methods. "

IJCAI Conference 2007 Conference Paper

David Wingate
Vishal Soni
Britton Wolfe
Satinder Singh

Most work on Predictive Representations of State (PSRs) has focused on learning and planning in unstructured domains (for example, those represented by flat POMDPs). This paper extends PSRs to represent relational knowledge about domains, so that they can use policies that generalize across different tasks, capture knowledge that ignores irrelevant attributes of objects, and represent policies in a way that is independent of the size of the state space. Using a blocks world domain, we show how generalized predictions about the future can compactly capture relations between objects, which in turn can be used to naturally specify relational-style options and policies. Because our representation is expressed solely in terms of actions and observations, it has extensive semantics which are statistics about observable quantities.

IJCAI Conference 2007 Conference Paper

Erik Talvitie
Satinder Singh

A long-lived agent continually faces new tasks in its environment. Such an agent may be able to use knowledge learned in solving earlier tasks to produce candidate policies for its current task. There may, however, be multiple reasonable policies suggested by prior experience, and the agent must choose between them potentially without any a priori knowledge about their applicability to its current situation. We present an "experts" algorithm for efficiently choosing amongst candidate policies in solving an unknown Markov decision process task. We conclude with the results of experiments on two domains in which we generate candidate policies from solutions to related tasks and use our experts algorithm to choose amongst them.

AAMAS Conference 2007 Conference Paper

Constraint Satisfaction Algorithms for Graphical Games

Vishal Soni
Satinder Singh
Michael P. Wellman

We formulate the problem of computing equilibria in multiplayer games represented by arbitrary undirected graphs as a constraint satisfaction problem and present two algorithms. The first is PureProp: an algorithm for computing approximate Nash equilibria in complete information one-shot games and approximate Bayes-Nash equilibria in one-shot games of incomplete information. PureProp unifies existing message-passing based algorithms for solving these classes of games. We also address repeated graphical games, and present a second algorithm, PureProp-R, for computing approximate Nash equilibria in these games.

AAMAS Conference 2007 Conference Paper

On Discovery and Learning of Models with Predictive Representations of State for Agents with Continuous Actions and Observations

David Wingate
Satinder Singh

Models of agent-environment interaction that use predictive state representations (PSRs) have mainly focused on the case of discrete observations and actions. The theory of discrete PSRs uses an elegant construct called the system dynamics matrix and derives the notion of predictive state as a sufficient statistic via the rank of the matrix. With continuous observations and actions, such a matrix and its rank no longer exist. In this paper, we show how to define an analogous construct for the continuous case, called the system dynamics distributions, and use information theoretic notions to define a sufficient statistic and thus state. Given this new construct, we use kernel density estimation to learn approximate system dynamics distributions from data, and use information-theoretic tools to derive algorithms for discovery of state and learning of model parameters. We illustrate our new modeling method on two example problems.

NeurIPS Conference 2005 Conference Paper

Off-policy Learning with Options and Recognizers

Doina Precup
Cosmin Paduraru
Anna Koop
Richard Sutton
Satinder Singh

We introduce a new algorithm for off-policy temporal-difference learning with function approximation that has lower variance and requires less knowledge of the behavior policy than prior methods. We develop the notion of a recognizer, a filter on actions that distorts the behavior policy to produce a related target policy with low-variance importance-sampling corrections. We also consider target policies that are deviations from the state distribution of the behavior policy, such as potential temporally abstract options, which further reduces variance. This paper introduces recognizers and their potential advantages, then develops a full algorithm for linear function approximation and proves that its updates are in the same direction as on-policy TD updates, which implies asymptotic convergence. Even though our algorithm is based on importance sampling, we prove that it requires absolutely no knowledge of the behavior policy for the case of state-aggregation function approximators. Off-policy learning is learning about one way of behaving while actually behaving in another way. For example, Q-learning is an off- policy learning method because it learns about the optimal policy while taking actions in a more exploratory fashion, e. g. , according to an -greedy policy. Off-policy learning is of interest because only one way of selecting actions can be used at any time, but we would like to learn about many different ways of behaving from the single resultant stream of experience. For example, the options framework for temporal abstraction involves considering a variety of different ways of selecting actions. For each such option one would like to learn a model of its possible outcomes suitable for planning and other uses. Such option models have been proposed as fundamental building blocks of grounded world knowledge (Sutton, Precup & Singh, 1999; Sutton, Rafols & Koop, 2005). Using off-policy learning, one would be able to learn predictive models for many options at the same time from a single stream of experience. Unfortunately, off-policy learning using temporal-difference methods has proven problematic when used in conjunction with function approximation. Function approximation is essential in order to handle the large state spaces that are inherent in many problem do- mains. Q-learning, for example, has been proven to converge to an optimal policy in the tabular case, but is unsound and may diverge in the case of linear function approximation (Baird, 1996). Precup, Sutton, and Dasgupta (2001) introduced and proved convergence for the first off-policy learning algorithm with linear function approximation. They addressed the problem of learning the expected value of a target policy based on experience generated using a different behavior policy. They used importance sampling techniques to reduce the off-policy case to the on-policy case, where existing convergence theorems apply (Tsitsiklis & Van Roy, 1997; Tadic, 2001). There are two important difficulties with that approach. First, the behavior policy needs to be stationary and known, because it is needed to compute the importance sampling corrections. Second, the importance sampling weights are often ill-conditioned. In the worst case, the variance could be infinite and convergence would not occur. The conditions required to prevent this were somewhat awkward and, even when they applied and asymptotic convergence was assured, the variance could still be high and convergence could be slow. In this paper we address both of these problems in the context of off-policy learning for options. We introduce the notion of a recognizer. Rather than specifying an explicit target policy (for instance, the policy of an option), about which we want to make predictions, a recognizer specifies a condition on the actions that are selected. For example, a recognizer for the temporally extended action of picking up a cup would not specify which hand is to be used, or what the motion should be at all different positions of the cup. The recognizer would recognize a whole variety of directions of motion and poses as part of picking the cup. The advantage of this strategy is not that one might prefer a multitude of different behaviors, but that the behavior may be based on a variety of different strategies, all of which are relevant, and we would like to learn from any of them. In general, a recognizer is a function that recognizes or accepts a space of different ways of behaving and thus, can learn from a wider range of data. Recognizers have two advantages over direct specification of a target policy: 1) they are a natural and easy way to specify a target policy for which importance sampling will be well conditioned, and 2) they do not require the behavior policy to be known. The latter is important because in many cases we may have little knowledge of the behavior policy, or a stationary behavior policy may not even exist. We show that for the case of state aggregation, even if the behavior policy is unknown, convergence to a good model is achieved. 1 Non-sequential example The benefits of using recognizers in off-policy learning can be most easily seen in a nonsequential context with a single continuous action. Suppose you are given a sequence of sample actions ai [0, 1], selected i. i. d. according to probability density b: [0, 1] + (the behavior density). For example, suppose the behavior density is of the oscillatory form shown as a red line in Figure 1. For each each action, ai, we observe a corresponding outcome, zi, a random variable whose distribution depends only on ai. Thus the behavior density induces an outcome density. The on-policy problem is to estimate the mean mb of the outcome density. This problem can be solved simply by averaging the sample outcomes: mb = (1/n) n=1 zi. The off-policy problem is to use this same data to learn what ^ i the mean would be if actions were selected in some way other than b, for example, if the actions were restricted to a designated range, such as between 0. 7 and 0. 9. There are two natural ways to pose this off-policy problem. The most straightforward way is to be equally interested in all actions within the designated region. One professes to be interested in actions selected according to a target density: [0, 1] +, which in the example would be 5. 0 between 0. 7 and 0. 9, and zero elsewhere, as in the dashed line in 12 Probability density functions Target policy with recognizer Target policy w/o recognizer 1. 5 Empirical variances (average of 200 sample variances) 1 without recognizer. 5 with recognizer 100 200 300 400 500

NeurIPS Conference 2004 Conference Paper

Approximately Efficient Online Mechanism Design

David Parkes
Dimah Yanovsky
Satinder Singh

Online mechanism design (OMD) addresses the problem of sequential decision making in a stochastic environment with multiple self-interested agents. The goal in OMD is to make value-maximizing decisions despite this self-interest. In previous work we presented a Markov decision pro- cess (MDP)-based approach to OMD in large-scale problem domains. In practice the underlying MDP needed to solve OMD is too large and hence the mechanism must consider approximations. This raises the pos- sibility that agents may be able to exploit the approximation for selfish gain. We adopt sparse-sampling-based MDP algorithms to implement - efficient policies, and retain truth-revelation as an approximate Bayesian- Nash equilibrium. Our approach is empirically illustrated in the context of the dynamic allocation of WiFi connectivity to users in a coffeehouse. 1 Introduction Mechanism design (MD) is concerned with the problem of providing incentives to im- plement desired system-wide outcomes in systems with multiple self-interested agents. Agents are assumed to have private information, for example about their utility for differ- ent outcomes and about their ability to implement different outcomes, and act to maximize their own utility. The MD approach to achieving multiagent coordination supposes the ex- istence of a center that can receive messages from agents and implement an outcome and collect payments from agents. The goal of MD is to implement an outcome with desired system-wide properties in a game-theoretic equilibrium. Classic mechanism design considers static systems in which all agents are present and a one-time decision is made about an outcome. Auctions, used in the context of resource- allocation problems, are a standard example. Online mechanism design [1] departs from this and allows agents to arrive and depart dynamically requiring decisions to be made with uncertainty about the future. Thus, an online mechanism makes a sequence of decisions without the benefit of hindsight about the valuations of the agents yet to arrive. Without the issue of incentives, the online MD problem is a classic sequential decision problem. In prior work [6], we showed that Markov decision processes (MDPs) can be used to define an online Vickrey-Clarke-Groves (VCG) mechanism [2] that makes truth-revelation by the agents (called incentive-compatibility) a Bayesian-Nash equilibrium [5] and implements a policy that maximizes the expected summed value of all agents. This online VCG model differs from the line of work in online auctions, introduced by Lavi and Nisan [4] in that it assumes that the center has a model and it handles a general decision space and any decision horizon. Computing the payments and allocations in the online VCG mechanism involves solving the MDP that defines the underlying centralized (ignoring self-interest) decision making problem. For large systems, the MDPs that need to be solved exactly become large and thus computationally infeasible. In this paper we consider the case where the underlying centralized MDPs are indeed too large and thus must be solved approximately, as will be the case in most real applications. Of course, there are several choices of methods for solving MDPs approximately. We show that the sparse-sampling algorithm due to Kearns et al. [3] is particularly well suited to online MD because it produces the needed local approximations to the optimal value and action efficiently. More challengingly, regardless of our choice the agents in the system can exploit their knowledge of the mechanism's approximation algorithm to try and "cheat" the mechanism to further their own selfish interests. Our main contribution is to demonstrate that our new approximate online VCG mechanism has truth-revelation by the agents as an -Bayesian-Nash equilibrium (BNE). This approximate equilibrium supposes that each agent is indifferent to within an expected utility of, and will play a truthful strategy in best- response to truthful strategies of other agents if no other strategy can improve its utility by more than. For any, our online mechanism has a run-time that is independent of the number of states in the underlying MDP, provides an -BNE, and implements a policy with expected value within of the optimal policy's value. Our approach is empirically illustrated in the context of the dynamic allocation of WiFi con- nectivity to users in a coffeehouse. We demonstrate the speed-up introduced with sparse- sampling (compared with policy calculation via value-iteration), as well as the economic value of adopting an MDP-based approach over a simple fixed-price approach. 2 Preliminaries Here we formalize the multiagent sequential decision problem that defines the online mech- anism design (OMD) problem. The approach is centralized. Each agent is asked to report its private information (for instance about its value and its capabilities) to a central planner or mechanism upon arrival. The mechanism implements a policy based on its view of the state of the world (as reported by agents). The policy defines actions in each state, and the assumption is that all agents acquiesce to the decisions of the planner. The mechanism also collects payments from agents, which can themselves depend on the reports of agents. Online Mechanism Design We consider a finite-horizon problem with a set T of time points and a sequence of decisions k = {k1, .. ., kT }, where kt Kt and Kt is the set of feasible decisions in period t. Agent i I arrives at time ai T, departs at time li T, and has value vi(k) 0 for a sequence of decisions k. By assumption, an agent has no value for decisions outside of interval [ai, li]. Agents also face payments, which can be col- lected after an agent's departure. Collectively, information i = (ai, li, vi) defines the type of agent i with i. Agent types are sampled i. i. d. from a probability distribution f (), assumed known to the agents and to the central mechanism. Multiple agents can arrive and depart at the same time. Agent i, with type i, receives utility ui(k, p; i) = vi(k; i) - p, for decisions k and payment p. Agents are modeled as expected-utility maximizers. Definition 1 (Online Mechanism Design) The OMD problem is to implement the sequence of decisions that maximizes the expected summed value across all agents in equilibrium, given self-interested agents with private information about valuations. In economic terms, an optimal (value-maximizing) policy is the allocatively-efficient, or simply the efficient policy. Note that nothing about the OMD models requires centralized execution of the joint plan. Rather, the agents themselves can have capabilities to perform actions and be asked to perform particular actions by the mechanism. The agents can also have private information about the actions that they are able to perform. Using MDPs to Solve Online Mechanism Design. In the MDP-based approach to solv- ing the OMD problem the sequential decision problem is formalized as an MDP with the state at any time encapsulating both the current agent population and constraints on current decisions as reflected by decisions made previously. The reward function in the MDP is simply defined to correspond with the total reported value of all agents for all sequences of decisions. Given types i f () we define an MDP, Mf, as follows. Define the state of the MDP at time t as the history-vector ht = (1, .. ., t; k1, .. ., kt-1), to include the reported types up to and including period t and the decisions made up to and including period t - 1. 1 The set of all possible states at time t is denoted Ht. The set of all possible states across all time is H = T +1 H t=1 t. The set of decisions available in state ht is Kt(ht). Given a decision kt Kt(ht) in state ht, there is some probability distribution Prob(ht+1|ht, kt) over possible next states ht+1. In the setting of OMD, this probability distribution is determined by the uncertainty on new agent arrivals (as represented within f ()), together with departures and the impact of decision kt on state. The payoff function for the induced MDP is defined to reflect the goal of maximizing the total expected reward across all agents. In particular, payoff Ri(ht, kt) = vi(kt; i) - vi(kt-1; i) becomes available from agent i upon taking action kt in state ht. With this, we have Ri(h t=1 t, kt) = vi(k; i), for all periods to provide the required cor- respondence with agent valuations. Let R(ht, kt) = Ri(h i t, kt), denote the payoff obtained from all agents at time t. Given a (nonstationary) policy = {1, 2, .. ., T } where t: Ht Kt, an MDP defines an MDP-value function V as follows: V (ht) is the expected value of the summed payoff obtained from state ht onwards under policy, i. e. , V (ht) = E{R(ht, (ht)) + R(ht+1, (ht+1)) + + R(hT, (hT ))}. An optimal policy is one that maximizes the MDP-value of every state in H. The optimal MDP-value function V can be computed by value-iteration, and is defined so that V (h) = maxkK P rob(h |h, k)V (h )] for t = T - t (h) [R(h, k) + h Ht+1 1, T - 2, .. ., 1 and all h Ht, with V (h HT ) = maxkKT (h) R(h, k). Given the optimal MDP-value function, the optimal policy is derived as follows: for t mi >mi arrive. This is the earliest period in which agent i's total value is known with certainty. Let ht (^ t; ) denote the state in period t given reports ^ t. Let ^ t \i = ^ t \ ^ i. Definition 2 (Online VCG mechanism) Given history h H, mechanism Mvcg = (; , pvcg) implements policy and collects payment, pvcg(^; ) = Ri (^; ) - V (h (^; )) - V (h (^ i mi m m ^ a ^ a ^ a ^ a i i i i i i \i; )) (1) from agent i in some period t mi. 1Using histories as state will make the state space very large. Often, there will be some function g for which g(h) is a sufficient statistic for all possible states h. We ignore this possibility here. Agent i's payment is equal to its reported value discounted by the expected marginal value that it will contribute to the system as determined by the MDP-value function for the policy in its arrival period. The incentive-compatibility of the Online VCG mechanism requires that the MDP satisfies stalling which requires that the expected value from the optimal policy in every state in which an agent arrives is at least the expected value from following the optimal action in that state as though the agent had never arrived and then returning to the optimal policy. Clearly, property Kt(ht) Kt(ht \ i) in any period t in which i has just arrived is sufficient for stalling. Stalling is satisfied whenever an agent's arrival does not force a change in action on a policy. Theorem 1 (Parkes & Singh [6]) An online VCG mechanism, Mvcg = (; , pvcg), based on an optimal policy for a correct MDP model that satisfies stalling is Bayesian- Nash incentive compatible and implements the optimal MDP policy. 3 Solving Very Large MDPs Approximately From Equation 1, it can be seen that making outcome and payment decisions at any point in time in an online VCG mechanism does not require a global value function or a global policy. Unlike most methods for approximately solving MDPs that compute global approx- imations, the sparse-sampling methods of Kearns et al. [3] compute approximate values and actions for a single state at a time. Furthermore, sparse-sampling methods provide approx- imation guarantees that will be important to establish equilibrium properties -- they can compute an -approximation to the optimal value and action in a given state in time inde- pendent of the size of the state space (though polynomial in 1 and exponential in the time horizon). Thus, sparse-sampling methods are particularly suited to approximating online VCG and we adopt them here. The sparse-sampling algorithm uses the MDP model Mf as a generative model, i. e. , as a black box from which a sample of the next-state and reward distributions for any given state-action pair can be obtained. Given a state s and an approximation parameter, it computes an -accurate estimate of the optimal value for s as follows. We make the param- eterization on explicit by writing sparse-sampling( ). The algorithm builds out a depth-T sampled tree in which each node is a state and each node's children are obtained by sam- pling each action in that state m times (where m is chosen to guarantee an approximation to the optimal value of s), and each edge is labeled with the sample reward for that transi- tion. The algorithm computes estimates of the optimal value for nodes in the tree working backwards from the leaves as follows. The leaf-nodes have zero value. The value of a node is the maximum over the values for all actions in that node. The value of an action in a node is the summed value of the m rewards on the m outgoing edges for that action plus the summed value of the m children of that node. The estimated optimal value of state s is the value of the root node of the tree. The estimated optimal action in state s is the action that leads to the largest value for the root node in the tree. Lemma 1 (Adapted from Kearns, Mansour & Ng [3]) The sparse-sampling( ) algorithm, given access to a generative model for any n-action MDP M, takes as input any state s S and any > 0, outputs an action, and satisfies the following two conditions: (Running Time) The running time of the algorithm is O((nC)T ), where C = f (n, 1, Rmax, T ) and f is a polynomial function of the approximation parameter 1, the number of actions n, the largest expected reward in a state Rmax and the horizon T. In particular, the running time has no dependence on the number of states. (Near-Optimality) The value function of the stochastic policy implemented by the sparse-sampling( ) algorithm, denoted V ss, satisfies |V (s) - V ss(s)| si- multaneously for all states s S. It is straightforward to derive the following corollary from the proof of Lemma 1 in [3]. Corollary 1 The value function computed by the sparse-sampling( ) algorithm, denoted ^ V ss, is near-optimal in expectation, i. e. , |V (s) - E{ ^ V ss(s)}| simultaneously for all states s S and where the expectation is over the randomness introduced by the sparse- sampling( ) algorithm. 4 Approximately Efficient Online Mechanism Design In this section, we define an approximate online VCG mechanism and consider the effect on incentives of substituting the sparse-sampling( ) algorithm into the online VCG mech- anism. We model agents as indifferent between decisions that differ by at most a utility of > 0, and consider an approximate Bayesian-Nash equilibrium. Let vi(; ) denote the final value to agent i after reports given policy. Definition 3 (approximate BNE) Mechanism Mvcg = (, , pvcg) is -Bayesian-Nash in- centive compatible if E| {vi(; ) - pvcg(; )} + E {vi(-i, ^ i; ) - pvcg(-i, ^ i; )}(2) | t i t i where agent i with type i arrives in period t, and with the expectation taken over future types given current reports t. In particular, when truth-telling is an -BNE we say that the mechanism is -BNE incentive compatible and no agent can improve its expected utility by more than > 0, for any type, as long as other agents are bidding truthfully. Equivalently, one can interpret an -BNE as an exact equilibrium for agents that face a computational cost of at least to compute the exact BNE. Definition 4 (approximate mechanism) A sparse-sampling( ) based approximate online VCG mechanism, Mvcg( ) = (; ~, ~ pvcg), uses the sparse-sampling( ) algorithm to imple- ment stochastic policy ~ and collects payment ~ pvcg(^; ~ ) = Ri (^; ~ ) - ^ V ss(h (^; ~ )) - ^ V ss(h (^ i mi m m ^ a ^ a ^ a ^ a i i i i i i \i; ~ )) from agent i in some period t mi, for commitment period mi. Our proof of incentive-compatibility first demonstrates that an approximate delayed VCG mechanism [1, 6] is -BNE. With this, we demonstrate that the expected value of the pay- ments in the approximate online VCG mechanism is within 3 of the payments in the approximate delayed VCG mechanism. The delayed VCG mechanism makes the same decisions as the online VCG mechanism, except that payments are delayed until the final period and computed as: pDvcg(^; ) = Ri (^; ) - R i T T ( ^; ) - RT (^ -i; ) (3) where the discount is computed ex post, once the effect of an agent on the system value is known. In an approximate delayed-VCG mechanism, the role of the sparse-sampling algorithm is to implement an approximate policy, as well as counterfactual policies for the worlds -i without each agent i in turn. The total reported reward to agents = i over this counterfactual series of states is used to define the payment to agent i. Lemma 2 Truthful bidding is an -Bayesian-Nash equilibrium in the sparse-sampling( ) based approximate delayed-VCG mechanism. Proof: Let ~ denote the approximate policy computed by the sparse-sampling algorithm. Assume agents = i are truthful. Now, if agent i bids truthfully its expected utility is E| {v Rj (; ~ ) - Rj ( a i(; ~ ) + T T -i; ~ )} (4) i j=i j=i where the expectation is over both the types yet to be reported and the random- ness introduced by the sparse-sampling( ) algorithm. Substituting R V (ha (; ~ )) - V ss(h ( i ai ai ai\i; ~ )) - (5) because V ss(ha (; ~ )) V (h (; ~ )) - by Lemma 1. Now, ignore term i ai ai ai RT (-i; ~ ) in Equation (4), which is independent of agent i's bid ^ i, and consider the maximal expected utility to agent i from some non-truthful bid. The effect of ^ i on the first two terms is indirect, through a change in the policy for periods ai. An agent can change the policy only indirectly, by changing the center's view of the state by misreporting its type. By definition, the agent can do no better than selecting optimal policy, which is defined to maximize the expected value of the first two terms. Putting this together, the expected utility from ^ i is at most V (ha (; ~ )) - V ss(h ( i ai ai ai\i; ~ )) and at most better than that from bidding truthfully. Theorem 2 Truthful bidding is an 4 -Bayesian-Nash equilibrium in the sparse- sampling( ) based approximate online VCG mechanism. Proof: Assume agents = i bid truthfully, and consider report ^ i. Clearly, the policy implemented in the approximate online-VCG mechanism is the same as in the delayed- VCG mechanism for all ^ i. Left to show is that the expected value of the payments are within 3 for all ^ i. From this we conclude that the expected utility to agent i in the approximate-VCG mechanism is always within 3 of that in the approximate delayed-VCG mechanism, and therefore 4 -BNE by Lemma 2. The expected payment in the approximate online VCG mechanism is E| {Ri (^; ~ )} - E{ ^ V ss(h (^; ~ )} - E{ ^ V ss(h (^ a T ^ ai ^ ai ^ ai ^ ai\i; ~ )} i The value function computed by the sparse-sampling( ) algorithm is a random variable to agent i at the time of bidding, and the second and third expectations are over the random- ness introduced by the sparse-sampling( ) algorithm. The first term is the same as in the payment in the approximate delayed-VCG mechanism. By Corollary 1, the value function estimated in the sparse-sampling( ) is near-optimal in expectation and the total of the sec- ond two terms is at least V (h^a (^ (^; )) - 2. Ignoring the first i ^ ai\i; )) - V (h^ ai ^ ai term in pDvcg, the expected payment in the approximate delayed-VCG mechanism is no i more than V (h^a (^ (^; )) - ) because of the near-optimality i ^ ai\i; )) - (V (h^ ai ^ ai of the value function of the stochastic policy (Lemma 1). Putting this together, we have a maximum difference in expected payments of 3. Similar analysis yields a maximum dif- ference of 3 when an upper-bound is taken on the payment in the online VCG mechanism and compared with a lower-bound on the payment in the delayed mechanism. Theorem 3 For any parameter > 0, the sparse-sampling( ) based approximate online VCG mechanism has -efficiency in an 4 -BNE. 5 Empirical Evaluation of Approximate Online VCG The WiFi Problem. The WiFi problem considers a fixed number of channels C with a random number of agents (max A) that can arrive per period. The time horizon T = 50. Agents demand a single channel and arrive with per-unit value, distributed i. i. d. V = {v1, .. ., vk} and duration in the system, distributed i. i. d. D = {d1, .. ., dl}. The value model requires that any allocation to agent i must be for contiguous periods, and be made while the agent is present (i. e. , during periods [t, ai + di], for arrival ai and duration di). An agent's value for an allocation of duration x is vix where vi is its per-unit value. Let dmax denote the maximal possible allocated duration. We define the following MDP components: State space: We use the following compact, sufficient, statistic of history: a resource schedule is a (weakly non-decreasing) vector of length dmax that counts the number of channels available in the current period and next dmax - 1 periods given previous actions (C channels are available after this); an agent vector of size (k l) that provides a count of the number of agents present but not allocated for each possible per-unit value and each possible duration (the duration is automatically decremented when an agent remains in the system for a period without receiving an allocation); the time remaining until horizon T. Action space: The policy can postpone an agent allocation, or allocate an agent to a chan- nel for the remaining duration of the agent's time in the system if a channel is available, and the remaining duration is not greater than dmax. Payoff function: The reward at a time step is the summed value obtained from all agents for which an allocation is made in this time step. This is the total value such an agent will receive before it departs. Transition probabilities: The change in resource schedule, and in the agent vector that relates to agents currently present, is deterministic. The random new additions to the agent vector at each step are unaffected by the actions and also independent of time. Mechanisms. In the exact online VCG mechanism we compute an optimal policy, and optimal MDP values, offline using finite-horizon value iteration [7]. In the sparse- sampling( ) approach, we define a sampling tree depth L (perhaps v or duration 100 value: mdp value: mdp 80 rev: mdp rev: mdp value: fixed 80 value: fixed rev: fixed rev: fixed 60 60 % % 40 40 20 20 2 4 6 8 10 12 03 4 5 6 7 8 9 10 Number of agents Number of channels 98 1. 0 600 value: pruning vs. #agents time: pruning vs. #agents (no pruning) 96 value: no pruning 0. 8 500 time: no pruning vs. #channels 400 94 0. 6 300 92 0. 4 Run time (s) % of Exact Value % of Exact Time 200 90 0. 2 time: pruning 100 time: no pruning 88 0 0 2 4 6 8 10 2 4 6 8 10 12 Sampling Width Number of Agents Figure 1: (A) Value and Revenue vs. Number of Agents. (B) Value and Revenue vs. Number of Channels. (C) Effect of Sampling Width. (D) Pruning speed-up. small: A = 2, C = 2, D = {1, 2, 3}, V = {1, 2, 3} and L = 4, to allow a compari- son with the compute time for an optimal policy. The sparse-sampling method is already running in less than 1% of the time for optimal value-iteration (right-hand axis), with an accuracy as high as 96% of the optimal. Pruning provides an incremental speed-up, and actually improves accuracy, presumably by making better use of each sample. Figure 1 (D) shows that pruning is extremely useful computationally (in comparison with plain sparse- sampling), for the default model parameters and as the number of agents is increased from 2 to 12. Pruning is effective, removing around 50% of agents (summed across all states in the lookahead tree) at 5 agents. Acknowledgments. David Parkes was funded by NSF grant IIS-0238147. Satinder Singh was funded by NSF grant CCF 0432027 and by a grant from DARPA's IPTO program.

NeurIPS Conference 2004 Conference Paper

Intrinsically Motivated Reinforcement Learning

Nuttapong Chentanez
Andrew Barto
Satinder Singh

Psychologists call behavior intrinsically motivated when it is engaged in for its own sake rather than as a step toward solving a specific problem of clear practical value. But what we learn during intrinsically motivated behavior is essential for our development as competent autonomous en- tities able to efficiently solve a wide range of practical problems as they arise. In this paper we present initial results from a computational study of intrinsically motivated reinforcement learning aimed at allowing arti- ficial agents to construct and extend hierarchies of reusable skills that are needed for competent autonomy. 1 Introduction Psychologists distinguish between extrinsic motivation, which means being moved to do something because of some specific rewarding outcome, and intrinsic motivation, which refers to being moved to do something because it is inherently enjoyable. Intrinsic motiva- tion leads organisms to engage in exploration, play, and other behavior driven by curiosity in the absence of explicit reward. These activities favor the development of broad com- petence rather than being directed to more externally-directed goals (e. g. , ref. [14]). In contrast, machine learning algorithms are typically applied to single problems and so do not cope flexibly with new problems as they arise over extended periods of time. Although the acquisition of competence may not be driven by specific problems, this com- petence is routinely enlisted to solve many different specific problems over the agent's lifetime. The skills making up general competence act as the "building blocks" out of which an agent can form solutions to new problems as they arise. Instead of facing each new challenge by trying to create a solution out of low-level primitives, it can focus on combining and adjusting its higher-level skills. In animals, this greatly increases the effi- ciency of learning to solve new problems, and our main objective is to achieve a similar efficiency in our machine learning algorithms and architectures. This paper presents an elaboration of the reinforcement learning (RL) framework [11] that encompasses the autonomous development of skill hierarchies through intrinsically mo- tivated reinforcement learning. We illustrate its ability to allow an agent to learn broad competence in a simple "playroom" environment. In a related paper [1], we provide more extensive background for this approach, whereas here the focus is more on algorithmic details. Lack of space prevents us from providing a comprehensive background to the many ideas to which our approach is connected. Many researchers have argued for this kind of devel- opmental approach in which an agent undergoes an extended developmental period dur- ing which collections of reusable skills are autonomously learned that will be useful for a wide range of later challenges (e. g. , [4, 13]). The previous machine learning research most closely related is that of Schmidhuber (e. g. , [8]) on confidence-based curiosity and the ideas of exploration and shaping bonuses [6, 10], although our definition of intrinsic re- ward differs from these. The most direct inspiration behind the experiment reported in this paper, comes from neuroscience. The neuromodulator dopamine has long been associated with reward learning [9]. Recent studies [2, 3] have focused on the idea that dopamine not only plays a critical role in the extrinsic motivational control of behaviors aimed at harvest- ing explicit rewards, but also in the intrinsic motivational control of behaviors associated with novelty and exploration. For instance, salient, novel sensory stimuli inspire the same sort of phasic activity of dopamine cells as unpredicted rewards. However, this activation extinguishes more or less quickly as the stimuli become familiar. This may underlie the fact that novelty itself has rewarding characteristics [7]. These connections are key components of our approach to intrinsically motivated RL. 2 Reinforcement Learning of Skills According to the "standard" view of RL (e. g. , [11]) the agent-environment interaction is envisioned as the interaction between a controller (the agent) and the controlled system (the environment), with a specialized reward signal coming from a "critic" in the environment that evaluates (usually with a scalar reward value) the agent's behavior (Fig. 1A). The agent learns to improve its skill in controlling the environment in the sense of learning how to increase the total amount of reward it receives over time from the critic. External Environment Environment Actions Sensations Critic Internal Environment Critic Rewards Actions States Rewards Decisions States Agent Agent "Organism" A B Figure 1: Agent-Environment Interaction in RL. A: The usual view. B: An elaboration. Sutton and Barto [11] point out that one should not identify this RL agent with an entire animal or robot. An an animal's reward signals are determined by processes within its brain that monitor not only external state but also the animal's internal state. The critic is in an animal's head. Fig. 1B makes this more explicit by "factoring" the environment of Fig. 1A into an external environment and an internal environment, the latter of which contains the critic which determines primary reward. This scheme still includes cases in which reward is essentially an external stimulus (e. g. , a pat on the head or a word of praise). These are simply stimuli transduced by the internal environment so as to generate the appropriate level of primary reward. The usual practice in applying RL algorithms is to formulate the problem one wants the agent to learn how to solve (e. g. , win at backgammon) and define a reward function spe- cially tailored for this problem (e. g. , reward = 1 on a win, reward = 0 on a loss). Sometimes considerable ingenuity is required to craft an appropriate reward function. The point of departure for our approach is to note that the internal environment contains, among other things, the organism's motivational system, which needs to be a sophisticated system that should not have to be redesigned for different problems. Handcrafting a different special- purpose motivational system (as in the usual RL practice) should be largely unnecessary. Skills--Autonomous mental development should result in a collection of reusable skills. But what do we mean by a skill? Our approach to skills builds on the theory of options [12]. Briefly, an option is something like a subroutine. It consists of 1) an option policy that directs the agent's behavior for a subset of the environment states, 2) an initiation set con- sisting of all the states in which the option can be initiated, and 3) a termination condition, which specifies the conditions under which the option terminates. It is important to note that an option is not a sequence of actions; it is a closed-loop control rule, meaning that it is responsive to on-going state changes. Furthermore, because options can invoke other options as actions, hierarchical skills and algorithms for learning them naturally emerge from the conception of skills as options. Theoretically, when options are added to the set of admissible agent actions, the usual Markov decision process (MDP) formulation of RL extends to semi-Markov decision processes (SMDPs), with the one-step actions now be- coming the "primitive actions. " All of the theory and algorithms applicable to SMDPs can be appropriated for decision making and learning with options [12]. Two components of the the options framework are especially important for our approach: 1. Option Models: An option model is a probabilistic description of the effects of executing an option. As a function of an environment state where the option is initiated, it gives the probability with which the option will terminate at any other state, and it gives the total amount of reward expected over the option's execution. Option models can be learned from experience (usually only approximately) using standard methods. Option models allow stochastic planning methods to be extended to handle planning at higher levels of abstraction. 2. Intra-option Learning Methods: These methods allow the policies of many options to be updated simultaneously during an agent's interaction with the environment. If an option could have produced a primitive action in a given state, its policy can be updated on the basis of the observed consequences even though it was not directing the agent's behavior at the time. In most of the work with options, the set of options must be provided by the system designer. While an option's policy can be improved through learning, each option has to be prede- fined by providing its initiation set, termination condition, and the reward function that evaluates its performance. Many researchers have recognized the desirability of automati- cally creating options, and several approaches have recently been proposed (e. g. , [5]). For the most part, these methods extract options from the learning system's attempts to solve a particular problem, whereas our approach creates options outside of the context of solving any particular problem. Developing Hierarchical Collections of Skills--Children accumulate skills while they engage in intrinsically motivated behavior, e. g. , while at play. When they notice that some- thing they can do reliably results in an interesting consequence, they remember this in a form that will allow them to bring this consequence about if they wish to do so at a future time when they think it might contribute to a specific goal. Moreover, they improve the efficiency with which they bring about this interesting consequence with repetition, before they become bored and move on to something else. We claim that the concepts of an option and an option model are exactly appropriate to model this type of behavior. Indeed, one of our main contributions is a (preliminary) demonstration of this claim. 3 Intrinsically Motivated RL Our main departure from the usual application of RL is that our agent maintains a knowl- edge base of skills that it learns using intrinsic rewards. In most other regards, our ex- tended RL framework is based on putting together learning and planning algorithms for Loop forever Current state st, current primitive action at, current option ot, extrinsic reward ret, intrinsic reward rit Obtain next state st+1 //-- Deal with special case if next state is salient If st+1 is a salient event e If option for e, oe, does not exist in O (skill-KB) Create option oe in skill-KB; Add st to Ioe // initialize initiation set Set oe (st+1) = 1 // set termination probability //-- set intrinsic reward value ri = [1 - P oe (s t+1 t+1|st)] // is a constant multiplier else ri = 0 t+1 //-- Update all option models For each option o = oe in skill-KB (O) If st+1 Io, then add st to Io // grow initiation set If at is greedy action for o in state st //-- update option transition probability model P o(x|st) [(1 - o(st+1)P o(x|st+1) + o(st+1)st+1x] //-- update option reward model Ro(st) [re + (1 - o(s t t+1))Ro(st+1)] //-- Q-learning update of behavior action-value function QB(st, at) [re + ri + max t t aAO QB (st+1, a)] //-- SMDP-planning update of behavior action-value function For each option o in skill-KB QB(st, o) [Ro(st) + P o(x|s xS t) maxaAO QB (x, a)] //-- Update option action-value functions For each option o O such that st Io Qo(st, at) [re + (o(s t t+1) terminal value for option o) +(1 - o(st+1)) maxaAO Qo(st+1, a)] For each option o O such that st Io and o = o Qo(st, o ) Ro (st) + P o (x|s xS t)[o(x) terminal val for option o +((1 - o(x)) maxaAO Qo(x, a))] Choose at+1 using -greedy policy w. r. to QB // -- Choose next action //-- Determine next extrinsic reward Set ret+1 to the extrinsic reward for transition st, at st+1 Set st st+1; at at+1; re re ri t t+1; rit t+1 Figure 2: Learning Algorithm. Extrinsic reward is denoted re while intrinsic reward is denoted ri. Equations of the form x [y] are short for x (1-)x+[y]. The behavior action value function QB is updated using a combination of Q-learning and SMDP planning. Throughout is a discount factor and is the step-size. The option action value functions Qo are updated using intra-option Q-learning. Note that the intrinsic reward is only used in updating QB and not any of the Qo. options [12]. Behavior The agent behaves in its environment according to an -greedy policy with re- spect to an action-value function QB that is learned using a mix of Q-learning and SMDP planning as described in Fig. 2. Initially only the primitive actions are available to the agent. Over time, skills represented internally as options and their models also become available to the agent as action choices. Thus, QB maps states s and actions a (both primitive and options) to the expected long-term utility of taking that action a in state s. Salient Events In our current implementation we assume that the agent has intrinsic or hardwired notions of interesting or "salient" events in its environment. For example, in the playroom environment we present shortly, the agent finds changes in light and sound intensity to be salient. These are intended to be independent of any specific task and likely to be applicable to many environments. Reward In addition to the usual extrinsic rewards there are occasional intrinsic rewards generated by the agent's critic (see Fig. 1B). In this implementation, the agent's intrinsic reward is generated in a way suggested by the novelty response of dopamine neurons. The intrinsic reward for each salient event is proportional to the error in the prediction of the salient event according to the learned option model for that event (see Fig. 2 for detail). Skill-KB The agent maintains a knowledge base of skills that it has learned in its environ- ment. Initially this may be empty. The first time a salient event occurs, say light turned on, structures to learn an option that achieves that salient event (turn-light-on option) are created in the skill-KB. In addition, structures to learn an option model are also created. So for option o, Qo maps states s and actions a (again, both primitive and options) to the long-term utility of taking action a in state s. The option for a salient event terminates with probability one in any state that achieves that event and never terminates in any other state. The initiation set, Io, for an option o is incrementally expanded to includes states that lead to states in the current initiation set. Learning The details of the learning algorithm are presented in Fig. 2. 4 Playroom Domain: Empirical Results We implemented intrinsically motivated RL (of Fig. 2) in a simple artificial "playroom" domain shown in Fig. 3A. In the playroom are a number of objects: a light switch, a ball, a bell, two movable blocks that are also buttons for turning music on and off, as well as a toy monkey that can make sounds. The agent has an eye, a hand, and a visual marker (seen as a cross hair in the figure). The agent's sensors tell it what objects (if any) are under the eye, hand and marker. At any time step, the agent has the following actions available to it: 1) move eye to hand, 2) move eye to marker, 3) move eye one step north, south, east or west, 4) move eye to random object, 5) move hand to eye, and 6) move marker to eye. In addition, if both the eye and and hand are on some object, then natural operations suggested by the object become available, e. g. , if both the hand and the eye are on the light switch, then the action of flicking the light switch becomes available, and if both the hand and eye are on the ball, then the action of kicking the ball becomes available (which when pushed, moves in a straight line to the marker). The objects in the playroom all have potentially interesting characteristics. The bell rings once and moves to a random adjacent square if the ball is kicked into it. The light switch controls the lighting in the room. The colors of any of the blocks in the room are only visible if the light is on, otherwise they appear similarly gray. The blue block if pressed turns music on, while the red block if pressed turns music off. Either block can be pushed and as a result moves to a random adjacent square. The toy monkey makes frightened sounds if simultaneously the room is dark and the music is on and the bell is rung. These objects were designed to have varying degrees of difficulty to engage. For example, to get the monkey to cry out requires the agent to do the following sequence of actions: 1) get its eye to the light switch, 2) move hand to eye, 3) push the light switch to turn the light on, 4) find the blue block with its eye, 5) move the hand to the eye, 6) press the blue block to turn music on, 7) find the light switch with its eye, 8) move hand to eye, 9) press light switch to turn light off, 10) find the bell with its eye, 11) move the marker to the eye, 12) find the ball with its eye, 13) move its hand to the ball, and 14) kick the ball to make the bell ring. Notice that if the agent has already learned how to turn the light on and off, how to turn music on, and how to make the bell ring, then those learned skills would be of obvious use in simplifying this process of engaging the toy monkey. A B C Performance of Learned Options Effect of Intrinsically Motivated Learning 10000 120 100 8000 80 Sound On Extrinsic Reward Only 6000 60 Light On Music On 4000 40 Toy Monkey On Intrinsic & Extrinsic Rewards 20 2000 0 Average # of Actions to Salient Event 0 0. 5 1 1. 5 2 2. 5 0 Number of steps between extrinsic rewards 0 100 200 300 400 500 600 7 Number of Actions x 10 Number of extrinsic rewards Figure 3: A. Playroom domain. B. Speed of learning of various skills. C. The effect of intrinsically motivated learning when extrinsic reward is present. See text for details For this simple example, changes in light and sound intensity are considered salient by the playroom agent. Because the initial action value function, QB, is uninformative, the agent starts by exploring its environment randomly. Each first encounter with a salient event initiates the learning of an option and an option model for that salient event. For example, the first time the agent happens to turn the light on, it initiates the data structures necessary for learning and storing the light-on option. As the agent moves around the environment, all the options (initiated so far) and their models are simultaneously updated using intra-option learning. As shown in Fig. 2, the intrinsic reward is used to update QB. As a result, when the agent encounters an unpredicted salient event a few times, its updated action value function drives it to repeatedly attempt to achieve that salient event. There are two interesting side effects of this: 1) as the agent tries to repeatedly achieve the salient event, learning improves both its policy for doing so and its option-model that predicts the salient event, and 2) as its option policy and option model improve, the intrinsic reward diminishes and the agent gets "bored" with the associated salient event and moves on. Of course, the option policy and model become accurate in states the agent encounters frequently. Occasionally, the agent encounters the salient event in a state (set of sensor readings) that it has not encountered before, and it generates intrinsic reward again (it is "surprised"). A summary of results is presented in Fig. 4. Each panel of the figure is for a distinct salient event. The graph in each panel shows both the time steps at which the event occurs as well as the intrinsic reward associated by the agent to each occurrence. Each occurrence is denoted by a vertical bar whose height denotes the amount of associated intrinsic reward. Note that as one goes from top to bottom in this figure, the salient events become harder to achieve and, in fact, become more hierarchical. Indeed, the lowest one for turning on the monkey noise (Non) needs light on, music on, light off, sound on in sequence. A number of interesting results can be observed in this figure. First note that the salient events that are simpler to achieve occur earlier in time. For example, Lon (light turning on) and Loff (light turning off) are the simplest salient events, and the agent makes these happen quite early. The agent tries them a large number of times before getting bored and moving on to other salient events. The reward obtained for each of these events diminishes after repeated exposure to the event. Thus, automatically, the skill of achieving the simpler events are learned before those for the more complex events. Figure 4: Results from the playroom domain. Each panel depicts the occurrences of salient events as well as the associated intrinsic rewards. See text for details. Of course, the events keep happening despite their diminished capacity to reward because they are needed to achieve the more complex events. Consequently, the agent continues to turn the light on and off even after it has learned this skill because this is a step along the way toward turning on the music, as well as along the way toward turning on the monkey noise. Finally note that the more complex skills are learned relatively quickly once the required sub-skills are in place, as one can see by the few rewards the agent receives for them. The agent is able to bootstrap and build upon the options it has already learned for the simpler events. We confirmed the hierarchical nature of the learned options by inspecting the greedy policies for the more complex options like Non and Noff. The fact that all the options are successfully learned is also seen in Fig. 3B in which we show how long it takes to bring about the events at different points in the agent's experience (there is an upper cutoff of 120 steps). This figure also shows that the simpler skills are learned earlier than the more complex ones. An agent having a collection of skills learned through intrinsic reward can learn a wide variety of extrinsically rewarded tasks more easily than an agent lacking these skills. To illustrate, we looked at a playroom task in which extrinsic reward was available only if the agent succeeded in making the monkey cry out. This requires the 14 steps described above. This is difficult for an agent to learn if only the extrinsic reward is available, but much easier if the agent can use intrinsic reward to learn a collection of skills, some of which are relevant to the overall task. Fig. 3C compares the performance of two agents in this task. Each starts out with no knowledge of task, but one employs the intrinsic reward mechanism we have discussed above. The extrinsic reward is always available, but only when the monkey cries out. The figure, which shows the average of 100 repetitions of the experiment, clearly shows the advantage of learning with intrinsic reward. Discussion One of the key aspects of the Playroom example was that intrinsic reward was generated only by unexpected salient events. But this is only one of the simplest possibilities and has many limitations. It cannot account for what makes many forms of exploration and manipulation "interesting. " In the future, we intend to implement compu- tational analogs of other forms of intrinsic motivation as suggested in the psychological, statistical, and neuroscience literatures. Despite the "toy" nature of this domain, these results are among the most sophisticated we have seen involving intrinsically motivated learning. Moreover, they were achieved quite directly by combining a collection of existing RL algorithms for learning options and option-models with a simple notion of intrinsic reward. The idea of intrinsic motivation for artificial agents is certainly not new, but we hope to have shown that the elaboration of the formal RL framework in the direction we have pursued, together with the use of recently- developed hierarchical RL algorithms, provides a fruitful basis for developing competently autonomous agents. Acknowledgement Satinder Singh and Nuttapong Chentanez were funded by NSF grant CCF 0432027 and by a grant from DARPA's IPTO program. Andrew Barto was funded by NSF grant CCF 0432143 and by a grant from DARPA's IPTO program.

NeurIPS Conference 2003 Conference Paper

A Nonlinear Predictive State Representation

Matthew Rudary
Satinder Singh

Predictive state representations (PSRs) use predictions of a set of tests to represent the state of controlled dynamical systems. One reason why this representation is exciting as an alternative to partially observable Markov decision processes (POMDPs) is that PSR models of dynamical systems may be much more compact than POMDP models. Empirical work on PSRs to date has focused on linear PSRs, which have not allowed for compression relative to POMDPs. We introduce a new notion of tests which allows us to deﬁne a new type of PSR that is nonlinear in general and allows for exponential compression in some deterministic dynami- cal systems. These new tests, called e-tests, are related to the tests used by Rivest and Schapire [1] in their work with the diversity representation, but our PSR avoids some of the pitfalls of their representation—in partic- ular, its potential to be exponentially larger than the equivalent POMDP.

NeurIPS Conference 2003 Conference Paper

An MDP-Based Approach to Online Mechanism Design

David Parkes
Satinder Singh

Online mechanism design (MD) considers the problem of provid- ing incentives to implement desired system-wide outcomes in sys- tems with self-interested agents that arrive and depart dynami- cally. Agents can choose to misrepresent their arrival and depar- ture times, in addition to information about their value for di(cid: 11)erent outcomes. We consider the problem of maximizing the total long- term value of the system despite the self-interest of agents. The online MD problem induces a Markov Decision Process (MDP), which when solved can be used to implement optimal policies in a truth-revealing Bayesian-Nash equilibrium.

NeurIPS Conference 2001 Conference Paper

An Efficient, Exact Algorithm for Solving Tree-Structured Graphical Games

Michael Littman
Michael Kearns
Satinder Singh

We describe a new algorithm for computing a Nash equilibrium in graphical games, a compact representation for multi-agent systems that we introduced in previous work. The algorithm is the first to compute equilibria both efficiently and exactly for a non-trivial class of graphical games.

AAAI Conference 2000 Conference Paper

Cobot in LambdaMOO: A Social Statistics Agent

Charles Lee Isbell
Michael Kearns
Satinder Singh

We describe our development of Cobot, a software agent who lives in LambdaMOO, a popular virtual world frequented by hundreds of users. We present a detailed discussion of the functionality that has made him one of the objects most frequently interacted with in LambdaMOO, human or artificial.

AAAI Conference 2000 Conference Paper

Empirical Evaluation of a Reinforcement Learning Spoken Dialogue System

Satinder Singh
Diane J. Litman

We report on the design, construction and empirical evaluation of a large-scale spoken dialogue system that optimizes its performance via reinforcement learning on human user dialogue data.

AIJ Journal 1999 Journal Article

Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning

Richard S. Sutton
Doina Precup
Satinder Singh

NeurIPS Conference 1999 Conference Paper

Policy Gradient Methods for Reinforcement Learning with Function Approximation

Richard Sutton
David McAllester
Satinder Singh
Yishay Mansour

Function approximation is essential to reinforcement learning, but the standard approach of approximating a value function and deter(cid: 173) mining a policy from it has so far proven theoretically intractable. In this paper we explore an alternative approach in which the policy is explicitly represented by its own function approximator, indepen(cid: 173) dent of the value function, and is updated according to the gradient of expected reward with respect to the policy parameters. Williams's REINFORCE method and actor-critic methods are examples of this approach. Our main new result is to show that the gradient can be written in a form suitable for estimation from experience aided by an approximate action-value or advantage function. Using this result, we prove for the first time that a version of policy iteration with arbitrary differentiable function approximation is convergent to a locally optimal policy. Large applications of reinforcement learning (RL) require the use of generalizing func(cid: 173) tion approximators such neural networks, decision-trees, or instance-based methods. The dominant approach for the last decade has been the value-function approach, in which all function approximation effort goes into estimating a value function, with the action-selection policy represented implicitly as the "greedy" policy with respect to the estimated values (e. g. , as the policy that selects in each state the action with highest estimated value). The value-function approach has worked well in many appli(cid: 173) cations, but has several limitations. First, it is oriented toward finding deterministic policies, whereas the optimal policy is often stochastic, selecting different actions with specific probabilities (e. g. , see Singh, Jaakkola, and Jordan, 1994). Second, an arbi(cid: 173) trarily small change in the estimated value of an action can cause it to be, or not be, selected. Such discontinuous changes have been identified as a key obstacle to estab(cid: 173) lishing convergence assurances for algorithms following the value-function approach (Bertsekas and Tsitsiklis, 1996). For example, Q-Iearning, Sarsa, and dynamic pro(cid: 173) gramming methods have all been shown unable to converge to any policy for simple MDPs and simple function approximators (Gordon, 1995, 1996; Baird, 1995; Tsit(cid: 173) siklis and van Roy, 1996; Bertsekas and Tsitsiklis, 1996). This can occur even if the best approximation is found at each step before changing the policy, and whether the notion of "best" is in the mean-squared-error sense or the slightly different senses of residual-gradient, temporal-difference, and dynamic-programming methods. In this paper we explore an alternative approach to function approximation in RL.

NeurIPS Conference 1999 Conference Paper

Reinforcement Learning for Spoken Dialogue Systems

Satinder Singh
Michael Kearns
Diane Litman
Marilyn Walker

Recently, a number of authors have proposed treating dialogue systems as Markov decision processes (MDPs). However, the practical application ofMDP algorithms to dialogue systems faces a number of severe technical challenges. We have built a general software tool (RLDS, for Reinforcement Learning for Dialogue Systems) based on the MDP framework, and have applied it to dialogue corpora gathered from two dialogue systems built at AT&T Labs. Our experiments demonstrate that RLDS holds promise as a tool for "browsing" and understanding correlations in complex, temporally dependent dialogue corpora.

NeurIPS Conference 1998 Conference Paper

Experimental Results on Learning Stochastic Memoryless Policies for Partially Observable Markov Decision Processes

John Williams
Satinder Singh

Partially Observable Markov Decision Processes (pO "MOPs) constitute an important class of reinforcement learning problems which present unique theoretical and computational difficulties. In the absence of the Markov property, popular reinforcement learning algorithms such as Q-Iearning may no longer be effective, and memory-based methods which remove partial observability via state-estimation are notoriously expensive. An alternative approach is to seek a stochastic memoryless policy which for each observation of the environment prescribes a probability distribution over available actions that maximizes the average reward per timestep. A reinforcement learning algorithm which learns a locally optimal stochastic memoryless policy has been proposed by Jaakkola, Singh and Jordan, but not empirically verified. We present a variation of this algorithm, discuss its implementation, and demonstrate its viability using four test problems.

NeurIPS Conference 1998 Conference Paper

Finite-Sample Convergence Rates for Q-Learning and Indirect Algorithms

Michael Kearns
Satinder Singh

In this paper, we address two issues of long-standing interest in the re(cid: 173) inforcement learning literature. First, what kinds of performance guar(cid: 173) antees can be made for Q-learning after only a finite number of actions? Second, what quantitative comparisons can be made between Q-learning and model-based (indirect) approaches, which use experience to estimate next-state distributions for off-line value iteration? We first show that both Q-learning and the indirect approach enjoy rather rapid convergence to the optimal policy as a function of the num(cid: 173) ber of state transitions observed. In particular, on the order of only (Nlog(1/c)/c2 )(log(N) + loglog(l/c)) transitions are sufficient for both algorithms to come within c of the optimal policy, in an idealized model that assumes the observed transitions are "well-mixed" throughout an N-state MDP. Thus, the two approaches have roughly the same sample complexity. Perhaps surprisingly, this sample complexity is far less than what is required for the model-based approach to actually construct a good approximation to the next-state distribution. The result also shows that the amount of memory required by the model-based approach is closer to N than to N 2 • For either approach, to remove the assumption that the observed tran(cid: 173) sitions are well-mixed, we consider a model in which the transitions are determined by a fixed, arbitrary exploration policy. Bounds on the number of transitions required in order to achieve a desired level of performance are then related to the stationary distribution and mixing time of this policy.

NeurIPS Conference 1998 Conference Paper

Improved Switching among Temporally Abstract Actions

Richard Sutton
Satinder Singh
Doina Precup
Balaraman Ravindran

In robotics and other control applications it is commonplace to have a pre(cid: 173) existing set of controllers for solving subtasks, perhaps hand-crafted or previously learned or planned, and still face a difficult problem of how to choose and switch among the controllers to solve an overall task as well as possible. In this paper we present a framework based on Markov decision processes and semi-Markov decision processes for phrasing this problem, a basic theorem regarding the improvement in performance that can be ob(cid: 173) tained by switching flexibly between given controllers, and example appli(cid: 173) cations of the theorem. In particular, we show how an agent can plan with these high-level controllers and then use the results of such planning to find an even better plan, by modifying the existing controllers, with negligible additional cost and no re-planning. In one of our examples, the complexity of the problem is reduced from 24 billion state-action pairs to less than a million state-controller pairs. In many applications, solutions to parts of a task are known, either because they were hand(cid: 173) crafted by people or because they were previously learned or planned. For example, in robotics applications, there may exist controllers for moving joints to positions, picking up objects, controlling eye movements, or navigating along hallways. More generally, an intelli(cid: 173) gent system may have available to it several temporally extended courses of action to choose from. In such cases, a key challenge is to take full advantage of the existing temporally ex(cid: 173) tended actions, to choose or switch among them effectively, and to plan at their level rather than at the level of individual actions. Recently, several researchers have begun to address these challenges within the framework of reinforcement learning and Markov decision processes (e. g. , Singh, 1992; Kaelbling, 1993; Dayan & Hinton, 1993; Thrun and Schwartz, 1995; Sutton, 1995; Dietterich, 1998; Parr & Russell, 1998; McGovern, Sutton & Fagg, 1997). Common to much of this recent work is the modeling of a temporally extended action as a policy (controller) and a condition for terminating, which we together refer to as an option (Sutton, Precup & Singh, 1998). In this paper we consider the problem of effectively combining given options into one overall policy, generalizing prior work by Kaelbling (1993). Sections 1-3 introduce the framework; our new results are in Sections 4 and 5. Improved Switching among Temporally Abstract Actions

NeurIPS Conference 1998 Conference Paper

Optimizing Admission Control while Ensuring Quality of Service in Multimedia Networks via Reinforcement Learning

Timothy Brown
Hui Tong
Satinder Singh

This paper examines the application of reinforcement learning to a telecommunications networking problem. The problem requires that rev(cid: 173) enue be maximized while simultaneously meeting a quality of service constraint that forbids entry into certain states. We present a general solution to this multi-criteria problem that is able to earn significantly higher revenues than alternatives.

NeurIPS Conference 1997 Conference Paper

How to Dynamically Merge Markov Decision Processes

Satinder Singh
David Cohn

We are frequently called upon to perform multiple tasks that com(cid: 173) pete for our attention and resource. Often we know the optimal solution to each task in isolation; in this paper, we describe how this knowledge can be exploited to efficiently find good solutions for doing the tasks in parallel. We formulate this problem as that of dynamically merging multiple Markov decision processes (MDPs) into a composite MDP, and present a new theoretically-sound dy(cid: 173) namic programming algorithm for finding an optimal policy for the composite MDP. We analyze various aspects of our algorithm and illustrate its use on a simple merging problem. Every day, we are faced with the problem of doing mUltiple tasks in parallel, each of which competes for our attention and resource. If we are running a job shop, we must decide which machines to allocate to which jobs, and in what order, so that no jobs miss their deadlines. If we are a mail delivery robot, we must find the intended recipients of the mail while simultaneously avoiding fixed obstacles (such as walls) and mobile obstacles (such as people), and still manage to keep ourselves sufficiently charged up. Frequently we know how to perform each task in isolation; this paper considers how we can take the information we have about the individual tasks and combine it to efficiently find an optimal solution for doing the entire set of tasks in parallel. More importantly, we describe a theoretically-sound algorithm for doing this merging dynamically; new tasks (such as a new job arrival at a job shop) can be assimilated online into the solution being found for the ongoing set of simultaneous tasks.

NeurIPS Conference 1996 Conference Paper

Analytical Mean Squared Error Curves in Temporal Difference Learning

Satinder Singh
Peter Dayan

We have calculated analytical expressions for how the bias and variance of the estimators provided by various temporal difference value estimation algorithms change with offline updates over trials in absorbing Markov chains using lookup table representations. We illustrate classes of learning curve behavior in various chains, and show the manner in which TD is sensitive to the choice of its step(cid: 173) size and eligibility trace parameters.

NeurIPS Conference 1996 Conference Paper

Predicting Lifetimes in Dynamically Allocated Memory

David Cohn
Satinder Singh

Predictions oflifetimes of dynamically allocated objects can be used to improve time and space efficiency of dynamic memory manage(cid: 173) ment in computer programs. Barrett and Zorn [1993] used a simple lifetime predictor and demonstrated this improvement on a variety of computer programs. In this paper, we use decision trees to do lifetime prediction on the same programs and show significantly better prediction. Our method also has the advantage that during training we can use a large number of features and let the decision tree automatically choose the relevant subset. 1 INTELLIGENT MEMORY ALLOCATION Dynamic memory allocation is used in many computer applications. The appli(cid: 173) cation requests blocks of memory from the operating system or from a memory manager when needed and explicitly frees them up after use. Typically, all of these requests are handled in the same way, without any regard for how or for how long the requested block will be used. Sometimes programmers use runtime profiles to analyze the typical behavior of their program and write special purpose memory management routines specifically tuned to dominant classes of allocation events. Machine learning methods offer the opportunity to automate the process of tuning memory management systems. In a recent study, Barrett and Zorn [1993] used two allocators: a special allocator for objects that are short-lived, and a default allocator for everything else. They tried a simple prediction method on a number of public-domain, allocation-intensive programs and got mixed results on the lifetime prediction problem. Nevertheless, they showed that for all the cases where they were able to predict well, their strategy of assigning objects predicted to be short-lived to the special allocator led to savings 940 D. A. Cohn and S. Singh in program running times. Their results imply that if we could predict well in all cases we could get similar savings for all programs. We concentrate on the lifetime prediction task in this paper and show that using axis-parallel decision trees does indeed lead to significantly better prediction on all the programs studied by Zorn and Grunwald and some others that we included. Another advantage of our approach is that we can use a large number of features about the allocation requests and let the decision tree decide on their relevance. There are a number of advantages of using lifetime predictions for intelligent mem(cid: 173) ory management. It can improve CPU usage, by using special-purpose allocators, e. g. , short-lived objects can be allocated in small spaces by incrementing a pointer and deallocated together when they are all dead. It can decrease memory fragmen(cid: 173) tation, because the short-lived objects do not pollute the address space of long lived objects. Finally, it can improve program locality, and thus program speed, because the short-lived objects are all allocated in a small part of the heap. The advantages of prediction must be weighed against the time required to examine each request and make that prediction about its intended use. It is frequently argued that, as computers and memory become faster and cheaper, we need to be less concerned about the speed and efficiency of machine learning algorithms. When the purpose of the algorithm is to save space and computation, however, these concerns are paramount. 1. 1 RELATED WORK Traditionally, memory management has been relegated to a single, general-purpose allocator. When performance is critical, software developers will frequently build a custom memory manager which they believe is tuned to optimize the performance of the program. Not only is this hand construction inefficient in terms of the pro(cid: 173) gramming time required, this "optimization" may seriously degrade the program's performance if it does not accurately reflect the program's use [Wilson et al. , 1995]. Customalloc [Grunwald and Zorn, 1992] monitors program runs on benchmark in(cid: 173) puts to determine the most commonly requested block sizes. It then produces a set of memory allocation routines which are customized to efficiently allocate those block sizes. Other memory requests are still handled by a general purpose allocator. Barrett and Zorn [1993] studied lifetime prediction based on benchmark inputs. At each allocation request, the call graph (the list of nested procedure/function calls in effect at the time) and the object size was used to identify an allocation site. If all allocations from a particular site were short-lived on the benchmark inputs, their algorithm predicted that future allocations would also be short-lived. Their method produced mixed results at lifetime prediction, but demonstrated the savings that such predictions could bring. In this paper, we discuss an approach to lifetime prediction which uses learned decision trees. In the next section, we first discuss the identification of relevant state features by a decision tree. Section 3 discusses in greater detail the problem of lifetime prediction. Section 4 describes the empirical results of applying this approach to several benchmark programs, and Section 5 discusses the implications of these results and directions for future work. Predicting Lifetimes in Dynamically Allocated Memory 941 2 FEATURE SELECTION WITH A DECISION TREE Barrett and Zorn 's approach captures state information in the form of the program's call graph at the time of an allocation request, which is recorded to a fixed pre(cid: 173) determined depth. This graph, plus the request size, specifies an allocation "site"; statistics are gathered separately for each site. A drawback of this approach is that it forces a division for each distinct call graph, preventing generalization across ir(cid: 173) relevant features. Computationally, it requires maintaining an explicit call graph (information that the program would not normally provide), as well as storing a potentially large table of call sites from which to make predictions. It also ignores other potentially useful information, such as the parameters of the functions on the call stack, and the contents of heap memory and the program registers at the time of the request. Ideally, we would like to examine as much of the program state as possible at the time of each allocation request, and automatically extract those pieces of informa(cid: 173) tion that best allow predicting how the requested block will be used. Decision tree algorithms are useful for this sort of task. A decision tree divides inputs on basis of how each input feature improves "purity" of the tree's leaves. Inputs that are statistically irrelevant for prediction are not used in any splits; the tree's final set of decisions examine only input features that improve its predictive performance. Regardless of the parsimony of the final tree however, training a tree with the entire program state as a feature vector is computationally infeasible. In our experiments, detailed below, we arbitrarily used the top 20 words on the stack, along with the request size, as an approximate indicator of program state. On the target machine (a Sparcstation), we found that including program registers in the feature set made no significant difference, and so dropped them from consideration for efficiency. 3 LIFETIME PREDICTION The characteristic of memory requests that we would like to predict is the lifetime of the block - how long it will be before the requested memory is returned to the central pool. Accurate lifetime prediction lets one segregate memory into short(cid: 173) term, long-term and permanent storage. To this end, we have used a decision tree learning algorithm to derive rules that distinguish "short-lived" and "permanent" allocations from the general pool of allocation requests. For short-lived blocks, one can create a very simple and efficient allocation scheme [Barrett and Zorn, 1993]. For "permanent" blocks, allocation is also simple and cheap, because the allocator does not need to compute and store any of the infor(cid: 173) mation that would normally be required to keep track of the block and return it to the pool when freed. One complication is that of unequal loss for different types of incorrect predictions. An appropriately routed memory request may save dozens of instruction cycles, but an inappropriately routed one may cost hundreds. The cost in terms of memory may also be unequal: a short-lived block that is incorrectly predicted to be "per(cid: 173) manent" will permanently tie up the space occupied by the block (if it is allocated via a method that can not be freed). A "permanent" block, however, that is in(cid: 173) correctly predicted to be short-lived may pollute the allocator's short-term space by preventing a large segment of otherwise free memory from being reclaimed (see Barrett and Zorn for examples). These risks translate into a time-space tradeoff that depends on the properties of 942 D. A. Cohn and S. Singh the specific allocators used and the space limitations of the target machine. For our experiments, we arbitrarily defined false positives and false negatives to have equal loss, except where noted otherwise. Other cases may be handled by reweighting the splitting criterion, or by rebalancing the training inputs (as described in the following section).

NeurIPS Conference 1996 Conference Paper

Reinforcement Learning for Dynamic Channel Allocation in Cellular Telephone Systems

Satinder Singh
Dimitri Bertsekas

In cellular telephone systems, an important problem is to dynami(cid: 173) cally allocate the communication resource (channels) so as to max(cid: 173) imize service in a stochastic caller environment. This problem is naturally formulated as a dynamic programming problem and we use a reinforcement learning (RL) method to find dynamic channel allocation policies that are better than previous heuristic solutions. The policies obtained perform well for a broad variety of call traf(cid: 173) fic patterns. We present results on a large cellular system with approximately 4949 states. In cellular communication systems, an important problem is to allocate the com(cid: 173) munication resource (bandwidth) so as to maximize the service provided to a set of mobile callers whose demand for service changes stochastically. A given geograph(cid: 173) ical area is divided into mutually disjoint cells, and each cell serves the calls that are within its boundaries (see Figure 1a). The total system bandwidth is divided into channels, with each channel centered around a frequency. Each channel can be used simultaneously at different cells, provided these cells are sufficiently separated spatially, so that there is no interference between them. The minimum separation distance between simultaneous reuse of the same channel is called the channel reuse constraint. When a call requests service in a given cell either a free channel (one that does not violate the channel reuse constraint) may be assigned to the call, or else the call is blocked from the system; this will happen if no free channel can be found. Also, when a mobile caller crosses from one cell to another, the call is "handed off" to the cell of entry; that is, a new free channel is provided to the call at the new cell. If no such channel is available, the call must be dropped/disconnected from the system. RLfor Dynamic Channel Allocation 975 One objective of a channel allocation policy is to allocate the available channels to calls so that the number of blocked calls is minimized. An additional objective is to minimize the number of calls that are dropped when they are handed off to a busy cell. These two objectives must be weighted appropriately to reflect their relative importance, since dropping existing calls is generally more undesirable than blocking new calls. To illustrate the qualitative nature of the channel assignment decisions, suppose that there are only two channels and three cells arranged in a line. Assume a channel reuse constraint of 2, i. e. , a channel may be used simultaneously in cells 1 and 3, but may not be used in channel 2 if it is already used in cell 1 or in cell 3. Suppose that the system is serving one call in cell 1 and another call in cell 3. Then serving both calls on the same channel results in a better channel usage pattern than serving them on different channels, since in the former case the other channel is free to be used in cell 2. The purpose of the channel assignment and channel rearrangement strategy is, roughly speaking, to create such favorable usage patterns that minimize the likelihood of calls being blocked. We formulate the channel assignment problem as a dynamic programming problem, which, however, is too complex to be solved exactly. We introduce approximations based on the methodology of reinforcement learning (RL) (e. g. , Barto, Bradtke and Singh, 1995, or the recent textbook by Bertsekas and Tsitsiklis, 1996). Our method learns channel allocation policies that outperform not only the most commonly used policy in cellular systems, but also the best heuristic policy we could find in the literature. 1 CHANNEL ASSIGNMENT POLICIES Many cellular systems are based on a fixed assignment (FA) channel allocation; that is, the set of channels is partitioned, and the partitions are permanently assigned to cells so that all cells can use all the channels assigned to them simultaneously without interference (see Figure 1a). When a call arrives in a cell, if any pre(cid: 173) assigned channel is unused; it is assigned, else the call is blocked. No rearrangement is done when a call terminates. Such a policy is static and cannot take advantage of temporary stochastic variations in demand for service. More efficient are dynamic channel allocation policies, which assign channels to different cells, so that every channel is available to every cell on a need basis, unless the channel is used in a nearby cell and the reuse constraint is violated. The best existing dynamic channel allocation policy we found in the literature is Borrowing with Directional Channel Locking (BDCL) of Zhang & Yum (1989). It numbers the channels from 1 to N, partitions and assigns them to cells as in FA. The channels assigned to a cell are its nominal channels. If a nominal channel is available when a call arrives in a cell, the smallest numbered such channel is assigned to the call. If no nominal channel is available, then the largest numbered free channel is borrowed from the neighbour with the most free channels. When a channel is borrowed, careful accounting of the directional effect of which cells can no longer use that channel because of interference is done. The call is blocked if there are no free channels at all. When a call terminates in a cell and the channel so freed is a nominal channel, say numbered i, of that cell, then if there is a call in that cell on a borrowed channel, the call on the smallest numbered borrowed channel is reassigned to i and the borrowed channel is returned to the appropriate cell. If there is no call on a borrowed channel, then if there is a call on a nominal channel numbered larger than i, the call on the highest numbered nominal channel is reassigned to i. If the call just terminated was itself on a borrowed channel, the 976 S. Singh and D. Bertsekas call on the smallest numbered borrowed channel is reassigned to it and that channel is returned to the cell from which it was borrowed. Notice that when a borrowed channel is returned to its original cell, a nominal channel becomes free in that cell and triggers a reassignment. Thus, in the worst case a call termination in one cell can sequentially cause reassignments in arbitrarily far away cells - making BDCL somewhat impractical. BOCL is quite sophisticated and combines the notions of channel-ordering, nominal channels, and channel borrowing. Zhang and Yum (1989) show that BOCL is superior to its competitors, including FA. Generally, BOCL has continued to be highly regarded in the literature as a powerful heuristic (Enrico et. al. , 1996). In this paper, we compare the performance of dynamic channel allocation policies learned by RL with both FA and BOCL. 1. 1 DYNAMIC PROGRAMMING FORMULATION We can formulate the dynamic channel allocation problem using dynamic program(cid: 173) ming (e. g. , Bertsekas, 1995). State transitions occur when channels become free due to call departures, or when a call arrives at a given cell and wishes to be assigned a channel, or when there is a handoff, which can be viewed as a simultaneous call departure from one cell and a call arrival at another cell. The state at each time consists of two components: (1) The list of occupied and unoccupied channels at each cell. We call this the configuration of the cellular system. It is exponential in the number of cells. (2) The event that causes the state transition (arrival, departure, or handoff). This component of the state is uncontrollable. The decision/control applied at the time of a call departure is the rearrangement of the channels in use with the aim of creating a more favorable channel packing pattern among the cells (one that will leave more channels free for future assign(cid: 173) ments). Unlike BDCL, our RL solution will restrict this rearrangement to the cell with the current call departure. The control exercised at the time of a call arrival is the assignment of a free channel, or the blocking of the call if no free channel is currently available. In general, it may also be useful to do admission control, i. e. , to allow the possibility of not accepting a new call even when there exists a free channel to minimize the dropping of ongoing calls during handoff in the future. We address admission control in a separate paper and here restrict ourselves to always accepting a call if a free channel is available. The objective is to learn a policy that assigns decisions (assignment or rearrangement depending on event) to each state so as to maximize J = E {lCO e- f3t e(t)dt}, where E{-} is the expectation operator, e(t) is the number of ongoing calls at time t, and j3 is a discount factor that makes immediate profit more valuable than future profit. Maximizing J is equivalent to minimizing the expected (discounted) number of blocked calls over an infinite horizon. 2 REINFORCEMENT LEARNING SOLUTION RL methods solve optimal control (or dynamic programming) problems by learning good approximations to the optimal value function, J*, given by the solution to RLfor Dynamic Channel Allocation 977 the Bellman optimality equation which takes the following form for the dynamic channel allocation problem:

NeurIPS Conference 1995 Conference Paper

Improving Policies without Measuring Merits

Peter Dayan
Satinder Singh

Performing policy iteration in dynamic programming should only require knowledge of relative rather than absolute measures of the utility of actions (Werbos, 1991) - what Baird (1993) calls the ad(cid: 173) vantages of actions at states. Nevertheless, most existing methods in dynamic programming (including Baird's) compute some form of absolute utility function. For smooth problems, advantages satisfy two differential consistency conditions (including the requirement that they be free of curl), and we show that enforcing these can lead to appropriate policy improvement solely in terms of advantages. 1

NeurIPS Conference 1994 Conference Paper

Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems

Tommi Jaakkola
Satinder Singh
Michael Jordan

Increasing attention has been paid to reinforcement learning algo(cid: 173) rithms in recent years, partly due to successes in the theoretical analysis of their behavior in Markov environments. If the Markov assumption is removed, however, neither generally the algorithms nor the analyses continue to be usable. We propose and analyze a new learning algorithm to solve a certain class of non-Markov decision problems. Our algorithm applies to problems in which the environment is Markov, but the learner has restricted access to state information. The algorithm involves a Monte-Carlo pol(cid: 173) icy evaluation combined with a policy improvement method that is similar to that of Markov decision problems and is guaranteed to converge to a local maximum. The algorithm operates in the space of stochastic policies, a space which can yield a policy that per(cid: 173) forms considerably better than any deterministic policy. Although the space of stochastic policies is continuous-even for a discrete action space-our algorithm is computationally tractable. 346 Tommi Jaakkola, Satinder P. Singh, Michaell. Jordan

NeurIPS Conference 1994 Conference Paper

Reinforcement Learning with Soft State Aggregation

Satinder Singh
Tommi Jaakkola
Michael Jordan

It is widely accepted that the use of more compact representations than lookup tables is crucial to scaling reinforcement learning (RL) algorithms to real-world problems. Unfortunately almost all of the theory of reinforcement learning assumes lookup table representa(cid: 173) tions. In this paper we address the pressing issue of combining function approximation and RL, and present 1) a function approx(cid: 173) imator based on a simple extension to state aggregation (a com(cid: 173) monly used form of compact representation), namely soft state aggregation, 2) a theory of convergence for RL with arbitrary, but fixed, soft state aggregation, 3) a novel intuitive understanding of the effect of state aggregation on online RL, and 4) a new heuristic adaptive state aggregation algorithm that finds improved compact representations by exploiting the non-discrete nature of soft state aggregation. Preliminary empirical results are also presented.

NeurIPS Conference 1993 Conference Paper

Convergence of Stochastic Iterative Dynamic Programming Algorithms

Tommi Jaakkola
Michael Jordan
Satinder Singh

Increasing attention has recently been paid to algorithms based on dynamic programming (DP) due to the suitability of DP for learn(cid: 173) ing problems involving control. In stochastic environments where the system being controlled is only incompletely known, however, a unifying theoretical account of these methods has been missing. In this paper we relate DP-based learning algorithms to the pow(cid: 173) erful techniques of stochastic approximation via a new convergence theorem, enabling us to establish a class of convergent algorithms to which both TD(") and Q-Iearning belong.

NeurIPS Conference 1993 Conference Paper

Robust Reinforcement Learning in Motion Planning

Satinder Singh
Andrew Barto
Roderic Grupen
Christopher Connolly

While exploring to find better solutions, an agent performing on(cid: 173) line reinforcement learning (RL) can perform worse than is accept(cid: 173) able. In some cases, exploration might have unsafe, or even catas(cid: 173) trophic, results, often modeled in terms of reaching 'failure' states of the agent's environment. This paper presents a method that uses domain knowledge to reduce the number of failures during explo(cid: 173) ration. This method formulates the set of actions from which the RL agent composes a control policy to ensure that exploration is conducted in a policy space that excludes most of the unacceptable policies. The resulting action set has a more abstract relationship to the task being solved than is common in many applications of RL. Although the cost of this added safety is that learning may result in a suboptimal solution, we argue that this is an appropri(cid: 173) ate tradeoff in many problems. We illustrate this method in the domain of motion planning. "'This work was done while the first author was finishing his Ph. D in computer science at the University of Massachusetts, Amherst.

NeurIPS Conference 1991 Conference Paper

The Efficient Learning of Multiple Task Sequences

Satinder Singh

I present a modular network architecture and a learning algorithm based on incremental dynamic programming that allows a single learning agent to learn to solve multiple Markovian decision tasks (MDTs) with signif(cid: 173) icant transfer of learning across the tasks. I consider a class of MDTs, called composite tasks, formed by temporally concatenating a number of simpler, elemental MDTs. The architecture is trained on a set of compos(cid: 173) ite and elemental MDTs. The temporal structure of a composite task is assumed to be unknown and the architecture learns to produce a tempo(cid: 173) ral decomposition. It is shown that under certain conditions the solution of a composite MDT can be constructed by computationally inexpensive modifications of the solutions of its constituent elemental MDTs.