Arrow Research search

Author name cluster

Michael Littman

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

47 papers
1 author row

Possible papers

47

NeurIPS Conference 2024 Conference Paper

Mitigating Partial Observability in Sequential Decision Processes via the Lambda Discrepancy

  • Cameron Allen
  • Aaron Kirtland
  • Ruo Yu Tao
  • Sam Lobel
  • Daniel Scott
  • Nicholas Petrocelli
  • Omer Gottesman
  • Ronald Parr

Reinforcement learning algorithms typically rely on the assumption that the environment dynamics and value function can be expressed in terms of a Markovian state representation. However, when state information is only partially observable, how can an agent learn such a state representation, and how can it detect when it has found one? We introduce a metric that can accomplish both objectives, without requiring access to---or knowledge of---an underlying, unobservable state space. Our metric, the λ-discrepancy, is the difference between two distinct temporal difference (TD) value estimates, each computed using TD(λ) with a different value of λ. Since TD(λ=0) makes an implicit Markov assumption and TD(λ=1) does not, a discrepancy between these estimates is a potential indicator of a non-Markovian state representation. Indeed, we prove that the λ-discrepancy is exactly zero for all Markov decision processes and almost always non-zero for a broad class of partially observable environments. We also demonstrate empirically that, once detected, minimizing the λ-discrepancy can help with learning a memory function to mitigate the corresponding partial observability. We then train a reinforcement learning agent that simultaneously constructs two recurrent value networks with different λ parameters and minimizes the difference between them as an auxiliary loss. The approach scales to challenging partially observable domains, where the resulting agent frequently performs significantly better (and never performs worse) than a baseline recurrent agent with only a single value network.

RLJ Journal 2024 Journal Article

On Welfare-Centric Fair Reinforcement Learning

  • Cyrus Cousins
  • Kavosh Asadi
  • Elita Lobo
  • Michael Littman

We propose a welfare-centric fair reinforcement-learning setting, in which an agent enjoys vector-valued reward from a set of beneficiaries. Given a welfare function W(·), the task is to select a policy π̂ that approximately optimizes the welfare of theirvalue functions from start state s0, i.e., π̂ ≈ argmaxπ W Vπ1 (s0 ), Vπ2 (s0 ),..., Vπg (s0 ). We find that welfare-optimal policies are stochastic and start-state dependent. Whether individual actions are mistakes depends on the policy, thus mistake bounds, regret analysis, and PAC-MDP learning do not readily generalize to our setting. We develop the adversarial-fair KWIK (Kwik-Af) learning model, wherein at each timestep, an agent either takes an exploration action or outputs an exploitation policy, such that the number of exploration actions is bounded and each exploitation policy is ε-welfare optimal. Finally, we reduce PAC-MDP to Kwik-Af, introduce the Equitable Explicit Explore Exploit (E4) learner, and show that it Kwik-Af learns.

RLC Conference 2024 Conference Paper

On Welfare-Centric Fair Reinforcement Learning

  • Cyrus Cousins
  • Kavosh Asadi
  • Elita Lobo
  • Michael Littman

We propose a welfare-centric fair reinforcement-learning setting, in which an agent enjoys vector-valued reward from a set of beneficiaries. Given a welfare function W(·), the task is to select a policy π̂ that approximately optimizes the welfare of theirvalue functions from start state s0, i. e. , π̂ ≈ argmaxπ W Vπ1 (s0 ), Vπ2 (s0 ), .. ., Vπg (s0 ). We find that welfare-optimal policies are stochastic and start-state dependent. Whether individual actions are mistakes depends on the policy, thus mistake bounds, regret analysis, and PAC-MDP learning do not readily generalize to our setting. We develop the adversarial-fair KWIK (Kwik-Af) learning model, wherein at each timestep, an agent either takes an exploration action or outputs an exploitation policy, such that the number of exploration actions is bounded and each exploitation policy is ε-welfare optimal. Finally, we reduce PAC-MDP to Kwik-Af, introduce the Equitable Explicit Explore Exploit (E4) learner, and show that it Kwik-Af learns.

RLJ Journal 2024 Journal Article

Tiered Reward: Designing Rewards for Specification and Fast Learning of Desired Behavior

  • Zhiyuan Zhou
  • Shreyas Sundara Raman
  • Henry Sowerby
  • Michael Littman

Reinforcement-learning agents seek to maximize a reward signal through environmental interactions. As humans, our job in the learning process is to design reward functions to express desired behavior and enable the agent to learn such behavior swiftly. However, designing good reward functions to induce the desired behavior is generally hard, let alone the question of which rewards make learning fast. In this work, we introduce a family of a reward structures we call Tiered Reward that resolves both of these questions. We consider the reward-design problem in tasks formulated as reaching desirable states and avoiding undesirable states. To start, we propose a strict partial ordering of the policy space to resolve trade-offs in behavior preference. We prefer policies that reach the good states faster and with higher probability while avoiding the bad states longer. Next, we introduce Tiered Reward, a class of environment-independent reward functions and show it is guaranteed to induce policies that are Pareto-optimal according to our preference relation. Finally, we demonstrate that Tiered Reward leads to fast learning with multiple tabular and deep reinforcement-learning algorithms.

RLC Conference 2024 Conference Paper

Tiered Reward: Designing Rewards for Specification and Fast Learning of Desired Behavior

  • Zhiyuan Zhou
  • Shreyas Sundara Raman
  • Henry Sowerby
  • Michael Littman

Reinforcement-learning agents seek to maximize a reward signal through environmental interactions. As humans, our job in the learning process is to design reward functions to express desired behavior and enable the agent to learn such behavior swiftly. However, designing good reward functions to induce the desired behavior is generally hard, let alone the question of which rewards make learning fast. In this work, we introduce a family of a reward structures we call Tiered Reward that resolves both of these questions. We consider the reward-design problem in tasks formulated as reaching desirable states and avoiding undesirable states. To start, we propose a strict partial ordering of the policy space to resolve trade-offs in behavior preference. We prefer policies that reach the good states faster and with higher probability while avoiding the bad states longer. Next, we introduce Tiered Reward, a class of environment-independent reward functions and show it is guaranteed to induce policies that are Pareto-optimal according to our preference relation. Finally, we demonstrate that Tiered Reward leads to fast learning with multiple tabular and deep reinforcement-learning algorithms.

AAAI Conference 2023 Conference Paper

Computably Continuous Reinforcement-Learning Objectives Are PAC-Learnable

  • Cambridge Yang
  • Michael Littman
  • Michael Carbin

In reinforcement learning, the classic objectives of maximizing discounted and finite-horizon cumulative rewards are PAC-learnable: There are algorithms that learn a near-optimal policy with high probability using a finite amount of samples and computation. In recent years, researchers have introduced objectives and corresponding reinforcement-learning algorithms beyond the classic cumulative rewards, such as objectives specified as linear temporal logic formulas. However, questions about the PAC-learnability of these new objectives have remained open. This work demonstrates the PAC-learnability of general reinforcement-learning objectives through sufficient conditions for PAC-learnability in two analysis settings. In particular, for the analysis that considers only sample complexity, we prove that if an objective given as an oracle is uniformly continuous, then it is PAC-learnable. Further, for the analysis that considers computational complexity, we prove that if an objective is computable, then it is PAC-learnable. In other words, if a procedure computes successive approximations of the objective's value, then the objective is PAC-learnable. We give three applications of our condition on objectives from the literature with previously unknown PAC-learnability and prove that these objectives are PAC-learnable. Overall, our result helps verify existing objectives' PAC-learnability. Also, as some studied objectives that are not uniformly continuous have been shown to be not PAC-learnable, our results could guide the design of new PAC-learnable objectives.

PRL Workshop 2023 Workshop Paper

Task Scoping: Generating Task-Specific Simplifications of Open-Scope Planning Problems

  • Michael Fishman
  • Nishanth Kumar
  • Cameron Allen
  • Natasha Danas
  • Michael Littman
  • Stefanie Tellex
  • George Konidaris

A general-purpose agent must learn an open-scope world model: one rich enough to tackle any of the wide range of tasks it may be asked to solve over its operational lifetime. This stands in contrast with typical planning approaches, where the scope of a model is limited to a specific family of tasks that share significant structure. Unfortunately, planning to solve any specific task within an open-scope model is computationally intractable---even for state-of-the-art methods---due to the many states and actions that are necessarily present in the model but irrelevant to that problem. We propose task scoping: a method that exploits knowledge of the initial state, goal conditions, and transition system to automatically and efficiently remove provably irrelevant variables and actions from grounded planning problems. Our approach leverages causal link analysis and backwards reachability over state variables (rather than states) along with operator merging (when effects on relevant variables are identical). Using task scoping as a pre-planning step can shrink the search space by orders of magnitude and dramatically decrease planning time. We empirically demonstrate that these improvements occur across a variety of open-scope domains, including Minecraft, where our approach reduces search time by a factor of $75$ for a state-of-the-art numeric planner, even after including the time required for task scoping itself.

NeurIPS Conference 2022 Conference Paper

Evaluation beyond Task Performance: Analyzing Concepts in AlphaZero in Hex

  • Charles Lovering
  • Jessica Forde
  • George Konidaris
  • Ellie Pavlick
  • Michael Littman

AlphaZero, an approach to reinforcement learning that couples neural networks and Monte Carlo tree search (MCTS), has produced state-of-the-art strategies for traditional board games like chess, Go, shogi, and Hex. While researchers and game commentators have suggested that AlphaZero uses concepts that humans consider important, it is unclear how these concepts are captured in the network. We investigate AlphaZero's internal representations in the game of Hex using two evaluation techniques from natural language processing (NLP): model probing and behavioral tests. In doing so, we introduce several new evaluation tools to the RL community, and illustrate how evaluations other than task performance can be used to provide a more complete picture of a model's strengths and weaknesses. Our analyses in the game of Hex reveal interesting patterns and generate some testable hypotheses about how such models learn in general. For example, we find that the MCTS discovers concepts before the neural network learns to encode them. We also find that concepts related to short-term end-game planning are best encoded in the final layers of the model, whereas concepts related to long-term planning are encoded in the middle layers of the model.

NeurIPS Conference 2022 Conference Paper

Faster Deep Reinforcement Learning with Slower Online Network

  • Kavosh Asadi
  • Rasool Fakoor
  • Omer Gottesman
  • Taesup Kim
  • Michael Littman
  • Alexander J. Smola

Deep reinforcement learning algorithms often use two networks for value function optimization: an online network, and a target network that tracks the online network with some delay. Using two separate networks enables the agent to hedge against issues that arise when performing bootstrapping. In this paper we endow two popular deep reinforcement learning algorithms, namely DQN and Rainbow, with updates that incentivize the online network to remain in the proximity of the target network. This improves the robustness of deep reinforcement learning in presence of noisy updates. The resultant agents, called DQN Pro and Rainbow Pro, exhibit significant performance improvements over their original counterparts on the Atari benchmark demonstrating the effectiveness of this simple idea in deep reinforcement learning. The code for our paper is available here: Github. com/amazon-research/fast-rl-with-slow-updates.

NeurIPS Conference 2022 Conference Paper

Model-based Lifelong Reinforcement Learning with Bayesian Exploration

  • Haotian Fu
  • Shangqun Yu
  • Michael Littman
  • George Konidaris

We propose a model-based lifelong reinforcement-learning approach that estimates a hierarchical Bayesian posterior distilling the common structure shared across different tasks. The learned posterior combined with a sample-based Bayesian exploration procedure increases the sample efficiency of learning across a family of related tasks. We first derive an analysis of the relationship between the sample complexity and the initialization quality of the posterior in the finite MDP setting. We next scale the approach to continuous-state domains by introducing a Variational Bayesian Lifelong Reinforcement Learning algorithm that can be combined with recent model-based deep RL methods, and that exhibits backward transfer. Experimental results on several challenging domains show that our algorithms achieve both better forward and backward transfer performance than state-of-the-art lifelong RL methods.

NeurIPS Conference 2021 Conference Paper

On the Expressivity of Markov Reward

  • David Abel
  • Will Dabney
  • Anna Harutyunyan
  • Mark K. Ho
  • Michael Littman
  • Doina Precup
  • Satinder Singh

Reward is the driving force for reinforcement-learning agents. This paper is dedicated to understanding the expressivity of reward as a way to capture tasks that we would want an agent to perform. We frame this study around three new abstract notions of “task” that might be desirable: (1) a set of acceptable behaviors, (2) a partial ordering over behaviors, or (3) a partial ordering over trajectories. Our main results prove that while reward can express many of these tasks, there exist instances of each task type that no Markov reward function can capture. We then provide a set of polynomial-time algorithms that construct a Markov reward function that allows an agent to optimize tasks of each of these three types, and correctly determine when no such reward function exists. We conclude with an empirical study that corroborates and illustrates our theoretical findings.

AAAI Conference 2020 Conference Paper

People Do Not Just Plan,They Plan to Plan

  • Mark Ho
  • David Abel
  • Jonathan Cohen
  • Michael Littman
  • Thomas Griffiths

Planning is useful. It lets people take actions that have desirable long-term consequences. But, planning is hard. It requires thinking about consequences, which consumes limited computational and cognitive resources. Thus, people should plan their actions, but they should also be smart about how they deploy resources used for planning their actions. Put another way, people should also “plan their plans”. Here, we formulate this aspect of planning as a meta-reasoning problem and formalize it in terms of a recursive Bellman objective that incorporates both task rewards and information-theoretic planning costs. Our account makes quantitative predictions about how people should plan and meta-plan as a function of the overall structure of a task, which we test in two experiments with human participants. We find that people’s reaction times reflect a planned use of information processing, consistent with our account. This formulation of planning to plan provides new insight into the function of hierarchical planning, state abstraction, and cognitive control in both humans and machines.

AAAI Conference 2020 Short Paper

Task Scoping for Efficient Planning in Open Worlds (Student Abstract)

  • Nishanth Kumar
  • Michael Fishman
  • Natasha Danas
  • Stefanie Tellex
  • Michael Littman
  • George Konidaris

We propose an abstraction method for open-world environments expressed as Factored Markov Decision Processes (FMDPs) with very large state and action spaces. Our method prunes state and action variables that are irrelevant to the optimal value function on the state subspace the agent would visit when following any optimal policy from the initial state. This method thus enables tractable fast planning within large open-world FMDPs.

IJCAI Conference 2019 Conference Paper

DeepMellow: Removing the Need for a Target Network in Deep Q-Learning

  • Seungchan Kim
  • Kavosh Asadi
  • Michael Littman
  • George Konidaris

Deep Q-Network (DQN) is an algorithm that achieves human-level performance in complex domains like Atari games. One of the important elements of DQN is its use of a target network, which is necessary to stabilize learning. We argue that using a target network is incompatible with online reinforcement learning, and it is possible to achieve faster and more stable learning without a target network when we use Mellowmax, an alternative softmax operator. We derive novel properties of Mellowmax, and empirically show that the combination of DQN and Mellowmax, but without a target network, outperforms DQN with a target network.

RLDM Conference 2019 Conference Abstract

Model-based Knowledge Representations

  • Lucas Lehnert
  • Michael Littman

One question central to reinforcement learning is which representations – including aspects of the state space, transition function and reward function – can be generalized or re-used across different tasks. Humans are adept at such flexible transfer but existing reinforcement learning algorithms are much more limited. While transferring successor features between different tasks has been shown to improve learning speed, this representation is overly specific and hence needs to be re-learned when the optimal policy or transition function change. This article presents Model Features: a latent representation that compresses the state space of a control problem by exploiting states that are equivalent in terms of both transition and reward functions. Because Model Features only extract these equivalences but are not tied to the transi- tion and reward functions themselves, this latent state representation generalizes across tasks that change in both their transition and reward functions. Model Features link successor features to model reductions, facilitating the design of gradient-based optimization algorithms to approximate model reductions directly from transition data. Learning Model Features is akin to model-based reinforcement learning, because the learned representation supports predictions of future reward outcomes. This article first summarizes theo- retical results from our extended article. Then empirical simulation results are presented that suggest Model Features serve as a state representation that affords generalization across tasks with different transition and reward functions. Because Model Features construct a latent state representation that supports predictions of future reward outcomes, the presented results motivate further experiments to investigate if humans or animals learn such a representation, and whether neural systems involved in state representation reflect the equivalence abstraction.

AAMAS Conference 2019 Conference Paper

Removing the Target Network from Deep Q-Networks with the Mellowmax Operator

  • Seungchan Kim
  • Kavosh Asadi
  • Michael Littman
  • George Konidaris

Deep Q-Network (DQN) is a learning algorithm that achieves humanlevel performance in high-dimensional domains like Atari games. We propose that using an softmax operator, Mellowmax, in DQN reduces its need for a separate target network, which is otherwise necessary to stabilize learning. We empirically show that, in the absence of a target network, the combination of Mellowmax and DQN outperforms DQN alone.

IJCAI Conference 2019 Conference Paper

The Expected-Length Model of Options

  • David Abel
  • John Winder
  • Marie desJardins
  • Michael Littman

Effective options can make reinforcement learning easier by enhancing an agent's ability to both explore in a targeted manner and plan further into the future. However, learning an appropriate model of an option's dynamics in hard, requiring estimating a highly parameterized probability distribution. This paper introduces and motivates the Expected-Length Model (ELM) for options, an alternate model for transition dynamics. We prove ELM is a (biased) estimator of the traditional Multi-Time Model (MTM), but provide a non-vacuous bound on their deviation. We further prove that, in stochastic shortest path problems, ELM induces a value function that is sufficiently similar to the one induced by MTM, and is thus capable of supporting near-optimal behavior. We explore the practical utility of this option model experimentally, finding consistent support for the thesis that ELM is a suitable replacement for MTM. In some cases, we find ELM leads to more sample efficient learning, especially when options are arranged in a hierarchy.

AAAI Conference 2018 Conference Paper

Bandit-Based Solar Panel Control

  • David Abel
  • Edward Williams
  • Stephen Brawner
  • Emily Reif
  • Michael Littman

Solar panels sustainably harvest energy from the sun. To improve performance, panels are often equipped with a tracking mechanism that computes the sun’s position in the sky throughout the day. Based on the tracker’s estimate of the sun’s location, a controller orients the panel to minimize the angle of incidence between solar radiant energy and the photovoltaic cells on the surface of the panel, increasing total energy harvested. Prior work has developed efficient tracking algorithms that accurately compute the sun’s location to facilitate solar tracking and control. However, always pointing a panel directly at the sun does not account for diffuse irradiance in the sky, reflected irradiance from the ground and surrounding surfaces, power required to reorient the panel, shading effects from neighboring panels and foliage, or changing weather conditions (such as clouds), all of which are contributing factors to the total energy harvested by a fleet of solar panels. In this work, we show that a bandit-based approach can increase the total energy harvested by solar panels by learning to dynamically account for such other factors. Our contribution is threefold: (1) the development of a test bed based on typical solar and irradiance models for experimenting with solar panel control using a variety of learning methods, (2) simulated validation that bandit algorithms can effectively learn to control solar panels, and (3) the design and construction of an intelligent solar panel prototype that learns to angle itself using bandit algorithms.

RLDM Conference 2017 Conference Abstract

Generalized Inverse Reinforcement Learning

  • Nakul Gopalan
  • Amy Greenwald
  • Michael Littman
  • James MacGlashan

Inverse Reinforcement Learning (IRL) is used to teach behaviors to agents, by having them learn a reward function from example trajectories. The underlying assumption is usually that these trajectories represent optimal behavior. However, it is not always possible for a user to provide examples of optimal trajectories. This problem has been tackled previously by labeling trajectories with a score that indicates good and bad behaviors. In this work, we formalize the IRL problem in a generalized framework that allows for learning from failed demonstrations. In our framework, users can score entire trajectories as well as individual state-action pairs. This allows the agent to learn preferred behaviors from a relatively small number of trajectories. We expect this framework to be especially useful in robotics domains, where the user can collect fewer trajectories at the cost of labeling bad state-action pairs, which might be easier than maneuvering a robot to collect additional (entire) trajectories.

RLDM Conference 2017 Conference Abstract

Improving Solar Panel Efficiency Using Reinforcement Learning

  • David Abel
  • Emily Reif
  • Michael Littman

Solar panels sustainably harvest energy from the sun. To improve performance, panels are often equipped with a tracking mechanism that computes the sun’s position in the sky throughout the day. Based on the tracker’s estimate of the sun’s location, a controller orients the panel to minimize the angle of inci- dence between solar radiant energy and the photovoltaic cells on the surface of the panel, increasing total energy harvested. Prior work has developed efficient tracking algorithms that accurately compute the sun’s location to facilitate solar tracking and control. However, always pointing a panel directly at the sun does not account for diffuse irradiance in the sky, reflected irradiance from the ground and surrounding surfaces, or changing weather conditions (such as cloud coverage), all of which are contributing factors to the total energy harvested by a solar panel. In this work, we show that a reinforcement learning (RL) approach can increase the total energy harvested by solar panels by learning to dynamically account for such other factors. We advocate for the use of RL for solar panel control due to its effectiveness, negligible cost, and versatility. Our contribution is twofold: (1) an adaption of typical RL algorithms to the task of improving solar panel performance, and (2) an experimental validation in simulation based on typical solar and irradiance models for experimenting with solar panel control. We evaluate the utility of various RL approaches compared to an idealized controller, an efficient state-of-the-art direct tracking algorithm, and a fixed panel in our simulated environment. We experiment across different time scales, in different places on earth, and with dramati- cally different percepts (sun coordinates and raw images of the sky with and without clouds), consistently demonstrating that simple RL algorithms improve over existing baselines.

RLDM Conference 2017 Conference Abstract

Learning to Cooperate and Compete*

  • Max Kleiman-Weiner
  • Mark Ho
  • Joseph Austerweil
  • Michael Littman
  • Joshua Tenenbaum

Successfully navigating the social world requires reasoning about both high-level strategic goals, such as whether to cooperate or compete, as well as the low-level actions needed to achieve those goals. While previous work in experimental game theory has examined the former and work on multi-agent sys- tems has examined the latter, there has been little work investigating behavior in environments that require simultaneous planning and inference across both levels. We develop a hierarchical model of social agency that infers the intentions of other agents, strategically decides whether to cooperate or compete, and then executes either a cooperative or competitive planning program. The cooperative planning program formal- izes a type of joint intentionality or team reasoning where agents mesh plans to efficiently cooperate. The competitive planning program is a generalization of iterated best-response. These planning programs en- able both strategic action as well as action interpretation – the ability to infer whether others are intending to cooperate or compete from ambiguous actions. We test predictions of this model in multi-agent behav- ioral experiments using rich video-game like environments. Learning occurs across both high-level strategic decisions and low-level actions leading to the emergence of social norms. These rapidly learned norms coor- dinate cooperation and make it more efficient after just a few interactions. By grounding strategic behavior in a formal model of planning, we develop abstract notions of both cooperation and competition and shed light on the computational nature of joint intentionality.

RLDM Conference 2017 Conference Abstract

Mellowmax: An Alternative Softmax Operator for Reinforcement Learning

  • Kavosh Asadi
  • Michael Littman

A softmax operator applied to a set of values acts somewhat like the maximization function and somewhat like an average. In sequential decision making, softmax is often used in settings where it is necessary to maximize utility but also to hedge against problems that arise from putting all of one’s weight behind a single maximum utility decision. The Boltzmann softmax operator is the most commonly used softmax operator in this setting, but we show that this operator is prone to misbehavior. In this work, we study and evaluate an alternative softmax operator that, among other properties, is both a non-expansion (ensuring convergent behavior in learning and planning) and differentiable (making it possible to improve decisions via gradient descent methods).

RLDM Conference 2017 Conference Abstract

Planning with Abstract Markov Decision Processes

  • Nakul Gopalan
  • Michael Littman
  • Shawn Squire
  • Stefanie Tellex
  • John Winder
  • Lawson Wong

Robots acting in human-scale environments must plan under uncertainty in large state–action spaces and face constantly changing reward functions as requirements and goals change. Planning un- der uncertainty in large state–action spaces requires hierarchical abstraction for efficient computation. We (Gopalan et al. 2017 In Press) introduce a new hierarchical planning framework called Abstract Markov Decision Processes (AMDPs) that can plan in a fraction of the time needed for complex decision making in ordinary MDPs. AMDPs provide abstract states, actions, and transition dynamics in multiple layers above a base-level “flat” MDP. AMDPs decompose problems into a series of subtasks with both local reward and local transition functions used to create policies for subtasks. The resulting hierarchical planning method is independently optimal at each level of abstraction, and is recursively optimal when the local reward and transition functions are correct. We present empirical results showing significantly improved planning speed, while maintaining solution quality, in the Taxi domain and in a mobile-manipulation robotics prob- lem. Furthermore, our approach allows specification of a decision-making model for a mobile-manipulation problem on a Turtlebot, spanning from low-level control actions operating on continuous variables all the way up through high-level object manipulation tasks.

NeurIPS Conference 2016 Conference Paper

Showing versus doing: Teaching by demonstration

  • Mark Ho
  • Michael Littman
  • James MacGlashan
  • Fiery Cushman
  • Joseph Austerweil

People often learn from others' demonstrations, and classic inverse reinforcement learning (IRL) algorithms have brought us closer to realizing this capacity in machines. In contrast, teaching by demonstration has been less well studied computationally. Here, we develop a novel Bayesian model for teaching by demonstration. Stark differences arise when demonstrators are intentionally teaching a task versus simply performing a task. In two experiments, we show that human participants systematically modify their teaching behavior consistent with the predictions of our model. Further, we show that even standard IRL algorithms benefit when learning from behaviors that are intentionally pedagogical. We conclude by discussing IRL algorithms that can take advantage of intentional pedagogy.

RLDM Conference 2015 Conference Abstract

Expressing Tasks Robustly via Multiple Discount Factors

  • Ashley Edwards
  • Michael Littman
  • Charles Isbell

Reward engineering is the problem of expressing a target task for an agent in the form of rewards for a Markov decision process. To be useful for learning, it is important that these encodings be robust to structural changes in the underlying domain; that is, the specification remain unchanged for any domain in some target class. We identify problems that are difficult to express robustly via the standard model of discounted rewards. In response, we examine the idea of decomposing a reward function into separate components, each with its own discount factor. We describe a method for finding robust parameters through the concept of task engineering, which additionally modifies the discount factors. We present a method for optimizing behavior in this setting and show that it could provide a more robust language than standard approaches. Poster T22*: Multi-Objective Markov Decision Processes for Decision Support Dan Lizotte*, University of Western Ontario; Eric Laber, North Carolina State University We present a new data analysis framework, Multi-Objective Markov Decision Processes for Decision Support, for developing sequential decision support systems. The framework extends the Multi- Objective Markov Decision Process with the ability to provide support that is tailored to different decision- makers with different preferences about which objectives are most important to them. We present an exten- sion of fitted-Q iteration for multiple objectives that can compute recommended actions in this context; in doing so we identify and address several conceptual and computational challenges. Finally, we demonstrate how our model could be applied to provide decision support for choosing treatments for schizophrenia using data from the Clinical Antipsychotic Trials of Intervention Effectiveness. Poster T23*: Reinforcement learning based on impulsively biased time scale and its neural substrate in OCD Yuki Sakai*, KPUM; Saori Tanaka, ATR; Yoshinari Abe, KPUM; Seiji Nishida, KPUM; Takashi Nakamae, KPUM; Kei Yamada, KPUM; Kenji Doya, OIST; Kenji Fukui, KPUM; Jin Narumoto, KPUM Obsessive-compulsive disorder (OCD) is a common neuropsychiatric disorder with a lifetime prevalence of 2-3%, which is characterized by persistent intrusive thoughts (obsessions), repetitive actions (compulsions). Howard Hughes, as depicted in the famous movie ‘Aviator, ’ suffered from severe OCD in his last years. He could not stop washing his hands and died alone in a hotel room because of his anxiety of bacterial contamination. Like his case, OCD seriously impairs patients’ daily lives Patients with OCD impulsively act on compulsive behavior to reduce obsession-related anxiety despite the profound effects on their life. Serotonergic dysfunction and hyper activity in ventral-striatal circuitry are thought to be essential in neuropathophysiology of OCD. Since cumulative evidence in human and animals suggests that serotonergic dysfunction and related alteration in ventral-striatal activity underlies impulsive behavior, which is caused by ‘prospective’ manner (underestimation of future reward) and ‘retrospective’ manner (impaired association of aversive outcomes to past actions), we hypothesized that OCD is the disorder of ‘impulsively biased time scale’. Here, we conducted the behavioral and fMRI experiments to investigate the mechanism of impulsive action selection in OCD. In fMRI experiment during prospective decision making (experiment (i)), patients with OCD had significantly greater correlated activities with impulsive short-term reward prediction in the ventral striatum, which were similar to our previous findings of healthy subjects at low serotonin levels. In experiment (ii), we conducted the monetary choice task that is difficult to solve in a prospective way and observed significantly slower associative learning when actions were followed by a delayed punishment in OCD. These results suggest that impulsive action selection characterized by both prospective and retrospective manner underlies disadvantageous compulsive behavior in OCD. Poster T24*: Direct Predictive Collaborative Control of a Prosthetic Arm Craig Sherstan, University of Alberta; Joseph Modayil, University of Alberta; Patrick Pilarski*, University of Alberta We have developed an online learning system for the collaborative control of an assistive device. Collaborative control is a complex setting requiring a human user and a learning system (automation) to co-operate towards achieving the user’s goals. There are many control domains where the number of con- trollable functions available to a user surpass what a user can attend to at a given moment. Such domains may benefit from having automation assist the user by controlling those unattended functions. How exactly this interaction between user decision making and automated decision making should occur is not clear, nor is it clear to what degree automation is beneficial or desired. We should expect such answers to vary from domain to domain and possibly from moment to moment. One domain of interest is the control of powered prosthetic arms by amputees. Upper-limb amputees are extremely limited in the number of inputs they can provide to a prosthetic device and typically control only one joint at a time with the ability to toggle between joints. Control of modern prostheses is often considered by users to be laborious and non-intuitive. To address these difficulties, we have developed a collaborative control framework called Direct Predictive Collaborative Control (DPCC), which uses a reinforcement learning technique known as general value func- tions to make temporal predictions about user behavior. These predictions are directly mapped to the control of unattended actuators to produce movement synergies. We evaluate DPCC during the human control of a powered multi-joint arm. We show that DPCC improves a user’s ability to perform coordinated movement tasks. Additionally, we demonstrate that this method can be used without the need for a specific training environment, learning only from user’s behavior. To our knowledge this is also the first demonstration of the combined use of the new True Online TD(lambda) algorithm with general value functions for online control.

RLDM Conference 2015 Conference Abstract

Teaching Behavior with Punishments and Rewards

  • Mark Ho
  • Michael Littman
  • Fiery Cushman
  • Joseph Austerweil

When teaching a complex behavior that achieves a goal, a teacher could simply deliver rewards once a learner has succeeded. For instance, a coach can train a basketball team by showering players with praise when they win a game and lambasting them otherwise. A more reasonable strategy, however, might be to teach subtasks like dribbling, passing, or shooting the ball. Teaching subtasks through reward or punishment, or shaping, can allow a learner to acquire the overarching task more quickly. Effective shaping also requires coordination between a teacher’s training strategy and a learner’s interpretation of rewards and punishments. Specifically, feedback could be treated as a reward to be maximized, as in standard reinforcement learning (RL), or as a signal for the correctness or incorrectness of an action. If feedback is treated as a pure reward signal, giving rewards for subtasks can introduce problematic positive reward cycles, state-action-feedback sequences that return to an initial state but yield a net positive reward. For example, if a player primarily wants to maximize positive teacher feedback, overly praising them for dribbling may lead them to exploit those rewards and not learn how to play the whole game. In contrast, positive cycles do not pose a problem when teaching a learner who interprets feedback as a signal about the correctness of an action. Here, we examine how humans teach with evaluative feedback and whether they teach consistent with a learner who interprets feedback as a reward signal or a signal of action correctness. We first formalize these two models of feedback interpretation as a reward-maximizing model based on standard RL and an action-feedback model based on Bayesian inference. Then, in two experiments in which people trained virtual learners for isolated actions or while learning over time, we show that people shape in a manner more consistent with the action-feedback model.

AAAI Conference 2014 Conference Paper

A Strategy-Aware Technique for Learning Behaviors from Discrete Human Feedback

  • Robert Loftin
  • James MacGlashan
  • Bei Peng
  • Matthew Taylor
  • Michael Littman
  • Jeff Huang
  • David Roberts

This paper introduces two novel algorithms for learning behaviors from human-provided rewards. The primary novelty of these algorithms is that instead of treating the feedback as a numeric reward signal, they interpret feedback as a form of discrete communication that depends on both the behavior the trainer is trying to teach and the teaching strategy used by the trainer. For example, some human trainers use a lack of feedback to indicate whether actions are correct or incorrect, and interpreting this lack of feedback accurately can significantly improve learning speed. Results from user studies show that humans use a variety of training strategies in practice and both algorithms can learn a contextual bandit task faster than algorithms that treat the feedback as numeric. Simulated trainers are also employed to evaluate the algorithms in both contextual bandit and sequential decision-making tasks with similar results.

RLDM Conference 2013 Conference Abstract

Computational Game Theory in Sequential Environments

  • Michael Littman

Compared to typical single-agent decision problems, general sum games offer a panoply of strategies for maximizing utility. In many games, such as the well-known Prisoner’s dilemma, agents must work together, bearing some individual risk, to arrive at mutually beneficial outcomes. In this talk, I will discuss three al- gorithmic approaches that have been developed to identify cooperative strategies in non-cooperative games. I will describe a computational folk theorem, an analysis of value-function-based reinforcement learning, and a cognitive hierarchy approach. These methods will be illustrated in both normal form and multi-stage stochastic game representations and the implications for the role of learning in games will be discussed.

RLDM Conference 2013 Conference Abstract

Relative Bellman Error: An Offline Evaluation Metric for Comparing Value Functions

  • Michael Littman

Reinforcement learning (RL) algorithms are typically evaluated online—a value function or pol- icy is used to control the target system and its return is measured. When a target system makes online evaluation expensive (such as driving a robot car), unethical (such as treating a disease), or simply impracti- cal (such as challenging a human chess master), effective offline evaluation metrics can play a critical role. In this paper, we compare several offline evaluation metrics, pointing out significant shortcomings that limit their utility. We propose a new metric we call “relative Bellman update error” (RBUE) that scores pairs of value functions using offline data. We provide analysis and empirical results that suggest the RBUE metric is a viable way of comparing value functions offline.

AAMAS Conference 2012 Conference Paper

A Framework for Modeling Population Strategies by Depth of Reasoning

  • Michael Wunder
  • Michael Kaisers
  • John Robert Yaros
  • Michael Littman

This article presents a population-based cognitive hierarchy model that can be used to estimate the reasoning depth and sophistication of a collection of opponents' strategies from observed behavior in repeated games. This framework provides a compact representation of a distribution of complicated strategies by reducing them to a small number of parameters. This estimated population model can be then used to compute a best response to the observed distribution over these parameters. As such, it provides a basis for building improved strategies given a history of observations of the community of agents. Results show that this model predicts and explains the winning strategies in the recent 2011 Lemonade Stand Game competition, where eight algorithms had been pitted against each other. The Lemonade Stand Game is a three-player game with simple rules that includes both cooperative and competitive elements. Despite its apparent simplicity, the fact that success depends crucially on what other players do gives rise to complex interaction patterns, which our new framework captures well.

AAAI Conference 2012 Conference Paper

Covering Number as a Complexity Measure for POMDP Planning and Learning

  • Zongzhang Zhang
  • Michael Littman
  • Xiaoping Chen

Finding a meaningful way of characterizing the difficulty of partially observable Markov decision processes (POMDPs) is a core theoretical problem in POMDP research. State-space size is often used as a proxy for POMDP difficulty, but it is a weak metric at best. Existing work has shown that the covering number for the reachable belief space, which is a set of belief points that are reachable from the initial belief point, has interesting links with the complexity of POMDP planning, theoretically. In this paper, we present empirical evidence that the covering number for the reachable belief space (or just “covering number”, for brevity) is a far better complexity measure than the state-space size for both planning and learning POMDPs on several small-scale benchmark problems. We connect the covering number to the complexity of learning POMDPs by proposing a provably convergent learning algorithm for POMDPs without reset given knowledge of the covering number.

AAMAS Conference 2011 Conference Paper

Using Iterated Reasoning to Predict Opponent Strategies

  • Michael Wunder
  • John Robert Yaros
  • Michael Littman
  • Michael Kaisers

The field of multiagent decision making is extending its tools from classical game theory by embracing reinforcement learning, statistical analysis, and opponent modeling. For example, behavioral economists conclude from experimental results that people act according to levels of reasoning that form a "cognitive hierarchy" of strategies, rather than merely following the hyper-rational Nash equilibrium solution concept. This paper expands this model of the iterative reasoning process by widening the notion of a level within the hierarchy from one single strategy to a distribution over strategies, leading to a more general framework of multiagent decision making. It provides a measure of sophistication for strategies and can serve as a guide for designing good strategies for multiagent games, drawing it's main strength from predicting opponent strategies. We apply these lessons to the recently introduced Lemonade-stand Game, a simple setting that includes both collaborative and competitive elements, where an agent's score is critically dependent on its responsiveness to opponent behavior. The opening moves are significant to the end result and simple heuristics have achieved faster cooperation than intricate learning schemes. Using results from the past two real-world tournaments, we show how the submitted entries fit naturally into our model and explain why the top agents were successful.

AAAI Conference 2010 Conference Paper

Integrating Sample-Based Planning and Model-Based Reinforcement Learning

  • Thomas Walsh
  • Sergiu Goschin
  • Michael Littman

Recent advancements in model-based reinforcement learning have shown that the dynamics of many structured domains (e. g. DBNs) can be learned with tractable sample complexity, despite their exponentially large state spaces. Unfortunately, these algorithms all require access to a planner that computes a near optimal policy, and while many traditional MDP algorithms make this guarantee, their computation time grows with the number of states. We show how to replace these over-matched planners with a class of sample-based planners—whose computation time is independent of the number of states—without sacrificing the sampleefficiency guarantees of the overall learning algorithms. To do so, we define sufficient criteria for a sample-based planner to be used in such a learning system and analyze two popular sample-based approaches from the literature. We also introduce our own sample-based planner, which combines the strategies from these algorithms and still meets the criteria for integration into our learning system. In doing so, we define the first complete RL solution for compactly represented (exponentially sized) state spaces with efficiently learnable dynamics that is both sample efficient and whose computation time does not grow rapidly with the number of states.

NeurIPS Conference 2008 Conference Paper

Multi-resolution Exploration in Continuous Spaces

  • Ali Nouri
  • Michael Littman

The essence of exploration is acting to try to decrease uncertainty. We propose a new methodology for representing uncertainty in continuous-state control problems. Our approach, multi-resolution exploration (MRE), uses a hierarchical mapping to identify regions of the state space that would benefit from additional samples. We demonstrate MRE's broad utility by using it to speed up learning in a prototypical model-based and value-based reinforcement-learning method. Empirical results show that MRE improves upon state-of-the-art exploration approaches.

AAMAS Conference 2008 Conference Paper

Social Reward Shaping in the Prisoner's Dilemma

  • Monica Babes
  • Enrique Munoz
  • Michael Littman

Reward shaping is a well-known technique applied to help reinforcement-learning agents converge more quickly to nearoptimal behavior. In this paper, we introduce social reward shaping, which is reward shaping applied in the multiagentlearning framework. We present preliminary experiments in the iterated Prisoner’s dilemma setting that show that agents using social reward shaping appropriately can behave more effectively than other classical learning and nonlearning strategies. In particular, we show that these agents can both lead —encourage adaptive opponents to stably cooperate— and follow —adopt a best-response strategy when paired with a fixed opponent— where better known approaches achieve only one of these objectives.

NeurIPS Conference 2007 Conference Paper

Online Linear Regression and Its Application to Model-Based Reinforcement Learning

  • Alexander Strehl
  • Michael Littman

We provide a provably efficient algorithm for learning Markov Decision Processes (MDPs) with continuous state and action spaces in the online setting. Specifically, we take a model-based approach and show that a special type of online linear regression allows us to learn MDPs with (possibly kernalized) linearly parameterized dynamics. This result builds on Kearns and Singh's work that provides a provably efficient algorithm for finite state MDPs. Our approach is not restricted to the linear setting, and is applicable to other classes of continuous MDPs.

NeurIPS Conference 2005 Conference Paper

Cyclic Equilibria in Markov Games

  • Martin Zinkevich
  • Amy Greenwald
  • Michael Littman

Although variants of value iteration have been proposed for finding Nash or correlated equilibria in general-sum Markov games, these variants have not been shown to be effective in general. In this paper, we demon- strate by construction that existing variants of value iteration cannot find stationary equilibrium policies in arbitrary general-sum Markov games. Instead, we propose an alternative interpretation of the output of value it- eration based on a new (non-stationary) equilibrium concept that we call “cyclic equilibria. ” We prove that value iteration identifies cyclic equi- libria in a class of games in which it fails to find stationary equilibria. We also demonstrate empirically that value iteration finds cyclic equilibria in nearly all examples drawn from a random distribution of Markov games.

NeurIPS Conference 2001 Conference Paper

An Efficient, Exact Algorithm for Solving Tree-Structured Graphical Games

  • Michael Littman
  • Michael Kearns
  • Satinder Singh

We describe a new algorithm for computing a Nash equilibrium in graphical games, a compact representation for multi-agent systems that we introduced in previous work. The algorithm is the first to compute equilibria both efficiently and exactly for a non-trivial class of graphical games.

NeurIPS Conference 2001 Conference Paper

PAC Generalization Bounds for Co-training

  • Sanjoy Dasgupta
  • Michael Littman
  • David McAllester

The rule-based bootstrapping introduced by Yarowsky, and its co- training variant by Blum and Mitchell, have met with considerable em- pirical success. Earlier work on the theory of co-training has been only loosely related to empirically useful co-training algorithms. Here we give a new PAC-style bound on generalization error which justifies both the use of confidences — partial rules and partial labeling of the unlabeled data — and the use of an agreement-based objective function as sug- gested by Collins and Singer. Our bounds apply to the multiclass case, i. e. , where instances are to be assigned one of

NeurIPS Conference 2001 Conference Paper

Predictive Representations of State

  • Michael Littman
  • Richard Sutton

We show that states of a dynamical system can be usefully repre(cid: 173) sented by multi-step, action-conditional predictions of future ob(cid: 173) servations. State representations that are grounded in data in this way may be easier to learn, generalize better, and be less depen(cid: 173) dent on accurate prior models than, for example, POMDP state representations. Building on prior work by Jaeger and by Rivest and Schapire, in this paper we compare and contrast a linear spe(cid: 173) cialization of the predictive approach with the state representa(cid: 173) tions used in POMDPs and in k-order Markov models. Ours is the first specific formulation of the predictive idea that includes both stochasticity and actions (controls). We show that any system has a linear predictive state representation with number of predictions no greater than the number of states in its minimal POMDP model. In predicting or controlling a sequence of observations, the concepts of state and state estimation inevitably arise. There have been two dominant approaches. The generative-model approach, typified by research on partially observable Markov de(cid: 173) cision processes (POMDPs), hypothesizes a structure for generating observations and estimates its state and state dynamics. The history-based approach, typified by k-order Markov methods, uses simple functions of past observations as state, that is, as the immediate basis for prediction and control. (The data flow in these two ap(cid: 173) proaches are diagrammed in Figure 1. ) Of the two, the generative-model approach is more general. The model's internal state gives it temporally unlimited memory(cid: 173) the ability to remember an event that happened arbitrarily long ago--whereas a history-based approach can only remember as far back as its history extends. The bane of generative-model approaches is that they are often strongly dependent on a good model of the system's dynamics. Most uses of POMDPs, for example, assume a perfect dynamics model and attempt only to estimate state. There are algorithms for simultaneously estimating state and dynamics (e. g. , Chrisman, 1992), analogous to the Baum-Welch algorithm for the uncontrolled case (Baum et al. , 1970), but these are only effective at tuning parameters that are already approximately cor(cid: 173) rect (e. g. , Shatkay & Kaelbling, 1997). observations (and actions)

NeurIPS Conference 2000 Conference Paper

Exact Solutions to Time-Dependent MDPs

  • Justin Boyan
  • Michael Littman

We describe an extension of the Markov decision process model in which a continuous time dimension is included in the state space. This allows for the representation and exact solution of a wide range of problems in which transitions or rewards vary over time. We examine problems based on route planning with public trans(cid: 173) portation and telescope observation scheduling.

AIJ Journal 1996 Journal Article

Taggers for parsers

  • Eugene Charniak
  • Glenn Carroll
  • John Adcock
  • Anthony Cassandra
  • Yoshihiko Gotoh
  • Jeremy Katz
  • Michael Littman
  • John McCann

NeurIPS Conference 1993 Conference Paper

Packet Routing in Dynamically Changing Networks: A Reinforcement Learning Approach

  • Justin Boyan
  • Michael Littman

This paper describes the Q-routing algorithm for packet routing, in which a reinforcement learning module is embedded into each node of a switching network. Only local communication is used by each node to keep accurate statistics on which routing decisions lead to minimal delivery times. In simple experiments involving a 36-node, irregularly connected network, Q-routing proves supe(cid: 173) rior to a nonadaptive algorithm based on precomputed shortest paths and is able to route efficiently even when critical aspects of the simulation, such as the network load, are allowed to vary dy(cid: 173) namically. The paper concludes with a discussion of the tradeoff between discovering shortcuts and maintaining stable policies.

NeurIPS Conference 1989 Conference Paper

Generalization and Scaling in Reinforcement Learning

  • David Ackley
  • Michael Littman

In associative reinforcement learning, an environment generates input vectors, a learning system generates possible output vectors, and a re(cid: 173) inforcement function computes feedback signals from the input-output pairs. The task is to discover and remember input-output pairs that generate rewards. Especially difficult cases occur when rewards are rare, since the expected time for any algorithm can grow exponentially with the size of the problem. Nonetheless, if a reinforcement function possesses regularities, and a learning algorithm exploits them, learning time can be reduced below that of non-generalizing algorithms. This paper describes a neural network algorithm called complementary re(cid: 173) inforcement back-propagation (CRBP), and reports simulation results on problems designed to offer differing opportunities for generalization. 1 REINFORCEMENT LEARNING REQUIRES SEARCH Reinforcement learning (Sutton, 1984; Barto & Anandan, 1985; Ackley, 1988; Allen, 1989) requires more from a learner than does the more familiar supervised learning paradigm. Supervised learning supplies the correct answers to the learner, whereas reinforcement learning requires the learner to discover the correct outputs before they can be stored. The reinforcement paradigm divides neatly into search and learning aspects: When rewarded the system makes internal adjustments to learn the discovered input-output pair; when punished the system makes internal adjust(cid: 173) ments to search elsewhere. Generalization and Scaling in Reinforcement Learning 551 1. 1 MAKING REINFORCEMENT INTO ERROR Following work by Anderson (1986) and Williams (1988), we extend the backprop(cid: 173) agation algorithm to associative reinforcement learning. Start with a "garden va(cid: 173) riety" backpropagation network: A vector i of n binary input units propagates through zero or more layers of hidden units, ultimately reaching a vector 8 of m sigmoid units, each taking continuous values in the range (0, 1). Interpret each 8j as the probability that an associated random bit OJ takes on value 1. Let us call the continuous, deterministic vector 8 the search vector to distinguish it from the stochastic binary output vector o. Given an input vector, we forward propagate to produce a search vector 8, and then perform m independent Bernoulli trials to produce an output vector o. The i - 0 pair is evaluated by the reinforcement function and reward or punishment ensues. Suppose reward occurs. We therefore want to make 0 more likely given i. Backpropagation will do just that if we take 0 as the desired target to produce an error vector (0 - 8) and adjust weights normally. Now suppose punishment occurs, indicating 0 does not correspond with i. By choice of error vector, backpropagation allows us to push the search vector in any direction; which way should we go? In absence of problem-specific information, we cannot pick an appropriate direction with certainty. Any decision will involve assumptions. A very minimal "don't be like 0" assumption-employed in Anderson (1986), Williams (1988), and Ackley (1989)-pushes s directly away from 0 by taking (8 - 0) as the error vector. A slightly stronger "be like not-o" assumption-employed in Barto & Anandan (1985) and Ackley (1987)-pushes s directly toward the complement of 0 by taking ((1 - 0) - 8) as the error vector. Although the two approaches always agree on the signs of the error terms, they differ in magnitudes. In this work, we explore the second possibility, embodied in an algorithm called complementary reinforcement back-propagation ( CRBP). Figure 1 summarizes the CRBP algorithm. The algorithm in the figure reflects three modifications to the basic approach just sketched. First, in step 2, instead of using the 8j'S directly as probabilities, we found it advantageous to "stretch" the values using a parameter v. When v < 1, it is not necessary for the 8i'S to reach zero or one to produce a deterministic output. Second, in step 6, we found it important to use a smaller learning rate for punishment compared to reward. Third, consider step 7: Another forward propagation is performed, another stochastic binary out(cid: 173) put vector 0* is generated (using the procedure from step 2), and 0* is compared to o. If they are identical and punishment occurred, or if they are different and reward occurred, then another error vector is generated and another weight update is performed. This loop continues until a different output is generated (in the case of failure) or until the original output is regenerated (in the case of success). This modification improved performance significantly, and added only a small percentage to the total number of weight updates performed. 552 Ackley and Littman O. Build a back propagation network with input dimensionality n and output dimensionality m. Let t = 0 and te = O. Pick random i E 2n and forward propagate to produce a/s. 2. Generate a binary output vector o. Given a uniform random variable ~ E [0, 1] and parameter 0 < v < 1, OJ = {1, if(sj -! )/v+! ~ ~j