Author name cluster

Richard Lewis

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

14 papers

1 author row

AAAI Conference 2022 Conference Paper

Adaptive Pairwise Weights for Temporal Credit Assignment

Zeyu Zheng
Risto Vuorio
Richard Lewis
Satinder Singh

How much credit (or blame) should an action taken in a state get for a future reward? This is the fundamental temporal credit assignment problem in Reinforcement Learning (RL). One of the earliest and still most widely used heuristics is to assign this credit based on a scalar coefficient, lambda (treated as a hyperparameter), raised to the power of the time interval between the state-action and the reward. In this empirical paper, we explore heuristics based on more general pairwise weightings that are functions of the state in which the action was taken, the state at the time of the reward, as well as the time interval between the two. Of course it isn’t clear what these pairwise weight functions should be, and because they are too complex to be treated as hyperparameters we develop a metagradient procedure for learning these weight functions during the usual RL training of a policy. Our empirical work shows that it is often possible to learn these pairwise weight functions during learning of the policy to achieve better performance than competing approaches.

PDF Details

IJCAI Conference 2021 Conference Paper

Reinforcement Learning for Sparse-Reward Object-Interaction Tasks in a First-person Simulated 3D Environment

Wilka Carvalho
Anthony Liang
Kimin Lee
Sungryull Sohn
Honglak Lee
Richard Lewis
Satinder Singh

Learning how to execute complex tasks involving multiple objects in a 3D world is challenging when there is no ground-truth information about the objects or any demonstration to learn from. When an agent only receives a signal from task-completion, this makes it challenging to learn the object-representations which support learning the correct object-interactions needed to complete the task. In this work, we formulate learning an attentive object dynamics model as a classification problem, using random object-images to define incorrect labels for our object-dynamics model. We show empirically that this enables object-representation learning that captures an object's category (is it a toaster? ), its properties (is it on? ), and object-relations (is something inside of it? ). With this, our core learner (a relational RL agent) receives the dense training signal it needs to rapidly learn object-interaction tasks. We demonstrate results in the 3D AI2Thor simulated kitchen environment with a range of challenging food preparation tasks. We compare our method's performance to several related approaches and against the performance of an oracle: an agent that is supplied with ground-truth information about objects in the scene. We find that our agent achieves performance closest to the oracle in terms of both learning speed and maximum success rate.

PDF Details DOI

AAAI Conference 2020 Conference Paper

How Should an Agent Practice?

Janarthanan Rajendran
Richard Lewis
Vivek Veeriah
Honglak Lee
Satinder Singh

We present a method for learning intrinsic reward functions to drive the learning of an agent during periods of practice in which extrinsic task rewards are not available. During practice, the environment may differ from the one available for training and evaluation with extrinsic rewards. We refer to this setup of alternating periods of practice and objective evaluation as practice-match, drawing an analogy to regimes of skill acquisition common for humans in sports and games. The agent must effectively use periods in the practice environment so that performance improves during matches. In the proposed method the intrinsic practice reward is learned through a meta-gradient approach that adapts the practice reward parameters to reduce the extrinsic match reward loss computed from matches. We illustrate the method on a simple grid world, and evaluate it in two games in which the practice environment differs from match: Pong with practice against a wall without an opponent, and PacMan with practice in a maze without ghosts. The results show gains from learning in practice in addition to match periods over learning in matches only.

PDF Details

NeurIPS Conference 2019 Conference Paper

Discovery of Useful Questions as Auxiliary Tasks

Vivek Veeriah
Matteo Hessel
Zhongwen Xu
Janarthanan Rajendran
Richard Lewis
Junhyuk Oh
Hado van Hasselt
David Silver

Arguably, intelligent agents ought to be able to discover their own questions so that in learning answers for them they learn unanticipated useful knowledge and skills; this departs from the focus in much of machine learning on agents learning answers to externally defined questions. We present a novel method for a reinforcement learning (RL) agent to discover questions formulated as general value functions or GVFs, a fairly rich form of knowledge representation. Specifically, our method uses non-myopic meta-gradients to learn GVF-questions such that learning answers to them, as an auxiliary task, induces useful representations for the main task faced by the RL agent. We demonstrate that auxiliary tasks based on the discovered GVFs are sufficient, on their own, to build representations that support main task learning, and that they do so better than popular hand-designed auxiliary tasks from the literature. Furthermore, we show, in the context of Atari2600 videogames, how such auxiliary tasks, meta-learned alongside the main task, can improve the data efficiency of an actor-critic agent.

PDF Details

AAAI Conference 2019 Conference Paper

Learning to Communicate and Solve Visual Blocks-World Tasks

Qi Zhang
Richard Lewis
Satinder Singh
Edmund Durfee

We study emergent communication between speaker and listener recurrent neural-network agents that are tasked to cooperatively construct a blocks-world target image sampled from a generative grammar of blocks configurations. The speaker receives the target image and learns to emit a sequence of discrete symbols from a fixed vocabulary. The listener learns to construct a blocks-world image by choosing block placement actions as a function of the speaker’s full utterance and the image of the ongoing construction. Our contributions are (a) the introduction of a task domain for studying emergent communication that is both challenging and affords useful analyses of the emergent protocols; (b) an empirical comparison of the interpolation and extrapolation performance of training via supervised, (contextual) Bandit, and reinforcement learning; and (c) evidence for the emergence of interesting linguistic properties in the RL agent protocol that are distinct from the other two.

PDF Details

IJCAI Conference 2016 Conference Paper

Deep Learning for Reward Design to Improve Monte Carlo Tree Search in ATARI Games

Xiaoxiao Guo
Satinder Singh
Richard Lewis
Honglak Lee

Monte Carlo Tree Search (MCTS) methods have proven powerful in planning for sequential decision-making problems such as Go and video games, but their performance can be poor when the planning depth and sampling trajectories are limited or when the rewards are sparse. We present an adaptation of PGRD (policy-gradient for reward-design) for learning a reward-bonus function to improve UCT (a MCTS algorithm). Unlike previous applications of PGRD in which the space of reward-bonus functions was limited to linear functions of hand-coded state-action-features, we use PGRD with a multi-layer convolutional neural network to automatically learn features from raw perception as well as to adapt the non-linear reward-bonus function parameters. We also adopt a variance-reducing gradient method to improve PGRD's performance. The new method improves UCT's performance on multiple ATARI games compared to UCT without the reward bonus. Combining PGRD and Deep Learning in this way should make adapting rewards for MCTS algorithms far more widely and practically applicable than before.

PDF Details

IJCAI Conference 2016 Conference Paper

The Dependence of Effective Planning Horizon on Model Accuracy

Nan Jiang
Alex Kulesza
Satinder Singh
Richard Lewis

Because planning with a long horizon (i. e. , looking far into the future) is computationally expensive, it is common in practice to save time by using reduced horizons. This is usually understood to come at the expense of computing suboptimal plans, which is the case when the planning model is exact. However, when the planning model is estimated from data, as is frequently true in the real world, the policy found using a shorter planning horizon can actually be better than a policy learned with the true horizon. In this paper we provide a precise explanation for this phenomenon based on principles of learning theory. We show formally that the planning horizon is a complexity control parameter for the class of policies available to the planning algorithm, having an intuitive, monotonic relationship with a simple measure of complexity. We prove a planning loss bound predicting that shorter planning horizons can reduce overfitting and improve test performance, and we confirm these predictions empirically.

PDF Details

NeurIPS Conference 2015 Conference Paper

Action-Conditional Video Prediction using Deep Networks in Atari Games

Junhyuk Oh
Xiaoxiao Guo
Honglak Lee
Richard Lewis
Satinder Singh

Motivated by vision-based reinforcement learning (RL) problems, in particular Atari games from the recent benchmark Aracade Learning Environment (ALE), we consider spatio-temporal prediction problems where future (image-)frames are dependent on control variables or actions as well as previous frames. While not composed of natural scenes, frames in Atari games are high-dimensional in size, can involve tens of objects with one or more objects being controlled by the actions directly and many other objects being influenced indirectly, can involve entry and departure of objects, and can involve deep partial observability. We propose and evaluate two deep neural network architectures that consist of encoding, action-conditional transformation, and decoding layers based on convolutional neural networks and recurrent neural networks. Experimental results show that the proposed architectures are able to generate visually-realistic frames that are also useful for control over approximately 100-step action-conditional futures in some games. To the best of our knowledge, this paper is the first to make and evaluate long-term predictions on high-dimensional video conditioned by control inputs.

PDF Details

NeurIPS Conference 2014 Conference Paper

Deep Learning for Real-Time Atari Game Play Using Offline Monte-Carlo Tree Search Planning

Xiaoxiao Guo
Satinder Singh
Honglak Lee
Richard Lewis
Xiaoshi Wang

The combination of modern Reinforcement Learning and Deep Learning approaches holds the promise of making significant progress on challenging applications requiring both rich perception and policy-selection. The Arcade Learning Environment (ALE) provides a set of Atari games that represent a useful benchmark set of such applications. A recent breakthrough in combining model-free reinforcement learning with deep learning, called DQN, achieves the best real-time agents thus far. Planning-based approaches achieve far higher scores than the best model-free approaches, but they exploit information that is not available to human players, and they are orders of magnitude slower than needed for real-time play. Our main goal in this work is to build a better real-time Atari game playing agent than DQN. The central idea is to use the slow planning-based agents to provide training data for a deep-learning architecture capable of real-time play. We proposed new agents based on this idea and show that they outperform DQN.

PDF Details

RLDM Conference 2013 Conference Abstract

Discovering Computationally Rational Eye Movements in the Distractor Ratio Task

Xiuli Chen
Richard Lewis
Christopher Myers
Joseph Houpt
Andrew Howes

In our recent work we have defined reinforcement learning (RL) problems in which the goal is to discover strategies that are computationally rational given a theory of the constraints on human cognition. These strategies are used to predict human behaviours. In this extended abstract we illustrate this use of RL with an example in which distractor ratio phenomena are explained by deriving strategies for eye move- ments and target detection given constraints on visual acuity. The distractor ratio effect is shown to be a consequence of computationally rational adaptation to the goal of making presence/absence decisions given location noise in the human visual system.

PDF Details

NeurIPS Conference 2013 Conference Paper

Reward Mapping for Transfer in Long-Lived Agents

Xiaoxiao Guo
Satinder Singh
Richard Lewis

We consider how to transfer knowledge from previous tasks to a current task in long-lived and bounded agents that must solve a sequence of MDPs over a finite lifetime. A novel aspect of our transfer approach is that we reuse reward functions. While this may seem counterintuitive, we build on the insight of recent work on the optimal rewards problem that guiding an agent's behavior with reward functions other than the task-specifying reward function can help overcome computational bounds of the agent. Specifically, we use good guidance reward functions learned on previous tasks in the sequence to incrementally train a reward mapping function that maps task-specifying reward functions into good initial guidance reward functions for subsequent tasks. We demonstrate that our approach can substantially improve the agent's performance relative to other approaches, including an approach that transfers policies.

PDF Details

AAMAS Conference 2012 Conference Paper

Strong Mitigation: Nesting Search for Good Policies Within Search for Good Reward

Jeshua Bratman
Satinder Singh
Richard Lewis
Jonathan Sorg

Recent work has defined an optimal reward problem (ORP) in which an agent designer, with an objective reward function that \emph{evaluates} an agent's behavior, has a choice of what reward function to build into a learning or planning agent to \emph{guide} its behavior. Existing results on ORP show \emph{weak mitigation} of limited computational resources, i. e. , the existence of reward functions so that agents when guided by them do better than when guided by the objective reward function. These existing results ignore the cost of finding such good reward functions. We define a nested optimal reward and control architecture that achieves \emph{strong mitigation} of limited computational resources. We show empirically that the designer is better off using a new architecture that spends some of its limited resources learning a good reward function instead of using all of its resources to optimize its behavior with respect to the objective reward function.

PDF

AAAI Conference 2011 Conference Paper

Optimal Rewards versus Leaf-Evaluation Heuristics in Planning Agents

Jonathan Sorg
Satinder Singh
Richard Lewis

Planning agents often lack the computational resources needed to build full planning trees for their environments. Agent designers commonly overcome this ﬁnite-horizon approximation by applying an evaluation function at the leaf-states of the planning tree. Recent work has proposed an alternative approach for overcoming computational constraints on agent design: modify the reward function. In this work, we compare this reward design approach to the common leaf-evaluation heuristic approach for improving planning agents. We show that in many agents, the reward design approach strictly subsumes the leaf-evaluation approach, i. e. , there exists a reward function for every leaf-evaluation heuristic that leads to equivalent behavior, but the converse is not true. We demonstrate that this generality leads to improved performance when an agent makes approximations in addition to the ﬁnite-horizon approximation. As part of our contribution, we extend PGRD, an online reward design algorithm, to develop reward design algorithms for Sparse Sampling and UCT, two algorithms capable of planning in large state spaces.

PDF Details

NeurIPS Conference 2010 Conference Paper

Reward Design via Online Gradient Ascent

Jonathan Sorg
Richard Lewis
Satinder Singh

Recent work has demonstrated that when artificial agents are limited in their ability to achieve their goals, the agent designer can benefit by making the agent's goals different from the designer's. This gives rise to the optimization problem of designing the artificial agent's goals---in the RL framework, designing the agent's reward function. Existing attempts at solving this optimal reward problem do not leverage experience gained online during the agent's lifetime nor do they take advantage of knowledge about the agent's structure. In this work, we develop a gradient ascent approach with formal convergence guarantees for approximately solving the optimal reward problem online during an agent's lifetime. We show that our method generalizes a standard policy gradient approach, and we demonstrate its ability to improve reward functions in agents with various forms of limitations.

PDF Details